Conversation
|
Based on Slack chat with @JMMackenzie - we should just merge this and proceed? |
|
Hi @andrewtrotman - the |
| message Header { | ||
| int32 num_postings_lists = 1; | ||
| int32 num_docs = 2; | ||
| } |
There was a problem hiding this comment.
To get it all into a single ProtoBuff file that doesn't rely on knowing what's in there, what's repeated and how many times I suggest:
message CIFF {
int32 version = 1;
int32 num_postings_lists = 2;
int32 num_docs = 3;
repeat DocRecord = 4;
repeat PostingsList = 5;
}
I've also renamed header to CIFF (for clarity) and added a version element. This is so that if we later add stuff (such as fields) we can easily reject the CIFF without looking for stuff we don't know about in a large file we can parse most of.
There was a problem hiding this comment.
@andrewtrotman We can't do this because it would require CIFF to be completely held in memory, per the rationale discussed in the README.
Also, CIFF is a format, not a message, so the name isn't totally appropriate.
There was a problem hiding this comment.
How about IndexMetadata? I'm fine with adding a version.
There was a problem hiding this comment.
Would it be an idea to also add Metadata for each docrecord?
There was a problem hiding this comment.
What do you mean exactly? There will be a DocRecord message for every single document in the collection...
There was a problem hiding this comment.
it would require CIFF to be completely held in memory, per the rationale discussed in the README.
Sure, but the ClueWeb index is so small that it fits in memory anyway. Indeed, this is how my code works, it block reads the CIFF index and converts to a JASS index all in memory, the dumps the JASS index and terminates. Although we are trying to be future-proof with CIFF, I find it unlikely that we'll ever generate a CIFF file too large to fit into memory - because we already assume elsewhere that the search engine's index is memory resident.
There was a problem hiding this comment.
Current CW12b index is small because it only has postings for the query terms... postings lists for CW12b is 87 GiB (uncompressed) in the current protobuf format for all postings lists.
There was a problem hiding this comment.
In order to compute the impact scores in JASSv2 I need the max and min BM25 score across the entire index - so I use the 87GiB version. Come on, 87GiB is small. Its a pig to transfer, but small by comparison to the amount of memory on a modern desktop.
There was a problem hiding this comment.
That means I couldn't do an export from my laptop, even though my laptop can comfortably hold the raw index...?
There was a problem hiding this comment.
protobuf repeated fields are BLOBs stored with the BLOB-length at the start, so (I'm guessing) you need enough memory to compute the repeated field so as to compute the length so as to write the length then the data. But, my laptop only has 8GiB RAM, so I can't even get the ClueWeb index into RAM even once it is converted from CIFF. So I don't really buy the argument that "my laptop isn't a supercomputer so we shouldn't do this".
I'll add at this point that I'm happy to go with the consensus on this point. My experience, however, is that if you have a multiple file index then something will go wrong that can't go wrong with a single file - which is why ATIRE has a single file for the index. Yes, its a bit painful, but you only pay the pain once (when writing the code to write and read indexes).
I disagree because the "extra" stuff we are proposing to add is tiny in comparison to the postings lists. Also, since Google's code doesn't work on the postings lists, any constraints due to that can be disregarded. If we do decide to have multiple small(er) messages than I'd prefer three files (like we already have), each of which has a desperate protobuf description. That way we can't make wrong assumptions about what data we see in a file (even if we can get the files muddled up). |
I disagree with this. I am using Google's standard codegen'ed code to export CIFF from Lucene in Java. The point of using a standard is to also use the associated toolchain, to the extent possible. Just because you wish to write your own custom standards-complaint parser doesn't mean CIFF should force everyone to do so.
I am fine with this option also, so we'd have (say):
|
|
I think we could have some definition in the protobuf that describes the preprocessing that has occurred on the index. it might simply be free text, but something that we could search for /porter/i Should we also record a version string for what created the CIFF file? Currently only Anserini produces the CIFF file, but many of our systems could have implementations. |
|
If we do this then can the MetaData file also contain the names of the Postings file and the DocRecord file - that way the data is in 3 files, but the other two are described by the metadata file, and we only need to pass the name of one file to the converterizer. |
I'm fine with this design. A single master metadata file that points to two separate files, one with all the postings, and the second with doc records. I'll throw in additional fields in the metadata protobuf for avgdl, description, and other features people have requested on the thread. Before going off to implement this, is everyone happy with this design? |
|
Can we sketch the updated design here for brevity? I'll outline it as I currently understand it. Please "react" accordingly -- if we're all thumb's up here, we'll go ahead.
|
|
@JMMackenzie let's call above Option 1 Option 2 is what I have implemented here in this PR, where we have a single file, And for completeness, Option 3 is Andrew's suggestion of everything in a single protobuf message. I am against Option 3 because it requires the entire index to be in memory for exporting purposes. I am agnostic wrt either Option 1 or Option 2, but leaning toward Option 2, because I like the clean design of a single file, even at the expense of heterogeneous protobuf messages stuffed inside. I think it's fine because it follows the general design pattern of "say what you're going to say, and then say it." |
|
In this case, I prefer options 1 or 2. Option 3 is easy now, but it might hurt later if we decide to index fields/positions which may not then be able to fit entirely in-memory. I also think it makes it less likely for other search engines to want to write a CIFF exporter since it's a bit more difficult. |
|
I prefer option 1 (assuming option 3 is off the cards), but with an "int32 version" in the header. I'd like to rename collection_docid to primary_key in the DocRecord. I think primary_key is clearer than collection_docid, and makes more sense in collecting in which the collection_docid is more like a URL than an ID. |
|
@andrewtrotman - like this one?
|
|
Sorry, I just noticed this now. Yes, like that. I'm torn between a single file (which is good) and a single description (which is good). I don't like multiple files, and I don't like needing extra knowledge. I think multiple files is the lesser of two evils. My overall preference is one protobuf that fully describes one file - but I can see why others don't want to do that. |
|
haha, I think I slightly prefer a single file now (Option 2), if anything - because it's already implemented (at your original request, no less). I think the only "extra knowledge" is However, I'm happy to implement Option 1 (multiple files) if that's what people prefer. What do @chriskamphuis @cmacdonald @arjenpdevries think? |
|
I do not mind whether it is 1 or 2. I agree with Jimmy's arguments against 3. |
|
@arjenpdevries we're hoping for someone to help us decide between 1 or 2... everyone seems on the fence on this... |
|
I slightly prefer 2, but 1 is fine with me also. |
|
Likewise, I prefer the single file, but could work with both. Knowing the number of documents BEFORE reading the DocRecords could also be useful. |
|
Okay, seeing as there is a slight preference of votes for single file (option 2) - I am going to merge in this PR and make additional modifications on top. |
For comment only - not ready to be merged.