Skip to content

CIFF export in a single file#4

Merged
lintool merged 2 commits intomasterfrom
single-protobuf
Feb 26, 2020
Merged

CIFF export in a single file#4
lintool merged 2 commits intomasterfrom
single-protobuf

Conversation

@lintool
Copy link
Copy Markdown
Member

@lintool lintool commented Feb 6, 2020

For comment only - not ready to be merged.

@lintool lintool requested a review from JMMackenzie February 14, 2020 11:56
@lintool
Copy link
Copy Markdown
Member Author

lintool commented Feb 14, 2020

Based on Slack chat with @JMMackenzie - we should just merge this and proceed?

@lintool
Copy link
Copy Markdown
Member Author

lintool commented Feb 14, 2020

Hi @andrewtrotman - the README has an explanation for my choice in design. Let me know if you disagree. Otherwise sign off in review?

Copy link
Copy Markdown
Member

@chriskamphuis chriskamphuis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lg

Comment on lines +4 to +7
message Header {
int32 num_postings_lists = 1;
int32 num_docs = 2;
}
Copy link
Copy Markdown
Contributor

@andrewtrotman andrewtrotman Feb 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To get it all into a single ProtoBuff file that doesn't rely on knowing what's in there, what's repeated and how many times I suggest:

message CIFF {
  int32 version = 1;
  int32 num_postings_lists = 2;
  int32 num_docs = 3;
  repeat DocRecord = 4;
  repeat PostingsList = 5;
}

I've also renamed header to CIFF (for clarity) and added a version element. This is so that if we later add stuff (such as fields) we can easily reject the CIFF without looking for stuff we don't know about in a large file we can parse most of.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrewtrotman We can't do this because it would require CIFF to be completely held in memory, per the rationale discussed in the README.

Also, CIFF is a format, not a message, so the name isn't totally appropriate.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about IndexMetadata? I'm fine with adding a version.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be an idea to also add Metadata for each docrecord?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean exactly? There will be a DocRecord message for every single document in the collection...

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would require CIFF to be completely held in memory, per the rationale discussed in the README.

Sure, but the ClueWeb index is so small that it fits in memory anyway. Indeed, this is how my code works, it block reads the CIFF index and converts to a JASS index all in memory, the dumps the JASS index and terminates. Although we are trying to be future-proof with CIFF, I find it unlikely that we'll ever generate a CIFF file too large to fit into memory - because we already assume elsewhere that the search engine's index is memory resident.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current CW12b index is small because it only has postings for the query terms... postings lists for CW12b is 87 GiB (uncompressed) in the current protobuf format for all postings lists.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to compute the impact scores in JASSv2 I need the max and min BM25 score across the entire index - so I use the 87GiB version. Come on, 87GiB is small. Its a pig to transfer, but small by comparison to the amount of memory on a modern desktop.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That means I couldn't do an export from my laptop, even though my laptop can comfortably hold the raw index...?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

protobuf repeated fields are BLOBs stored with the BLOB-length at the start, so (I'm guessing) you need enough memory to compute the repeated field so as to compute the length so as to write the length then the data. But, my laptop only has 8GiB RAM, so I can't even get the ClueWeb index into RAM even once it is converted from CIFF. So I don't really buy the argument that "my laptop isn't a supercomputer so we shouldn't do this".

I'll add at this point that I'm happy to go with the consensus on this point. My experience, however, is that if you have a multiple file index then something will go wrong that can't go wrong with a single file - which is why ATIRE has a single file for the index. Yes, its a bit painful, but you only pay the pain once (when writing the code to write and read indexes).

@andrewtrotman
Copy link
Copy Markdown
Contributor

Hi @andrewtrotman - the README has an explanation for my choice in design. Let me know if you disagree. Otherwise sign off in review?

I disagree because the "extra" stuff we are proposing to add is tiny in comparison to the postings lists. Also, since Google's code doesn't work on the postings lists, any constraints due to that can be disregarded.

If we do decide to have multiple small(er) messages than I'd prefer three files (like we already have), each of which has a desperate protobuf description. That way we can't make wrong assumptions about what data we see in a file (even if we can get the files muddled up).

@lintool
Copy link
Copy Markdown
Member Author

lintool commented Feb 18, 2020

Also, since Google's code doesn't work on the postings lists, any constraints due to that can be disregarded.

I disagree with this. I am using Google's standard codegen'ed code to export CIFF from Lucene in Java. The point of using a standard is to also use the associated toolchain, to the extent possible. Just because you wish to write your own custom standards-complaint parser doesn't mean CIFF should force everyone to do so.

If we do decide to have multiple small(er) messages than I'd prefer three files (like we already have), each of which has a desperate protobuf description. That way we can't make wrong assumptions about what data we see in a file (even if we can get the files muddled up).

I am fine with this option also, so we'd have (say):

  • foo.ciff.postings = PostingList *
  • foo.ciff.docrecords = DocRecord *
  • foo.ciff.meta = MetaData

@cmacdonald
Copy link
Copy Markdown
Member

I think we could have some definition in the protobuf that describes the preprocessing that has occurred on the index. it might simply be free text, but something that we could search for /porter/i

string preprocessing = 6;

Should we also record a version string for what created the CIFF file? Currently only Anserini produces the CIFF file, but many of our systems could have implementations.

@andrewtrotman
Copy link
Copy Markdown
Contributor

If we do this then can the MetaData file also contain the names of the Postings file and the DocRecord file - that way the data is in 3 files, but the other two are described by the metadata file, and we only need to pass the name of one file to the converterizer.

@lintool
Copy link
Copy Markdown
Member Author

lintool commented Feb 18, 2020

If we do this then can the MetaData file also contain the names of the Postings file and the DocRecord file - that way the data is in 3 files, but the other two are described by the metadata file, and we only need to pass the name of one file to the converterizer.

I'm fine with this design. A single master metadata file that points to two separate files, one with all the postings, and the second with doc records.

I'll throw in additional fields in the metadata protobuf for avgdl, description, and other features people have requested on the thread.

Before going off to implement this, is everyone happy with this design?

@JMMackenzie
Copy link
Copy Markdown
Member

JMMackenzie commented Feb 19, 2020

Can we sketch the updated design here for brevity? I'll outline it as I currently understand it. Please "react" accordingly -- if we're all thumb's up here, we'll go ahead.

  • foo.ciff.postings = PostingList
message Posting {
  int32 docid = 1;
  int32 tf = 2;
}

message PostingsList {
  string term = 1;
  int64 df = 2;
  int64 cf = 3;
  repeated Posting postings = 4;
}
  • foo.ciff.docrecords = DocRecord
message DocRecord {
  int32 docid = 1;
  string collection_docid = 2;
  int32 doclength = 3;
}
  • foo.ciff.meta = MetaData
message Header {
  string postings_file = 1;
  string documents_file = 2;
  int32 num_postings_lists = 3;
  int32 num_docs = 4;
  double average_doclength = 5;
  string preprocessing = 6;
}

@lintool
Copy link
Copy Markdown
Member Author

lintool commented Feb 19, 2020

@JMMackenzie let's call above Option 1

Option 2 is what I have implemented here in this PR, where we have a single file, foo.ciff = MetaData PostingList* DocRecord*

And for completeness, Option 3 is Andrew's suggestion of everything in a single protobuf message.


I am against Option 3 because it requires the entire index to be in memory for exporting purposes.

I am agnostic wrt either Option 1 or Option 2, but leaning toward Option 2, because I like the clean design of a single file, even at the expense of heterogeneous protobuf messages stuffed inside. I think it's fine because it follows the general design pattern of "say what you're going to say, and then say it."

@JMMackenzie
Copy link
Copy Markdown
Member

In this case, I prefer options 1 or 2. Option 3 is easy now, but it might hurt later if we decide to index fields/positions which may not then be able to fit entirely in-memory. I also think it makes it less likely for other search engines to want to write a CIFF exporter since it's a bit more difficult.

@andrewtrotman
Copy link
Copy Markdown
Contributor

andrewtrotman commented Feb 19, 2020

I prefer option 1 (assuming option 3 is off the cards), but with an "int32 version" in the header. I'd like to rename collection_docid to primary_key in the DocRecord. I think primary_key is clearer than collection_docid, and makes more sense in collecting in which the collection_docid is more like a URL than an ID.

@JMMackenzie
Copy link
Copy Markdown
Member

@andrewtrotman - like this one?

  • foo.ciff.postings = PostingList
message Posting {
  int32 docid = 1;
  int32 tf = 2;
}

message PostingsList {
  string term = 1;
  int64 df = 2;
  int64 cf = 3;
  repeated Posting postings = 4;
}
  • foo.ciff.docrecords = DocRecord
message DocRecord {
  int32 docid = 1;
  string primary_key = 2;
  int32 doclength = 3;
}
  • foo.ciff.meta = MetaData
message Header {
  int32_t version = 1;
  string postings_file = 2;
  string documents_file = 3;
  int32 num_postings_lists = 4;
  int32 num_docs = 5;
  double average_doclength = 6;
  string preprocessing = 7;
}

@andrewtrotman
Copy link
Copy Markdown
Contributor

Sorry, I just noticed this now. Yes, like that. I'm torn between a single file (which is good) and a single description (which is good). I don't like multiple files, and I don't like needing extra knowledge. I think multiple files is the lesser of two evils. My overall preference is one protobuf that fully describes one file - but I can see why others don't want to do that.

@lintool
Copy link
Copy Markdown
Member Author

lintool commented Feb 26, 2020

haha, I think I slightly prefer a single file now (Option 2), if anything - because it's already implemented (at your original request, no less).

I think the only "extra knowledge" is .ciff = MetaData PostingList* DocRecord* which is manageable since we'll have example code. It occurred to me that with a single file, working on the file in compressed form is easier.

However, I'm happy to implement Option 1 (multiple files) if that's what people prefer.

What do @chriskamphuis @cmacdonald @arjenpdevries think?

@arjenpdevries
Copy link
Copy Markdown
Member

I do not mind whether it is 1 or 2.

I agree with Jimmy's arguments against 3.

@lintool
Copy link
Copy Markdown
Member Author

lintool commented Feb 26, 2020

@arjenpdevries we're hoping for someone to help us decide between 1 or 2... everyone seems on the fence on this...

@chriskamphuis
Copy link
Copy Markdown
Member

I slightly prefer 2, but 1 is fine with me also.

@cmacdonald
Copy link
Copy Markdown
Member

Likewise, I prefer the single file, but could work with both. Knowing the number of documents BEFORE reading the DocRecords could also be useful.

@lintool
Copy link
Copy Markdown
Member Author

lintool commented Feb 26, 2020

Okay, seeing as there is a slight preference of votes for single file (option 2) - I am going to merge in this PR and make additional modifications on top.

@lintool lintool merged commit dad6baa into master Feb 26, 2020
@lintool lintool deleted the single-protobuf branch February 26, 2020 18:44
@lintool lintool mentioned this pull request Feb 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants