CIFF export in a single file by lintool · Pull Request #4 · osirrc/ciff

lintool · 2020-02-06T03:36:21Z

For comment only - not ready to be merged.

lintool · 2020-02-14T11:57:34Z

Based on Slack chat with @JMMackenzie - we should just merge this and proceed?

lintool · 2020-02-14T12:42:02Z

Hi @andrewtrotman - the README has an explanation for my choice in design. Let me know if you disagree. Otherwise sign off in review?

chriskamphuis

lg

andrewtrotman · 2020-02-18T02:34:11Z

+message Header {
+  int32 num_postings_lists = 1;
+  int32 num_docs = 2;
+}


To get it all into a single ProtoBuff file that doesn't rely on knowing what's in there, what's repeated and how many times I suggest:

message CIFF { int32 version = 1; int32 num_postings_lists = 2; int32 num_docs = 3; repeat DocRecord = 4; repeat PostingsList = 5; }

I've also renamed header to CIFF (for clarity) and added a version element. This is so that if we later add stuff (such as fields) we can easily reject the CIFF without looking for stuff we don't know about in a large file we can parse most of.

@andrewtrotman We can't do this because it would require CIFF to be completely held in memory, per the rationale discussed in the README.

Also, CIFF is a format, not a message, so the name isn't totally appropriate.

How about IndexMetadata? I'm fine with adding a version.

Would it be an idea to also add Metadata for each docrecord?

What do you mean exactly? There will be a DocRecord message for every single document in the collection...

it would require CIFF to be completely held in memory, per the rationale discussed in the README.

Sure, but the ClueWeb index is so small that it fits in memory anyway. Indeed, this is how my code works, it block reads the CIFF index and converts to a JASS index all in memory, the dumps the JASS index and terminates. Although we are trying to be future-proof with CIFF, I find it unlikely that we'll ever generate a CIFF file too large to fit into memory - because we already assume elsewhere that the search engine's index is memory resident.

Current CW12b index is small because it only has postings for the query terms... postings lists for CW12b is 87 GiB (uncompressed) in the current protobuf format for all postings lists.

In order to compute the impact scores in JASSv2 I need the max and min BM25 score across the entire index - so I use the 87GiB version. Come on, 87GiB is small. Its a pig to transfer, but small by comparison to the amount of memory on a modern desktop.

That means I couldn't do an export from my laptop, even though my laptop can comfortably hold the raw index...?

protobuf repeated fields are BLOBs stored with the BLOB-length at the start, so (I'm guessing) you need enough memory to compute the repeated field so as to compute the length so as to write the length then the data. But, my laptop only has 8GiB RAM, so I can't even get the ClueWeb index into RAM even once it is converted from CIFF. So I don't really buy the argument that "my laptop isn't a supercomputer so we shouldn't do this".

I'll add at this point that I'm happy to go with the consensus on this point. My experience, however, is that if you have a multiple file index then something will go wrong that can't go wrong with a single file - which is why ATIRE has a single file for the index. Yes, its a bit painful, but you only pay the pain once (when writing the code to write and read indexes).

andrewtrotman · 2020-02-18T05:03:53Z

Hi @andrewtrotman - the README has an explanation for my choice in design. Let me know if you disagree. Otherwise sign off in review?

I disagree because the "extra" stuff we are proposing to add is tiny in comparison to the postings lists. Also, since Google's code doesn't work on the postings lists, any constraints due to that can be disregarded.

If we do decide to have multiple small(er) messages than I'd prefer three files (like we already have), each of which has a desperate protobuf description. That way we can't make wrong assumptions about what data we see in a file (even if we can get the files muddled up).

lintool · 2020-02-18T10:53:58Z

Also, since Google's code doesn't work on the postings lists, any constraints due to that can be disregarded.

I disagree with this. I am using Google's standard codegen'ed code to export CIFF from Lucene in Java. The point of using a standard is to also use the associated toolchain, to the extent possible. Just because you wish to write your own custom standards-complaint parser doesn't mean CIFF should force everyone to do so.

If we do decide to have multiple small(er) messages than I'd prefer three files (like we already have), each of which has a desperate protobuf description. That way we can't make wrong assumptions about what data we see in a file (even if we can get the files muddled up).

I am fine with this option also, so we'd have (say):

foo.ciff.postings = PostingList *
foo.ciff.docrecords = DocRecord *
foo.ciff.meta = MetaData

cmacdonald · 2020-02-18T11:49:48Z

I think we could have some definition in the protobuf that describes the preprocessing that has occurred on the index. it might simply be free text, but something that we could search for /porter/i

string preprocessing = 6;

Should we also record a version string for what created the CIFF file? Currently only Anserini produces the CIFF file, but many of our systems could have implementations.

andrewtrotman · 2020-02-18T20:28:20Z

If we do this then can the MetaData file also contain the names of the Postings file and the DocRecord file - that way the data is in 3 files, but the other two are described by the metadata file, and we only need to pass the name of one file to the converterizer.

lintool · 2020-02-18T20:41:10Z

If we do this then can the MetaData file also contain the names of the Postings file and the DocRecord file - that way the data is in 3 files, but the other two are described by the metadata file, and we only need to pass the name of one file to the converterizer.

I'm fine with this design. A single master metadata file that points to two separate files, one with all the postings, and the second with doc records.

I'll throw in additional fields in the metadata protobuf for avgdl, description, and other features people have requested on the thread.

Before going off to implement this, is everyone happy with this design?

JMMackenzie · 2020-02-19T01:34:28Z

Can we sketch the updated design here for brevity? I'll outline it as I currently understand it. Please "react" accordingly -- if we're all thumb's up here, we'll go ahead.

foo.ciff.postings = PostingList

message Posting {
  int32 docid = 1;
  int32 tf = 2;
}

message PostingsList {
  string term = 1;
  int64 df = 2;
  int64 cf = 3;
  repeated Posting postings = 4;
}

foo.ciff.docrecords = DocRecord

message DocRecord {
  int32 docid = 1;
  string collection_docid = 2;
  int32 doclength = 3;
}

foo.ciff.meta = MetaData

message Header {
  string postings_file = 1;
  string documents_file = 2;
  int32 num_postings_lists = 3;
  int32 num_docs = 4;
  double average_doclength = 5;
  string preprocessing = 6;
}

lintool · 2020-02-19T01:55:05Z

@JMMackenzie let's call above Option 1

Option 2 is what I have implemented here in this PR, where we have a single file, foo.ciff = MetaData PostingList* DocRecord*

And for completeness, Option 3 is Andrew's suggestion of everything in a single protobuf message.

I am against Option 3 because it requires the entire index to be in memory for exporting purposes.

I am agnostic wrt either Option 1 or Option 2, but leaning toward Option 2, because I like the clean design of a single file, even at the expense of heterogeneous protobuf messages stuffed inside. I think it's fine because it follows the general design pattern of "say what you're going to say, and then say it."

JMMackenzie · 2020-02-19T02:02:10Z

In this case, I prefer options 1 or 2. Option 3 is easy now, but it might hurt later if we decide to index fields/positions which may not then be able to fit entirely in-memory. I also think it makes it less likely for other search engines to want to write a CIFF exporter since it's a bit more difficult.

andrewtrotman · 2020-02-19T22:13:02Z

I prefer option 1 (assuming option 3 is off the cards), but with an "int32 version" in the header. I'd like to rename collection_docid to primary_key in the DocRecord. I think primary_key is clearer than collection_docid, and makes more sense in collecting in which the collection_docid is more like a URL than an ID.

JMMackenzie · 2020-02-21T01:46:35Z

@andrewtrotman - like this one?

foo.ciff.postings = PostingList

message Posting {
  int32 docid = 1;
  int32 tf = 2;
}

message PostingsList {
  string term = 1;
  int64 df = 2;
  int64 cf = 3;
  repeated Posting postings = 4;
}

foo.ciff.docrecords = DocRecord

message DocRecord {
  int32 docid = 1;
  string primary_key = 2;
  int32 doclength = 3;
}

foo.ciff.meta = MetaData

message Header {
  int32_t version = 1;
  string postings_file = 2;
  string documents_file = 3;
  int32 num_postings_lists = 4;
  int32 num_docs = 5;
  double average_doclength = 6;
  string preprocessing = 7;
}

andrewtrotman · 2020-02-26T04:36:07Z

Sorry, I just noticed this now. Yes, like that. I'm torn between a single file (which is good) and a single description (which is good). I don't like multiple files, and I don't like needing extra knowledge. I think multiple files is the lesser of two evils. My overall preference is one protobuf that fully describes one file - but I can see why others don't want to do that.

lintool · 2020-02-26T12:07:34Z

haha, I think I slightly prefer a single file now (Option 2), if anything - because it's already implemented (at your original request, no less).

I think the only "extra knowledge" is .ciff = MetaData PostingList* DocRecord* which is manageable since we'll have example code. It occurred to me that with a single file, working on the file in compressed form is easier.

However, I'm happy to implement Option 1 (multiple files) if that's what people prefer.

What do @chriskamphuis @cmacdonald @arjenpdevries think?

arjenpdevries · 2020-02-26T13:31:19Z

I do not mind whether it is 1 or 2.

I agree with Jimmy's arguments against 3.

lintool · 2020-02-26T13:32:52Z

@arjenpdevries we're hoping for someone to help us decide between 1 or 2... everyone seems on the fence on this...

chriskamphuis · 2020-02-26T13:40:03Z

I slightly prefer 2, but 1 is fine with me also.

cmacdonald · 2020-02-26T15:03:30Z

Likewise, I prefer the single file, but could work with both. Knowing the number of documents BEFORE reading the DocRecords could also be useful.

lintool · 2020-02-26T18:44:26Z

Okay, seeing as there is a slight preference of votes for single file (option 2) - I am going to merge in this PR and make additional modifications on top.

CIFF export in a single file

e410c22

lintool requested a review from JMMackenzie February 14, 2020 11:56

Update README.md

611371f

lintool requested review from andrewtrotman and chriskamphuis February 14, 2020 12:41

chriskamphuis approved these changes Feb 14, 2020

View reviewed changes

JMMackenzie approved these changes Feb 15, 2020

View reviewed changes

andrewtrotman reviewed Feb 18, 2020

View reviewed changes

lintool merged commit dad6baa into master Feb 26, 2020

lintool deleted the single-protobuf branch February 26, 2020 18:44

lintool mentioned this pull request Feb 26, 2020

Update protobuf definition #6

Merged

Conversation

lintool commented Feb 6, 2020

Uh oh!

lintool commented Feb 14, 2020

Uh oh!

lintool commented Feb 14, 2020

Uh oh!

chriskamphuis left a comment

Choose a reason for hiding this comment

Uh oh!

andrewtrotman Feb 18, 2020 • edited by JMMackenzie Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andrewtrotman commented Feb 18, 2020

Uh oh!

lintool commented Feb 18, 2020

Uh oh!

cmacdonald commented Feb 18, 2020

Uh oh!

andrewtrotman commented Feb 18, 2020

Uh oh!

lintool commented Feb 18, 2020

Uh oh!

JMMackenzie commented Feb 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lintool commented Feb 19, 2020

Uh oh!

JMMackenzie commented Feb 19, 2020

Uh oh!

andrewtrotman commented Feb 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JMMackenzie commented Feb 21, 2020

Uh oh!

andrewtrotman commented Feb 26, 2020

Uh oh!

lintool commented Feb 26, 2020

Uh oh!

arjenpdevries commented Feb 26, 2020

Uh oh!

lintool commented Feb 26, 2020

Uh oh!

chriskamphuis commented Feb 26, 2020

Uh oh!

cmacdonald commented Feb 26, 2020

Uh oh!

lintool commented Feb 26, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

andrewtrotman Feb 18, 2020 •

edited by JMMackenzie

Loading

JMMackenzie commented Feb 19, 2020 •

edited

Loading

andrewtrotman commented Feb 19, 2020 •

edited

Loading