Update protobuf definition#6
Conversation
|
I'm happy with the one file solution. |
andrewtrotman
left a comment
There was a problem hiding this comment.
all the names are plurals except "total_postings_list". Can we make that "total_postings_lists"
|
@andrewtrotman good catch - will do. |
|
I understand how num_postings_lists differs from total_postings_lists, the fist being the number in the file, the second being the number in a full index. But what is total_terms_in_collection, and how does it differ from total_postings_lists? Especially since total_postings_lists is defined as being "the vocabulary size". |
|
Updated exports: |
|
|
Alternatively, would |
|
Now I understand what it is - thanks. You might call it sum_of_document_lengths |
| int32 total_docs = 5; | ||
|
|
||
| // The total number of terms in the entire collection. | ||
| int64 total_terms_in_collection = 6; |
There was a problem hiding this comment.
Suggested rename by @andrewtrotman : sum_of_document_lengths
| string collection_docid = 2; | ||
| int32 doclength = 3; | ||
| int32 docid = 1; // Refers to the docid in the postings lists. | ||
| string collection_docid = 2; // Refers to a docid in the external collection. |
There was a problem hiding this comment.
I think Andrew wanted primary_key instead since it may be more clear? @andrewtrotman
There was a problem hiding this comment.
I prefer collection_docid, which has semantic meaning. primary_key is logical, not semantic.
|
@andrewtrotman @JMMackenzie suggestions adopted. |
|
Once everyone is happy I'll generate fresh exports. |
|
Two comments: This is obtainable from parsing the postings lists, no? Given it is provided, should an implementation use it? I'm not sure I follow the motivation for its presence. a long time ago Terrier standardised nomenclature on docno for the external String for the primary key of the document, as this corresponded to the DOCNO tag in a TREC collection. docid is the integer internal to the system (or CIFF) in this case. I would suggest using the same nomenclature to avoid needing to clarify docid vs docno. |
I think the rationale is this: "We store this value explicitly in case the exporting application wants a particular level of precision" as noted in comments. Suppose the exporting application rounds avgdl to the nearest integer... it could use this field to tell the importer. Don't feel too strongly though...
I'm okay with |
|
Make sure to make total_terms_in_collection clearer with a better comment |
Per discussion in #4
CIFF to play with: