Skip to content

Update protobuf definition#6

Merged
andrewtrotman merged 5 commits intomasterfrom
protobuf-updates
Mar 2, 2020
Merged

Update protobuf definition#6
andrewtrotman merged 5 commits intomasterfrom
protobuf-updates

Conversation

@lintool
Copy link
Copy Markdown
Member

@lintool lintool commented Feb 26, 2020

Per discussion in #4

CIFF to play with:

@andrewtrotman
Copy link
Copy Markdown
Contributor

I'm happy with the one file solution.

Copy link
Copy Markdown
Contributor

@andrewtrotman andrewtrotman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all the names are plurals except "total_postings_list". Can we make that "total_postings_lists"

@lintool
Copy link
Copy Markdown
Member Author

lintool commented Feb 26, 2020

@andrewtrotman good catch - will do.

@andrewtrotman
Copy link
Copy Markdown
Contributor

I understand how num_postings_lists differs from total_postings_lists, the fist being the number in the file, the second being the number in a full index. But what is total_terms_in_collection, and how does it differ from total_postings_lists? Especially since total_postings_lists is defined as being "the vocabulary size".

@lintool
Copy link
Copy Markdown
Member Author

lintool commented Feb 26, 2020

Updated exports:

@lintool
Copy link
Copy Markdown
Member Author

lintool commented Feb 26, 2020

I understand how num_postings_lists differs from total_postings_lists, the fist being the number in the file, the second being the number in a full index. But what is total_terms_in_collection, and how does it differ from total_postings_lists? Especially since total_postings_lists is defined as being "the vocabulary size".

total_terms_in_collection is, literally, how many tokens there are in the entire collection, i.e., the sum of all document lengths of all documents. How would you suggest I clarify the wording?

@lintool
Copy link
Copy Markdown
Member Author

lintool commented Feb 26, 2020

Alternatively, would total_postings_lists be more clear as vocab_size?

@andrewtrotman
Copy link
Copy Markdown
Contributor

Now I understand what it is - thanks. You might call it sum_of_document_lengths

int32 total_docs = 5;

// The total number of terms in the entire collection.
int64 total_terms_in_collection = 6;
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested rename by @andrewtrotman : sum_of_document_lengths

Comment thread src/main/protobuf/CommonIndexFileFormat.proto
string collection_docid = 2;
int32 doclength = 3;
int32 docid = 1; // Refers to the docid in the postings lists.
string collection_docid = 2; // Refers to a docid in the external collection.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Andrew wanted primary_key instead since it may be more clear? @andrewtrotman

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer collection_docid, which has semantic meaning. primary_key is logical, not semantic.

Comment thread src/main/java/io/osirrc/ciff/lucene/ExportAnseriniLuceneIndex.java Outdated
@lintool
Copy link
Copy Markdown
Member Author

lintool commented Feb 26, 2020

@andrewtrotman @JMMackenzie suggestions adopted.

@lintool
Copy link
Copy Markdown
Member Author

lintool commented Feb 26, 2020

Once everyone is happy I'll generate fresh exports.

@cmacdonald
Copy link
Copy Markdown
Member

Two comments:

// The average document length. We store this value explicitly in case the exporting application wants a particular
// level of precision.
double average_doclength = 7;

This is obtainable from parsing the postings lists, no? Given it is provided, should an implementation use it? I'm not sure I follow the motivation for its presence.

string collection_docid = 2;  

a long time ago Terrier standardised nomenclature on docno for the external String for the primary key of the document, as this corresponded to the DOCNO tag in a TREC collection. docid is the integer internal to the system (or CIFF) in this case. I would suggest using the same nomenclature to avoid needing to clarify docid vs docno.

@lintool
Copy link
Copy Markdown
Member Author

lintool commented Feb 26, 2020

Two comments:

// The average document length. We store this value explicitly in case the exporting application wants a particular
// level of precision.
double average_doclength = 7;

This is obtainable from parsing the postings lists, no? Given it is provided, should an implementation use it? I'm not sure I follow the motivation for its presence.

I think the rationale is this: "We store this value explicitly in case the exporting application wants a particular level of precision" as noted in comments. Suppose the exporting application rounds avgdl to the nearest integer... it could use this field to tell the importer.

Don't feel too strongly though...

string collection_docid = 2;  

a long time ago Terrier standardised nomenclature on docno for the external String for the primary key of the document, as this corresponded to the DOCNO tag in a TREC collection. docid is the integer internal to the system (or CIFF) in this case. I would suggest using the same nomenclature to avoid needing to clarify docid vs docno.

I'm okay with docno, although it is a bit counter-intuitive, since "no" connotes an number, which is definitely isn't.

Copy link
Copy Markdown
Member

@chriskamphuis chriskamphuis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me

@andrewtrotman
Copy link
Copy Markdown
Contributor

Make sure to make total_terms_in_collection clearer with a better comment

@andrewtrotman andrewtrotman merged commit 8c9d314 into master Mar 2, 2020
lintool added a commit that referenced this pull request Mar 2, 2020
@lintool lintool deleted the protobuf-updates branch March 4, 2020 16:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants