-
Notifications
You must be signed in to change notification settings - Fork 66
feat: improve parsing of unlabeled CSVs #678
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried it on the modified xmarket data, lgtm
90d1f1f
to
feef2c1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work, added some comments
`This is an English sentence`, `Das ist ein englischer Satz` and `Dit is een Engelse zin` will all be given the same label. | ||
|
||
```` | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe it is better to roll out the tabs, since it is not just the same thing with small modification. The content is rather different in all tabs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
im not sure I understand, do you mean use tabs for all the example CSVs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean rather remove all tabs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
📝 Docs are deployed on https://ft-feat-binary-relevance-judgement--jina-docs.netlify.app 🎉 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This pr adjusts the
load_finetuning_dataset
method to parse unlabeled CSV files in a way that removes duplicate documentsThis pr does not implement support for binary relevance judgement exactly, instead it changes the way unlabeled data is parsed to achieve the same result. The formats that users can provide their data in has not changed, only the DocumentArrays that are returned.
For an unlabeled data provided in this format:
We previously would have returned a document array with 14 elements (4 duplicates) and 7 classes, now we will return a document array with 10 elements and 3 classes