-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What is filtered out? #39
Comments
I just took a look at the filtering on NCI. It seems that most of the sentences which are removed have a high ratio of punctuation or numbers. These are filtered out by the filters I created, The steps I took are below:
The file is viewable at |
Ok I think I figured out the problem. The base file The changes I've made to the |
With a length filter present, you need to run |
Rather than updating an intermediate file on gdrive every time a bug is fixed, how about including the extraction and sentence splitting in the download script? Something like
|
Yes I agree with this. It was particularly cumbersome when one small change was made to the intermediate file which would then have to be uploaded to Google Drive (e.g. creating a version of NCI v2 with newlines after each document by using the
|
Sounds great. FYI, there are 3485 documents in the Understanding BERT better, it's strange that they call it next sentence prediction when it is actually more like same document prediction. If each document is actually a collection of documents from a single source it even becomes more like domain classification. For next sentence prediction, I would prefer the negative examples to be pairs of sentences from the same document with the constraint that the distance must be at least k sentences, say k=4. The task could also be expanded to predict the order, e.g. 4 classes {AB, BA, AXB, BXA}. |
What kind of material does the filter remove from the NCI? Take a random sample of the 886823 sentences and look for patterns.
The text was updated successfully, but these errors were encountered: