Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is filtered out? #39

Open
jowagner opened this issue Nov 26, 2020 · 6 comments
Open

What is filtered out? #39

jowagner opened this issue Nov 26, 2020 · 6 comments
Labels
bug Something isn't working NCI Processing the New Corpus of Ireland

Comments

@jowagner
Copy link
Collaborator

What kind of material does the filter remove from the NCI? Take a random sample of the 886823 sentences and look for patterns.

@jowagner jowagner added the NCI Processing the New Corpus of Ireland label Nov 26, 2020
@jbrry
Copy link
Owner

jbrry commented Nov 26, 2020

I just took a look at the filtering on NCI. It seems that most of the sentences which are removed have a high ratio of punctuation or numbers. These are filtered out by the filters I created, PunctuationFilter and DigitsFilter in https://github.com/jbrry/Irish-BERT/blob/master/filters/customfilters.py. I used a threshold of 0.4 in order to filter out lines which are telephone numbers or just noisy punctuation. If you look at the examples of sentences filtered out - many of them are clean sentences with a lot of punctuation, e.g. the listing of many people's names or things (with a comma after each). I think the ratio of 0.4 is much too low and I think we should move to around 0.6 or 0.8 to filter out lines which are just telephone numbers or random punctuation. Note: this is on NCI_v2 which isn't segmented by UDPipe, e.g. a line may contain numerous sentences. Therefore, it is more likely that even more sentences are filtered out after these are segmented due to the ratios being higher on individual sentences. I will try that soon. I will also repeat the same step with a threshold of 0.6 for PunctuationFilter and DigitsFilter.

The steps I took are below:

python scripts/download_handler.py --datasets NCI
python scripts/text_processor.py --datasets NCI --bucket-size 100000000 --input-type raw --output-type processed
# add our custom filters to PYTHONPATH so OpusFilter can find them
export PYTHONPATH=/home/jbarry/spinning-storage/jbarry/ga_BERT/Irish-BERT/filters/
python scripts/filter_corpora.py --datasets NCI --filter-threshold 0

Running step 1: {'type': 'filter', 'parameters': {'inputs': ['NCI_00.bz2'], 'outputs': ['NCI_00_filtered_0.bz2'], 'filters': [{'LengthFilter': {'unit': 'word', 'min_length': 1, 'max_length': 100}}, {'LongWordFilter': {'threshold': 40}}, {'HtmlTagFilter': {}}, {'PunctuationFilter': {'threshold': 0.4}, 'module': 'customfilters'}, {'DigitsFilter': {'threshold': 0.4}, 'module': 'customfilters'}]}}

bzip2 -d NCI_00.bz2 NCI_00_filtered_0.bz2

cat NCI_00 | wc -l
519591
cat NCI_00_filtered_0 | wc -l
444714

diff NCI_00 NCI_00_filtered_0 > filtered-no_lang_char+digits-0.4.txt
rclone copy filtered-no_lang_char+digits-0.4.txt "gdrive:Theme A DCU/Irish_Data/ForasNaGaeilge/new-extraction/filtered/"

The file is viewable at Theme A DCU/Irish_Data/ForasNaGaeilge/new-extraction/filtered/

@jbrry
Copy link
Owner

jbrry commented Nov 26, 2020

Ok I think I figured out the problem. The base file NCI_00 contains 519591 sentences (where there may be multiple sentences on the one line). When it's segmented it has 1745027. I adjusted the threshold for digits and punctuation from 0.4 to 0.6 and then removed it completely and got 446702 and 449065 respectively. This shows that the digit and punctuation filters do not play a big role. I noticed the LengthFilter (e.g. length of sentence in words) is set to 100. I increased it to 10000. After doing that and removing the digits/punctuation filter, there are now 519534 (only 57 sentences removed). After restoring the digits/punctuation filter to 0.6 as well there are now 517167. So the digits/punctuation filters removes a bit over 2,000 lines and the main reason so many sentences were filtered out was due to there being more than 100 words on a line. In the most recent NCI runs, the files had to be filtered by OpusFilter beforehand, i.e. on sentences which were not segmented yet. This means that the most recent runs suffered a dramatic reduction in sentences due to this error. Basically any line with over 100 words was excluded.

The changes I've made to the wiki-bert-pipeline recently means OpusFilter is called in the corpora filtering step of wiki-bert-pipeline, e.g. after tokenisation and segmentation. So this error wouldn't happen.

@jbrry jbrry added the bug Something isn't working label Nov 26, 2020
@jowagner
Copy link
Collaborator Author

jowagner commented Nov 26, 2020

With a length filter present, you need to run split_tokenised_text_into_sentences.py on the output of my extractor script.

@jowagner
Copy link
Collaborator Author

Rather than updating an intermediate file on gdrive every time a bug is fixed, how about including the extraction and sentence splitting in the download script? Something like

rclone cat "gdrive:$FNG/$CODE.vert" | \
    python2 scripts/extract_text_from_nci_vert.py | \
    python scripts/split_tokenised_text_into_sentences.py > $OUTDIR/NCI_extracted_v2.txt

@jbrry
Copy link
Owner

jbrry commented Nov 26, 2020

Yes I agree with this. It was particularly cumbersome when one small change was made to the intermediate file which would then have to be uploaded to Google Drive (e.g. creating a version of NCI v2 with newlines after each document by using the --document-newline option). I would update the commands you've written to include that option for our next run to support document shuffling. As far as I know, you can just provide one input file if you want, but there needs to be empty lines between documents for the next-sentence-prediction task. If there are no document boundaries, you can break the single document into shards and it will use data from another shard as a negative example.

rclone cat "gdrive:$FNG/$CODE.vert" | \
    python2 scripts/extract_text_from_nci_vert.py --document-newline | \
    python scripts/split_tokenised_text_into_sentences.py > $OUTDIR/NCI_extracted_v2.txt

@jowagner
Copy link
Collaborator Author

Sounds great. FYI, there are 3485 documents in the .vert file.

Understanding BERT better, it's strange that they call it next sentence prediction when it is actually more like same document prediction. If each document is actually a collection of documents from a single source it even becomes more like domain classification. For next sentence prediction, I would prefer the negative examples to be pairs of sentences from the same document with the constraint that the distance must be at least k sentences, say k=4. The task could also be expanded to predict the order, e.g. 4 classes {AB, BA, AXB, BXA}.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working NCI Processing the New Corpus of Ireland
Projects
None yet
Development

No branches or pull requests

2 participants