Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide up-to-date pre-processed text files #35

Open
jowagner opened this issue Nov 25, 2020 · 8 comments
Open

Provide up-to-date pre-processed text files #35

jowagner opened this issue Nov 25, 2020 · 8 comments
Assignees
Labels
documentation Improvements or additions to documentation reproducibility Improve transparency what was done

Comments

@jowagner
Copy link
Collaborator

jowagner commented Nov 25, 2020

The file Irish_Data > processed_ga_files_for_BERT_runs > train.txt mentioned in issue #32 is severely out of date and there is no documentation what setting was used. Please update and add a readme. Given that BERT requires multiple input files for its next sentence objective, it will also be better for reproducibility do provide these individual files, e.g. as a .tgz.

@jowagner
Copy link
Collaborator Author

jowagner commented Nov 25, 2020

If you have a .conllu version of these files ready that would save me a few minutes converting the files. I only need the first two columns populated as below. No worries if not.

$ head ga-common_crawl-000.conllu 
1       Iarscoláire     _       _       _       _       _       _       _       _
2       é       _       _       _       _       _       _       _       _
3       de      _       _       _       _       _       _       _       _
4       chuid   _       _       _       _       _       _       _       _
5       na      _       _       _       _       _       _       _       _
6       meánscoile      _       _       _       _       _       _       _       _
7       Coláiste        _       _       _       _       _       _       _       _
8       Eoin    _       _       _       _       _       _       _       SpaceAfter=No
9       .       _       _       _       _       _       _       _       _

@jowagner jowagner added documentation Improvements or additions to documentation reproducibility Improve transparency what was done labels Nov 25, 2020
@jbrry
Copy link
Owner

jbrry commented Nov 25, 2020

Yes this was the training corpus used for the first multilingual_bert run when we used this file to do continued pre-training of multilingual_bert. I added the training corpus to Google Drive so it could be accessed later for reproducibility. The pipeline has changed a lot since then, and this file would be considered obsolete - it is always excluded from subsequent runs.

I'm not sure if I can recover the exact configuration that was used to create this file but it was created before this repo existed, e.g. it was created using the scripts in https://github.com/jbrry/Irish-UD-Parsing and would have used a different version of UDPipe for tokenisation/segmentation. I also think I no longer have this repository on grove due to needing to free up space.

In any case, this file would have been created as follows (I will also add this to a README in the relevant folder on Google Drive).

  1. Data is manually downloaded for conll17
  2. conll17 data is tokenized/segmented using UDPipe.
  3. conll17 and data on Google drive are combined into a single file (with some manual exclusion of, Paracrawl and NCI_Cleaned.
  4. Run https://github.com/jbrry/Irish-UD-Parsing/blob/master/scripts/text_processor.py on the input file from step 3.
  5. Run https://github.com/jbrry/Irish-UD-Parsing/blob/master/scripts/create_pretraining_data.sh to break the input file into shards.
  6. Run https://github.com/jbrry/Irish-UD-Parsing/blob/master/scripts/run_pretraining.sh to do continued training of multilingual_bert using this training corpus.

@jowagner
Copy link
Collaborator Author

jowagner commented Nov 26, 2020

Sorry for not having been more clear. As we are highly unlikely to release a model based on this old file we don't need documentation for it. I meant "Please update this file to what is the input of our current best BERT model and then add a readme identifying this BERT model." Ideally, this should be repeated each time we have a new best model. Team members may want to use this file (or tgz archive of multiple files) to look for issues, train their own bert/roberta/xlm/elmo/fasttext/word2vec model or use the data for semi-supervised training of NLP components, e.g. tri-train a UD parser.

@jbrry
Copy link
Owner

jbrry commented Nov 26, 2020

Ok good idea. Yes, I intend to upload some sort of a corpus snapshot - e.g. the version of gdrive_file_list.csv which shows exactly which files were downloaded from Google Drive.

If you want these files as individual files, I can upload all of the individual .bz2 files which are stored in data/ga/<corpus>/raw/ by scripts/download_handler.py. Then we have access to all of the raw files we used prior to bucketing and the subsequent tokenisation/segmentation/filtering which takes place in wiki-bert-pipeline. In order to be fully deterministic, the filtering config files will be also included in the snapshot.

@jowagner
Copy link
Collaborator Author

How about referring to a commit via its hash such commit 04d4a12 in the readme and including the options used with text_processor.py? This should cover everything needed, at least if we follow the suggestion in #39 (comment) to stop using intermediate files on gdrive and instead carry out all processing with scripts in this repo starting from original files that do not change during the project.

BTW: I don't see create_pretraining_data.sh in the scripts folder.

@jbrry
Copy link
Owner

jbrry commented Nov 26, 2020

Would the most recent commit in the repo suffice? e.g.;

# take the first line from git log and print the hash
git log | head -n 1 | awk -F " " '{print $2}'

Yes the arguments supplied to text_processor.py should inform you of the datasets being used, e.g.:

python scripts/text_processor.py --datasets conll17 gdrive NCI oscar --bucket-size 100000000 --input-type raw --output-type processed

It just doesn't give you the list of files which were used from Google Drive, that can be found via:

cat data/ga/gdrive/gdrive_filelist.csv

BTW: I don't see create_pretraining_data.sh in the scripts folder.

Yes sorry given that the initial training corpus train.txt this issue was referring to was uploaded at the start of that year, the script is located in the old repoIrish-UD-Parsing: https://github.com/jbrry/Irish-UD-Parsing/blob/master/scripts/create_pretraining_data.sh

@jowagner
Copy link
Collaborator Author

Would the most recent commit in the repo suffice? e.g.;

# take the first line from git log and print the hash
git log | head -n 1 | awk -F " " '{print $2}'

https://stackoverflow.com/questions/949314/how-to-retrieve-the-hash-for-the-current-commit-in-git shows simpler ways.

It just doesn't give you the list of files which were used from Google Drive, that can be found via:

cat data/ga/gdrive/gdrive_filelist.csv

If this file was in the repo it would be covered by the commit. Is there anything sensitive in there? If you agree that it would be a good idea to move it into the repo let's check with Teresa whether the list of filenames can be published or must stay secret.

BTW: I don't see create_pretraining_data.sh in the scripts folder.

Yes sorry given that the initial training corpus train.txt this issue was referring to was uploaded at the start of that year, the script is located in the old repoIrish-UD-Parsing: https://github.com/jbrry/Irish-UD-Parsing/blob/master/scripts/create_pretraining_data.sh

Do we still need or use this script or is it obsolete?

@jowagner
Copy link
Collaborator Author

I created an issue for gdrive_filelist.csv. Assign it to Teresa if you agree it is a good idea. Otherwise, close issue #43 with "wont-fix" label.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation reproducibility Improve transparency what was done
Projects
None yet
Development

No branches or pull requests

2 participants