-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide up-to-date pre-processed text files #35
Comments
If you have a
|
Yes this was the training corpus used for the first I'm not sure if I can recover the exact configuration that was used to create this file but it was created before this repo existed, e.g. it was created using the scripts in https://github.com/jbrry/Irish-UD-Parsing and would have used a different version of UDPipe for tokenisation/segmentation. I also think I no longer have this repository on In any case, this file would have been created as follows (I will also add this to a README in the relevant folder on Google Drive).
|
Sorry for not having been more clear. As we are highly unlikely to release a model based on this old file we don't need documentation for it. I meant "Please update this file to what is the input of our current best BERT model and then add a readme identifying this BERT model." Ideally, this should be repeated each time we have a new best model. Team members may want to use this file (or tgz archive of multiple files) to look for issues, train their own bert/roberta/xlm/elmo/fasttext/word2vec model or use the data for semi-supervised training of NLP components, e.g. tri-train a UD parser. |
Ok good idea. Yes, I intend to upload some sort of a corpus snapshot - e.g. the version of If you want these files as individual files, I can upload all of the individual |
How about referring to a commit via its hash such commit 04d4a12 in the readme and including the options used with BTW: I don't see |
Would the most recent commit in the repo suffice? e.g.;
Yes the arguments supplied to
It just doesn't give you the list of files which were used from Google Drive, that can be found via:
Yes sorry given that the initial training corpus |
https://stackoverflow.com/questions/949314/how-to-retrieve-the-hash-for-the-current-commit-in-git shows simpler ways.
If this file was in the repo it would be covered by the commit. Is there anything sensitive in there? If you agree that it would be a good idea to move it into the repo let's check with Teresa whether the list of filenames can be published or must stay secret.
Do we still need or use this script or is it obsolete? |
I created an issue for |
The file
Irish_Data > processed_ga_files_for_BERT_runs > train.txt
mentioned in issue #32 is severely out of date and there is no documentation what setting was used. Please update and add a readme. Given that BERT requires multiple input files for its next sentence objective, it will also be better for reproducibility do provide these individual files, e.g. as a.tgz
.The text was updated successfully, but these errors were encountered: