-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fine-tuned models request #5
Comments
We haven't published the models since they are trained with an internal version of LaserTagger and thus might not be fully compatible. However, we can share the predictions for the eval sets if needed. The easiest approach for parsing the the summarization and the GEC datasets is to store them as tsv files and use the |
@ekQ I tried that I got the dataset formatted in similar wikisplit way, used errant for generating GEC ground truth. I am also confused about addition of new tags in label map, as it only maps to KEEP/DELETE. How do I to add tags? |
We used vocabulary size 500 for all of the experiments in the paper (but that's not necessarily optimal for those tasks). You should be able to train a model by using this script and just updating the 3 first parameters: |
I have been using that script. I tried the MSR summarization dataset. But didn't get decent results so thought maybe some parameters need tuning. |
The summarization dataset is actually different from the other datasets studied in the paper in the sense that it's not already tokenized. You should first run the datasets through a tokenizer and finally detokenize the predictions before computing the scores to get comparable numbers to ours. Tokenization is important since otherwise "A, b." will be treated as two tokens "A," and "b." instead of four tokens. We used an internal version of Cloud Natural Language API for this, but you can use any tokenizer. |
So this tokenisation step nees to happen before phrase vocabulary optimization? Is it also the same case for BEA 2019 Error correction dataset? Does it need tokenisation too? |
Correct, it should happen in the very beginning. |
Great, thanks. I'll try out and let you know how it went. |
Thanks, please do. |
Are there any models available for Abstractive Summarization and Grammatical Error Correction?
It would also be good to have
utils
file for parsing MSR Abstractive Text Compression Dataset and for Low Resource Track in GEC taskThe text was updated successfully, but these errors were encountered: