Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fine-tuned models request #5

Closed
ghost opened this issue Jan 15, 2020 · 9 comments
Closed

Fine-tuned models request #5

ghost opened this issue Jan 15, 2020 · 9 comments

Comments

@ghost
Copy link

ghost commented Jan 15, 2020

Are there any models available for Abstractive Summarization and Grammatical Error Correction?

It would also be good to have utils file for parsing MSR Abstractive Text Compression Dataset and for Low Resource Track in GEC task

@ekQ
Copy link
Collaborator

ekQ commented Jan 29, 2020

We haven't published the models since they are trained with an internal version of LaserTagger and thus might not be fully compatible. However, we can share the predictions for the eval sets if needed.

The easiest approach for parsing the the summarization and the GEC datasets is to store them as tsv files and use the wikisplit input format, which expects the source to be on the first column and the target on the second column.

@ekQ ekQ closed this as completed Jan 29, 2020
@ghost
Copy link
Author

ghost commented Jan 29, 2020

@ekQ I tried that I got the dataset formatted in similar wikisplit way, used errant for generating GEC ground truth.
What should be a good example of PHRASE_VOCAB_SIZE size for these 2 tasks?

I am also confused about addition of new tags in label map, as it only maps to KEEP/DELETE. How do I to add tags?
Any help to have a good working demo with similar accuracies as mentioned in the paper for these 2 tasks would be highly appreciated.

@ekQ
Copy link
Collaborator

ekQ commented Jan 29, 2020

We used vocabulary size 500 for all of the experiments in the paper (but that's not necessarily optimal for those tasks).

You should be able to train a model by using this script and just updating the 3 first parameters:
https://github.com/google-research/lasertagger/blob/master/run_wikisplit_experiment.sh#L21-L26
The script will construct the label map.

@ghost
Copy link
Author

ghost commented Jan 29, 2020

I have been using that script. I tried the MSR summarization dataset. But didn't get decent results so thought maybe some parameters need tuning.

@ekQ
Copy link
Collaborator

ekQ commented Feb 5, 2020

The summarization dataset is actually different from the other datasets studied in the paper in the sense that it's not already tokenized. You should first run the datasets through a tokenizer and finally detokenize the predictions before computing the scores to get comparable numbers to ours. Tokenization is important since otherwise "A, b." will be treated as two tokens "A," and "b." instead of four tokens.

We used an internal version of Cloud Natural Language API for this, but you can use any tokenizer.

@ghost
Copy link
Author

ghost commented Feb 5, 2020

So this tokenisation step nees to happen before phrase vocabulary optimization?

Is it also the same case for BEA 2019 Error correction dataset? Does it need tokenisation too?

@ekQ
Copy link
Collaborator

ekQ commented Feb 5, 2020

Correct, it should happen in the very beginning.
The BEA dataset should already be tokenized.

@ghost
Copy link
Author

ghost commented Feb 5, 2020

Great, thanks. I'll try out and let you know how it went.

@ekQ
Copy link
Collaborator

ekQ commented Feb 5, 2020

Thanks, please do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant