Fine-tuned models request #5

ghost · 2020-01-15T23:14:49Z

Are there any models available for Abstractive Summarization and Grammatical Error Correction?

It would also be good to have utils file for parsing MSR Abstractive Text Compression Dataset and for Low Resource Track in GEC task

The text was updated successfully, but these errors were encountered:

ekQ · 2020-01-29T12:40:58Z

We haven't published the models since they are trained with an internal version of LaserTagger and thus might not be fully compatible. However, we can share the predictions for the eval sets if needed.

The easiest approach for parsing the the summarization and the GEC datasets is to store them as tsv files and use the wikisplit input format, which expects the source to be on the first column and the target on the second column.

ghost · 2020-01-29T15:05:53Z

@ekQ I tried that I got the dataset formatted in similar wikisplit way, used errant for generating GEC ground truth.
What should be a good example of PHRASE_VOCAB_SIZE size for these 2 tasks?

I am also confused about addition of new tags in label map, as it only maps to KEEP/DELETE. How do I to add tags?
Any help to have a good working demo with similar accuracies as mentioned in the paper for these 2 tasks would be highly appreciated.

ekQ · 2020-01-29T16:58:56Z

We used vocabulary size 500 for all of the experiments in the paper (but that's not necessarily optimal for those tasks).

You should be able to train a model by using this script and just updating the 3 first parameters:
https://github.com/google-research/lasertagger/blob/master/run_wikisplit_experiment.sh#L21-L26
The script will construct the label map.

ghost · 2020-01-29T17:45:23Z

I have been using that script. I tried the MSR summarization dataset. But didn't get decent results so thought maybe some parameters need tuning.

ekQ · 2020-02-05T13:45:59Z

The summarization dataset is actually different from the other datasets studied in the paper in the sense that it's not already tokenized. You should first run the datasets through a tokenizer and finally detokenize the predictions before computing the scores to get comparable numbers to ours. Tokenization is important since otherwise "A, b." will be treated as two tokens "A," and "b." instead of four tokens.

We used an internal version of Cloud Natural Language API for this, but you can use any tokenizer.

ghost · 2020-02-05T15:04:26Z

So this tokenisation step nees to happen before phrase vocabulary optimization?

Is it also the same case for BEA 2019 Error correction dataset? Does it need tokenisation too?

ekQ · 2020-02-05T16:44:36Z

Correct, it should happen in the very beginning.
The BEA dataset should already be tokenized.

ghost · 2020-02-05T16:48:58Z

Great, thanks. I'll try out and let you know how it went.

ekQ · 2020-02-05T16:52:32Z

Thanks, please do.

ekQ closed this as completed Jan 29, 2020

mnc29 mentioned this issue Jun 2, 2020

MSR Summarization Example #11

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fine-tuned models request #5

Fine-tuned models request #5

ghost commented Jan 15, 2020

ekQ commented Jan 29, 2020

ghost commented Jan 29, 2020

ekQ commented Jan 29, 2020

ghost commented Jan 29, 2020

ekQ commented Feb 5, 2020

ghost commented Feb 5, 2020 •

edited by ghost

Loading

ekQ commented Feb 5, 2020

ghost commented Feb 5, 2020

ekQ commented Feb 5, 2020

Fine-tuned models request #5

Fine-tuned models request #5

Comments

ghost commented Jan 15, 2020

ekQ commented Jan 29, 2020

ghost commented Jan 29, 2020

ekQ commented Jan 29, 2020

ghost commented Jan 29, 2020

ekQ commented Feb 5, 2020

ghost commented Feb 5, 2020 • edited by ghost Loading

ekQ commented Feb 5, 2020

ghost commented Feb 5, 2020

ekQ commented Feb 5, 2020

ghost commented Feb 5, 2020 •

edited by ghost

Loading