predict.py to work with raw text files #8

drunkinlove · 2019-11-18T13:13:47Z

Hello!

First of all, thank you for the research and shared code, it's immensely helpful.

I wanted to know if there's an easy way for me to make predict.py work with raw text files, since this seems like the purpose of the architecture. Is there a reason my input files have to conform to the CoNLL -U format besides calculating evaluation metrics?

Hyperparticle · 2019-11-19T02:46:06Z

It's not strictly necessary to use the conllu predictor, it's just there for convenience for evaluation. All the logic for sentence input and prediction output can be found in predictor.py. The _json_to_instance() is probably what you are interested in, which takes as input a json dict with a sentence as input, which you can then tokenize (AllenNLP provides a Spacy tokenizer which can be used for multilingual text), and then pass to the dataset reader. I can get you a simple example working soon.

Hyperparticle · 2019-11-19T04:01:17Z

I added a new option --raw_text to predict.py that can take an input file of one sentence per line and output one json annotation object per line.

Hope this helps. Let me know if you need any additional help.

drunkinlove · 2019-11-19T08:55:24Z

great, thanks a lot!

jzhou316 · 2020-02-04T03:39:25Z

Hi I also found that the option to directly input a raw text file for prediction very useful. Thanks for that! I have another small question, which is related to tokenization and BPE encoding. For example, if my text file is already tokenized (with some of my own tokenizer) and split by BPE, does it still work? And how are the BPE sub-words handled in the word-level tag prediction?

Hyperparticle · 2020-02-04T18:15:07Z

The text should work if it is tokenized, though if you already split with BPE I'm not sure what will happen. You could either recombine the words so that they can be split by the model (easy) or you could modify the tokenizer code in the repo to bypass the BPE step (harder).

The BPE subword embeddings are all discarded except for the first subword which is used to represent the information content of the whole word. In my experiments, I found no discernible difference between using the first, last, or average of all embeddings. This is also explained in the paper, and there's existing work which report similar findings.

jzhou316 · 2020-02-05T03:15:28Z

Thanks for the clarification! That really helps.

Hyperparticle closed this as completed Nov 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

predict.py to work with raw text files #8

predict.py to work with raw text files #8

drunkinlove commented Nov 18, 2019

Hyperparticle commented Nov 19, 2019

Hyperparticle commented Nov 19, 2019

drunkinlove commented Nov 19, 2019

jzhou316 commented Feb 4, 2020 •

edited

Hyperparticle commented Feb 4, 2020

jzhou316 commented Feb 5, 2020

predict.py to work with raw text files #8

predict.py to work with raw text files #8

Comments

drunkinlove commented Nov 18, 2019

Hyperparticle commented Nov 19, 2019

Hyperparticle commented Nov 19, 2019

drunkinlove commented Nov 19, 2019

jzhou316 commented Feb 4, 2020 • edited

Hyperparticle commented Feb 4, 2020

jzhou316 commented Feb 5, 2020

jzhou316 commented Feb 4, 2020 •

edited