Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

predict.py to work with raw text files #8

Closed
drunkinlove opened this issue Nov 18, 2019 · 6 comments
Closed

predict.py to work with raw text files #8

drunkinlove opened this issue Nov 18, 2019 · 6 comments

Comments

@drunkinlove
Copy link

Hello!

First of all, thank you for the research and shared code, it's immensely helpful.

I wanted to know if there's an easy way for me to make predict.py work with raw text files, since this seems like the purpose of the architecture. Is there a reason my input files have to conform to the CoNLL -U format besides calculating evaluation metrics?

@Hyperparticle
Copy link
Owner

It's not strictly necessary to use the conllu predictor, it's just there for convenience for evaluation. All the logic for sentence input and prediction output can be found in predictor.py. The _json_to_instance() is probably what you are interested in, which takes as input a json dict with a sentence as input, which you can then tokenize (AllenNLP provides a Spacy tokenizer which can be used for multilingual text), and then pass to the dataset reader. I can get you a simple example working soon.

@Hyperparticle
Copy link
Owner

I added a new option --raw_text to predict.py that can take an input file of one sentence per line and output one json annotation object per line.

Hope this helps. Let me know if you need any additional help.

@drunkinlove
Copy link
Author

great, thanks a lot!

@jzhou316
Copy link

jzhou316 commented Feb 4, 2020

Hi I also found that the option to directly input a raw text file for prediction very useful. Thanks for that! I have another small question, which is related to tokenization and BPE encoding. For example, if my text file is already tokenized (with some of my own tokenizer) and split by BPE, does it still work? And how are the BPE sub-words handled in the word-level tag prediction?

@Hyperparticle
Copy link
Owner

The text should work if it is tokenized, though if you already split with BPE I'm not sure what will happen. You could either recombine the words so that they can be split by the model (easy) or you could modify the tokenizer code in the repo to bypass the BPE step (harder).

The BPE subword embeddings are all discarded except for the first subword which is used to represent the information content of the whole word. In my experiments, I found no discernible difference between using the first, last, or average of all embeddings. This is also explained in the paper, and there's existing work which report similar findings.

@jzhou316
Copy link

jzhou316 commented Feb 5, 2020

Thanks for the clarification! That really helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants