New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
predict.py to work with raw text files #8
Comments
It's not strictly necessary to use the conllu predictor, it's just there for convenience for evaluation. All the logic for sentence input and prediction output can be found in predictor.py. The |
I added a new option Hope this helps. Let me know if you need any additional help. |
great, thanks a lot! |
Hi I also found that the option to directly input a raw text file for prediction very useful. Thanks for that! I have another small question, which is related to tokenization and BPE encoding. For example, if my text file is already tokenized (with some of my own tokenizer) and split by BPE, does it still work? And how are the BPE sub-words handled in the word-level tag prediction? |
The text should work if it is tokenized, though if you already split with BPE I'm not sure what will happen. You could either recombine the words so that they can be split by the model (easy) or you could modify the tokenizer code in the repo to bypass the BPE step (harder). The BPE subword embeddings are all discarded except for the first subword which is used to represent the information content of the whole word. In my experiments, I found no discernible difference between using the first, last, or average of all embeddings. This is also explained in the paper, and there's existing work which report similar findings. |
Thanks for the clarification! That really helps. |
Hello!
First of all, thank you for the research and shared code, it's immensely helpful.
I wanted to know if there's an easy way for me to make predict.py work with raw text files, since this seems like the purpose of the architecture. Is there a reason my input files have to conform to the CoNLL -U format besides calculating evaluation metrics?
The text was updated successfully, but these errors were encountered: