Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

predict.py to work with .conllu files NOT annotated for dependencies? #26

Closed
lmompela opened this issue Oct 5, 2021 · 2 comments
Closed

Comments

@lmompela
Copy link

lmompela commented Oct 5, 2021

Hi there,

I was wondering whether there was a way for me to use predict.py with my corpus data (.conllu) which is not annotated for dependencies, but is annotated for POS. My goal is not to calculate evaluation metrics, at the moment, but rather have my pretrained model give me predictions on dependencies to hopefully get a head start with dependency annotations. I am working on an underdocumented language and would like to have a first row of dependencies predictions that I would then go back to, verify and update to create the GOLD standard for my language.

Is there a reason my input file have to conform to the conllu format other than for evaluation metrics? My issue seems to be that my "head" and "deprel" columns are not integers but simply "_" because they're empty. I would preferably like to keep the .conllu format of my input file as it contains POS information already which could give me better predictions.

Thank you for the research, it's super helpful, especially for underdocumented languages.

Here is my error message :
image

@lmompela
Copy link
Author

Nevermind, found a way around! Thanks

@andidyer
Copy link

andidyer commented Feb 28, 2024

Nevermind, found a way around! Thanks

I suppose this is a while ago, but do you remember what your solution was? I am also trying to parse a file without annotated dependencies and facing this issue.

Maybe the solution is just to fill each head field with the index of the token minus 1? This is what I did and it worked. The format looked a bit like this:

# sent_id = 1
# text = "This is a sentence"
1    This    _    _    _    _    0    _    _    _
2    is    _    _    _    _    1    _    _    _
3    a    _    _    _    _    2    _    _    _
4    sentence    _    _    _    _    3    _    _    _

# sent_id = 2
# text = "This is another sentence"
1    This    _    _    _    _    0    _    _    _
2    is    _    _    _    _    1    _    _    _
3    another    _    _    _    _    2    _    _    _
4    sentence    _    _    _    _    3    _    _    _

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants