Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document what tokenisation was used for the offered models #47

Open
jowagner opened this issue Feb 15, 2019 · 2 comments
Open

Document what tokenisation was used for the offered models #47

jowagner opened this issue Feb 15, 2019 · 2 comments

Comments

@jowagner
Copy link

Closed issue #45 indicates that udpipe was used and __main__.py suggests that you use the expanded form for conll multiword tokens, e.g. 2 tokens "de le" instead of "du" in French. The readme should mention both.

@jowagner
Copy link
Author

However, the config.json of a downloaded model suggests that the model was not trained on a conllu file: "train_path": "/users4/conll18st/raw_text/Czech/cs-20m.raw". Has this historic reasons, i.e. was conllu input format only added later to elmoformanylangs and you used an external conlly-to-raw converter at the time?

@Oneplus
Copy link
Member

Oneplus commented Feb 19, 2019

was conllu input format only added later to elmoformanylangs and you used an external conlly-to-raw converter at the time?

cs-20m.raw was obtained from an external conllu-to-raw script. the original data can be found at https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-1989 and it was preprocessed by udpipe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants