Document what tokenisation was used for the offered models #47

jowagner · 2019-02-15T12:29:09Z

Closed issue #45 indicates that udpipe was used and __main__.py suggests that you use the expanded form for conll multiword tokens, e.g. 2 tokens "de le" instead of "du" in French. The readme should mention both.

The text was updated successfully, but these errors were encountered:

jowagner · 2019-02-17T15:24:54Z

However, the config.json of a downloaded model suggests that the model was not trained on a conllu file: "train_path": "/users4/conll18st/raw_text/Czech/cs-20m.raw". Has this historic reasons, i.e. was conllu input format only added later to elmoformanylangs and you used an external conlly-to-raw converter at the time?

Oneplus · 2019-02-19T00:38:00Z

was conllu input format only added later to elmoformanylangs and you used an external conlly-to-raw converter at the time?

cs-20m.raw was obtained from an external conllu-to-raw script. the original data can be found at https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-1989 and it was preprocessed by udpipe.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document what tokenisation was used for the offered models #47

Document what tokenisation was used for the offered models #47

jowagner commented Feb 15, 2019

jowagner commented Feb 17, 2019

Oneplus commented Feb 19, 2019 •

edited

Loading

Document what tokenisation was used for the offered models #47

Document what tokenisation was used for the offered models #47

Comments

jowagner commented Feb 15, 2019

jowagner commented Feb 17, 2019

Oneplus commented Feb 19, 2019 • edited Loading

Oneplus commented Feb 19, 2019 •

edited

Loading