Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is this ready to use? #2

Open
kuchenrolle opened this issue Oct 22, 2019 · 7 comments
Open

Is this ready to use? #2

kuchenrolle opened this issue Oct 22, 2019 · 7 comments

Comments

@kuchenrolle
Copy link

Hi. I am working on Polish and was looking for a parser and just found this. So I was wondering - is this ready to be used? Is there any information on how well the components perform, in particular the dependency parser? Otherwise, could you recommend one of the existing tools? (:

@ryszardtuora
Copy link
Collaborator

The model is usable right now and achieves quite good results, we will publish the precise evaluation metrics shortly. You can download it here: http://zil.ipipan.waw.pl/SpacyPL
The parser itself is doing fine as long as you process one sentence at a time, this is an artifact of polish dependency treebanks (and so the evaluation procedures we've used are not sensitive to this) and we will fix this in the next version which will come out in a matter of days. Some further improvements both to the parser and other components will be included, e.g. we plan to switch to PDB for more robust dependency analyses.

@kuchenrolle
Copy link
Author

That sounds good, can't wait to see the results. Two questions:

  1. Since you're using the Morpheusz dictionary, wouldn't it make sense to run their analyzer as part of the pipeline and get the full morphological tags?
  2. I've been playing around with the model, but I got some weird results. For example, when I look at the dependency parse for "W starożytności tereny obecnych Niemiec zamieszkiwały plemiona Celtów.", the verb has two subjects assigned, rather than a subject and an object. Obviously there will be mistakes all the time, but not assigning multiple subjects seems like a hard constraint for a parser, no?

@ryszardtuora
Copy link
Collaborator

  1. We do have a version which uses Morfeusz in the pipeline, it should also be available shortly. We have something like this in mind, but the issue with using it for full morphological tags, is that Morfeusz supplies multiple analyses, and these have to be disambiguated based on context. At the moment this is done by the ordinary spaCy tagger model, but this model cannot differentiate between analyses differing by features other than POS tags. We have yet to experiment with this, and see whether we can achieve reliable results.

  2. I've tried parsing the sentence you mentioned, and got the same results. The parser is based on a neural network, and so we can never be certain that absurdities like these will not pop up, but it is indeed strange, as I would expect that no sentences in the training dataset contain more than one nsubj. As I've said, the parser will be substantially modified in the upcoming version, and I hope mistakes like these will no longer appear. Thank you for your findings, if you do stumble upon something else that worries you, please do contact us.

@kuchenrolle
Copy link
Author

We have some manually annotated data that we will compare the parser against, so I'll report if we find something that seems off. Please notify me (here) once the new model is out. (:

@ryszardtuora
Copy link
Collaborator

The new version came out, please see here for details.
Unfortunately (despite changing the training corpus and optimizing hyperparameters some more) we were not yet able to fix the issue with the sentence you've mentioned. I've tried altering the sentence in some minor ways (e.g. parsing W starożytności tereny Niemiec kontrolowały plemiona Celtów and othe variants) and still got two subjects in the analysis.

The structure of the sentence might be a little difficult, but as I've said, all examples should have only one nsubj so I would rather expect the errors to look different (e.g. 'Celtów' becoming an obj or something like that). We're experimenting with hyperparameters some more, and the results are promising, maybe the issue will disappear in the next version (which should be available shortly).

Some of the problems with the parser must be attributed to the way tokenization is handled in the training. The default tokenizer is used (and substituting Morfeusz here would not really be ideal, as it would make reproducible training substantially harder) becasue -G is now bugged. This clashes with our training resources which follow a different tokenization methodology. Still, even after explosion fixes it, -G would lead to the parser being unable to split documents into sentences, because it forces documents to be one sentence long. We will keep you updated on our work on this problem.

With respect to Morfeusz, the pl_spacy_model_morfeusz version now supports extended tags. The way it works now, is that we append those, and only those features, on which all interpretations agree. Full morphosyntactical tagging is also something we want to do in the near future.

Please let us know if anything pops up!

@kuchenrolle
Copy link
Author

Thanks for the update. So far, our results have been somewhat discouraging, as we get fairly low agreement between our manual annotation and what I'm extracting using the parser, though that is still work in progress. The sentences we use are from wikipedia, so the one cited above is actually on the simpler side of things, maybe the treebank(s) do not reflect that kind of structure. I will have a look at the new version tomorrow, hopes are up in any case! (:

@lkobylinski
Copy link
Contributor

While we work on this spacy model, you may also take a look at these tools and models, which may give better results at the moment.
http://zil.ipipan.waw.pl/PDB/PDBparser

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants