Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rules Improvement for French #38

Merged
merged 7 commits into from Feb 11, 2022

Conversation

Pantalaymon
Copy link
Contributor

Hello ,

As I will be using coreferee in a new project I am still working on improving the rules.

I added a few more rules in lang/fr/language_rules.py as well as a few tests in tests/fr to make sure they work as expected.
There is also some edits in lang/fr/data files which are used by the rules

Regarding the new rules, I don't know if you plan to use the same rules for the spacy native solution that you are developing but I just wanted to share that on top of the language specific rules for noun/anaphora - anaphora pairs, the system would greatly benefit from language specific rules for noun - noun coreferring pairs. For instance to prevent singular named entities (say John Doe) from coreferring with plural nouns (say the people) or gender-incompatible nouns.

@Pantalaymon
Copy link
Contributor Author

Hi @richardpaulhudson
Are you still maintaining coreferee?
It would be really desirable for the last french version to be updated as those last commits fix a major issue with the output.

@richardpaulhudson
Copy link
Collaborator

Hi @Pantalaymon, thank you very much for this and please accept my sincere apologies for taking so unacceptably long to get back to you. Coreferee is still being maintained and will still be maintained in the future; with me having changed employers I seem to have missed the original PR notification in December.

I am currently doing experiments into ways of improving the accuracy specifically for English. The most likely outcome — although this is by no means set in stone — is that we will end up implementing a new library for English coreference. Coreferee will definitely still be supported for the other languages and it may well be that the results of the experiments point to some cross-language improvements that can be made to Coreferee as well.

Your suggestion to implement rules to filter noun-noun coreference sounds like a very good idea and I shall definitely look into this further.

Two questions about this PR:

  • Are the rules designed to be used with the existing model? Would it make sense to generate a new model?
  • If the rules improve the accuracy, would it make sense to specify the improved accuracy in https://github.com/msg-systems/coreferee#142-model-performance? (At the same time, I can see there is no easy way of measuring the new accuracy if we decide not to generate a new model.)

@Pantalaymon
Copy link
Contributor Author

Hi @richardpaulhudson ,

Very interesting. So it would be a new library independent from base spacy?

Regarding my suggestion, I think that partly exceeds the original focus of coreferee which was anaphora resolution, Since the noun-noun pairing operates mostly on a cross-language level and a rule-based system . However if you really plan to start from this project as to implement a larger, multi-language coreference resolution solution for spacy, I am 100% convinced that specific language rules for noun-noun coreference would be worth designing.

Regarding your questions :

  • I have not retrained the model at all, so yes the rules work with the existing model. Although I think I have slightly modified the rules for mention definition (independent noun and anaphora) so retraining the model might result in better accuracy... or not. I am not sure if the noun-noun pairing rules affect the training of the neural ensemble at all... if it does I will definitely retrain it and compare the results when I have time.
  • Well I'm not sure since this table is about the accuracy of the neural ensemble between potential anaphoric pairs if I'm not mistake and not about the whole coreference chains.

By the way, regarding the evaluation of the whole coreference chains, I have been able to evaluate the tool for french with more usual metrics here by using the CONLL format. The results are not so good for the reasons exposed below but still ok.
I think the same method would be used to evaluate other languages supported by coreferee provided the corpus is converted to CONLL. Then only a few adapatations to each language (namely the separators in the conll loader and the dependencies to exclude from the building of mention phrases) would be required before you can run the coreference resolution scorer.
Dependending on the genres in the test corpora, it could yield better results than what I had for french.

@richardpaulhudson richardpaulhudson merged commit 5ffaa37 into msg-systems:development Feb 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants