Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing errors, version 5.1.1, Fench #31

Open
abalvet opened this issue Apr 18, 2018 · 1 comment
Open

Parsing errors, version 5.1.1, Fench #31

abalvet opened this issue Apr 18, 2018 · 1 comment

Comments

@abalvet
Copy link

abalvet commented Apr 18, 2018

"Les poules du couvent couvent." should return:
1 Les le DET det p 2 det _ _
2 poules poule NC nc fp 5 suj _ _
3 du de P+D P+D ms 2 dep _ _
4 couvent couvent NC nc ms 3 obj _ _
5 couvent couver V v PS3p 0 root _ _
6 . . PONCT PONCT null 5 punct _ _

This is what I get:

1 Les les DET DET n=p 2 det 2 det
2 poules poule NC NC n=p|g=f 0 _ 0 _
3 du de P+D P+D n=s|g=m 2 dep 2 dep
4 couvent couvent NC NC n=s|g=m 3 prep 3 prep
5 couvent couvent NC NC n=s|g=m 4 mod 4 mod <= VERY wrong here due to tagging error
6 . . PONCT PONCT 5 ponct 5 ponct

This is not the only error/parsing issue. Talismane doesn't seem to distinguish indirect objects from PP modifiers. Plus, it attaches the final punctuation to the last token, while Malt attaches the final punctuation to the last verbal root.
Talismane seems to systematically attach the NP to the last Prep, tagging its function as "prep" (prepositional object?), while Malt seems to systematically consider Preps as modifiers of the last v root, with the head Noun being tagged as "obj". Malt doesn't seem to distinguish between indirect objects and PP modifiers, either. Maybe this comes from the dependency analyses of the FTB?
Sometimes, the form of the preposition seems to have some influence on the function of the head noun.
Talismane, as well as Malt, does not distinguish transitive verbs from intransitive ones: in "Maurice dort le matin", "le matin" should be tagged "mod", not "obj".

Here is a side-by-side comparison with Malt (MaltParser 1.9.2 + fremalt-1.7.mco) on a set of very simple sentences:
ID TOKEN LEMMA TAG_RED TAG_EXT ID_REL REL ID_REL REL
TALISMANE MALT
1 Maurice Maurice NPP NPP 2 suj 2 suj
2 accorde accorder V V 0 root 0 root
3 sa sa DET DET 4 det 4 det
4 guitare guitare NC NC 2 obj 2 obj
5 à à P P 2 mod 2 mod
6 l' le DET DET 7 det 7 det
7 oreille oreille NC NC 5 prep 5 obj
8 . . PONCT PONCT 7 ponct 2 ponct

1 Maurice Maurice NPP NPP 2 suj 2 suj
2 accorde accorder V V 0 root 0 root
3 sa sa DET DET 4 det 4 det
4 confiance confiance NC NC 2 obj 2 obj
5 à à P P 2 mod 4 dep
6 la la DET DET 7 det 7 det
7 directrice directrice NC NC 5 prep 5 obj
8 . . PONCT PONCT 7 ponct 2 ponct

1 Maurice Maurice NPP NPP 2 suj 2 suj
2 accorde accorder V V 0 root 0 root
3 sa sa DET DET 4 det 4 det
4 confiance confiance NC NC 2 obj 2 obj
5 au à P+D P+D 4 dep 2 mod
6 directeur directeur NC NC 5 prep 5 obj
7 . . PONCT PONCT 6 ponct 2 ponct

1 Maurice Maurice NPP NPP 2 suj 2 suj
2 achète acheter V V 0 root 0 root
3 une une DET DET 4 det 4 det
4 voiture voiture NC NC 2 obj 2 obj
5 à à P P 2 mod 2 mod
6 sa sa DET DET 7 det 7 det
7 femme femme NC NC 5 prep 5 obj
8 . . PONCT PONCT 7 ponct 2 ponct

1 Maurice Maurice NPP NPP 2 suj 2 suj
2 achète acheter V V 0 root 0 root
3 une une DET DET 4 det 4 det
4 voiture voiture NC NC 2 obj 2 obj
5 pour pour P P 2 mod 2 mod
6 sa sa DET DET 7 det 7 det
7 femme femme NC NC 5 prep 5 obj
8 . . PONCT PONCT 7 ponct 2 ponct

1 Maurice Maurice NPP NPP 2 suj 2 suj
2 achète acheter V V 0 root 0 root
3 une une DET DET 4 det 4 det
4 voiture voiture NC NC 2 obj 2 obj
5 à à P P 2 mod 2 mod
6 la la DET DET 7 det 7 det
7 représentante représentant NC NC 5 prep 5 obj
8 . . PONCT PONCT 7 ponct 2 ponct

1 Maurice Maurice NPP NPP 2 suj 2 suj
2 achète acheter V V 0 root 0 root
3 une une DET DET 4 det 4 det
4 voiture voiture NC NC 2 obj 2 obj
5 à à P P 2 mod 2 mod
6 la la DET DET 7 det 7 det
7 sauvette sauvette NC NC 5 prep 5 obj
8 . . PONCT PONCT 7 ponct 2 ponct

1 Maurice Maurice NPP NPP 2 suj 2 suj
2 achète acheter V V 0 root 0 root
3 une une DET DET 4 det 4 det
4 voiture voiture NC NC 2 obj 2 obj
5 à à P P 2 mod 2 mod
6 l' le DET DET 7 det 7 det
7 étranger étranger NC NC 5 prep 5 obj
8 . . PONCT PONCT 7 ponct 2 ponct

1 Maurice Maurice NPP NPP 2 suj 2 suj
2 dort dormir V V 0 root 0 root
3 le le DET DET 4 det 4 det
4 matin matin NC NC 2 obj 2 obj
5 jusqu' jusque P P 2 mod 2 mod
6 à à P P 5 prep 5 obj
7 10 10 ADJ ADJ 8 mod 8 mod
8 heures heure NC NC 6 prep 6 obj
9 . . PONCT PONCT 8 ponct 2 ponct

1 Maurice Maurice NPP NPP 2 suj 2 suj
2 dort dormir V V 0 root 0 root
3 dans dans P P 2 mod 2 mod
4 son son DET DET 5 det 5 det
5 jardin jardin NC NC 3 prep 3 obj
6 . . PONCT PONCT 5 ponct 2 ponct

1 Maurice Maurice NPP NPP 2 suj 2 suj
2 respire respirer V V 0 root 0 root
3 la la DET DET 4 det 4 det
4 santé santé NC NC 2 obj 2 obj
5 . . PONCT PONCT 4 ponct 2 ponct

1 Maurice Maurice NPP NPP 2 suj 2 suj
2 respire respirer V V 0 root 0 root
3 la la DET DET 5 det 5 det
4 bonne bon ADJ ADJ 5 mod 5 mod
5 odeur odeur NC NC 2 obj 2 obj
6 du de P+D P+D 5 dep 5 dep
7 gâteau gâteau NC NC 6 prep 6 obj
8 . . PONCT PONCT 7 ponct 2 ponct

1 Maurice Maurice NPP NPP 2 suj 2 suj
2 sent sentir V V 0 root 0 root
3 la la DET DET 4 det 4 det
4 bière bière NC NC 2 obj 2 obj
5 . . PONCT PONCT 4 ponct 2 ponct

1 Maurice Maurice NPP NPP 2 suj 2 suj
2 donne donner V V 0 root 0 root
3 un un DET DET 4 det 4 det
4 livre livre NC NC 2 obj 2 obj
5 à à P P 2 mod 2 mod
6 son son DET DET 7 det 7 det
7 frère frère NC NC 5 prep 5 obj
8 . . PONCT PONCT 7 ponct 2 ponct

1 Maurice Maurice NPP NPP 2 suj 2 suj
2 donne donner V V 0 root 0 root
3 un un DET DET 4 det 4 det
4 livre livre NC NC 2 obj 2 obj
5 à à P P 2 mod 2 mod
6 lire lire VINF VINF 5 prep 5 obj
7 à à P P 6 mod 6 mod
8 son son DET DET 9 det 9 det
9 frère frère NC NC 7 prep 7 obj
10 . . PONCT PONCT 9 ponct 2 ponct

1 Maurice Maurice NPP NPP 2 suj 2 suj
2 donne donner V V 0 root 0 root
3 un un DET DET 4 det 4 det
4 livre livre NC NC 2 obj 2 obj
5 à à P P 2 mod 2 mod
6 son son DET DET 7 det 7 det
7 frère frère NC NC 5 prep 5 obj
8 à à P P 2 mod 2 mod
9 la la DET DET 10 det 10 det
10 fin fin NC NC 8 prep 8 obj
11 de de P P 2 mod 10 dep
12 la la DET DET 13 det 13 det
13 journée journée NC NC 11 prep 11 obj
14 . . PONCT PONCT 13 ponct 2 ponct

1 À à P P 9 mod 9 mod
2 la la DET DET 3 det 3 det
3 fin fin NC NC 1 prep 1 obj
4 de de P P 3 dep 3 dep
5 la la DET DET 6 det 6 det
6 journée journée NC NC 4 prep 4 obj
7 0 0 PONCT PONCT 6 ponct 9 ponct
8 Maurice Maurice NPP NPP 9 suj 9 suj
9 donne donner V V 0 root 0 root
10 un un DET DET 11 det 11 det
11 livre livre NC NC 9 obj 9 obj
12 à à P P 9 mod 9 mod
13 son son DET DET 14 det 14 det
14 frère frère NC NC 12 prep 12 obj
15 . . PONCT PONCT 14 ponct 9 ponct

@urieli
Copy link
Collaborator

urieli commented Apr 23, 2018

You report various issues here - in the future, it would be better to separate them into separate issues.

I'll try to tackle them one at a time, but first of all, a general comment: Talismane and Malt are both based on supervised machine learning, and can only reproduce what they find in their training corpora. Because of copyright issues, we cannot (unfortunately) publish our training corpora. Most of these issues would be far better handled in a discussion of dependency annotation guidelines for French specifically targeting a "gold" standard corpus, as they have little to do with Talismane as software.

Issue 1: "Les poules du couvent couvent."
This can be fixed by downloading v5.1.2 and running the command with a higher beam width (option --beamWidth=5).
A beam width gives you a trade-off between parsing speed and parsing accuracy. The higher the beam width, the slower the parse and the more accurate.
At beam width = 1 (the default), the parser has to take the 1st option produced by the pos-tagger, and is highly sensitive to pos-tagger errors. At higher beam widths, the parser select among various options produced by the pos-tagger.

Note that higher beam widths had a bug in v5.1.1 which was fixed in v5.1.2.

Issue 2: Talismane doesn't seem to distinguish indirect objects from PP modifiers.
The examples you give show that this is not systematically true. My assumption is that this can only be corrected by feeding Talismane with many more correctly annotated examples during training.

Issue 3: Talismane attaches the final punctuation to the last token, while Malt attaches the final punctuation to the last verbal root.
In the FTB, punctuation is attached haphazardly, and there is nothing in the annotation guide to indicate where it should be attached. Talismane makes the simplifying assumption that punctuation should be systematically attached to the previous non-punctuation token. I currently see no reason to consider this wrong.

Issue 4: Talismane seems to systematically attach the NP to the last Prep, tagging its function as "prep"
See https://github.com/joliciel-informatique/talismane/blob/master/talismane_core/languagePacks/french/languagePack/talismaneDependencyLabels_fr.txt#L41

Issue 5: Talismane, as well as Malt, does not distinguish transitive verbs from intransitive ones
I agree with your analysis, and can only assume additional training examples would correct this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants