Inconsistencies in dataset annotations #9

janpf · 2021-09-30T12:54:46Z

Hi!
I've just found an inconsistency in the Darmstadt dataset dev split. I haven't checked whether this also occurs in different datasets or in different splits.

Two back-to-back examples in the dev split look like this:

{
  "sent_id": "DeVry_University_95_05-16-2004-6",
  "text": "I can't overemphasize that enough .",
  "opinions": [
    {
      "Source": [
        [],
        []
      ],
      "Target": [
        [
          "that"
        ],
        [
          "22:26"
        ]
      ],
      "Polar_expression": [
        [
          "can't overemphasize enough"
        ],
        [
          "2:33"
        ]
      ],
      "Polarity": "Positive",
      "Intensity": "Strong"
    }
  ]
},
{
  "sent_id": "DeVry_University_95_05-16-2004-7",
  "text": "The school gives students a knowledge base that makes them extremely competitive in the corporate world .",
  "opinions": [
    {
      "Source": [
        [],
        []
      ],
      "Target": [
        [
          "students"
        ],
        [
          "17:25"
        ]
      ],
      "Polar_expression": [
        [
          "extremely",
          "competitive"
        ],
        [
          "59:68",
          "69:80"
        ]
      ],
      "Polarity": "Positive",
      "Intensity": "Strong"
    }
  ]
},

Usually the datapoints are handled like in the second sentence: polar expressions (as well as source and target fields for that matter) are whitespace separated, even if the words are directly back-to-back. In the first sentence the whole polar expression is listed as a whole though and the span("2:33") even includes the target word("that"|"22:26") while it is not present in the string("can't overemphasize enough").
I'm actually unsure whether this issue stems for the provided preprocessing function or the underlying dataset.

I also noticed that for this example sentence both splitting methods are applied for polar_expression:

{
  "sent_id": "Capella_University_50_12-09-2005-3",
  "text": "I have found the course work and research more challenging and of higher quality at Capella than at any of the other institutions I graduated from .",
  "opinions": [
    {
      "Source": [
        [],
        []
      ],
      "Target": [
        [
          "course work research"
        ],
        [
          "17:41"
        ]
      ],
      "Polar_expression": [
        [
          "higher quality"
        ],
        [
          "66:80"
        ]
      ],
      "Polarity": "Positive",
      "Intensity": "Average"
    },
    {
      "Source": [
        [],
        []
      ],
      "Target": [
        [
          "course work research"
        ],
        [
          "17:41"
        ]
      ],
      "Polar_expression": [
        [
          "more",
          "challenging"
        ],
        [
          "42:46",
          "47:58"
        ]
      ],
      "Polarity": "Positive",
      "Intensity": "Strong"
    }
  ]
},

sometimes the indices for the polarity_expression strings are also missing:

{
  "sent_id": "St_Leo_University_4_04-16-2004-5",
  "text": "The teachers are very helpful , and the staff is , as well .",
  "opinions": [
    {
      "Source": [
        [],
        []
      ],
      "Target": [
        [
          "teachers"
        ],
        [
          "4:12"
        ]
      ],
      "Polar_expression": [
        [
          "very",
          "helpful"
        ],
        []
      ],
      "Polarity": "Positive",
      "Intensity": "Strong"
    },
    {
      "Source": [
        [],
        []
      ],
      "Target": [
        [
          "staff"
        ],
        [
          "40:45"
        ]
      ],
      "Polar_expression": [
        [
          "very",
          "helpful"
        ],
        []
      ],
      "Polarity": "Positive",
      "Intensity": "Strong"
    }
  ]
},

The text was updated successfully, but these errors were encountered:

jerbarnes · 2021-09-30T13:49:51Z

Hi Jan,

Yes, you're right. This first issue stems from the dataset itself, where negators ('no', 'not', etc) and intensifiers ('very', 'extremely') are not explicitly included in the polar expression, but instead attached as properties. In the conversion script, we decided to leave them separate, but it is true that this choice is arbitrary. Regarding the missing indices, I'll have to have a deeper look into the code to see why that is happening. Thanks for bringing it up!

janpf · 2021-09-30T13:53:00Z

Thanks for your quick reply!

negators ('no', 'not', etc) and intensifiers ('very', 'extremely') are not explicitly included in the polar expression

that information actually helps a lot!
btw: in the train.json some even funkier stuff seems to be going on:

{
  "sent_id": "Colorado_Technical_University_Online_69_10-14-2005-1",
  "text": "They have used one of the books that was used by a professor of mine from a SUNY school that would only teach with graduate level books for undergraduate courses .",
  "opinions": [
    {
      "Source": [
        [],
        []
      ],
      "Target": [
        [
          "They"
        ],
        [
          "0:4"
        ]
      ],
      "Polar_expression": [
        [
          "no",
          "complaints"
        ],
        []
      ],
      "Polarity": "Positive",
      "Intensity": "Average"
    }
  ]
},

jerbarnes · 2021-09-30T14:42:54Z

Thanks for pointing that out. I'll have a look and try to get back to you soon.

jerbarnes · 2021-10-14T11:55:22Z

Ok, I've confirmed that this is only a problem in the Darmstadt dataset, only affects polar expressions, but occurs in all splits. The problem comes from the fact that the original annotations often span several sentences. That means that if you have a document with a target in the first sentence and a polar expression much later. When we divide the annotations into sentences, the polar expression is no longer in the sentence, which gives null offsets. I will refactor the code a bit to remove these sentences and push later today.

jerbarnes · 2021-10-15T09:03:12Z

Ok, I've updated the preprocessing script to remove the annotations that were problematic. Let me know if it works on your end and I'll close the issue.

janpf · 2021-10-20T11:28:13Z

Thanks! Looks like that removed the issues. If I find something else I'll just reopen ;)

janpf · 2021-10-29T13:22:56Z

I believe that there are some other cases of wrong annotations. Example from multibooked_ca/dev:

{
        "sent_id": "corpora/ca/quintessence-Miriam-1",
        "text": "La porteria i l ' escala .",
        "opinions": []
},
{
        "sent_id": "corpora/ca/quintessence-Miriam-2",
        "text": "Son poc accesibles quan vas amb nens petits i no posaba res a l ' anunci",
        "opinions": [
            {
                "Source": [
                    [],
                    []
                ],
                "Target": [
                    [
                        "l ' escala"
                    ],
                    [
                        "14:24"
                    ]
                ],
                "Polar_expression": [
                    [
                        "poc accesibles"
                    ],
                    [
                        "4:18"
                    ]
                ],
                "Polarity": "Negative",
                "Intensity": "Standard"
            },

the target doesn't exist in the source text, but in the sentence right before :O
the opener_en/dev also does contain an interesting case:

    {
        "sent_id": "../opener/en/kaf/hotel/english00200_e8f707795fc0c7f605a1f7115c3da711-2",
        "text": "Hotel Premiere Classe Orly Rungis is near the airport and close to Orly",
        "opinions": [
            {
                "Source": [
                    [],
                    []
                ],
                "Target": [
                    [
                        "Hotel Premiere Classe Orly Rungis"
                    ],
                    [
                        "0:33"
                    ]
                ],
                "Polar_expression": [
                    [
                        "near the airport"
                    ],
                    [
                        "37:53"
                    ]
                ],
                "Polarity": "Negative",
                "Intensity": "Standard"
            },
            {
                "Source": [
                    [],
                    []
                ],
                "Target": [
                    [
                        "Hotel Premiere Classe Orly Rungis"
                    ],
                    [
                        "0:33"
                    ]
                ],
                "Polar_expression": [
                    [
                        "close to Orly major highways"
                    ],
                    [
                        "0:71"
                    ]
                ],
                "Polarity": "Negative",
                "Intensity": "Standard"
            }
        ]
    },
    {
        "sent_id": "../opener/en/kaf/hotel/english00200_e8f707795fc0c7f605a1f7115c3da711-3",
        "text": "major highways ( all night heard the noise of passing large vehicles ) .",
        "opinions": [
            {
                "Source": [
                    [],
                    []
                ],
                "Target": [
                    [
                        "Hotel Premiere Classe Orly Rungis"
                    ],
                    [
                        "0:33"
                    ]
                ],
                "Polar_expression": [
                    [
                        "noise of passing large vehicles"
                    ],
                    [
                        "37:68"
                    ]
                ],
                "Polarity": "Negative",
                "Intensity": "Standard"
            }
        ]
    },

my guess is that some sentences have been by accident incorrectly split into two separate sentences?

jerbarnes · 2021-11-04T14:52:48Z

Yes, you are correct. I will have to have a deeper look at the other datasets and will come with corrections soon.

jerbarnes · 2021-12-08T10:49:09Z

It seems like the problem stems from the original sentence segmentation. The annotation was performed at document-level and although we told annotators to make sure that all sources/targets/expressions were annotated within sentences, at the time it wasn't completely clear that the annotations spanned across the incorrect sentence boundaries. This will require quite a bit of work to fix and I'm afraid I'll have to leave it for now. What I will do is filter the dev/eval data to make sure they do not influence the evaluation.

egilron · 2021-12-17T09:43:54Z

I compared the index and text representation for each segment of each elemet = ["Source", "Target", "Polar_expression"] in each opinion in the train data. I checked if the length of the text was similar to the length of the span represented by the index values. Here is what I got:

	Polar_expression dissimilar	Polar_expression similar	Source dissimilar	Source similar	Source empty	Target dissimilar	Target similar	Target empty
opener_en	95	2789	0	266	2618	17	2665	202
multibooked_eu	94	1590	5	200	1479	23	1262	399
opener_es	47	2997	0	176	2868	3	2756	285
multibooked_ca	62	1918	1	167	1812	23	1672	285
norec	0	9255	0	898	7550	0	6819	1670
darmstadt_unis	22	1077	0	63	743	2	804	0
mpqa	0	1706	0	1434	272	0	1481	225

Example sentence

{"sent_id": "../opener/en/kaf/hotel/english00192_e3fe22eeb360723a699504a27e13065e-5", 
   "text": "I can't explain in words how grand this place looks .", 
   "opinions": [{"Source": [[], []], "Target": [["this place"], ["36:46"]], "Polar_expression": [["how grand looks"], ["26:52"]], 
   "Polarity": "Positive", "Intensity": "Standard"}]}

For cases like this, where words are ommitted from the text, like "how grand looks", we could write a script to brake the element up in segments. Or just go with the index representations. Or just throw them out.

  for text, span in zip(opinion[element][0], opinion[element][1]):
      sp = [int(n) for n in span.split(":")]
      if len(text) == sp[1]- sp[0]:
          data.append(element + " similar")
      else: 
          data.append(element + " dissimilar")

janpf · 2021-12-19T13:32:32Z

@egilron nice work!
maybe you could add another column which indicates whether the string in source, target and expression exists in the original text in the first place? sometimes the indices are correct but the string doesn't appear anywhere at all.

egilron · 2021-12-20T09:57:19Z

Thank you! Virtually all the dissimilarities between text span and index span that I catch, comes from the text span ommitting words while the index span covers from first to last word. Like the "how grand looks" example. I found only one sentence where a text span is larger than index span. I have 393 segments where the index span is larger than text representation. For 389 of these I find each word in the text representation inside the span representation. Like [["how grand looks"], ["26:52"]] All words in ["how", "grand", "looks"] can be found in "I can't explain in words how grand this place looks ."[26-1:52-1] ( "how grand this place looks")

For the four spans where text words are not found in the index span, these words are from outside the sentence.
All counting is train only.

Table: Is getting cluttered now. Use at own risk.

	Polar_expression indexspan_larger	Polar_expression similar	Polar_expression spanlarger text_notin_sentence	Polar_expression textspan_larger	Polar_expressionspanlarger text_in_span	Source indexspan_larger	Source similar	Source spanlarger text_notin_sentence	Source_empty	Sourcespanlarger text_in_span	Target indexspan_larger	Target similar	Target_empty	Targetspanlarger text_in_span
opener_en	95	2789	0	0	95	0	266	0	2618	0	17	2665	202	17
multibooked_eu	94	1590	2	0	92	5	200	1	1479	4	23	1262	399	23
opener_es	46	2997	0	1	46	0	176	0	2868	0	3	2756	285	3
multibooked_ca	62	1918	1	0	61	1	167	0	1812	1	23	1672	285	23
norec	0	9255	0	0	0	0	898	0	7550	0	0	6819	1670	0
darmstadt_unis	22	1077	0	0	22	0	63	0	743	0	2	804	0	2
mpqa	0	1706	0	0	0	0	1434	0	272	0	0	1481	225	0

jerbarnes · 2022-01-11T15:47:21Z

Hey,

In the end, I was able to fix the easy ones, where the target/polar expression was split, but the offsets did not reflect this. That took care of most of them. For the ones that were split across sentences, I either filtered them, if they were incorrect annotations (the original annotation spanned a sentence boundary) or combined the text and fixed them otherwise (incorrect sentence segmentation).

I think that should fix most of the issues, but let me know if you happen to find anything else.

MinionAttack · 2022-01-12T08:45:47Z

Hi @jerbarnes, after this change do I have to retrain the models?

jerbarnes · 2022-01-12T11:33:39Z

It depends a bit. There were relatively few examples that were affected, so I doubt that retraining anything based on pre-trained languaged models will see large benefits. On the other hand, if you have smaller models that you can train quickly, it might be worth it.

janpf · 2022-01-12T21:01:24Z

Hi Jeremy,
thanks for your work!
I just found this one and I have no idea what's happening here 😅

from mpqa:

{
        "sent_id": "xbank/wsj_0583-27",
        "text": "Sansui , he said , is a perfect fit for Polly Peck 's electronics operations , which make televisions , videocassette recorders , microwaves and other products on an \" original equipment maker \" basis for sale under other companies ' brand names .",
        "opinions": [
            {
                "Source": [
                    [
                        "sa"
                    ],
                    [
                        "12:14"
                    ]
                ],
                "Target": [
                    [
                        ","
                    ],
                    [
                        "7:8"
                    ]
                ],
                "Polar_expression": [
                    [
                        ","
                    ],
                    [
                        "17:18"
                    ]
                ],
                "Polarity": "Positive",
                "Intensity": "Average"
            }
        ]
    },

I didn't create script to find issues like these tho :/

jerbarnes · 2022-01-15T15:38:18Z

It looks like it's a problem in the original annotation file in MPQA. In that particular file, lots of the the indices seem like they're off. Not sure what would have happened. I can remove this one in the preprocessing script, but I don't currently have a way to search for similar kinds of errors.

egilron · 2022-01-20T11:41:25Z

Today I redownloaded the repo, and re-extracted the data. Now my import script only catches a handful of darmstadt-unis sentences with some expression text/spans issues. Looking good!

jerbarnes · 2022-01-21T14:04:35Z

Great to hear!

janpf closed this as completed Oct 20, 2021

MinionAttack mentioned this issue Oct 29, 2021

Accuracy of baseline values #13

Closed

janpf reopened this Oct 29, 2021

janpf changed the title ~~Inconsistency in Darmstadt~~ Inconsistencies in dataset annotations Oct 29, 2021

jerbarnes added a commit that referenced this issue Jan 11, 2022

updated with corrected data (see issue #9)

0e3a107

jerbarnes added a commit that referenced this issue Jan 11, 2022

updated with corrected data (see issue #9)

3b4abad

jerbarnes added a commit that referenced this issue Jan 11, 2022

updated with corrected data (see issue #9)

89b68e7

jerbarnes added a commit that referenced this issue Jan 11, 2022

updated with corrected data (see issue #9)

da24e93

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistencies in dataset annotations #9

Inconsistencies in dataset annotations #9

janpf commented Sep 30, 2021 •

edited

jerbarnes commented Sep 30, 2021 •

edited

janpf commented Sep 30, 2021 •

edited

jerbarnes commented Sep 30, 2021

jerbarnes commented Oct 14, 2021

jerbarnes commented Oct 15, 2021

janpf commented Oct 20, 2021

janpf commented Oct 29, 2021 •

edited

jerbarnes commented Nov 4, 2021

jerbarnes commented Dec 8, 2021

egilron commented Dec 17, 2021

janpf commented Dec 19, 2021

egilron commented Dec 20, 2021

jerbarnes commented Jan 11, 2022

MinionAttack commented Jan 12, 2022 •

edited

jerbarnes commented Jan 12, 2022

janpf commented Jan 12, 2022 •

edited

jerbarnes commented Jan 15, 2022

egilron commented Jan 20, 2022

jerbarnes commented Jan 21, 2022

Inconsistencies in dataset annotations #9

Inconsistencies in dataset annotations #9

Comments

janpf commented Sep 30, 2021 • edited

jerbarnes commented Sep 30, 2021 • edited

janpf commented Sep 30, 2021 • edited

jerbarnes commented Sep 30, 2021

jerbarnes commented Oct 14, 2021

jerbarnes commented Oct 15, 2021

janpf commented Oct 20, 2021

janpf commented Oct 29, 2021 • edited

jerbarnes commented Nov 4, 2021

jerbarnes commented Dec 8, 2021

egilron commented Dec 17, 2021

janpf commented Dec 19, 2021

egilron commented Dec 20, 2021

jerbarnes commented Jan 11, 2022

MinionAttack commented Jan 12, 2022 • edited

jerbarnes commented Jan 12, 2022

janpf commented Jan 12, 2022 • edited

jerbarnes commented Jan 15, 2022

egilron commented Jan 20, 2022

jerbarnes commented Jan 21, 2022

janpf commented Sep 30, 2021 •

edited

jerbarnes commented Sep 30, 2021 •

edited

janpf commented Sep 30, 2021 •

edited

janpf commented Oct 29, 2021 •

edited

MinionAttack commented Jan 12, 2022 •

edited

janpf commented Jan 12, 2022 •

edited