Reduce the number of false positives in affiliation matching outcome #1327

marekhorst · 2022-03-22T15:32:49Z

Originally reported on redmine: https://support.openaire.eu/issues/7392.

Performed analysis revealed some false positives are generated with very high confidence level. The matches are hardly justifiable by looking at both affiliations and matched organization metadata.

The affiliation matching algorithm is quite complex and consists of several matcher and voter modules so in order to pinpoint which module is malfunctioning we should perform some local tests (which can be a subject for easier debugging), possibly relying on already defined unit tests. We should prepare input and output set relying on the false positive matches and run the algorithm to take a closer look at the matching logic.

…ching outcome Preparing manual tests.

marekhorst · 2022-03-25T11:29:51Z

After countless hours of testing and debugging I managed to pinpoint the cause of one of the false positive matches which most probably is shared among other cases as well.

Long story short: this problem occurs when relying on a bucket (more details on buckets concept can be found here) including affiliations and organizations linked by the project. Those are just candidates for a match but there is one particular voter module (again, more on voters can be found in the document linked above), CommonAffSectionWordsVoter, which under current configuration allows pretty loose matching in particular it votes for a positive match between:

affiliation: "MISTEA, INRA, Montpellier SupAgro, Universite de Montpellier"
organization: "Athena Research and Innovation Center In Information Communication & Knowledge Technologies"

simply by assuming one of the organization tokens ("in") is a subsection of one of the affiliation tokens ("INRA"). This is clearly wrong, on many levels including pretty high confidence level, and if I don't find the way to adjust CommonAffSectionWordsVoter module (either via reconfiguration or reimplementation) then I will stand for removing it from the proj-based related matcher as the major source of false positives.

…ching outcome Adjusting CommonAffSectionWordsVoter configuration by: * increasing the length of words to be removed from matching (from 1 up to 2) to get rid of two letter words * increasing minimum word similarity level in CommonSimilarWordCalculator (from 0.85 up to 0.9) to make more accurate comparison * specifying the minimum number of words defined in a single affiliation section (set to 2) to avoid taking single word sections into account when matching * specifying the minimum ratio of the number of common words to the total number of words in organization (set to 0.81) which is an equivalent of similar ratio of the number of common words to the number of words in affiliation section Supplementing test suite with more cases.

…ching outcome Adjusting CommonAffSectionWordsVoter configuration by: * setting minNumberOfWordsInAffSection = 2, which requires at least two words in a single affiliation section to be a match candidate * excluding matches with different country codes, to be applied only country code when set in both organization and affiliation Alighing unit tests and voter trust level.

marekhorst · 2022-04-11T16:34:41Z

After countless rounds involving altering configuration/implementation of CommonAffSectionWordsVoter and analyzing results I come to a conclusion the following two additional restrictions should be imposed:

minNumberOfWordsInAffSection = 2, which requires at least two words in a single affiliation section to be a match candidate
excluding matches with different country codes, to be applied only when country code was set in both organization and affiliation

We are requiring at least two words in aff section because jaro winkler similarity level is set to 0.85 and having single common word enough to a match considered as valid is quite risky and introduced quite a lot of false positives which at first glance were difficult to explain (e.g. "IN" and "INRA" gives 0.866 similarity and finding "in" word in organization name was enough to vote for a match). Increasing similarity level threshold eliminated too many valid matches and I decided to leave it as it is.

Excluding matches with different country codes was a rather safe call and eliminated most of the false positives reported in https://support.openaire.eu/issues/7392 (8 out of 10, one was not reproducible). It is important this rule is not applied when country code is unknown on either side: organization or affiliation.

This turned out to be quite optimal solution which proved to remove only 33k of relations (out of 12M) and when I analyzed random sample of 100 removed elements over 80% of them were false positives. The properly matched cases were a poor quality affiliation or organization matched by a sheer luck of a single-word aff section having a common word (e.g. university or an organization abbreviation).

Some of the expected matches were not matched because of the country mismatch due to an invalid country defined on organization side but there is expected another organization having the country code set which should be properly matched, e.g. "University of California" organizations having invalid country code set has ~500 other instances of organization with similar name and a valid country code it is expected to be successfully matched with. No need to match with a poorly defined organization instance.

It was a long and bumpy road to come up with the set of changes presented above. Here is the list of other tested improvements which turned out to eliminate to many valid matches:

wordToRemoveMaxLength: 1->2, too many two letter words were removed which proved to be rather indicative, it did not address the reported false positive cases anyway
minWordSimilarity: 0.85 -> 0.9, which resulted in removing quite significant amount of positive matches, e.g. "res" and "research" were not considered as similar anymore
[new param] minCommonWordsToAllOrgWordsRatio = 0.81, which turned out to be too strict due to quite complex organization names, including departments etc, which could not be matched anymore

This change is relevant to DocOrgRelationMatcher only (the only matcher unitizing CommonAffSectionWordsVoter) so it is bound to doc->proj and proj->org relations based clusterization only.

…ching outcome Introduces code review fixes.

…ching outcome 2nd round of code review fixes: removing two test cases which were not proving anything important and adding one test proving country code comparison works as expected.

marekhorst added the functionality: affiliations label Mar 22, 2022

marekhorst self-assigned this Mar 22, 2022

marekhorst added a commit that referenced this issue Mar 24, 2022

Closes #1327: Reduce the number of false positives in affiliation mat…

e699dfa

…ching outcome Preparing manual tests.

This was referenced Mar 29, 2022

Consider introducing country name checking in CommonAffSectionWordsVoter #1331

Closed

Make affiliation matching debuging more efficient #1332

Open

This was referenced Apr 11, 2022

Marekhorst 1327 reduce the number of false positives in affmatch outcome #1337

Closed

Consider introducing generic veto-voter model in affmatching #1341

Open

marekhorst added a commit that referenced this issue Apr 15, 2022

Closes #1327: Reduce the number of false positives in affiliation mat…

dfdeabd

…ching outcome Introduces code review fixes.

marekhorst closed this as completed in ab635cd Apr 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce the number of false positives in affiliation matching outcome #1327

Reduce the number of false positives in affiliation matching outcome #1327

marekhorst commented Mar 22, 2022

marekhorst commented Mar 25, 2022 •

edited

Loading

marekhorst commented Apr 11, 2022 •

edited

Loading

Reduce the number of false positives in affiliation matching outcome #1327

Reduce the number of false positives in affiliation matching outcome #1327

Comments

marekhorst commented Mar 22, 2022

marekhorst commented Mar 25, 2022 • edited Loading

marekhorst commented Apr 11, 2022 • edited Loading

marekhorst commented Mar 25, 2022 •

edited

Loading

marekhorst commented Apr 11, 2022 •

edited

Loading