Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce the number of false positives in affiliation matching outcome #1327

Closed
marekhorst opened this issue Mar 22, 2022 · 2 comments
Closed

Comments

@marekhorst
Copy link
Member

Originally reported on redmine: https://support.openaire.eu/issues/7392.

Performed analysis revealed some false positives are generated with very high confidence level. The matches are hardly justifiable by looking at both affiliations and matched organization metadata.

The affiliation matching algorithm is quite complex and consists of several matcher and voter modules so in order to pinpoint which module is malfunctioning we should perform some local tests (which can be a subject for easier debugging), possibly relying on already defined unit tests. We should prepare input and output set relying on the false positive matches and run the algorithm to take a closer look at the matching logic.

@marekhorst marekhorst self-assigned this Mar 22, 2022
marekhorst added a commit that referenced this issue Mar 24, 2022
@marekhorst
Copy link
Member Author

marekhorst commented Mar 25, 2022

After countless hours of testing and debugging I managed to pinpoint the cause of one of the false positive matches which most probably is shared among other cases as well.

Long story short: this problem occurs when relying on a bucket (more details on buckets concept can be found here) including affiliations and organizations linked by the project. Those are just candidates for a match but there is one particular voter module (again, more on voters can be found in the document linked above), CommonAffSectionWordsVoter, which under current configuration allows pretty loose matching in particular it votes for a positive match between:

  • affiliation: "MISTEA, INRA, Montpellier SupAgro, Universite de Montpellier"
  • organization: "Athena Research and Innovation Center In Information Communication & Knowledge Technologies"

simply by assuming one of the organization tokens ("in") is a subsection of one of the affiliation tokens ("INRA"). This is clearly wrong, on many levels including pretty high confidence level, and if I don't find the way to adjust CommonAffSectionWordsVoter module (either via reconfiguration or reimplementation) then I will stand for removing it from the proj-based related matcher as the major source of false positives.

marekhorst added a commit that referenced this issue Mar 28, 2022
…ching outcome

Adjusting CommonAffSectionWordsVoter configuration by:
* increasing the length of words to be removed from matching (from 1 up to 2) to get rid of two letter words
* increasing minimum word similarity level in CommonSimilarWordCalculator (from 0.85 up to 0.9) to make more accurate comparison
* specifying the minimum number of words defined in a single affiliation section (set to 2) to avoid taking single word sections into account when matching
* specifying the minimum ratio of the number of common words to the total number of words in organization (set to 0.81) which is an equivalent of similar ratio of the number of common words to the number of words in affiliation section

Supplementing test suite with more cases.
marekhorst added a commit that referenced this issue Apr 11, 2022
…ching outcome

Adjusting CommonAffSectionWordsVoter configuration by:
* setting minNumberOfWordsInAffSection = 2, which requires at least two words in a single affiliation section to be a match candidate
* excluding matches with different country codes, to be applied only country code when set in both organization and affiliation

Alighing unit tests and voter trust level.
@marekhorst
Copy link
Member Author

marekhorst commented Apr 11, 2022

After countless rounds involving altering configuration/implementation of CommonAffSectionWordsVoter and analyzing results I come to a conclusion the following two additional restrictions should be imposed:

  • minNumberOfWordsInAffSection = 2, which requires at least two words in a single affiliation section to be a match candidate
  • excluding matches with different country codes, to be applied only when country code was set in both organization and affiliation

We are requiring at least two words in aff section because jaro winkler similarity level is set to 0.85 and having single common word enough to a match considered as valid is quite risky and introduced quite a lot of false positives which at first glance were difficult to explain (e.g. "IN" and "INRA" gives 0.866 similarity and finding "in" word in organization name was enough to vote for a match). Increasing similarity level threshold eliminated too many valid matches and I decided to leave it as it is.

Excluding matches with different country codes was a rather safe call and eliminated most of the false positives reported in https://support.openaire.eu/issues/7392 (8 out of 10, one was not reproducible). It is important this rule is not applied when country code is unknown on either side: organization or affiliation.

This turned out to be quite optimal solution which proved to remove only 33k of relations (out of 12M) and when I analyzed random sample of 100 removed elements over 80% of them were false positives. The properly matched cases were a poor quality affiliation or organization matched by a sheer luck of a single-word aff section having a common word (e.g. university or an organization abbreviation).

Some of the expected matches were not matched because of the country mismatch due to an invalid country defined on organization side but there is expected another organization having the country code set which should be properly matched, e.g. "University of California" organizations having invalid country code set has ~500 other instances of organization with similar name and a valid country code it is expected to be successfully matched with. No need to match with a poorly defined organization instance.

It was a long and bumpy road to come up with the set of changes presented above. Here is the list of other tested improvements which turned out to eliminate to many valid matches:

  • wordToRemoveMaxLength: 1->2, too many two letter words were removed which proved to be rather indicative, it did not address the reported false positive cases anyway
  • minWordSimilarity: 0.85 -> 0.9, which resulted in removing quite significant amount of positive matches, e.g. "res" and "research" were not considered as similar anymore
  • [new param] minCommonWordsToAllOrgWordsRatio = 0.81, which turned out to be too strict due to quite complex organization names, including departments etc, which could not be matched anymore

This change is relevant to DocOrgRelationMatcher only (the only matcher unitizing CommonAffSectionWordsVoter) so it is bound to doc->proj and proj->org relations based clusterization only.

marekhorst added a commit that referenced this issue Apr 15, 2022
…ching outcome

Introduces code review fixes.
marekhorst added a commit that referenced this issue Apr 20, 2022
…ching outcome

2nd round of code review fixes: removing two test cases which were not proving anything important and adding one test proving country code comparison works as expected.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant