-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce the number of false positives in affiliation matching outcome #1327
Comments
…ching outcome Preparing manual tests.
After countless hours of testing and debugging I managed to pinpoint the cause of one of the false positive matches which most probably is shared among other cases as well. Long story short: this problem occurs when relying on a bucket (more details on buckets concept can be found here) including affiliations and organizations linked by the project. Those are just candidates for a match but there is one particular voter module (again, more on voters can be found in the document linked above),
simply by assuming one of the organization tokens ("in") is a subsection of one of the affiliation tokens ("INRA"). This is clearly wrong, on many levels including pretty high confidence level, and if I don't find the way to adjust |
…ching outcome Adjusting CommonAffSectionWordsVoter configuration by: * increasing the length of words to be removed from matching (from 1 up to 2) to get rid of two letter words * increasing minimum word similarity level in CommonSimilarWordCalculator (from 0.85 up to 0.9) to make more accurate comparison * specifying the minimum number of words defined in a single affiliation section (set to 2) to avoid taking single word sections into account when matching * specifying the minimum ratio of the number of common words to the total number of words in organization (set to 0.81) which is an equivalent of similar ratio of the number of common words to the number of words in affiliation section Supplementing test suite with more cases.
…ching outcome Adjusting CommonAffSectionWordsVoter configuration by: * setting minNumberOfWordsInAffSection = 2, which requires at least two words in a single affiliation section to be a match candidate * excluding matches with different country codes, to be applied only country code when set in both organization and affiliation Alighing unit tests and voter trust level.
After countless rounds involving altering configuration/implementation of
We are requiring at least two words in aff section because jaro winkler similarity level is set to Excluding matches with different country codes was a rather safe call and eliminated most of the false positives reported in https://support.openaire.eu/issues/7392 (8 out of 10, one was not reproducible). It is important this rule is not applied when country code is unknown on either side: organization or affiliation. This turned out to be quite optimal solution which proved to remove only Some of the expected matches were not matched because of the country mismatch due to an invalid country defined on organization side but there is expected another organization having the country code set which should be properly matched, e.g. "University of California" organizations having invalid country code set has ~500 other instances of organization with similar name and a valid country code it is expected to be successfully matched with. No need to match with a poorly defined organization instance. It was a long and bumpy road to come up with the set of changes presented above. Here is the list of other tested improvements which turned out to eliminate to many valid matches:
This change is relevant to |
…ching outcome Introduces code review fixes.
…ching outcome 2nd round of code review fixes: removing two test cases which were not proving anything important and adding one test proving country code comparison works as expected.
Originally reported on redmine: https://support.openaire.eu/issues/7392.
Performed analysis revealed some false positives are generated with very high confidence level. The matches are hardly justifiable by looking at both affiliations and matched organization metadata.
The affiliation matching algorithm is quite complex and consists of several matcher and voter modules so in order to pinpoint which module is malfunctioning we should perform some local tests (which can be a subject for easier debugging), possibly relying on already defined unit tests. We should prepare input and output set relying on the false positive matches and run the algorithm to take a closer look at the matching logic.
The text was updated successfully, but these errors were encountered: