Skip to content
This repository has been archived by the owner on Feb 1, 2024. It is now read-only.

Add trigram match on facility name fallback when matching via API #1099

Merged
merged 3 commits into from Sep 9, 2020

Conversation

jwalgran
Copy link
Contributor

@jwalgran jwalgran commented Sep 8, 2020

Overview

This new feature of the facility matching/creation API addresses problems experienced by an MSI when attempting to match records with partial or incomplete addresses by adding opt-in fallback text matching on facility name filtered by country when there are no dedupe matches.

Connects #1094

Notes

Installing the pg_trgm extension "[adds] functions and operators for determining the similarity of alphanumeric text based on trigram matching"

https://postgresql.org/docs/current/pgtrgm.html

We use the same 0.50 confidence threshold as our dedupe matcher for consistency. It is a reasonable starting point and could be adjusted in the future.

We are making this an opt-in feature by adding a new optional query string argument so all existing behavior is preserved.

Testing Instructions

  • Run ./scripts/resetdb and verify that it completes without error
  • Browse http://localhost:8081/api/docs/#!/facilities/facilities_create
  • Authorize swagger with the development testing token [Token 1d18b962d6f976b0b7e8fcf9fcc39b56cf278051]
  • Set the create option to false and POST this JSON. Verify that a normal MATCHED is returned with a non-zero confidence score
[{
    "country": "VN",
    "name": "BI (VN) Co. Ltd.",
    "address": "Vietnam"
}]
  • POST this example and verify that there is no match (NEW_FACILITY) is returned
{
    "country": "VN",
    "name": "BI Co. Ltd.",
    "address": "Vietnam"
}
  • Set the textonlyfallback option to true and POST the same example. Verify that POTENTIAL_MATCH items are returned with a confidence score of 0 and text_only_match set to true.
{
    "country": "VN",
    "name": "BI Co. Ltd.",
    "address": "Vietnam"
}
  • Set the create option to true and submit the same item. Verify that that the matches in the response contain confirm_match_url and reject_match_url links.
{
    "country": "VN",
    "name": "BI Co. Ltd.",
    "address": "Vietnam"
}

Checklist

  • fixup! commits have been squashed
  • CI passes after rebase
  • CHANGELOG.md updated with summary of features or fixes, following Keep a Changelog guidelines

@rajadain
Copy link
Contributor

rajadain commented Sep 9, 2020

Set the create option to false and POST this JSON. Verify that a normal POTENTIAL_MATCH is returned with a non-zero confidence score

{
    "country": "VN",
    "name": "BI (VN) Co. Ltd.",
    "address": "Vietnam"
}

I'm getting MATCHED instead of POTENTIAL_MATCH. Should that be concerning?

{
  "matches": [
    {
      "id": "VN2020253TGDHJC",
      "type": "Feature",
      "geometry": {
        "type": "Point",
        "coordinates": [
          106.6703963,
          10.8834303
        ]
      },
      "properties": {
        "name": "BI (VN) Co. Ltd.",
        "address": "Fri Jan 01 1075 10:52:58 GMT+0300 (EAT),Zone 1,Thanh Xuan Ward,Ho Chi Minh",
        "country_code": "VN",
        "oar_id": "VN2020253TGDHJC",
        "other_names": [],
        "other_addresses": [],
        "contributors": [
          {
            "id": 11,
            "name": "Service Provider E (Summer 2018 Affiliate List)",
            "is_verified": false
          }
        ],
        "country_name": "Vietnam",
        "claim_info": null,
        "other_locations": [],
        "ppe_product_types": null,
        "ppe_contact_phone": null,
        "ppe_contact_email": null,
        "ppe_website": null
      },
      "confidence": 0.8654
    }
  ],
  "item_id": 933,
  "geocoded_geometry": {
    "type": "Point",
    "coordinates": [
      108.277199,
      14.058324
    ]
  },
  "geocoded_address": "Vietnam",
  "status": "MATCHED",
  "oar_id": "VN2020253TGDHJC"
}

@rajadain
Copy link
Contributor

rajadain commented Sep 9, 2020

POST this example and verify that there is no match (NEW_FACILITY) is returned

{
    "country": "VN",
    "name": "BI Co. Ltd.",
    "address": "Vietnam"
}

This is giving me a POTENTIAL_MATCH:

{
  "matches": [
    {
      "id": "VN2020253Z0AEY7",
      "type": "Feature",
      "geometry": {
        "type": "Point",
        "coordinates": [
          108.277199,
          14.058324
        ]
      },
      "properties": {
        "name": "Pungkook Saigon Two Corporation",
        "address": "Vietnam.",
        "country_code": "VN",
        "oar_id": "VN2020253Z0AEY7",
        "other_names": [],
        "other_addresses": [],
        "contributors": [
          {
            "id": 8,
            "name": "Civil Society Organization A (Summer 2019 Apparel List)",
            "is_verified": false
          }
        ],
        "country_name": "Vietnam",
        "claim_info": null,
        "other_locations": [],
        "ppe_product_types": null,
        "ppe_contact_phone": null,
        "ppe_contact_email": null,
        "ppe_website": null
      },
      "confidence": 0.5676
    }
  ],
  "item_id": 934,
  "geocoded_geometry": {
    "type": "Point",
    "coordinates": [
      108.277199,
      14.058324
    ]
  },
  "geocoded_address": "Vietnam",
  "status": "POTENTIAL_MATCH"
}

@jwalgran
Copy link
Contributor Author

jwalgran commented Sep 9, 2020

I'm getting MATCHED instead of POTENTIAL_MATCH. Should that be concerning?

No. That was a mistake in writing the instructions. Fixed.

@rajadain
Copy link
Contributor

rajadain commented Sep 9, 2020

Set the textonlyfallback option to true and POST the same example. Verify that POTENTIAL_MATCH items are returned with a confidence score of 0 and text_only_match set to true.

{
    "country": "VN",
    "name": "BI Co. Ltd.",
    "address": "Vietnam"
}

With textonlyfallback=true I get a POTENTIAL_MATCH

{
  "matches": [
    {
      "id": "VN2020253TGDHJC",
      "type": "Feature",
      "geometry": {
        "type": "Point",
        "coordinates": [
          106.6703963,
          10.8834303
        ]
      },
      "properties": {
        "name": "BI (VN) Co. Ltd.",
        "address": "Fri Jan 01 1075 10:52:58 GMT+0300 (EAT),Zone 1,Thanh Xuan Ward,Ho Chi Minh",
        "country_code": "VN",
        "oar_id": "VN2020253TGDHJC",
        "other_names": [],
        "other_addresses": [],
        "contributors": [
          {
            "id": 11,
            "name": "Service Provider E (Summer 2018 Affiliate List)",
            "is_verified": false
          }
        ],
        "country_name": "Vietnam",
        "claim_info": null,
        "other_locations": [],
        "ppe_product_types": null,
        "ppe_contact_phone": null,
        "ppe_contact_email": null,
        "ppe_website": null
      },
      "confidence": 0,
      "text_only_match": true
    },
    {
      "id": "VN20202534B0BHX",
      "type": "Feature",
      "geometry": {
        "type": "Point",
        "coordinates": [
          106.270029,
          20.933005
        ]
      },
      "properties": {
        "name": "PHI Co.,Ltd",
        "address": "XN10,Dai An Industrial Zone KM51,High Way No.5,Tu Minh,Hai Duong City,Hai Duong Province",
        "country_code": "VN",
        "oar_id": "VN20202534B0BHX",
        "other_names": [],
        "other_addresses": [],
        "contributors": [
          {
            "id": 10,
            "name": "Union A (Spring 2017 Apparel List)",
            "is_verified": false
          }
        ],
        "country_name": "Vietnam",
        "claim_info": null,
        "other_locations": [],
        "ppe_product_types": null,
        "ppe_contact_phone": null,
        "ppe_contact_email": null,
        "ppe_website": null
      },
      "confidence": 0,
      "text_only_match": true
    },
    {
      "id": "VN2020253RBKVMQ",
      "type": "Feature",
      "geometry": {
        "type": "Point",
        "coordinates": [
          106.2686966,
          20.9329079
        ]
      },
      "properties": {
        "name": "PHI Co., Ltd.",
        "address": "No. 10, Dai An Industrial Zone, Tu Minh , Hai Duong, 17000, Hai Duong",
        "country_code": "VN",
        "oar_id": "VN2020253RBKVMQ",
        "other_names": [],
        "other_addresses": [],
        "contributors": [
          {
            "id": 15,
            "name": "Manufacturing Group E (Winter 2019 Compliance List)",
            "is_verified": false
          }
        ],
        "country_name": "Vietnam",
        "claim_info": null,
        "other_locations": [],
        "ppe_product_types": null,
        "ppe_contact_phone": null,
        "ppe_contact_email": null,
        "ppe_website": null
      },
      "confidence": 0,
      "text_only_match": true
    }
  ],
  "item_id": 935,
  "geocoded_geometry": {
    "type": "Point",
    "coordinates": [
      108.277199,
      14.058324
    ]
  },
  "geocoded_address": "Vietnam",
  "status": "POTENTIAL_MATCH"
}

Copy link
Contributor

@rajadain rajadain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything works except for this #1099 (comment). Not sure if it was a fluke or testing error.

@rajadain rajadain assigned jwalgran and unassigned rajadain Sep 9, 2020
Installing the pg_trgm extension "[adds] functions and operators for determining
the similarity of alphanumeric text based on trigram matching"

https://www.postgresql.org/docs/current/pgtrgm.html
This new feature of the facility matching/creation API addresses problems
experiences by an MSI when attempting to match records with partial or
incomplete addresses.

We use the same 0.50 confidence threshold as our dedupe matcher for consistency.
It is a reasonable starting point and could be adjusted in the future. When
testing with realistic data we were sometimes matching dozens of records so we
limit the responses to the first five, which is reasonable given that we are
sorting by similarity score.

We are making this an opt-in feature by adding a new optional query string
argument so all existing behavior is preserved.
@jwalgran jwalgran force-pushed the feature/jcw/extended-matching branch from 94a4f39 to d7dc120 Compare September 9, 2020 19:35
@jwalgran
Copy link
Contributor Author

jwalgran commented Sep 9, 2020

The SQL query produced looks like this

SELECT
"api_facility"."ppe_product_types",
"api_facility"."ppe_contact_email",
"api_facility"."ppe_contact_phone",
"api_facility"."ppe_website",
"api_facility"."id",
"api_facility"."name",
"api_facility"."address",
"api_facility"."country_code",
"api_facility"."location"::bytea,
"api_facility"."created_from_id",
"api_facility"."created_at",
"api_facility"."updated_at",
SIMILARITY("api_facility"."name", 'TAKFOOK (CAMBODIA) GARMENT LTD') AS "similarity"
FROM "api_facility"
WHERE ("api_facility"."country_code" = 'KH'
AND SIMILARITY("api_facility"."name", 'TAKFOOK (CAMBODIA) GARMENT LTD') >= 0.5)
ORDER BY "similarity" DESC;

I used this query with production data to verify that facilities that an MSI was having trouble matching would have results returned by this new query.

@jwalgran
Copy link
Contributor Author

jwalgran commented Sep 9, 2020

I pushed a rebase on develop that also includes a small substantive change (https://github.com/open-apparel-registry/open-apparel-registry/pull/1099/files#diff-59d0b6e11bdeae4bc0a0f3dd1be03982R1138-R1144)

When testing with realistic data we were sometimes matching dozens of records so we limit the responses to the first five, which is reasonable given that we are sorting by similarity score.

@jwalgran
Copy link
Contributor Author

jwalgran commented Sep 9, 2020

Everything works except for this #1099 (comment). Not sure if it was a fluke or testing error.

The score on that pending match was 0.56, just above the 0.50 cutoff. Since a new model is trained each time we restart the server or run resetdb, and there is a random element in the training, match results at the borderline can sometimes change. This isn't a regression in behavior, but thanks for noting the deviation from the expectations in the test instructions.

@jwalgran jwalgran merged commit 22696df into develop Sep 9, 2020
@jwalgran jwalgran deleted the feature/jcw/extended-matching branch September 9, 2020 19:47
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants