Add trigram match on facility name fallback when matching via API #1099
Conversation
I'm getting {
"matches": [
{
"id": "VN2020253TGDHJC",
"type": "Feature",
"geometry": {
"type": "Point",
"coordinates": [
106.6703963,
10.8834303
]
},
"properties": {
"name": "BI (VN) Co. Ltd.",
"address": "Fri Jan 01 1075 10:52:58 GMT+0300 (EAT),Zone 1,Thanh Xuan Ward,Ho Chi Minh",
"country_code": "VN",
"oar_id": "VN2020253TGDHJC",
"other_names": [],
"other_addresses": [],
"contributors": [
{
"id": 11,
"name": "Service Provider E (Summer 2018 Affiliate List)",
"is_verified": false
}
],
"country_name": "Vietnam",
"claim_info": null,
"other_locations": [],
"ppe_product_types": null,
"ppe_contact_phone": null,
"ppe_contact_email": null,
"ppe_website": null
},
"confidence": 0.8654
}
],
"item_id": 933,
"geocoded_geometry": {
"type": "Point",
"coordinates": [
108.277199,
14.058324
]
},
"geocoded_address": "Vietnam",
"status": "MATCHED",
"oar_id": "VN2020253TGDHJC"
} |
This is giving me a {
"matches": [
{
"id": "VN2020253Z0AEY7",
"type": "Feature",
"geometry": {
"type": "Point",
"coordinates": [
108.277199,
14.058324
]
},
"properties": {
"name": "Pungkook Saigon Two Corporation",
"address": "Vietnam.",
"country_code": "VN",
"oar_id": "VN2020253Z0AEY7",
"other_names": [],
"other_addresses": [],
"contributors": [
{
"id": 8,
"name": "Civil Society Organization A (Summer 2019 Apparel List)",
"is_verified": false
}
],
"country_name": "Vietnam",
"claim_info": null,
"other_locations": [],
"ppe_product_types": null,
"ppe_contact_phone": null,
"ppe_contact_email": null,
"ppe_website": null
},
"confidence": 0.5676
}
],
"item_id": 934,
"geocoded_geometry": {
"type": "Point",
"coordinates": [
108.277199,
14.058324
]
},
"geocoded_address": "Vietnam",
"status": "POTENTIAL_MATCH"
} |
No. That was a mistake in writing the instructions. Fixed. |
With {
"matches": [
{
"id": "VN2020253TGDHJC",
"type": "Feature",
"geometry": {
"type": "Point",
"coordinates": [
106.6703963,
10.8834303
]
},
"properties": {
"name": "BI (VN) Co. Ltd.",
"address": "Fri Jan 01 1075 10:52:58 GMT+0300 (EAT),Zone 1,Thanh Xuan Ward,Ho Chi Minh",
"country_code": "VN",
"oar_id": "VN2020253TGDHJC",
"other_names": [],
"other_addresses": [],
"contributors": [
{
"id": 11,
"name": "Service Provider E (Summer 2018 Affiliate List)",
"is_verified": false
}
],
"country_name": "Vietnam",
"claim_info": null,
"other_locations": [],
"ppe_product_types": null,
"ppe_contact_phone": null,
"ppe_contact_email": null,
"ppe_website": null
},
"confidence": 0,
"text_only_match": true
},
{
"id": "VN20202534B0BHX",
"type": "Feature",
"geometry": {
"type": "Point",
"coordinates": [
106.270029,
20.933005
]
},
"properties": {
"name": "PHI Co.,Ltd",
"address": "XN10,Dai An Industrial Zone KM51,High Way No.5,Tu Minh,Hai Duong City,Hai Duong Province",
"country_code": "VN",
"oar_id": "VN20202534B0BHX",
"other_names": [],
"other_addresses": [],
"contributors": [
{
"id": 10,
"name": "Union A (Spring 2017 Apparel List)",
"is_verified": false
}
],
"country_name": "Vietnam",
"claim_info": null,
"other_locations": [],
"ppe_product_types": null,
"ppe_contact_phone": null,
"ppe_contact_email": null,
"ppe_website": null
},
"confidence": 0,
"text_only_match": true
},
{
"id": "VN2020253RBKVMQ",
"type": "Feature",
"geometry": {
"type": "Point",
"coordinates": [
106.2686966,
20.9329079
]
},
"properties": {
"name": "PHI Co., Ltd.",
"address": "No. 10, Dai An Industrial Zone, Tu Minh , Hai Duong, 17000, Hai Duong",
"country_code": "VN",
"oar_id": "VN2020253RBKVMQ",
"other_names": [],
"other_addresses": [],
"contributors": [
{
"id": 15,
"name": "Manufacturing Group E (Winter 2019 Compliance List)",
"is_verified": false
}
],
"country_name": "Vietnam",
"claim_info": null,
"other_locations": [],
"ppe_product_types": null,
"ppe_contact_phone": null,
"ppe_contact_email": null,
"ppe_website": null
},
"confidence": 0,
"text_only_match": true
}
],
"item_id": 935,
"geocoded_geometry": {
"type": "Point",
"coordinates": [
108.277199,
14.058324
]
},
"geocoded_address": "Vietnam",
"status": "POTENTIAL_MATCH"
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything works except for this #1099 (comment). Not sure if it was a fluke or testing error.
Installing the pg_trgm extension "[adds] functions and operators for determining the similarity of alphanumeric text based on trigram matching" https://www.postgresql.org/docs/current/pgtrgm.html
This new feature of the facility matching/creation API addresses problems experiences by an MSI when attempting to match records with partial or incomplete addresses. We use the same 0.50 confidence threshold as our dedupe matcher for consistency. It is a reasonable starting point and could be adjusted in the future. When testing with realistic data we were sometimes matching dozens of records so we limit the responses to the first five, which is reasonable given that we are sorting by similarity score. We are making this an opt-in feature by adding a new optional query string argument so all existing behavior is preserved.
94a4f39
to
d7dc120
Compare
The SQL query produced looks like this
I used this query with production data to verify that facilities that an MSI was having trouble matching would have results returned by this new query. |
I pushed a rebase on develop that also includes a small substantive change (https://github.com/open-apparel-registry/open-apparel-registry/pull/1099/files#diff-59d0b6e11bdeae4bc0a0f3dd1be03982R1138-R1144) When testing with realistic data we were sometimes matching dozens of records so we limit the responses to the first five, which is reasonable given that we are sorting by similarity score. |
The score on that pending match was 0.56, just above the 0.50 cutoff. Since a new model is trained each time we restart the server or run resetdb, and there is a random element in the training, match results at the borderline can sometimes change. This isn't a regression in behavior, but thanks for noting the deviation from the expectations in the test instructions. |
Overview
This new feature of the facility matching/creation API addresses problems experienced by an MSI when attempting to match records with partial or incomplete addresses by adding opt-in fallback text matching on facility name filtered by country when there are no dedupe matches.
Connects #1094
Notes
Installing the pg_trgm extension "[adds] functions and operators for determining the similarity of alphanumeric text based on trigram matching"
https://postgresql.org/docs/current/pgtrgm.html
We use the same 0.50 confidence threshold as our dedupe matcher for consistency. It is a reasonable starting point and could be adjusted in the future.
We are making this an opt-in feature by adding a new optional query string argument so all existing behavior is preserved.
Testing Instructions
./scripts/resetdb
and verify that it completes without error[Token 1d18b962d6f976b0b7e8fcf9fcc39b56cf278051]
create
option to false and POST this JSON. Verify that a normalMATCHED
is returned with a non-zero confidence scoretextonlyfallback
option totrue
and POST the same example. Verify thatPOTENTIAL_MATCH
items are returned with a confidence score of 0 andtext_only_match
set to true.create
option totrue
and submit the same item. Verify that that the matches in the response containconfirm_match_url
andreject_match_url
links.confirm_match_url
, browse http://localhost:8081/api/docs/#!/facility-matches/facility_matches_confirm and submit the ID. Verify a successful response and copy the matched OAR IDChecklist
fixup!
commits have been squashed