-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some issues with the matching #34
Comments
Hi Vishal, There are some examples of generating one in the unit test. fuzzy-matcher/src/test/java/com/intuit/fuzzymatcher/component/MatchServiceTest.java Line 363 in d2ce6f6
The library internally uses this is many places , and I have a hunch that it is also causing bad matches to surface as well. hope this helps Thanks, |
Thanks @manishobhatia I actually tried this method out earlier and that worked and I saw multiple results as well. But I am still seeing some interesting results, and it may be a question of configuration. For example: Person searched: {[{'Susanna Smith'}, {'47 Ventura Boulevard 08873'}, {'ssmith@gmail.com'}, {'7324787393'}]} since phone number has been given a greater weight, I am surprised the second match has a higher score. What do you think might be happening? Is it because the name match is better? Also do you have any suggestions on making the address search a bit smarter? Where the zip code can act as a proxy for the City, State? |
Also, would you be able to accommodate requests to add elements for approximate age (I guess we can still use the NUMBER element for this) and gender/sex? Thanks Manish! |
Vishal, On the request to add additional elements, yes absolutely . We are always looking to enhance this with new elements. Your usage of NUMBER for age is correct, but we can add support natively in the library. Can you elaborate on gender/sex ? What kind of values you look to be matched ? Coming back to the issue you are seeing with a lower number on records which visually seems stronger.
In this record, none of the elements have a strong match.
This record on the other hand has 2 strong matches
With regards to zip code being a proxy for city and state. We did consider that , and the problem with understanding which city or state a zip code points too would require us to lookup external repositories (like ones maintained by US postal offices ). Hope this helps. |
Hi Manish,
Thanks for the detailed response. My responses are inline:
On Thu, Sep 3, 2020 at 1:29 PM Manish Bhatia ***@***.***> wrote:
Vishal,
On the request to add additional elements, yes absolutely . We are always
looking to enhance this with new elements. Your usage of NUMBER for age is
correct, but we can add support natively in the library.
My thinking is age value can differ slightly and a 1 year difference in
value should give us a strong match
This would be great.
Can you elaborate on gender/sex ? What kind of values you look to be
matched ?
We try to make elements in this library that have fuzziness in them. For
boolean matches like gender, trying to understand where do you thing a
fuzzy match can be useful
To give you some background on what we are trying to do, we are evaluating
the tool for contact tracing de-duplication, The way we have formulated the
gender question is similar to this:
What is your gender? 1) Female 2) Male 3) Others, Please Specify?
In the case of some transmissions such as HIV, the response to gender
becomes more
ambiguous
and varied, I was wondering if your tool could help deal with that.
Coming back to the issue you are seeing with a lower number on records
which visually seems stronger.
Person searched: {[{'Susanna Smith'}, {'47 Ventura Boulevard 08873'}, ***@***.***'}, {'7324787393'}]}
Matched With: {[{'Susanna Smithers'}, {'47 Ventura Blvd Somerset Somerset NJ 08873'}, ***@***.***'}, {'732-478-7394'}]} Score: 0.5357142857142857
In this record, none of the elements have a strong match.
- The Name has 1 word in common
- the address have a few words missing
- Email again has some similarity but does not match exactly
- Phone number, we look for all the 10 digits to be the same, in this
case the last digit (3 vs 4) is a mis-match
In our case again, a contact tracer or interviewee might get a digit or
two wrong. Any suggestions on how to deal with that?
Person searched: {[{'Susanna Smith'}, {'47 Ventura Boulevard 08873'}, ***@***.***'}, {'7324787393'}]}
Matched With: {[{'Susan Smith'}, {'2 88th St Somerville Somerset NJ 08876'}, ***@***.***'}, {'908-765-1239'}]} Score: 0.5857864376269051
This record on the other hand has 2 strong matches
- The Name gives an exact match, the words "Susan" and "Susanna" are
considered to be same using the soundex algorithm
- The email is also considered exact match, since we disregard the
domain, when we run matches
I think I was thrown because the phone number and address were much closer
in case 1) (especially phone number with a higher weight), but your
explanation now makes sense.
With regards to zip code being a proxy for city and state. We did consider
that , and the problem with understanding which city or state a zip code
points too would require us to lookup external repositories (like ones
maintained by US postal offices ).
That would be difficult to maintain on a standalone java library.
That said, we do have hooks to pre-process the address, before the library
starts fuzzy matching. Each element accepts a pre-processing java function,
where we could perform some normalization of the address.
Do you have some examples on how to do that?
Hope this helps.
Absolutely. I had an additional question. Have you considered using
penalties for mismatches (from what I understood, you only use additive
scoring)? This way, if a phone number is missing, there is a lower penalty
than for incorrect phone numbers.
Thanks,
Vishal.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#34 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB4E6K2B4P63SOGFZKJI7SLSD7HBFANCNFSM4QSTJ7JA>
.
<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=icon>
Virus-free.
www.avast.com
<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=link>
<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
|
I'll add an issue to get age support going. For gender, let me think through the problem and see if we can allow some custom matching for boolean values. For phone number , there is some support already. The phone element with just 1 digit mismatch gave with 0.5 score instead of 0 in the second example. To enhance this logic, a custom tokenizer function applied here which can match 8 or 9 digits instead of 10.
On similar lines for the address field, we can write a custom pre-processing function , which instead of a simple lambda function above you could write something more elaborate which makes use of API's that connect to US Postal ZIp code and feeds a normalized address to the library. For the scoring you mentioned , the library tries not punish the results for lack of data. For elements that do not match will get a 0 score, and for missing element a default 0.5 score will be given. So in a way the average score at a document level will be punished for incorrect matches where a 0 score will pull it down. Let me know if you do not see this in your examples Thanks |
Closing this issue, we released support for AGE ElementType in 1.0.4 |
Hi Manish,
This is an extremely neat tool you have developed, kudos!! I have been playing around with it and have run into a couple of issues. I was hoping you would help me resolve them.
I have been trying to match a single records against a database of records, and have been going up from 10000 to 500000. This is how I have been configuring the database:
new Document.Builder(csv[0])
.addElement(new Element.Builder().setType(NAME).setValue(getName(csv)).createElement())
.addElement(new Element.Builder().setType(ADDRESS).setValue(getAddress(csv)).createElement())
.addElement(new Element.Builder().setType(PHONE).setValue(csv[8]).setWeight(2)
.setThreshold(0.5).createElement())
.addElement(new Element.Builder().setType(EMAIL).setValue(csv[10]).createElement())
.setThreshold(0.5)
.createDocument();
But I am seeing some anomalies.
result.entrySet().forEach(entry -> {
entry.getValue().forEach(match -> {
System.out.println("Person searched: " + match.getData() + "\nMatched With: " + match.getMatchedWith() + " Score: " + match.getScore().getResult());
});
});
I don't have a unique identifier for each record. this is how my CSV looks like:
"first_name","last_name","company_name","address","city","county","state","zip","phone1","phone2","email","web"
Do you think that might be the issue?
Person searched: {[{'Susan Smith'}, {'47 Venture Boulevard Somerset NJ'}, {'ssmithers@gmail.com'}, {'7324787395'}]}
Matched With: {[{'Susanna Desiga'}, {'4 W Broad St San Juan Capistrano Orange CA 92675'}, {'susanna@aol.com'}, {'949-622-6261'}]} Score: 0.8668599263800767
Given the only thing remotely in common is the first name, I am wondering why there is such a high matching score. Whereas something like:
Person searched: {[{'Susan Smith'}, {'47 Venture Boulevard Somerset NJ'}, {'ssmithers@gmail.com'}, {'7324787395'}]}
Matched With: {[{'Susanna Smithers'}, {'47 Ventura Blvd Somerset Somerset NJ 08873'}, {'ssmithers@cox.net'}, {'732-478-7394'}]} Score: 0.7142857142857143
which is actually a better match is only getting a score of 0.7142.
Other "bad" match, but good score examples:
searched: {[{'Susan Smith'}, {'47 Venture Boulevard Somerset NJ'}, {'ssmithers@gmail.com'}, {'7324787395'}]}
Matched With: {[{'Susanna Fedak'}, {'4983 Mcallister St Cambridge Middlesex MA 02138'}, {'sfedak@fedak.org'}, {'617-357-4376'}]} Score: 0.7142857142857143
searched: {[{'Susan Smith'}, {'47 Venture Boulevard Somerset NJ'}, {'ssmithers@gmail.com'}, {'7324787395'}]}
Matched With: {[{'Susanna Molavi'}, {'17389 Market St #8 Pearl City Honolulu HI 96782'}, {'smolavi@molavi.org'}, {'808-723-3110'}]} Score: 0.7142857142857143
And only these "bad" matches were returned despite the "good" match being present in the database:
{[{'Susanna Smithers'}, {'47 Ventura Blvd Somerset Somerset NJ 08873'}, {'ssmithers@cox.net'}, {'732-478-7394'}]} Score: 0.7142857142857143
Any suggestions on how to tune the match database to get better results than this? Would be greatly appreciated!
FYI, the data I am using here is all fake.
Thanks!
Vishal.
The text was updated successfully, but these errors were encountered: