Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some issues with the matching #34

Closed
vishaln79 opened this issue Sep 2, 2020 · 7 comments
Closed

Some issues with the matching #34

vishaln79 opened this issue Sep 2, 2020 · 7 comments

Comments

@vishaln79
Copy link

vishaln79 commented Sep 2, 2020

Hi Manish,

This is an extremely neat tool you have developed, kudos!! I have been playing around with it and have run into a couple of issues. I was hoping you would help me resolve them.

I have been trying to match a single records against a database of records, and have been going up from 10000 to 500000. This is how I have been configuring the database:

new Document.Builder(csv[0])
.addElement(new Element.Builder().setType(NAME).setValue(getName(csv)).createElement())
.addElement(new Element.Builder().setType(ADDRESS).setValue(getAddress(csv)).createElement())
.addElement(new Element.Builder().setType(PHONE).setValue(csv[8]).setWeight(2)
.setThreshold(0.5).createElement())
.addElement(new Element.Builder().setType(EMAIL).setValue(csv[10]).createElement())
.setThreshold(0.5)
.createDocument();

But I am seeing some anomalies.

  1. I am always seeing one record being returned, whereas I am expecting all records with a threshold greater than 0.5 . And I know that in each case, there are multiple records that should pass the threshold. This is how I am printing the records:
    result.entrySet().forEach(entry -> {
    entry.getValue().forEach(match -> {
    System.out.println("Person searched: " + match.getData() + "\nMatched With: " + match.getMatchedWith() + " Score: " + match.getScore().getResult());
    });
    });

I don't have a unique identifier for each record. this is how my CSV looks like:
"first_name","last_name","company_name","address","city","county","state","zip","phone1","phone2","email","web"
Do you think that might be the issue?

  1. I am seeing something like this:
    Person searched: {[{'Susan Smith'}, {'47 Venture Boulevard Somerset NJ'}, {'ssmithers@gmail.com'}, {'7324787395'}]}
    Matched With: {[{'Susanna Desiga'}, {'4 W Broad St San Juan Capistrano Orange CA 92675'}, {'susanna@aol.com'}, {'949-622-6261'}]} Score: 0.8668599263800767

Given the only thing remotely in common is the first name, I am wondering why there is such a high matching score. Whereas something like:
Person searched: {[{'Susan Smith'}, {'47 Venture Boulevard Somerset NJ'}, {'ssmithers@gmail.com'}, {'7324787395'}]}
Matched With: {[{'Susanna Smithers'}, {'47 Ventura Blvd Somerset Somerset NJ 08873'}, {'ssmithers@cox.net'}, {'732-478-7394'}]} Score: 0.7142857142857143
which is actually a better match is only getting a score of 0.7142.
Other "bad" match, but good score examples:
searched: {[{'Susan Smith'}, {'47 Venture Boulevard Somerset NJ'}, {'ssmithers@gmail.com'}, {'7324787395'}]}
Matched With: {[{'Susanna Fedak'}, {'4983 Mcallister St Cambridge Middlesex MA 02138'}, {'sfedak@fedak.org'}, {'617-357-4376'}]} Score: 0.7142857142857143
searched: {[{'Susan Smith'}, {'47 Venture Boulevard Somerset NJ'}, {'ssmithers@gmail.com'}, {'7324787395'}]}
Matched With: {[{'Susanna Molavi'}, {'17389 Market St #8 Pearl City Honolulu HI 96782'}, {'smolavi@molavi.org'}, {'808-723-3110'}]} Score: 0.7142857142857143

And only these "bad" matches were returned despite the "good" match being present in the database:
{[{'Susanna Smithers'}, {'47 Ventura Blvd Somerset Somerset NJ 08873'}, {'ssmithers@cox.net'}, {'732-478-7394'}]} Score: 0.7142857142857143

Any suggestions on how to tune the match database to get better results than this? Would be greatly appreciated!

FYI, the data I am using here is all fake.

Thanks!
Vishal.

@manishobhatia
Copy link
Contributor

Hi Vishal,
I think the root cause of both issues might be the lack of unique identifier for each record.
If you are not able to pull out the unique id from the db record. I would recommend creating one while generating the Document object

There are some examples of generating one in the unit test.

The library internally uses this is many places , and I have a hunch that it is also causing bad matches to surface as well.

hope this helps

Thanks,
Manish

@vishaln79
Copy link
Author

vishaln79 commented Sep 2, 2020

Thanks @manishobhatia I actually tried this method out earlier and that worked and I saw multiple results as well. But I am still seeing some interesting results, and it may be a question of configuration. For example:

Person searched: {[{'Susanna Smith'}, {'47 Ventura Boulevard 08873'}, {'ssmith@gmail.com'}, {'7324787393'}]}
Matched With: {[{'Susanna Smithers'}, {'47 Ventura Blvd Somerset Somerset NJ 08873'}, {'ssmithers@cox.net'}, {'732-478-7394'}]} Score: 0.5357142857142857
Person searched: {[{'Susanna Smith'}, {'47 Ventura Boulevard 08873'}, {'ssmith@gmail.com'}, {'7324787393'}]}
Matched With: {[{'Susan Smith'}, {'2 88th St Somerville Somerset NJ 08876'}, {'ssmith@curb.org'}, {'908-765-1239'}]} Score: 0.5857864376269051

since phone number has been given a greater weight, I am surprised the second match has a higher score. What do you think might be happening? Is it because the name match is better?

Also do you have any suggestions on making the address search a bit smarter? Where the zip code can act as a proxy for the City, State?

@vishaln79
Copy link
Author

Also, would you be able to accommodate requests to add elements for approximate age (I guess we can still use the NUMBER element for this) and gender/sex? Thanks Manish!

@manishobhatia
Copy link
Contributor

Vishal,

On the request to add additional elements, yes absolutely . We are always looking to enhance this with new elements. Your usage of NUMBER for age is correct, but we can add support natively in the library.
My thinking is age value can differ slightly and a 1 year difference in value should give us a strong match

Can you elaborate on gender/sex ? What kind of values you look to be matched ?
We try to make elements in this library that have fuzziness in them. For boolean matches like gender, trying to understand where do you thing a fuzzy match can be useful

Coming back to the issue you are seeing with a lower number on records which visually seems stronger.

Person searched: {[{'Susanna Smith'}, {'47 Ventura Boulevard 08873'}, {'ssmith@gmail.com'}, {'7324787393'}]}
Matched With: {[{'Susanna Smithers'}, {'47 Ventura Blvd Somerset Somerset NJ 08873'}, {'ssmithers@cox.net'}, {'732-478-7394'}]} Score: 0.5357142857142857

In this record, none of the elements have a strong match.

  • The Name has 1 word in common
  • the address have a few words missing
  • Email again has some similarity but does not match exactly
  • Phone number, we look for all the 10 digits to be the same, in this case the last digit (3 vs 4) is a mis-match
Person searched: {[{'Susanna Smith'}, {'47 Ventura Boulevard 08873'}, {'ssmith@gmail.com'}, {'7324787393'}]}
Matched With: {[{'Susan Smith'}, {'2 88th St Somerville Somerset NJ 08876'}, {'ssmith@curb.org'}, {'908-765-1239'}]} Score: 0.5857864376269051

This record on the other hand has 2 strong matches

  • The Name gives an exact match, the words "Susan" and "Susanna" are considered to be same using the soundex algorithm
  • The email is also considered exact match, since we disregard the domain, when we run matches

With regards to zip code being a proxy for city and state. We did consider that , and the problem with understanding which city or state a zip code points too would require us to lookup external repositories (like ones maintained by US postal offices ).
That would be difficult to maintain on a standalone java library.
That said, we do have hooks to pre-process the address, before the library starts fuzzy matching. Each element accepts a pre-processing java function, where we could perform some normalization of the address.

Hope this helps.

@vishaln79
Copy link
Author

vishaln79 commented Sep 4, 2020 via email

@manishobhatia
Copy link
Contributor

I'll add an issue to get age support going. For gender, let me think through the problem and see if we can allow some custom matching for boolean values.

For phone number , there is some support already. The phone element with just 1 digit mismatch gave with 0.5 score instead of 0 in the second example.
The phone element goes through a conversion, which strips all non digits and adds a US country code before it. So a number like this 732-478-7394 gets converted to 17324787394. This makes it a 11 digit number, out of which we look for 10 matching numbers (2 tokens) i.e either 1732478739 or 7324787394 has a match with others.
In your example a similar logic was applied to the other number 7324787393 converted to 1732478739 and 7324787393, and fist token found a match giving the whole element a 0.5 match score

To enhance this logic, a custom tokenizer function applied here which can match 8 or 9 digits instead of 10.
example:

Function<Element<String>, Stream<Token<String>>> customTokenizer = (element) -> TokenizerFunction.getNGramTokens(9, element);
Element elem = new Element.Builder().setType(PHONE).setValue("17324787394").setTokenizerFunction(customTokenizer).createElement();

On similar lines for the address field, we can write a custom pre-processing function , which instead of a simple lambda function above you could write something more elaborate which makes use of API's that connect to US Postal ZIp code and feeds a normalized address to the library.

For the scoring you mentioned , the library tries not punish the results for lack of data. For elements that do not match will get a 0 score, and for missing element a default 0.5 score will be given. So in a way the average score at a document level will be punished for incorrect matches where a 0 score will pull it down. Let me know if you do not see this in your examples

Thanks

@manishobhatia
Copy link
Contributor

Closing this issue, we released support for AGE ElementType in 1.0.4
Feel free to open a new issue if there are more questions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants