-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Weighting the matching fields #49
Comments
One suggestion: you should include exact matches in your matching process. This helps the algorithm estimate the parameters. After that, you can set aside exact matches and focus on non-exact matches. The algorithm is designed to self-weight each field and my hope is that including exact matches will do just that. |
Hi @paull71, As @kosukeimai mentioned, one good idea is to include the exact matches into your original data. If your dataset is around 1700 observations, I do not think that will create any additional bottleneck in terms of computational efficiency. Basically, adding exact matches could help the model learn that matching on If the problem becomes too large as a result of adding back the exact matches, I would divide the problem into smaller ones by blocking on a field you believe is measure with little to no error (e.g., State or Postcode). Another idea: combine Hope this helps! All my best, Ted |
From my user perspective, I fully agree with the suggestions. Another idea: Add the argument Also, some care is required if you decide to combine two fields. For example, the default |
Thanks everyone - will give your suggestions a go. |
Hi Guys,
Nice work with the package.
Currently using the package to link addresses in our company database to the Government National Address File (GNAF)(https://data.gov.au/data/dataset/19432f89-dc3a-4ef3-b943-5326ef1dbecc) which contains >14million addresses in Australia.
Aside from the size of the GNAF (which requires clustering to manage), the challenge has been weighting the fields in relative importance. Given this is all spatial, the value of matching Postcode > Suburb > Street name > house number in determining matches, however the weighting seems to be even. Is there any way of influencing the weighting placed on the various fields?
For example the address that I am matching is 8, Unique, Road, Town, 7000, South Australia, row 1 in the following table.
I can see a few addresses in the GNAF on Unique street (e.g. row 2), however the Town/suburb and StreetNumber is incorrect in my database, going by the authority that is the GNAF. Fastlink is picking up as a match an address that has the same number, different street in a whole other suburb (row 3). By my count row 3 has 4 matches and 2 non matches (to row 1), equal to a result as if it had picked row 2 to match row 1.
Is there anyway of prioritising the StreetName or any other field over the StreetNumber.
The subset of GNAF in South Australia has 440k rows while the South Australian subset of my database is 164 rows. The size of the 164 is influenced by the fact that 90% of my database can be identically matched, the 164 are the problematic ones with varying data quality issues.
The text was updated successfully, but these errors were encountered: