Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of household inference #37

Merged
merged 3 commits into from
Jan 4, 2023
Merged

Conversation

dehall
Copy link
Collaborator

@dehall dehall commented Oct 11, 2022

The household inference process in households.py is very slow and memory intensive, and for large datasets it's likely to crash with a MemoryError (or get killed by the OS). The problem is that the current indexing approach doesn't discard enough candidates and the system tries to load close to n^2 pairs into memory.

As of now the indexing approach uses a sortedneighborhood (with default window=3) on zip code and a sortedneighborhood on street address. The zip index doesn't make sense when working with a single geographic region, because all the zip codes will start with the same few digits. (For example we saw in Denver there are dozens of 801xx codes). The tentative approach which gets the code to run on my system with a set of 3 million rows is to use two block indexes, one on zip and street name and one on zip and family name - this will result in only candidate pairs that match exactly on either pair of these 2 fields.

Some notes on performance and quality:

  • It's still not fast on 3 million records, and it will still be memory constrained at some point if the dataset gets large enough. This is a quick fix intended to get this working, we can revisit and rearchitect if needed.
  • The "ground truth" I'm working with seems to have some quirks. I haven't dug too deeply into this so for now I'm less worried about the absolute scores and more interested in the relative difference between scores and times for each approach.
  • I used the Adjusted Rand Index from sklearn to evaluate this: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_rand_score.html
Approach Score Candidate links on synthetic NC site A Runtime on 3M dataset Candidate links on 3M
indexer.full() 0.72257 334971 n/a (crashed)  
indexer.sortedneighborhood (current on master) 0.72257 16239 n/a (crashed)  
indexer.block([zip,addr]) (initial PR) 0.56015 279 4s index, 21m processing 5012942
indexer.block(zip only) 0.71013 6273 n/a (crashed)  
indexer.block(['household_zip', 'street']) indexer.block(['household_zip', 'family_name']) (PR as currently submitted) 0.69025 428 1m index, 48m processing 12480397

@dehall dehall changed the title WIP: Improve performance of household inference Improve performance of household inference Oct 24, 2022
@dehall dehall marked this pull request as ready for review October 24, 2022 15:07
@radamson
Copy link
Collaborator

The zip index doesn't make sense when working with a single geographic region, because all the zip codes will start with the same few digits.

Can you explain this a bit more and why the runtime is n/a?

Blocking on zip only seemed to have maintained the highest score and the most populous zip codes in the US are under 150,000 individuals.

@dehall
Copy link
Collaborator Author

dehall commented Oct 25, 2022

Sure - so the various comments are really referring to two different things, the sortedneighborhood approach currently used on master, and a block on zip only.
The idea I think doesn't make sense is the sortedneighborhood on the zip code field when you're working with a single geographic area - as I understand it the sortedneighborhood produces candidate pairs when values are different by 3 characters (by default; it is configurable). So allowing for 3 characters difference between 5 character zip codes which are text-wise very close even if geographically distinct are going to result in way more potential candidates than are necessary.

The block on zip only does seem to produce good results, but the n/a in the runtime column means it didn't complete, it was killed by the OS because it ran out of memory.

@dehall dehall merged commit 5f449ee into master Jan 4, 2023
@dehall dehall deleted the household_perf branch January 4, 2023 23:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants