Improve performance of household inference #37

dehall · 2022-10-11T15:17:22Z

The household inference process in households.py is very slow and memory intensive, and for large datasets it's likely to crash with a MemoryError (or get killed by the OS). The problem is that the current indexing approach doesn't discard enough candidates and the system tries to load close to n^2 pairs into memory.

As of now the indexing approach uses a sortedneighborhood (with default window=3) on zip code and a sortedneighborhood on street address. The zip index doesn't make sense when working with a single geographic region, because all the zip codes will start with the same few digits. (For example we saw in Denver there are dozens of 801xx codes). The tentative approach which gets the code to run on my system with a set of 3 million rows is to use two block indexes, one on zip and street name and one on zip and family name - this will result in only candidate pairs that match exactly on either pair of these 2 fields.

Some notes on performance and quality:

It's still not fast on 3 million records, and it will still be memory constrained at some point if the dataset gets large enough. This is a quick fix intended to get this working, we can revisit and rearchitect if needed.
The "ground truth" I'm working with seems to have some quirks. I haven't dug too deeply into this so for now I'm less worried about the absolute scores and more interested in the relative difference between scores and times for each approach.
I used the Adjusted Rand Index from sklearn to evaluate this: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_rand_score.html

Approach	Score	Candidate links on synthetic NC site A	Runtime on 3M dataset	Candidate links on 3M
indexer.full()	0.72257	334971	n/a (crashed)
indexer.sortedneighborhood (current on master)	0.72257	16239	n/a (crashed)
indexer.block([zip,addr]) (initial PR)	0.56015	279	4s index, 21m processing	5012942
indexer.block(zip only)	0.71013	6273	n/a (crashed)
indexer.block(['household_zip', 'street']) indexer.block(['household_zip', 'family_name']) (PR as currently submitted)	0.69025	428	1m index, 48m processing	12480397

radamson · 2022-10-25T13:40:47Z

The zip index doesn't make sense when working with a single geographic region, because all the zip codes will start with the same few digits.

Can you explain this a bit more and why the runtime is n/a?

Blocking on zip only seemed to have maintained the highest score and the most populous zip codes in the US are under 150,000 individuals.

dehall · 2022-10-25T14:27:49Z

Sure - so the various comments are really referring to two different things, the sortedneighborhood approach currently used on master, and a block on zip only.
The idea I think doesn't make sense is the sortedneighborhood on the zip code field when you're working with a single geographic area - as I understand it the sortedneighborhood produces candidate pairs when values are different by 3 characters (by default; it is configurable). So allowing for 3 characters difference between 5 character zip codes which are text-wise very close even if geographically distinct are going to result in way more potential candidates than are necessary.

The block on zip only does seem to produce good results, but the n/a in the runtime column means it didn't complete, it was killed by the OS because it ran out of memory.

dehall added 3 commits October 11, 2022 11:07

replace sortedneighborhood index with block

9fa635d

add additional block following testing

dfb012e

style

8ba8c2d

dehall changed the title ~~WIP: Improve performance of household inference~~ Improve performance of household inference Oct 24, 2022

dehall marked this pull request as ready for review October 24, 2022 15:07

radamson approved these changes Nov 10, 2022

View reviewed changes

dehall merged commit 5f449ee into master Jan 4, 2023

dehall deleted the household_perf branch January 4, 2023 23:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of household inference #37

Improve performance of household inference #37

dehall commented Oct 11, 2022 •

edited

Loading

radamson commented Oct 25, 2022

dehall commented Oct 25, 2022

Improve performance of household inference #37

Improve performance of household inference #37

Conversation

dehall commented Oct 11, 2022 • edited Loading

radamson commented Oct 25, 2022

dehall commented Oct 25, 2022

dehall commented Oct 11, 2022 •

edited

Loading