Improve performance of household inference #37
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The household inference process in households.py is very slow and memory intensive, and for large datasets it's likely to crash with a MemoryError (or get killed by the OS). The problem is that the current indexing approach doesn't discard enough candidates and the system tries to load close to n^2 pairs into memory.
As of now the indexing approach uses a sortedneighborhood (with default window=3) on zip code and a sortedneighborhood on street address. The zip index doesn't make sense when working with a single geographic region, because all the zip codes will start with the same few digits. (For example we saw in Denver there are dozens of 801xx codes). The tentative approach which gets the code to run on my system with a set of 3 million rows is to use two block indexes, one on zip and street name and one on zip and family name - this will result in only candidate pairs that match exactly on either pair of these 2 fields.
Some notes on performance and quality: