Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reducing memory footprint for catagoricals #6219

Closed
ghost opened this issue Feb 1, 2014 · 4 comments
Closed

Reducing memory footprint for catagoricals #6219

ghost opened this issue Feb 1, 2014 · 4 comments
Labels
Categorical Categorical Data Type Ideas Long-Term Enhancement Discussions Internals Related to non-user accessible pandas implementation

Comments

@ghost
Copy link

ghost commented Feb 1, 2014

python uses 32 bytes for every int.
numpy uses 4 bytes for an int32 (with which you can index billion of keys).

We currently have a hardwired if preventing single-level multi-indexes,
which would allow us to take advantage of factorization to reduce memory
consumption by ~8x for indices which have lots of duplicates.

Indexes tend to be unique (by their nature). But the same principle goes
for string columns which are often highligh degenerate.

Basically, if we store catagorical data as catagorical, we can
reduce memory footprint drastically. If we do the factorization
at the stage where data is read in (perhaps in conjuction with the iterator based
reader planned for 0.14 #2193) we can drastically reduce the peak memory usage as well.

Update: #5313

@jreback
Copy link
Contributor

jreback commented Feb 1, 2014

can you give an example of what you are thinking?

@ghost
Copy link
Author

ghost commented Feb 2, 2014

Here's a comparison of memory use for a series of 8 char strings stored
as dtype='o' (ragged strings).
Assuming int32 used for the level/index value, since being in-memory
means we can't handle billions rows anyway.

I used pympler to measure the memory footprint of python string types.

   Description       rows  factors  Per-value [Bytes]  ndarray [MB]  Categorical [MB]
0  bytestrings  100000000    10000                 48   4577.636719         381.92749
1      unicode  100000000    10000                 88   8392.333984         382.30896

For N rows with n distinct values and s being the mem size of a single value

Storing it as an Index (ndarray) takes:
N*s
Storing it as a catagorical (labels and levels / factors and index values) takes:
N*sizeof(int32) + n*s

Again, we don't treat categorical data as categorical and for string data in particular
the memory hit is huge since s >> sizeof(int32)

@jreback
Copy link
Contributor

jreback commented Feb 2, 2014

if this is your usecase, then absolutely this is a great idea. I don't think it would be that difficult actually, you would simply inherit a new block type, Categorical from Object (or maybe just Block, whatever is easier). Use it for object data that is string-like (and not just a random collection of mixed dtypes like Object can hold. use a different container and just provide the Block interface.

Already have all for the hash table stuff so easy to map locations to values (and back) for easy lookups. This is kind of like a Sparse container.

Downside is that slicing because more tricky (and potentially slower because you have to 'figure' out what the slice is then create it as opposed to a direct slice of memory, but maybe not so bad because you don't need to potentially copy big memory either).

A nice project. 👍

@ghost
Copy link
Author

ghost commented Feb 4, 2014

Closing in favor of #5313, which already covers this.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Ideas Long-Term Enhancement Discussions Internals Related to non-user accessible pandas implementation
Projects
None yet
Development

No branches or pull requests

1 participant