Reducing memory footprint for catagoricals #6219

ghost · 2014-02-01T11:50:11Z

python uses 32 bytes for every int.
numpy uses 4 bytes for an int32 (with which you can index billion of keys).

We currently have a hardwired if preventing single-level multi-indexes,
which would allow us to take advantage of factorization to reduce memory
consumption by ~8x for indices which have lots of duplicates.

Indexes tend to be unique (by their nature). But the same principle goes
for string columns which are often highligh degenerate.

Basically, if we store catagorical data as catagorical, we can
reduce memory footprint drastically. If we do the factorization
at the stage where data is read in (perhaps in conjuction with the iterator based
reader planned for 0.14 #2193) we can drastically reduce the peak memory usage as well.

Update: #5313

jreback · 2014-02-01T20:23:21Z

can you give an example of what you are thinking?

ghost · 2014-02-02T00:40:30Z

Here's a comparison of memory use for a series of 8 char strings stored
as dtype='o' (ragged strings).
Assuming int32 used for the level/index value, since being in-memory
means we can't handle billions rows anyway.

I used pympler to measure the memory footprint of python string types.

   Description       rows  factors  Per-value [Bytes]  ndarray [MB]  Categorical [MB]
0  bytestrings  100000000    10000                 48   4577.636719         381.92749
1      unicode  100000000    10000                 88   8392.333984         382.30896

For N rows with n distinct values and s being the mem size of a single value

Storing it as an Index (ndarray) takes:
N*s
Storing it as a catagorical (labels and levels / factors and index values) takes:
N*sizeof(int32) + n*s

Again, we don't treat categorical data as categorical and for string data in particular
the memory hit is huge since s >> sizeof(int32)

jreback · 2014-02-02T00:54:03Z

if this is your usecase, then absolutely this is a great idea. I don't think it would be that difficult actually, you would simply inherit a new block type, Categorical from Object (or maybe just Block, whatever is easier). Use it for object data that is string-like (and not just a random collection of mixed dtypes like Object can hold. use a different container and just provide the Block interface.

Already have all for the hash table stuff so easy to map locations to values (and back) for easy lookups. This is kind of like a Sparse container.

Downside is that slicing because more tricky (and potentially slower because you have to 'figure' out what the slice is then create it as opposed to a direct slice of memory, but maybe not so bad because you don't need to potentially copy big memory either).

A nice project. 👍

ghost · 2014-02-04T12:52:14Z

Closing in favor of #5313, which already covers this.

jreback mentioned this issue Feb 3, 2014

pd.Categorial and level ordering and adding to a DataFrame #6242

Closed

ghost mentioned this issue Feb 4, 2014

ENH: Add support for Categoricals in BlockManager #5313

Closed

ghost closed this as completed Feb 4, 2014

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reducing memory footprint for catagoricals #6219

Reducing memory footprint for catagoricals #6219

ghost commented Feb 1, 2014

jreback commented Feb 1, 2014

ghost commented Feb 2, 2014

jreback commented Feb 2, 2014

ghost commented Feb 4, 2014

Reducing memory footprint for catagoricals #6219

Reducing memory footprint for catagoricals #6219

Comments

ghost commented Feb 1, 2014

jreback commented Feb 1, 2014

ghost commented Feb 2, 2014

jreback commented Feb 2, 2014

ghost commented Feb 4, 2014