In [1]:
import pandas

In [2]:
import numpy

### Efficient data storing through dtypes


* __Dtypes__ are native object from Numpy, which allows you to define the exact type and number of bits used to store certain informations.

* __memory_usage()__ shows the number of bytes used by each of the columns, since there is only one entry (row) per column, the size of each int64 column is 8bytes and of int32 4bytes.

In [4]:
pandas.DataFrame([[133,2,4]], columns=['uid','dogs','cats']).memory_usage()

Index    80
uid       8
dogs      8
cats      8
dtype: int64

In [6]:
# Numpy’s dtype np.dtype('int32') would for instance represent a 32 bits long integer. 
# Pandas default to 64 bits integer, we could be save half the space by using 32 bits

pandas.DataFrame([[133,2,4]], columns=['uid','dogs','cats']).astype({
    'uid':numpy.dtype('int32'),
    'dogs':numpy.dtype('int32'),
    'cats':numpy.dtype('int32')
}).memory_usage()


Index    80
uid       4
dogs      4
cats      4
dtype: int64

* Pandas also introduces the __categorical dtype__, that allows for efficient memory utilization for frequently occurring values. In the example below, we can see a 2x decrease in memory utilization for the field year when we converted it to a categorical value.

In [7]:
import seaborn

In [16]:
df = seaborn.load_dataset('flights')
df.head()

Unnamed: 0,year,month,passengers
0,1949,January,112
1,1949,February,118
2,1949,March,132
3,1949,April,129
4,1949,May,121


In [10]:
df.memory_usage(deep='True')

Index           80
year          1152
month         1222
passengers    1152
dtype: int64

In [14]:
df.astype({
    'year':'category'
}).memory_usage(deep='True')

Index           80
year           560
month         1222
passengers    1152
dtype: int64

In [17]:
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 144 entries, 0 to 143
Data columns (total 3 columns):
year          144 non-null int64
month         144 non-null category
passengers    144 non-null int64
dtypes: category(1), int64(2)
memory usage: 3.5 KB


In [20]:
# the overall size of the data-frame drops by just changing this data type
df.astype({
    'year':'category'
}).info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 144 entries, 0 to 143
Data columns (total 3 columns):
year          144 non-null category
month         144 non-null category
passengers    144 non-null int64
dtypes: category(2), int64(1)
memory usage: 2.9 KB


Not only using the right dtypes allows you to handle larger datasets in memory, but it also makes some computations become more efficient. In the example below, we can see that using categorical type brought speed improvement for the groupby / sum operation.

In [21]:
dfcategory = df.astype({ 'year':'category'})

In [23]:
%%timeit
year_gb = df.groupby('year').passengers.sum()

1.06 ms ± 15.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [24]:
%%timeit
year_gb_cat = dfcategory.groupby('year').passengers.sum()

959 µs ± 12.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
