In [1]:
import pandas

In [2]:
import numpy

### Efficient data storing through dtypes


* __Dtypes__ are native object from Numpy, which allows you to define the exact type and number of bits used to store certain informations.

* __memory_usage()__ shows the number of bytes used by each of the columns, since there is only one entry (row) per column, the size of each int64 column is 8bytes and of int32 4bytes.

In [3]:
pandas.DataFrame([[133,2,4]], columns=['uid','dogs','cats']).memory_usage()

Index    80
uid       8
dogs      8
cats      8
dtype: int64

In [4]:
# Numpy’s dtype np.dtype('int32') would for instance represent a 32 bits long integer. 
# Pandas default to 64 bits integer, we could be save half the space by using 32 bits

pandas.DataFrame([[133,2,4]], columns=['uid','dogs','cats']).astype({
    'uid':numpy.dtype('int32'),
    'dogs':numpy.dtype('int32'),
    'cats':numpy.dtype('int32')
}).memory_usage()


Index    80
uid       4
dogs      4
cats      4
dtype: int64

* Pandas also introduces the __categorical dtype__, that allows for efficient memory utilization for frequently occurring values. In the example below, we can see a 2x decrease in memory utilization for the field year when we converted it to a categorical value.

In [5]:
# link to gapminder data as csv file from software carpentry website
csv_url='http://bit.ly/2cLzoxH'

In [6]:
df = pandas.read_csv(csv_url)
df.head()

Unnamed: 0,country,year,pop,continent,lifeExp,gdpPercap
0,Afghanistan,1952,8425333.0,Asia,28.801,779.445314
1,Afghanistan,1957,9240934.0,Asia,30.332,820.85303
2,Afghanistan,1962,10267083.0,Asia,31.997,853.10071
3,Afghanistan,1967,11537966.0,Asia,34.02,836.197138
4,Afghanistan,1972,13079460.0,Asia,36.088,739.981106


In [7]:
df.memory_usage(deep='True')

Index            80
country      111288
year          13632
pop           13632
continent    107184
lifeExp       13632
gdpPercap     13632
dtype: int64

In [8]:
df.astype({
    'country':'category'
}).memory_usage(deep='True')

Index            80
country       17802
year          13632
pop           13632
continent    107184
lifeExp       13632
gdpPercap     13632
dtype: int64

In [9]:
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 6 columns):
country      1704 non-null object
year         1704 non-null int64
pop          1704 non-null float64
continent    1704 non-null object
lifeExp      1704 non-null float64
gdpPercap    1704 non-null float64
dtypes: float64(3), int64(1), object(2)
memory usage: 266.7 KB


In [10]:
# the overall size of the data-frame drops by just changing this data type
df.astype({
    'country':'category'
}).info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 6 columns):
country      1704 non-null category
year         1704 non-null int64
pop          1704 non-null float64
continent    1704 non-null object
lifeExp      1704 non-null float64
gdpPercap    1704 non-null float64
dtypes: category(1), float64(3), int64(1), object(1)
memory usage: 175.4 KB


Not only using the right dtypes allows you to handle larger datasets in memory, but it also makes some computations become more efficient. In the example below, we can see that using categorical type brought speed improvement for the groupby / average operation.

In [11]:
dfcategory = df.astype({ 'country':'category'})

In [12]:
%%timeit
year_gb = df.groupby('country').mean()

3.84 ms ± 505 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [13]:
%%timeit
year_gb_cat = dfcategory.groupby('country').mean()

3.48 ms ± 818 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


Within pandas, we can define the dtypes, either during the data load (read_ ) or as a type conversion (astype).