# pandas and data types

## Data generation

In order to do some experiments, we generate a data set.  We want to experiment with various types of floating point numbers, integers and categorical data, so we will create a text file that has some columns of each of those types.

In [1]:
import random

In [19]:
nr_lines = 100_000

In [20]:
with open('random_data.csv', 'w') as file:
    categories = ''.join(chr(ord('A') + i) for i in range(26))
    print('float1,float2,int1,int2,binary1,binary2,cat1,cat2', file=file)
    for _ in range(nr_lines):
        print(f'{random.random()},{random.random()},'
              f'{random.randrange(0, 255)},{random.randrange(0, 255)},'
              f'{random.randint(0, 1)},{random.randint(0, 1)},'
              f'{random.choice(categories)},{random.choice(categories)}',
              file=file)

In [21]:
!wc random_data.csv

 100001  100001 5367829 random_data.csv


## Dataframes

In [25]:
import numpy as np
import pandas as pd

When simply using defaults to read the CSV file, pandas will use 64-bit floating point and integer values, even though that might not be required.

In [49]:
data = pd.read_csv('random_data.csv')

In [50]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 8 columns):
float1     100000 non-null float64
float2     100000 non-null float64
int1       100000 non-null int64
int2       100000 non-null int64
binary1    100000 non-null int64
binary2    100000 non-null int64
cat1       100000 non-null object
cat2       100000 non-null object
dtypes: float64(2), int64(4), object(2)
memory usage: 6.1+ MB


Of course one can convert columns of an existing dataframe info more frugal data types, e.g.,

In [52]:
data['float1'] = data['float1'].astype(np.float32)

In [53]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 8 columns):
float1     100000 non-null float32
float2     100000 non-null float64
int1       100000 non-null int64
int2       100000 non-null int64
binary1    100000 non-null int64
binary2    100000 non-null int64
cat1       100000 non-null object
cat2       100000 non-null object
dtypes: float32(1), float64(1), int64(4), object(2)
memory usage: 5.7+ MB


However, it is faster to specify a dictionary of data types when reading the file. Not only will this reduce the execution time, but it will also allow to deal with larger data sets since they do not require the initial (overlarge) memory footprint.

Specifying the type of the floating point values as single precesion reduces the size of the dataframe considerably.

In [26]:
data = pd.read_csv('random_data.csv',
                   dtype={'float1': np.float32, 'float2': np.float32})

In [27]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 8 columns):
float1     100000 non-null float32
float2     100000 non-null float32
int1       100000 non-null int64
int2       100000 non-null int64
binary1    100000 non-null int64
binary2    100000 non-null int64
cat1       100000 non-null object
cat2       100000 non-null object
dtypes: float32(2), int64(4), object(2)
memory usage: 5.3+ MB


Similarly, the type of the integer data can be reduced to `np.uint8`, further reducing the size of the dataframe.

In [54]:
data = pd.read_csv('random_data.csv',
                   dtype={'float1': np.float32, 'float2': np.float32,
                          'int1': np.uint8, 'int2': np.uint8})

In [55]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 8 columns):
float1     100000 non-null float32
float2     100000 non-null float32
int1       100000 non-null uint8
int2       100000 non-null uint8
binary1    100000 non-null int64
binary2    100000 non-null int64
cat1       100000 non-null object
cat2       100000 non-null object
dtypes: float32(2), int64(2), object(2), uint8(2)
memory usage: 4.0+ MB


Futhermore, both the columns with binary data and the categorical data can be explicitly represented as such, further reducing the memory footprint of the dataframe.

In [36]:
data = pd.read_csv('random_data.csv',
                   dtype={'float1': np.float32, 'float2': np.float32,
                          'int1': np.uint8, 'int2': np.uint8,
                          'binary1': 'category', 'binary2': 'category',
                          'cat1': 'category', 'cat2': 'category'})

In [37]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 8 columns):
float1     100000 non-null float32
float2     100000 non-null float32
int1       100000 non-null uint8
int2       100000 non-null uint8
binary1    100000 non-null category
binary2    100000 non-null category
cat1       100000 non-null category
cat2       100000 non-null category
dtypes: category(4), float32(2), uint8(2)
memory usage: 1.3 MB


Just to see how efficiently categories are represented, we load only a single binary column, and check the size of the dataframe.

In [39]:
data = pd.read_csv('random_data.csv',
                   usecols=('binary1',),
                   dtype={'float1': np.float32, 'float2': np.float32,
                          'int1': np.uint8, 'int2': np.uint8,
                          'binary1': 'category', 'binary2': 'category',
                          'cat1': 'category', 'cat2': 'category'})

In [40]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 1 columns):
binary1    100000 non-null category
dtypes: category(1)
memory usage: 97.8 KB


The memory footprint of a dataframe with a single categorical column is similar.

In [41]:
data = pd.read_csv('random_data.csv',
                   usecols=('cat1',),
                   dtype={'float1': np.float32, 'float2': np.float32,
                          'int1': np.uint8, 'int2': np.uint8,
                          'binary1': 'category', 'binary2': 'category',
                          'cat1': 'category', 'cat2': 'category'})

In [42]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 1 columns):
cat1    100000 non-null category
dtypes: category(1)
memory usage: 99.2 KB


Loading just an integer column as `np.uint8` shows that again, the memory footprint is the same, so this is the data type used to represent categorical data with less than 256 categories.

In [43]:
data = pd.read_csv('random_data.csv',
                   usecols=('int1',),
                   dtype={'float1': np.float32, 'float2': np.float32,
                          'int1': np.uint8, 'int2': np.uint8,
                          'binary1': 'category', 'binary2': 'category',
                          'cat1': 'category', 'cat2': 'category'})

In [44]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 1 columns):
int1    100000 non-null uint8
dtypes: uint8(1)
memory usage: 97.8 KB
