# pandas and data types

It is worth looking into how pandas stores data internally by default, and how that behaviour can be modified if necessary.

In general, pandas is rather liberal with respect to the data types it uses, i.e., integers are by default represented by 64 bit, and floating point values as double precision numbers.  Since one of the performance bottlenecks is often memory bandwidth, and memory may be at a premium when dealing with large datasets, it is often required to ensure that the columns of dataframes have more appropriate types for the data they have to store.

# Data generation

In order to do some experiments, we generate a data set.  We want to experiment with various types of floating point numbers, integers and categorical data, so we will create a text file that has some columns of each of those types.

In [1]:
import pathlib
import random

In [2]:
file_path = pathlib.Path('random_data_to_remove.csv')
nr_lines = 100_000

In [3]:
with open(file_path, 'w') as file:
    categories = ''.join(chr(ord('A') + i) for i in range(26))
    print('float1,float2,int1,int2,binary1,binary2,cat1,cat2',
          file=file)
    for _ in range(nr_lines):
        print(f'{random.random()},{random.random()},'
              f'{random.randrange(0, 255)},{random.randrange(0, 255)},'
              f'{random.randint(0, 1)},{random.randint(0, 1)},'
              f'{random.choice(categories)},{random.choice(categories)}',
              file=file)

# Tailoring dataframes

In [4]:
import numpy as np
import pandas as pd

When simply using defaults to read the CSV file, pandas will use 64-bit floating point and integer values, even though that might not be required.

In [5]:
data = pd.read_csv(file_path)

In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 8 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   float1   100000 non-null  float64
 1   float2   100000 non-null  float64
 2   int1     100000 non-null  int64  
 3   int2     100000 non-null  int64  
 4   binary1  100000 non-null  int64  
 5   binary2  100000 non-null  int64  
 6   cat1     100000 non-null  object 
 7   cat2     100000 non-null  object 
dtypes: float64(2), int64(4), object(2)
memory usage: 6.1+ MB


Of course one can convert columns of an existing dataframe info more frugal data types, e.g.,

In [7]:
data['float1'] = data['float1'].astype(np.float32)

In [8]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 8 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   float1   100000 non-null  float32
 1   float2   100000 non-null  float64
 2   int1     100000 non-null  int64  
 3   int2     100000 non-null  int64  
 4   binary1  100000 non-null  int64  
 5   binary2  100000 non-null  int64  
 6   cat1     100000 non-null  object 
 7   cat2     100000 non-null  object 
dtypes: float32(1), float64(1), int64(4), object(2)
memory usage: 5.7+ MB


However, it is faster to specify a dictionary of data types when reading the file. Not only will this reduce the execution time, but it will also allow to deal with larger data sets since they do not require the initial (overlarge) memory footprint.

Specifying the type of the floating point values as single precesion reduces the size of the dataframe considerably.

In [9]:
data = pd.read_csv(file_path,
                   dtype={
                       'float1': np.float32,
                       'float2': np.float32,
                   })

In [10]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 8 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   float1   100000 non-null  float32
 1   float2   100000 non-null  float32
 2   int1     100000 non-null  int64  
 3   int2     100000 non-null  int64  
 4   binary1  100000 non-null  int64  
 5   binary2  100000 non-null  int64  
 6   cat1     100000 non-null  object 
 7   cat2     100000 non-null  object 
dtypes: float32(2), int64(4), object(2)
memory usage: 5.3+ MB


Similarly, the type of the integer data can be reduced to `np.uint8`, further reducing the size of the dataframe.

In [11]:
data = pd.read_csv(file_path,
                   dtype={
                       'float1': np.float32, 'float2': np.float32,
                       'int1': np.uint8, 'int2': np.uint8,
                   })

In [12]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 8 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   float1   100000 non-null  float32
 1   float2   100000 non-null  float32
 2   int1     100000 non-null  uint8  
 3   int2     100000 non-null  uint8  
 4   binary1  100000 non-null  int64  
 5   binary2  100000 non-null  int64  
 6   cat1     100000 non-null  object 
 7   cat2     100000 non-null  object 
dtypes: float32(2), int64(2), object(2), uint8(2)
memory usage: 4.0+ MB


Futhermore, both the columns with binary data and the categorical data can be explicitly represented as such, further reducing the memory footprint of the dataframe.

In [13]:
data = pd.read_csv(file_path,
                   dtype={
                       'float1': np.float32, 'float2': np.float32,
                       'int1': np.uint8, 'int2': np.uint8,
                       'binary1': 'category', 'binary2': 'category',
                       'cat1': 'category', 'cat2': 'category',
                   })

In [14]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 8 columns):
 #   Column   Non-Null Count   Dtype   
---  ------   --------------   -----   
 0   float1   100000 non-null  float32 
 1   float2   100000 non-null  float32 
 2   int1     100000 non-null  uint8   
 3   int2     100000 non-null  uint8   
 4   binary1  100000 non-null  category
 5   binary2  100000 non-null  category
 6   cat1     100000 non-null  category
 7   cat2     100000 non-null  category
dtypes: category(4), float32(2), uint8(2)
memory usage: 1.3 MB


Just to see how efficiently categories are represented, we load only a single binary column, and check the size of the dataframe.

In [15]:
data = pd.read_csv(file_path,
                   usecols=('binary1',),
                   dtype={
                       'float1': np.float32, 'float2': np.float32,
                       'int1': np.uint8, 'int2': np.uint8,
                       'binary1': 'category', 'binary2': 'category',
                       'cat1': 'category', 'cat2': 'category',
                   })

In [16]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 1 columns):
 #   Column   Non-Null Count   Dtype   
---  ------   --------------   -----   
 0   binary1  100000 non-null  category
dtypes: category(1)
memory usage: 97.8 KB


The memory footprint of a dataframe with a single categorical column is similar.

In [17]:
data = pd.read_csv(file_path,
                   usecols=('cat1',),
                   dtype={
                       'float1': np.float32, 'float2': np.float32,
                       'int1': np.uint8, 'int2': np.uint8,
                       'binary1': 'category', 'binary2': 'category',
                       'cat1': 'category', 'cat2': 'category',
                   })

In [18]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 1 columns):
 #   Column  Non-Null Count   Dtype   
---  ------  --------------   -----   
 0   cat1    100000 non-null  category
dtypes: category(1)
memory usage: 99.0 KB


Loading just an integer column as `np.uint8` shows that again, the memory footprint is the same, so this is the data type used to represent categorical data with less than 256 categories.

In [19]:
data = pd.read_csv(file_path,
                   usecols=('int1',),
                   dtype={
                       'float1': np.float32, 'float2': np.float32,
                       'int1': np.uint8, 'int2': np.uint8,
                       'binary1': 'category', 'binary2': 'category',
                       'cat1': 'category', 'cat2': 'category',
                   })

In [20]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 1 columns):
 #   Column  Non-Null Count   Dtype
---  ------  --------------   -----
 0   int1    100000 non-null  uint8
dtypes: uint8(1)
memory usage: 97.8 KB


In [21]:
file_path.unlink()

# Failures

You should note that when you specify the data types to read data and this doesn't correspond to the data, failures will be silent.  For instance, integers will overflow.

Consider the following dataframe that has two integer and a floating point column.  Column `B` contains integers that can not be represented as 8-bit integers, and column `C` contains values that can not be represented as single precision numbers.

In [22]:
df = pd.DataFrame({
    'A': [3, 19, 5, 7],
    'B': [9, 7, 283, -1],
    'C': [4.59, 21.49e57, -4.8, -1.495e-103]
})

In [23]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   A       4 non-null      int64  
 1   B       4 non-null      int64  
 2   C       4 non-null      float64
dtypes: float64(1), int64(2)
memory usage: 224.0 bytes


We write this dataframe as a CSV file.

In [24]:
file_path = pathlib.Path('overflow_to_remove.csv')

In [25]:
df.to_csv(file_path)

When we read it back in, we specify the bypes of columns `A` and `B` as `np.uint8` and `C` as `np.float32`.

In [26]:
df_typed = pd.read_csv(file_path,
                       dtype={
                           'A': np.uint8,
                           'B': np.uint8,
                           'C': np.float32
                       },
                       index_col=0)

In [27]:
df_typed.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   A       4 non-null      uint8  
 1   B       4 non-null      uint8  
 2   C       4 non-null      float32
dtypes: float32(1), uint8(2)
memory usage: 56.0 bytes


Inspecting the values shows that the integer in column `B` silently overflowed, while for column `C` the value that is too large to be represented as a single precision number is infinity, and the value that is too small is 0.  Hence, failure is silent.

In [28]:
df_typed

Unnamed: 0,A,B,C
0,3,9,4.59
1,19,7,inf
2,5,27,-4.8
3,7,255,-0.0


In [29]:
file_path.unlink()