# Agenda: Pandas optimization

1. Why do we care? (Some considerations)
2. Memory optimization
3. Files
4. Query speed
5. PyArrow
6. Optimization antipatterns

In [1]:
!wget https://files.lerner.co.il/olympic_athlete_events.zip

--2025-08-27 20:04:59--  https://files.lerner.co.il/olympic_athlete_events.zip
Resolving files.lerner.co.il (files.lerner.co.il)... 64.227.9.246
Connecting to files.lerner.co.il (files.lerner.co.il)|64.227.9.246|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5536302 (5.3M) [application/zip]
Saving to: ‘olympic_athlete_events.zip’


2025-08-27 20:05:01 (4.81 MB/s) - ‘olympic_athlete_events.zip’ saved [5536302/5536302]



# Why do we care?

Everything that happens in Pandas happens in memory.

- You need to make sure that you have enough memory on your system for the data to fit
- Even if it all fits, the smaller the data frame, the faster the queries will run on it
- Even if it all fits, your query might create a large amount of "temporary" data that might use up your system's RAM.

You can use another system that reads only part of a data frame into memory -- Polars, Dask, Modin, PyArrow.

Our goal is always going to be: Use as little memory as possible!

# Optimizing memory

There are a few factors to consider:

- How many rows are in your data frame?
- How many columns are in your data frame?
- What dtype does each column have?

You want to have as few rows as possible, as few columns as possible, and as small a dtype as is practical/reasonable for each of those columns.

In [2]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

In [3]:
np.random.seed(0)  # guarantees that we all get the same random numbers each time

df = DataFrame(np.random.randint(0, 100, [3, 4]))
df

Unnamed: 0,0,1,2,3
0,44,47,64,67
1,67,9,83,21
2,36,87,70,88


In [4]:
# how much memory is this using?

# - how many rows? 3
# - how many columns? 4
# - what dtype? np.int64

3 * 4 * 8

96

In [5]:
# is this true?

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   0       3 non-null      int64
 1   1       3 non-null      int64
 2   2       3 non-null      int64
 3   3       3 non-null      int64
dtypes: int64(4)
memory usage: 228.0 bytes


This is the case because our Pandas data frame uses up some memory, too!

In [6]:
# another way to calculate memory usage

df.memory_usage()

Index    132
0         24
1         24
2         24
3         24
dtype: int64

In [7]:
24 * 4

96

In [8]:
# what if we now give some string names to the index?

np.random.seed(0) 

df = DataFrame(np.random.randint(0, 100, [3, 4]),
               index=list('abc'),
               columns=list('wxyz'))
df

Unnamed: 0,w,x,y,z
a,44,47,64,67
b,67,9,83,21
c,36,87,70,88


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, a to c
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   w       3 non-null      int64
 1   x       3 non-null      int64
 2   y       3 non-null      int64
 3   z       3 non-null      int64
dtypes: int64(4)
memory usage: 120.0+ bytes


In [10]:
df.memory_usage()

Index    24
w        24
x        24
y        24
z        24
dtype: int64

In [11]:
24 * 5

120

The `+` indicates that our data frame contains some values that are being stored in Python, not directly in NumPy. It'll take time and effort to go into Python's memory and track down those objects, and get their sizes. So Pandas fudges, telling us how much memory the *pointer* to that Python object takes up (64 bits), but doesn't reflect the actual size of the string in memory.

In [12]:
# we can solve this by telling Pandas to go into Python's memory and calculate the *real* size

df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, a to c
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   w       3 non-null      int64
 1   x       3 non-null      int64
 2   y       3 non-null      int64
 3   z       3 non-null      int64
dtypes: int64(4)
memory usage: 246.0 bytes


In [13]:
df.memory_usage(deep=True)

Index    150
w         24
x         24
y         24
z         24
dtype: int64

In [14]:
df

Unnamed: 0,w,x,y,z
a,44,47,64,67
b,67,9,83,21
c,36,87,70,88


In [15]:
df.loc['a', 'w'] = 'abcdefghijklmnopqrstuvwxyz' * 10_000_000

IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)



In [16]:
df

Unnamed: 0,w,x,y,z
a,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrst...,47,64,67
b,67,9,83,21
c,36,87,70,88


In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, a to c
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   w       3 non-null      object
 1   x       3 non-null      int64 
 2   y       3 non-null      int64 
 3   z       3 non-null      int64 
dtypes: int64(3), object(1)
memory usage: 228.0+ bytes


In [18]:
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, a to c
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   w       3 non-null      object
 1   x       3 non-null      int64 
 2   y       3 non-null      int64 
 3   z       3 non-null      int64 
dtypes: int64(3), object(1)
memory usage: 248.0 MB


In [19]:
import sys
sys.getsizeof('')

41

In [20]:
sys.getsizeof('a')

42

In [21]:
filename = '/Users/reuven/Courses/Current/data/nyc-parking-violations-2020.csv'

!ls -lh $filename

-rw-r--r-- 1 reuven staff 2.2G Jul  5  2021 /Users/reuven/Courses/Current/data/nyc-parking-violations-2020.csv


In [22]:
# what happens if I read this file into memory?

df = pd.read_csv(filename)

  df = pd.read_csv(filename)


In [23]:
df = pd.read_csv(filename, low_memory=False)

In [24]:
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12495734 entries, 0 to 12495733
Data columns (total 43 columns):
 #   Column                             Dtype  
---  ------                             -----  
 0   Summons Number                     int64  
 1   Plate ID                           object 
 2   Registration State                 object 
 3   Plate Type                         object 
 4   Issue Date                         object 
 5   Violation Code                     int64  
 6   Vehicle Body Type                  object 
 7   Vehicle Make                       object 
 8   Issuing Agency                     object 
 9   Street Code1                       int64  
 10  Street Code2                       int64  
 11  Street Code3                       int64  
 12  Vehicle Expiration Date            int64  
 13  Violation Location                 float64
 14  Violation Precinct                 int64  
 15  Issuer Precinct                    int64  
 16  Issuer Code     

In [25]:
# 12,495,734 rows in this data frame!

In [26]:
# one great way to reduce memory size: keep only a handful of columns!
# usecols lets us specify the columns we want

df = pd.read_csv(filename, low_memory=False,
                 usecols=['Plate ID', 'Registration State',
                          'Issue Date', 'Vehicle Make',
                          'Vehicle Color', 'Violation Code'])

In [27]:
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12495734 entries, 0 to 12495733
Data columns (total 6 columns):
 #   Column              Dtype 
---  ------              ----- 
 0   Plate ID            object
 1   Registration State  object
 2   Issue Date          object
 3   Violation Code      int64 
 4   Vehicle Make        object
 5   Vehicle Color       object
dtypes: int64(1), object(5)
memory usage: 3.4 GB


In [28]:
df['Violation Code'].min()

np.int64(0)

In [29]:
df['Violation Code'].max()

np.int64(99)

In [31]:
# we can get away with using an 8-bit integer!

(12495734 * 64) - (12495734 * 56) 

99965872

In [None]:
99,965,872

In [32]:
# how can I specify that the "Violation Code" column should use np.int8, and not np.int64?
# here, we can start to specify the dtype keyword argument to read_csv


df = pd.read_csv(filename, low_memory=False,
                 usecols=['Plate ID', 'Registration State',
                          'Issue Date', 'Vehicle Make',
                          'Vehicle Color', 'Violation Code'],
                dtype={'Violation Code':np.int8})

In [33]:
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12495734 entries, 0 to 12495733
Data columns (total 6 columns):
 #   Column              Dtype 
---  ------              ----- 
 0   Plate ID            object
 1   Registration State  object
 2   Issue Date          object
 3   Violation Code      int8  
 4   Vehicle Make        object
 5   Vehicle Color       object
dtypes: int8(1), object(5)
memory usage: 3.3 GB


# Categories

If a column contains strings, then each of those strings is being stored in Python and taking up memory. If we were writing a program, and the program used the same hard-coded strings many times, we would use an enum. Pandas supports the same idea, namely replacing strings with integers and storing the strings in one place. Each integer refers to an element in that storage. Because integers are smaller than strings, and because strings often repeat, categories can reduce memory usage quite a bit.

In [34]:
df.columns

Index(['Plate ID', 'Registration State', 'Issue Date', 'Violation Code',
       'Vehicle Make', 'Vehicle Color'],
      dtype='object')

In [35]:
df['Plate ID'].value_counts()

Plate ID
BLANKPLATE    8882
2704819       1535
86107MM       1429
2703289       1364
NS            1189
              ... 
DVV1033          1
JCX9668          1
BXU5368          1
HFS8627          1
T64528C          1
Name: count, Length: 3245467, dtype: int64

In [38]:
for one_column in df.select_dtypes('object').columns:
    print(f'Adjusting {one_column}...')
    df[one_column] = df[one_column].astype('category')

Adjusting Plate ID...
Adjusting Registration State...
Adjusting Issue Date...
Adjusting Vehicle Make...
Adjusting Vehicle Color...


In [39]:
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12495734 entries, 0 to 12495733
Data columns (total 6 columns):
 #   Column              Dtype   
---  ------              -----   
 0   Plate ID            category
 1   Registration State  category
 2   Issue Date          category
 3   Violation Code      int8    
 4   Vehicle Make        category
 5   Vehicle Color       category
dtypes: category(5), int8(1)
memory usage: 445.5 MB


In [40]:
df['Plate ID']

0             J58JKX
1            KRE6058
2            444326R
3            F728330
4            FMY9090
              ...   
12495729     62161MM
12495730     GYE7330
12495731     HNY4802
12495732    T687081C
12495733      3497ZN
Name: Plate ID, Length: 12495734, dtype: category
Categories (3245467, object): ['000000', '00000K', '00004', '000050', ..., 'hNN7585', 'jhs4430', 'kNH5584', 'tR8DER']

In [41]:
df['Vehicle Color']

0              BK
1             BLK
2           BLACK
3             NaN
4            GREY
            ...  
12495729       BR
12495730      BLK
12495731       GY
12495732      BLK
12495733    WHITE
Name: Vehicle Color, Length: 12495734, dtype: category
Categories (1896, object): ['';', '+', '++', '+++++', ..., 'YYL', 'Yello', 'Z', '`']

In [42]:
df['Vehicle Color'].dtype

CategoricalDtype(categories=['';', '+', '++', '+++++', '-', '- E', '- J', '--H', '-.',
                  '-.H',
                  ...
                  'YWPR', 'YWRD', 'YWTN', 'YWV', 'YWWH', 'YWYW', 'YYL',
                  'Yello', 'Z', '`'],
, ordered=False, categories_dtype=object)

# Exercise: Olympic data

1. Read the entire Olympic data set into memory, using the defaults. How much RAM does this use altogether?
2. Only keep `Team`, `Season`, `Year`, `Height`, and `Weight` columns. Now how much RAM does it use?
3. Modify `Year`, `Height`, and `Weight` to be smaller than `int64`. How small can they get? How much RAM do we use now?
4. Turn the text columns into categories. How much RAM are we using now?

In [43]:
olympic_df = pd.read_csv('olympic_athlete_events.zip')

In [44]:
olympic_df.shape

(271116, 15)

In [45]:
olympic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271116 entries, 0 to 271115
Data columns (total 15 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   ID      271116 non-null  int64  
 1   Name    271116 non-null  object 
 2   Sex     271116 non-null  object 
 3   Age     261642 non-null  float64
 4   Height  210945 non-null  float64
 5   Weight  208241 non-null  float64
 6   Team    271116 non-null  object 
 7   NOC     271116 non-null  object 
 8   Games   271116 non-null  object 
 9   Year    271116 non-null  int64  
 10  Season  271116 non-null  object 
 11  City    271116 non-null  object 
 12  Sport   271116 non-null  object 
 13  Event   271116 non-null  object 
 14  Medal   39783 non-null   object 
dtypes: float64(3), int64(2), object(10)
memory usage: 31.0+ MB


In [47]:
olympic_df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271116 entries, 0 to 271115
Data columns (total 15 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   ID      271116 non-null  int64  
 1   Name    271116 non-null  object 
 2   Sex     271116 non-null  object 
 3   Age     261642 non-null  float64
 4   Height  210945 non-null  float64
 5   Weight  208241 non-null  float64
 6   Team    271116 non-null  object 
 7   NOC     271116 non-null  object 
 8   Games   271116 non-null  object 
 9   Year    271116 non-null  int64  
 10  Season  271116 non-null  object 
 11  City    271116 non-null  object 
 12  Sport   271116 non-null  object 
 13  Event   271116 non-null  object 
 14  Medal   39783 non-null   object 
dtypes: float64(3), int64(2), object(10)
memory usage: 158.9 MB


In [48]:
!ls -lh olympic_athlete_events.zip

-rw-r--r-- 1 reuven staff 5.3M Apr  3  2022 olympic_athlete_events.zip


In [50]:
# Only keep Team, Season, Year, Height, and Weight columns. Now how much RAM does it use?

olympic_df = pd.read_csv('olympic_athlete_events.zip',
                        usecols=['Team', 'Season', 'Year', 'Height', 'Weight'])
olympic_df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271116 entries, 0 to 271115
Data columns (total 5 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Height  210945 non-null  float64
 1   Weight  208241 non-null  float64
 2   Team    271116 non-null  object 
 3   Year    271116 non-null  int64  
 4   Season  271116 non-null  object 
dtypes: float64(2), int64(1), object(2)
memory usage: 35.3 MB


In [52]:
olympic_df[['Year', 'Height', 'Weight']].agg(['min', 'max'])

Unnamed: 0,Year,Height,Weight
min,1896,127.0,25.0
max,2016,226.0,214.0


In [58]:
# Modify Year, Height, and Weight to be smaller than int64. How small can they get? How much RAM do we use now?

olympic_df = pd.read_csv('olympic_athlete_events.zip',
                        usecols=['Team', 'Season', 'Year', 'Height', 'Weight'],
                        dtype={'Year':np.uint16,
                               'Height':np.float32,
                              'Weight':np.float32})
olympic_df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271116 entries, 0 to 271115
Data columns (total 5 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Height  210945 non-null  float32
 1   Weight  208241 non-null  float32
 2   Team    271116 non-null  object 
 3   Year    271116 non-null  uint16 
 4   Season  271116 non-null  object 
dtypes: float32(2), object(2), uint16(1)
memory usage: 31.7 MB


In [59]:
# Turn the text columns into categories. How much RAM are we using now?

olympic_df = pd.read_csv('olympic_athlete_events.zip',
                        usecols=['Team', 'Season', 'Year', 'Height', 'Weight'],
                        dtype={'Year':np.uint16,
                               'Height':np.float32,
                              'Weight':np.float32,
                              'Team':'category',
                              'Season':'category'})
olympic_df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271116 entries, 0 to 271115
Data columns (total 5 columns):
 #   Column  Non-Null Count   Dtype   
---  ------  --------------   -----   
 0   Height  210945 non-null  float32 
 1   Weight  208241 non-null  float32 
 2   Team    271116 non-null  category
 3   Year    271116 non-null  uint16  
 4   Season  271116 non-null  category
dtypes: category(2), float32(2), uint16(1)
memory usage: 3.5 MB


# Apache Arrow

Wes McKinney, who created Pandas, realized that he was reinventing the wheel. He used NumPy, but having a 2D table in memory wasn't unique to Pandas. He also realized that while NumPy works, and is efficient, it's far from perfect for data analysis. He decided to create a new in-memory data structure for working with data analysis. That project is known as Apache Arrow. The Python version/bindings is known as PyArrow.

There are plans to replace NumPy with PyArrow over the coming years.  Right now, you can use PyArrow in a few ways:

- You can use its CSV-loading library, which is far faster than the Pandas default written in Python
- You can use two binary file formats (Parquet and Feather) to reading/writing data frames to/from disk, which are far faster and more accurate than CSV or Excel
- You can use it instead of NumPy, although this is considered experimental


In [60]:
%%timeit

pd.read_csv('olympic_athlete_events.zip',
                        usecols=['Team', 'Season', 'Year', 'Height', 'Weight'],
                        dtype={'Year':np.uint16,
                               'Height':np.float32,
                              'Weight':np.float32,
                              'Team':'category',
                              'Season':'category'})


243 ms ± 3.19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [61]:
%%timeit

pd.read_csv('olympic_athlete_events.zip',
                        usecols=['Team', 'Season', 'Year', 'Height', 'Weight'],
                        dtype={'Year':np.uint16,
                               'Height':np.float32,
                              'Weight':np.float32,
                              'Team':'category',
                              'Season':'category'},
           engine='pyarrow')   # Use PyArrow to load the CSV -- but NumPy is still the storage backend


676 ms ± 5.18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [62]:
olympic_file = '/Users/reuven/Courses/Current/Data/olympic_athlete_events.csv'

!ls -lh $olympic_file

-rw-r--r-- 1 reuven staff 40M Oct  1  2019 /Users/reuven/Courses/Current/Data/olympic_athlete_events.csv


In [63]:
%%timeit

pd.read_csv(olympic_file,
                        usecols=['Team', 'Season', 'Year', 'Height', 'Weight'],
                        dtype={'Year':np.uint16,
                               'Height':np.float32,
                              'Weight':np.float32,
                              'Team':'category',
                              'Season':'category'})

217 ms ± 3.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [66]:
%%timeit

pd.read_csv(olympic_file,
                        usecols=['Team', 'Season', 'Year', 'Height', 'Weight'],
               engine='pyarrow')

631 ms ± 8.46 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [65]:
filename

'/Users/reuven/Courses/Current/data/nyc-parking-violations-2020.csv'

# PyArrow binary formats

- Feather format, which is larger, but takes less time to read/write (less compression)
- Parquet format, which is smaller, but takes more time to read/write (more compression)

In [71]:
df.to_parquet('/tmp/nyc-tickets.parquet')

In [70]:
df.to_feather('/tmp/nyc-tickets.feather')

In [72]:
!ls -lh /tmp/nyc-tickets*

-rw-r--r-- 1 reuven wheel 118M Aug 27 21:03 /tmp/nyc-tickets.feather
-rw-r--r-- 1 reuven wheel 263M Aug 27 21:03 /tmp/nyc-tickets.parquet


In [73]:
%timeit df = pd.read_parquet('/tmp/nyc-tickets.parquet')

4.72 s ± 41.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


You can specify, when you read data in using `read_csv`, that you want `dtype_backend='pyarrow'`. This will use PyArrow instead of NumPy.

Many times, you will find that the PyArrow backend is *FAR* faster and smaller than NumPy. But it's still not ready for prime time, and there are certain things that are far slower.

# Common Pandas anti-patterns (meaning: don't do these things!)

Don't use `inplace=True`! It's very common for people to learn about `inplace=True` on many methods. Instead of returning a new data frame, these methods then return `None`, and modify the original data frame. You can do this with `set_index`, `sort_values`, `dropna`, and many others.

However, the core Pandas developers say loud and clear:

1. It's going away!
2. It's not really faster or smaller than returning a new data frame
3. It kills the possibility of doing method chaining, which is increasingly in vogue

# Don't use `for` loops on your data frames

Pandas is optimized for vectorized method calls. It's really fast at doing those! If you use a `for` loop, you're giving up all of the speed and memory optimizations.

You should almost always be able to find a Pandas method that does what you want. If you're working with strings, check out the `.str` accessor, and the same is true for `.dt` for datetime values.

If you really need, you can use the `map` and `apply` methods, which aren't perfect, but they're still better.

# Turn dates and times into `datetime` dtypes!

Strings are so much larger than `datetime` values, so this will both save you memory and also give you lots of functionality.