# Idiomatic Polars 

## Matt Harrison - ODSC 2025

## https://github.com/mattharrison/odsc_east_2025



<!-- https://github.com/mattharrison/talks>

## About Matt  Harrison @\_\_mharrison\_\_

* Author of *Effective Polars*, *Effective Pandas*, *Effective XGBoost*, *Learning Python for Data*, *Machine Learning Pocket Reference*, and *Illustrated Guide to Python 3*
* Advisor and consultant.
* Corporate trainer at MetaSnake. Taught Pandas to 1000's of students.


## Relevant Background

* 1999 NLP
* 2006 Created Python OLAP Engine
* 2009 Heard about Pandas
* Used Pandas for failure modeling, analytics, and ml
* 2016 Learning the Pandas Library
* 2019 Spark
* 2020 Pandas Cookbook
* 2021 Effective Pandas
* 2022 CuDf, Modin, Polars
* 2023 Effective Pandas 2
* 2024 Effective Polars
* 2024 Effective Polars 1.0
* 2024 NVidia Polars backend


## Why Tabular?

- Deep learning and video/audio are popular but the crown jewels are in Excel or SQL.

- My focus on tabular tooling (Pandas, Polars, XGBoost, CatBoost, etc)


## Sad News

- Python is slow
- Pandas gets around this with NumPy (v1) and PyArrow (v2)
- Polars gets around this with Arrow (Rust)
- Stay in the playground

## Outline of Opinions

* Load Data
* Types
* Chaining
* Apply
* Aggregation

## Polars Overview

* Polars is a Rust library with Python bindings
* Polars has expressions and contexts
  - Contexts - `.select`, `.with_columns`, `.filter`
  - Expressions - done w/ `pl.col("col_name")` or `pl.lit(1)`
* Lazy evaluation
* Query optimization
* Multi-threaded


## Data

This just shows how I processed the Strava GPX file to 
create a dataframe


In [None]:
import gpxpy
import numpy as np
import polars as pl
import polars.selectors as cs
import pandas as pd

In [None]:
pl.__version__

In [None]:
import polars as pl
import gpxpy
import numpy as np

def gpx_to_polars(fname):
    # Parse the GPX file
    data = gpxpy.parse(open(fname))
    prev = None
    data_dict = {'course': [],
                 'distance_2d': [],
                 'latitude': [],
                 'longitude': [],
                 'time': [],
                 'elevation': [],
                 'speed_between': [],
                }
    
    # Iterate through tracks, segments, and points
    for track in data.tracks:
        for seg in track.segments:
            for i, pt in enumerate(seg.points):
                if prev is None:
                    prev = pt
                for key in data_dict:
                    attr = getattr(pt, key)
                    if callable(attr):
                        data_dict[key].append(attr(prev))
                    else:
                        data_dict[key].append(attr)
                prev = pt

    # Create a Polars DataFrame
    df = (pl.DataFrame(data_dict)

        #.with_columns([pl.col("time").str.strptime(pl.Datetime, "%Y-%m-%dT%H:%M:%SZ", strict=False)])
        .with_columns(
            travelled= pl.col("distance_2d").cum_sum(),
            elapsed=(pl.col("time") - pl.col("time").min()).dt.total_seconds()
        )
        .with_columns(        
            avg_velocity=pl.col("travelled") / pl.col("elapsed"),
            rolling_travelled=pl.col("travelled").rolling_mean(window_size=5),
            rolling_elapsed=pl.col("elapsed").rolling_mean(window_size=5),
        )
        .with_columns(
            rolling_velocity=pl.col("rolling_travelled") / pl.col("rolling_elapsed"),
            rolling_between=pl.col("speed_between").rolling_mean(window_size=5),
        )
    )

    return df

df = gpx_to_polars('Face_plant.gpx')
print(df)


In [None]:
import polars as pl
import polars.selectors as cs
import numpy as np



In [None]:
# 1.09 Megabytes
df.estimated_size()

In [None]:
(df)  # no index

In [None]:
(df.with_row_index())

## Types
Getting the right types will enable analysis and correctness.

In [None]:
print(df.columns)

In [None]:
cols = ['course', 'distance_2d', 'latitude', 'longitude', 'time', 'elevation', 'speed_between', 'travelled', 
        'elapsed', 'avg_velocity', 'rolling_travelled', 'rolling_elapsed', 'rolling_velocity', 'rolling_between']

In [None]:
df[cols].dtypes

In [None]:
# 1.06 Megabytes
df[cols].estimated_size()

### Ints

In [None]:
# select is a context
# pl.col is an expression
df[cols].select(pl.col([pl.Int64]))

In [None]:
df[cols].select(pl.col(pl.Int64)).describe()

In [None]:
# chaining
(df
 .select(cols)
 .select(pl.col(pl.Int64))
 .describe()
)

In [None]:
# can elapsed be an int8?
np.iinfo(np.int8)

In [None]:
# can elapsed be an int16?
np.iinfo(np.int16)

In [None]:
# chaining
# polars prevents illegal casts
(df
 .select(cols)
 .with_columns(pl.col('elapsed').cast(pl.Int8))
 #.describe()
)

In [None]:
# chaining
(df
 .select(cols)
 .with_columns(pl.col('elapsed').cast(pl.Int16))
 .describe()               
)

In [None]:
# chaining
(df
 .select(cols)
 .with_columns(pl.col('elapsed').cast(pl.Int16)) 
 .estimated_size()
)

In [None]:
(df
 .estimated_size()
)

### Strings

In [None]:
df.select(cols).select(pl.col(pl.String))

In [None]:
# chaining
(df
 .select(cols)
 .with_columns(pl.col('elapsed').cast(pl.Int16),
               course=pl.lit('Maple Syrup'))
)

In [None]:
# chaining
(df
 .select(cols)
 .with_columns(pl.col('elapsed').cast(pl.Int16),
               course=pl.lit('Maple Syrup'))
 .with_columns(alt_name=pl.col('course').str.to_uppercase()) 
 
)

In [None]:
# a bunch of string methods off of .str
# note that the spelling might be different from python/pandas
col = pl.col('')
print(dir(col.str))

In [None]:
col.str.to_uppercase?

## Convert Date to Local Time

In [None]:
(df
 .select(cols)
 .with_columns(pl.col('elapsed').cast(pl.Int16),
               course=pl.lit('Maple Syrup').cast(pl.Categorical),
               time=pl.col('time').dt.convert_time_zone('America/Denver')
              )
)

In [None]:
col = pl.col('time')
print(dir(col))

In [None]:
print(len(dir(col)))

In [None]:
print(dir(col.dt))

In [None]:
print(len(dir(col.dt)))

## Missing Data

- Use `.fill_null` to address
- Use `.filter` to filter rows
- Use `.select` to select columns

To view rows with missing data use `.filter(pl.col("col_name").is_null())`

In [None]:
(df
 .select(cols)
 .with_columns(pl.col('elapsed').cast(pl.Int16),
               course=pl.lit('Maple Syrup').cast(pl.Categorical),
               time=pl.col('time').dt.convert_time_zone('America/Denver'))
 .null_count()
)

In [None]:
# use .select to find where rows are missing
(df
 .select(cols)
 .with_columns(pl.col('elapsed').cast(pl.Int16),
               course=pl.lit('Maple Syrup').cast(pl.Categorical),
               time=pl.col('time').dt.convert_time_zone('America/Denver'))
 .select(pl.col('rolling_between').is_null())
)

In [None]:
# change .select to .filter to view the rows
(df
 .select(cols)
 .with_row_index()
 .with_columns(pl.col('elapsed').cast(pl.Int16),
               course=pl.lit('Maple Syrup').cast(pl.Categorical),
               time=pl.col('time').dt.convert_time_zone('America/Denver'))               
 .filter(pl.col('rolling_between').is_null())
)

In [None]:
# what about nans?
# note that nan and null are different in polars
# nan means not a number
# null means missing data
(df
 .select(cols)
 .with_columns(pl.col('elapsed').cast(pl.Int16),
               course=pl.lit('Maple Syrup').cast(pl.Categorical),
               time=pl.col('time').dt.convert_time_zone('America/Denver'))               
 .select(cs.numeric().is_nan().sum())
)

In [None]:
(df
 .select(cols)
 .with_columns(pl.col('elapsed').cast(pl.Int16),
               course=pl.lit('Maple Syrup').cast(pl.Categorical),
               time=pl.col('time').dt.convert_time_zone('America/Denver'))               
 .select(pl.col('avg_velocity').is_nan())
)

In [None]:
(df
 .select(cols)
 .with_row_index()  
 .with_columns(pl.col('elapsed').cast(pl.Int16),
               course=pl.lit('Maple Syrup').cast(pl.Categorical),
               time=pl.col('time').dt.convert_time_zone('America/Denver'))               
 .filter(pl.col('avg_velocity').is_nan())
)

In [None]:
# a glorious function

def tweak_gpx(df_):
    return (df_
        .select(cols)
        .with_row_index()  
        .with_columns(pl.col('elapsed').cast(pl.Int16),
                    course=pl.lit('Maple Syrup').cast(pl.Categorical),
                    time=pl.col('time').dt.convert_time_zone('America/Denver'))               
        )

tweak_gpx(df)

## Chain

Chaining is also called "flow" programming. Rather than making intermediate variables, just leverage the fact that most operations return a new object and work on that.

The chain should read like a recipe of ordered steps.

(BTW, this is actually what we did above.)

In [None]:
# a glorious function

def tweak_gpx(df_):
    return (df_
        .select(cols)
        .with_row_index()  
        .with_columns(pl.col('elapsed').cast(pl.Int16),
                    course=pl.lit('Maple Syrup').cast(pl.Categorical),
                    time=pl.col('time').dt.convert_time_zone('America/Denver'))               
        )

tweak_gpx(df).write_parquet('Face_plant.parquet')

In [None]:
# laziness
gpx_lazy = pl.scan_parquet('Face_plant.parquet') 
tweak_gpx(gpx_lazy)

In [None]:
# use .collect to generate plan and materialize
tweak_gpx(gpx_lazy).collect()

In [None]:
# using GPU!
tweak_gpx(gpx_lazy).collect('gpu')

In [None]:
# debugging
# some folks really want the intermediate data...
def get_var(df, var_name):
   globals()[var_name] = df
   return df

def tweak_gpx(df_):
    return (df_
        .pipe(lambda df: print(df.shape) or df)  # Look! 🤯
        .select(cols)
        .with_row_index()  
        .pipe(get_var, 'intermediate')  # Debugging! 💪
        .with_columns(pl.col('elapsed').cast(pl.Int16),
                    course=pl.lit('Maple Syrup').cast(pl.Categorical),
                    time=pl.col('time').dt.convert_time_zone('America/Denver'))               
        )

raw = pl.read_parquet('Face_plant.parquet')
tweak_gpx(raw)

In [None]:
intermediate

## Don't Apply (map_elements) if you can

In [None]:
# debugging
def get_var(df, var_name):
   globals()[var_name] = df
   return df

def tweak_gpx(df_):
    return (df_
             .pipe(lambda df: print(df.shape) or df)
        .select(cols)
        .with_row_index()  
         .pipe(get_var, 'intermediate')
        .with_columns(pl.col('elapsed').cast(pl.Int16),
                    course=pl.lit('Maple Syrup').cast(pl.Categorical),
                    time=pl.col('time').dt.convert_time_zone('America/Denver'))               
        )

raw = pl.read_parquet('Face_plant.parquet')
df = tweak_gpx(raw)

In [None]:
# convert elevation from meters to feet
def meters_to_feet(m):
    return m * 3.28084

(df
 .select('elevation', 
         ele_ft=pl.col('elevation').map_elements(meters_to_feet)) 
)

In [None]:
# convert elevation from meters to feet
def meters_to_feet(m):
    return m * 3.28084

(df
 .select('elevation', 
         ele_ft=meters_to_feet(pl.col('elevation')))
)

In [None]:
# Perhaps more readable
# convert elevation from meters to feet
def meters_to_feet(m):
    return m * 3.28084

(df
 .select('elevation', 
         ele_ft=pl.col('elevation').pipe(meters_to_feet))
)

In [None]:
%%timeit
# takes 965 µs on my machine
(df
 .select('elevation', ele_ft=pl.col('elevation').map_elements(meters_to_feet)) 
)

In [None]:
import warnings

warnings.filterwarnings('ignore')

In [None]:
%%timeit
# takes 965 µs on my machine
(df
 .select('elevation', ele_ft=pl.col('elevation').map_elements(meters_to_feet)) 
)

In [None]:
warnings.resetwarnings()

In [None]:
%%timeit
(df
 .select('elevation', 
         ele_ft=pl.col('elevation').pipe(meters_to_feet))
)

In [None]:
%%timeit
(df
 .select('elevation', 
         ele_ft=pl.col('elevation')*3.28084)
)

In [None]:
888/57

## benchmark caveat
- Use the size of data you are using in the real world

## Master Aggregation

Let's speed (and distance) by 10 minute intervals

In [None]:
def meters_per_second_to_mph(mps):
    return mps * 2.23694

(tweak_gpx(raw)
 .group_by_dynamic(index_column='time', every='10m')
 .agg(pl.col('travelled').last() - pl.col('travelled').first(),
      speed=(pl.col('travelled').last() - pl.col('travelled').first()) / 
          ((pl.col('time').last() - pl.col('time').first()).dt.total_seconds())
    ) 
 .with_columns(mph=pl.col('speed').pipe(meters_per_second_to_mph))
 )

In [None]:
def meters_per_second_to_mph(mps):
    return mps * 2.23694

(tweak_gpx(raw)
 .group_by_dynamic(index_column='time', every='10m')
 .agg(pl.col('travelled').last() - pl.col('travelled').first(),
      speed=(pl.col('travelled').last() - pl.col('travelled').first()) / 
      ((pl.col('time').last() - pl.col('time').first()).dt.total_seconds())
    ) 
 .with_columns(mph=pl.col('speed').pipe(meters_per_second_to_mph))
 .plot.bar(x='time', y='mph')
 )

In [None]:
# uphill vs downhill
def meters_to_feet(m):
    return m * 3.28084

def feet_to_miles(f):
    return f / 5280

(tweak_gpx(raw)
 .with_columns(climbing=pl.col('elevation').diff().gt(0))
 .group_by('climbing')
 .agg(pl.col('distance_2d').sum().pipe(meters_to_feet).pipe(feet_to_miles))
 .filter(~pl.col('climbing').is_null())
)

In [None]:
# uphill vs downhill
def meters_to_feet(m):
    return m * 3.28084

def feet_to_miles(f):
    return f / 5280

(tweak_gpx(raw)
 .with_columns(climbing=pl.col('elevation').diff().gt(0))
 .group_by('climbing')
 .agg(pl.col('distance_2d').sum().pipe(meters_to_feet).pipe(feet_to_miles))
 .filter(~pl.col('climbing').is_null())
 .plot.bar(x='climbing', y='distance_2d')
)

## Summary

* Correct types save space and enable convenient math, string, and date functionality
* Chaining operations will:
   * Make code readable
   * Remove bugs
   * Easier to debug
* ``.map_elements`` is slow for math
* Aggregations are powerful. Play with them until they make sense


Let's connect! 

See me at book signing tomorrow.

