# Idiomatic Polars 

## Matt Harrison - ODSC 2024



<!-- https://github.com/mattharrison/talks>

## About Matt  Harrison @\_\_mharrison\_\_

* Author of *Effective Polars*, *Effective Pandas*, *Effective XGBoost*, *Learning Python for Data*, *Machine Learning Pocket Reference*, and *Illustrated Guide to Python 3*
* Advisor and consultant.
* Corporate trainer at MetaSnake. Taught Pandas to 1000's of students.


## Relevant Background

* 1999 NLP
* 2006 Created Python OLAP Engine
* 2009 Heard about Pandas
* Used Pandas for failure modeling, analytics, and ml
* 2016 Learning the Pandas Library
* 2019 Spark
* 2020 Pandas Cookbook
* 2021 Effective Pandas
* 2022 CuDf, Modin, Polars
* 2023 Effective Pandas 2
* 2024 Effective Polars
* 2024 Effective Polars 1.0


## Why Tabular?

- Deep learning and video/audio are popular but the crown jewels are in Excel or SQL.

- My focus on tabular tooling (Pandas, Polars, XGBoost, CatBoost, etc)


## Sad News

- Python is slow
- Pandas gets around this with NumPy (v1) and PyArrow (v2)
- Polars gets around this with Arrow (Rust)
- Stay in the playground

## Outline of Opinions

* Load Data
* Types
* Chaining
* Apply
* Aggregation

## Polars Overview

* Polars is a Rust library with Python bindings
* Polars has expressions and contexts
  - Contexts - `.select`, `.with_columns`, `.filter`
  - Expressions - done w/ `pl.col("col_name")` or `pl.lit(1)`
* Lazy evaluation
* Query optimization
* Multi-threaded


## Data

In [None]:
!pip install -U polars

In [2]:
!uv add gpxpy

[2K[2mResolved [1m121 packages[0m [2min 306ms[0m[0m                                       [0m
[2K[37m⠙[0m [2mPreparing packages...[0m (0/1)                                                   
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/1)----[0m[0m     0 B/41.65 KiB                     [1A
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/1)----[0m[0m 16.00 KiB/41.65 KiB                   [1A
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/1)----[0m[0m 32.00 KiB/41.65 KiB                   [1A
[2K[2mPrepared [1m1 package[0m [2min 39ms[0m[0m                                                   [1A
         If the cache and target directories are on different filesystems, hardlinking may not be supported.
[2K[2mInstalled [1m1 package[0m [2min 2ms[0m[0m                                  [0m
 [32m+[39m [1mgpxpy[0m[2m==1.6.2[0m


In [1]:
import gpxpy
import numpy as np
import polars as pl
import polars.selectors as cs
import pandas as pd

In [2]:
pl.__version__

'1.12.0'

In [3]:
import polars as pl
import gpxpy
import numpy as np

def gpx_to_polars(fname):
    # Parse the GPX file
    data = gpxpy.parse(open(fname))
    prev = None
    data_dict = {'course': [],
                 'distance_2d': [],
                 'latitude': [],
                 'longitude': [],
                 'time': [],
                 'elevation': [],
                 'speed_between': [],
                }
    
    # Iterate through tracks, segments, and points
    for track in data.tracks:
        for seg in track.segments:
            for i, pt in enumerate(seg.points):
                if prev is None:
                    prev = pt
                for key in data_dict:
                    attr = getattr(pt, key)
                    if callable(attr):
                        data_dict[key].append(attr(prev))
                    else:
                        data_dict[key].append(attr)
                prev = pt

    # Create a Polars DataFrame
    df = (pl.DataFrame(data_dict)

        #.with_columns([pl.col("time").str.strptime(pl.Datetime, "%Y-%m-%dT%H:%M:%SZ", strict=False)])
        .with_columns(
            travelled= pl.col("distance_2d").cum_sum(),
            elapsed=(pl.col("time") - pl.col("time").min()).dt.total_seconds()
        )
        .with_columns(        
            avg_velocity=pl.col("travelled") / pl.col("elapsed"),
            rolling_travelled=pl.col("travelled").rolling_mean(window_size=5),
            rolling_elapsed=pl.col("elapsed").rolling_mean(window_size=5),
        )
        .with_columns(
            rolling_velocity=pl.col("rolling_travelled") / pl.col("rolling_elapsed"),
            rolling_between=pl.col("speed_between").rolling_mean(window_size=5),
        )
    )

    return df

df = gpx_to_polars('Face_plant.gpx')
print(df)


shape: (10_430, 14)
┌────────┬────────────┬───────────┬────────────┬───┬───────────┬───────────┬───────────┬───────────┐
│ course ┆ distance_2 ┆ latitude  ┆ longitude  ┆ … ┆ rolling_t ┆ rolling_e ┆ rolling_v ┆ rolling_b │
│ ---    ┆ d          ┆ ---       ┆ ---        ┆   ┆ ravelled  ┆ lapsed    ┆ elocity   ┆ etween    │
│ null   ┆ ---        ┆ f64       ┆ f64        ┆   ┆ ---       ┆ ---       ┆ ---       ┆ ---       │
│        ┆ f64        ┆           ┆            ┆   ┆ f64       ┆ f64       ┆ f64       ┆ f64       │
╞════════╪════════════╪═══════════╪════════════╪═══╪═══════════╪═══════════╪═══════════╪═══════════╡
│ null   ┆ 0.0        ┆ 40.879161 ┆ -111.85516 ┆ … ┆ null      ┆ null      ┆ null      ┆ null      │
│        ┆            ┆           ┆ 9          ┆   ┆           ┆           ┆           ┆           │
│ null   ┆ 1.210883   ┆ 40.879167 ┆ -111.85518 ┆ … ┆ null      ┆ null      ┆ null      ┆ null      │
│        ┆            ┆           ┆ 1          ┆   ┆           ┆       

In [4]:
# 1.09 Megabytes
df.estimated_size()


1091240

In [5]:
# no index!
df

course,distance_2d,latitude,longitude,time,elevation,speed_between,travelled,elapsed,avg_velocity,rolling_travelled,rolling_elapsed,rolling_velocity,rolling_between
null,f64,f64,f64,"datetime[μs, UTC]",f64,f64,f64,i64,f64,f64,f64,f64,f64
,0.0,40.879161,-111.855169,2024-09-10 23:41:56 UTC,1480.0,,0.0,0,,,,,
,1.210883,40.879167,-111.855181,2024-09-10 23:41:58 UTC,1480.1,0.607503,1.210883,2,0.605442,,,,
,1.227612,40.879172,-111.855194,2024-09-10 23:41:59 UTC,1480.1,1.227612,2.438495,3,0.812832,,,,
,0.952238,40.879174,-111.855205,2024-09-10 23:42:00 UTC,1480.1,0.952238,3.390733,4,0.847683,,,,
,0.90551,40.879177,-111.855215,2024-09-10 23:42:01 UTC,1480.1,0.90551,4.296243,5,0.859249,2.267271,2.8,0.80974,
…,…,…,…,…,…,…,…,…,…,…,…,…,…
,4.085147,40.857847,-111.823899,2024-09-11 02:39:58 UTC,1696.8,4.086371,18640.85135,10682,1.745071,18634.60004,10680.0,1.744813,3.177668
,1.535726,40.857852,-111.823916,2024-09-11 02:39:59 UTC,1696.8,1.535726,18642.387076,10683,1.745052,18637.420852,10681.0,1.744913,2.821613
,3.156134,40.857869,-111.823946,2024-09-11 02:40:00 UTC,1696.7,3.157718,18645.54321,10684,1.745184,18640.220738,10682.0,1.745012,2.800696
,3.626488,40.857889,-111.82398,2024-09-11 02:40:01 UTC,1696.5,3.631999,18649.169698,10685,1.74536,18642.943507,10683.0,1.745104,2.724433


In [6]:
(df.with_row_index())

index,course,distance_2d,latitude,longitude,time,elevation,speed_between,travelled,elapsed,avg_velocity,rolling_travelled,rolling_elapsed,rolling_velocity,rolling_between
u32,null,f64,f64,f64,"datetime[μs, UTC]",f64,f64,f64,i64,f64,f64,f64,f64,f64
0,,0.0,40.879161,-111.855169,2024-09-10 23:41:56 UTC,1480.0,,0.0,0,,,,,
1,,1.210883,40.879167,-111.855181,2024-09-10 23:41:58 UTC,1480.1,0.607503,1.210883,2,0.605442,,,,
2,,1.227612,40.879172,-111.855194,2024-09-10 23:41:59 UTC,1480.1,1.227612,2.438495,3,0.812832,,,,
3,,0.952238,40.879174,-111.855205,2024-09-10 23:42:00 UTC,1480.1,0.952238,3.390733,4,0.847683,,,,
4,,0.90551,40.879177,-111.855215,2024-09-10 23:42:01 UTC,1480.1,0.90551,4.296243,5,0.859249,2.267271,2.8,0.80974,
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
10425,,4.085147,40.857847,-111.823899,2024-09-11 02:39:58 UTC,1696.8,4.086371,18640.85135,10682,1.745071,18634.60004,10680.0,1.744813,3.177668
10426,,1.535726,40.857852,-111.823916,2024-09-11 02:39:59 UTC,1696.8,1.535726,18642.387076,10683,1.745052,18637.420852,10681.0,1.744913,2.821613
10427,,3.156134,40.857869,-111.823946,2024-09-11 02:40:00 UTC,1696.7,3.157718,18645.54321,10684,1.745184,18640.220738,10682.0,1.745012,2.800696
10428,,3.626488,40.857889,-111.82398,2024-09-11 02:40:01 UTC,1696.5,3.631999,18649.169698,10685,1.74536,18642.943507,10683.0,1.745104,2.724433


## Types
Getting the right types will enable analysis and correctness.

In [7]:
print(df.columns)

['course', 'distance_2d', 'latitude', 'longitude', 'time', 'elevation', 'speed_between', 'travelled', 'elapsed', 'avg_velocity', 'rolling_travelled', 'rolling_elapsed', 'rolling_velocity', 'rolling_between']


In [8]:
cols = ['course', 'distance_2d', 'latitude', 'longitude', 'time', 'elevation', 'speed_between', 'travelled', 
        'elapsed', 'avg_velocity', 'rolling_travelled', 'rolling_elapsed', 'rolling_velocity', 'rolling_between']

In [9]:
df[cols].dtypes

[Null,
 Float64,
 Float64,
 Float64,
 Datetime(time_unit='us', time_zone='UTC'),
 Float64,
 Float64,
 Float64,
 Int64,
 Float64,
 Float64,
 Float64,
 Float64,
 Float64]

In [10]:
# 1.09 Megabytes
df[cols].estimated_size()

1091240

### Ints

In [11]:
# select is a context
# pl.col is an expression
df[cols].select(pl.col(pl.Int64))

elapsed
i64
0
2
3
4
5
…
10682
10683
10684
10685


In [12]:
df[cols].select(pl.col(pl.Int64)).describe()

statistic,elapsed
str,f64
"""count""",10430.0
"""null_count""",0.0
"""mean""",5403.773921
"""std""",3067.406074
"""min""",0.0
"""25%""",2760.0
"""50%""",5424.0
"""75%""",8069.0
"""max""",10686.0


In [13]:
# chaining
(df
 .select(cols)
 .select(pl.col(pl.Int64))
 .describe()
)

statistic,elapsed
str,f64
"""count""",10430.0
"""null_count""",0.0
"""mean""",5403.773921
"""std""",3067.406074
"""min""",0.0
"""25%""",2760.0
"""50%""",5424.0
"""75%""",8069.0
"""max""",10686.0


In [19]:
# can elapsed be an int8?
np.iinfo(np.int8)

iinfo(min=-128, max=127, dtype=int8)

In [20]:
# can elapsed be an int16?
np.iinfo(np.int16)

iinfo(min=-32768, max=32767, dtype=int16)

In [21]:
# chaining
# polars prevents illegal casts
(df
 .select(cols)
 .with_columns(pl.col('elapsed').cast(pl.Int8))
 #.describe()
)

InvalidOperationError: conversion from `i64` to `i8` failed in column 'elapsed' for 10321 out of 10430 values: [128, 129, … 10686]

In [22]:
# chaining
(df
 .select(cols)
 .with_columns(pl.col('elapsed').cast(pl.Int16))
 .describe()               
)

statistic,course,distance_2d,latitude,longitude,time,elevation,speed_between,travelled,elapsed,avg_velocity,rolling_travelled,rolling_elapsed,rolling_velocity,rolling_between
str,f64,f64,f64,f64,str,f64,f64,f64,f64,f64,f64,f64,f64,f64
"""count""",0.0,10430.0,10430.0,10430.0,"""10430""",10430.0,10429.0,10430.0,10430.0,10430.0,10426.0,10426.0,10426.0,10425.0
"""null_count""",10430.0,0.0,0.0,0.0,"""0""",0.0,1.0,0.0,0.0,0.0,4.0,4.0,4.0,5.0
"""mean""",,1.788392,40.859662,-111.822538,"""2024-09-11 01:11:59.773921+00:…",1788.95372,1.811534,10644.58751,5403.773921,,10645.09369,5403.797123,2.065183,1.81141
"""std""",,1.710749,0.005536,0.0137,,141.492882,1.657749,5393.503724,3067.406074,,5391.382656,3066.209114,0.418829,1.506718
"""min""",,0.0,40.848371,-111.855371,"""2024-09-10 23:41:56+00:00""",1480.0,0.0,0.0,0.0,0.374262,2.267271,2.8,0.376911,-8.6597e-15
"""25%""",,0.611231,40.856748,-111.829785,"""2024-09-11 00:27:56+00:00""",1696.0,0.675361,5810.431187,2760.0,1.848393,5812.456847,2761.0,1.848439,0.776866
"""50%""",,1.324637,40.859156,-111.82222,"""2024-09-11 01:12:20+00:00""",1816.5,1.392097,12239.423704,5424.0,1.981256,12239.47112,5424.0,1.981257,1.582638
"""75%""",,2.424472,40.86085,-111.813012,"""2024-09-11 01:56:25+00:00""",1924.9,2.506978,15087.566241,8069.0,2.231917,15086.449322,8068.0,2.231922,2.428585
"""max""",,15.32443,40.879271,-111.801421,"""2024-09-11 02:40:02+00:00""",1984.0,12.871382,18652.930528,10686.0,3.828152,18646.176372,10684.0,3.827216,12.491514


In [23]:
# chaining
(df
 .select(cols)
 .with_columns(pl.col('elapsed').cast(pl.Int16)) 
 .estimated_size()
)

1028660

In [24]:
(df
 .select(cols)
 .estimated_size()
)

1091240

### Strings

In [23]:
df.select(cols).select(pl.col(pl.String))

In [24]:
# chaining
(df
 .select(cols)
 .with_columns(pl.col('elapsed').cast(pl.Int16),
               course=pl.lit('Maple Syrup'))
)

course,distance_2d,latitude,longitude,time,elevation,speed_between,travelled,elapsed,avg_velocity,rolling_travelled,rolling_elapsed,rolling_velocity,rolling_between
str,f64,f64,f64,"datetime[μs, UTC]",f64,f64,f64,i16,f64,f64,f64,f64,f64
"""Maple Syrup""",0.0,40.879161,-111.855169,2024-09-10 23:41:56 UTC,1480.0,,0.0,0,,,,,
"""Maple Syrup""",1.210883,40.879167,-111.855181,2024-09-10 23:41:58 UTC,1480.1,0.607503,1.210883,2,0.605442,,,,
"""Maple Syrup""",1.227612,40.879172,-111.855194,2024-09-10 23:41:59 UTC,1480.1,1.227612,2.438495,3,0.812832,,,,
"""Maple Syrup""",0.952238,40.879174,-111.855205,2024-09-10 23:42:00 UTC,1480.1,0.952238,3.390733,4,0.847683,,,,
"""Maple Syrup""",0.90551,40.879177,-111.855215,2024-09-10 23:42:01 UTC,1480.1,0.90551,4.296243,5,0.859249,2.267271,2.8,0.80974,
…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""Maple Syrup""",4.085147,40.857847,-111.823899,2024-09-11 02:39:58 UTC,1696.8,4.086371,18640.85135,10682,1.745071,18634.60004,10680.0,1.744813,3.177668
"""Maple Syrup""",1.535726,40.857852,-111.823916,2024-09-11 02:39:59 UTC,1696.8,1.535726,18642.387076,10683,1.745052,18637.420852,10681.0,1.744913,2.821613
"""Maple Syrup""",3.156134,40.857869,-111.823946,2024-09-11 02:40:00 UTC,1696.7,3.157718,18645.54321,10684,1.745184,18640.220738,10682.0,1.745012,2.800696
"""Maple Syrup""",3.626488,40.857889,-111.82398,2024-09-11 02:40:01 UTC,1696.5,3.631999,18649.169698,10685,1.74536,18642.943507,10683.0,1.745104,2.724433


## Convert Date to Local Time

In [27]:
(df
 .select(cols)
 .with_columns(pl.col('elapsed').cast(pl.Int16),
               course=pl.lit('Maple Syrup').cast(pl.Categorical),
               time=pl.col('time').dt.convert_time_zone('America/Denver')
              )
)

course,distance_2d,latitude,longitude,time,elevation,speed_between,travelled,elapsed,avg_velocity,rolling_travelled,rolling_elapsed,rolling_velocity,rolling_between
cat,f64,f64,f64,"datetime[μs, America/Denver]",f64,f64,f64,i16,f64,f64,f64,f64,f64
"""Maple Syrup""",0.0,40.879161,-111.855169,2024-09-10 17:41:56 MDT,1480.0,,0.0,0,,,,,
"""Maple Syrup""",1.210883,40.879167,-111.855181,2024-09-10 17:41:58 MDT,1480.1,0.607503,1.210883,2,0.605442,,,,
"""Maple Syrup""",1.227612,40.879172,-111.855194,2024-09-10 17:41:59 MDT,1480.1,1.227612,2.438495,3,0.812832,,,,
"""Maple Syrup""",0.952238,40.879174,-111.855205,2024-09-10 17:42:00 MDT,1480.1,0.952238,3.390733,4,0.847683,,,,
"""Maple Syrup""",0.90551,40.879177,-111.855215,2024-09-10 17:42:01 MDT,1480.1,0.90551,4.296243,5,0.859249,2.267271,2.8,0.80974,
…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""Maple Syrup""",4.085147,40.857847,-111.823899,2024-09-10 20:39:58 MDT,1696.8,4.086371,18640.85135,10682,1.745071,18634.60004,10680.0,1.744813,3.177668
"""Maple Syrup""",1.535726,40.857852,-111.823916,2024-09-10 20:39:59 MDT,1696.8,1.535726,18642.387076,10683,1.745052,18637.420852,10681.0,1.744913,2.821613
"""Maple Syrup""",3.156134,40.857869,-111.823946,2024-09-10 20:40:00 MDT,1696.7,3.157718,18645.54321,10684,1.745184,18640.220738,10682.0,1.745012,2.800696
"""Maple Syrup""",3.626488,40.857889,-111.82398,2024-09-10 20:40:01 MDT,1696.5,3.631999,18649.169698,10685,1.74536,18642.943507,10683.0,1.745104,2.724433


In [28]:
col = pl.col('time')
print(dir(col))

['__abs__', '__add__', '__and__', '__annotations__', '__array_ufunc__', '__bool__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__floordiv__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__invert__', '__le__', '__lt__', '__mod__', '__module__', '__mul__', '__ne__', '__neg__', '__new__', '__or__', '__pos__', '__pow__', '__radd__', '__rand__', '__reduce__', '__reduce_ex__', '__repr__', '__rfloordiv__', '__rmod__', '__rmul__', '__ror__', '__rpow__', '__rsub__', '__rtruediv__', '__rxor__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__sub__', '__subclasshook__', '__truediv__', '__weakref__', '__xor__', '_accessors', '_from_pyexpr', '_map_batches_wrapper', '_pyexpr', '_repr_html_', 'abs', 'add', 'agg_groups', 'alias', 'all', 'and_', 'any', 'append', 'approx_n_unique', 'arccos', 'arccosh', 'arcsin', 'arcsinh', 'arctan', 'arctanh', 'arg_max', 'arg_min', 'arg_sort', 'arg_true', 'ar

In [29]:
print(len(dir(col)))

259


In [30]:
print(dir(col.dt))

['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_accessor', '_pyexpr', 'add_business_days', 'base_utc_offset', 'cast_time_unit', 'century', 'combine', 'convert_time_zone', 'date', 'datetime', 'day', 'dst_offset', 'epoch', 'hour', 'is_leap_year', 'iso_year', 'microsecond', 'millennium', 'millisecond', 'minute', 'month', 'month_end', 'month_start', 'nanosecond', 'offset_by', 'ordinal_day', 'quarter', 'replace_time_zone', 'round', 'second', 'strftime', 'time', 'timestamp', 'to_string', 'total_days', 'total_hours', 'total_microseconds', 'total_milliseconds', 'total_minutes', 'total_nanoseconds', 'total_seconds', 'truncate', 'week', 'weekday', 'with_time_unit', 'year']


In [31]:
print(len(dir(col.dt)))

72


## Missing Data

- Use `.fill_null` to address
- Use `.filter` to filter rows
- Use `.select` to select columns

To view rows with missing data use `.filter(pl.col("col_name").is_null())`

In [32]:
(df
 .select(cols)
 .with_columns(pl.col('elapsed').cast(pl.Int16),
               course=pl.lit('Maple Syrup').cast(pl.Categorical),
              time=pl.col('time').dt.convert_time_zone('America/Denver'))
 .null_count()
)

course,distance_2d,latitude,longitude,time,elevation,speed_between,travelled,elapsed,avg_velocity,rolling_travelled,rolling_elapsed,rolling_velocity,rolling_between
u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32
0,0,0,0,0,0,1,0,0,0,4,4,4,5


In [33]:
# use .select to find where rows are missing
(df
 .select(cols)
 .with_columns(pl.col('elapsed').cast(pl.Int16),
               course=pl.lit('Maple Syrup').cast(pl.Categorical),
               time=pl.col('time').dt.convert_time_zone('America/Denver'))
 .select(pl.col('rolling_between').is_null())
)

rolling_between
bool
true
true
true
true
true
…
false
false
false
false


In [34]:
# change .select to .filter to view the rows
(df
 .select(cols)
 .with_row_index()
 .with_columns(pl.col('elapsed').cast(pl.Int16),
               course=pl.lit('Maple Syrup').cast(pl.Categorical),
               time=pl.col('time').dt.convert_time_zone('America/Denver'))               
 .filter(pl.col('rolling_between').is_null())
)

index,course,distance_2d,latitude,longitude,time,elevation,speed_between,travelled,elapsed,avg_velocity,rolling_travelled,rolling_elapsed,rolling_velocity,rolling_between
u32,cat,f64,f64,f64,"datetime[μs, America/Denver]",f64,f64,f64,i16,f64,f64,f64,f64,f64
0,"""Maple Syrup""",0.0,40.879161,-111.855169,2024-09-10 17:41:56 MDT,1480.0,,0.0,0,,,,,
1,"""Maple Syrup""",1.210883,40.879167,-111.855181,2024-09-10 17:41:58 MDT,1480.1,0.607503,1.210883,2,0.605442,,,,
2,"""Maple Syrup""",1.227612,40.879172,-111.855194,2024-09-10 17:41:59 MDT,1480.1,1.227612,2.438495,3,0.812832,,,,
3,"""Maple Syrup""",0.952238,40.879174,-111.855205,2024-09-10 17:42:00 MDT,1480.1,0.952238,3.390733,4,0.847683,,,,
4,"""Maple Syrup""",0.90551,40.879177,-111.855215,2024-09-10 17:42:01 MDT,1480.1,0.90551,4.296243,5,0.859249,2.267271,2.8,0.80974,


In [35]:
# what about nans?
# note that nan and null are different in polars
# nan means not a number
# null means missing data
(df
 .select(cols)
 .with_columns(pl.col('elapsed').cast(pl.Int16),
               course=pl.lit('Maple Syrup').cast(pl.Categorical),
               time=pl.col('time').dt.convert_time_zone('America/Denver'))               
 .select(cs.numeric().is_nan().sum())
)

distance_2d,latitude,longitude,elevation,speed_between,travelled,elapsed,avg_velocity,rolling_travelled,rolling_elapsed,rolling_velocity,rolling_between
u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32
0,0,0,0,0,0,0,1,0,0,0,0


In [36]:
(df
 .select(cols)
 .with_columns(pl.col('elapsed').cast(pl.Int16),
               course=pl.lit('Maple Syrup').cast(pl.Categorical),
               time=pl.col('time').dt.convert_time_zone('America/Denver'))               
 .select(pl.col('avg_velocity').is_nan())
)

avg_velocity
bool
true
false
false
false
false
…
false
false
false
false


In [37]:
(df
 .select(cols)
 .with_row_index()  
 .with_columns(pl.col('elapsed').cast(pl.Int16),
               course=pl.lit('Maple Syrup').cast(pl.Categorical),
               time=pl.col('time').dt.convert_time_zone('America/Denver'))               
 .filter(pl.col('avg_velocity').is_nan())
)

index,course,distance_2d,latitude,longitude,time,elevation,speed_between,travelled,elapsed,avg_velocity,rolling_travelled,rolling_elapsed,rolling_velocity,rolling_between
u32,cat,f64,f64,f64,"datetime[μs, America/Denver]",f64,f64,f64,i16,f64,f64,f64,f64,f64
0,"""Maple Syrup""",0.0,40.879161,-111.855169,2024-09-10 17:41:56 MDT,1480.0,,0.0,0,,,,,


In [7]:
# a glorious function

def tweak_gpx(df_):
    return (df_
        .select(cols)
        .with_row_index()  
        .with_columns(pl.col('elapsed').cast(pl.Int16),
                    course=pl.lit('Maple Syrup').cast(pl.Categorical),
                    time=pl.col('time').dt.convert_time_zone('America/Denver'))               
        )

tweak_gpx(df)

NameError: name 'cols' is not defined

## Chain

Chaining is also called "flow" programming. Rather than making intermediate variables, just leverage the fact that most operations return a new object and work on that.

The chain should read like a recipe of ordered steps.

(BTW, this is actually what we did above.)

In [14]:
# a glorious function

def tweak_gpx(df_):
    return (df_
        .select(cols)
        .with_row_index()  
        .with_columns(pl.col('elapsed').cast(pl.Int16),
                    course=pl.lit('Maple Syrup').cast(pl.Categorical),
                    time=pl.col('time').dt.convert_time_zone('America/Denver'))               
        )

tweak_gpx(df).write_parquet('Face_plant.parquet')

In [15]:
# laziness
gpx_lazy = pl.scan_parquet('Face_plant.parquet') 
tweak_gpx(gpx_lazy)

In [16]:
tweak_gpx(gpx_lazy).collect()

index,course,distance_2d,latitude,longitude,time,elevation,speed_between,travelled,elapsed,avg_velocity,rolling_travelled,rolling_elapsed,rolling_velocity,rolling_between
u32,cat,f64,f64,f64,"datetime[μs, America/Denver]",f64,f64,f64,i16,f64,f64,f64,f64,f64
0,"""Maple Syrup""",0.0,40.879161,-111.855169,2024-09-10 17:41:56 MDT,1480.0,,0.0,0,,,,,
1,"""Maple Syrup""",1.210883,40.879167,-111.855181,2024-09-10 17:41:58 MDT,1480.1,0.607503,1.210883,2,0.605442,,,,
2,"""Maple Syrup""",1.227612,40.879172,-111.855194,2024-09-10 17:41:59 MDT,1480.1,1.227612,2.438495,3,0.812832,,,,
3,"""Maple Syrup""",0.952238,40.879174,-111.855205,2024-09-10 17:42:00 MDT,1480.1,0.952238,3.390733,4,0.847683,,,,
4,"""Maple Syrup""",0.90551,40.879177,-111.855215,2024-09-10 17:42:01 MDT,1480.1,0.90551,4.296243,5,0.859249,2.267271,2.8,0.80974,
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
10425,"""Maple Syrup""",4.085147,40.857847,-111.823899,2024-09-10 20:39:58 MDT,1696.8,4.086371,18640.85135,10682,1.745071,18634.60004,10680.0,1.744813,3.177668
10426,"""Maple Syrup""",1.535726,40.857852,-111.823916,2024-09-10 20:39:59 MDT,1696.8,1.535726,18642.387076,10683,1.745052,18637.420852,10681.0,1.744913,2.821613
10427,"""Maple Syrup""",3.156134,40.857869,-111.823946,2024-09-10 20:40:00 MDT,1696.7,3.157718,18645.54321,10684,1.745184,18640.220738,10682.0,1.745012,2.800696
10428,"""Maple Syrup""",3.626488,40.857889,-111.82398,2024-09-10 20:40:01 MDT,1696.5,3.631999,18649.169698,10685,1.74536,18642.943507,10683.0,1.745104,2.724433


In [17]:
# using GPU!
tweak_gpx(gpx_lazy).collect('gpu')

TypeError: LazyFrame.collect() takes 1 positional argument but 2 were given

In [18]:
# debugging
def get_var(df, var_name):
   globals()[var_name] = df
   return df

def tweak_gpx(df_):
    return (df_
        .pipe(lambda df: print(df.shape) or df)  # Look! 🤯
        .select(cols)
        .with_row_index()  
        .pipe(get_var, 'intermediate')  # Debugging! 💪
        .with_columns(pl.col('elapsed').cast(pl.Int16),
                    course=pl.lit('Maple Syrup').cast(pl.Categorical),
                    time=pl.col('time').dt.convert_time_zone('America/Denver'))               
        )

raw = pl.read_parquet('Face_plant.parquet')
tweak_gpx(raw)

(10430, 15)


index,course,distance_2d,latitude,longitude,time,elevation,speed_between,travelled,elapsed,avg_velocity,rolling_travelled,rolling_elapsed,rolling_velocity,rolling_between
u32,cat,f64,f64,f64,"datetime[μs, America/Denver]",f64,f64,f64,i16,f64,f64,f64,f64,f64
0,"""Maple Syrup""",0.0,40.879161,-111.855169,2024-09-10 17:41:56 MDT,1480.0,,0.0,0,,,,,
1,"""Maple Syrup""",1.210883,40.879167,-111.855181,2024-09-10 17:41:58 MDT,1480.1,0.607503,1.210883,2,0.605442,,,,
2,"""Maple Syrup""",1.227612,40.879172,-111.855194,2024-09-10 17:41:59 MDT,1480.1,1.227612,2.438495,3,0.812832,,,,
3,"""Maple Syrup""",0.952238,40.879174,-111.855205,2024-09-10 17:42:00 MDT,1480.1,0.952238,3.390733,4,0.847683,,,,
4,"""Maple Syrup""",0.90551,40.879177,-111.855215,2024-09-10 17:42:01 MDT,1480.1,0.90551,4.296243,5,0.859249,2.267271,2.8,0.80974,
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
10425,"""Maple Syrup""",4.085147,40.857847,-111.823899,2024-09-10 20:39:58 MDT,1696.8,4.086371,18640.85135,10682,1.745071,18634.60004,10680.0,1.744813,3.177668
10426,"""Maple Syrup""",1.535726,40.857852,-111.823916,2024-09-10 20:39:59 MDT,1696.8,1.535726,18642.387076,10683,1.745052,18637.420852,10681.0,1.744913,2.821613
10427,"""Maple Syrup""",3.156134,40.857869,-111.823946,2024-09-10 20:40:00 MDT,1696.7,3.157718,18645.54321,10684,1.745184,18640.220738,10682.0,1.745012,2.800696
10428,"""Maple Syrup""",3.626488,40.857889,-111.82398,2024-09-10 20:40:01 MDT,1696.5,3.631999,18649.169698,10685,1.74536,18642.943507,10683.0,1.745104,2.724433


In [19]:
intermediate

index,course,distance_2d,latitude,longitude,time,elevation,speed_between,travelled,elapsed,avg_velocity,rolling_travelled,rolling_elapsed,rolling_velocity,rolling_between
u32,cat,f64,f64,f64,"datetime[μs, America/Denver]",f64,f64,f64,i16,f64,f64,f64,f64,f64
0,"""Maple Syrup""",0.0,40.879161,-111.855169,2024-09-10 17:41:56 MDT,1480.0,,0.0,0,,,,,
1,"""Maple Syrup""",1.210883,40.879167,-111.855181,2024-09-10 17:41:58 MDT,1480.1,0.607503,1.210883,2,0.605442,,,,
2,"""Maple Syrup""",1.227612,40.879172,-111.855194,2024-09-10 17:41:59 MDT,1480.1,1.227612,2.438495,3,0.812832,,,,
3,"""Maple Syrup""",0.952238,40.879174,-111.855205,2024-09-10 17:42:00 MDT,1480.1,0.952238,3.390733,4,0.847683,,,,
4,"""Maple Syrup""",0.90551,40.879177,-111.855215,2024-09-10 17:42:01 MDT,1480.1,0.90551,4.296243,5,0.859249,2.267271,2.8,0.80974,
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
10425,"""Maple Syrup""",4.085147,40.857847,-111.823899,2024-09-10 20:39:58 MDT,1696.8,4.086371,18640.85135,10682,1.745071,18634.60004,10680.0,1.744813,3.177668
10426,"""Maple Syrup""",1.535726,40.857852,-111.823916,2024-09-10 20:39:59 MDT,1696.8,1.535726,18642.387076,10683,1.745052,18637.420852,10681.0,1.744913,2.821613
10427,"""Maple Syrup""",3.156134,40.857869,-111.823946,2024-09-10 20:40:00 MDT,1696.7,3.157718,18645.54321,10684,1.745184,18640.220738,10682.0,1.745012,2.800696
10428,"""Maple Syrup""",3.626488,40.857889,-111.82398,2024-09-10 20:40:01 MDT,1696.5,3.631999,18649.169698,10685,1.74536,18642.943507,10683.0,1.745104,2.724433


## Don't Apply (map_elements) if you can

In [20]:
# debugging
def get_var(df, var_name):
   globals()[var_name] = df
   return df

def tweak_gpx(df_):
    return (df_
             .pipe(lambda df: print(df.shape) or df)
        .select(cols)
        .with_row_index()  
         .pipe(get_var, 'intermediate')
        .with_columns(pl.col('elapsed').cast(pl.Int16),
                    course=pl.lit('Maple Syrup').cast(pl.Categorical),
                    time=pl.col('time').dt.convert_time_zone('America/Denver'))               
        )

raw = pl.read_parquet('Face_plant.parquet')
df = tweak_gpx(raw)

(10430, 15)


In [21]:
# convert elevation from meters to feet
def meters_to_feet(m):
    return m * 3.28084

(df
 .select('elevation', 
         ele_ft=pl.col('elevation').map_elements(meters_to_feet)) 
)

Expr.map_elements is significantly slower than the native expressions API.
Only use if you absolutely CANNOT implement your logic otherwise.
Replace this expression...
  - pl.col("elevation").map_elements(meters_to_feet)
with this one instead:
  + pl.col("elevation") * 3.28084

  ele_ft=pl.col('elevation').map_elements(meters_to_feet))


elevation,ele_ft
f64,f64
1480.0,4855.6432
1480.1,4855.971284
1480.1,4855.971284
1480.1,4855.971284
1480.1,4855.971284
…,…
1696.8,5566.929312
1696.8,5566.929312
1696.7,5566.601228
1696.5,5565.94506


In [22]:
# convert elevation from meters to feet
def meters_to_feet(m):
    return m * 3.28084

(df
 .select('elevation', 
         ele_ft=meters_to_feet(pl.col('elevation')))
)

elevation,ele_ft
f64,f64
1480.0,4855.6432
1480.1,4855.971284
1480.1,4855.971284
1480.1,4855.971284
1480.1,4855.971284
…,…
1696.8,5566.929312
1696.8,5566.929312
1696.7,5566.601228
1696.5,5565.94506


In [34]:
# Perhaps more readable
# convert elevation from meters to feet
def meters_to_feet(m):
    return m * 3.28084

(df
 .select('elevation', 
         ele_ft=pl.col('elevation').pipe(meters_to_feet))
)

elevation,ele_ft
f64,f64
1480.0,4855.6432
1480.1,4855.971284
1480.1,4855.971284
1480.1,4855.971284
1480.1,4855.971284
…,…
1696.8,5566.929312
1696.8,5566.929312
1696.7,5566.601228
1696.5,5565.94506


In [48]:
%%timeit
# takes 965 µs on my machine
(df
 .select('elevation', ele_ft=pl.col('elevation').map_elements(meters_to_feet)) 
)

Expr.map_elements is significantly slower than the native expressions API.
Only use if you absolutely CANNOT implement your logic otherwise.
Replace this expression...
  - pl.col("elevation").map_elements(meters_to_feet)
with this one instead:
  + pl.col("elevation") * 3.28084

Expr.map_elements is significantly slower than the native expressions API.
Only use if you absolutely CANNOT implement your logic otherwise.
Replace this expression...
  - pl.col("elevation").map_elements(meters_to_feet)
with this one instead:
  + pl.col("elevation") * 3.28084

Expr.map_elements is significantly slower than the native expressions API.
Only use if you absolutely CANNOT implement your logic otherwise.
Replace this expression...
  - pl.col("elevation").map_elements(meters_to_feet)
with this one instead:
  + pl.col("elevation") * 3.28084

Expr.map_elements is significantly slower than the native expressions API.
Only use if you absolutely CANNOT implement your logic otherwise.
Replace this expressio

856 µs ± 19.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


Expr.map_elements is significantly slower than the native expressions API.
Only use if you absolutely CANNOT implement your logic otherwise.
Replace this expression...
  - pl.col("elevation").map_elements(meters_to_feet)
with this one instead:
  + pl.col("elevation") * 3.28084

Expr.map_elements is significantly slower than the native expressions API.
Only use if you absolutely CANNOT implement your logic otherwise.
Replace this expression...
  - pl.col("elevation").map_elements(meters_to_feet)
with this one instead:
  + pl.col("elevation") * 3.28084

Expr.map_elements is significantly slower than the native expressions API.
Only use if you absolutely CANNOT implement your logic otherwise.
Replace this expression...
  - pl.col("elevation").map_elements(meters_to_feet)
with this one instead:
  + pl.col("elevation") * 3.28084

Expr.map_elements is significantly slower than the native expressions API.
Only use if you absolutely CANNOT implement your logic otherwise.
Replace this expressio

In [49]:
%%timeit
(df
 .select('elevation', 
         ele_ft=pl.col('elevation').pipe(meters_to_feet))
)

37 µs ± 636 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


In [50]:
%%timeit
(df
 .select('elevation', 
         ele_ft=pl.col('elevation')*3.28084)
)

36.7 µs ± 471 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


In [51]:
965/40

24.125

## benchmark caveat
- Use the size of data you are using in the real world

## Master Aggregation

Let's speed (and distance) by 10 minute intervals

In [23]:
def meters_per_second_to_mph(mps):
    return mps * 2.23694

(tweak_gpx(raw)
 .group_by_dynamic(index_column='time', every='10m')
 .agg(pl.col('travelled').last() - pl.col('travelled').first(),
      speed=(pl.col('travelled').last() - pl.col('travelled').first()) / 
          ((pl.col('time').last() - pl.col('time').first()).dt.total_seconds())
    ) 
 .with_columns(mph=pl.col('speed').pipe(meters_per_second_to_mph))
 )

(10430, 15)


time,travelled,speed,mph
"datetime[μs, America/Denver]",f64,f64,f64
2024-09-10 17:40:00 MDT,1620.956003,3.356017,7.507208
2024-09-10 17:50:00 MDT,1081.973828,1.8063,4.040585
2024-09-10 18:00:00 MDT,99.724838,0.166486,0.372418
2024-09-10 18:10:00 MDT,1703.006885,2.843083,6.359807
2024-09-10 18:20:00 MDT,1602.514698,2.675317,5.984523
…,…,…,…
2024-09-10 20:00:00 MDT,973.953159,1.625965,3.637187
2024-09-10 20:10:00 MDT,680.118068,1.135422,2.539872
2024-09-10 20:20:00 MDT,796.69377,1.33004,2.975219
2024-09-10 20:30:00 MDT,763.967062,1.275404,2.853002


In [24]:
def meters_per_second_to_mph(mps):
    return mps * 2.23694

(tweak_gpx(raw)
 .group_by_dynamic(index_column='time', every='10m')
 .agg(pl.col('travelled').last() - pl.col('travelled').first(),
      speed=(pl.col('travelled').last() - pl.col('travelled').first()) / 
      ((pl.col('time').last() - pl.col('time').first()).dt.total_seconds())
    ) 
 .with_columns(mph=pl.col('speed').pipe(meters_per_second_to_mph))
 .plot.bar(x='time', y='mph')
 )

(10430, 15)


In [25]:
# uphill vs downhill
def meters_to_feet(m):
    return m * 3.28084

def feet_to_miles(f):
    return f / 5280

(tweak_gpx(raw)
 .with_columns(climbing=pl.col('elevation').diff().gt(0))
 .group_by('climbing')
 .agg(pl.col('distance_2d').sum().pipe(meters_to_feet).pipe(feet_to_miles))
 .filter(~pl.col('climbing').is_null())
)

(10430, 15)


climbing,distance_2d
bool,f64
False,5.46702
True,6.123374


In [26]:
# uphill vs downhill
def meters_to_feet(m):
    return m * 3.28084

def feet_to_miles(f):
    return f / 5280

(tweak_gpx(raw)
 .with_columns(climbing=pl.col('elevation').diff().gt(0))
 .group_by('climbing')
 .agg(pl.col('distance_2d').sum().pipe(meters_to_feet).pipe(feet_to_miles))
 .filter(~pl.col('climbing').is_null())
 .plot.bar(x='climbing', y='distance_2d')
)

(10430, 15)


## Summary

* Correct types save space and enable convenient math, string, and date functionality
* Chaining operations will:
   * Make code readable
   * Remove bugs
   * Easier to debug
* ``.map_elements`` is slow for math
* Aggregations are powerful. Play with them until they make sense


Let's connect! Twitter ``@__mharrison__``, LinkedIn

Book giveaway

In [None]:
import random

In [None]:
random.randrange(0, 2)

In [None]:
random.randrange(0, 3)

In [None]:
import random
random.choice([0,1])

In [None]:
import random
random.randrange(1,4)