# Polars a Highly Optimized Dataframe Library


## About Matt  Harrison @\_\_mharrison\_\_

* Author of Effective Pandas, Machine Learning Pocket Reference, and Illustrated Guide to Python 3.
* Advisor at Ponder (creators of Modin)
* Corporate trainer at MetaSnake. Taught Pandas to 1000's of students.

## Relevant Background

* 1999 NLP
* 2006 Created Python OLAP Engine
* 2009 Heard about Pandas
* Used Pandas for failure modeling, analytics, and ml
* 2016 Learning the Pandas Library
* 2019 Spark
* 2020 Pandas Cookbook
* 2021 Effective Pandas
* 2022 CuDf, Modin, Polars

## Outline of Opinions

* Load Data
* Types
* Chaining
* Apply
* Aggregation

## Polars Overview

* Polars is a Rust library with Python bindings
* Polars has expressions and contexts
  - Contexts - `.select`, `.with_columns`, `.filter`
  - Expressions - done w/ `pl.col("col_name")` or `pl.lit(1)`
* Lazy evaluation
* Query optimization
* Multi-threaded


## Data

In [None]:
!pip install -U polars

In [None]:
import numpy as np
import polars as pl
import pandas as pd

In [None]:
pl.__version__

In [None]:
import urllib.request
import zipfile
import os

def download_and_unzip(url, extract_to='.'):
    zip_path, _ = urllib.request.urlretrieve(url)
    with zipfile.ZipFile(zip_path) as zip_ref:
        zip_ref.extractall(extract_to)
    os.remove(zip_path)

url = 'https://github.com/mattharrison/datasets/raw/master/data/vehicles.csv.zip'
download_and_unzip(url)

In [None]:
autos = pl.read_csv('vehicles.csv', null_values=['NA'])

In [None]:
autos

In [None]:
autos.columns

In [None]:
# 33.5 Megs
autos.estimated_size()

## Types
Getting the right types will enable analysis and correctness.

In [None]:
cols = ['city08', 'comb08', 'highway08', 'cylinders', 'displ', 'drive', 'eng_dscr', 
        'fuelCost08', 'make', 'model', 'trany', 'range', 'createdOn', 'year']

In [None]:
autos[cols].dtypes

In [None]:
# 8 Megs
autos[cols].estimated_size()

### Ints

In [None]:
# select is a context
# pl.col is an expression
autos[cols].select(pl.col(pl.Int64))

In [None]:
autos[cols].select(pl.col(pl.Int64)).describe()

In [None]:
# chaining
(autos
 .select(cols)
 .select(pl.col(pl.Int64))
 .describe()
)

In [None]:
# can comb08 be an int8?
np.iinfo(np.int8)

In [None]:
# no but maybe a uint8
np.iinfo(np.uint8)

In [None]:
# chaining
# polars prevents illegal casts
(autos
 .select(cols)
 .with_columns(pl.col('comb08').cast(pl.Int8))
 .describe()
)

In [None]:
# chaining
(autos
 .select(cols)
 .with_columns(pl.col('comb08').cast(pl.UInt8))
 .describe()               
)

In [None]:
np.iinfo(np.int16)

In [None]:
# chaining
(autos
 .select(cols)
 .with_columns(pl.col(['city08', 'comb08', 'highway08', 'cylinders', 'displ',]).cast(pl.UInt8))
 .with_columns(pl.col(['range', 'fuelCost08', 'year',]).cast(pl.UInt16))
 .estimated_size()
)

In [None]:
(autos
 .select(cols)
 .estimated_size()
)

### Strings

In [None]:
autos.select(cols).select(pl.col(pl.Utf8))

In [None]:
# chaining
(autos
 .select(cols)
 .with_columns(pl.col(['city08', 'comb08', 'highway08', 'cylinders', 'displ',]).cast(pl.UInt8))
 .with_columns(pl.col(['range', 'fuelCost08', 'year',]).cast(pl.UInt16))
 .estimated_size()
)

In [None]:
# chaining
(autos
 .select(cols)
 .with_columns(pl.col(['city08', 'comb08', 'highway08', 'cylinders', 'displ',]).cast(pl.UInt8))
 .with_columns(pl.col(['range', 'fuelCost08', 'year',]).cast(pl.UInt16))
 .with_columns(pl.col(['drive', 'make', 'model',]).cast(pl.Categorical)) 
 .estimated_size()
)

## Extract FFS, Speed, & Manual

In [None]:
# chaining
(autos
 .select(cols)
 .with_columns(pl.col(['city08', 'comb08', 'highway08', 'cylinders', 'displ',]).cast(pl.UInt8))
 .with_columns(pl.col(['range', 'fuelCost08', 'year',]).cast(pl.UInt16))
 .with_columns(pl.col(['drive', 'make', 'model',]).cast(pl.Categorical)) 
 .select(pl.col(pl.Utf8))
)

In [None]:
col = pl.col('test')

In [None]:
col.str.extract?

In [None]:
(autos
 .select(cols)
 .with_columns([pl.col(['city08', 'comb08', 'highway08', 'cylinders', 'displ',]).cast(pl.UInt8),
     pl.col(['range', 'fuelCost08', 'year',]).cast(pl.UInt16),
     pl.col(['drive', 'make', 'model',]).cast(pl.Categorical),
     pl.col('eng_dscr').str.contains('FFS').alias('FFS'),
     pl.col('trany').str.extract(r'(\d+)').cast(pl.UInt8).alias('Speeds'),
     pl.col('trany').str.contains('Manual').alias('Manual'),    
 ])
)

In [None]:
print(autos
 .select(cols)
 .with_columns([pl.col(['city08', 'comb08', 'highway08', 'cylinders', 'displ',]).cast(pl.UInt8),
     pl.col(['range', 'fuelCost08', 'year',]).cast(pl.UInt16),
     pl.col(['drive', 'make', 'model',]).cast(pl.Categorical),
     pl.col('eng_dscr').str.contains('FFS').alias('FFS'),
     pl.col('trany').str.extract(r'(\d+)').cast(pl.UInt8).alias('Speeds'),
     pl.col('trany').str.contains('Manual').alias('Manual'),    
     ])
.columns
)

In [None]:
# remove trany and engr_desc
(autos
 .with_columns([pl.col(['city08', 'comb08', 'highway08', 'cylinders', 'displ',]).cast(pl.UInt8),
     pl.col(['range', 'fuelCost08', 'year',]).cast(pl.UInt16),
     pl.col(['drive', 'make', 'model',]).cast(pl.Categorical),
     pl.col('eng_dscr').str.contains('FFS').alias('FFS'),
     pl.col('trany').str.extract(r'(\d+)').cast(pl.UInt8).alias('Speeds'),
     pl.col('trany').str.contains('Manual').alias('Manual'),    
  ])
 .select(pl.col(['city08', 'comb08', 'highway08', 'cylinders', 'displ', 'drive', 'fuelCost08',
  'make', 'model', 'range', 'createdOn', 'year', 'FFS', 'Speeds', 'Manual']))
 .estimated_size()
)

In [None]:
# where are the values missing for drive?
(autos
 .with_columns([pl.col(['city08', 'comb08', 'highway08', 'cylinders', 'displ',]).cast(pl.UInt8),
     pl.col(['range', 'fuelCost08', 'year',]).cast(pl.UInt16),
     pl.col(['drive', 'make', 'model',]).cast(pl.Categorical),
     pl.col('eng_dscr').str.contains('FFS').alias('FFS'),
     pl.col('trany').str.extract(r'(\d+)').cast(pl.UInt8).alias('Speeds'),
     pl.col('trany').str.contains('Manual').alias('Manual'),    
  ])
 .select(pl.col(['city08', 'comb08', 'highway08', 'cylinders', 'displ', 'drive', 'fuelCost08',
  'make', 'model', 'range', 'createdOn', 'year', 'FFS', 'Speeds', 'Manual']))
 .filter(pl.col('drive').is_null())
)

In [None]:
# fill in missing values with other
(autos
 .with_columns([pl.col(['city08', 'comb08', 'highway08', 'cylinders', 'displ',]).cast(pl.UInt8),
     pl.col(['range', 'fuelCost08', 'year',]).cast(pl.UInt16),
     pl.col(['make', 'model',]).cast(pl.Categorical),
     pl.col('drive').fill_null('other').cast(pl.Categorical),
     pl.col('eng_dscr').str.contains('FFS').alias('FFS'),
     pl.col('trany').str.extract(r'(\d+)').cast(pl.UInt8).alias('Speeds'),
     pl.col('trany').str.contains('Manual').alias('Manual'),    
  ])
 .select(pl.col(['city08', 'comb08', 'highway08', 'cylinders', 'displ', 'drive', 'fuelCost08',
  'make', 'model', 'range', 'createdOn', 'year', 'FFS', 'Speeds', 'Manual']))
)

In [None]:
# missing cylinders
(autos
 .with_columns([pl.col(['city08', 'comb08', 'highway08', 'cylinders', 'displ',]).cast(pl.UInt8),
     pl.col(['range', 'fuelCost08', 'year',]).cast(pl.UInt16),
     pl.col(['make', 'model',]).cast(pl.Categorical),
     pl.col('drive').fill_null('other').cast(pl.Categorical),
     pl.col('eng_dscr').str.contains('FFS').alias('ffs'),
     pl.col('trany').str.extract(r'(\d+)').cast(pl.UInt8).alias('speeds'),
     pl.col('trany').str.contains('Manual').alias('manual'),    
  ])
 .select(pl.col(['city08', 'comb08', 'highway08', 'cylinders', 'displ', 'drive', 'fuelCost08',
  'make', 'model', 'range', 'createdOn', 'year', 'ffs', 'speeds', 'manual']))
 .filter(pl.col('cylinders').is_null()) 
)

In [None]:
# missing cylinders
(autos
 .with_columns([pl.col(['city08', 'comb08', 'highway08', 'displ',]).cast(pl.UInt8),
     pl.col('cylinders').fill_null(0).cast(pl.UInt8),
     pl.col(['range', 'fuelCost08', 'year',]).cast(pl.UInt16),
     pl.col(['make', 'model',]).cast(pl.Categorical),
     pl.col('drive').fill_null('other').cast(pl.Categorical),
     pl.col('eng_dscr').str.contains('FFS').alias('ffs'),
     pl.col('trany').str.extract(r'(\d+)').cast(pl.UInt8).alias('speeds'),
     pl.col('trany').str.contains('Manual').alias('manual'),    
  ])
 .select(pl.col(['city08', 'comb08', 'highway08', 'cylinders', 'displ', 'drive', 'fuelCost08',
  'make', 'model', 'range', 'createdOn', 'year', 'ffs', 'speeds', 'manual']))
)

### Dates

In [None]:
col.str.replace?

In [None]:
(autos
 .with_columns([pl.col(['city08', 'comb08', 'highway08', 'displ',]).cast(pl.UInt8),
     pl.col('cylinders').fill_null(0).cast(pl.UInt8),
     pl.col(['range', 'fuelCost08', 'year',]).cast(pl.UInt16),
     pl.col(['make', 'model',]).cast(pl.Categorical),
     pl.col('drive').fill_null('other').cast(pl.Categorical),
     pl.col('eng_dscr').str.contains('FFS').alias('ffs'),
     pl.col('trany').str.extract(r'(\d+)').cast(pl.UInt8).alias('speeds'),
     pl.col('trany').str.contains('Manual').alias('manual'),    
    pl.col('createdOn').str.replace(' EDT', ' -0400')
               .str.replace(' EST', ' -0500')
               .str.strptime(pl.Datetime, '%a %b %d %H:%M:%S %z %Y')
               ])
 .select(pl.col(['city08', 'comb08', 'highway08', 'cylinders', 'displ', 'drive', 'fuelCost08',
  'make', 'model', 'range', 'createdOn', 'year', 'ffs', 'speeds', 'manual']))
)

In [None]:
# NYC TZ
(autos
 .with_columns(pl.col(['city08', 'comb08', 'highway08', 'displ',]).cast(pl.UInt8),
     pl.col('cylinders').fill_null(0).cast(pl.UInt8),
     pl.col(['range', 'fuelCost08', 'year',]).cast(pl.UInt16),
     pl.col(['make', 'model',]).cast(pl.Categorical),
     pl.col('drive').fill_null('other').cast(pl.Categorical),
     pl.col('eng_dscr').str.contains('FFS').alias('ffs'),
     pl.col('trany').str.extract(r'(\d+)').cast(pl.UInt8).alias('speeds'),
     pl.col('trany').str.contains('Manual').alias('manual'),    
    pl.col('createdOn').str.replace(' EDT', ' -0400')
               .str.replace(' EST', ' -0500')
               .str.strptime(pl.Datetime, '%a %b %d %H:%M:%S %z %Y')
               .dt.convert_time_zone('America/New_York')
               )
 .select(pl.col(['city08', 'comb08', 'highway08', 'cylinders', 'displ', 'drive', 'fuelCost08',
  'make', 'model', 'range', 'createdOn', 'year', 'ffs', 'speeds', 'manual']))
)

In [None]:
# a glorious function
def tweak_autos(autos):
    return (autos
     .with_columns(pl.col(['city08', 'comb08', 'highway08', 'displ',]).cast(pl.UInt8),
         pl.col('cylinders').fill_null(0).cast(pl.UInt8),
         pl.col(['range', 'fuelCost08', 'year',]).cast(pl.UInt16),
         pl.col(['make', 'model',]).cast(pl.Categorical),
         pl.col('drive').fill_null('other').cast(pl.Categorical),
         pl.col('eng_dscr').str.contains('FFS').alias('ffs'),
         pl.col('trany').str.extract(r'(\d+)').cast(pl.UInt8).alias('speeds'),
         pl.col('trany').str.contains('Manual').alias('manual'),    
        pl.col('createdOn').str.replace(' EDT', ' -0400')
                   .str.replace(' EST', ' -0500')
                   .str.strptime(pl.Datetime, '%a %b %d %H:%M:%S %z %Y')
                   .dt.convert_time_zone('America/New_York')
                   )
     .select(pl.col(['city08', 'comb08', 'highway08', 'cylinders', 'displ', 'drive', 'fuelCost08',
      'make', 'model', 'range', 'createdOn', 'year', 'ffs', 'speeds', 'manual']))
    )

tweak_autos(autos)

## Chain

Chaining is also called "flow" programming. Rather than making intermediate variables, just leverage the fact that most operations return a new object and work on that.

The chain should read like a recipe of ordered steps.

(BTW, this is actually what we did above.)

In [None]:
# a glorious function
def tweak_autos_lazy(path):
    return (pl.scan_csv(path, null_values=['NA'])
     .lazy()
     .with_columns(pl.col(['city08', 'comb08', 'highway08', 'displ',]).cast(pl.UInt8),
         pl.col('cylinders').fill_null(0).cast(pl.UInt8),
         pl.col(['range', 'fuelCost08', 'year',]).cast(pl.UInt16),
         pl.col(['make', 'model',]).cast(pl.Categorical),
         pl.col('drive').fill_null('other').cast(pl.Categorical),
         pl.col('eng_dscr').str.contains('FFS').alias('ffs'),
         pl.col('trany').str.extract(r'(\d+)').cast(pl.UInt8).alias('speeds'),
         pl.col('trany').str.contains('Manual').alias('manual'),    
        pl.col('createdOn').str.replace(' EDT', ' -0400')
                   .str.replace(' EST', ' -0500')
                   .str.strptime(pl.Datetime, '%a %b %d %H:%M:%S %z %Y',)
                   .dt.convert_time_zone('America/New_York')
                   )
     .select(pl.col(['city08', 'comb08', 'highway08', 'cylinders', 'displ', 'drive', 'fuelCost08',
          'make', 'model', 'range', 'createdOn', 'year', 'ffs', 'speeds', 'manual']))
     .collect()
    )

tweak_autos_lazy('vehicles.csv')

In [None]:
# easy to debug

def get_var(df, var_name):
    globals()[var_name] = df
    return df


def tweak_autos_debug(autos):
    return (autos
     .pipe(lambda df: print(df.shape) or df)
     .with_columns(pl.col(['city08', 'comb08', 'highway08', 'displ',]).cast(pl.UInt8),
         pl.col('cylinders').fill_null(0).cast(pl.UInt8),
         pl.col(['range', 'fuelCost08', 'year',]).cast(pl.UInt16),
         pl.col(['make', 'model',]).cast(pl.Categorical),
         pl.col('drive').fill_null('other').cast(pl.Categorical),
         pl.col('eng_dscr').str.contains('FFS').alias('ffs'),
         pl.col('trany').str.extract(r'(\d+)').cast(pl.UInt8).alias('speeds'),
         pl.col('trany').str.contains('Manual').alias('manual'),    
         pl.col('createdOn').str.replace(' EDT', ' -0400')
                   .str.replace(' EST', ' -0500')
                   .str.strptime(pl.Datetime, '%a %b %d %H:%M:%S %z %Y')
                   .dt.convert_time_zone('America/New_York')
                   )
     .pipe(lambda df: print(df.shape) or df)    
     .pipe(get_var, 'df2')
     .select(pl.col(['city08', 'comb08', 'highway08', 'cylinders', 'displ', 'drive', 'fuelCost08',
          'make', 'model', 'range', 'createdOn', 'year', 'ffs', 'speeds', 'manual']))
     .pipe(lambda df: print(df.shape) or df)            
           )

tweak_autos_debug(autos)

In [None]:
df2

## Don't Apply (map_elements) if you can

In [None]:
# a glorious function
def tweak_autos(autos):
    return (autos
     .with_columns(pl.col(['city08', 'comb08', 'highway08', 'displ',]).cast(pl.UInt8),
         pl.col('cylinders').fill_null(0).cast(pl.UInt8),
         pl.col(['range', 'fuelCost08', 'year',]).cast(pl.UInt16),
         pl.col(['make', 'model',]).cast(pl.Categorical),
         pl.col('drive').fill_null('other').cast(pl.Categorical),
         pl.col('eng_dscr').str.contains('FFS').alias('ffs'),
         pl.col('trany').str.extract(r'(\d+)').cast(pl.UInt8).alias('speeds'),
         pl.col('trany').str.contains('Manual').alias('manual'),    
        pl.col('createdOn').str.replace(' EDT', ' -0400')
                   .str.replace(' EST', ' -0500')
                   .str.strptime(pl.Datetime, '%a %b %d %H:%M:%S %z %Y')
                   .dt.convert_time_zone('America/New_York')
                   )
     .select(pl.col(['city08', 'comb08', 'highway08', 'cylinders', 'displ', 'drive', 'fuelCost08',
      'make', 'model', 'range', 'createdOn', 'year', 'ffs', 'speeds', 'manual']))
    )

autos2 = tweak_autos(autos)

In [None]:
# try to me more Euro-centric
def to_lper100km(val):
    return 235.215 / val
(autos2
 .with_columns(pl.col('city08').map_elements(to_lper100km))
)

In [None]:
# Same results

(autos2
 .with_columns(235.215 / pl.col('city08'))
)

In [None]:
%%timeit
(autos2
 .with_columns(235.215 / pl.col('city08'))
)

In [None]:
%%timeit
(autos2
 .with_columns(pl.col('city08').map_elements(to_lper100km))
)

In [None]:
3_7300 / 139

## Master Aggregation

Let's compare mileage by country by year...🤔

In [None]:
(autos2
   .group_by('year')
   .mean()
)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('pandas1book') 
sns.set_context('talk')
plt.plot(range(10))

In [None]:
(autos2
   .group_by('year')
   .agg(pl.col('comb08').mean(),
        pl.col('speeds').mean())
)

In [None]:
(autos2
   .group_by('year')
   .agg(pl.col('comb08').mean(),
        pl.col('speeds').mean())
 .to_pandas()
 .set_index('year')
# .plot()
)

In [None]:
(autos2
   .group_by('year')
   .agg(pl.col('comb08').mean(),
        pl.col('speeds').mean())
 .to_pandas()
 .set_index('year')
 .sort_index()
 .plot()
)

In [None]:
(autos2
   .group_by('year')
   .agg(pl.col('comb08').std(),
        pl.col('speeds').std())
 .to_pandas()
 .set_index('year')
 .sort_index()
 .plot()
)

In [None]:
pl.when?

In [None]:
# add country 
(autos2
 .with_columns(
     pl.col('make').cast(pl.Utf8),
     pl.when(pl.col('make').is_in(pl.Series(['Chevrolet', 'Ford', 'Dodge', 'GMC', 'Tesla'])))
               .then(pl.lit('US'))
               .otherwise(pl.lit('Other'))
               .alias('country')
     )
)

In [None]:
# need to convert back to utf8
(autos2
 .with_columns(pl.when(pl.col('make').cast(pl.Utf8).is_in(['Chevrolet', 'Ford', 'Dodge', 'GMC', 'Tesla']))
               .then(pl.lit('US'))
               .otherwise(pl.lit('Other'))
               .alias('country'))
   .group_by(['year', 'country'])
   .agg(pl.col('comb08').mean(),
        pl.col('speeds').mean())
# .to_pandas()
# .set_index('year')
# .sort_index()
# .plot()
)

In [None]:
# need to convert back to utf8
(autos2
 .with_columns(pl.when(pl.col('make').cast(pl.Utf8).is_in(['Chevrolet', 'Ford', 'Dodge', 'GMC', 'Tesla']))
               .then(pl.lit('US'))
               .otherwise(pl.lit('Other'))
               .alias('country'))
   .group_by(['year', 'country'])
   .agg(pl.col('comb08').mean(),
        pl.col('speeds').mean())
 .to_pandas()
 .set_index(['year', 'country'])
 .sort_index()
 .unstack()
 .plot().legend(bbox_to_anchor=(1, 1))
)

In [None]:
# use pivot
(autos2
 .with_columns(pl.when(pl.col('make').cast(pl.Utf8).is_in(['Chevrolet', 'Ford', 'Dodge', 'GMC', 'Tesla']))
               .then('US')
               .otherwise('Other')
               .alias('country'))
   .pivot(index='year', values=['comb08', 'speeds'],
          columns='country', aggregate_function='mean')
 .to_pandas()
 .set_index('year')
 .sort_index()
 .plot()
 .legend(bbox_to_anchor=(1,1))
)

## Summary

* Correct types save space and enable convenient math, string, and date functionality
* Chaining operations will:
   * Make code readable
   * Remove bugs
   * Easier to debug
* ``.apply`` is slow for math
* Aggregations are powerful. Play with them until they make sense


Follow me on Twitter ``@__mharrison__``

Course giveaway: Beyond Pandas 1 https://store.metasnake.com/8e40182f-16ae-4305-a904-371d4ff85d6a