In [None]:
import polars as pl
import pandas as pd
import numpy as np
import pyarrow as pa
import plotly.express as px
import string
import random
from datetime import datetime

# Motivation

1. Small memory footpring
  - Native dtypes: missing, strings.
  - Arrow format.
1. Query Planning
1. Parallelism:
    - Speed
    - Debugging




## Memory Footprint


### Memory Footprint of Storage

Polars vs. Pandas:

In [None]:
letters = pl.Series(list(string.ascii_letters))

n = int(10e6)
letter1 = letters.sample(n,with_replacement=True)
letter1.estimated_size(unit='gb')

In [None]:
letter1_pandas = letter1.to_pandas() 
letter1_pandas.memory_usage(deep=True, index=False) / 1e9

The memory footprint of the polars Series is 1/7 of the pandas Series(!).
But I did cheat- I used string type data to emphasize the difference. The difference would have been smaller if I had used integers or floats. 




### Memory Footprint of Compute

You are probably storing your data to compute with it.
Let's compare the memory footprint of computations. 


In [None]:
%load_ext memory_profiler

In [None]:
%memit letter1.sort()

In [None]:
%memit letter1_pandas.sort_values()

In [None]:
%memit letter1[10]='a'

In [None]:
%memit letter1_pandas[10]='a'

Things to notice:

- Operating on existing data consumes less memory in polars than in pandas.
- Changing the data consumes more memory in polars than in pandas. Why is that?


### Operating From Disk to Disk

What if my data does not fit into RAM?
Turns out you can read from disk, process in RAM, and write to disk. This allows you to process data larger than your memory. 

TODO: demonstrate sink_parquet from [here](https://www.rhosignal.com/posts/sink-parquet-files/).





## Query Planning

Consider a sort opperation that follows a filter operation. 
Ideally, filter precededs the sort, but we did not ensure this...
We now demonstarte that polars' query planner will do it for you. 
En passant, we see polars is more efficient also without the query planner. 


Polars' Eager evaluation, without query planning. 
Sort then filter. 

In [None]:
%timeit -n 2 -r 2 letter1.sort().filter(letter1.is_in(['a','b','c']))

Polars' Eager evaluation, without query planning. 
Filter then sort. 

In [None]:
%timeit -n 2 -r 2 letter1.filter(letter1.is_in(['a','b','c'])).sort()

Polars' Lazy evaluation with query planning. 
Recieves sort then filter; executes filter then sort. 

In [None]:
%timeit -n 2 -r 2 letter1.alias('letters').to_frame().lazy().sort(by='letters').filter(pl.col('letters').is_in(['a','b','c'])).collect()

Pandas' eager evaluation in the wrong order: Sort then filter. 

%timeit -n 2 -r 2 letter1_pandas.sort_values().loc[lambda x: x.isin(['a','b','c'])]
```


Pandas eager evaluation in the right order: Filter then sort. 

In [None]:
%timeit -n 2 letter1_pandas.loc[lambda x: x.isin(['a','b','c'])].sort_values()

Pandas alternative syntax, just as slow. 

In [None]:
%timeit -n 2 -r 2 letter1_pandas.loc[letter1_pandas.isin(['a','b','c'])].sort_values()

Things to note:

1. Query planning works!
1. Polars faster than Pandas even in eager evaluation (without query planning).



## Parallelism

Polars seamlessly parallelizes over columns (also within, when possible).
As the number of columns in the data grows, we would expect fixed runtime until all cores are used, and then linear scaling.
The following code demonstrates this idea, using a simple sum-within-column.


In [None]:
import time

def scaling_of_sums(n_rows, n_cols):
  # n_cols = 2
  # n_rows = int(1e6)
  A = {}
  A_numpy = np.random.randn(n_rows,n_cols)
  A['numpy'] = A_numpy.copy()
  A['polars'] = pl.DataFrame(A_numpy)
  A['pandas'] = pd.DataFrame(A_numpy)

  times = {}
  for key,value in A.items():
    start = time.time()
    value.sum()
    end = time.time()
    times[key] = end-start

  return(times)

In [None]:
scaling_of_time = {
  p:scaling_of_sums(n_rows= int(1e6),n_cols = p) for p in np.arange(1,16)}

In [None]:
data = pd.DataFrame(scaling_of_time).T
px.line(
  data, 
  labels=dict(
    index="Number of Columns", 
    value="Runtime")
)

Things to note:

- Pandas is slow. 
- Numpy is quite efficient.
- My machine has 8 cores. I would thus expect a fixed timing until 8 columns, and then linear scaling. This is not the case. I wonder why?


## Speed Of Import

Polar's `read_x` functions are quite faster than Pandas. 
This is due to better type "guessing" heuristics, and to native support of the parquet file format. 

We now make synthetic data, save it as csv or parquet, and reimport it with polars and pandas.

Starting with CSV:

In [None]:
n_rows = int(1e5)
n_cols = 10
data = np.random.randn(n_rows,n_cols)
data.tofile('data/data.csv', sep = ',')

Import with pandas. 

In [None]:
%timeit -n2 -r2 data_pandas = pd.read_csv('data/data.csv', header = None)

Import with polars. 

In [None]:
%timeit -n2 -r2 data_polars = pl.read_csv('data/data.csv', has_header = False)

Moving to parquet:


In [None]:
data_pandas = pd.DataFrame(data)
data_pandas.columns = data_pandas.columns.astype(str)
data_pandas.to_parquet('data/data.parquet', index = False)

In [None]:
%timeit -n2 -r2 data_pandas = pd.read_parquet('data/data.parquet')

In [None]:
%timeit -n2 -r2 data_polars = pl.read_parquet('data/data.parquet')

Things to note:

- The difference in speed is quite large.
- I dare argue that polars' type guessing is better, but I am not demonstrating it here. 
- Bonus fact: parquet is much faster than csv, and also saves the frame's schema.



## Speed Of Join

Because pandas is built on numpy, people see it as both an in-memory database, and a matrix/array library.
With polars, it is quite clear it is an in-memory database, and not an array processing library (despite having a `pl.dot()` function for inner products).
As such, you cannot multiply two polars dataframes, but you can certainly join then efficiently.

Make some data:

In [None]:
def make_data(n_rows, n_cols):
  data = np.concatenate(
  (
    np.arange(n_rows)[:,np.newaxis], # index
    np.random.randn(n_rows,n_cols), # values
    ),
    axis=1)
    
  return data


n_rows = int(1e6)
n_cols = 10
data_left = make_data(n_rows, n_cols)
data_right = make_data(n_rows, n_cols)

Polars join:

In [None]:
data_left_polars = pl.DataFrame(data_left)
data_right_polars = pl.DataFrame(data_right)

%timeit -n2 -r2 polars_joined = data_left_polars.join(data_right_polars, on = 'column_0', how = 'inner')

Pandas join:

In [None]:
data_left_pandas = pd.DataFrame(data_left)
data_right_pandas = pd.DataFrame(data_right)

%timeit -n2 -r2 pandas_joined = data_left_pandas.merge(data_right_pandas, on = 0, how = 'inner')

## Moving Forward...

If this motivational seection has convinced you to try polars instead of pandas, here is a  more structured intro. 






# Polars Series

Much like pandas, polars' fundamental building block is the series. 
A series is a column of data, with a name, and a dtype.
In the following we:

1. Create a series and demonstrate basic operations on it.
1. Demonstrate the various dtypes. 
1. Discuss missing values.
1. Filter a series.

## Series Housekeeping
Construct a series

In [None]:
s = pl.Series("a", [1, 2, 3])
s

Make pandas series for comparison:

In [None]:
s_pandas = pd.Series([1, 2, 3], name = "a")

In [None]:
type(s)

In [None]:
type(s_pandas)

In [None]:
s.dtype

In [None]:
s_pandas.dtype

Renaming a series; will be very useful when operating on dataframe columns.

In [None]:
s.alias("b")

In [None]:
s.clone()

In [None]:
s.clone().append(pl.Series("a", [4, 5, 6]))

Note: `series.append` operates in-place. That is why we cloned the series first.

Flatten a list of lists using `explode()`.

In [None]:
pl.Series("a", [[1, 2], [3, 4], [9, 10]]).explode()

In [None]:
s.extend_constant(666, n=2)

In [None]:
#| eval: false
s.new_from_index()

In [None]:
s.rechunk()

In [None]:
s.rename("b", in_place=False) # has an in_place option. Unlike .alias()

In [None]:
s.to_dummies()

In [None]:
s.cleared() # creates an empty series, with same dtype

Consturcting a series of floats, for later use.

In [None]:
f = pl.Series("a", [1., 2., 3.])
f

In [None]:
f.dtype

## Memory Representation of Series

Object size in memory. Super useful for profiling:

In [None]:
s.estimated_size(unit="gb")

In [None]:
s.chunk_lengths() # what is the length of each memory chunk?

## Filtering and Subsetting


In [None]:
s[0]

Filtering with boolneas requires a series of booleans, not a list:

In [None]:
s.filter(pl.Series("a", [True, False, True])) # works

Will not work:

In [None]:
#| eval: false

s[[True, False, True]]

Don't be confused with pandas!

In [None]:
#| eval: false

s.loc[[True, False, True]] 

In [None]:
s.head(2)

In [None]:
s.limit(2)

Negative indexing is not supported:

In [None]:
#| eval: false

s.head(-1)
s.limit(-1)

In [None]:
s.tail(2)

In [None]:
s.sample(2, with_replacement=False)

In [None]:
s.take([0, 2]) # same as .iloc

In [None]:
s.slice(1, 2) # same as pandas .iloc[1:2]

In [None]:
s.take_every(2)

## Aggregations

In [None]:
s.sum()

In [None]:
s.min()

In [None]:
s.arg_min()

In [None]:
s.mean()

In [None]:
s.median()

In [None]:
s.entropy()

In [None]:
s.describe()

In [None]:
s.value_counts()

## Object Transformations


In [None]:
pl.Series("a",[1,2,3,4]).reshape(dims = (2,2))

In [None]:
s.shift(1)

In [None]:
s.shift(-1)

In [None]:
s.shift_and_fill(1, 999)

## Mathematical Transformations

In [None]:
s.abs()

In [None]:
s.sin()

In [None]:
s.exp()

In [None]:
s.hash()

In [None]:
s.log()

In [None]:
s.peak_max()

In [None]:
s.sqrt()

In [None]:
s.clip_max(2)

In [None]:
s.clip_min(1)

You cannot round integers, but you can round floats.


In [None]:
f.round(2)

In [None]:
f.ceil()

In [None]:
f.floor()

In [None]:
s.is_in(pl.Series([1, 10]))

__Caution__: `is_in()` in polars has an underscore, unlike `isin()` in pandas.



## Apply

Applying your own function:

In [None]:
s.apply(lambda x: x + 1)

Using your own functions comes with a performance cost:

In [None]:
s1 = pl.Series(np.random.randn(int(1e5)))

%timeit -n2 -r2 s1+1

In [None]:
%timeit -n2 -r2 s1.apply(lambda x: x + 1)

## Cummulative Operations


In [None]:
s.cummax()

In [None]:
s.cumsum()

In [None]:
s.cumprod()

In [None]:
s.ewm_mean(com=0.5)

## Sequential Operations


In [None]:
s.diff()

In [None]:
s.pct_change()

## Windowed Operations


In [None]:
s.rolling_apply(
  pl.sum, 
  window_size=2)

Not all functions will work within a `rolling_apply`! Only polars' functions will.

In [None]:
#| eval: false

s.rolling_apply(np.sum, window_size=2) # will not work

In [None]:
s.rolling_max(window_size=2)

In [None]:
s.clip(1, 2)

In [None]:
s.clone()

In [None]:
# check equality with clone
s == s.clone()

## Booleans


In [None]:
b = pl.Series("a", [True, True, False])
b.dtype

In [None]:
b.all()

In [None]:
b.any()

## Uniques and Duplicates


In [None]:
s.is_duplicated()

In [None]:
s.is_unique()

In [None]:
s.n_unique()

In [None]:
pl.Series([1,2,3,4,1]).unique_counts()

The first appearance of a value in a series:

In [None]:
pl.Series([1,2,3,4,1]).is_first()

## dtypes

__Note__. Unlike pandas, polars' test functions have an underscore: `is_numeric()` instead of `isnumeric()`.


### Testing

In [None]:
s.is_numeric()

In [None]:
s.is_float()

In [None]:
s.is_utf8()

In [None]:
s.is_boolean()

In [None]:
s.is_datelike()

Compare with Pandas Type Checkers:

In [None]:
pd.api.types.is_string_dtype(s_pandas)

In [None]:
pd.api.types.is_string_dtype(s)

### Casting


In [None]:
s.cast(pl.Int32)

Things to note: 

- `s.cast()` is an in place operation. If you want to keep the original series, you can use `s.cast(pl.Int32).clone()`.
- `cast()` is polars' equivalent of pandas' `astype()`.
- For a list of dtypes see the official [documentation](see https://pola-rs.github.io/polars/py-polars/html/reference/datatypes.html).



### Optimizing dtypes

Find the most efficient dtype for a series:

In [None]:
s.shrink_dtype()

Also see [here](http://braaannigan.github.io/software/2022/10/31/polars-dtype-diet.html).

Shrink the memory allocation to the size of the actual data (in place).

In [None]:
s.shrink_to_fit() 

## Ordering and Sorting 


In [None]:
s.sort()

In [None]:
s.reverse()

In [None]:
s.rank()

In [None]:
s.arg_sort() 

`arg_sort()` returns the indices that would sort the series. Same as R's `order()`.


In [None]:
s.sort() == s[s.arg_sort()]

`arg_sort()` can also be used to return the original series from the sorted one:

In [None]:
s == s[s[s.arg_sort()].arg_sort()]

In [None]:
s.shuffle(seed=1)

## Missing

Pandas users will be excited to know that polars has built in missing value support (!) for all dtypes.
This has been a long awaited feature in the Python data science ecosystem, with implications on performance and syntax.


In [None]:
m = pl.Series("a", [1, 2, None, np.nan])
m.is_null()

In [None]:
m.is_nan()

In [None]:
m1 = pl.Series("a", [1, None, 2, ]) # python native None
m2 = pl.Series("a", [1, np.nan, 2, ]) # numpy's nan
m3 = pl.Series("a", [1, float('nan'), 2, ]) # python's nan
m4 = pd.Series([1, None, 2 ])
m5 = pd.Series([1, np.nan, 2, ])
m6 = pd.Series([1, float('nan'), 2, ])

In [None]:
[m1.sum(), m2.sum(), m3.sum(), m4.sum(), m5.sum(), m6.sum()]

Things to note:

- The use of `is_null()` instead of pandas `isna()`.
- Polars supports `np.nan` but that is a different dtype than `None` (which is a `Null` type). `None` is not considered 
- Aggregating pandas and polars series behave differently w.r.t. missing values:
  - Both will ignore `None`; which is unsafe.
  - Polars will not ignore `np.nan`; which is safe. Pandas is unsafe w.r.t. `np.nan`, and will ignore it. 


Filling missing values; `None` and `np.nan` are treated differently:

In [None]:
m1.fill_null(0)

In [None]:
m1.interpolate()

In [None]:
m2.fill_null(0)

In [None]:
m2.fill_nan(0)

In [None]:
m1.drop_nulls()

In [None]:
m1.drop_nans()

In [None]:
m2.drop_nulls()

## Export


In [None]:
s.to_frame()

In [None]:
s.to_list()

In [None]:
s.to_numpy()

In [None]:
s.to_pandas()

In [None]:
s.to_arrow()

## Strings 
Like Pandas, accessed with the `.str` attribute.


In [None]:
st = pl.Series("a", ["foo", "bar", "baz"])

In [None]:
st.str.n_chars() # gets number of chars. In ASCII this is the same as lengths()

In [None]:
st.str.lengths() # gets number of bytes in memory

In [None]:
st.str.concat("-")

In [None]:
st.str.contains("foo|tra|bar")

In [None]:
st.str.count_match(pattern= 'o') # count literal metches

Count pattern matches. 
Notice the `r"<regex pattern>"` syntax for regex (more about it [here](https://stackoverflow.com/questions/2241600/python-regex-r-prefix)). 

In [None]:
st.str.count_match(r"\w") # regex for alphanumeric

In [None]:
st.str.ends_with("oo")

In [None]:
st.str.starts_with("fo")

To extract the first appearance of a pattern, use `extract`:

In [None]:
url = pl.Series("a", [
            "http://vote.com/ballon_dor?candidate=messi&ref=polars",

            "http://vote.com/ballon_dor?candidate=jorginho&ref=polars",

            "http://vote.com/ballon_dor?candidate=ronaldo&ref=polars"
            ])

url.str.extract(r"=(\w+)", 1)

To extract all appearances of a pattern, use `extract_all`:

In [None]:
url.str.extract_all("=(\w+)")

In [None]:
st.str.ljust(8, "*")

In [None]:
st.str.rjust(8, "*")

In [None]:
st.str.lstrip('f')

In [None]:
st.str.rstrip('r')

Replacing first appearance of a pattern:

In [None]:
st.str.replace(r"o", "ZZ")  

In [None]:
st.str.replace(r"o+", "ZZ")  

Replace all appearances of a pattern:

In [None]:
st.str.replace_all("o", "ZZ")

String to list of strings. Number of spits inferred.

In [None]:
st.str.split(by="o")

In [None]:
s.str.split(by="a", inclusive=True)

String to dict of strings. Number of splits fixed.

In [None]:
st.str.split_exact("a", 2)

String to dict of strings. Length of output fixed.

In [None]:
st.str.splitn("a", 4)

Strip white spaces.

In [None]:
st.str.rjust(8, " ").str.strip()

In [None]:
st.str.to_uppercase()

In [None]:
st.str.to_lowercase()

In [None]:
st.str.zfill(5)

## Date and Time

There are 4 datetime dtypes in polars:

1. Date: A date, without hours. Generated with `pl.Date()`.
2. Datetime: Date and hours. Generated with `pl.Datetime()`.
3. Duration: As the name suggests. Similar t o `timedelta` in pandas. Generated with `pl.Duration()`.
4. Time: Hour of day. Generated with `pl.Time()`.


### Converting from Strings


In [None]:
sd = pl.Series(
    "date",
    [
        "2021-04-22",
        "2022-01-04 00:00:00",
        "01/31/22",
        "Sun Jul  8 00:34:60 2001",
    ],
)
sd.str.strptime(pl.Date, "%F", strict=False)

In [None]:
sd.str.strptime(pl.Date, "%F %T",strict=False)

In [None]:
sd.str.strptime(pl.Date, "%D", strict=False)

### Time Range


In [None]:
from datetime import datetime, timedelta

start = datetime(year= 2001, month=2, day=2)
stop = datetime(year=2001, month=2, day=3)

date = pl.date_range(
  low=start, 
  high=stop, 
  interval=timedelta(seconds=500*61))
date

Things to note:

- How else could I have constructed this series? What other types are accepted as `low` and `high`? 
- `pl.date_range` may return a series of dtype `Date` or `Datetime`. This depens of the granularity of the inputs. 


In [None]:
date.dtype

Cast to different time unit. 
May be useful when joining datasets, and the time unit is different.

In [None]:
date.dt.cast_time_unit(tu="ms")

### From Date to String


In [None]:
date.dt.strftime("%Y-%m-%d")

### Ecxtract Time Sub-Units


In [None]:
date.dt.second()

In [None]:
date.dt.minute()

In [None]:
date.dt.hour()

In [None]:
date.dt.day()

In [None]:
date.dt.week()

In [None]:
date.dt.weekday()

In [None]:
date.dt.month()

In [None]:
date.dt.year()

In [None]:
date.dt.ordinal_day() # day in year

In [None]:
date.dt.quarter()

### Durations 

Equivalent to Pandas `period` dtype.


In [None]:
diffs = date.diff()
diffs

In [None]:
diffs.dtype

In [None]:
diffs.dt.seconds()

In [None]:
diffs.dt.minutes()

In [None]:
diffs.dt.days()

In [None]:
diffs.dt.hours()

### Date Aggregations
Note that aggregating dates, returns a `datetime` type object. 


In [None]:
date.dt.max()

In [None]:
date.dt.min()

I have no idea what is an "average date", but it can be computed.

In [None]:
date.dt.mean()

In [None]:
date.dt.median()

### Data Transformations

Notice the syntax of `offset_by`. It is similar to R's `lubridate` package.

In [None]:
date.dt.offset_by(by="1y2m20d")

Nagative offset is also allowed.

In [None]:
date.dt.offset_by(by="-1y2m20d")

In [None]:
date.dt.round("1y")

In [None]:
date2 = date.dt.truncate("30m") # round to period
pd.crosstab(date,date2)

## Comparing Series 

In [None]:
s.series_equal(pl.Series("a", [1, 2, 3]))

# head









# DataFrames

General:
1. There is no row index (like R's `data.frame`, `data.table`, and `tibble`; unlike Python's `pandas`). 
1. Will not accept duplicat column names (unlike pandas).


## DataFrame Hosekeeping

A frame can be created as you would expect. 
From a dictionary of series, a numpy array, a pandas dataframe, or a list of polars (or pandas) series, etc.


In [None]:
dataframe = pl.DataFrame({"integer": [1, 2, 3], 
                          "date": [
                              (datetime(2022, 1, 1)), 
                              (datetime(2022, 1, 2)), 
                              (datetime(2022, 1, 3))
                          ], 
                          "float":[4.0, 5.0, 6.0]})

dataframe

In [None]:
print(dataframe)

Things to note:

1. 


## Statistical Aggregations

## Filtering Rows

## Selecting Columns



Select cols along time_unit and convert
This may be useful when joining multiple dataframes with different time units.

In [None]:
# df.with_column(
#     pl.col(pl.Datetime("ns")).dt.cast_time_unit(tu="ms")
# )            

## Joins

## Reshaping

## Groupby

## Query Planning and Optimization

- describe_plan
- show_graph
- describe_optimized_plan



# I/O

## Import

- From csv
- From parquet
- From multiple parquets
- From Arrow DataSet



Warnings:

1. String caching


## Export




# Plotting

# Polars and ML

# Strings

# Datatimes


# Config


In [None]:
list(dir(pl.Config))