In [1]:
import polars as pl
import pandas as pd
import numpy as np
import pyarrow as pa
import plotly.express as px
import string
import random
from datetime import datetime

# Motivation

1. Small memory footpring
  - Native dtypes: missing, strings.
  - Arrow format.
1. Query Planning
1. Parallelism:
    - Speed
    - Debugging




## Memory Footprint


### Memory Footprint of Storage

Polars vs. Pandas:

In [2]:
letters = pl.Series(list(string.ascii_letters))

n = int(10e6)
letter1 = letters.sample(n,with_replacement=True)
letter1.estimated_size(unit='gb')

0.08381903916597366

In [3]:
letter1_pandas = letter1.to_pandas() 
letter1_pandas.memory_usage(deep=True, index=False) / 1e9

0.58

The memory footprint of the polars Series is 1/7 of the pandas Series(!).
But I did cheat- I used string type data to emphasize the difference. The difference would have been smaller if I had used integers or floats. 




### Memory Footprint of Compute

You are probably storing your data to compute with it.
Let's compare the memory footprint of computations. 


In [4]:
%load_ext memory_profiler

In [5]:
%memit letter1.sort()

peak memory: 542.71 MiB, increment: 217.00 MiB


In [6]:
%memit letter1_pandas.sort_values()

peak memory: 703.95 MiB, increment: 377.82 MiB


In [7]:
%memit letter1[10]='a'

peak memory: 476.91 MiB, increment: 64.59 MiB


In [8]:
%memit letter1_pandas[10]='a'

peak memory: 413.09 MiB, increment: 0.00 MiB


Things to notice:

- Operating on existing data consumes less memory in polars than in pandas.
- Changing the data consumes more memory in polars than in pandas. Why is that?


### Operating From Disk to Disk

What if my data does not fit into RAM?
Turns out you can read from disk, process in RAM, and write to disk. This allows you to process data larger than your memory. 

TODO: demonstrate sink_parquet from [here](https://www.rhosignal.com/posts/sink-parquet-files/).





## Query Planning

Consider a sort opperation that follows a filter operation. 
Ideally, filter precededs the sort, but we did not ensure this...
We now demonstarte that polars' query planner will do it for you. 
En passant, we see polars is more efficient also without the query planner. 


Polars' Eager evaluation, without query planning. 
Sort then filter. 

In [9]:
%timeit -n 2 -r 2 letter1.sort().filter(letter1.is_in(['a','b','c']))

905 ms ± 77.9 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)


Polars' Eager evaluation, without query planning. 
Filter then sort. 

In [10]:
%timeit -n 2 -r 2 letter1.filter(letter1.is_in(['a','b','c'])).sort()

265 ms ± 26.5 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)


Polars' Lazy evaluation with query planning. 
Recieves sort then filter; executes filter then sort. 

In [11]:
%timeit -n 2 -r 2 letter1.alias('letters').to_frame().lazy().sort(by='letters').filter(pl.col('letters').is_in(['a','b','c'])).collect()

306 ms ± 24.7 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)


Pandas' eager evaluation in the wrong order: Sort then filter. 

%timeit -n 2 -r 2 letter1_pandas.sort_values().loc[lambda x: x.isin(['a','b','c'])]
```


Pandas eager evaluation in the right order: Filter then sort. 

In [12]:
%timeit -n 2 letter1_pandas.loc[lambda x: x.isin(['a','b','c'])].sort_values()

919 ms ± 104 ms per loop (mean ± std. dev. of 7 runs, 2 loops each)


Pandas alternative syntax, just as slow. 

In [13]:
%timeit -n 2 -r 2 letter1_pandas.loc[letter1_pandas.isin(['a','b','c'])].sort_values()

803 ms ± 44.6 µs per loop (mean ± std. dev. of 2 runs, 2 loops each)


Things to note:

1. Query planning works!
1. Polars faster than Pandas even in eager evaluation (without query planning).



## Parallelism

Polars seamlessly parallelizes over columns (also within, when possible).
As the number of columns in the data grows, we would expect fixed runtime until all cores are used, and then linear scaling.
The following code demonstrates this idea, using a simple sum-within-column.


In [14]:
import time

def scaling_of_sums(n_rows, n_cols):
  # n_cols = 2
  # n_rows = int(1e6)
  A = {}
  A_numpy = np.random.randn(n_rows,n_cols)
  A['numpy'] = A_numpy.copy()
  A['polars'] = pl.DataFrame(A_numpy)
  A['pandas'] = pd.DataFrame(A_numpy)

  times = {}
  for key,value in A.items():
    start = time.time()
    value.sum()
    end = time.time()
    times[key] = end-start

  return(times)

In [15]:
scaling_of_time = {
  p:scaling_of_sums(n_rows= int(1e6),n_cols = p) for p in np.arange(1,16)}

In [16]:
data = pd.DataFrame(scaling_of_time).T
px.line(
  data, 
  labels=dict(
    index="Number of Columns", 
    value="Runtime")
)

Things to note:

- Pandas is slow. 
- Numpy is quite efficient.
- My machine has 8 cores. I would thus expect a fixed timing until 8 columns, and then linear scaling. This is not the case. I wonder why?


## Speed Of Import

Polar's `read_x` functions are quite faster than Pandas. 
This is due to better type "guessing" heuristics, and to native support of the parquet file format. 

We now make synthetic data, save it as csv or parquet, and reimport it with polars and pandas.

Starting with CSV:

In [17]:
n_rows = int(1e5)
n_cols = 10
data = np.random.randn(n_rows,n_cols)
data.tofile('data/data.csv', sep = ',')

Import with pandas. 

In [18]:
%timeit -n2 -r2 data_pandas = pd.read_csv('data/data.csv', header = None)

23.4 s ± 1.44 s per loop (mean ± std. dev. of 2 runs, 2 loops each)


Import with polars. 

In [19]:
%timeit -n2 -r2 data_polars = pl.read_csv('data/data.csv', has_header = False)

4.07 s ± 923 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)


Moving to parquet:


In [20]:
data_pandas = pd.DataFrame(data)
data_pandas.columns = data_pandas.columns.astype(str)
data_pandas.to_parquet('data/data.parquet', index = False)

In [21]:
%timeit -n2 -r2 data_pandas = pd.read_parquet('data/data.parquet')

The slowest run took 10.09 times longer than the fastest. This could mean that an intermediate result is being cached.
53.6 ms ± 43.9 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)


In [22]:
%timeit -n2 -r2 data_polars = pl.read_parquet('data/data.parquet')

40.5 ms ± 8.6 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)


Things to note:

- The difference in speed is quite large.
- I dare argue that polars' type guessing is better, but I am not demonstrating it here. 
- Bonus fact: parquet is much faster than csv, and also saves the frame's schema.



## Speed Of Join

Because pandas is built on numpy, people see it as both an in-memory database, and a matrix/array library.
With polars, it is quite clear it is an in-memory database, and not an array processing library (despite having a `pl.dot()` function for inner products).
As such, you cannot multiply two polars dataframes, but you can certainly join then efficiently.

Make some data:

In [23]:
def make_data(n_rows, n_cols):
  data = np.concatenate(
  (
    np.arange(n_rows)[:,np.newaxis], # index
    np.random.randn(n_rows,n_cols), # values
    ),
    axis=1)
    
  return data


n_rows = int(1e6)
n_cols = 10
data_left = make_data(n_rows, n_cols)
data_right = make_data(n_rows, n_cols)

Polars join:

In [24]:
data_left_polars = pl.DataFrame(data_left)
data_right_polars = pl.DataFrame(data_right)

%timeit -n2 -r2 polars_joined = data_left_polars.join(data_right_polars, on = 'column_0', how = 'inner')

287 ms ± 78.8 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)


Pandas join:

In [25]:
data_left_pandas = pd.DataFrame(data_left)
data_right_pandas = pd.DataFrame(data_right)

%timeit -n2 -r2 pandas_joined = data_left_pandas.merge(data_right_pandas, on = 0, how = 'inner')

626 ms ± 52.7 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)


## Moving Forward...

If this motivational seection has convinced you to try polars instead of pandas, here is a  more structured intro. 






# Polars Series

Much like pandas, polars' fundamental building block is the series. 
A series is a column of data, with a name, and a dtype.
In the following we:

1. Create a series and demonstrate basic operations on it.
1. Demonstrate the various dtypes. 
1. Discuss missing values.
1. Filter a series.

## Series Housekeeping
Construct a series

In [26]:
s = pl.Series("a", [1, 2, 3])
s

a
i64
1
2
3


Make pandas series for comparison:

In [27]:
s_pandas = pd.Series([1, 2, 3], name = "a")

In [28]:
type(s)

polars.internals.series.series.Series

In [29]:
type(s_pandas)

pandas.core.series.Series

In [30]:
s.dtype

Int64

In [31]:
s_pandas.dtype

dtype('int64')

Renaming a series; will be very useful when operating on dataframe columns.

In [32]:
s.alias("b")

b
i64
1
2
3


In [33]:
s.clone()

a
i64
1
2
3


In [34]:
s.clone().append(pl.Series("a", [4, 5, 6]))

a
i64
1
2
3
4
5
6


Note: `series.append` operates in-place. That is why we cloned the series first.

Flatten a list of lists using `explode()`.

In [35]:
pl.Series("a", [[1, 2], [3, 4], [9, 10]]).explode()

a
i64
1
2
3
4
9
10


In [36]:
s.extend_constant(666, n=2)

a
i64
1
2
3
666
666


In [37]:
#| eval: false
s.new_from_index()

In [38]:
s.rechunk()

a
i64
1
2
3


In [39]:
s.rename("b", in_place=False) # has an in_place option. Unlike .alias()

b
i64
1
2
3


In [40]:
s.to_dummies()

a_1,a_2,a_3
u8,u8,u8
1,0,0
0,1,0
0,0,1


In [41]:
s.cleared() # creates an empty series, with same dtype

a
i64


Consturcting a series of floats, for later use.

In [42]:
f = pl.Series("a", [1., 2., 3.])
f

a
f64
1.0
2.0
3.0


In [43]:
f.dtype

Float64

## Memory Representation of Series

Object size in memory. Super useful for profiling:

In [44]:
s.estimated_size(unit="gb")

2.2351741790771484e-08

In [45]:
s.chunk_lengths() # what is the length of each memory chunk?

[3]

## Filtering and Subsetting


In [46]:
s[0]

1

Filtering with boolneas requires a series of booleans, not a list:

In [47]:
s.filter(pl.Series("a", [True, False, True])) # works

a
i64
1
3


Will not work:

In [48]:
#| eval: false

s[[True, False, True]]

Don't be confused with pandas!

In [49]:
#| eval: false

s.loc[[True, False, True]] 

In [50]:
s.head(2)

a
i64
1
2


In [51]:
s.limit(2)

a
i64
1
2


Negative indexing is not supported:

In [52]:
#| eval: false

s.head(-1)
s.limit(-1)

In [53]:
s.tail(2)

a
i64
2
3


In [54]:
s.sample(2, with_replacement=False)

a
i64
1
3


In [55]:
s.take([0, 2]) # same as .iloc

a
i64
1
3


In [56]:
s.slice(1, 2) # same as pandas .iloc[1:2]

a
i64
2
3


In [57]:
s.take_every(2)

a
i64
1
3


## Aggregations

In [58]:
s.sum()

6

In [59]:
s.min()

1

In [60]:
s.arg_min()

0

In [61]:
s.mean()

2.0

In [62]:
s.median()

2.0

In [63]:
s.entropy()

-4.68213122712422

In [64]:
s.describe()

statistic,value
str,f64
"""min""",1.0
"""max""",3.0
"""null_count""",0.0
"""mean""",2.0
"""std""",1.0
"""count""",3.0


In [65]:
s.value_counts()

a,counts
i64,u32
3,1
1,1
2,1


## Object Transformations


In [66]:
pl.Series("a",[1,2,3,4]).reshape(dims = (2,2))

a
list[i64]
"[1, 2]"
"[3, 4]"


In [67]:
s.shift(1)

a
i64
""
1.0
2.0


In [68]:
s.shift(-1)

a
i64
2.0
3.0
""


In [69]:
s.shift_and_fill(1, 999)

a
i64
999
1
2


## Mathematical Transformations

In [70]:
s.abs()

a
i64
1
2
3


In [71]:
s.sin()

a
f64
0.841471
0.909297
0.14112


In [72]:
s.exp()

a
f64
2.718282
7.389056
20.085537


In [73]:
s.hash()

a
u64
13321499719149775801
8196255364589999986
3071011010030224171


In [74]:
s.log()

a
f64
0.0
0.693147
1.098612


In [75]:
s.peak_max()

false
False
True


In [76]:
s.sqrt()

a
f64
1.0
1.414214
1.732051


In [77]:
s.clip_max(2)

a
i64
1
2
2


In [78]:
s.clip_min(1)

a
i64
1
2
3


You cannot round integers, but you can round floats.


In [79]:
f.round(2)

a
f64
1.0
2.0
3.0


In [80]:
f.ceil()

a
f64
1.0
2.0
3.0


In [81]:
f.floor()

a
f64
1.0
2.0
3.0


In [82]:
s.is_in(pl.Series([1, 10]))

a
bool
True
False
False


__Caution__: `is_in()` in polars has an underscore, unlike `isin()` in pandas.



## Apply

Applying your own function:

In [83]:
s.apply(lambda x: x + 1)

a
i64
2
3
4


Using your own functions comes with a performance cost:

In [84]:
s1 = pl.Series(np.random.randn(int(1e5)))

%timeit -n2 -r2 s1+1

The slowest run took 5.40 times longer than the fastest. This could mean that an intermediate result is being cached.
409 µs ± 281 µs per loop (mean ± std. dev. of 2 runs, 2 loops each)


In [85]:
%timeit -n2 -r2 s1.apply(lambda x: x + 1)

25 ms ± 5.74 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)


## Cummulative Operations


In [86]:
s.cummax()

a
i64
1
2
3


In [87]:
s.cumsum()

a
i64
1
3
6


In [88]:
s.cumprod()

a
i64
1
2
6


In [89]:
s.ewm_mean(com=0.5)

a
f64
1.0
1.75
2.615385


## Sequential Operations


In [90]:
s.diff()

a
i64
""
1.0
1.0


In [91]:
s.pct_change()

a
f64
""
1.0
0.5


## Windowed Operations


In [92]:
s.rolling_apply(
  pl.sum, 
  window_size=2)

a
i64
""
3.0
5.0


Not all functions will work within a `rolling_apply`! Only polars' functions will.

In [93]:
#| eval: false

s.rolling_apply(np.sum, window_size=2) # will not work

In [94]:
s.rolling_max(window_size=2)

a
i64
""
2.0
3.0


In [95]:
s.clip(1, 2)

a
i64
1
2
2


In [96]:
s.clone()

a
i64
1
2
3


In [97]:
# check equality with clone
s == s.clone()

a
bool
True
True
True


## Booleans


In [98]:
b = pl.Series("a", [True, True, False])
b.dtype

Boolean

In [99]:
b.all()

False

In [100]:
b.any()

True

## Uniques and Duplicates


In [101]:
s.is_duplicated()

a
bool
False
False
False


In [102]:
s.is_unique()

a
bool
True
True
True


In [103]:
s.n_unique()

3

In [104]:
pl.Series([1,2,3,4,1]).unique_counts()

2
1
1
1


The first appearance of a value in a series:

In [105]:
pl.Series([1,2,3,4,1]).is_first()

true
True
True
True
False


## dtypes

__Note__. Unlike pandas, polars' test functions have an underscore: `is_numeric()` instead of `isnumeric()`.


### Testing

In [106]:
s.is_numeric()

True

In [107]:
s.is_float()

False

In [108]:
s.is_utf8()

False

In [109]:
s.is_boolean()

False

In [110]:
s.is_datelike()

False

Compare with Pandas Type Checkers:

In [111]:
pd.api.types.is_string_dtype(s_pandas)

False

In [112]:
pd.api.types.is_string_dtype(s)

False

### Casting


In [113]:
s.cast(pl.Int32)

a
i32
1
2
3


Things to note: 

- `s.cast()` is an in place operation. If you want to keep the original series, you can use `s.cast(pl.Int32).clone()`.
- `cast()` is polars' equivalent of pandas' `astype()`.
- For a list of dtypes see the official [documentation](see https://pola-rs.github.io/polars/py-polars/html/reference/datatypes.html).



### Optimizing dtypes

Find the most efficient dtype for a series:

In [114]:
s.shrink_dtype()

a
i8
1
2
3


Also see [here](http://braaannigan.github.io/software/2022/10/31/polars-dtype-diet.html).

Shrink the memory allocation to the size of the actual data (in place).

In [115]:
s.shrink_to_fit() 

a
i64
1
2
3


## Ordering and Sorting 


In [116]:
s.sort()

a
i64
1
2
3


In [117]:
s.reverse()

a
i64
3
2
1


In [118]:
s.rank()

a
f32
1.0
2.0
3.0


In [119]:
s.arg_sort() 

a
u32
0
1
2


`arg_sort()` returns the indices that would sort the series. Same as R's `order()`.


In [120]:
s.sort() == s[s.arg_sort()]

a
bool
True
True
True


`arg_sort()` can also be used to return the original series from the sorted one:

In [121]:
s == s[s[s.arg_sort()].arg_sort()]

a
bool
True
True
True


In [122]:
s.shuffle(seed=1)

a
i64
2
1
3


## Missing

Pandas users will be excited to know that polars has built in missing value support (!) for all dtypes.
This has been a long awaited feature in the Python data science ecosystem, with implications on performance and syntax.


In [123]:
m = pl.Series("a", [1, 2, None, np.nan])
m.is_null()

a
bool
False
False
True
False


In [124]:
m.is_nan()

a
bool
False
False
""
True


In [125]:
m1 = pl.Series("a", [1, None, 2, ]) # python native None
m2 = pl.Series("a", [1, np.nan, 2, ]) # numpy's nan
m3 = pl.Series("a", [1, float('nan'), 2, ]) # python's nan
m4 = pd.Series([1, None, 2 ])
m5 = pd.Series([1, np.nan, 2, ])
m6 = pd.Series([1, float('nan'), 2, ])

In [126]:
[m1.sum(), m2.sum(), m3.sum(), m4.sum(), m5.sum(), m6.sum()]

[3, nan, nan, 3.0, 3.0, 3.0]

Things to note:

- The use of `is_null()` instead of pandas `isna()`.
- Polars supports `np.nan` but that is a different dtype than `None` (which is a `Null` type). `None` is not considered 
- Aggregating pandas and polars series behave differently w.r.t. missing values:
  - Both will ignore `None`; which is unsafe.
  - Polars will not ignore `np.nan`; which is safe. Pandas is unsafe w.r.t. `np.nan`, and will ignore it. 


Filling missing values; `None` and `np.nan` are treated differently:

In [127]:
m1.fill_null(0)

a
i64
1
0
2


In [128]:
m1.interpolate()

a
i64
1
1
2


In [129]:
m2.fill_null(0)

a
f64
1.0
""
2.0


In [130]:
m2.fill_nan(0)

a
f64
1.0
0.0
2.0


In [131]:
m1.drop_nulls()

a
i64
1
2


In [132]:
m1.drop_nans()

a
i64
1.0
""
2.0


In [133]:
m2.drop_nulls()

a
f64
1.0
""
2.0


## Export


In [134]:
s.to_frame()

a
i64
1
2
3


In [135]:
s.to_list()

[1, 2, 3]

In [136]:
s.to_numpy()

array([1, 2, 3])

In [137]:
s.to_pandas()

0    1
1    2
2    3
Name: a, dtype: int64

In [138]:
s.to_arrow()

<pyarrow.lib.Int64Array object at 0x7fd1d8532680>
[
  1,
  2,
  3
]

## Strings 
Like Pandas, accessed with the `.str` attribute.


In [139]:
st = pl.Series("a", ["foo", "bar", "baz"])

In [140]:
st.str.n_chars() # gets number of chars. In ASCII this is the same as lengths()

a
u32
3
3
3


In [141]:
st.str.lengths() # gets number of bytes in memory

a
u32
3
3
3


In [142]:
st.str.concat("-")

a
str
"""foo-bar-baz"""


In [143]:
st.str.contains("foo|tra|bar")

a
bool
True
True
False


In [144]:
st.str.count_match(pattern= 'o') # count literal metches

a
u32
2
0
0


Count pattern matches. 
Notice the `r"<regex pattern>"` syntax for regex (more about it [here](https://stackoverflow.com/questions/2241600/python-regex-r-prefix)). 

In [145]:
st.str.count_match(r"\w") # regex for alphanumeric

a
u32
3
3
3


In [146]:
st.str.ends_with("oo")

a
bool
True
False
False


In [147]:
st.str.starts_with("fo")

a
bool
True
False
False


To extract the first appearance of a pattern, use `extract`:

In [148]:
url = pl.Series("a", [
            "http://vote.com/ballon_dor?candidate=messi&ref=polars",

            "http://vote.com/ballon_dor?candidate=jorginho&ref=polars",

            "http://vote.com/ballon_dor?candidate=ronaldo&ref=polars"
            ])

url.str.extract(r"=(\w+)", 1)

a
str
"""messi"""
"""jorginho"""
"""ronaldo"""


To extract all appearances of a pattern, use `extract_all`:

In [149]:
url.str.extract_all("=(\w+)")

a
list[str]
"[""=messi"", ""=polars""]"
"[""=jorginho"", ""=polars""]"
"[""=ronaldo"", ""=polars""]"


In [150]:
st.str.ljust(8, "*")

a
str
"""foo*****"""
"""bar*****"""
"""baz*****"""


In [151]:
st.str.rjust(8, "*")

a
str
"""*****foo"""
"""*****bar"""
"""*****baz"""


In [152]:
st.str.lstrip('f')

a
str
"""oo"""
"""bar"""
"""baz"""


In [153]:
st.str.rstrip('r')

a
str
"""foo"""
"""ba"""
"""baz"""


Replacing first appearance of a pattern:

In [154]:
st.str.replace(r"o", "ZZ")  

a
str
"""fZZo"""
"""bar"""
"""baz"""


In [155]:
st.str.replace(r"o+", "ZZ")  

a
str
"""fZZ"""
"""bar"""
"""baz"""


Replace all appearances of a pattern:

In [156]:
st.str.replace_all("o", "ZZ")

a
str
"""fZZZZ"""
"""bar"""
"""baz"""


String to list of strings. Number of spits inferred.

In [157]:
st.str.split(by="o")

a
list[str]
"[""f"", """", """"]"
"[""bar""]"
"[""baz""]"


In [158]:
s.str.split(by="a", inclusive=True)

SchemaError: Series of dtype: Int64 != Utf8

String to dict of strings. Number of splits fixed.

In [159]:
st.str.split_exact("a", 2)

a
struct[3]
"{""foo"",null,null}"
"{""b"",""r"",null}"
"{""b"",""z"",null}"


String to dict of strings. Length of output fixed.

In [160]:
st.str.splitn("a", 4)

a
struct[4]
"{""foo"",null,null,null}"
"{""b"",""r"",null,null}"
"{""b"",""z"",null,null}"


Strip white spaces.

In [161]:
st.str.rjust(8, " ").str.strip()

a
str
"""foo"""
"""bar"""
"""baz"""


In [162]:
st.str.to_uppercase()

a
str
"""FOO"""
"""BAR"""
"""BAZ"""


In [163]:
st.str.to_lowercase()

a
str
"""foo"""
"""bar"""
"""baz"""


In [164]:
st.str.zfill(5)

a
str
"""00foo"""
"""00bar"""
"""00baz"""


## Date and Time

There are 4 datetime dtypes in polars:

1. Date: A date, without hours. Generated with `pl.Date()`.
2. Datetime: Date and hours. Generated with `pl.Datetime()`.
3. Duration: As the name suggests. Similar t o `timedelta` in pandas. Generated with `pl.Duration()`.
4. Time: Hour of day. Generated with `pl.Time()`.


### Converting from Strings


In [165]:
sd = pl.Series(
    "date",
    [
        "2021-04-22",
        "2022-01-04 00:00:00",
        "01/31/22",
        "Sun Jul  8 00:34:60 2001",
    ],
)
sd.str.strptime(pl.Date, "%F", strict=False)

date
date
2021-04-22
""
""
""


In [166]:
sd.str.strptime(pl.Date, "%F %T",strict=False)

date
date
""
2022-01-04
""
""


In [167]:
sd.str.strptime(pl.Date, "%D", strict=False)

date
date
""
""
2022-01-31
""


### Time Range


In [168]:
from datetime import datetime, timedelta

start = datetime(year= 2001, month=2, day=2)
stop = datetime(year=2001, month=2, day=3)

date = pl.date_range(
  low=start, 
  high=stop, 
  interval=timedelta(seconds=500*61))
date

2001-02-02 00:00:00
2001-02-02 08:28:20
2001-02-02 16:56:40


Things to note:

- How else could I have constructed this series? What other types are accepted as `low` and `high`? 
- `pl.date_range` may return a series of dtype `Date` or `Datetime`. This depens of the granularity of the inputs. 


In [169]:
date.dtype

Datetime(tu='us', tz=None)

Cast to different time unit. 
May be useful when joining datasets, and the time unit is different.

In [170]:
date.dt.cast_time_unit(tu="ms")

2001-02-02 00:00:00
2001-02-02 08:28:20
2001-02-02 16:56:40


### From Date to String


In [171]:
date.dt.strftime("%Y-%m-%d")

"""2001-02-02"""
"""2001-02-02"""
"""2001-02-02"""


### Ecxtract Time Sub-Units


In [172]:
date.dt.second()

0
20
40


In [173]:
date.dt.minute()

0
28
56


In [174]:
date.dt.hour()

0
8
16


In [175]:
date.dt.day()

2
2
2


In [176]:
date.dt.week()

5
5
5


In [177]:
date.dt.weekday()

5
5
5


In [178]:
date.dt.month()

2
2
2


In [179]:
date.dt.year()

2001
2001
2001


In [180]:
date.dt.ordinal_day() # day in year

33
33
33


In [181]:
date.dt.quarter()

1
1
1


### Durations 

Equivalent to Pandas `period` dtype.


In [182]:
diffs = date.diff()
diffs

null
8h 28m 20s
8h 28m 20s


In [183]:
diffs.dtype

Duration(tu='us')

In [184]:
diffs.dt.seconds()

null
30500
30500


In [185]:
diffs.dt.minutes()

null
508
508


In [186]:
diffs.dt.days()

null
0
0


In [187]:
diffs.dt.hours()

null
8
8


### Date Aggregations
Note that aggregating dates, returns a `datetime` type object. 


In [188]:
date.dt.max()

datetime.datetime(2001, 2, 2, 16, 56, 40)

In [189]:
date.dt.min()

datetime.datetime(2001, 2, 2, 0, 0)

I have no idea what is an "average date", but it can be computed.

In [190]:
date.dt.mean()

datetime.datetime(2001, 2, 2, 8, 28, 20)

In [191]:
date.dt.median()

datetime.datetime(2001, 2, 2, 8, 28, 20)

### Data Transformations

Notice the syntax of `offset_by`. It is similar to R's `lubridate` package.

In [192]:
date.dt.offset_by(by="1y2m20d")

2002-02-22 00:02:00
2002-02-22 08:30:20
2002-02-22 16:58:40


Nagative offset is also allowed.

In [193]:
date.dt.offset_by(by="-1y2m20d")

2000-01-12 23:58:00
2000-01-13 08:26:20
2000-01-13 16:54:40


In [194]:
date.dt.round("1y")

2001-01-01 00:00:00
2001-01-01 00:00:00
2001-01-01 00:00:00


In [195]:
date2 = date.dt.truncate("30m") # round to period
pd.crosstab(date,date2)

col_0,2001-02-02 00:00:00,2001-02-02 08:00:00,2001-02-02 16:30:00
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2001-02-02 00:00:00,1,0,0
2001-02-02 08:28:20,0,1,0
2001-02-02 16:56:40,0,0,1


## Comparing Series 

In [196]:
s.series_equal(pl.Series("a", [1, 2, 3]))

True

# DataFrames

General:
1. There is no row index (like R's `data.frame`, `data.table`, and `tibble`; unlike Python's `pandas`). 
1. Will not accept duplicat column names (unlike pandas).


## DataFrame Hosekeeping

A frame can be created as you would expect. 
From a dictionary of series, a numpy array, a pandas sdataframe, or a list of polars (or pandas) series, etc.


In [197]:
df = pl.DataFrame({"integer": [1, 2, 3], 
                          "date": [
                              (datetime(2022, 1, 1)), 
                              (datetime(2022, 1, 2)), 
                              (datetime(2022, 1, 3))
                          ], 
                          "float":[4.0, 5.0, 6.0]})

df

integer,date,float
i64,datetime[μs],f64
1,2022-01-01 00:00:00,4.0
2,2022-01-02 00:00:00,5.0
3,2022-01-03 00:00:00,6.0


In [198]:
print(df)

shape: (3, 3)
┌─────────┬─────────────────────┬───────┐
│ integer ┆ date                ┆ float │
│ ---     ┆ ---                 ┆ ---   │
│ i64     ┆ datetime[μs]        ┆ f64   │
╞═════════╪═════════════════════╪═══════╡
│ 1       ┆ 2022-01-01 00:00:00 ┆ 4.0   │
│ 2       ┆ 2022-01-02 00:00:00 ┆ 5.0   │
│ 3       ┆ 2022-01-03 00:00:00 ┆ 6.0   │
└─────────┴─────────────────────┴───────┘


Things to note:

1. The frame may be printed with Jupter's styling, or as ASCII with a `print()` statement.
1. Shape, and dtypes, are part of the output.


In [199]:
df.columns

['integer', 'date', 'float']

In [200]:
df.shape

(3, 3)

In [201]:
df.height # probably more useful than df.shape[0]

3

In [202]:
df.width

3

In [203]:
df.schema # similar to pandas info()

{'integer': Int64, 'date': Datetime(tu='us', tz=None), 'float': Float64}

In [204]:
df.schema

{'integer': Int64, 'date': Datetime(tu='us', tz=None), 'float': Float64}

In [205]:
df.is_empty()

False

In [206]:
df.cleared() # make empty copy

integer,date,float
i64,datetime[μs],f64


In [207]:
df.clone() # deep copy

integer,date,float
i64,datetime[μs],f64
1,2022-01-01 00:00:00,4.0
2,2022-01-02 00:00:00,5.0
3,2022-01-03 00:00:00,6.0


Renaming columns can be done with `rename()`. 
Later, we will see it may also be done with an `alias()` statement withing a `with_columns()` context. 

In [208]:
df.rename({'integer': 'integer2'})

integer2,date,float
i64,datetime[μs],f64
1,2022-01-01 00:00:00,4.0
2,2022-01-02 00:00:00,5.0
3,2022-01-03 00:00:00,6.0


## Dataframe in Memory


In [209]:
df.estimated_size(unit="gb")

6.705522537231445e-08

In [210]:
df.n_chunks() # number of ChunkedArrays in the dataframe

1

## Statistical Aggregations 


In [211]:
df.describe()

describe,integer,date,float
str,f64,str,f64
"""count""",3.0,"""3""",3.0
"""null_count""",0.0,"""0""",0.0
"""mean""",2.0,,5.0
"""std""",1.0,,1.0
"""min""",1.0,"""2022-01-01 00:...",4.0
"""max""",3.0,"""2022-01-03 00:...",6.0
"""median""",2.0,,5.0


The usual statistical aggregations operate column-wise (and in parallel).


In [212]:
df.max()

integer,date,float
i64,datetime[μs],f64
3,2022-01-03 00:00:00,6.0


In [213]:
df.min()

integer,date,float
i64,datetime[μs],f64
1,2022-01-01 00:00:00,4.0


In [214]:
df.mean()

integer,date,float
f64,datetime[μs],f64
2.0,,5.0


In [215]:
df.median()

integer,date,float
f64,datetime[μs],f64
2.0,,5.0


In [216]:
df.sum()

integer,date,float
i64,datetime[μs],f64
6,,15.0


In [217]:
df.std()

integer,date,float
f64,datetime[μs],f64
1.0,,1.0


In [218]:
df.quantile(0.1)

integer,date,float
f64,datetime[μs],f64
1.0,,4.0


## Exctraction

1. If you are used to pandas, recall there is no index. There is thus no need for `loc` vs. `iloc`, `reset_index()`, etc.
2. Filtering and selection is possible with the `[` operator, or the `filter()` and `select()` methods. The latter is recommended to facilitate lazy evaluation (discussed later).



Single cell extraction.

In [219]:
df[0,0] # like pandas .iloc[]

1

Slicing along rows.

In [220]:
df[0:1] 

integer,date,float
i64,datetime[μs],f64
1,2022-01-01 00:00:00,4.0


Slicing along columns.

In [221]:
df[:,0:1]

integer
i64
1
2
3


### Filtering Rows

Row filtering by label

In [222]:
df.filter(pl.col("integer") == 2)

integer,date,float
i64,datetime[μs],f64
2,2022-01-02 00:00:00,5.0


Things to note:

- The `[` operator does not support indexing with boolean such as `df[df["integer"] == 2]`.
- The `filter()` method is recommended over `[` by the authors of polars, to facilitate lazy evaluation (discussed later).






### Selecting Columns

Column selection by label

In [223]:
df.select("integer")
# or df['integer']
# or df[:,'integer']

integer
i64
1
2
3


Multiple column selection by label

In [224]:
df.select(["integer", "float"])
# or df[['integer', 'float']]

integer,float
i64,f64
1,4.0
2,5.0
3,6.0


Column slicing by label

In [225]:
df[:,"integer":"float"]

integer,date,float
i64,datetime[μs],f64
1,2022-01-01 00:00:00,4.0
2,2022-01-02 00:00:00,5.0
3,2022-01-03 00:00:00,6.0


Note: Slicing with `df.select()` does not support ranges such as `df.select("integer":"float")`; only lists of column names.


In [226]:
df.drop("integer")

date,float
datetime[μs],f64
2022-01-01 00:00:00,4.0
2022-01-02 00:00:00,5.0
2022-01-03 00:00:00,6.0


Polars will not have an `inplace` argument. Use `df.drop_in_place()` instead.


Select along dtype

In [227]:
df.select(pl.col(pl.Int64))

integer
i64
1
2
3


In [228]:
df.select(pl.col(pl.Float64))

float
f64
4.0
5.0
6.0


In [229]:
df.select(pl.col(pl.Utf8))

Things to note:

- The `pl.col()` function will be very useful for referencing columns in a dataframe. It may extract a single column, a list, a particular (polars) dtype, a regex pattern, or simply all columns.
- When exctracting along dtype, use polars' dtypes, not pandas' dtypes. For example, use `pl.Int64` instead of `np.int64`.




## Missing


In [230]:
df.is_unique()

true
True
True


In [231]:
df.is_duplicated()

false
False
False


In [232]:
df.n_unique()

3

In [233]:
df.null_count()

integer,date,float
u32,u32,u32
0,0,0


In [234]:
df.drop_nulls()

integer,date,float
i64,datetime[μs],f64
1,2022-01-01 00:00:00,4.0
2,2022-01-02 00:00:00,5.0
3,2022-01-03 00:00:00,6.0


## Transformations

- The general idea of colum trasformation is to wrap all transformations in a `with_columns()` method, and the select colums to operat on with `pl.col()`. 
- The output column will have the same name as the input, unless you use the `alias()` method to rename it. 
- The `with_columns()` is called a __polars context__.
- The flavor of the `with_columns()` context is similar to pandas' `assign()`.


In [235]:
df.with_columns([
    pl.col("integer").alias("integer2"),
    pl.col("integer") * 2
])

integer,date,float,integer2
i64,datetime[μs],f64,i64
2,2022-01-01 00:00:00,4.0,1
4,2022-01-02 00:00:00,5.0,2
6,2022-01-03 00:00:00,6.0,3


Things to note:

- The column `integer` is copied, by renaming it to `integer2`.
- The columns `integer` is multiplied by 2 in place, because no `alias` is used. 
- You cannot use `[` to assign! This would not have worked `df['integer3'] = df['integer'] * 2`




If a selection returns multiple columns, all will be transformed:

In [236]:
df.with_columns([
    pl.col([pl.Int64,pl.Float64])*2
])

integer,date,float
i64,datetime[μs],f64
2,2022-01-01 00:00:00,8.0
4,2022-01-02 00:00:00,10.0
6,2022-01-03 00:00:00,12.0


## Sorting


In [237]:
df.sort("integer")

integer,date,float
i64,datetime[μs],f64
1,2022-01-01 00:00:00,4.0
2,2022-01-02 00:00:00,5.0
3,2022-01-03 00:00:00,6.0


## Uniques


In [238]:
df.unique() # same as pandas .drop_duplicates()

integer,date,float
i64,datetime[μs],f64
1,2022-01-01 00:00:00,4.0
2,2022-01-02 00:00:00,5.0
3,2022-01-03 00:00:00,6.0


## Joins


`.hstack()` is like pandas pd.concat() or R's cbind.

In [239]:
df.hstack([pl.Series("c", np.repeat(1, df.height))])

integer,date,float,c
i64,datetime[μs],f64,i64
1,2022-01-01 00:00:00,4.0,1
2,2022-01-02 00:00:00,5.0,1
3,2022-01-03 00:00:00,6.0,1


Note: Joining along rows is possible only if matched columns have the same dtype. 
Timestamps may be tricky because they may have different time units.
The follwing snippet may be useful when joining multiple dataframes with different time units.

In [240]:
#| eval: false
df.with_column(
    pl.col(pl.Datetime("ns")).dt.cast_time_unit(tu="ms")
)            

## Reshaping

## Groupby

## Query Planning and Optimization

- describe_plan
- show_graph
- describe_optimized_plan



# I/O

## Import

- From csv
- From parquet
- From multiple parquets
- From Arrow DataSet



Warnings:

1. String caching


## Export




# Plotting

# Polars and ML

# Strings

# Datatimes


# Config


In [241]:
list(dir(pl.Config))

['__annotations__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__enter__',
 '__eq__',
 '__exit__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 'load',
 'restore_defaults',
 'save',
 'set_ascii_tables',
 'set_fmt_str_lengths',
 'set_tbl_cell_alignment',
 'set_tbl_cols',
 'set_tbl_column_data_type_inline',
 'set_tbl_dataframe_shape_below',
 'set_tbl_formatting',
 'set_tbl_hide_column_data_types',
 'set_tbl_hide_column_names',
 'set_tbl_hide_dataframe_shape',
 'set_tbl_hide_dtype_separator',
 'set_tbl_rows',
 'set_tbl_width_chars',
 'set_verbose',
 'state',
 'with_columns_kwargs']