In [1]:
#| echo: false
# %pip install --upgrade pip
# %pip install --upgrade polars
# %pip install --upgrade pyarrow
# %pip install --upgrade Pandas
# %pip install --upgrade plotly
# %pip freeze > requirements.txt

In [2]:
#| label: setup-env

# %pip install -r requirements.txt

In [3]:
#| label: Polars-version
%pip show Polars # check you Polars version

Name: polars
Version: 0.20.26
Summary: Blazingly fast DataFrame library
Home-page: 
Author: 
Author-email: Ritchie Vink <ritchie46@gmail.com>
License: 
Location: /Users/johnros/workspace/polars_demo/.venv/lib/python3.11/site-packages
Requires: 
Required-by: 


Note: you may need to restart the kernel to use updated packages.


In [4]:
#| label: Pandas-version
%pip show Pandas # check you Pandas version

Name: pandas
Version: 2.2.2
Summary: Powerful data structures for data analysis, time series, and statistics
Home-page: https://pandas.pydata.org
Author: 
Author-email: The Pandas Development Team <pandas-dev@python.org>
License: BSD 3-Clause License

Copyright (c) 2008-2011, AQR Capital Management, LLC, Lambda Foundry, Inc. and PyData Development Team
All rights reserved.

Copyright (c) 2011-2023, Open source contributors.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:

* Redistributions of source code must retain the above copyright notice, this
  list of conditions and the following disclaimer.

* Redistributions in binary form must reproduce the above copyright notice,
  this list of conditions and the following disclaimer in the documentation
  and/or other materials provided with the distribution.

* Neither the name of the copyright holder nor the names of its

Note: you may need to restart the kernel to use updated packages.




In [5]:
#| label: preliminaries

import polars as pl
pl.Config(fmt_str_lengths=50)
import polars.selectors as cs

import pandas as pd
import numpy as np
import pyarrow as pa
import plotly.express as px
import string
import random
import os
import sys
%matplotlib inline 
import matplotlib.pyplot as plt
from datetime import datetime

# Following two lines only required to view plotly when rendering from VScode. 
import plotly.io as pio
# pio.renderers.default = "plotly_mimetype+notebook_connected+notebook"
pio.renderers.default = "plotly_mimetype+notebook"

What Polars module and dependencies are installed?

In [6]:
#| label: show-versions
pl.show_versions()

--------Version info---------
Polars:               0.20.26
Index type:           UInt32
Platform:             macOS-14.4.1-arm64-arm-64bit
Python:               3.11.6 (main, Oct  2 2023, 13:45:54) [Clang 15.0.0 (clang-1500.0.40.1)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           3.9.0
nest_asyncio:         1.6.0
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.2.2
pyarrow:              16.1.0
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           2.0.30
torch:                <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not insta

How many cores are available for parallelism?

In [7]:
#| label: show-cores
pl.thread_pool_size()

8

# Preliminaries


## A Polars Frame Can Hold Anything

Fun fact- Polars, like Pandas, can store anything within a cell. For instance, a Polars frame can hold a Polars frame.


In [8]:
#| label: make-polars-frame

df = pl.DataFrame(
    {
        "a": [
            pl.DataFrame({"x": [1, 2, 3], "y": [4, 5, 6]}), 
            pl.DataFrame({"x": [7, 8, 9], "y": [10, 11, 12]})
            ],
        "b": ["a", "b"]
    }
)

df

a,b
object,str
"shape: (3, 2) ┌─────┬─────┐ │ x ┆ y │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 4 │ │ 2 ┆ 5 │ │ 3 ┆ 6 │ └─────┴─────┘","""a"""
"shape: (3, 2) ┌─────┬─────┐ │ x ┆ y │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 7 ┆ 10 │ │ 8 ┆ 11 │ │ 9 ┆ 12 │ └─────┴─────┘","""b"""


Things to note: 

- The dtype of the frame is `Polars Object`. 


## Motivation For Nested dtypes

Consider the following scenarios:

1. I want to compute with part-of-strings generated by a split. 
2. I want to store and compute with lists (in the general sense) of varying lengths. Examples include: paths of cars on the globe, paths of users in a website, 
3. I want to group columns together, and compute with them as a unit.
4. I want to read a JSON/XML/YML with nested structures.


How can I go about?

1. Since a Polars object, like a Python object, can hold anything, I could store the nested data as a Polars object.
1. I could generate many columns, and fill with nulls when lengths differ.
2. I could store each unit as a Polars Series.

The Python object option is always available, and I will try to avoid it. 
If I can avoid Polars (i.e. Rust) calling Python, I will.

The second options could work. If Polars saves nulls efficiently, it may be a good one. 

The third option would be great. I would be computing within Rust, avoid storing zillions of nulls, and have a clean API.
This is today's topic. 



## Nested dtypes in Polars

The Polars nested dtypes, inherited from [PyArrow](https://wesm.github.io/arrow-site-test/format/Layout.html):


1.  **Polars List** 
2.  **Polars Array** 
3.  **Polars Struct** 

A **Polars list** is a Polars Series within a cell: All elements in the cell must have the same dtype; the elements are unnamed so accessed by index or filter.
Do not be confused by the name. A Polars list is not a Python list.  

A **Polars array** is a Polars list with a fixed length: All elements in the cell must have the same dtype and the same length; the elements are unnamed so accessed by index or filter. Another mental model of a Polars array is a numpy 2D array within a Polars frame.


A **Polars struct** is a Polars DataFrame within a cell: All elements in the cell must have the same fields; the elements are named so accessed by field name. 
Do not think of the struct as a Python dict within a cell, because all rows must have the same fields.



# Polars List {#sec-list}



## Making a Polars List

Make a Polars list from a Python list.


In [9]:
#| label: make-list

pl.DataFrame(
    {
        "a": [[1, 2, 3], [4, 5, 6], [7, 8, 9]],
        "b": ["a", "b", "c"]
    }
)

a,b
list[i64],str
"[1, 2, 3]","""a"""
"[4, 5, 6]","""b"""
"[7, 8, 9]","""c"""


Things to note:

- `a` is not a Python list, rather, it is a Polars list. There are differences between the two. For instance, all elements in a Polars list must have the same dtype. Also, the Polars list is a columnar data structure, which is more efficient for certain operations. Finally, the Polars list is a first-class citizen in the Polars API, with its own methods and functions.
- Can a Polars list hold a Polars list? Yes. It can hold any Polars object, even nested ones (list of list of list, etc.); provided that all elements in the cell have the same dtype.

Make a Polars list of Python lists of Python lists:


In [10]:
#| label: make-list-of-list

pl.DataFrame(
    {
        "a": [[[1, 2, 3], [4, 5, 6], [7, 8, 9]], [[10, 11, 12], [13, 14, 15], [16, 17, 18]]],
        "b": ["a", "b"]
    }
)

a,b
list[list[i64]],str
"[[1, 2, 3], [4, 5, 6], [7, 8, 9]]","""a"""
"[[10, 11, 12], [13, 14, 15], [16, 17, 18]]","""b"""


More often you will not make a Polars list directly, but rather it will be the result of some operation:

1. When grouping without aggregation. 
1. When splitting a string of unknown length (knonwn length will return a Polars struct, and I expect in the future a Polars array).
2. When wrapping a bunch of columns with `pl.concat_list()`.
3. When "imploding" (aka "collapsing") an `pl.Expr()`.  



### List from Aggregation

A Polars group_by, will actually create a Polars list internally, before applying the `.agg()` method.


In [11]:
#| label: list-from-aggregation

df = pl.DataFrame(
    {
        "a": [1, 1, 2, 2, 3, 3],
        "b": [1, 2, 3, 4, 5, 6]
    }
)

df.group_by("a").agg(pl.col("b"))

a,b
i64,list[i64]
2,"[3, 4]"
3,"[5, 6]"
1,"[1, 2]"


### List From Splitting a String


In [12]:
#| label: make-frame-with-string-to-split

df_strings = pl.DataFrame(
    {
        "a": ["apple, banana, orange", "apple, banana", "apple"],
    }
)

df_strings

a
str
"""apple, banana, orange"""
"""apple, banana"""
"""apple"""


Now split `a` into... whatever Polars decides to return. 
Hint: a Polars list of Polars strings.


In [13]:
(
    df_strings
    .select(
        pl.col('a').str.split(", ")
        )
)

a
list[str]
"[""apple"", ""banana"", ""orange""]"
"[""apple"", ""banana""]"
"[""apple""]"


### List From pl.concat_list()


In [14]:
#| label: list-from-concatenation

df.select(pl.concat_list([pl.col("a"), pl.col("b")]))

a
list[i64]
"[1, 1]"
"[1, 2]"
"[2, 3]"
"[2, 4]"
"[3, 5]"
"[3, 6]"


### List From `.list.concat()` 


In [15]:
(
    df
    .with_columns(
        pl.concat_list([pl.col("a"), pl.col("b")]).alias('ab')
        )
    .select(pl.col('ab').list.concat(['a','a','a']))
    .to_pandas()
)

Unnamed: 0,ab
0,"[1, 1, 1, 1, 1]"
1,"[1, 2, 1, 1, 1]"
2,"[2, 3, 2, 2, 2]"
3,"[2, 4, 2, 2, 2]"
4,"[3, 5, 3, 3, 3]"
5,"[3, 6, 3, 3, 3]"


### List from Imploding n `pl.Expr()`


In [16]:
#| label: list-from-imploding

df.with_columns(pl.col("b").implode())

a,b
i64,list[i64]
1,"[1, 2, … 6]"
1,"[1, 2, … 6]"
2,"[1, 2, … 6]"
2,"[1, 2, … 6]"
3,"[1, 2, … 6]"
3,"[1, 2, … 6]"


Implode within group:


In [17]:
#| label: implode-within-group

df.group_by("a").agg(pl.col("b"))

a,b
i64,list[i64]
3,"[5, 6]"
1,"[1, 2]"
2,"[3, 4]"


Implode over:


In [18]:
#| label: implode-over
#| eval: false
df.with_columns(pl.col("b").over("a")) # no good
df.with_columns(pl.col("b").implode().over("a")) # no good
df.with_columns(pl.concat_list('b').over("a")) # no good

In [19]:
df.with_columns(pl.col("b").over("a", mapping_strategy="join")) # good!

a,b
i64,list[i64]
1,"[1, 2]"
1,"[1, 2]"
2,"[3, 4]"
2,"[3, 4]"
3,"[5, 6]"
3,"[5, 6]"


See [here](https://github.com/pola-rs/polars/pull/6487) for more context. 



###  More on `.over(mapping_strategy=...)`

TODO



## Operating on List Elements

The most general way is to use `list.eval(pl.element())`:


In [20]:
#| label: list-eval-element

df_with_list = df.with_columns(pl.concat_list([pl.col("a"), pl.col("b")]).alias('ab'))


(
    df_with_list
    .select(
        pl.col('ab').list.eval(pl.element().add(1000))
        )
)

ab
list[i64]
"[1001, 1001]"
"[1001, 1002]"
"[1002, 1003]"
"[1002, 1004]"
"[1003, 1005]"
"[1003, 1006]"


Things to note:

- `.eval()` belongs to the `.list` namespace. 
- `pl.element()` is a selector that selects the elements of the list. It has almost all the methods available to `pl.col()`.


In [21]:
#| label: list-eval-element-methods
print(dir(pl.element()))

['__abs__', '__add__', '__and__', '__annotations__', '__array_ufunc__', '__bool__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__floordiv__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__invert__', '__le__', '__lt__', '__mod__', '__module__', '__mul__', '__ne__', '__neg__', '__new__', '__or__', '__pos__', '__pow__', '__radd__', '__rand__', '__reduce__', '__reduce_ex__', '__repr__', '__rfloordiv__', '__rmod__', '__rmul__', '__ror__', '__rpow__', '__rsub__', '__rtruediv__', '__rxor__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__sub__', '__subclasshook__', '__truediv__', '__weakref__', '__xor__', '_accessors', '_from_pyexpr', '_map_batches_wrapper', '_pyexpr', '_register_plugin', '_repr_html_', 'abs', 'add', 'agg_groups', 'alias', 'all', 'and_', 'any', 'append', 'apply', 'approx_n_unique', 'arccos', 'arccosh', 'arcsin', 'arcsinh', 'arctan', 'arctanh', 'arg_max', 'arg_min'

## List Methods

`.list.eval(pl.element())` will operate one element at a time. 
If you want to operate on the list as a whole, you can wither use existing [list methods](https://docs.pola.rs/py-polars/html/reference/expressions/list.html), or use `pl.col()` methods, after exploding the list (see @sec-list-explode).

We now demonstrate some of the list methods.



### Selecting Elements

Get, Gather, Slice, Gather_Every


In [22]:
#| label: list-get-and-gather

(
    df_with_list.select(
        pl.col('ab'),
        pl.col('ab').list.get(0).alias('get_1'), 
        pl.col('ab').list.gather([0]).alias('gather_1'), # returns a list
        pl.col('ab').list.slice(0,1).alias('slice_1'), # returns a list
        pl.col('ab').list.gather([0, 1]).alias('gather_2'),
        pl.col('ab').list.slice(0,2).alias('slice_2'),
        pl.col('ab').list.gather_every(2).alias('gather_every_2'),
        )
)

ab,get_1,gather_1,slice_1,gather_2,slice_2,gather_every_2
list[i64],i64,list[i64],list[i64],list[i64],list[i64],list[i64]
"[1, 1]",1,[1],[1],"[1, 1]","[1, 1]",[1]
"[1, 2]",1,[1],[1],"[1, 2]","[1, 2]",[1]
"[2, 3]",2,[2],[2],"[2, 3]","[2, 3]",[2]
"[2, 4]",2,[2],[2],"[2, 4]","[2, 4]",[2]
"[3, 5]",3,[3],[3],"[3, 5]","[3, 5]",[3]
"[3, 6]",3,[3],[3],"[3, 6]","[3, 6]",[3]


First, Head, Last, Tail


In [23]:
#| label: list-methods

(
    df_with_list.select(
        pl.col('ab'),
        pl.col('ab').list.get(1).alias('first_1'),
        pl.col('ab').list.first().alias('first_2'),
        pl.col('ab').list.head(1).alias('first_5'), # returns a list
        pl.col('ab').list.last().alias('last_2'), 
        pl.col('ab').list.tail(1).alias('last_1'), # returns a list
        pl.col('ab').list.tail(1).list.first().alias('last_3'),
        )
)

ab,first_1,first_2,first_5,last_2,last_1,last_3
list[i64],i64,i64,list[i64],i64,list[i64],i64
"[1, 1]",1,1,[1],1,[1],1
"[1, 2]",2,1,[1],2,[2],2
"[2, 3]",3,2,[2],3,[3],3
"[2, 4]",4,2,[2],4,[4],4
"[3, 5]",5,3,[3],5,[5],5
"[3, 6]",6,3,[3],6,[6],6


Shift


In [24]:
#| label: list-shift

(
    df_with_list.select(
        pl.col('ab'),
        pl.col('ab').list.shift(1).alias('shift_1'),
        pl.col('ab').list.shift(-1).alias('shift_-1'),
        )
)

ab,shift_1,shift_-1
list[i64],list[i64],list[i64]
"[1, 1]","[null, 1]","[1, null]"
"[1, 2]","[null, 1]","[2, null]"
"[2, 3]","[null, 2]","[3, null]"
"[2, 4]","[null, 2]","[4, null]"
"[3, 5]","[null, 3]","[5, null]"
"[3, 6]","[null, 3]","[6, null]"


Sample


In [25]:
#| label: list-sample

(
    df_with_list.select(
        pl.col('ab'),
        pl.col('ab').list.sample(10, with_replacement=True).alias('sample'),
        )
    .to_pandas()
)

Unnamed: 0,ab,sample
0,"[1, 1]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]"
1,"[1, 2]","[2, 2, 1, 1, 2, 2, 1, 2, 2, 1]"
2,"[2, 3]","[2, 3, 3, 2, 2, 3, 2, 2, 2, 2]"
3,"[2, 4]","[2, 4, 2, 4, 4, 4, 4, 2, 2, 4]"
4,"[3, 5]","[5, 3, 3, 3, 5, 5, 5, 5, 5, 5]"
5,"[3, 6]","[3, 6, 6, 3, 6, 6, 6, 6, 3, 3]"


Careful where you put the `.sample()`!


In [26]:
(
    df_with_list
    .select(
        pl.col('ab').list.sample(10, with_replacement=True).alias('within_row'),
        pl.col('ab').sample(6, with_replacement=True).alias('within_column'),
    )
    .to_pandas()
)

Unnamed: 0,within_row,within_column
0,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]","[2, 4]"
1,"[2, 1, 2, 2, 1, 1, 2, 1, 2, 1]","[3, 5]"
2,"[3, 2, 3, 3, 2, 3, 2, 3, 3, 2]","[1, 1]"
3,"[4, 2, 4, 2, 2, 2, 2, 2, 2, 2]","[1, 2]"
4,"[3, 5, 3, 5, 3, 5, 3, 5, 5, 5]","[1, 2]"
5,"[3, 3, 6, 6, 6, 3, 3, 3, 3, 3]","[1, 1]"


### Statistical Aggregations

.arg_max:

.arg_min:

.var:

.std:

.n_unique:

.unique:

.min:

.max:

.sum:

.mean:

.median:

.len


In [27]:
#| label: list-statistical-aggregations

(
    df_with_list
    .select(
        pl.col('ab'),
        pl.col('ab').list.min().alias('min'),
        pl.col('ab').list.max().alias('max'),
        pl.col('ab').list.sum().alias('sum'),
        pl.col('ab').list.mean().alias('mean'),
        pl.col('ab').list.median().alias('median'),
        pl.col('ab').list.var().alias('var'),
        pl.col('ab').list.std().alias('std'),
        pl.col('ab').list.n_unique().alias('n_unique'),
        pl.col('ab').list.unique().alias('unique'),
        pl.col('ab').list.len().alias('len'),
        )
)

ab,min,max,sum,mean,median,var,std,n_unique,unique,len
list[i64],i64,i64,i64,f64,f64,f64,f64,u32,list[i64],u32
"[1, 1]",1,1,2,1.0,1.0,0.0,0.0,1,[1],2
"[1, 2]",1,2,3,1.5,1.5,0.5,0.707107,2,"[1, 2]",2
"[2, 3]",2,3,5,2.5,2.5,0.5,0.707107,2,"[2, 3]",2
"[2, 4]",2,4,6,3.0,3.0,2.0,1.414214,2,"[2, 4]",2
"[3, 5]",3,5,8,4.0,4.0,2.0,1.414214,2,"[3, 5]",2
"[3, 6]",3,6,9,4.5,4.5,4.5,2.12132,2,"[3, 6]",2


### Ordering and Ranking

.sort:

.reverse:


In [28]:
#| label: list-ordering-and-ranking

(
    df_with_list
    .select(
        pl.col('ab'),
        # pl.col('ab').list.rank().alias('rank'),
        pl.col('ab').list.sort().alias('sort'),
        pl.col('ab').list.reverse().alias('reverse'),
        )
)

ab,sort,reverse
list[i64],list[i64],list[i64]
"[1, 1]","[1, 1]","[1, 1]"
"[1, 2]","[1, 2]","[2, 1]"
"[2, 3]","[2, 3]","[3, 2]"
"[2, 4]","[2, 4]","[4, 2]"
"[3, 5]","[3, 5]","[5, 3]"
"[3, 6]","[3, 6]","[6, 3]"


### Sequence Operations

.diff


In [29]:
#| label: list-diff

(
    df_with_list
    .select(
        pl.col('ab'),
        pl.col('ab').list.diff().alias('diff'),
        )
)

ab,diff
list[i64],list[i64]
"[1, 1]","[null, 0]"
"[1, 2]","[null, 1]"
"[2, 3]","[null, 1]"
"[2, 4]","[null, 2]"
"[3, 5]","[null, 2]"
"[3, 6]","[null, 3]"


### Logical Aggregations


In [30]:
#| label: list-logical-aggregations

(
    df_with_list
    .select(
        pl.col('ab'),
        pl.col('ab').list.eval(pl.element().eq(1)).list.any().alias('any'),
        pl.col('ab').list.eval(pl.element().eq(1)).list.all().alias('all'),
        )
)

ab,any,all
list[i64],bool,bool
"[1, 1]",True,True
"[1, 2]",True,False
"[2, 3]",False,False
"[2, 4]",False,False
"[3, 5]",False,False
"[3, 6]",False,False


### String Operations


In [31]:
#| label: list-strings

df_2_with_list = pl.DataFrame(
    {
        "ab": [["a", "b", "c"], ["d", "e", "f"], ["g", "h", "i"]],
        'sep': ['@', '#', '$']
    }
)

(
    df_2_with_list
    .select(
        pl.col('ab'),
        pl.col('ab').list.join(pl.col('sep')).alias('joined'),
        pl.col('ab').list.contains('a').alias('contains_a'),
        pl.col('ab').list.count_matches('a').alias('count_a'),
        )
)

ab,joined,contains_a,count_a
list[str],str,bool,u32
"[""a"", ""b"", ""c""]","""a@b@c""",True,1
"[""d"", ""e"", ""f""]","""d#e#f""",False,0
"[""g"", ""h"", ""i""]","""g$h$i""",False,0


### Filtering

Currently there is no filter method in the list namespace.
There are, however, many workarounds. 


In [32]:
#| label: list-filter

(
    df_with_list
    .select(
        pl.col('ab'),
        pl.col('ab').list.eval(pl.element().gt(2)).alias('gt_2'),
        # pl.col('ab').list.filter(pl.col('ab').list.gt(5)).alias('gt_5'), # not implemented yet
        )
)

ab,gt_2
list[i64],list[bool]
"[1, 1]","[false, false]"
"[1, 2]","[false, false]"
"[2, 3]","[false, true]"
"[2, 4]","[false, true]"
"[3, 5]","[true, true]"
"[3, 6]","[true, true]"


Let's try to filter elements of a list based on the length of the string.


In [33]:
# Sample data
data = pl.DataFrame({
    'string_list': [
        ["apple", "banana", "orange", "grape", "pear"],
        ["apple", "banana", "orange", "grape"],
        ["apple", "banana", "orange"],
        ["apple", "banana"],
        ["apple"],
        ["apple", "banana", "orange", "grape", "pear"],
        ["apple", "banana", "orange", "grape"],
    ],
    }
    )

In each list, keep only strings of length 4.

Using `.list.gather()` and `pl.arg_where()` and explode to access `pl.col().str` methods. 


In [34]:
(
    data
    .with_row_index('i')
    .with_columns(
        pl.col('string_list').list.gather(
            pl.arg_where(
                pl.col('string_list').explode().str.len_chars().eq(4))
                )
                .over('i')
            )
    .drop('i')
)

string_list
list[str]
"[""pear""]"
[]
[]
[]
[]
"[""pear""]"
[]


Without exploding: Using `list.gather()` to filter, and `list.eval()` to access `pl.element().str` methods.


In [35]:
(
    data
    .select(
        pl.col('string_list').list.gather(
            pl.col('string_list').list.eval(
                pl.arg_where(
                    pl.element().str.len_chars().eq(4)
                    )
                )
            )
        )
)

string_list
list[str]
"[""pear""]"
[]
[]
[]
[]
"[""pear""]"
[]


Explode once, and call the output twice using the Walrus operator; once to access `pl.col().filter()` and then to access `pl.col().str`.


In [36]:
(
    data
    .with_row_index('i')
    .with_columns(
        (
            (cntr := pl.col('string_list').explode())
            .filter(cntr.str.len_chars().eq(4))
            )
        .implode()
        .over('i')
        )
    .drop('i')
)

string_list
list[str]
"[""pear""]"
[]
[]
[]
[]
"[""pear""]"
[]


Things to note:

- If you are unfamilar with Python's (not Polars') Walrus operator `:=`, see [here](https://realpython.com/python-walrus-operator/).









### Missing

I believe everything can be done with `.list.eval(pl.element().method())`.



### Set Operations

.list.set_intersection

.list.set_union

.list.set_difference

.list.set_symmetric_difference





### Exporting to Other Nested Dtypes


In [37]:
#| label: list-to-struct

(
    df_with_list
    .select(
        pl.col('ab'),
        pl.col('ab').list.to_struct().alias('ab_struct'),
        pl.col('ab').list.to_array(width=2).alias('ab_array'),
        )
)

ab,ab_struct,ab_array
list[i64],struct[2],"array[i64, 2]"
"[1, 1]","{1,1}","[1, 1]"
"[1, 2]","{1,2}","[1, 2]"
"[2, 3]","{2,3}","[2, 3]"
"[2, 4]","{2,4}","[2, 4]"
"[3, 5]","{3,5}","[3, 5]"
"[3, 6]","{3,6}","[3, 6]"


### Examples

#### ECDF {#sec-ecdf}


In [38]:
#| label: list-ecdf

(
    df_with_list
    .select(
        pl.col('ab').alias('raw'),
        pl.col('ab').list.eval(pl.element().rank()).alias('ranks'),
        pl.col('ab').list.eval(pl.element().rank().truediv(2)).alias('ecdf_1'),
        # pl.col('ab').list.eval(pl.element().rank().truediv(pl.col('ab').list.len())).alias('ecdf_2'),
    )
)

raw,ranks,ecdf_1
list[i64],list[f64],list[f64]
"[1, 1]","[1.5, 1.5]","[0.75, 0.75]"
"[1, 2]","[1.0, 2.0]","[0.5, 1.0]"
"[2, 3]","[1.0, 2.0]","[0.5, 1.0]"
"[2, 4]","[1.0, 2.0]","[0.5, 1.0]"
"[3, 5]","[1.0, 2.0]","[0.5, 1.0]"
"[3, 6]","[1.0, 2.0]","[0.5, 1.0]"


Things to note:

- Currently, `.list.eval()` cannot reference another column. See [issue](https://github.com/pola-rs/polars/issues/7210).
- `.list.eval(pl.element())` can do more than point_wise operations. Think about the `rank()` example.
- `.rank()` has a `method` argument that can be used to specify how to deal with ties.


Here is another attempt, which does not assume the length of the list is known (not fixed). 
The following will not work, but may be fixed. 


In [39]:
#| eval: false
(
    df_with_list
    .select(
        pl.col('ab').list.eval(pl.element().rank().truediv(pl.col('ab').list.len())).alias('ecdf_2'),
    )
)

In [40]:
#eval: false
(
    df_with_list
    .with_row_index()
    .group_by('index')
    .agg(
        abrank:= pl.col('ab').explode().rank().truediv(abrank.len()),
    )
)

NameError: name 'abrank' is not defined

#### ECDF wrt Other Column

@sec-ecdf showed how to compute the ECDF of a list. Say we want to evaluate the ECDF of column `a` wrt column `b`.

TODO


#### arg_max_horizontal

Note, there currently is no `pl.arg_max_horizontal()` method.
We will try to make one by concatenating columns to list, and then using the `.list.arg_max()`.
In particular, we will want the arg_max col name, and not the index. 


In [41]:
#| label: arg-max-horizontal

(
    df
    .with_columns(
        pl.concat_list([pl.col("a"), pl.col("b")]).list.arg_max().alias('arg_max'),
        )
    .select(
        'a','b',
        (
            pl.col('arg_max')
            .map_elements(lambda i: df.columns[i], return_dtype=pl.Utf8)
            .alias('arg_max_col_name')
        )
        
    )
)

a,b,arg_max_col_name
i64,i64,str
1,1,"""a"""
1,2,"""b"""
2,3,"""b"""
2,4,"""b"""
3,5,"""b"""
3,6,"""b"""


Things to note:

- Can you find a more efficient way to do this? Maybe using a struct?





## Exploding and Imploding {#sec-list-explode}

Explode: "explode" the list onto a column.

In [42]:
(
    df_with_list
    .select(
        pl.col('ab').list.explode().alias('ab_exploded'),
        )
)

ab_exploded
i64
1
1
1
2
2
…
4
3
5
3


Implode: "implode" the list back into a list.

In [43]:
(
    df_with_list
    .select(
        pl.col('ab').implode().alias('ab_imploded'),
        pl.col('ab').list.explode().implode().alias('ab_exploded_imploded'),
        )
)

ab_imploded,ab_exploded_imploded
list[list[i64]],list[i64]
"[[1, 1], [1, 2], … [3, 6]]","[1, 1, … 6]"


Use group-wise to work on the list as a column.


In [44]:
(
    df_with_list
    .with_row_index()
    .group_by('index')
    .agg(
        pl.col('ab').list.explode().alias('this_is_actually_a_column'),
        )
)

index,this_is_actually_a_column
u32,list[i64]
0,"[1, 1]"
1,"[1, 2]"
2,"[2, 3]"
3,"[2, 4]"
4,"[3, 5]"
5,"[3, 6]"


There is also  `df.explode()`

In [45]:
#| label: df-explode
df_with_list.explode('ab')

a,b,ab
i64,i64,i64
1,1,1
1,1,1
1,2,1
1,2,2
2,3,2
…,…,…
2,4,4
3,5,3
3,5,5
3,6,3


### Examples 

#### Quantile

Recall there is no `.list.quantile()` method; so we need to brew our own. 


In [46]:
#| label: quantile-example

df_for_quantile = (
    df_with_list
    .select(
        pl.col('ab').list.sample(100, with_replacement=True),
    )
)

df_for_quantile.to_pandas()

Unnamed: 0,ab
0,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
1,"[2, 2, 1, 1, 1, 1, 2, 2, 1, 1, 2, 1, 2, 1, 2, ..."
2,"[2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2, 3, ..."
3,"[2, 4, 2, 2, 4, 2, 4, 2, 4, 4, 4, 2, 2, 2, 2, ..."
4,"[3, 5, 3, 5, 3, 3, 5, 3, 5, 3, 5, 5, 3, 3, 5, ..."
5,"[3, 3, 6, 6, 3, 6, 6, 6, 6, 3, 3, 3, 3, 6, 6, ..."


In [47]:
(
    df_for_quantile
    .with_row_index()
    .group_by('index')
    .agg(
        pl.col('ab').explode().quantile(0.2)
        )
)

index,ab
u32,f64
0,1.0
1,1.0
2,2.0
3,2.0
4,3.0
5,3.0


Things to note:

1. The explode within row index is a powerful trick to apply any `pl.col()` method to a list.
2. There is another way to apply `pl.col()` methods to a list... With `.list.eval(pl.element())` as in the next example.


In [48]:
(
    df_for_quantile
    .select(
        pl.col('ab').list.eval(pl.element().quantile(0.2))
        .list.first() # to extract a float from a list
        .alias('quantile_0.2'),
    )



)

quantile_0.2
f64
1.0
1.0
2.0
2.0
3.0
3.0


## Polars Lists of Polars Lists


In [49]:
#| label: list-of-list

df_with_list_of_list = pl.DataFrame(
    {
        "a": [[[1, 2, 3], [4, 5, 6], [7, 8, 9]], [[10, 11, 12], [13, 14, 15], [16, 17, 18]]],
        "b": ["a", "b"]
    }
)

df_with_list_of_list

a,b
list[list[i64]],str
"[[1, 2, 3], [4, 5, 6], [7, 8, 9]]","""a"""
"[[10, 11, 12], [13, 14, 15], [16, 17, 18]]","""b"""


List operations will work one level deep.


In [50]:
#| label: list-of-list-inner

(
    df_with_list_of_list
    .select(
        pl.col('a'),
        pl.col('a').list.get(0).alias('get_1'), 
        # pl.col('a').list.gather([0]).alias('gather_1'), # returns a list
        # pl.col('a').list.slice(0,1).alias('slice_1'), # returns a list
        # pl.col('a').list.gather([0, 1]).alias('gather_2'),
        # pl.col('a').list.slice(0,2).alias('slice_2'),
        # pl.col('a').list.gather_every(2).alias('gather_every_2'),
        )
)

a,get_1
list[list[i64]],list[i64]
"[[1, 2, 3], [4, 5, 6], [7, 8, 9]]","[1, 2, 3]"
"[[10, 11, 12], [13, 14, 15], [16, 17, 18]]","[10, 11, 12]"


# Polars Array {#sec-array}

Polars Arrays as Polars lists with a fixed length.

## Making a Polars Array

Make a Polars array from a Python list.


In [51]:
#| label: make-array

pl.DataFrame(
    [
        pl.Series("Array_1", [[1, 3], [2, 5]]),
        pl.Series("Array_2", [[1, 7, 3], [8, 1, 0]]),
    ],
    schema={
        "Array_1": pl.Array(pl.Int64, 2),
        "Array_2": pl.Array(pl.Int64, 3),
    },
)

Array_1,Array_2
"array[i64, 2]","array[i64, 3]"
"[1, 3]","[1, 7, 3]"
"[2, 5]","[8, 1, 0]"


Things to note:

- The dtype of the array is specified in the schema. Otherwise, a Polars list would have been inferred. 


## Array Methods

Currently, list methods and array methods seem to be the same. 
I expect these lists to diverge in the future, because more operations can be defined when assuming a fixed length.







# Polars Struct {#sec-struct}

Quoting [RhoSignal](https://www.rhosignal.com/posts/nested-dtypes/):
> pl.Struct type is a nested collection of columns. The pl.Struct is really just a way of having a nested namespace for columns. The underlying columns are just normal Polars Series.

Also from the [Arrow documentation](https://wesm.github.io/arrow-site-test/format/Layout.html):
> A struct is a nested type parameterized by an ordered sequence of relative types (which can all be distinct), called its fields.

So in summary:

2. The Python dict analogy is not a good one.
3. A Polars struct is a column which consists of another Polars dataframe.
4. There no difference in performance between a Polars struct and a set of columns (unless an entire field is null. See the [Arrow documentation](https://wesm.github.io/arrow-site-test/format/Layout.html)).
5. 




##  Making a Polars Struct

Make a Polars struct from a Python dict.


In [52]:
#| label: make-struct

pl.DataFrame(
    {
        "a": [{"x": 1, "y": 2}, {"x": 3, "y": 4}, {"x": 5, "y": 6}],
        "b": ["a", "b", "c"]
    }
)

a,b
struct[2],str
"{1,2}","""a"""
"{3,4}","""b"""
"{5,6}","""c"""


What are the differences between a Polars struct and a Python dict? 

1. The Polars stuct must have the same keys (called `fields`) in all rows. 
2. A Polars struct is a first-class citizen in the Polars API, with its own methods and functions.
3. Besides these, I am still figuring it out. 

What are the differences between a Polars struct and a Polars list?

1. A Polars struct has named elements, while a Polars list has unnamed elements.
2. Because the struct has the same fields in all rows, it must have the same length in all rows. This is not the case for a Polars list.

More often you will not make a Polars struct directly, but rather you will create one from a Polars DataFrame:

1. By directly creating a struct.
2. As the output of some operation, like `pl.Expr().value_counts()`, or all the horizontal cumulators like `pl.cum_sum_horizontal()`, `pl.cum_reduce()`, `pl.cum_fold()`, ...
3. From a Polars list. 
4. By splitting a string to a known length. 



### Making a Polars Struct Directly


In [53]:
#| label: make-struct-directly
df_with_struct = df.select(pl.struct(['a','b']).alias('struct'))
df_with_struct

struct
struct[2]
"{1,1}"
"{1,2}"
"{2,3}"
"{2,4}"
"{3,5}"
"{3,6}"


Verify that the column is a struct:


In [54]:
#| label: verify-struct
df_with_struct.schema

OrderedDict([('struct', Struct({'a': Int64, 'b': Int64}))])

Alternative constructors


In [55]:
df.select(pl.struct(aaa=pl.col('a'), bbb=pl.col('b'))).schema

OrderedDict([('aaa', Struct({'aaa': Int64, 'bbb': Int64}))])

### Struct As Output


In [56]:
df.select(pl.col('a').value_counts())
# df.select(pl.col('a').value_counts()).schema

a
struct[2]
"{2,2}"
"{3,2}"
"{1,2}"


### Struct From List


In [57]:
#| label: struct-from-list
df_with_list.select(pl.col('ab').list.to_struct())

ab
struct[2]
"{1,1}"
"{1,2}"
"{2,3}"
"{2,4}"
"{3,5}"
"{3,6}"


### Split a String

TODO: str.split_exact(), .str.splitn()




## Struct Methods


### Extracting Elements


In [58]:
#| label: struct-get
(
    df_with_struct
    .select(
        pl.col('struct').struct.field('a'),
        pl.col('struct').struct.field('b'),
        )
)

a,b
i64,i64
1,1
1,2
2,3
2,4
3,5
3,6


To text in JSON format:


In [59]:
#| label: struct-to-json
(
    df_with_struct
    .select(
        pl.col('struct').struct.json_encode()
        )
)

struct
str
"""{""a"":1,""b"":1}"""
"""{""a"":1,""b"":2}"""
"""{""a"":2,""b"":3}"""
"""{""a"":2,""b"":4}"""
"""{""a"":3,""b"":5}"""
"""{""a"":3,""b"":6}"""


## Struct to List

There is no `.struct.to_list()` method. 
This makes sense, since a struct does not require all fields to have the same dtype. 
In the case where all fields have the same dtype, you can use the following.


In [60]:
#| label: struct-to-list
(
    df_with_struct
    .unnest('struct')
    .select(pl.concat_list(pl.all()))
)

a
list[i64]
"[1, 1]"
"[1, 2]"
"[2, 3]"
"[2, 4]"
"[3, 5]"
"[3, 6]"


You can also consider element-by-element extractions.


In [61]:
#| label: struct-to-list-element-by-element
(
    df_with_struct
    .select(
        pl.concat_list([
            pl.col('struct').struct.field('a'), 
            pl.col('struct').struct.field('b')
            ])
        )
)

a
list[i64]
"[1, 1]"
"[1, 2]"
"[2, 3]"
"[2, 4]"
"[3, 5]"
"[3, 6]"


## Examples

### Splitting Strings



### Verifying Multi-Column Uniques (hashing)

The `.expr().unique()` method does not accept multiple columns.
We can use a struct to group the columns, and then apply the single-column `.unique()` method.


TODO




# Discussion


### Q: Can I have a list of list of lists (i.e. more than 2 layers of nesting)?  

No. Only a list within a list. 

### Can I use list methods on a struct?(e.g. argmax)  

No. But you can extract the struct to a list, and then use list methods.


### Q: When a Polars List and when a set of columns?

The difference is mostly syntactic. So a matter of preference and convenience. See the following example. The same goes for Polars arrays. 


In [62]:
# make a frame with lists of random length

max_length = int(1e6)

def make_list():
    
    result = pl.Series(list(string.ascii_letters)).sample(random.randint(1, max_length), with_replacement=True)
    
    return result

# make_list()

In [63]:
df = pl.DataFrame(
    {
        "a": [make_list(), make_list(), make_list()],
        "b": ["a", "b", "c"]
    }
)

In [64]:
df.estimated_size(unit='mb')

1.7867975234985352

In [65]:
df_2 = df.select(pl.col('a').list.to_struct(n_field_strategy='max_width')).unnest("a")
df_2.estimated_size(unit='mb')

1.7867717742919922

In [66]:
# %timeit df_2.select(pl.col('a').str.len_chars())

### Can I export to CSV? To Parquet? 


In [67]:
#| eval: false
# get tmp file name from operating system
import tempfile
temp_file = os.path.join(tempfile._get_default_tempdir(), 'something.csv')

df_with_list.write_csv('df_with_list.csv')

What is the dtype when I export to Pandas?

In [68]:
#| label: export-to-pandas

df_with_list_pandas = df_with_list.to_pandas()
df_with_list_pandas.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   a       6 non-null      int64 
 1   b       6 non-null      int64 
 2   ab      6 non-null      object
dtypes: int64(2), object(1)
memory usage: 276.0+ bytes
