# Polars quickstart
 
To help you get started this notebook introduces some of the key concepts that make Polars a powerful data analysis tool.

The key concepts we meet are:
- fast flexible analysis with the Expression API in Polars
- easy parallel computations
- automatic query optimisation in lazy mode
- streaming to work with larger-than-memory datasets in Polars

## Stay in touch
I post a lot of material about Polars on social media and my blog. Stay in touch by
- connecting with me on LinkedIn https://www.linkedin.com/in/liam-brannigan-9080b214a/
- following me on twitter https://twitter.com/braaannigan
- check out my blog posts https://www.rhosignal.com/
- see my youtube channel https://www.youtube.com/channel/UC-J3uR0g7CxCSnx0YFE6R_g/

Send a message to say hi if you are coming from the course! 

## Importing Polars
We begin by importing polars as `pl`. Following this convention will allow you to work with examples from the official documentation

In [1]:
import polars as pl

## Setting configuration options
We want to control how many rows of a `DataFrame` are printed out to the screen. Polars allows us to control configuration using options using methods in the `pl.Config` namespace.

In this notebook we want Polars to print 6 rows of `DataFrame` so we use `pl.Config.set_tbl_rows`

In [2]:
pl.Config.set_tbl_rows(6)

polars.config.Config

You can see the full range of configuration options here: https://pola-rs.github.io/polars/py-polars/html/reference/config.html

In the course we see how to apply the right configuration options in a range of contexts.

## Input data
Polars can read from a wide range of data formats including CSV, Parquet, Arrow, JSON, Excel and database connections. We cover all of these in the course.

For this introduction we use a CSV with the Titanic passenger dataset. This dataset gives details of all the passengers on the Titanic and whether they survived.

We begin by setting the path to this CSV

In [3]:
csv_file = "data_titanic.csv"

We read the CSV into a Polars `DataFrame` with the `read_csv` function. 

We then call `head` to print out the first few rows of the `DataFrame`

In [4]:
df = pl.read_csv(csv_file)
df.head(3)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Ow…","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,"""Cumings, Mrs. …","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
3,1,3,"""Heikkinen, Mis…","""female""",26.0,0,0,"""STON/O2. 31012…",7.925,,"""S"""


Each row of the `DataFrame` has details about a passenger on the Titanic including the class they travelled in (`Pclass`), their name (`Name`) and `Age`.

Alternatively we can use `glimpse` to see the first data points arranged vertically. I use this regularly for dataframes with a lot of columns

In [5]:
print(df.glimpse())

Rows: 891
Columns: 12
$ PassengerId <i64> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
$ Survived    <i64> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1
$ Pclass      <i64> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2
$ Name        <str> 'Braund, Mr. Owen Harris', 'Cumings, Mrs. John Bradley (Florence Briggs Thayer)', 'Heikkinen, Miss. Laina', 'Futrelle, Mrs. Jacques Heath (Lily May Peel)', 'Allen, Mr. William Henry', 'Moran, Mr. James', 'McCarthy, Mr. Timothy J', 'Palsson, Master. Gosta Leonard', 'Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)', 'Nasser, Mrs. Nicholas (Adele Achem)'
$ Sex         <str> 'male', 'female', 'female', 'female', 'male', 'male', 'male', 'male', 'female', 'female'
$ Age         <f64> 22.0, 38.0, 26.0, 35.0, 35.0, None, 54.0, 2.0, 27.0, 14.0
$ SibSp       <i64> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1
$ Parch       <i64> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0
$ Ticket      <str> 'A/5 21171', 'PC 17599', 'STON/O2. 3101282', '113803', '373450', '330877', '17463', '349909', '347742', '237736'
$ Fare        <f64> 7.25, 71.2833, 7

## Expressions
You can use square brackets to select rows and columns in Polars...

In [6]:
df[:3,["Pclass","Name","Age"]]

Pclass,Name,Age
i64,str,f64
3,"""Braund, Mr. Ow…",22.0
1,"""Cumings, Mrs. …",38.0
3,"""Heikkinen, Mis…",26.0


...but using this square bracket approach means that you don't get all the benefits of parallelisation and query optimisation.

> We learn more about square bracket indexing in Section 3 of the course.

To really take advantage of Polars we use the Expression API.

### Selecting and transforming columns with the Expression API

We see a simple example of the Expression API here where we select the `Pclass`, `Name` and `Age` columns inside a `select` statement (we learn much more about a `select` statement in Section 3)

In [8]:
(
    df
    .select(
        [
            pl.col("Pclass"),
            pl.col("Name").str.to_lowercase(),
            pl.col("Age").round(2),
        ]
    )
)

Pclass,Name,Age
i64,str,f64
3,"""braund, mr. ow…",22.0
1,"""cumings, mrs. …",38.0
3,"""heikkinen, mis…",26.0
…,…,…
3,"""johnston, miss…",
1,"""behr, mr. karl…",26.0
3,"""dooley, mr. pa…",32.0


In the Expression API we use `pl.col` to refer to a column.

We would like the strings in the `Name` column to be printed wider. We can do this with

In [None]:
pl.Config.set_fmt_str_lengths(100)

> We learn more about the `pl.Config` namespace for configuring how Polars looks and behaves in a lecture later in this Section.

### What is an expression?

An expression is a function that takes a `Series` (or column in a `DataFrame`) in and returns `Series` (or column in a `DataFrame`). 

Expressions are the core building blocks of data transformations and include:
- the identity expression where the output is the same as the input
- arithmetic where we add/multiply/etc all elements of a `Series`
- rounding off all elements of a `Series`
- converting all strings in a `Series` to uppercase
- extracting the date from all elements of a datetime `Series`
- and so on

In this example we select the same three columns, but this time we:
- convert the names to lowercase and
- round off the age to 2 decimal places

In [9]:
(
    df
    .select(
        [
            # Identity expression
            pl.col("Pclass"),
            # Names to lowercase
            pl.col("Name").str.to_lowercase(),
            # Round the ages
            pl.col("Age").round(2)
        ]
    )
)

Pclass,Name,Age
i64,str,f64
3,"""braund, mr. ow…",22.0
1,"""cumings, mrs. …",38.0
3,"""heikkinen, mis…",26.0
…,…,…
3,"""johnston, miss…",
1,"""behr, mr. karl…",26.0
3,"""dooley, mr. pa…",32.0


When we have multiple expressions like this Polars runs them in parallel.

Expressions can also return a shorter `Series` such as `head` to return the first rows or aggregating expressions such as `mean` to get the average of the values in a `Series`. Expressions can also return a longer `Series` such as `explode` that converts a list `Series` to individual rows.

> We learn much more about expressions in Section 3 of the course.

### Method chaining and code formatting
In the cell above the code is wrapped in parantheses `()`. In Python (rather than Polars in particular) when we wrap code in parantheses we can call a new method - in this case `select` - on a new line.

In Polars we often build queries in multiple steps with multiple calls to new methods. I find it is much easier to read a series of queries if each method starts on a new line so I will generally wrap code blocks in paranetheses.

### Expression chaining

As well as chaining methods we can chain expressions together to do more transformations in a single step. 

In this example we return three columns:
- the original `Name` columns
- the `Name` column split into a list of words
- the count of the number of words when the `Name` column split into a list of words

Column names in a Polars `DataFrame` are always strings and must be unique. We use the `alias` method at the end of the second and third expressions so we do not end up with multiple columns called `Name`

In [10]:
(
    df
    .select(
        [
            # Get the Name column without changes
            pl.col("Name"),
            # Take the Name column and split it into a list of separate words
            pl.col("Name").str.split(" ").alias("Name_split"),
            # Take the Name column, split it into a list of separate words and count the number of words
            pl.col("Name").str.split(" ").list.len().alias("Name_word_count"),
        ]
    )
)

Name,Name_split,Name_word_count
str,list[str],u32
"""Braund, Mr. Ow…","[""Braund,"", ""Mr."", … ""Harris""]",4
"""Cumings, Mrs. …","[""Cumings,"", ""Mrs."", … ""Thayer)""]",7
"""Heikkinen, Mis…","[""Heikkinen,"", ""Miss."", ""Laina""]",3
…,…,…
"""Johnston, Miss…","[""Johnston,"", ""Miss."", … """"Carrie""""]",5
"""Behr, Mr. Karl…","[""Behr,"", ""Mr."", … ""Howell""]",4
"""Dooley, Mr. Pa…","[""Dooley,"", ""Mr."", ""Patrick""]",3


In [11]:
df.select([pl.col("Name"),
           pl.col("Name").str.split(" ").alias("Name_split"),
           pl.col("Name").str.split(" ").list.len().alias("Name_word_count"),])

Name,Name_split,Name_word_count
str,list[str],u32
"""Braund, Mr. Ow…","[""Braund,"", ""Mr."", … ""Harris""]",4
"""Cumings, Mrs. …","[""Cumings,"", ""Mrs."", … ""Thayer)""]",7
"""Heikkinen, Mis…","[""Heikkinen,"", ""Miss."", ""Laina""]",3
…,…,…
"""Johnston, Miss…","[""Johnston,"", ""Miss."", … """"Carrie""""]",5
"""Behr, Mr. Karl…","[""Behr,"", ""Mr."", … ""Howell""]",4
"""Dooley, Mr. Pa…","[""Dooley,"", ""Mr."", ""Patrick""]",3


We look at expressions in detail throughout the course to find the right expression for many different scenarios.

Expressions can seem verbose, but they also allow us to select groups of columns in one go. For example, to select all the integer columns we can use `pl.

In [12]:
(
    df
    .select(
        pl.col(pl.INTEGER_DTYPES)
    )
    .head(3)
)

PassengerId,Survived,Pclass,SibSp,Parch
i64,i64,i64,i64,i64
1,0,3,1,0
2,1,1,1,0
3,1,3,0,0


> We meet other ways to quickly select multiple columns in Section 3.

### Filtering a `DataFrame` with the Expression API

We filter a `DataFrame` by applying a condition to an expression.

In this example we find all the passengers over 70 years of age

In [13]:
(
    df
    .filter(
        pl.col("Age") > 70
    )
)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
97,0,1,"""Goldschmidt, M…","""male""",71.0,0,0,"""PC 17754""",34.6542,"""A5""","""C"""
117,0,3,"""Connors, Mr. P…","""male""",70.5,0,0,"""370369""",7.75,,"""Q"""
494,0,1,"""Artagaveytia, …","""male""",71.0,0,0,"""PC 17609""",49.5042,,"""C"""
631,1,1,"""Barkworth, Mr.…","""male""",80.0,0,0,"""27042""",30.0,"""A23""","""S"""
852,0,3,"""Svensson, Mr. …","""male""",74.0,0,0,"""347060""",7.775,,"""S"""


We are not limited to using the Expression API for these operations. The Expression API is at the heart of all data transformations in Polars as we see below.

> We learn more about applying filter conditions in Section 2 of the course.

## Analytics
Polars has a wide range of functionality for analysing data. In the course we look at a wider range of analytic methods and how we can use expressions to write more complicated analysis in a concise way.

We begin by getting an overview of the `DataFrame` with `describe`

In [14]:
df.describe()

statistic,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
str,f64,f64,f64,str,str,f64,f64,f64,str,f64,str,str
"""count""",891.0,891.0,891.0,"""891""","""891""",714.0,891.0,891.0,"""891""",891.0,"""204""","""889"""
"""null_count""",0.0,0.0,0.0,"""0""","""0""",177.0,0.0,0.0,"""0""",0.0,"""687""","""2"""
"""mean""",446.0,0.383838,2.308642,,,29.699118,0.523008,0.381594,,32.204208,,
…,…,…,…,…,…,…,…,…,…,…,…,…
"""50%""",446.0,0.0,3.0,,,28.0,0.0,0.0,,14.4542,,
"""75%""",669.0,1.0,3.0,,,38.0,1.0,0.0,,31.0,,
"""max""",891.0,1.0,3.0,"""van Melkebeke,…","""male""",80.0,8.0,6.0,"""WE/P 5735""",512.3292,"""T""","""S"""


The output of `describe` shows us how many records there are, how many `null` values and some key statistics. The `null_count` has helped me identify emerging data quality issues in my machine learning pipelines.

### Value counts on a column
We use `value_counts` to count occurences of values in a column.

In this example we count how many passengers there are in each class with `value_counts`

In [15]:
df["Pclass"].value_counts()

Pclass,count
i64,u32
1,216
2,184
3,491


### Groupby and aggregations
Polars has a fast parallel algorithm for `group_by` operations. 

Here we first group by the `Survived` and the `Pclass` columns. We then aggregate in `agg` by counting the number of passengers in each group

In [16]:
(
    df
    .group_by(["Survived","Pclass"])
    .agg(
        pl.col("PassengerId").count().alias("count")
    )
)

Survived,Pclass,count
i64,i64,u32
1,1,136
1,2,87
0,1,80
0,2,97
0,3,372
1,3,119


We use the Expression API to for each aggregation in `agg`.

Groupby operations in Polars are fast because Polars has a parallel algorithm for getting the groupby keys. Aggregations are also fast because Polars runs multiple expressions in `agg` in parallel.

### Window operations
Window operations occur when we want to add a column that reflects not just data from that row but from a related group of rows. Windows occur in many contexts including rolling or temporal statistics and Polars covers these use cases.

Another example of a window operation is when we want on each row to have a statistic for a group of rows. We use the `over` expression for this (equivalent to `groupby-transform` in Pandas).

In this example we are going to add a column with the maximum age of the passenger in each class. To add a column we use an expression inside the `with_columns` method (we see much more of this method in Section 2). In the expression we calculate the maximum `Age` and specify that we want here we use `over` to calculate that max by the passenger class

In [20]:
(
    df
    .with_columns(
        pl.col("Age").max().over("Pclass").alias("MaxAge")
    )
    .select("Pclass","Name","Age","MaxAge")
    .head(10)
)

Pclass,Name,Age,MaxAge
i64,str,f64,f64
3,"""Braund, Mr. Ow…",22.0,74.0
1,"""Cumings, Mrs. …",38.0,80.0
3,"""Heikkinen, Mis…",26.0,74.0
…,…,…,…
3,"""Palsson, Maste…",2.0,74.0
3,"""Johnson, Mrs. …",27.0,74.0
2,"""Nasser, Mrs. N…",14.0,70.0


In [18]:
df

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Ow…","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,"""Cumings, Mrs. …","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
3,1,3,"""Heikkinen, Mis…","""female""",26.0,0,0,"""STON/O2. 31012…",7.925,,"""S"""
…,…,…,…,…,…,…,…,…,…,…,…
889,0,3,"""Johnston, Miss…","""female""",,1,2,"""W./C. 6607""",23.45,,"""S"""
890,1,1,"""Behr, Mr. Karl…","""male""",26.0,0,0,"""111369""",30.0,"""C148""","""C"""
891,0,3,"""Dooley, Mr. Pa…","""male""",32.0,0,0,"""370376""",7.75,,"""Q"""


> We learn more about grouping and aggregations in Section 4 of the course.

### Visualisation

We can use popular plotting libraries like Matplotlib, Seaborn, Altair and Plotly directly with Polars.

In this example we create a scatter plot of bar chart of age and fare with Altair (version 5+ of Altair)

In [None]:
import altair as alt
alt.Chart(
    df,
    title="Scatter plot of Age and Fare"
).mark_circle().encode(
    x="Age:Q",
    y="Fare:Q"
)

> We see how to work with Matplotlib, Seaborn, Altair and Plotly in the visualisation lecture in this Section.

## Lazy mode and query optimisation
In the examples above we work in eager mode. In eager mode Polars runs each part of a query step-by-step.

Polars has a powerful feature called lazy mode. In this mode Polars looks at a query as a whole to make a query graph. Before running the query Polars passes the query graph through its query optimiser to see if there ways to make the query faster.

When working with a CSV we can switch from eager mode to eager mode by replacing `read_csv` with `scan_csv`

In [21]:
(
    pl.scan_csv(csv_file)
    .group_by(["Survived","Pclass"])
    .agg(
        pl.col("PassengerId").count().alias("count")
    )
)

The output of a lazy query is `LazyFrame` and we see the unoptimized query plan when we output a `LazyFrame`.

### Query optimiser
We can see the optimised query plan that Polars will actually run by add `explain` at the end of the query

In [22]:
print(
    pl.scan_csv(csv_file)
    .group_by(["Survived","Pclass"])
    .agg(
        pl.col("PassengerId").count().alias("count")
    )
    .explain()
)

AGGREGATE
	[col("PassengerId").count().alias("count")] BY [col("Survived"), col("Pclass")] FROM

    Csv SCAN data_titanic.csv
    PROJECT 3/12 COLUMNS


In this example Polars has identified an optimisation:
```python
PROJECT 3/12 COLUMNS
```
There are 12 columns in the CSV, but the query optimiser sees that only 3 of these columns are required for the query. When the query is evaluated Polars will `PROJECT` 3 out of 12 columns: Polars will only read the 3 required columns from the CSV. This projection saves memory and computation time.

A different optimisation happens when we apply a `filter` to a query. In this case we want the same analysis of survival by class but only for passengers over 50

In [23]:
print(
    pl.scan_csv(csv_file)
    .filter(pl.col("Age") > 50)
    .group_by(["Survived","Pclass"])
    .agg(
        pl.col("PassengerId").count().alias("count")
    )
    .explain()
)

AGGREGATE
	[col("PassengerId").count().alias("count")] BY [col("Survived"), col("Pclass")] FROM

    Csv SCAN data_titanic.csv
    PROJECT 4/12 COLUMNS
    SELECTION: [(col("Age")) > (50.0)]


In this example the query optimiser has seen that:
- 4 out of 12 columns are now required `PROJECT 4/12 COLUMNS` and
- only passengers over 50 should be selected `FILTER: [(col("Age")) > (50.0)]`

These optimisations are applied as Polars reads the CSV file so the whole dataset must not be read into memory.

### Query evaluation

To evaluate the full query and output a `DataFrame` we call `collect` 

In [24]:
(
    pl.scan_csv(csv_file)
    .filter(pl.col("Age") > 50)
    .group_by(["Survived","Pclass"])
    .agg(
        pl.col("PassengerId").count().alias("count")
    )
    .collect()
)

Survived,Pclass,count
i64,i64,u32
1,1,18
1,3,1
0,1,21
0,3,9
0,2,12
1,2,3


We learn more about lazy mode and evaluating queries in this section of the course.

## Streaming larger-than-memory datasets
By default Polars reads your full dataset into memory when evaluating a lazy query. However, if your dataset is too large to fit into memory Polars can run many operations in *streaming* mode. With streaming Polars processes your query in batches rather than all at once.

To enable streaming we pass the `streaming = True` argument to `collect`

In [25]:
(
    pl.scan_csv(csv_file)
    .filter(pl.col("Age") > 50)
    .group_by(["Survived","Pclass"])
    .agg(
        pl.col("PassengerId").count().alias("count")
    )
    .collect(streaming = True)
)

Survived,Pclass,count
i64,i64,u32
0,3,9
0,1,21
1,2,3
0,2,12
1,3,1
1,1,18


In the course we look at what queries streaming can be used in (see the Streaming CSV lecture in the I/O section for more detail).

## Summary
This notebook has been a quick overview of the key ideas that make Polars a powerful data analysis tool:
- expressions allow us to write complex transformations concisely and run them in parallel
- lazy mode allows Polars apply query optimisations that reduce memory usage and computation time
- streaming lets us process larger-than-memory datasets with Polars