# Lazy mode 1: Introducing lazy mode
By the end of this lecture you will be able to:
- create a `LazyFrame` from a CSV file
- explain the difference between a `DataFrame` and a `LazyFrame`
- print the optimized query plan

Lazy mode is crucial to taking full advantage of Polars with query optimisation and streaming large datasets. We introduce lazy mode in this lesson and we re-visit it again and again throughout the course.  

## Code or queries?
Data analysis often involves multiple steps:
- loading data from a file or database
- transforming the data
- grouping by a column
- ...

We call the set of steps a **query**.

We can write some lines of code that carry out a query step-by-step in eager mode.

There are two problems with this approach:
- Each line of code is not aware of what the others are doing.
- Each line of code requires copying the full dataframe.

We can instead write the steps as an integrated query in lazy mode.

With an integrated query:
- a query optimizer can identify efficiencies
- a query engine can minimise the memory usage and produce a single output

## So what are eager and lazy modes?

**Eager mode**: each line of code is run as soon as it is encountered.

**Lazy mode**: each line is added to a query plan and the query plan is optimized.

In [179]:
import polars as pl

In [21]:
csv_file = "../data/titanic.csv"

## `DataFrames` and `LazyFrames`
We **read** a CSV in eager mode with `pl.read_csv`. This creates a **`DataFrame`**

In [162]:
df_eager = pl.read_csv(csv_file)
df_eager.head(2)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Owen Harris""","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""


We **scan** a CSV in lazy mode with `pl.scan_csv`. This creates a **`LazyFrame`**

In [163]:
df_lazy = pl.scan_csv(csv_file)
df_lazy

When we scan a CSV Polars:
- opens the file 
- gets the column names as headers
- infers the type of each column from the first 100 rows

A `LazyFrame` is really a **query plan** - a plan for how Polars will transform your data.

We transform a `LazyFrame` into a `DataFrame` by calling `collect` on a `LazyFrame` - this processes your data according to the query plan

In [164]:
(
    df_lazy
    .head(3)
    .collect()
)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Owen Harris""","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
3,1,3,"""Heikkinen, Miss. Laina""","""female""",26.0,0,0,"""STON/O2. 3101282""",7.925,,"""S"""


We can get the dtype schema of a `LazyFrame` (which is a mapping from column names to dtypes) by calling `schema`

In [165]:
df_lazy.schema

  df_lazy.schema


Schema([('PassengerId', Int64),
        ('Survived', Int64),
        ('Pclass', Int64),
        ('Name', String),
        ('Sex', String),
        ('Age', Float64),
        ('SibSp', Int64),
        ('Parch', Int64),
        ('Ticket', String),
        ('Fare', Float64),
        ('Cabin', String),
        ('Embarked', String)])

However, we get a warning that getting the schema of a `LazyFrame` may be an expensive operation as it requires Polars to work through the logic of the query plan to see what the final columns and dtypes would be. 

Don't get *too* worried by the word "expensive" here - for a simple `LazyFrame` this might only take 1 millisecond. By "expensive" the Polars devs really mean that it takes some computation to arrive at the schema and that the time taken for this computation will grow as the length of the query plan grows. 

If you have a long and complicated query plan - imagine you are ingesting many files and doing lots of joins, concats and aggregations for example - then you might start to notice how long this takes.

The preferred way to get the schema of a `LazyFrame` - equivalent to `.schema` internally - is with `collect_schema`

In [166]:
(
    df_lazy
    .collect_schema()
)

Schema([('PassengerId', Int64),
        ('Survived', Int64),
        ('Pclass', Int64),
        ('Name', String),
        ('Sex', String),
        ('Age', Float64),
        ('SibSp', Int64),
        ('Parch', Int64),
        ('Ticket', String),
        ('Fare', Float64),
        ('Cabin', String),
        ('Embarked', String)])

Calculating the schema with `collect_schema` is still much faster than evaluating the full query with `collect` (see below) as `collect_schema` does not process your data, it just runs through the optimised query plan.

We cannot get the shape of the `LazyFrame` for free as Polars does not know how many rows there are from a CSV scan. If we want the length we have to trigger some evaluation of a query

In [167]:
(
    df_lazy
    .select(
        pl.len()
    )
    .collect()
)

len
u32
891


In this query Polars loads one column from the CSV and counts how long it is with `pl.len`.

We  learn more about evaluating a lazy query by calling `collect` in the next lecture

### Creating a LazyFrame from data
We can also directly create a `LazyFrame` from a constructor with some data

In [168]:
pl.LazyFrame({"values":[0,1,2]})

Or we can call `.lazy` on `DataFrame`

In [169]:
pl.DataFrame({"values":[0,1,2]}).lazy()

### What's the difference between a `DataFrame` and a `LazyFrame`?

If we print a `DataFrame` we see data...

In [170]:
df_eager.head(2)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Owen Harris""","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""


...but if we print a `LazyFrame` we see a **query plan**

**Key message: a method on a `DataFrame` acts on the data. An method on a `LazyFrame` acts on the query plan**.

## Operations on a `DataFrame` and a `LazyFrame` 
To show the difference between operations on a `DataFrame` and a `LazyFrame` we rename the `PassengerID` column to `Id` using `rename`.

On a `DataFrame` we see the first column is renamed...

In [171]:
(
    df_eager
    .rename({"PassengerId":"Id"})
    .head(2)
)    

Id,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Owen Harris""","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""


while on a `LazyFrame` we see that a `RENAME` step is added the query plan

In [172]:
(
    df_lazy
    .rename({"PassengerId":"Id"})
)    

## Chaining or re-assigning?
In this course we typically run operations with method chaining like this

In [173]:
(
    pl.scan_csv(csv_file)
    .rename({"PassengerId":"Id"})
)    

However, we can also do operations by re-assigning the variable in each step

In [174]:
df_lazy = pl.scan_csv(csv_file)
df_lazy = df_lazy.rename({"PassengerId":"Id"})

The two methods are equivalent when working with `DataFrames` or `LazyFrames`.

## Query optimisation
Polars creates a *naive query plan* from your query. This means a query plan with no optimisations.

`Polars` passes the naive query plan to its **query optimizer**. The query optimizer looks for more efficient ways to arrive at the output you want.

The `explain` method shows the optimized plan. We use a `print` statement to format it correctly

In [175]:
print(
    pl.scan_csv(csv_file)
    .explain()
)

Csv SCAN [../data/titanic.csv]
PROJECT */12 COLUMNS


In this simple case the query plan shows that we:
- scan the CSV file
- select all 12 of the columns (*/12*)

and the output is a `DataFrame`

## What query optimizations are applied?
Query optimizations aren't magic. Most optimizations could be implemented by users in a well-written query if the user:
- knows the optimization exists 
- remembers to implement the optimization and 
- implements the optimization correctly!

Optimizations applied by Polars include:
- `projection pushdown` limit the number of columns read to those required
- `predicate pushdown` apply filter conditions as early as possible
- `combine predicates` combine multiple filter conditions
- `slice pushdown` limit rows processed when limited rows are required
- `common subplan elimination` run duplicated transformations on the same data once and then re-use
- `common subexpression elimination` duplicated expressions are cached and re-used

We see how most of these optimisations arise later in the course.

### Common subexpression elimination
We see how the common subexpression elimination optimisation works here. With common subexpression elimination Polars identifies where the same expression is calculated more than once so Polars caches the first output to be re-used.

In this example we have a lazy query where we scan the Titanic CSV file. We then:
- use `select` to output a subset of columns
- create a first expression which has the mean age minus one standard deviation
- a second expression with the mean age
- create a third expression which has the mean age plus one standard deviation
- evaluate the query with .`collect`

In [176]:
(
    pl.scan_csv(csv_file)
    .select(
        (pl.col("Age").mean() - pl.col("Age").std()).alias("minus_one_std"),
        pl.col("Age").mean().alias("mean"),
        (pl.col("Age").mean() + pl.col("Age").std()).alias("plus_one_std"),
    )
    .collect()
)              

minus_one_std,mean,plus_one_std
f64,f64,f64
15.17262,29.699118,44.225615


In this query we use the `pl.col("Age").mean()` and `pl.col("Age").std()` expressions repeatedly. If we print the optimised query plan with `.explain` we can see that Polars is applying the common subexpression optimisation

In [177]:
print(
    pl.scan_csv(csv_file)
    .select(
        (pl.col("Age").mean() - pl.col("Age").std()).alias("minus_one_std"),
        pl.col("Age").mean().alias("mean"),
        (pl.col("Age").mean() + pl.col("Age").std()).alias("minus_one_std"),
    )
    .explain()
)               

 SELECT [[(col("__POLARS_CSER_0x73b29b6cae631f75")) - (col("__POLARS_CSER_0x92272a4df0f11131"))].alias("minus_one_std"), col("__POLARS_CSER_0x73b29b6cae631f75").alias("mean"), [(col("__POLARS_CSER_0x73b29b6cae631f75")) + (col("__POLARS_CSER_0x92272a4df0f11131"))].alias("minus_one_std")] FROM
   WITH_COLUMNS:
   [col("Age").std().alias("__POLARS_CSER_0x92272a4df0f11131"), col("Age").mean().alias("__POLARS_CSER_0x73b29b6cae631f75")] 
    Csv SCAN [../data/titanic.csv]
    PROJECT 1/12 COLUMNS


This query plan has two blocks separated by `FROM`.

Within the upper `SELECT` block we see the expressions are called with `__POLARS_CSER_X` where there is one code for the mean expression and one for the standard deviation expression. We can see that Polars has identified these as the same sub-expression across the three expressions in the `SELECT` block.

Polars also implements other optimisations such as fast-path algorithms on sorted data (separate from the query optimiser).  We learn more about these later in the course.

## Exercises

In the exercises you will develop your understanding of:
- creating a `LazyFrame` from a CSV file
- getting metadata from a `LazyFrame`
- printing the query plans

### Exercise 1
Create a `LazyFrame` by doing a scan of the Titanic CSV file

In [181]:
df = pl.scan_csv(csv_file)

Check to see which of the following metadata you can get from a `LazyFrame`:
- number of rows
- column names
- schema

Create a lazy query where you scan the Titanic CSV file and then select the `Name` and `Age` columns.

In [188]:
(
    pl.scan_csv(csv_file)
    .select(
        pl.col("Name", "Age")
    )
)

Print out the optimised query plan for this query

## Solutions

### Solution to Exercise 1

Create a `LazyFrame` by doing a scan of the Titanic CSV file

In [189]:
df = pl.scan_csv(csv_file)

A `LazyFrame` does not know the number of rows in a CSV

In [190]:
df.shape

AttributeError: 'LazyFrame' object has no attribute 'shape'

A `LazyFrame` does know the column names. As we will see in the I/O section `Polars` scans the first row of the CSV file to get column names in `pl.scan_csv`

In [186]:
df.columns

  df.columns


['PassengerId',
 'Survived',
 'Pclass',
 'Name',
 'Sex',
 'Age',
 'SibSp',
 'Parch',
 'Ticket',
 'Fare',
 'Cabin',
 'Embarked']

In [187]:
df.schema

  df.schema


Schema([('PassengerId', Int64),
        ('Survived', Int64),
        ('Pclass', Int64),
        ('Name', String),
        ('Sex', String),
        ('Age', Float64),
        ('SibSp', Int64),
        ('Parch', Int64),
        ('Ticket', String),
        ('Fare', Float64),
        ('Cabin', String),
        ('Embarked', String)])

Create a lazy query where you scan the Titanic CSV file and then select the `Name` and `Age` columns.

In [191]:
(
    pl.scan_csv(csv_file)
    .select("Name","Age")
)   

Print out the optimised query plan for this query

In [192]:
print(
    pl.scan_csv(csv_file)
    .select("Name","Age")
    .explain()
)   

Csv SCAN [../data/titanic.csv]
PROJECT 2/12 COLUMNS
