# Lazy mode 1: Introducing lazy mode
By the end of this lecture you will be able to:
- create a `LazyFrame` from a CSV file
- explain the difference between a `DataFrame` and a `LazyFrame`
- print the optimized query plan

Lazy mode is crucial to taking full advantage of Polars with query optimisation and streaming large datasets. We introduce lazy mode in this lesson and we re-visit it again and again throughout the course.  

## Code or queries?
Data analysis often involves multiple steps:
- loading data from a file or database
- transforming the data
- grouping by a column
- ...

We call this set of steps a **query**.

We can write some lines of code that carry out a query step-by-step in eager mode.

There are two problems with this approach:
- Each line of code is not aware of what the others are doing.
- Each line of code requires copying the full dataframe.

We can instead write the steps as an integrated query in lazy mode.

With an integrated query:
- a query optimizer can identify efficiencies
- a query engine can minimise the memory usage and produce a single output

## So what are eager and lazy modes?

**Eager mode**: each line of code is run as soon as it is encountered.

**Lazy mode**: each line is added to a query plan and the query plan is optimized.

In [1]:
import polars as pl

In [2]:
csv_file = "../notebooks/data/titanic.csv"

## `DataFrames` and `LazyFrames`
We **read** a CSV in eager mode with `pl.read_csv`. This creates a **`DataFrame`**

In [3]:
df_eager = pl.read_csv(csv_file)
df_eager.head(2)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Ow…","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,"""Cumings, Mrs. …","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""


We **scan** a CSV in lazy mode with `pl.scan_csv`. This creates a **`LazyFrame`**

In [4]:
df_lazy = pl.scan_csv(csv_file)
df_lazy

When we scan a CSV Polars:
- opens the file 
- gets the column names as headers
- infers the type of each column from the first 100 rows

We can get the dtype schema of a `LazyFrame`. This is a mapping from column names to dtypes

In [5]:
df_lazy.schema

{'PassengerId': Int64,
 'Survived': Int64,
 'Pclass': Int64,
 'Name': Utf8,
 'Sex': Utf8,
 'Age': Float64,
 'SibSp': Int64,
 'Parch': Int64,
 'Ticket': Utf8,
 'Fare': Float64,
 'Cabin': Utf8,
 'Embarked': Utf8}

We cannot get the shape of a `LazyFrame` as Polars does not know how many rows there are from a scan.

### What's the difference between a `DataFrame` and a `LazyFrame`?

If we print a `DataFrame` we see data...

In [6]:
df_eager.head(2)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Ow…","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,"""Cumings, Mrs. …","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""


...but if we print a `LazyFrame` we see a **query plan**

**Key message: an operation on a `DataFrame` acts on the data. An operation on a `LazyFrame` acts on the query plan**.

## Operations on a `DataFrame` and a `LazyFrame` 
To show the difference between operations on a `DataFrame` and a `LazyFrame` we rename the `PassengerID` column to `Id` using `rename`.

On a `DataFrame` we see the first column is renamed...

In [7]:
(
    df_eager
    .rename({"PassengerId":"Id"})
    .head(2)
)    

Id,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Ow…","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,"""Cumings, Mrs. …","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""


while on a `LazyFrame` we see that a `RENAME` step is added the query plan

In [8]:
(
    df_lazy
    .rename({"PassengerId":"Id"})
)    

## Chaining or re-assigning?
In this course we typically run operations with method chaining like this

In [9]:
print(
    pl.scan_csv(csv_file)
    .rename({"PassengerId":"Id"})
    .explain()
)    

RENAME

    Csv SCAN ../notebooks/data/titanic.csv
    PROJECT */12 COLUMNS


However, we can also do operations by re-assigning the variable in each step

In [None]:
df_lazy = pl.scan_csv(csv_file)
df_lazy = df_lazy.rename({"PassengerId":"Id"})
print(df_lazy.explain())

The two methods are equivalent

## Query optimization
Polars creates a *naive query plan* from your query.

`Polars` passes the naive query plan to its **query optimizer**. The query optimizer looks for more efficient ways to arrive at the output you want.

Printing the output of the `explain` method shows the optimized plan

In [10]:
df_lazy = pl.scan_csv(csv_file)
print(df_lazy.explain())


  Csv SCAN ../notebooks/data/titanic.csv
  PROJECT */12 COLUMNS


## What query optimizations are applied?
Query optimizations aren't magic. Most optimizations could be implemented by users in a well-written query if the user:
- knows the optimization exists 
- remembers to implement the optimization and 
- implements the optimization correctly!

Optimizations applied by Polars include:
- `projection pushdown` limit the number of columns read to those required
- `predicate pushdown` apply filter conditions as early as possible
- `combine predicates` combine multiple filter conditions
- `slice pushdown` limit rows processed when limited rows are required
- `common subplan elimination` run duplicated transformations on the same data once and then re-use

We'll see how these optimisations arise later in the course.

Polars also implements other optimisations such as fast-path algorithms on sorted data (separate from the query optimiser). 

## Exercises

In the exercises you will develop your understanding of:
- creating a `LazyFrame` from a CSV file
- getting metadata from a `LazyFrame`
- printing the query plans

### Exercise 1
Create a `LazyFrame` by doing a scan of the Titanic CSV file

In [11]:
df = pl.scan_csv(csv_file)

Use the fetch statement and count how many rows it returns by default

In [13]:
df.fetch().shape

(500, 12)

Check to see which of the following metadata you can get from a `LazyFrame`:
- number of rows
- column names
- schema

## Solutions

### Solution to Exercise 1

In [None]:
df = pl.scan_csv(csv_file)

In [None]:
df.fetch().shape

A `LazyFrame` does not know the number of rows in a CSV

In [None]:
df.shape

A `LazyFrame` does know the column names. As we will see in the I/O section `Polars` scans the first row of the CSV file to get column names in `pl.scan_csv`

In [14]:
df.columns

['PassengerId',
 'Survived',
 'Pclass',
 'Name',
 'Sex',
 'Age',
 'SibSp',
 'Parch',
 'Ticket',
 'Fare',
 'Cabin',
 'Embarked']

In [15]:
df.schema

{'PassengerId': Int64,
 'Survived': Int64,
 'Pclass': Int64,
 'Name': Utf8,
 'Sex': Utf8,
 'Age': Float64,
 'SibSp': Int64,
 'Parch': Int64,
 'Ticket': Utf8,
 'Fare': Float64,
 'Cabin': Utf8,
 'Embarked': Utf8}