# Lazy mode 1: Introducing lazy mode
By the end of this lecture you will be able to:
- explain the difference between eager mode and lazy mode
- explain what the query optimizer does
- create a `LazyFrame`
- explain the difference between a `DataFrame` and a `LazyFrame`
- print the optimized query plan

Lazy mode is crucial to taking full advantage of Polars with query optimization and streaming large datasets. We introduce lazy mode in this lesson and we re-visit it again and again throughout the course.  

In [None]:
import polars as pl

In [None]:
csv_file = "../data/titanic.csv"

## Eager mode and lazy mode


In [None]:
(
    pl.read_csv(csv_file)
    .group_by("Pclass")
    .agg(
        pl.col("Age").mean()
    )
)

We cover `group_by.agg` in much more detail later in the course!

In this example Polars works line-by-line to:
- create a `DataFrame` from the full Titanic dataset with 12 columns
- do a `group_by` on the `Pclass` column and
- get the mean of the `Age` column for each group

However, this is not the optimal way to calculate this output: we read in all 12 columns from the file into a `DataFrame` even though we only need 2 columns to get the output. We use more computation and more memory than necessary.

This approach is not optimal because Polars runs the code line-by-line instead of looking at the full set of operations. We can call the full set of operations to get our desired output a **query**.

We call this way of working line-by-line **eager** mode.

Polars has an alternative way of running this query called **lazy** mode

In [None]:
(
    pl.scan_csv(csv_file)
    .group_by("Pclass")
    .agg(
        pl.col("Age").mean()
    )
    .collect()
)

In lazy mode Polars:
- starts a lazy query as we use `pl.scan_csv` instead of `pl.read_csv`
- builds a *naive query plan* from the set of operations as we set out them out in our code
- passes this naive query plan to its **query optimizer** to build an optimized query plan
- **evaluates** this optimized query plan when we call `collect`

In this example the query optimization is to limit the `DataFrame` to the `Pclass` and `Age` columns.

Pandas has eager mode only. Polars can run in eager or lazy mode.

## `DataFrames` and `LazyFrames`
In Polars:
- eager mode is equivalent to working with `DataFrames`
- lazy mode is equivalent to working with `LazyFrames`.

We **read** a CSV in eager mode with `pl.read_csv`. This creates a **`DataFrame`**

In [None]:
df_eager = pl.read_csv(csv_file)
df_eager.head(2)

We **scan** a CSV in lazy mode with `pl.scan_csv`. This creates a **`LazyFrame`**

In [None]:
df_lazy = pl.scan_csv(csv_file)
df_lazy

When we print a `LazyFrame` Polars prints out the `naive plan` - we learn more about the naive plan later in this notebook.

A `LazyFrame` is really a **query plan** - a plan for how Polars will transform your data when you evaluate the lazy query (we cover evaluation in more detail in the next lecture).

We evaluate a `LazyFrame` and transform it into a `DataFrame` by calling `collect` on a `LazyFrame`

In [None]:
(
    df_lazy
    .head(3)
    .collect()
)

### Schema and column names

The `schema` of a `DataFrame` sets out the column names and dtypes

In [None]:
df_eager.schema

The schema of a `DataFrame` is an attribute of the `DataFrame` and no computation required to see it. 

The schema of a `LazyFrame` is the schema that we eventually get when we evaluate a `LazyFrame` and turn it into a `DataFrame`.

We *can* get the schema of a `LazyFrame` by calling `schema`

In [None]:
df_lazy.schema

However, we get a `warning` that getting the schema of a `LazyFrame` may be an expensive operation as it requires Polars to work through the logic of the query plan to see what the final columns and dtypes would be. 

Don't get *too* worried by the word "expensive" here - for a simple `LazyFrame` this might only take 1 millisecond! By "expensive" the Polars devs are warning you that it takes some computation to generate the schema from a `LazyFrame` and that the time taken for this computation will grow as the length of the query plan grows. 
If you have a long and complicated query plan - imagine you are ingesting hundreds files and doing lots of joins, concats and aggregations - then you might start to notice how long getting the schema takes.

The preferred way to get the schema of a `LazyFrame` - equivalent to `.schema` internally - is with `collect_schema`

In [None]:
(
    df_lazy
    .collect_schema()
)

Calculating the schema with `collect_schema` is still much faster than evaluating the full query with `collect` as `collect_schema` does not process your data, it just runs through the optimized query plan.

> Why is `collect_schema` the preferred way? Because the syntax may make it more obvious to anyone reading your code that there is an actual computation going on here to get the schema.

Similarly, we get a warning if we call `columns` on a `LazyFrame`

In [None]:
df_lazy.columns

The preferred way to do this is via `collect_schema.names`

In [None]:
(
    df_lazy
    .collect_schema()
    .names()
)

We cannot get the number of rows of a `LazyFrame` for free as Polars does not know how many rows there are from a CSV scan. If we want the length of the output `DataFrame` we have to use a query like this with the `pl.len` expression to count the number of rows

In [None]:
(
    df_lazy
    .select(
        pl.len()
    )
    .collect()
)

To evaluate this query Polars analyzes the CSV to count now many rows there are.

We learn more about evaluating a lazy query by calling `collect` in the next lecture.

### Creating a LazyFrame from data

Above we create a `LazyFrame` from a scan of a CSV file. We can also directly create a `LazyFrame` from `pl.LazyFrame` with some data

In [None]:
(
    pl.LazyFrame(
        {"values":[0,1,2]}
    )
)

Or we can call `.lazy` on `DataFrame`

In [None]:
(
    pl.DataFrame(
        {"values":[0,1,2]}
    )
    .lazy()
)

Each time we print a `LazyFrame` Polars prints the `naive plan`. This is the query plan built directly from your operations in the order you added them to the `LazyFrame` with no query optimizations applied. The naive query plan is just for tracking the operations you have made, the optimized query plan is what is actually run when we evaluate the query.

### What's the difference between a `DataFrame` and a `LazyFrame`?

If we print a `DataFrame` we see data...

In [None]:
(
    df_eager
    .head(2)
)

...but if we print a `LazyFrame` we see a **query plan**

**Key message: an operation on a `DataFrame` acts on the data. An operation on a `LazyFrame` acts on the query plan**.

### Operations on a `DataFrame` and a `LazyFrame` 
To show the difference between operations on a `DataFrame` and a `LazyFrame` we do a simple operation where we rename the `PassengerID` column to `Id` using `rename`.

On a `DataFrame` we see the first column is renamed...

In [None]:
(
    df_eager
    .rename({"PassengerId":"Id"})
    .head(2)
)    

while on a `LazyFrame` we see that a `RENAME` step is added the query plan

In [None]:
(
    df_lazy
    .rename({"PassengerId":"Id"})
)    

### Chaining or re-assigning?
In this course we typically run operations with method chaining like this

In [None]:
(
    pl.scan_csv(csv_file)
    .rename({"PassengerId":"Id"})
)    

However, we can also do operations by re-assigning the variable in each step

In [None]:
df_lazy = pl.scan_csv(csv_file)
df_lazy = df_lazy.rename({"PassengerId":"Id"})

The two methods are equivalent when working with `DataFrames` or `LazyFrames`. I find that chaining makes it easier to read so I generally stick with that approach.

## Query optimization
Polars creates a *naive query plan* from your query. This is a query plan that sets out the operations  as described in the code you write (i.e. with no optimizations).

`Polars` passes the naive query plan to its **query optimizer** to produce the *optimized query plan*. The query optimizer looks for more efficient ways to arrive at the output you want. This optimization step is the key advantage of lazy mode.

To see the *optimized* plan we call `explain` on a `LazyFrame` and the plan is returned as a string. We use a `print` statement to format it correctly

In [None]:
print(
    pl.scan_csv(csv_file)
    .explain()
)

In this simple case the query plan shows that we:
- scan the CSV file
- select all 12 of the columns (*/12*)


### What query optimizations are applied?
Query optimizations aren't magic. Most optimizations could be implemented by users in a well-written query if the user:
- knows the optimization exists 
- remembers to implement the optimization and 
- implements the optimization correctly!

Optimizations applied by Polars include:
- `projection pushdown` limit the number of columns read to those required for a query
- `predicate pushdown` apply filter conditions as early as possible
- `combine predicates` combine multiple filter conditions into a single pass through the data
- `common subexpression elimination` duplicated calculations are saved and re-used
- `common subplan elimination` run duplicated transformations on the same data once and then re-use

We see how these optimizations arise in the relevant sections later in the course.

Note that if there are no query optimizations that can be applied to your query then the performance of eager mode and lazy mode will be very similar as eager mode uses the same lazy mode code internally. 

## Exercises
In the exercises you will develop your understanding of:
- creating a `LazyFrame` from a CSV file
- getting metadata from a `LazyFrame`
- printing the query plans

For all notebooks you can scroll down to see the solutions.

### Exercise 1
Create a `LazyFrame` by doing a scan of the Titanic CSV file.

In [None]:
# Replace <blank> with your own code in the exercises
df = pl.<blank>

Check to see which of the following metadata you can get from a `LazyFrame`:
- number of rows
- schema
- column names

Create a lazy query where you scan the Titanic CSV file and then select the `Name` and `Age` columns.

In [None]:
(
    pl.scan_csv(csv_file)
    <blank>
)

Print out the optimized query plan for this query

## Solutions

### Solution to Exercise 1

Create a `LazyFrame` by doing a scan of the Titanic CSV file

In [None]:
df = pl.scan_csv(csv_file)

A `LazyFrame` does not know the number of rows in a CSV

In [None]:
df.shape

We can compute the schema

In [None]:
(
    df
    .collect_schema()
)

And we can compute the column names from a `LazyFrame`

In [None]:
(
    df
    .collect_schema()
    .names()
)

Create a lazy query where you scan the Titanic CSV file and then select the `Name` and `Age` columns.

In [None]:
(
    pl.scan_csv(csv_file)
    .select("Name","Age")
)   

Print out the optimized query plan for this query

In [None]:
print(
    pl.scan_csv(csv_file)
    .select("Name","Age")
    .explain()
)   