# Filtering rows 3: using `filter` in lazy mode
By the end of this lecture you will be able to:
- use `filter` in lazy mode
- understand the optimized and non-optimized query plans
- combine mulitiple conditions in lazy mode

In [1]:
import polars as pl

In [2]:
csvFile = "../data/titanic.csv"

Create a `LazyFrame` by scanning the CSV

In [3]:
df = pl.scan_csv(csvFile)
df

This output is the non-optimized `naive` query plan.

The `SELECTION` part is `None` as no filters have been applied.

## `filter` in lazy mode

When we apply a `filter` in lazy mode a `FILTER` line is added to the `naive` query plan.

In [6]:
df = pl.scan_csv(csvFile)
df = df.filter(pl.col("Age") > 30)

df.explain(optimized=True)

'FILTER [(col("Age")) > (30.0)] FROM\n\n  Csv SCAN ../data/titanic.csv\n  PROJECT */12 COLUMNS'

All query plans are read from bottom to top. 

## Inspecting the non-optimized query plan
We examine the non-optimized query plan above first.

### Bottom row
In this non-optimized query plan the bottom row:
`CSV SCAN ../data/titanic.csv; PROJECT */12 COLUMNS; SELECTION: None`

states what will happen as Polars reads the CSV file:
- `CSV SCAN` reads the CSV line-by-line
- `PROJECT */12 COLUMNS` selects all 12 of the columns (`*` is a wildcard meaning all)
- `SELECTION: None` means no rows will be filtered out.

### Top row
In this non-optimized query plan the top row:
`FILTER [(col("Age")) > (30i32)] FROM`

states that **once the entire CSV file has been read into memory** as a `DataFrame`:
- the `DataFrame` will be filtered for rows with `Age` greater than 30

## Inspecting the optimized query plan
We compare this with the optimized query plan that Polars will actually run when the `LazyFrame` is evaluated with `collect` or `fetch`.

We need to `print` the output of `describe_optimized_plan` to format it correctly.

In [13]:
df = (
    pl.scan_csv(csvFile)
    .filter(
        pl.col("Age") > 30
    )
)
print(df.explain())


FILTER [(col("Age")) > (30.0)] FROM

  Csv SCAN ../data/titanic.csv
  PROJECT */12 COLUMNS


There is only one row of the optimized query plan.

The `CSV SCAN` and `PROJECT` parts have not changed.

`SELECTION: [(col("Age")) > (30f64)]` means that Polars will apply the filter on the `Age` column **as the CSV is being read**.

For emphasis: in the optimized plan only the rows of the CSV that meet the filter conditions are read into a `DataFrame`. This is memory efficient.

### Multiple conditions in lazy mode
In *lazy mode* if we pass multiple `filter` calls then the query optimizer combines these into a *single condition* inside `SELECTION`.

In this example we filter for first class passengers over the age of 70.

In [54]:
df = (
    pl.scan_csv(csvFile)
    .filter(
        pl.col('Pclass')==1
    )
    .filter(
        (pl.col('Age') > 70)
    )
)
print(df.explain())

FILTER [([(col("Pclass")) == (1)]) & ([(col("Age")) > (70.0)])] FROM

  Csv SCAN ../data/titanic.csv
  PROJECT */12 COLUMNS


# Exercises
In the exercises you will develop your understanding of:
- using the `filter` method in lazy mode
- interpreting optimized query plans
- applying multiple conditions

## Exercise 1: Create a `LazyFrame` rows where `Fare` is greater than 10 in lazy mode

In [56]:
print(df.lazy().filter(pl.col("fare") > 10).explain(optimized=True))

print(df.lazy().filter(pl.col("fare") > 10).explain(optimized=False))

FILTER [([([(col("Age")) > (70.0)]) & ([(col("fare")) > (10)])]) & ([(col("Pclass")) == (1)])] FROM

  Csv SCAN ../data/titanic.csv
  PROJECT */12 COLUMNS
FILTER [(col("fare")) > (10)] FROM
FILTER [(col("Age")) > (70)] FROM
FILTER [(col("Pclass")) == (1)] FROM

  Csv SCAN ../data/titanic.csv
  PROJECT */12 COLUMNS


Exercise 1 cont: print out the optimized query plan and confirm the `SELECTION` is updated

Exercise 1 cont: evaluate this query for the first 10 rows

## Exercise 2: Create a `LazyFrame` where `Age` is greater than 30 and the passenger was in 2nd class

In [None]:
df = pl.<blank>

Exercise 2 cont: print out the optimized query plan and confirm the `SELECTION` is updated

## Exercise 3: Create a lazy query with the following optimized plan
```
CSV SCAN ../data/titanic.csv; 
PROJECT */12 COLUMNS; 
SELECTION: [([(col("Fare")) < (10f64)]) & ([(col("Sex")) == (Utf8(female))])]
```

## Solutions

## Solution to Exercise 1

In [57]:
df = pl.scan_csv(csvFile)
df = df.filter(pl.col('Fare') > 10)

Exercise 1 cont: print out the optimized query plan and confirm the `SELECTION` is updated

In [59]:
df = pl.scan_csv(csvFile)
df = df.filter(pl.col('Fare') > 10)
print(df.explain())

FILTER [(col("Fare")) > (10.0)] FROM

  Csv SCAN ../data/titanic.csv
  PROJECT */12 COLUMNS


Exercise 1 cont: evaluate this query for the first 10 rows

In [60]:
df.fetch(10)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
2,1,1,"""Cumings, Mrs. …","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
4,1,1,"""Futrelle, Mrs.…","""female""",35.0,1,0,"""113803""",53.1,"""C123""","""S"""
7,0,1,"""McCarthy, Mr. …","""male""",54.0,0,0,"""17463""",51.8625,"""E46""","""S"""
8,0,3,"""Palsson, Maste…","""male""",2.0,3,1,"""349909""",21.075,,"""S"""
9,1,3,"""Johnson, Mrs. …","""female""",27.0,0,2,"""347742""",11.1333,,"""S"""
10,1,2,"""Nasser, Mrs. N…","""female""",14.0,1,0,"""237736""",30.0708,,"""C"""


## Solution to Exercise 2

In [61]:
df = pl.scan_csv(csvFile)
df = df.filter(
    (pl.col('Age') > 30) & (pl.col('Pclass')==2)
)

Exercise 2 cont: print out the optimized query plan and confirm the `SELECTION` is updated

In [62]:
df = pl.scan_csv(csvFile)
df = df.filter(
    (pl.col('Age') > 30) & (pl.col('Pclass')==2)
)
print(df.explain())

FILTER [([(col("Age")) > (30.0)]) & ([(col("Pclass")) == (2)])] FROM

  Csv SCAN ../data/titanic.csv
  PROJECT */12 COLUMNS


Exercise 2 cont: evaluate this query for the full `DataFrame`

In [63]:
df.collect()

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
16,1,2,"""Hewlett, Mrs. …","""female""",55.0,0,0,"""248706""",16.0,,"""S"""
21,0,2,"""Fynney, Mr. Jo…","""male""",35.0,0,0,"""239865""",26.0,,"""S"""
22,1,2,"""Beesley, Mr. L…","""male""",34.0,0,0,"""248698""",13.0,"""D56""","""S"""
34,0,2,"""Wheadon, Mr. E…","""male""",66.0,0,0,"""C.A. 24579""",10.5,,"""S"""
71,0,2,"""Jenkin, Mr. St…","""male""",32.0,0,0,"""C.A. 33111""",10.5,,"""S"""
99,1,2,"""Doling, Mrs. J…","""female""",34.0,0,1,"""231919""",23.0,,"""S"""
100,0,2,"""Kantor, Mr. Si…","""male""",34.0,1,0,"""244367""",26.0,,"""S"""
123,0,2,"""Nasser, Mr. Ni…","""male""",32.5,1,0,"""237736""",30.0708,,"""C"""
124,1,2,"""Webber, Miss. …","""female""",32.5,0,0,"""27267""",13.0,"""E101""","""S"""
149,0,2,"""Navratil, Mr. …","""male""",36.5,0,2,"""230080""",26.0,"""F2""","""S"""


## Solution to Exercise 3

In [65]:
(
    pl.scan_csv(csvFile)
    .filter(pl.col('Fare') < 10)
    .filter(pl.col('Sex') == 'female')
).explain()

'FILTER [([(col("Sex")) == (Utf8(female))]) & ([(col("Fare")) < (10.0)])] FROM\n\n  Csv SCAN ../data/titanic.csv\n  PROJECT */12 COLUMNS'