## Replacing missing values
By the end of this lecture you will be able to:
- replace missing values with a constant
- replace missing values with a strategy

In [1]:
import polars as pl

In [2]:
data = {
    "Name": ["Alice", "Bob", "Bob", "Bob", "Alice"],
    "Age": [25, None, 30, None, None],
    "Score": [85.0, 90.5, None, 88.0, 92.0]
}

In [3]:
df = pl.DataFrame(data)

In [4]:
df

Name,Age,Score
str,i64,f64
"""Alice""",25.0,85.0
"""Bob""",,90.5
"""Bob""",30.0,
"""Bob""",,88.0
"""Alice""",,92.0


## Replace missing values with a constant
We replace missing values in an expression using `fill_null`

In [5]:
df_filled = df.with_columns([
    pl.col("Name").fill_null("Unknown"),
    pl.col("Age").fill_null(0),
    pl.col("Score").fill_null(1)
])

In [6]:
df_filled

Name,Age,Score
str,i64,f64
"""Alice""",25,85.0
"""Bob""",0,90.5
"""Bob""",30,1.0
"""Bob""",0,88.0
"""Alice""",0,92.0


In this case `fill_null` has `cast` the columns from integer to string dtype!



## Replace missing values with a strategy
We can also replace missing values with a stragegy including:
- forward: replace with the previous non-`null` value
- backward: replace with the next non-`null` value
- min: replace with the smallest value in the `Series`
- max: replace with the largest value in the `Series`
- mean: replace with the mean value in the `Series`

### Forward strategy
In the forward strategy the missing values are replaced with the previous non-`null` values

### Orignal Data frame

In [7]:
df

Name,Age,Score
str,i64,f64
"""Alice""",25.0,85.0
"""Bob""",,90.5
"""Bob""",30.0,
"""Bob""",,88.0
"""Alice""",,92.0


In [8]:
(
    df
    .with_columns(
        pl.col("Age").fill_null(strategy = "forward"),
        pl.col("Score").fill_null(strategy= "backward")
    )
)

Name,Age,Score
str,i64,f64
"""Alice""",25,85.0
"""Bob""",25,90.5
"""Bob""",30,88.0
"""Bob""",30,88.0
"""Alice""",30,92.0


We can set a limit on how many rows to fill-forward or backward with `limit`

### Orignal Data frame

In [9]:
df

Name,Age,Score
str,i64,f64
"""Alice""",25.0,85.0
"""Bob""",,90.5
"""Bob""",30.0,
"""Bob""",,88.0
"""Alice""",,92.0


In [10]:
(
    df
    .with_columns(
        pl.col("Age").fill_null(strategy="forward", limit=1)
    )
)

Name,Age,Score
str,i64,f64
"""Alice""",25.0,85.0
"""Bob""",25.0,90.5
"""Bob""",30.0,
"""Bob""",30.0,88.0
"""Alice""",,92.0


## Replacing missing values by group
In this example we have missing values in `Age` and we want to fill them with a fill-forward strategy. 

However, we want to fill forward with respect to the groups in the `Name` column.

We can do this using a *window expression* with `over`

In [11]:
(
    df
    .with_columns(
        pl.col("Age").fill_null(strategy='forward').over('Name')
        
    )
)

Name,Age,Score
str,i64,f64
"""Alice""",25.0,85.0
"""Bob""",,90.5
"""Bob""",30.0,
"""Bob""",30.0,88.0
"""Alice""",25.0,92.0


We see more of window expressions in the statistics and grouping Section.

### Interpolation
Polars provides the `interpolate` method on a Series object to fill missing values. The `interpolation` works based on linear `interpolation`, which computes the missing values as points on a straight line between existing (non-missing) values.

In [13]:
(
    df
    .with_columns(
        pl.col("Age").interpolate()
    )
)

Name,Age,Score
str,f64,f64
"""Alice""",25.0,85.0
"""Bob""",27.5,90.5
"""Bob""",30.0,
"""Bob""",,88.0
"""Alice""",,92.0
