## Missing values
By the end of this lecture you will be able to:
- identify missing values in a `DataFrame`
- count the number of missing values in a column
- find and drop `null` or non-`null` values

In [1]:
import polars as pl
import polars.selectors as cs

### Missing values in Polars
Missing values in Polars are represented with a `null` value for all dtypes. We can create them manually with `None` value.

We create a simple `DataFrame` where the rows have:
- all `null` values
- some `null` values
- one `null` value

In [2]:
df = pl.DataFrame(
    {
        'col1':[None,2,3,4],
        "col2":[None,None,5,6],
        "col3":[None,None,None,7]
    }
)
df

col1,col2,col3
i64,i64,i64
,,
2.0,,
3.0,5.0,
4.0,6.0,7.0


> In Pandas a missing value can be represented with a `null`,`NaN` or `None` value depending on the dtype of the column. Polars also allows `NaN` values for floating point columns to represent non-numberic values (e.g. where division by zero has occurred). This use of `NaN` is distinct from missing values. 

### Metadata on `null` values
Polars stores metadata about `null` values for each column in a `DataFrame`.

#### Null count
Polars stores a count of how many `null` values there are. We can access this with the `null_count` method on a single column or on all the columns

In [3]:
df.null_count()

col1,col2,col3
u32,u32,u32
1,2,3


Polars keeps track of the `null_count` at all times so this is a cheap operation regardless of the size of the column.

### Finding `null` values

We use the `is_null` expression to find out whether each value is `null` and `is_not_null` for the converse

In [5]:
(
    df
    .select(
        [
            pl.col("col1"),
            pl.col("col1").is_null().alias("is_null"),
            pl.col("col1").is_not_null().alias("is_not_null")
        ]
    )
)

col1,is_null,is_not_null
i64,bool,bool
,True,False
2.0,False,True
3.0,False,True
4.0,False,True


### Filtering by `null` values

#### Filtering on a single column
We can use these methods to filter by `null` or non-`null` values on a single column.

In this example we want all rows where the values in `col1` are not `null`

In [6]:
(
    df
    .filter(
        pl.col("col1").is_not_null(),
    )
)

col1,col2,col3
i64,i64,i64
2,,
3,5.0,
4,6.0,7.0


#### Filtering by `null` values in multiple columns

In this example we want to remove rows where **all** values are `null`. We can do this using:
- `pl.all().is_not_null()` to give `True` values where we get non-`null` values
- `pl.any_horizontal` to find if there is at least one `True` value in a row

In [7]:
df

col1,col2,col3
i64,i64,i64
,,
2.0,,
3.0,5.0,
4.0,6.0,7.0


In [8]:
(
    df
    .filter(
        pl.any_horizontal(pl.all().is_not_null())
    )
)

col1,col2,col3
i64,i64,i64
2,,
3,5.0,
4,6.0,7.0


In this example we want to keep rows where there are no `null` values

In [9]:
(
    df
    .filter(
        pl.all_horizontal(pl.all().is_not_null())
    )
)

col1,col2,col3
i64,i64,i64
4,6,7


### Using the `drop_nulls` method

Polars has a convenience `drop_nulls` method for dropping rows where all values are `null`

In [11]:
(
    df
    .drop_nulls()
)

col1,col2,col3
i64,i64,i64
4,6,7


We can also specify a subset of columns to apply the condition on

In [12]:
(
    df
    .drop_nulls(subset=["col1","col2"])
)

col1,col2,col3
i64,i64,i64
3,5,
4,6,7.0


## Exercises
In the exercises you will develop your understanding of:
- counting the `null` values
- filtering by `null` values

### Exercise 1
Count the number of `null` values in each row of the Titanic data

In [18]:
csv_file = "data_titanic.csv"
(
    pl.read_csv(csv_file)
    .transpose()
    .null_count()
)

column_0,column_1,column_2,column_3,column_4,column_5,column_6,column_7,column_8,column_9,column_10,column_11,column_12,column_13,column_14,column_15,column_16,column_17,column_18,column_19,column_20,column_21,column_22,column_23,column_24,column_25,column_26,column_27,column_28,column_29,column_30,column_31,column_32,column_33,column_34,column_35,column_36,…,column_854,column_855,column_856,column_857,column_858,column_859,column_860,column_861,column_862,column_863,column_864,column_865,column_866,column_867,column_868,column_869,column_870,column_871,column_872,column_873,column_874,column_875,column_876,column_877,column_878,column_879,column_880,column_881,column_882,column_883,column_884,column_885,column_886,column_887,column_888,column_889,column_890
u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,…,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32
1,0,1,0,1,2,0,1,1,1,0,0,1,1,1,1,1,2,1,2,1,0,1,0,1,1,2,0,2,2,1,1,2,1,1,1,2,…,1,1,1,0,1,2,1,1,0,2,1,1,1,0,2,1,1,0,0,1,1,1,1,1,2,0,1,1,1,1,1,1,1,0,2,0,1


Filter out the rows that are `null` from the `Cabin` column and count the null values for all columns again

In [19]:
(
    pl.read_csv(csv_file)
    .drop_nulls(subset=['Cabin'])
    .null_count()
)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32
0,0,0,0,0,19,0,0,0,0,0,2


### Exercise 2
Find all the rows for which the `Age` is `null`

In [20]:
(
    pl.read_csv(csv_file)
    .filter(
        pl.any_horizontal(pl.col('Age').is_null())
    )

)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
6,0,3,"""Moran, Mr. Jam…","""male""",,0,0,"""330877""",8.4583,,"""Q"""
18,1,2,"""Williams, Mr. …","""male""",,0,0,"""244373""",13.0,,"""S"""
20,1,3,"""Masselmani, Mr…","""female""",,0,0,"""2649""",7.225,,"""C"""
27,0,3,"""Emir, Mr. Farr…","""male""",,0,0,"""2631""",7.225,,"""C"""
29,1,3,"""O'Dwyer, Miss.…","""female""",,0,0,"""330959""",7.8792,,"""Q"""
…,…,…,…,…,…,…,…,…,…,…,…
860,0,3,"""Razi, Mr. Raih…","""male""",,0,0,"""2629""",7.2292,,"""C"""
864,0,3,"""Sage, Miss. Do…","""female""",,8,2,"""CA. 2343""",69.55,,"""S"""
869,0,3,"""van Melkebeke,…","""male""",,0,0,"""345777""",9.5,,"""S"""
879,0,3,"""Laleff, Mr. Kr…","""male""",,0,0,"""349217""",7.8958,,"""S"""


Find all the rows for which neither the `Age` nor the `Cabin` is `null`.

Use the Selectors API (imported above as `cs`) to select the columns

## Solutions
### Solution to Exercise 1
Count the number of `null` values in each row of the Titanic data

In [25]:
csv_file = "data_titanic.csv"
(
    pl.read_csv(csv_file)
    .null_count()
)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32
0,0,0,0,0,177,0,0,0,0,687,2


Filter out the rows that are `null` from the `Cabin` column and count the null values for all columns again

In [26]:
(
    pl.read_csv(csv_file)
    .filter(pl.col("Cabin").is_not_null())
    .null_count()
)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32
0,0,0,0,0,19,0,0,0,0,0,2


### Solution to Exercise 2
Find all the rows for which the `Age` is `null`

In [28]:
csv_file = "data_titanic.csv"
(
    pl.read_csv(csv_file)
    .filter(pl.col("Age").is_null())
    .head()
)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
6,0,3,"""Moran, Mr. Jam…","""male""",,0,0,"""330877""",8.4583,,"""Q"""
18,1,2,"""Williams, Mr. …","""male""",,0,0,"""244373""",13.0,,"""S"""
20,1,3,"""Masselmani, Mr…","""female""",,0,0,"""2649""",7.225,,"""C"""
27,0,3,"""Emir, Mr. Farr…","""male""",,0,0,"""2631""",7.225,,"""C"""
29,1,3,"""O'Dwyer, Miss.…","""female""",,0,0,"""330959""",7.8792,,"""Q"""


Find all the rows for which either the `Age` or the `Cabin` is `null`

Use the Selectors API (imported above as `cs`) to select the columns

In [29]:
(
    pl.read_csv(csv_file)
    .filter(
        pl.any_horizontal(cs.matches("Age|Cabin").is_null())
    )
    .select(
        cs.matches("Age|Cabin")
    )       
)

Age,Cabin
f64,str
22.0,
26.0,
35.0,
,
2.0,
…,…
25.0,
39.0,
27.0,
,
