# Missing values and outliers

In this notebook, we look at the basic techniques to identify and deal with missing values and outliers. There is of course much more to this, and it could be an entire course on its own.

In [None]:
import numpy as np
import pandas as pd

We can check for missing values using `isna` function/method

In [None]:
float_data = pd.Series([1.2, -3.5, np.nan, 0])
float_data

In [None]:
float_data.isna()

In [None]:
data = pd.DataFrame([[1., 6.5, 3.], [1., None, 5],
                     [np.nan, np.nan, np.nan], [np.nan, 6.5, 3.]],
                   columns = ["Col 1", "Col 2", "Col 3"])
data["Col 4"] = ["a", "b", None, np.nan]
data

In [None]:
data.isna()

For a dataframe, though, we might be more interesting in knowing if any, or how many, missing values there are in each column. the `info` method tell us have many Non-Null values we have in each column and together with the information about `RangeIndex` that there are 4 entries, we can see how many missing values each column has.

In [None]:
data.info()

## Dropping missing values



For Series it is easy to just remove the entries with missing values and sometimes this is what you want to do (but not always!).

In [None]:
float_data

In [None]:
float_data.dropna()

In [None]:
data

For dataframes it is a bit more complicated in the sense that we need to drop either entire rows or entire columns, which also might remove none missing values. `dropna` by default drops all rows that contains at least one missing value

In [None]:
data.dropna()

It can be made to drop na based on columns using the argument `axis = 1`:

In [None]:
data.dropna(axis = "columns")

**Important note: We always want to be careful dropping entire rows based on missing values. However, it might make sense to drop an entire column if the majority of values are missing or if the column is deemed irrelevant for the future analysis or machine learning model. After having removed the columns that might be deemed useless, one might further remove missing values row wise. However, we might want to impute the missing values instead of removing them.**

If we only want to remove rows (or columns) where all values are missing values, we can give that as an argument to `dropna`: 

In [None]:
data.dropna(how="all")

## Filling in missing values

Whenever possible, it is often preferable to fill in missing values instead of deleting them. (If one is to train a machine learning model and have a lot of data, then if 5% of the data is missing one can usually just drop it. It depends on whether the missing values are missing in a biased way.)

Filling in missing values with a fixed value is easy:

In [None]:
float_data

In [None]:
float_data.fillna(0)

In [None]:
float_data.fillna("banana")

In [None]:
data.fillna(0)

In [None]:
data.fillna(0).info()

However, for data frames you usually want to do it column wise and maybe only for some columns: 

In [None]:
data.fillna({"Col 1": 0, "Col 4": "no label"})

One can also fill in missing values of a column, the mean of that column, in the following way:

In [None]:
data.fillna({"Col 1": 0, "Col 3": data["Col 3"].mean(), "Col 4": "no label"})

## Replacing values

Sometimes we might want to replace particular values with other values, for instance replacing outliers with other values

In [None]:
data.iloc[2, 0] = 9999
data

In [None]:
data.replace(9999, np.nan)

In [None]:
data.iloc[2, 0] = 3.0
data

In [None]:
data.replace({3: 100})

Replacing values in a specific column only:

In [None]:
data.replace({"Col 3": {3: 100}})

## Replacing outliers

In replacing outliers, one can do it explicit by replacing specific values with other values (or NAs), but one might also be interested in just replacing values over (or under) a certain threshold in a column. 

In [None]:
data

In [None]:
data2 = data.copy()
data2

In [None]:
data2.loc[data["Col 3"] > 4]["Col 3"] = 4.0  # Gives a warning for using chained slicing

In [None]:
data2.loc[data["Col 3"] > 4, "Col 3"] = 4.0 # Do instead

In [None]:
data2

Note that the last technique could also be used to look for outliers. We could look for all values larger than 3 standard deviation from the mean in a column, for instance:

In [None]:
data2.iloc[2, 2] = -100.0
data2

In [None]:
mean3std = np.abs(data2["Col 3"].mean()) + 3 * data["Col 3"].std()

In [None]:
data2[data2["Col 3"].abs() > mean3std]

In [None]:
data2.loc[data2["Col 3"].abs() > mean3std, "Col 3"] = np.nan


In [None]:
data2