# Missing values and outliers

In this notebook, we look at the basic techniques to identify and deal with missing values and outliers. There is of course much more to this, and it could be an entire course on its own.

In [1]:
import numpy as np
import pandas as pd

We can check for missing values using `isna` function/method

In [2]:
float_data = pd.Series([1.2, -3.5, np.nan, 0])
float_data

0    1.2
1   -3.5
2    NaN
3    0.0
dtype: float64

In [3]:
float_data.isna()

0    False
1    False
2     True
3    False
dtype: bool

In [4]:
data = pd.DataFrame([[1., 6.5, 3.], [1., None, 5],
                     [np.nan, np.nan, np.nan], [np.nan, 6.5, 3.]],
                   columns = ["Col 1", "Col 2", "Col 3"])
data["Col 4"] = ["a", "b", None, np.nan]
data

Unnamed: 0,Col 1,Col 2,Col 3,Col 4
0,1.0,6.5,3.0,a
1,1.0,,5.0,b
2,,,,
3,,6.5,3.0,


In [5]:
data.isna()

Unnamed: 0,Col 1,Col 2,Col 3,Col 4
0,False,False,False,False
1,False,True,False,False
2,True,True,True,True
3,True,False,False,True


For a dataframe, though, we might be more interesting in knowing if any, or how many, missing values there are in each column. the `info` method tell us have many Non-Null values we have in each column and together with the information about `RangeIndex` that there are 4 entries, we can see how many missing values each column has.

In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Col 1   2 non-null      float64
 1   Col 2   2 non-null      float64
 2   Col 3   3 non-null      float64
 3   Col 4   2 non-null      object 
dtypes: float64(3), object(1)
memory usage: 260.0+ bytes


## Dropping missing values



For Series it is easy to just remove the entries with missing values and sometimes this is what you want to do (but not always!).

In [7]:
float_data

0    1.2
1   -3.5
2    NaN
3    0.0
dtype: float64

In [8]:
float_data.dropna()

0    1.2
1   -3.5
3    0.0
dtype: float64

In [9]:
data

Unnamed: 0,Col 1,Col 2,Col 3,Col 4
0,1.0,6.5,3.0,a
1,1.0,,5.0,b
2,,,,
3,,6.5,3.0,


For dataframes it is a bit more complicated in the sense that we need to drop either entire rows or entire columns, which also might remove none missing values. `dropna` by default drops all rows that contains at least one missing value

In [10]:
data.dropna()

Unnamed: 0,Col 1,Col 2,Col 3,Col 4
0,1.0,6.5,3.0,a


It can be made to drop na based on columns using the argument `axis = 1`:

In [11]:
data.dropna(axis = "columns")

0
1
2
3


**Important note: We always want to be careful dropping entire rows based on missing values. However, it might make sense to drop an entire column if the majority of values are missing or if the column is deemed irrelevant for the future analysis or machine learning model. After having removed the columns that might be deemed useless, one might further remove missing values row wise. However, we might want to impute the missing values instead of removing them.**

If we only want to remove rows (or columns) where all values are missing values, we can give that as an argument to `dropna`: 

In [12]:
data.dropna(how="all")

Unnamed: 0,Col 1,Col 2,Col 3,Col 4
0,1.0,6.5,3.0,a
1,1.0,,5.0,b
3,,6.5,3.0,


## Filling in missing values

Whenever possible, it is often preferable to fill in missing values instead of deleting them. (If one is to train a machine learning model and have a lot of data, then if 5% of the data is missing one can usually just drop it. It depends on whether the missing values are missing in a biased way.)

Filling in missing values with a fixed value is easy:

In [13]:
float_data

0    1.2
1   -3.5
2    NaN
3    0.0
dtype: float64

In [14]:
float_data.fillna(0)

0    1.2
1   -3.5
2    0.0
3    0.0
dtype: float64

In [15]:
float_data.fillna("banana")

0       1.2
1      -3.5
2    banana
3       0.0
dtype: object

In [16]:
data.fillna(0)

Unnamed: 0,Col 1,Col 2,Col 3,Col 4
0,1.0,6.5,3.0,a
1,1.0,0.0,5.0,b
2,0.0,0.0,0.0,0
3,0.0,6.5,3.0,0


In [17]:
data.fillna(0).info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Col 1   4 non-null      float64
 1   Col 2   4 non-null      float64
 2   Col 3   4 non-null      float64
 3   Col 4   4 non-null      object 
dtypes: float64(3), object(1)
memory usage: 260.0+ bytes


However, for data frames you usually want to do it column wise and maybe only for some columns: 

In [18]:
data.fillna({"Col 1": 0, "Col 4": "no label"})

Unnamed: 0,Col 1,Col 2,Col 3,Col 4
0,1.0,6.5,3.0,a
1,1.0,,5.0,b
2,0.0,,,no label
3,0.0,6.5,3.0,no label


One can also fill in missing values of a column, the mean of that column, in the following way:

In [19]:
data.fillna({"Col 1": 0, "Col 3": data["Col 3"].mean(), "Col 4": "no label"})

Unnamed: 0,Col 1,Col 2,Col 3,Col 4
0,1.0,6.5,3.0,a
1,1.0,,5.0,b
2,0.0,,3.666667,no label
3,0.0,6.5,3.0,no label


## Replacing values

Sometimes we might want to replace particular values with other values, for instance replacing outliers with other values

In [20]:
data.iloc[2, 0] = 9999
data

Unnamed: 0,Col 1,Col 2,Col 3,Col 4
0,1.0,6.5,3.0,a
1,1.0,,5.0,b
2,9999.0,,,
3,,6.5,3.0,


In [21]:
data.replace(9999, np.nan)

Unnamed: 0,Col 1,Col 2,Col 3,Col 4
0,1.0,6.5,3.0,a
1,1.0,,5.0,b
2,,,,
3,,6.5,3.0,


In [22]:
data.iloc[2, 0] = 3.0
data

Unnamed: 0,Col 1,Col 2,Col 3,Col 4
0,1.0,6.5,3.0,a
1,1.0,,5.0,b
2,3.0,,,
3,,6.5,3.0,


In [23]:
data.replace({3: 100})

Unnamed: 0,Col 1,Col 2,Col 3,Col 4
0,1.0,6.5,100.0,a
1,1.0,,5.0,b
2,100.0,,,
3,,6.5,100.0,


Replacing values in a specific column only:

In [24]:
data.replace({"Col 3": {3: 100}})

Unnamed: 0,Col 1,Col 2,Col 3,Col 4
0,1.0,6.5,100.0,a
1,1.0,,5.0,b
2,3.0,,,
3,,6.5,100.0,


## Replacing outliers

In replacing outliers, one can do it explicit by replacing specific values with other values (or NAs), but one might also be interested in just replacing values over (or under) a certain threshold in a column. 

In [25]:
data

Unnamed: 0,Col 1,Col 2,Col 3,Col 4
0,1.0,6.5,3.0,a
1,1.0,,5.0,b
2,3.0,,,
3,,6.5,3.0,


In [26]:
data2 = data.copy()
data2

Unnamed: 0,Col 1,Col 2,Col 3,Col 4
0,1.0,6.5,3.0,a
1,1.0,,5.0,b
2,3.0,,,
3,,6.5,3.0,


In [27]:
data2.loc[data["Col 3"] > 4]["Col 3"] = 4.0  # Gives a warning for using chained slicing

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [28]:
data2.loc[data["Col 3"] > 4, "Col 3"] = 4.0 # Do instead

In [29]:
data2

Unnamed: 0,Col 1,Col 2,Col 3,Col 4
0,1.0,6.5,3.0,a
1,1.0,,4.0,b
2,3.0,,,
3,,6.5,3.0,


Note that the last technique could also be used to look for outliers. We could look for all values larger than 3 standard deviation from the mean in a column, for instance:

In [30]:
data2.iloc[2, 2] = -100.0
data2

Unnamed: 0,Col 1,Col 2,Col 3,Col 4
0,1.0,6.5,3.0,a
1,1.0,,4.0,b
2,3.0,,-100.0,
3,,6.5,3.0,


In [31]:
mean3std = np.abs(data2["Col 3"].mean()) + 3 * data["Col 3"].std()

In [32]:
data2[data2["Col 3"].abs() > mean3std]

Unnamed: 0,Col 1,Col 2,Col 3,Col 4
2,3.0,,-100.0,


In [33]:
data2.loc[data2["Col 3"].abs() > mean3std, "Col 3"] = np.nan


In [34]:
data2

Unnamed: 0,Col 1,Col 2,Col 3,Col 4
0,1.0,6.5,3.0,a
1,1.0,,4.0,b
2,3.0,,,
3,,6.5,3.0,
