# Missing Data

In real world datasets, you'll come across *'missing data'*, very often. This could be beacuse of any number of reasons, some including, inabilty to get all possible data from the whole sample, loss of data, wrong format that had to be deleted etc.

Let's look at a few simple ways to deal with this missing data.

In [1]:
import numpy as np
import pandas as pd

```numpy.nan``` is how a **NaN** value is stored. It is how missing data is representred in most cases.

In [2]:
# creating a dataframe with mssing values

df = pd.DataFrame({'c0': [1, 2, np.nan, 4],
                'c1': [5, np.nan ,np.nan, 8],
                'c2': [9, 10, 11, 12],
                'c3': [np.nan, np.nan, np.nan, 16]})

In [3]:
df

Unnamed: 0,c0,c1,c2,c3
0,1.0,5.0,9,
1,2.0,,10,
2,,,11,
3,4.0,8.0,12,16.0


There are three main methods to deal with this missing data:

1. Drop rows/columns with missing data
2. Fill missing values with something you can infer from the rest of the data
3. A hybrid between the two

### 1. Dropping rows/columns with missing values

> ```dropna()``` is a method defined on the ```DataFrame``` class. We specify a ```thresh``` argument the minimum number of non-missing values we want to have in a row/column.


**Note**: like the pd.drop() function, this method wont change the underlying/original dataframe unless we specify ```inplace = True```.

In [4]:
# by default it drop rows with any missing values
df.dropna()

Unnamed: 0,c0,c1,c2,c3
3,4.0,8.0,12,16.0


In [5]:
# we can drop columns with any missing values by specifying axis=1
df.dropna(axis=1)

Unnamed: 0,c2
0,9
1,10
2,11
3,12


In [6]:
# setting a threshold 
df.dropna(thresh = 1)

Unnamed: 0,c0,c1,c2,c3
0,1.0,5.0,9,
1,2.0,,10,
2,,,11,
3,4.0,8.0,12,16.0


Dropping with a threshold gives us the flexibilty of dropping rows/columns with too many missing values and filling the rest from our statistical inference.

### 2. Filling missing values

> ```fillna()``` is a method defined on the ```DataFrame``` class (and ```Series``` as well). We specify a ```value``` argument to fill in the missing values.

In [7]:
df.fillna(value = 0)

Unnamed: 0,c0,c1,c2,c3
0,1.0,5.0,9,0.0
1,2.0,0.0,10,0.0
2,0.0,0.0,11,0.0
3,4.0,8.0,12,16.0


Instead of filling in the missing values with a constant, we can fill in the missing values with a value of some significance, say like the mean or mode of that column.

In [8]:
# filling fourth column's missing value with its mean
df['c3'].fillna(value=df['c3'].mean())

0    16.0
1    16.0
2    16.0
3    16.0
Name: c3, dtype: float64

In [9]:
# filling first column's missing value with its mode
df['c0'].fillna(value=df['c0'].mode())

0    1.0
1    2.0
2    4.0
3    4.0
Name: c0, dtype: float64

In [10]:
# filling second row's missing value with its median
df.iloc[1].fillna(value=df.iloc[1].median())

c0     2.0
c1     6.0
c2    10.0
c3     6.0
Name: 1, dtype: float64

**Note:** Other statistics about the dataframe can be found from the ```describe()``` method.

In [11]:
df.describe()

Unnamed: 0,c0,c1,c2,c3
count,3.0,2.0,4.0,1.0
mean,2.333333,6.5,10.5,16.0
std,1.527525,2.12132,1.290994,
min,1.0,5.0,9.0,16.0
25%,1.5,5.75,9.75,16.0
50%,2.0,6.5,10.5,16.0
75%,3.0,7.25,11.25,16.0
max,4.0,8.0,12.0,16.0


Similarly if you want to know the (non statiscal) details about the dataframe, you can use the ```info()``` method.

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   c0      3 non-null      float64
 1   c1      2 non-null      float64
 2   c2      4 non-null      int64  
 3   c3      1 non-null      float64
dtypes: float64(3), int64(1)
memory usage: 256.0 bytes


> ```isnull()``` is a method that returns a boolean array of the same shape as the dataframe with ```True``` values where the values are missing.
> So an alternative, ```df.isnull().sum()``` gives us the number of missing values in each column.

In [13]:
df.isnull().sum()

c0    1
c1    2
c2    0
c3    3
dtype: int64

### Example of hybrid method

In [14]:
df2 = df.dropna(thresh = 2).copy()
df2

Unnamed: 0,c0,c1,c2,c3
0,1.0,5.0,9,
1,2.0,,10,
3,4.0,8.0,12,16.0


In [15]:
df2["c1"].fillna(df2["c1"].mean(), inplace=True)

In [16]:
df2["c2"].fillna(df2["c2"].median(), inplace=True)

In [17]:
df2

Unnamed: 0,c0,c1,c2,c3
0,1.0,5.0,9,
1,2.0,6.5,10,
3,4.0,8.0,12,16.0


**Note:** In the last example we use the ```copy()``` method to create a copy of the dataframe, as by default all the functions return just a view of the dataframe.
If we hadnt used ```copy()```, we would have got the following warning:

In [18]:
df3 = df.dropna(thresh= 2)

In [19]:
df3["c1"].fillna(df3["c1"].mean(), inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df3["c1"].fillna(df3["c1"].mean(), inplace=True)
