# Missing Data

Let's show a few convenient methods to deal with Missing Data in pandas:

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.DataFrame({'A':[1,2,np.nan],
                  'B':[5,np.nan,np.nan],
                  'C':[1,2,3]})

In [4]:
df

Unnamed: 0,A,B,C
0,1.0,5.0,1
1,2.0,,2
2,,,3


As you can see there are multiple None `NaN` values or data points in our dataframes. Sometimes in statistical analyses, you may want to drop the observations(rows) or variables(columns) to make meaningful outcomes. You can use `df.dropna()` to drop necessary rows or columns. **As a default `axis=0`. So `df.dropna()` will drop the rows which have any `NaN` values.**

In [6]:
df.dropna()

Unnamed: 0,A,B,C
0,1.0,5.0,1


In [8]:
df.dropna(axis=1)

Unnamed: 0,C
0,1
1,2
2,3


We can set a threshold value. Threshold value means that keep that rows or columns which have at least `thresh=n` number of non `NaN` values. If `thresh=2`, it means that for that row there should be at least 2 real values (not `NaN`).

In [10]:
df.dropna(thresh=2)

Unnamed: 0,A,B,C
0,1.0,5.0,1
1,2.0,,2


Instead of deleting or dropping `NaN` values sometimes you may want to fill that value. As an example just write a `FILL VALUE` to make it obvious. In the following, we will assign the `mean()` of that column which makes much more sense.

In [12]:
df.fillna(value='FILL VALUE')

Unnamed: 0,A,B,C
0,1,5,1
1,2,FILL VALUE,2
2,FILL VALUE,FILL VALUE,3


In [13]:
df

Unnamed: 0,A,B,C
0,1.0,5.0,1
1,2.0,,2
2,,,3


Take column A. There is one missing value. Other values are 1.0 and 2.0. Therefore, if we want to populate that missing value a good candidate would be `mean()` of other data points which is (1+2)/2 = 1.5. Below it is how we do it.

In [17]:
df['A'].fillna(value=df['A'].mean())

0    1.0
1    2.0
2    1.5
Name: A, dtype: float64

You can do also for rows using `df.loc[]` function. Remember these are not `inplace=True` therefore change is not permanent.

In [18]:
df.loc[1].fillna(df.loc[1].mean())

A    2.0
B    2.0
C    2.0
Name: 1, dtype: float64

In [20]:
df # the change is not permanent.

Unnamed: 0,A,B,C
0,1.0,5.0,1
1,2.0,,2
2,,,3


# Great Job!