# DataScience Course $\circ$ Pandas $\circ$ Missing Data
By Dr Boontawee Suntisrivaraporn  
<img src="../miscs/logos/logo2.jpg" width=600> 
<hr>

### Topics
- missing data
- `null`, `NaN`, `NA`
- `isnull()`, `notnull()`
- `dropna()`, `fillna()`

Real-world data is rarely clean and ready to use. Specifically, many datasets unavoidably have some data missing. Also, missing data may be represented in heterogeneouse ways, making it even more difficult to handle.

In [38]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

def view_data(df):
    sns.heatmap(df, cmap="viridis") ; 
    return df

rs = np.random.RandomState(0)
mt = rs.randint(0, 100, 25).reshape(5, 5)
df1 = pd.DataFrame(mt, columns=list("uvxyz"), index=list("abcde"))

for i in range(10):
    df1.iloc[rs.randint(0, 5), rs.randint(0, 5)] = np.nan

df2 = df1.copy().fillna(-999).astype(np.int)
df2

Unnamed: 0,u,v,x,y,z
a,44,-999,64,67,67
b,-999,-999,21,36,-999
c,70,88,88,-999,-999
d,-999,39,87,-999,88
e,81,37,25,77,72


## $\Rightarrow$ Walk-through coding
Pay attention to the class and then code through, step by step

### None vs np.nan 
- `None` is a special Python object, can be used for any data type
- `np.nan` is a floating-point specific value representing Not-A-Number

Some real-world datasets may use domain-specific sentinel value to represent missing data, e.g. `-999`, `-`, `null` 

In [3]:
# object type is slower as all operations are at Python level



In [2]:
# floating type is faster with vectorized operation in complied code



NaN is contegeous like a virus, so handle it carefully

### Operations on missing values

`np.nan` is contegeous, so one may consider dropping all of them.
But we cannot just drop a single value in a dataframe, but rather an entire row or entire column

In [4]:
# drop rows


In [5]:
# drop columns


### Imputing data
- with `0`
- with a pre-defined value from domain knowledge
- with the last known value, looking forward / looking backward
- with the mode, the median, or arithmetic mean

Impute with global median/average

Impute with median/average of each variable (data column)

<hr>
## Finished!