# Pandas explanation in handeling missing data

- Importing necessary data:

In [4]:
import pandas as pd
import numpy as np

In [2]:
string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])

In [3]:
string_data

0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object

- Check wether there is a `NaN` value or not. *True* means in row 2, there is a `NaN` value

In [5]:
string_data.isnull()

0    False
1    False
2     True
3    False
dtype: bool

* To change a value to be `NaN`, just need:

In [7]:
string_data[0] = None
string_data.isnull()

0     True
1    False
2     True
3    False
dtype: bool

## Filtering out missing data

In [8]:
data = pd.Series([1, np.nan, 3.5, np.nan, 7])
data

0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64

- To *drop* a row which contains a `na` vlue:
 - if we change `inplace` to `True`, the vlue will store

In [10]:
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

* `dropna` will work for *pandas dataframe* and drops entire row:

In [12]:
data = pd.DataFrame([[1., 6.5, 3.], [1., np.nan, np.nan],
                     [np.nan, np.nan, np.nan], [np.nan, 6.5, 3.]])
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [13]:
cleaned = data.dropna()

In [14]:
cleaned

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


* Passing `how='all'` will only drop rows that all are `NaN`:

In [16]:
data.dropna(how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


* Using `axis=1` does drop action in column:

In [17]:
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [18]:
data[4] = np.nan
data

Unnamed: 0,0,1,2,4
0,1.0,6.5,3.0,
1,1.0,,,
2,,,,
3,,6.5,3.0,


In [19]:
data.dropna(axis=1, how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


* If you want to keep only rows which contains specific number (which means **indexing**):

In [20]:
df = pd.DataFrame(np.random.randn(7, 3))
df

Unnamed: 0,0,1,2
0,1.784113,-0.574788,-0.798074
1,0.284474,0.178317,0.679814
2,1.330352,0.043382,0.354472
3,-1.05311,-0.139292,0.14021
4,0.443426,-0.06595,0.96197
5,-0.056108,0.582005,0.133648
6,-1.3356,-0.112706,-1.07544


In [26]:
df.iloc[:4, 1] = np.nan
df.iloc[:2, 2] = np.nan

In [27]:
df

Unnamed: 0,0,1,2
0,1.784113,,
1,0.284474,,
2,1.330352,,0.354472
3,-1.05311,,0.14021
4,0.443426,-0.06595,0.96197
5,-0.056108,0.582005,0.133648
6,-1.3356,-0.112706,-1.07544


In [28]:
df.dropna()

Unnamed: 0,0,1,2
4,0.443426,-0.06595,0.96197
5,-0.056108,0.582005,0.133648
6,-1.3356,-0.112706,-1.07544


In [29]:
df.dropna(thresh=2)

Unnamed: 0,0,1,2
2,1.330352,,0.354472
3,-1.05311,,0.14021
4,0.443426,-0.06595,0.96197
5,-0.056108,0.582005,0.133648
6,-1.3356,-0.112706,-1.07544


## Filling missing data

* Fill any `NaN` value with 0:

In [31]:
df.fillna(0)

Unnamed: 0,0,1,2
0,1.784113,0.0,0.0
1,0.284474,0.0,0.0
2,1.330352,0.0,0.354472
3,-1.05311,0.0,0.14021
4,0.443426,-0.06595,0.96197
5,-0.056108,0.582005,0.133648
6,-1.3356,-0.112706,-1.07544
