# Missing Data

Let's show a few convenient methods to deal with Missing Data in pandas:

In [1]:
import numpy as np
import pandas as pd

```python
# Declare a dictionary
d = {'A':[1,2,np.nan],
     'B':[5,np.nan,np.nan],
     'C':[1,2,3]}

# Convert dictionary into pandas dataframe
df = pd.DataFrame(d)

# output dataframe
df
```

In [2]:
# Declare a dictionary
d = {'A':[1,2,np.nan],
     'B':[5,np.nan,np.nan],
     'C':[1,2,3]}

# Convert dictionary into pandas dataframe
df = pd.DataFrame(d)

# output dataframe
df

Unnamed: 0,A,B,C
0,1.0,5.0,1
1,2.0,,2
2,,,3


<hr>
<br>
<br>

## <span style="color:red"> Checking For Null Values </span>
In those cases when you load in a dataframe and want a quick overview of how many null values are present in the dataframe.

### `.isnull()`

```python
df.isnull()
```

In [3]:
df.isnull()

Unnamed: 0,A,B,C
0,False,False,False
1,False,True,False
2,True,True,False


```python
df.isnull().sum()
```
For every column it will have a value; sums along column axis

In [4]:
df.isnull().sum()

A    1
B    2
C    0
dtype: int64

<hr>
<br>
<br>

## None of these functions are in place

## <span style="color:red"> Drop Rows or Columns </span>
Sometimes it is appropriate to simply drop a `row`, `column`, or muliple of either, in order to deal with missing data.

### `.dropna()`

#### `Default Behavior`: Drop ALL `rows` with ANY NaN values.

```python
df.dropna()
```

Doesn't alter the index number 

In [5]:
df.dropna()

Unnamed: 0,A,B,C
0,1.0,5.0,1


<br>

#### `axis=1`: Drop ALL `columns` with ANY `NaN` values.

```python
df.dropna(axis=1)
```

In [6]:
df.dropna(axis=1)

Unnamed: 0,C
0,1
1,2
2,3


#### `axis= 1`, `thresh= 2`: Drop ALL `columns` with `>=2`, `NaN` values.

```python
df.dropna(axis= 1, thresh= 2)
```

In [7]:
df.dropna(axis= 1, thresh= 2)

Unnamed: 0,A,C
0,1.0,1
1,2.0,2
2,,3


<hr>
<br>
<br>

## <span style="color:red"> Fill in NaNs </span>
Sometimes it is appropriate to fill in the missing data, either by some specified value or by some aggregate statistic, like the median of a certain column.

### `.fillna()`

#### Fill ALL `NaN`s in `df` with some value.

```python
df.fillna(value='FILL VALUE')
```

In [8]:
df.fillna(value='FILL VALUE')

Unnamed: 0,A,B,C
0,1,5,1
1,2,FILL VALUE,2
2,FILL VALUE,FILL VALUE,3


#### Fill all `NaN`s in `series` with `median` of that `series`.

```python
df['A'].fillna(value= df['A'].median())
```

In [9]:
df['A'].fillna(value= df['A'].median())

0    1.0
1    2.0
2    1.5
Name: A, dtype: float64

In [10]:
df['A'] = df['A'].fillna(value= df['A'].median())

In [11]:
print(df)

     A    B  C
0  1.0  5.0  1
1  2.0  NaN  2
2  1.5  NaN  3
