# Working with Missing Values in Pandas

In [2]:
import pandas as pd
import numpy as np

Create a dataframe with a few missing values for demonstartion purposes.

In [4]:
# use np.nan to create missing values - nan = "not a number"
df_demo = pd.DataFrame({'A': [1,2,4, np.nan,5,np.nan,7],
                       'B': [4,np.nan, 7, np.nan, 1, np.nan, 2]})

df_demo

Unnamed: 0,A,B
0,1.0,4.0
1,2.0,
2,4.0,7.0
3,,
4,5.0,1.0
5,,
6,7.0,2.0


### The easiest way to check for missing values is with the isna( ) method.
- The results are boolean output
- True = missing data, False = data IS NOT missing

In [5]:
df_demo.isna()

Unnamed: 0,A,B
0,False,False
1,False,True
2,False,False
3,True,True
4,False,False
5,True,True
6,False,False


### Get a count of mising values
- This creates a series that groups missing values by attribute.

In [7]:
df_demo.isna().sum()

A    2
B    3
dtype: int64

### Drop rows that contain missing values
- Use dropna( ) but be aware that this will drop all/any row that has a missing value, even if one column or more has data, if there is a null, it gets dropped.
- See example below.

In [8]:
# default is how="any" meaning any row where a null exists gets dropped.
df_demo.dropna()

Unnamed: 0,A,B
0,1.0,4.0
2,4.0,7.0
4,5.0,1.0
6,7.0,2.0


### Change the methods parameters
- The default is how = "any" which procues the above result. 
- Chnage it to "all" to only drop rows where BOTH values are missing. Keep in mind, this will still prodcue nulls is one or more values are present in a different column.

In [9]:
df_demo.dropna(how="all")

Unnamed: 0,A,B
0,1.0,4.0
1,2.0,
2,4.0,7.0
4,5.0,1.0
6,7.0,2.0


### Subset columns with parameter of subset = 
- We can add a paramter, subset = [col names] to specify specific columns to use where missing values exist.
- To make permanent, use inplace = True as a parameter. It's also not a bad idea to make a copy before doing this (this way we have the original source of truth for our data).
- Example below.

In [10]:
df_demo.dropna(subset = ["A"])

Unnamed: 0,A,B
0,1.0,4.0
1,2.0,
2,4.0,7.0
4,5.0,1.0
6,7.0,2.0


In [11]:
# make a copy and chnage in place
df_demo_copy = df_demo.copy()

df_demo_copy.dropna(how="all", inplace=True)

df_demo_copy

Unnamed: 0,A,B
0,1.0,4.0
1,2.0,
2,4.0,7.0
4,5.0,1.0
6,7.0,2.0


## Replacing(Filling) missing values with fillna( )
- Use fillna()
- Specify the value to use (can be hardcoded, mean, median, etc.)
- Think 1st about what is better for you model, dropping or filling? How will it affect your data?

In [14]:
# use the original data frame - df_demo, see how handt that copy method already became? we still have the original data!
# Let's hardcode 100 (any missing value be 100) for missing values just as an example:
df_demo.fillna(value=100)

Unnamed: 0,A,B
0,1.0,4.0
1,2.0,100.0
2,4.0,7.0
3,100.0,100.0
4,5.0,1.0
5,100.0,100.0
6,7.0,2.0


A constant is really never the answer (unless we are replcaing a string value), let's use the mean.
- Create a mean var to use

In [15]:
mean_value = df_demo["A"].mean()
# view the variable value and add it to the method
mean_value

3.8

Adding the mean variable to the method will replace ALL the missing value with the mean value for column A.
- Remember to use inplace=True to make it permanent.

In [16]:
df_demo.fillna(value=mean_value)

Unnamed: 0,A,B
0,1.0,4.0
1,2.0,3.8
2,4.0,7.0
3,3.8,3.8
4,5.0,1.0
5,3.8,3.8
6,7.0,2.0


Replace each columns missing values with the columns mean value.
- For example, A will differ from B
- Inplace = True for perm

In [18]:
df_demo.fillna(value=df_demo.mean())

Unnamed: 0,A,B
0,1.0,4.0
1,2.0,3.5
2,4.0,7.0
3,3.8,3.5
4,5.0,1.0
5,3.8,3.5
6,7.0,2.0


Add the value to only column A

In [17]:
df_demo["A"].fillna(value=mean_value)

0    1.0
1    2.0
2    4.0
3    3.8
4    5.0
5    3.8
6    7.0
Name: A, dtype: float64

#### For other ways to deal with missing values - see simpleimputer & knnimputer and 