# Unit 3: EDA (Exploratory Data Analysis)

---

## Lesson 3.4: Dealing With Missing Values

In [55]:
# First we will load all the necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn

In [56]:
# Loading the Dataset
df = pd.read_csv("Data/titanic_leapcode_train.csv")

First let us discuss what is a missing value and why we need to handle it

### What is a Missing Value?

A missing value is when a cell in the dataset is empty  
This can happen when:
- When collecting data, a person skipped a question
- The data wasn't collected
- Something went wrong during saving the data

### Why It’s a Problem

Most machine learning models **can’t handle missing values**

---

### How to Handle the Missing Values

In [57]:
# Used to find amount of null values in each column
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [58]:
# prints first 10 rows
df.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


From using .isnull().sum() and looking at the data using .head() we know that the columns Age, Cabin, and Embarked have missing values

There are two main methods of dealing with missing values

### 1. Drop Missing Values

We can drop columns that have missing data.

In [59]:
# Get names of columns with missing values
cols_with_missing = [col for col in df.columns
                     if df[col].isnull().any()]

# Drop the columns with missing values
df_dropped = df.drop(cols_with_missing, axis=1)

df_dropped

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,SibSp,Parch,Ticket,Fare
0,1,0,3,"Braund, Mr. Owen Harris",male,1,0,A/5 21171,7.2500
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,1,0,PC 17599,71.2833
2,3,1,3,"Heikkinen, Miss. Laina",female,0,0,STON/O2. 3101282,7.9250
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,1,0,113803,53.1000
4,5,0,3,"Allen, Mr. William Henry",male,0,0,373450,8.0500
...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,0,0,211536,13.0000
887,888,1,1,"Graham, Miss. Margaret Edith",female,0,0,112053,30.0000
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,1,2,W./C. 6607,23.4500
889,890,1,1,"Behr, Mr. Karl Howell",male,0,0,111369,30.0000


The new dataframe has removed the columns Age, Cabin, and Embarkede because they had missing values

#### Pros

- Simple and easy to do
- No guessing values in place of the missing ones


#### Cons

- You may lose important data if many columns are dropped

- Can make the dataset too small to train a model well

### 2. Impute Missing Values

We can use SimpleImputer() from sklearn to replace missing values with:

- The mean (average)

- The median (middle number)

- The mode (most frequent number)

If we wanted to impute the dataframe with the most frequent, we could do it like this:

In [61]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='most_frequent')
pd.DataFrame(imputer.fit_transform(df), columns = df.columns)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,B96 B98,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,B96 B98,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,B96 B98,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,B96 B98,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,24.0,1,2,W./C. 6607,23.45,B96 B98,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C


#### Pros

- Keeps all rows in your dataset
- Avoids throwing away useful information


#### Cons

- You're guessing the missing value — it might be wrong

- Can introduce errors or bias in the model

### When Should We Use Mean or Median Based Imputation?

#### Use the mean when:

- The numbers have a normal distribution, meaning the data is shaped like a bell curve

<img src="../Photos/BellCurve.png" width="300">


- There are no big outliers

#### Use the median when:

- The data has outliers (very high or low values)

- The data is skewed (not balanced)

#### Use the mode when:

- Categorical data where the most frequent category is a meaningful representation of missing values
- It's best for categorical because we cannot take the mean or median of a categorical column
- In this case we used it because we were imputing categorical data