## Missing Values

### Missing Data Mechanisms

**1. MCAR** Missing Completely at Random (Same probability, no bias) **2. MAR:** Missing at Random, if men are more likely to disclose their weight than women, **3. Missing Not at Random, MNAR:** if people failed to fill in a **depression survey** because of their level of depression.

In [19]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
pd.set_option('display.max_columns', None)  # to display the total number columns in the dataset  

In [20]:
data = pd.read_csv('titanic.csv')
data.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22,S,,,"Montreal, PQ / Chesterville, ON"


In python, the missing values are stored as NaN.

In [21]:
data.isnull().sum()

pclass          0
survived        0
name            0
sex             0
age           263
sibsp           0
parch           0
ticket          0
fare            1
cabin        1014
embarked        2
boat          823
body         1188
home.dest     564
dtype: int64

There are **263 missing values for Age, 1014 for Cabin and 2 for Embarked**.

In [22]:
data.isnull().mean()

pclass       0.000000
survived     0.000000
name         0.000000
sex          0.000000
age          0.200917
sibsp        0.000000
parch        0.000000
ticket       0.000000
fare         0.000764
cabin        0.774637
embarked     0.001528
boat         0.628724
body         0.907563
home.dest    0.430863
dtype: float64

There are missing data in the variables **Age (20% missing), Cabin (77% missing), and Embarked** -the port from which the passenger got into the Titanic- **(~0.2%  missing)**.

### Missing data Not At Random (MNAR): Systematic missing values

In the Titanic dataset, we could expect a greater number of missing values in **age** and **cabin** for people who did not survive. Because the people who survived could be otherwise asked for that information.

In [23]:
data['cabin_null'] = np.where(data['cabin'].isnull(), 1, 0)

In [24]:
data.groupby(['survived'])['cabin_null'].mean()

survived
0    0.873918
1    0.614000
Name: cabin_null, dtype: float64

In [25]:
data['cabin'].isnull().groupby(data['survived']).mean()  # another way for the above!

survived
0    0.873918
1    0.614000
Name: cabin, dtype: float64

The percentage of missing values is higher for people **who did not surviv**e (87%), respect to people **who survived** (60%).

In [26]:
data['age_null'] = np.where(data['age'].isnull(), 1, 0)
data.groupby(['survived'])['age_null'].mean()

survived
0    0.234858
1    0.146000
Name: age_null, dtype: float64

In [27]:
data['age'].isnull().groupby(data['survived']).mean()

survived
0    0.234858
1    0.146000
Name: age, dtype: float64

There is a higher number of missing data for the people who did not survive the tragedy.

### Missing data Completely At Random (MCAR)

In [11]:
data[data['embarked'].isnull()]

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest,cabin_null,age_null
168,1,1,"Icard, Miss. Amelie",female,38.0,0,0,113572,80.0,B28,,6,,,0,0
284,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,B28,,6,,"Cincinatti, OH",0,0


These 2 women were traveling together, Miss Icard was the maid of Mrs Stone.

### Missing data at Random (MAR)

In the ficticious **loan book toy dataset** example, data missing in employment are associated with data missing in time in employment.

In [30]:
data = pd.read_csv('loan.csv', usecols=['employment', 'time_employed'])
data.head()

Unnamed: 0,employment,time_employed
0,Teacher,<=5 years
1,Accountant,<=5 years
2,Statistician,<=5 years
3,Other,<=5 years
4,Bus driver,>5 years


In [31]:
data.isnull().mean()  # Check the percentage of missing data

employment       0.0611
time_employed    0.0529
dtype: float64

We see that both variables have the **same percentage** of missing observations roughly.

In [14]:
data['employment'].unique()  # Examples of employments

array(['Teacher', 'Accountant', 'Statistician', 'Other', 'Bus driver',
       'Secretary', 'Software developer', 'Nurse', 'Taxi driver', nan,
       'Civil Servant', 'Dentist'], dtype=object)

In [15]:
print('Number of employments: {}'.format(    # Their numbers
    len(data['employment'].unique())))

Number of employments: 12


We observe the **missing information (nan)** and several different employments of the people.

In [32]:
data['time_employed'].unique()  # The variable time employed

array(['<=5 years', '>5 years', nan], dtype=object)

The customer can't enter a value for **employment time** if they are **not employed**. They could be **students, retired, self-employed, or work in the house**. But we see how these 2 variables are related to each other.

**percentage of missing data in time employed for the customers who declared employment**!

In [35]:
t = data[~data['employment'].isnull()]  # customers who declared employment
t['time_employed'].isnull().mean()  # percentage of missing data in time employed

0.0005325380764724678

**Missing data persentage in the customers who did not declare employment!**

In [36]:
t = data[data['employment'].isnull()]
t['time_employed'].isnull().mean()

0.8576104746317512

The number of borrowers who have reported occupation and have missing values in time_employed is minimal. Whereas the customers who did not report an occupation or employment, are mostly reporting missing values in the time_employed variable