## Missing values

Missing data is the absence of values in certain observations of a variable. Missing data is an unavoidable problem in most data sources and may have a significant impact on the conclusions that we derived from the data. 


## Why is the data missing?

The source of missing data can vary. These are just some examples:

- The value was forgotten, lost, or not stored properly.

- The value does not exist.

- The value can't be known or identified.


In [19]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

# To display all columns in the dataset.
pd.set_option('display.max_columns', None)

In [20]:
# Let's load the titanic dataset.
data = pd.read_csv('shipdata.csv')

# Let's inspect the first 5 rows.
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [21]:
# We can quantify the missing values using
# the isnull() method plus the sum() method:

data.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

There are 263 missing values for Age, 1014 for Cabin and 2 for Embarked.

In [22]:
# We can also use the mean() method after isnull()
# to obtain the fraction of missing values:

data.isnull().mean()

PassengerId    0.000000
Survived       0.000000
Pclass         0.000000
Name           0.000000
Sex            0.000000
Age            0.198653
SibSp          0.000000
Parch          0.000000
Ticket         0.000000
Fare           0.000000
Cabin          0.771044
Embarked       0.002245
dtype: float64

In the variables Age there is 20% of data missing. 

There is 77 percent of data missing in the variable Cabin, in which the passenger was traveling.

There is 0.2 percent of data missing in the field Embarked (the port from which the passenger boarded the Titanic). 

## Mechanisms of Missing Data

### Missing data Not At Random (MNAR)

The missing values of the variables **age** and **cabin**, were introduced systematically. For many of those who did not survive, their **age** or their **cabin** remains unknown. The people who survived could have been otherwise asked for that information.

Can we infer this by looking at the data?

If data is MNAR, we could expect a greater number of missing values for people who did not survive.

Let's have a look.

In [23]:
# Let's create a binary variable that indicates 
# if the value of cabin is missing.

data['cabin_null'] = np.where(data['Cabin'].isnull(), 1, 0)

In [24]:
# Let's evaluate the percentage of missing values in
# cabin for the people who survived vs the non-survivors.

# The variable Survived takes the value 1 if the passenger
# survived, or 0 otherwise.

# Group data by Survived vs Non-Survived
# and find the percentage of NaN for Cabin.

data.groupby(['Survived'])['cabin_null'].mean()

Survived
0    0.876138
1    0.602339
Name: cabin_null, dtype: float64

In [25]:
# Let's evaluate the percentage of missing values in
# cabin for the people who survived vs the non-survivors.

# The variable Survived takes the value 1 if the passenger
# survived, or 0 otherwise.

# Group data by Survived vs Non-Survived
# and find the percentage of NaN for Cabin.

data.groupby(['Survived'])['cabin_null'].mean()

Survived
0    0.876138
1    0.602339
Name: cabin_null, dtype: float64

In [26]:
# Another way of doing the above, with less lines
# of code:

data['Cabin'].isnull().groupby(data['Survived']).mean()

Survived
0    0.876138
1    0.602339
Name: Cabin, dtype: float64

In [27]:
# Let's do the same for the variable age:

# First, we create a binary variable to indicate
# if a value is missing.

data['age_null'] = np.where(data['Age'].isnull(), 1, 0)

# Then we look at the mean in survivors and non-survivors:
data.groupby(['Survived'])['age_null'].mean()

Survived
0    0.227687
1    0.152047
Name: age_null, dtype: float64

### Missing data Completely At Random (MCAR)

In [28]:
# In the titanic dataset, there are also missing values
# for the variable Embarked.

# Let's have a look.

# Let's slice the dataframe to show only the observations
# with missing values for Embarked.

data[data['Embarked'].isnull()]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,cabin_null,age_null
61,62,1,1,"Icard, Miss. Amelie",female,38.0,0,0,113572,80.0,B28,,0,0
829,830,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,B28,,0,0


### Missing data at Random (MAR)

We will use the financial dataset from the peer-to-peer lending company.

We will look at the variables "employment" and "years in employment", both declared by the borrowers at the time of applying for a loan. 

In this example, data missing in employment are associated with data missing in time in employment.

In [30]:
# Let's load the dataset with just the 2
# variables.

data = pd.read_csv('creditrisk.csv', usecols=['employment', 'time_employed'])

data.head()

Unnamed: 0,employment,time_employed
0,Teacher,<=5 years
1,Accountant,<=5 years
2,Statistician,<=5 years
3,Other,<=5 years
4,Bus driver,>5 years


In [31]:
# Let's check the percentage of missing data.

data.isnull().mean()

employment       0.0611
time_employed    0.0529
dtype: float64

Both variables have roughly the same percentage of missing observations.

In [32]:
# lLt's insptect the different employment types.

# Number of different employments.
print('Number of employments: {}'.format(
    len(data['employment'].unique())))

# Examples of employments.
data['employment'].unique()

Number of employments: 12


array(['Teacher', 'Accountant', 'Statistician', 'Other', 'Bus driver',
       'Secretary', 'Software developer', 'Nurse', 'Taxi driver', nan,
       'Civil Servant', 'Dentist'], dtype=object)

Note the missing data along with the different employment values.

In [33]:
# Let's inspect the variable time employed.

data['time_employed'].unique()

array(['<=5 years', '>5 years', nan], dtype=object)