## Missing values

Reference:

https://www.ncbi.nlm.nih.gov/books/NBK493614/


Missing data is encountered when __no data__ / __no value__ is stored for a variable in an observation. 

### Types of Missing Data

Missing data can be of the following types:

#### Missing Completely at Random [MCAR]:

- A variable has data missing completely at random if the probability of being missing is the same for all the observations

- The fact that the data are missing is independent of the observed and unobserved data

- No systematic differences exist between participants with missing data and those with complete data

- In these instances, the missing data reduce the analyzable population of the study and consequently, the statistical power

- Removing them does not introduce bias, i.e. when data are MCAR, the data which remain can be considered a simple random sample of the full data set of interest

- MCAR is generally regarded as a strong and often unrealistic assumption


#### Missing at Random [MAR]: 

- The fact that the data are missing is systematically related to the observed but not the unobserved data

- The probability of an an observation being missing depends only on available information 

- For example, if women are less likely to disclose their age than men, age is MAR

- If we decide to use the MAR variable with missing values, we will have to include the correlated variables (e.g., gender) to control the bias in MAR variable (e.g., age) for the missing observations

#### Missing Not at Random [MNAR]: 

- When data are MNAR, the fact that the data are missing is systematically related to the unobserved data, that is, the missingness is related to events or factors which are not measured by the researcher.

- MNAR would occur if people failed to fill in a depression survey because of their level of depression. 


### Rules of thumb


- The complete case analysis will be unbiased due to missing data if the missingness is independent of the outcome under study, a condition that can be present whether the data are MAR or MNAR

- However, if the missingness is not independent of outcome, it can be made so through analytic means only if the missingness is MAR.

__The MAR vs. MNAR distinction is therefore not to indicate that there definitively will or will not be bias in a complete case analysis, but instead to indicate – if the complete case analysis is biased – whether that bias can be fully removed in analysis (see below sections for analytic strategies)__

In [11]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

from google.colab import drive
drive.mount('/content/gdrive')
data = pd.read_csv("gdrive/My Drive/Colab Notebooks/FeatureEngineering/train.csv")


Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [12]:
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [13]:
len(data)

891

In [14]:
data.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [15]:
data.isnull().mean()*100

PassengerId     0.000000
Survived        0.000000
Pclass          0.000000
Name            0.000000
Sex             0.000000
Age            19.865320
SibSp           0.000000
Parch           0.000000
Ticket          0.000000
Fare            0.000000
Cabin          77.104377
Embarked        0.224467
dtype: float64

In [16]:
data['Cabin_nulls'] = np.where(data.Cabin.isnull(), 1, 0)

data['Cabin_nulls'].mean()

0.7710437710437711

In [17]:
# Group data by Survived vs Non-Survived
data.groupby(['Survived'])['Cabin_nulls'].mean()

Survived
0    0.876138
1    0.602339
Name: Cabin_nulls, dtype: float64

In [18]:
data['Age_nulls'] = np.where(data['Age'].isnull(), 1, 0)
data.groupby(['Survived'])['Age_nulls'].mean()

Survived
0    0.227687
1    0.152047
Name: Age_nulls, dtype: float64

In [19]:
## Missing Completely at Random
data['Embarked_nulls'] = np.where(data['Embarked'].isnull(), 1, 0)
data.groupby(['Survived'])['Embarked_nulls'].mean()

Survived
0    0.000000
1    0.005848
Name: Embarked_nulls, dtype: float64

In [20]:
data['Embarked_nulls'].sum()

2