# Missing Data Mechanism
To handle missing data, you need to know the mechanism behind them. You can classify mechanisms of missing data into three categories: Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing Not at Random (MNAR).
Let's dig into their definitions.

### - Missing Completely at Random (MCAR)
There is no systematic difference between observed and missing data. For example, missing data exsits in blood pressure variable because of a machine to measure it breaks down for a while. 

### - Missing at Random (MAR)
Any systematic diference bewteen observed and missing data can be explained by observed data. For example, missing bloog presure data may be distributed with lower mean gaussian distirbution. This happens because young people may be likely not to take tests regulary.

### - Missing Not at Random (MNAR)
Even after considering observed data, the systematic difference is not fully expalined. For example, people wiht high income may hasitate to fill their income at questionnair. 

Roughly speaking, MCAR is dealt with simple statistics like mean or median while predictive models regressed on observed data assumes MAR. MNAR is generally more difficult to deal with. You need to analyze more precisely for each case. If you use a modle whose assumptions are not satisfied, you would introduce bias into data. 

You can see the definition at the paper [[1]](https://www.bmj.com/content/338/bmj.b2393). There is also an interesting paper in the dialog form claryfing the difference between MCAR and MAR [[2]](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4121561/).

Here is tips as to how to diagnose missing mechanisms at [this blog post](https://www.theanalysisfactor.com/missing-data-mechanism/).



# Deletion
### - Listwise
Assumption: MCAR
Removes an instance if it contains missing values. 

### - Pairwise
Assumption: MCAR
Keep only cases containing interest varaiables and remove the others. Compared to listwise deletion, pairwise deletion may contain instances having missing data on some variables.

### - Dropping Variables
Assumption: Dropped varialbes do not have any predictive power.
Drop varaibles with missing values.


# Imputation
### - Mean, Median and Mode
Assumption: MCAR
Replace missing values with mean, median or mode. They are one of the most simple imputations and implemented in scikit-learn.

### - Multiple Imputation
Assumption: MAR
Imputation is executed in the following:

1. Impute missing values with basic methods like mean or median
2. Set back imputed values to massing values for each feature to impute
3. The feature is regressed on the other variables, which may or may not include all of other variables
4. Replace missing values with predicted values
5. Repeate 2-4 for each feature
6. Repeat process from 2-5 for the number of cycles. Then, store the last values as an imputed dataset
7. Repeate processe 2-6 to get multiple datasets, which will be used to esitimate uncertainty of the imputation

This imputation is know to work well for small number of features and instance datasets.
Here is the [reference](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3074241/pdf/nihms267760.pdf)


### - K Nearest Neighbor
Assumption: MAR
The efficiency is $O(N^2)$.
You can use [LSH](https://www.slaney.org/malcolm/yahoo/Slaney2008-LSHTutorial.pdf) instead for large scale datasets.
Requires standardization

### - Linear Regression(Logistic Regression)
Assumption: MAR
Regress missing features on the other variables.


In [1]:
import pandas as pd


In [2]:
df = pd.read_csv("train.csv")

In [3]:
df

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000
5,6,50,RL,85.0,14115,Pave,,IR1,Lvl,AllPub,...,0,,MnPrv,Shed,700,10,2009,WD,Normal,143000
6,7,20,RL,75.0,10084,Pave,,Reg,Lvl,AllPub,...,0,,,,0,8,2007,WD,Normal,307000
7,8,60,RL,,10382,Pave,,IR1,Lvl,AllPub,...,0,,,Shed,350,11,2009,WD,Normal,200000
8,9,50,RM,51.0,6120,Pave,,Reg,Lvl,AllPub,...,0,,,,0,4,2008,WD,Abnorml,129900
9,10,190,RL,50.0,7420,Pave,,Reg,Lvl,AllPub,...,0,,,,0,1,2008,WD,Normal,118000


In [4]:
df.describe()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
count,1460.0,1460.0,1201.0,1460.0,1460.0,1460.0,1460.0,1460.0,1452.0,1460.0,...,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,730.5,56.89726,70.049958,10516.828082,6.099315,5.575342,1971.267808,1984.865753,103.685262,443.639726,...,94.244521,46.660274,21.95411,3.409589,15.060959,2.758904,43.489041,6.321918,2007.815753,180921.19589
std,421.610009,42.300571,24.284752,9981.264932,1.382997,1.112799,30.202904,20.645407,181.066207,456.098091,...,125.338794,66.256028,61.119149,29.317331,55.757415,40.177307,496.123024,2.703626,1.328095,79442.502883
min,1.0,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,34900.0
25%,365.75,20.0,59.0,7553.5,5.0,5.0,1954.0,1967.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0,129975.0
50%,730.5,50.0,69.0,9478.5,6.0,5.0,1973.0,1994.0,0.0,383.5,...,0.0,25.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,163000.0
75%,1095.25,70.0,80.0,11601.5,7.0,6.0,2000.0,2004.0,166.0,712.25,...,168.0,68.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,214000.0
max,1460.0,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,...,857.0,547.0,552.0,508.0,480.0,738.0,15500.0,12.0,2010.0,755000.0
