The missing data within a dataset can often provide insight into our problem. We can look at the structure of the missing values - which features are affected, which records are affected, and differences between train and test. We can also use the missing data as a feature by counting it or transforming it. In this kernel I explore some patterns and suggest one way to improve your model with the findings.

### Patterns
First let's look at the overall pattern of missing data. The [missingno](https://github.com/ResidentMario/missingno) package by [Aleksey Bilogur](https://www.kaggle.com/residentmario) is the perfect tool here. Looking at a sample of data for all columns we see a group of columns where the missing values appear correlated.



In [None]:
import numpy as np
import pandas as pd
import missingno as msno

train = pd.read_csv('../input/application_train.csv')
msno.matrix(train.sample(500), inline=True, sparkline=True, figsize=(20,10), sort=None)

Let's also look at the Test set. If the structure of missing variables is different, that will be useful to know. It looks like Train and Test are similar from the picture below, at least for this random sample.

In [None]:
test = pd.read_csv('../input/application_test.csv')
msno.matrix(test.sample(500), inline=True, sparkline=True, figsize=(20,10), sort=None, color=(0.25, 0.45, 0.6))

Now going back to Train and zooming in to the middle columns we see that they deal mostly with information about the building where the client lives. It appears there are many applicants who leave blank the information for their housing. We can think about why that might be the case or how it might inform our model.

I'll sort the data this time to better see the proportions.

In [None]:
msno.matrix(train.iloc[0:100, 40:94], inline=True, sparkline=True, figsize=(20,10), sort='ascending', fontsize=12, labels=True)

The dendrogram view shows how missing values are related across columns by using hierarchical clustering. Pretty cool! 

In [None]:
msno.dendrogram(train, inline=True, fontsize=12, figsize=(20,30))

### Comparison and Feature Engineering

Let's focus on the large group of applications with missing house data. Is there a difference in mean default rates between those with house information and those without?

In [None]:
train['incomplete'] = 1
train.loc[train.isnull().sum(axis=1) < 35, 'incomplete'] = 0

mean_c = np.mean(train.loc[train['incomplete'] == 0, 'TARGET'].values)
mean_i = np.mean(train.loc[train['incomplete'] == 1, 'TARGET'].values)
print('default ratio for more complete: {:.2} \ndefault ratio for less complete: {:.2}'.format(mean_c, mean_i))

So there is some difference. Viewed one way, borrowers with incomplete applications are ~30% more likely to default. You may want to include this information in your model somehow. So far I found it helpful to add a binary feature called 'no_housing_info'. The application gets a 1 if the # of NAs related to housing is > 45, otherwise 0. You could also create three classes to account for the applications with some housing data (which may denote apartment dwellers). 

You can also look at the dendrogram and find ways to compare columns. You might find useful flags if an application has one feature missing but another filled in. There are many possibilities!


### Statistical Significance
Someone suggested I look at statistical significance of the difference in default rates mentioned above. Good advice! I'll use a [G-test](https://en.wikipedia.org/wiki/G-test) which is similar to Pearson's chi-squared test. Either one should work in this case, but I generally prefer the G-test.

In [None]:
from scipy.stats import chi2_contingency

props = pd.crosstab(train.incomplete, train.TARGET)
c = chi2_contingency(props, lambda_="log-likelihood")
print(props, "\n p-value= ", c[1])

"If p is low, the null must go." So we can reject the null hypothesis with only a small probability of [Type 1 error](https://en.wikipedia.org/wiki/Type_I_and_type_II_errors). In other words, the difference in default ratios between the two groups is not due to random chance. The question now is can we capture the difference in our model?

Good luck!