#Clustering on Missing Values

In a previous [notebook](https://www.kaggle.com/lesibius/two-sigma-financial-modeling/financial-instrument-types), I tried to see if I could differentiate among different instruments in the dataset using missing (or null) values. While my study was inconclusive, it was however clear that clustering this way could make sense (as suggested by the heatmap in the link above).

This said, I would like to go further in this direction, with these objectives in mind:
<ul>
<li>Find a way to keep my clusers constant. Using the k-means provide clusters in a random order, which prevents any further analysis. Building on my previous work, I intend to have a clear view of which features can be used in clustering.</li>
<li>Analyse the features and the y variable within each cluster to see if it is likely to use a one-model-fits-all will work.</li>
</ul>

The visualisation I obtained the last time suggests that two main clusters exist, and that some segregation could be done on the second cluster as well.

# Results so Far
<ul>
<li>From the [previous notebook](https://www.kaggle.com/lesibius/two-sigma-financial-modeling/financial-instrument-types), I could isolate two main groups of ids based on their missing values</li>
<li>One of the two groups contains more valid features that the other group. Overall, the cluster 0 has 57 "mostly valid" features, while the cluster 1 has only 24. 20 of these features are common to both clusters</li>
<li>I unsuccessfully tried to apply a Kolmogorov-Smirnov two samples test on their y values to see if their distribution were roughly the same. However, it was inconclusive since it rejected the null hypothesis that both distribution were drawn from the same population while the four first moments and the cdf clearly indicate the contrary. For more information, see [this discussion](https://www.kaggle.com/c/two-sigma-financial-modeling/discussion/26406#) (thanks to CarrDelling and Oussama Errabia BTW) </li>
</ul>

# Recovering the Previous Results

My first goal is to recover the results of my previous kernel.

In [None]:
#Importing libraries

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score


%matplotlib inline

#Getting data

#I owe this to SRK's work here:
#https://www.kaggle.com/sudalairajkumar/two-sigma-financial-modeling/simple-exploration-notebook/notebook
with pd.HDFStore("../input/train.h5", "r") as train:
    df = train.get("train")


input_variables = [x for x in df.columns.values if x not in ['id','y','timestamp']]


df_id_vs_variable = df[['id']+input_variables]       #Removes 'y' and 'timestamp'
df_id_vs_variable = df_id_vs_variable.fillna(0)      #Replace na by 0

def makeBinary(x):
    if abs(x) > 0.00000:
        return 1
    else:
        return 0

df_id_vs_variable = df_id_vs_variable.groupby('id').agg('sum').applymap(makeBinary)


n_clust = 2

km = KMeans(n_clusters=n_clust, n_init=20).fit(df_id_vs_variable)
clust = km.predict(df_id_vs_variable)


#Init table of indexes
df_clust_index = {}
for i in range(0,n_clust):
    df_clust_index[i]=[]

#Fill the cluster index
for i in range(0,len(clust)):
    df_clust_index[clust[i]].append(i)

for i in range(0,n_clust):
    df_clust_index[i] = df_id_vs_variable.iloc[df_clust_index[i]].index.values



df_clust = []

for i in range(0,n_clust):
    df_clust.append(df.loc[df.id.isin(df_clust_index[i])])

At this stage, we have some useful variables:

<ul>
 <li>`df_id_vs_variable` which provides a binary matrix (1: non-null variable for the id, 0: null-variable) </li>
<li>`clust` which is the result of the skleran kmean (i.e. an array containing the cluster number for each observation)</li>
<li>`df_clust_index` which has the following shape: {index_clust_0:[ids in cluster 0], index_clust_n:[ids in cluster n]}</li>
<li>`df_clust`: an array of dataframe, where the index of the array represent the cluster number</li>
</ul>

# Two Clusters Analysis

A first step is to see whether the data from the two clusters seems to be drawn from the same population.

## Null vs Non-Null Variables

First, I "serialize" my clusters. At the same time, I would like to keep only columns which exhibit sufficient data to be used. Thus, I only keep these clusters if 95% of the ids have a value.

In [None]:
non_null_0 = df_id_vs_variable.loc[clust==0].sum() / df_id_vs_variable.loc[clust==0].shape[0]
non_null_1 = df_id_vs_variable.loc[clust==1].sum() / df_id_vs_variable.loc[clust==1].shape[0]


df_non_null_comparison = pd.concat([non_null_0,non_null_1],axis=1)

bar_width = 1
index = np.arange(df_non_null_comparison.shape[0])

fig, ax = plt.subplots(figsize=(12,50))

rects1 = plt.barh(index ,  np.array(df_non_null_comparison[0]), bar_width/2,
                 color='b',
                 label='Cluster 0')

rects1 = plt.barh(index + bar_width/2,  np.array(df_non_null_comparison[1]), bar_width/2,
                 color='r',
                 label='Cluster 1')


#plt.figure(figsize=(20,50))
plt.legend()
plt.xlabel('Percentage of Null-Values')
plt.ylabel('Features')
plt.yticks(index + bar_width, df_non_null_comparison.index.values)
plt.tight_layout()
plt.show()

In [None]:
non_null_threshold = 0.95

col_0 = non_null_0.loc[non_null_0 > non_null_threshold].index.values
col_1 = non_null_1.loc[non_null_1 > non_null_threshold].index.values

The following lists contains "serialised" column names.

In [None]:
col_0 = ['derived_2', 'fundamental_0', 'fundamental_2', 'fundamental_7', 'fundamental_8', 'fundamental_10', 'fundamental_11', 'fundamental_13', 'fundamental_14', 'fundamental_15', 'fundamental_16', 'fundamental_18', 'fundamental_19', 'fundamental_21', 'fundamental_23', 'fundamental_29', 'fundamental_30', 'fundamental_33', 'fundamental_35', 'fundamental_36', 'fundamental_37', 'fundamental_39', 'fundamental_41', 'fundamental_42', 'fundamental_43', 'fundamental_44', 'fundamental_45', 'fundamental_46', 'fundamental_48', 'fundamental_50', 'fundamental_53', 'fundamental_54', 'fundamental_55', 'fundamental_56', 'fundamental_59', 'fundamental_60', 'fundamental_62', 'technical_1', 'technical_2', 'technical_3', 'technical_6', 'technical_7', 'technical_11', 'technical_13', 'technical_17', 'technical_19', 'technical_20', 'technical_21', 'technical_22', 'technical_24', 'technical_27', 'technical_30', 'technical_33', 'technical_35', 'technical_36', 'technical_40', 'technical_41']
col_1 = ['technical_1', 'technical_2', 'technical_3', 'technical_5', 'technical_6',
     'technical_7', 'technical_11', 'technical_13', 'technical_14', 'technical_17',
     'technical_19', 'technical_20', 'technical_21', 'technical_22', 'technical_24',
     'technical_27', 'technical_30', 'technical_33', 'technical_34', 'technical_35',
     'technical_36', 'technical_40', 'technical_41', 'technical_43']

common_cols = [x for x in col_0 if x in col_1]

print("Cluster 0 has {0} columns".format(len(col_0)))
print("Cluster 1 has {0} columns".format(len(col_1)))

print("Number of features present in both clusters: {0}".format(len(common_cols)))
print("Number of features that are only present in cluster 0: {0}".format(len([x for x in col_0 if x not in col_1])))
print("The following features are only present in cluster 1: {0}".format(len([x for x in col_1 if x not in col_0])))

## Kolmogorov-Smirnov Analysis on the 'y' Value

This section and the following were an unfruitful try to see if the distribution of y values among clusters were different. As the graphs and first four moments exhibit the contrary of the test result, I will just let this part as is for the record.

*Null hypothesis*  - H0: the two clusters are drawn from the same distribution.

In [None]:
from scipy import stats

y_0 = df_clust[0].y.dropna().values
y_1 = df_clust[1].y.dropna().values 

print("{:01.3f}".format(stats.ks_2samp(y_0, y_1)[1]))

We reject the null hypothesis at the 5% threshold. The two clusters apparently result from different distributions.

In [None]:

for i in range(0,n_clust):
    n, bins, patches = plt.hist(df_clust[i].y.dropna().values, 50, normed=1, facecolor='green', alpha=0.75)
    plt.xlabel('y Value')
    plt.ylabel('Occurence')
    plt.title(r'Distribution of y Value for Cluster '+str(i))
    plt.show()
    print("Mean value: {:.3e}".format(df_clust[i].y.dropna().mean()))
    print("Standard deviation: {:.3e}".format(df_clust[i].y.dropna().std()))
    print("Median value: {:.3e}".format(df_clust[i].y.dropna().median()))
    print("Skew: {:.3e}".format(df_clust[i].y.dropna().skew()))
    print("Kurtosis: {:.3e}".format(df_clust[i].y.dropna().kurtosis()))

## Kolmogorov-Smirnov Analysis on the Common Features

In [None]:
p_values_y = map(lambda x: stats.ks_2samp(df_clust[0][x].dropna().values, df_clust[1][x].dropna().values)[1],common_cols)

def isrejected(pval,th = 0.05):
    if pval < th:
        return "rejected"
    else:
        return "not rejected"
for pv in list(p_values_y):
    print("{:.3e}: {}".format(pv,isrejected(pv)))

In [None]:
df_clust[1].shape

## Inter-Cluster Features Correlation

I would like to see now if some correlations exist between the features that are common to both clusters and those that are related to only one of them. The idea would be to use some models to fill missing values based on the non-empty data, in order to keep more features (or to avoid to fill them with their averages).

In [None]:
#This code's purpose is to make sure that the cluster 0 in my previous analysis is the same here.

if(non_null_0.loc[non_null_0 > non_null_threshold].index.isin(col_0).sum() < 24):
    temp_df = df_clust[0]
    df_clust[0] = df_clust[1]
    df_clust[1] = temp_df

In [None]:
cl=0
cols=[col_0,col_1]
for cl in [0,1]:
    remaining_cols = [x for x in cols[cl] if x not in cols[1-cl]]

    df_corr = df_clust[cl][common_cols + remaining_cols]
    df_corr = df_corr.corr()

    df_corr = df_corr[common_cols].loc[df_corr.index.isin(remaining_cols)]

    cmap = sns.diverging_palette(220, 10, as_cmap=True)
    f, ax = plt.subplots(figsize=(11, 9))
    sns.heatmap(df_corr, cmap=cmap, vmax=1,
                square=True, xticklabels=True, yticklabels=True,
                linewidths=.5, cbar_kws={"shrink": .5}, ax=ax)

From the cluster 0, it might be possible to create a model on 'fundamental_21', 'fundamental_54' and 'fundamental_60'. From the cluster 1, 'technical_14' and 'technical_43'.