In this notebook, we provide some qualitative analysis of our data. We drop row with nan values for simplicity (but we will perform missing data inputing for learning, in the next notebook).

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns; sns.set(); sns.set_style('whitegrid')

from sklearn.preprocessing import OneHotEncoder

In [2]:
df_train = pd.read_csv("notebook_insights/preprocessed_train.csv")
df_train = df_train.dropna()
df_train.head()

Unnamed: 0,TARGET_FLAG,KIDSDRIV,AGE,HOMEKIDS,YOJ,INCOME,PARENT1,HOME_VAL,MSTATUS,EDUCATION,...,Student,z_Blue Collar,Commercial,Minivan,Panel Truck,Pickup,Sports Car,Van,z_SUV,Highly Urban/ Urban
0,0,0,60.0,0,11.0,11.117643,0,1.0,0,3,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
1,0,0,43.0,0,11.0,11.423537,0,12.457811,0,0,...,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
2,0,0,35.0,1,10.0,9.682779,0,11.729576,1,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
5,1,0,34.0,1,12.0,11.738474,1,1.0,0,1,...,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
8,1,0,34.0,0,10.0,11.050541,0,1.0,0,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0


# Visualize distribution

We propose first to visualize the distribution of each features. We see that the majority of the features are categorical.

In [None]:
df_train.hist(bins=50, figsize=(20,20), color='navy')
plt.show()

# Impact of some numerical features

The obs with positive and negative target_flag has the same mean for age, YOJ, TRAVTIME, TIF.

In [None]:
for column in ['AGE',
                'YOJ',
                'TRAVTIME',
                'TIF']:
    
    sns.catplot(data=df_train[['TARGET_FLAG',column]],x='TARGET_FLAG', y=column, kind="box")
    plt.title('Impact of {}'.format(column))
    plt.show()

There are visual differences between the boxplots for the income, home_val, oldclaim, car_age variables.

In [None]:
for column in ['INCOME',
                'HOME_VAL',
                'OLDCLAIM',
                'CAR_AGE']:
    
    sns.catplot(data=df_train[['TARGET_FLAG',column]],x='TARGET_FLAG', y=column, kind="box")
    plt.title('Impact of {}'.format(column))
    plt.show()

## Impact of categorical features

Here, we visualize the impact of some categorical features on the target variable. For instance, the feature 'Highly Urban/ Urban' seems highly correlated to the output.

In [None]:
for column in [ 'KIDSDRIV',
            'HOMEKIDS',
            'PARENT1',
            'MSTATUS',
            'CLM_FREQ',
            'z_SUV',
            'Highly Urban/ Urban',
            'Lawyer']:

    df_train.groupby(column).mean()['TARGET_FLAG'].plot.barh()
    plt.xlabel('% of TARGET_FLAG')
    plt.title('Impact of {}'.format(column))
    plt.show()

# Heatmap

We can then visualize the correlation of each features.

In [None]:
f, ax = plt.subplots(figsize=(10, 8))
corr = df_train.corr()
sns.heatmap(corr, mask=np.zeros_like(corr, dtype=np.bool), cmap=sns.diverging_palette(220, 10, as_cmap=True),
            square=True, ax=ax)
plt.title('Correlation matrix')
plt.show()

We focus on correlation between the features and the target variable.

In [None]:
corr.sort_values(by='TARGET_FLAG', ascending=False)['TARGET_FLAG']

We may drop the features with small correlation with the target variable.

In [None]:
corr_values      = np.abs(corr['TARGET_FLAG'])
low_corr_bool    = corr_values < 0.08
low_corr_columns = corr_values[low_corr_bool].index
print(low_corr_columns)

# TSNE

We can visualize a 2D projection to get a summary of the data. To this end, we can either use PCA or TSNE.

In [None]:
TARGET = 'TARGET_FLAG'
y = df_train[TARGET].values
df_train.drop(columns=TARGET, inplace=True)
df_train.reset_index(inplace=True)
X = df_train
X = X.drop(columns=low_corr_columns)
X = X.drop(columns=['index'])
X.head()

We need to preprocess the data in order to enhance the projection quality.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.manifold import TSNE

# Preprocessing
numeric_features = ['AGE',
                    'INCOME',
                    'HOME_VAL',
                    'BLUEBOOK',
                    'OLDCLAIM',
                    'CAR_AGE']

numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler()),
    ('imputer', SimpleImputer(missing_values=np.nan, strategy='constant', fill_value=0))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),],
    remainder='passthrough')

X = preprocessor.fit_transform(X)

# TSNE: Projection into 2D Space

X_embedded = TSNE(n_components=2).fit_transform(X)

In [None]:
LABEL = 'CLM_FREQ'
values = df_train[LABEL].value_counts().index
for value in values:
    idx = df_train[(df_train[LABEL] == value)].index
    plt.scatter(X_embedded[idx, 0], X_embedded[idx, 1], label=value)
    idx = np.where(df_train[LABEL] == value)[0]
plt.legend()
plt.title('2D Projection, with colors representing {} distribution'.format(LABEL))
plt.show()