# Data Exploration

This is a just a notebook for exploring the databasse and getting some insights.
First some meta analysis will be implemented, by categorizing the variables and counting the missing values.
Then, each of the variables will be studied.

## 1. Importing Libraries

In [None]:
%matplotlib inline
import pandas as pd
from matplotlib import pyplot as plt

import numpy as np
import seaborn as sns

#import joypy
import re
#from IPython.display import display, HTML
#import ipywidgets as widgets # for later



sns.set(style="darkgrid", color_codes=True)
pd.options.display.float_format = '{:.2f}'.format

In [None]:
df = pd.read_csv('train.csv')

## 2. Variables in the data

First, let's check how many variables we have and their respective types.

In [None]:
# Based on asindico kernel - https://www.kaggle.com/asindico/porto-seguro-the-essential-kickstarter
# Classifying the variables in the data
variables = []
for variable in df.columns:
    for types in ['ind','reg','car', 'calc','target','id']:
        ty = "None"
        if df[variable].dtype == int:
            tybin = "ordinal"
        elif df[variable].dtype == float:
            tybin = "continuous"
        match = re.search('^.*'+types+'.*$',variable)
        if match:
            ty = types
            if re.search('^.*bin.*$',variable):
                tybin='binary'
            if re.search('^.*cat.*$',variable):
                tybin='categorical'
            if 'target' in variable:
                tybin = 'binary'
            break
    variables.append([variable,ty,tybin])

# Creating dataframe containing variables
variablesdf = pd.DataFrame(variables,columns=['name','type','bin'])

In [None]:
# Showing the number of variables per type
print('Total number of variables',len(variablesdf))
variablesdf.pivot_table(values='name',index='type',columns='bin',aggfunc="count",fill_value=0)

The table above shows how many of each variables we have in the dataset. The variables of type "None" are the Target and Id.
As we can see, 'calc' type is the most common, with 20 variables. In terms of binary, categorical and continuous, the continuous variables are most common, with 26 variables.

In [None]:
variablesdf = variablesdf.drop(0)

## 3. Variables distribuition

### 3.1 Target Variable

In [None]:
sns.countplot(x=df.target,data=df)
print(df.shape)
print('Percentage of Target equals 1 =',np.round(sum(df.target)/len(df)*100,2),("%"))

### 3.2 Studying missing data

Check how the missing data is spread across the variables in the whole dataset.

In [None]:
emptyvalue = df[(df==-1)].count()/(len(df))
emptyvalue = emptyvalue[emptyvalue>0]

#plot variables with empty values
plt.figure(figsize=(25,7))
sns.barplot(x=emptyvalue.index,y=emptyvalue)

Now, check to see if the missing data is similarly distributed considering only the cases with target equal to 1.

In [None]:
targetdata = df[df.target==1].copy() # Create database containing only cases in which the target is equal to 1

In [None]:
emptyvalue = targetdata[(targetdata==-1)].count()/(len(targetdata))
emptyvalue = emptyvalue[emptyvalue>0]
#plot variables with empty values
plt.figure(figsize=(25,7))
sns.barplot(x=emptyvalue.index,y=emptyvalue)

The missing data seems to be equally distributed among the cases of target=1 and target=0.

### 3.3 Binary variables

Let's start with the most simple type of data, the binary ones.

In [None]:
binarydata = pd.DataFrame(df[variablesdf.name[(variablesdf.bin=='binary')]].sum().copy(),columns=['1s'])
binarydata['1s'] =binarydata['1s']/len(df)
binarydata['0s'] =(1-binarydata['1s'])

plt.figure(figsize=(26,10))
plt.subplot(211)
plt.xticks(range(len(binarydata)),binarydata.index)
plt.bar(left=range(len(binarydata)),height=binarydata['1s'].values)
plt.bar(left=range(len(binarydata)),height=binarydata['0s'].values,bottom=binarydata['1s'].values)

This graph shows the distribution in each binary variable. As we can see, variables such as 'ps_ind_10_bin','ps_ind_11_bin','ps_ind_12_bin' and 'ps_ind_13_bin' are almost all of only one type. So we can speculate that they will not be very useful.

To gain more insight about the meaningfulness of each varible, we can see if their distributions change much when we analyze only the target=1 dataset.

In [None]:
binarydata_t = pd.DataFrame(targetdata[variablesdf.name[(variablesdf.bin=='binary')]].sum().copy(),columns=['1s_t'])
binarydata['1s_t'] = binarydata_t['1s_t']/len(targetdata)
binarydata['0s_t'] = (1-binarydata['1s_t'])

plt.figure(figsize=(26,5))
plt.xticks(range(len(binarydata)),binarydata.index)
plt.bar(left=range(len(binarydata)),height=binarydata['1s_t'].values)
plt.bar(left=range(len(binarydata)),height=binarydata['0s_t'].values,bottom=binarydata['1s_t'].values)

The distributions seem similar in both cases, but some small change can be noticed in some variables. We can plot the percent difference to better observe which distributions changed most. This can serve as an initial assessment on which binary variables might be relevant.

In [None]:
plt.figure(figsize=(26,5))
binarydata['dif'] = binarydata['0s']-binarydata['0s_t']
plt.xticks(range(len(binarydata)),binarydata.index)
sns.barplot(x=binarydata.index,y=binarydata['dif'])

As we can see, "ps_ind_06_bin","ps_ind_07_bin","ps_ind_16_bin", "ps_ind_17_bin" are the ones that changed the most.

Finally, let's see how they are correlated.

In [None]:
plt.figure(figsize=(17,7))
plt.subplot(121)
plt.title('All dataset')
sns.heatmap(df[variablesdf.name[(variablesdf.bin=='binary')]].corr(),cmap="coolwarm", linewidths=0.1)
plt.subplot(122)
plt.title('Only target=1')
sns.heatmap(targetdata[variablesdf.name[(variablesdf.bin=='binary')]].corr(),cmap="coolwarm", linewidths=.1)

Looking at the heatmap, variables 'ps_ind_06_bin','ps_ind_07_bin','ps_ind_08_bin' and 'ps_ind_09_bin' have some modest correlation among themselves. Also, 'ps_ind_16_bin','ps_ind_17_bin' and 'ps_ind_18_bin'.

Also, the correlation seems to hold for all the dataset and for target=1.

## 3.4 Categorical

Studying the categorical variables.

In [None]:
uniquecat = pd.DataFrame(df[variablesdf.name[variablesdf.bin=='categorical']]
                              .T.apply(lambda x: x.nunique(), axis=1),columns=['val_unicos'])
uniquecat

In [None]:
for i in variablesdf.name[variablesdf.bin=='categorical']:
    plt.figure(figsize=(20,3))
    plt.subplot(121)
    g= sns.countplot(x=i,data=df)
    plt.subplot(122)
    g= sns.countplot(x=i,data=targetdata)

### 3.5 Continuous

Let's see how the "ordinal" varibles are distributed. Due to some memory constrains, we just plot the first 3000 data points.

In [None]:
sns.pairplot(df[variablesdf.name[
    (variablesdf.bin=='ordinal')|(variablesdf.name=="target")]][0:3000],hue='target')

With an initial visual inspection, it seems like some of these variables indeed have a clear bound between target=0 and target=1 (e.g: 'ps_calc_10' vs. 'ps_calc_14').

Then, an equal procedure is applied to the "continuous" variables.

In [None]:
sns.pairplot(df[variablesdf.name[
    (variablesdf.bin=='continuous')|(variablesdf.name=="target")]][0:1000],hue='target')

It is interesting that the "continuous" calc varibales actually behave like a discrete variables, as can be seen by the "checkers-like" scatter plot. Also, they are uniformly distributed (I don't know if this has any implication, but is interesting). Here not clear relation is found.

Now, let's combine both "ordinal" and "continuous".

In [None]:
sns.pairplot(df[variablesdf.name[
    (variablesdf.type=='calc')&(variablesdf.bin=='ordinal')|(variablesdf.bin=="continuous")|
    (variablesdf.name=="target")]][0:500],hue='target')

This is a lot to digest, but we can see some interesting plots in which there seems to be a clear relation between the "ordinal" variable and the "continuous" one.

## 4. Correlation

Finally, let's check how all variables are correlated amongst each other.

In [None]:
plt.figure(figsize=(17,17))
plt.title('All dataset')
sns.heatmap(df.dropna().corr(),cmap="coolwarm", linewidths=0.1)

In [None]:
plt.figure(figsize=(17,17))
plt.title('Only target=1')
sns.heatmap(targetdata.dropna().corr(),cmap="coolwarm", linewidths=.1)

So there seems to be some strong correlations, but another thing stands out... All the calculated variables seem to have no correlation amongst any of the varibles, which seems a bit odd.