This is a high level exploratory data analysis of the Porto Seguro data. Do let me know if you like this or have any comments!

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Load data

In [None]:
df = pd.read_csv(open('../input/train.csv'))

Lets have an overview of the data

In [None]:
df.describe()

The mean for the target is very low, meaning there are very few cases with target=1.

In [None]:
df.info()

There are no NAN values already present (however any -1 values are actually NANs).

Lets check how many -1s are there in each column

In [None]:
col_names = df.columns.tolist()
for col_name in col_names:
    missing = np.sum(df[col_name] == -1)
    print (col_name, missing)


Some columns seem to have a large number of -1 values. Notice that no bin variables have any -1 present.

Replace the -1 by NaN.

In [None]:
df1 = df.replace(-1, np.NaN)

Let us separate the different types of columns into different variables.

In [None]:
cat_cols = []
bin_cols = []
other_cols = []
ind_cols = []
reg_cols = []
car_cols = []
calc_cols = []
import re
for col_name in col_names:
    if re.search('bin', col_name):
        bin_cols.append(col_name)
    elif re.search('cat', col_name):
        cat_cols.append(col_name)
    else:
        other_cols.append(col_name)
    if re.search('ind', col_name):
        ind_cols.append(col_name)
    elif re.search('reg', col_name):
        reg_cols.append(col_name)
    elif re.search('car', col_name):
        car_cols.append(col_name)
    else:
        calc_cols.append(col_name)
other_cols.remove('id')
other_cols.remove('target')
calc_cols.remove('id')
calc_cols.remove('target')
print ("No of binary columns: ", len(bin_cols))
print ("No of categorical columns: ", len(cat_cols))
print ("No of other columns: ", len(other_cols))
print ("No of ind columns: ", len(ind_cols))
print ("No of reg columns: ", len(reg_cols))
print ("No of car columns: ", len(car_cols))
print ("No of calc columns: ", len(calc_cols))

Check the correlation within variable categories (ind, reg, car, calc) 

In [None]:
corrmat = df1[ind_cols].dropna().corr()
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat, vmax=.8, square=True, cmap='RdBu');

In [None]:
corrmat = df1[reg_cols].dropna().corr()
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat, vmax=.8, square=True, cmap='RdBu');

In [None]:
corrmat = df1[car_cols].dropna().corr()
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat, vmax=.8, square=True, cmap='RdBu');

In [None]:
corrmat = df1[calc_cols].dropna().corr()
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat, vmax=.8, square=True, cmap='RdBu');

Few important correlations can be seen which need further investigation:
1. **ps_ind_06_bin, ps_ind_07_bin, ps_ind_08_bin, ps_ind_09_bin** are all highly negatively correlated to each other. Since these are all binary variables, it suggests that the value of one of these is 1 when all others are 0, which may mean that these variables are one-hot encodings of a single variable.

2. It is a similar case for **ps_ind_16_bin, ps_ind_17_bin, ps_ind_18_bin**, although the correlation between ps_ind_17_bin and ps_ind_18_bin is not so strong. It may be that they represent a single binary variable, so thete could be cases where all three variables are 0.

3. **ps_car_12 **and** ps_car_14** are highly correlated

4. **ps_car_08_cat** is highly negatively correlated with **ps_car_03_cat**

Lets first check the relation between ps_ind_06_bin, ps_ind_07_bin, ps_ind_08_bin and ps_ind_09_bin

In [None]:
sums = (df1[['ps_ind_06_bin','ps_ind_07_bin', 'ps_ind_08_bin', 'ps_ind_09_bin']].sum(axis=1))
len(sums[sums == 1])

Yes! These are one-hot encodings of a single variable!

Let us now look at ps_ind_16_bin, ps_ind_17_bin and ps_ind_18_bin

In [None]:
sums = (df1[['ps_ind_16_bin','ps_ind_17_bin', 'ps_ind_18_bin']].sum(axis=1))
len(sums[sums == 1]) + len(sums[sums == 0])

As assumed earlier, these can be represented by a single binary variable. Let us add this variable to the dataframe and see if it is well correlated to the target variable.

In [None]:
df1['sum_ind_161718_bin'] = sums

In [None]:
target_bin = ['target'] + ['sum_ind_161718_bin']
corrmat = df1[target_bin].dropna().corr()
f, ax = plt.subplots(figsize=(6, 4))
sns.heatmap(corrmat, vmax=.8, square=True, cmap='RdBu');

:(, not at all!

Coming to point 3 of the observations above, let us analyse the relation between ps_car_12 and ps_car_14 in some more detail.

In [None]:
vars = ['ps_car_12', 'ps_car_14']
g = sns.pairplot(df1.dropna(), vars=vars, hue="target", size = 3.5)

We can see some outliers, but they can have both target=0 and 1.

Let us examine the cat variables in more detail: how many categories are there in each variable, how many samples of each categry are there etc.

In [None]:
for cat_col in cat_cols:
    print (cat_col, len(df1[df1['target'] == 0][cat_col].value_counts()), len(df1[df1['target'] == 1][cat_col].value_counts()))

Since ps_car_11_cat has so many categories, we will exclude it from our analysis for now.

In [None]:
cat_cols.remove('ps_car_11_cat')

In [None]:
types_sum = df1[cat_cols].apply(pd.Series.value_counts)
ax = types_sum.T.plot(kind='bar', figsize=(15, 7), fontsize=12)

Some cat variables are actually binary! **ps_ind_04_cat, ps_car_02_cat, ps_car_03_cat, ps_car_05_cat, ps_car_07_cat **and** ps_car_08_cat**.

**ps_car_03_cat** seems to have a lot of NaNs, so does not seem very useful for analysis.

Now let us examine the variable ps_car_11_cat in more detail.

In [None]:
plt.figure(figsize=(15, 8))
df["ps_car_11_cat"].value_counts().plot(kind='bar')

Category 104 has by far the most samples in the training dataset. This actually seems like an ordinal variable!

Let us add it back to our original list.

In [None]:
cat_cols += ['ps_car_11_cat']

Now let us check the other columns, which are ordinal or continous.

In [None]:
for col in other_cols:
    plt.figure()
    sns.distplot(df1[col].dropna());

There are only 4 continous columns!

> The graphs for **ps_calc_01, ps_calc_02 and ps_calc_03** look very similar-are the values actually same?

In [None]:
arr1 = ['ps_calc_01', 'ps_calc_02', 'ps_calc_03']
df1[arr1].head(10)

No!

Coming back to **ps_car_11**, we see that **ps_car_11** has only 7 categories, while **ps_car_11_cat** had 104! It seems likely that the labels for the two variables are interchanged.

I will try to do some more analysis when I get time (distribution of NaNs, relation of different variables to the target etc) and also some feature engineering and modelling. 

Till then, please let me know your comments and suggetions!