In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats
from IPython.display import display
%matplotlib inline

#This keeps the "middle" columns from being omitted when wide dataframes are being displayed
pd.options.display.max_columns = None

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory
from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))


In [None]:
#customers = pd.read_csv('../input/train.csv', na_values='-1')
customers = pd.read_csv('../input/train.csv')

In [None]:
customers.head()

In [None]:
for colname in customers.columns:
    print (colname, customers[colname].dtype)

In [None]:
bin_cols = ['target']
cat_cols = []
cont_cols = []

for colname in customers.columns:
        if 'bin' in colname:
            bin_cols.append(colname)
        elif 'cat' in colname:
            cat_cols.append(colname)
        else:
            cont_cols.append(colname)

**Exploration of the Binary Features**

In [None]:
print (len(customers))

customers[bin_cols].describe()

Noteworthy things in the description matrix for the binary features:
* There are no missing values (none of the minimums are -1)
* The target only has a mean of .036448. This means that the positive and negative classes are imbalanced. We will need to deal with this later by undersampling or oversampling.
* With means below 0.01, features ps_ind_10_bin through ps_ind_13_bin are quite sparse, with very few positive entries (even fewer than the target)
* Conversely, there are no features that are heavily positive

Based on this, it seems unlikely that any of these invidivual features will show much correlation at all with the target. There's a chance, though, that the sparse features might line up pretty well with the sparse targets. A correlation matrix will show this.

So let's take a look:

In [None]:
customers[bin_cols].corr()

Not a lot to see here:
* Features ps_ind_06_bin through  ps_ind_09_bin have weak levels of correlation with each other (~0.20 - 0.50).
* Features ps_ind_11_bin and ps_ind_12_bin are weakly correlated with each other (0.25), and ps_ind_12_bin and ps_ind_13_bin even less so. As stated earlier, this could be useful if the "sparsities" line up. But they don't....there's no correlation with the target.
* Features ps_ind_16_bin shows strong correlation (0.50+) with both ps_ind_17_bin and ps_ind_18_bin. But ps_ind_17_bin and ps_ind_18 show little correlation with each other (0.158)...that seems odd.
* The "calc" binary features (ps_calc_15_bin through ps_calc_20_bin) are noteworthy for how little correlation there is, with either each other or with the target.
* The features that are correlated with each other are numbered sequentially...likely these features represent attributes of the customer that are similar/related.
* None of the binary features show anything close to correlation with the target. The only features with a correlation coefficient with magnitude over 0.03 are ps_ind_06_bin, ps_ind_07_bin, and ps_ind_17_bin, which happen to be the features that show some correlation to each other. Is it meaningful that the features with the highest intercorrelation are also the features with the highest (but still very low) correlation with the target? Maybe...(<-- Warning: confirmation bias). Perhaps some form of PCA on these features would improve the correlation. (A quick Google search shows there are valid techniques for doing PCA-like dimensionality reduction on binary features.)

That's it for the binary features. A few of them look like they *might* be useful, but most of them don't look very informative. I'll start maintaining a list of features that are worth investigating further, and then let's move on to the categorical features.

In [None]:
# Adding features to the list with a correlation magnitude of 0.01 or greater. Tiny, I know...
potential_features = ['ps_ind_06_bin', 'ps_ind_07_bin', 'ps_ind_08_bin', 'ps_ind_16_bin', 'ps_ind_17_bin']

**Exploration of the Categorical Features**

In [None]:
customers[cat_cols].describe()

Things to note:
* 7 of the features are missing values. However, since these are categorical I'll plan to just treat a missing variable as an additional category for now 
* ps_car_11_cat has high cardinality, with over 100 different categories. One-hot encoding should be OK for the other variables, but I will want to test alternative approaches (vs. OHE) for ps_car_11_cat
* ps_car_10_cat has low variance
* ps_car_08_cat is binary-like and is not missing any data. Even though the contest sponsor said these values really are categorical, I don't see a meaningful difference between a two-category feature and a binary feature, so I'll treat this as binary

Let's look at how ps_car_08_cat is correlated to the target before moving on to the categorical features:

In [None]:
customers[['target', 'ps_car_08_cat']].corr()

This feature's correlation is above the (tiny) 0.01 threshold I established earlier. I'll add it to the potential features list and the binary columns list.

In [None]:
# remove the binary-like feature
cat_cols = list(set(cat_cols) - set(['ps_car_08_cat']))
# add it to the previous binary features list
bin_cols += ['ps_car_08_cat']
# add it to the potential features list
potential_features += ['ps_car_08_cat']

In [None]:
#sort by correlation to target and save correlation dataframe
bin_corr = customers[bin_cols].corr().sort_values(['target'], ascending=0) 
#reorder the dataframe columns so we have the nice symmetry of the 1 correlations down the diagonal
bin_corr = bin_corr[list(bin_corr.index.values)]
bin_corr

In [None]:
plt.figure(figsize=(18, 14))
sns.heatmap(bin_corr, cmap="YlGnBu", annot=True, fmt='03.2f')



Now we'll return to the categorical data. Ideally, I would want to visualize the categorical data with a histogram broken down by category value, but since the data is so imbalanced it will likely be difficult to "eyeball" anything meaningful. Still, I'll take a look just in case something jumps out:

In [None]:
for col in cat_cols:
    sns.countplot(x=col, hue="target", data=customers)
    plt.show()

As suspected, not much is easily visible, although it's easy to see that ps_car_10_cat has very little variance (as previously mentioned). 

Instead of "eyeballing it", I can do a Chi-square Test of Independence to see if that reveals potential relationships to the target.

In [None]:
for col in cat_cols:
    cont_table = pd.crosstab(customers['target'], customers[col])
    print ("Feature:", col, "P-value:", stats.chi2_contingency(observed= cont_table)[1])

These results are surprising. The extremely low p-values for all of but one of these features (our boring, low-variance friend ps_car_10_cat) suggest that all of these features could be useful. I'll add them to the potential features list, and move on to the remaining ordinal/continuous features.

In [None]:
cat_cols = list(set(cat_cols) - set(['ps_car_10_cat']))

potential_features += cat_cols

**Exploration of the Continuous/Ordinal Features**

In [None]:
cont_cols = list(set(cont_cols) - set(['id']) - set(['target']))

customers[cont_cols].describe()

In [None]:

plt.figure(figsize=(18, 14))
plt.xticks(rotation=90)
sns.boxplot(data=customers[cont_cols])

Things to note from the description matrix and box plot of continuous/ordinal features:
* ps_reg_01, ps_calc_01, ps_calc_03, ps_calc_02, ps_car_12, and ps_car_14 initially look like low-variance features but they are all continous features with small-scale values, so the low-variance is only relative to larger scale features like ps_calc_11, ps_calc_14, and ps_calc_10
* Although the scales of the features are not dramatically different (only 1 order of magnitude difference), given the imbalanced dataset and scale of the binary and encoded categorical features, it may also make sense to scale the continuous features.
* Four features are missing data: ps_reg_03, ps_car_12, ps_car_14, and ps_car_11. 

For the purposes of this analysis, I'm not going to treat continuous features any differently than ordinal features, except when it comes to handling missing data (if necessary). Let's take a look at how much data is missing:

In [None]:
print ("Missing Feature Counts:")
print ("ps_reg_03: ", customers['ps_reg_03'].value_counts().loc[-1])
print ("ps_car_12: ", customers['ps_car_12'].value_counts().loc[-1])
print ("ps_car_14: ", customers['ps_car_14'].value_counts().loc[-1])
print ("ps_car_11: ", customers['ps_car_11'].value_counts().loc[-1])

ps_reg_03 and ps_car_14 are missing quite a bit of data, and are continuous features. If each feature has another feature that is strongly correlated with it, these could be used to provide a better replacement value than a simple mean or median.

The correlation matrix (for rows where both features are not null) is shown below:

In [None]:
plt.figure(figsize=(18, 14))
sns.heatmap(customers[(customers['ps_reg_03'] != -1) & (customers['ps_car_14'] != -1)][cont_cols].corr(), cmap="YlGnBu", annot=True, fmt='03.2f')

ps_reg_03 is strongly correlated (0.74) with ps_reg_02 and, ps_car_14 is fairly well correlated with ps_car_13 and ps_car_12 (0.44 and 0.59 respectively). Let's take a quick look at some scatter plots:

In [None]:
temp = customers[(customers['ps_reg_03'] != -1) & (customers['ps_car_14'] != -1)][cont_cols]
sns.regplot(data=temp, x='ps_reg_02', y='ps_reg_03')
plt.show()
sns.regplot(data=temp, x='ps_car_13', y='ps_car_14')
plt.show()
sns.regplot(data=temp, x='ps_car_12', y='ps_car_14')
plt.show()

Since these features look like they have a relationship to the target, then I should be able to use them to make a better guess when replacing the missing values.

As a final step, I'll do simple one-variable linear regressions for each continuous column against the target variable to see if potential relationships are revealed:

In [None]:
lst = []

for col in cont_cols:
    if (col == 'ps_reg_03' or col == 'ps_car_14'):
        slope, intercept, r_value, p_value, std_err = stats.linregress(customers[customers[col] != -1][col], customers[customers[col] != -1]['target'])
        lst.append([col, slope, p_value])              
    else:
        slope, intercept, r_value, p_value, std_err = stats.linregress(customers[col], customers['target'])
        lst.append([col, slope, p_value])
        
cont_features = pd.DataFrame(lst, columns=['Feature', 'Slope', 'P-value'])
cont_features.sort_values(['P-value'], inplace=True)
                   
cont_features

Observations:
* None of the "_calc_" features show a significant relationship with the target and also have tiny slope coefficients
* At the top of the list, the ps_car_12 and ps_car_13 variables have relatively large slope cofficients and extremely low p-values (the p-value for ps_car_13 is reported to be exactly zero, which is very odd).

I will add the 11 features with a p-value over 0.01 to my list of potential features.

In [None]:
#Add features with p-value over .01 to potential features list
potential_features += list(cont_features[cont_features['P-value'] < .01]['Feature'])

In [None]:
print ("Count: ", len(potential_features))
potential_features