This is my very first completely self-written Kernel. I have stolen ideas from all over the place and just applied to this dataset to see to what extent I understand (Visualization most notably from @Anisotropic). I am at this point in reinforcing my fundamentals, specially the ones I have picked up reading a number of ML books.

There is some work to be done (add more key points I have picked up and the reasoning behind those) and this Kernel is not final since I want to do more (may be I will do that in a second part of this). To do (apart from the longer list I have below) for this Kernel involves:

1. Discretizing certain continuous variables
2. Outlier cleaning for important features
3. Re-looking NaNs with a reasonable assumption that NaNs are generally systematic rather than random (and comparing NaNs with test)
4. Looking at more feature transformations for important features
5. Do I really understand what I am doing or am I doing it while others do it? (*Most Important*!)


**I would love some feedback. Thank you in advance!**

In [None]:
import pandas as pd
import numpy as np
from scipy.stats import pearsonr, pointbiserialr
from sklearn.metrics import matthews_corrcoef, classification_report, confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.base import clone
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from IPython.core.display import display, HTML
import missingno as msno
from itertools import combinations_with_replacement, combinations
from collections import defaultdict
import warnings
warnings.filterwarnings('ignore')

display(HTML("<style>.container { width:100% !important; }</style>"))

In [None]:
#train = pd.read_csv("D:/Kaggle_Data/Safe Driver/train.csv")  # could have used the na_values=-1 argument for automatic replacement of -1 with NaNs
#test = pd.read_csv("D:/Kaggle_Data/Safe Driver/test.csv")
train = pd.read_csv("../input/train.csv")
test = pd.read_csv("../input/test.csv")

In [None]:
train.shape

In [None]:
train.target.value_counts(normalize=True)*100

In [None]:
plt.figure(figsize=(4,4))
ax = sns.countplot(train.target)
ax.set_facecolor('white')

Now that's one ugly unbalanced dataset. Gotta learn to live with it.

In [None]:
plt.figure(figsize=(4,4))
ax = sns.countplot(train.dtypes)
ax.set_facecolor('white')

In the train and test data, features that belong to similar groupings are tagged as such in the feature names (e.g., __ind, reg, car, calc__). In addition, feature names include the postfix __bin__ to indicate binary features and __cat__ to indicate categorical features. __Features without these designations are either continuous or ordinal__. Values of __-1__ indicate that the feature was missing from the observation. The target columns signifies whether or not a claim was filed for that policy holder.

In [None]:
pd.Series(train.columns)[:10]  # a trivial conversion to Series to see the names without the ugly single quotes

Steps:

1. Get rid of the unnecessary "ps_" prefix in all columns since it just makes it hard to see the names
2. Create name clusters to see how strongly the similarly grouped features are correlated
3. Since no.2 would automatically include a few columns in each group that belong to the binary variable, we will leave them alone in the first step

In [None]:
# Find the first underscore from the left and keep the remaining characters of the col.name
new_col_names = [s[s.find("_")+1:] for s in train.columns]

In [None]:
test_new_col_names = new_col_names[:]
test_new_col_names.remove('target')

In [None]:
pd.DataFrame.from_dict({"New Names":new_col_names, "Old Names":list(train.columns)})

In [None]:
train.columns = new_col_names
test.columns = test_new_col_names

Now we'll try to group the column names depending on their prefix. We will use the default dictionary for this purpose to store the prefix as the key and the list of columns in the value.

In [None]:
# checking how many prefixes exist
prefixes=  set([s[:s.find("_")] for s in train.columns if "_" in s])
print(prefixes)

In [None]:
grouped_cols = defaultdict(list)

for prefix in prefixes:
    grouped_cols[prefix]=[col for col in train.columns if prefix in col]

Checking if the dictionary is indeed working

In [None]:
grouped_cols['reg']

### The nightmare of NaNs

First have to see what kind of missing value counts are we facing since correlations etc. only make sense if the data has atleast a semblance of completion. Since NaNs have been replaced with -1, we take it back so we can visualize it for our ease better. Here we go!

In [None]:
train.replace(-1,np.nan, inplace=True)
test.replace(-1,np.nan, inplace=True)

Below we see the absolute count and more importantly their percentage value as a number of entire column length

In [None]:
def display_nans(df):
    '''
    returns a dataframe with Number of NaNs in each column and also as a percentage of all rows in that column

    :param df: DataFrame containing NaNs. Type: pandas.DataFrame
    :return: DataFrame with indices as column names and columns as no. of NaN values and their percentage of # of rows.
    '''
    nans = pd.concat([df.isnull().sum(), (df.isnull().sum() / df.shape[0]) * 100], axis=1,
                     keys=['Num_NaN', 'NaN_Percent'])
    return nans[nans.Num_NaN > 0]

In [None]:
# Train NaNs
display_nans(train)

In [None]:
# Test NaNs
display_nans(test)

Before we do something about the NaNs, it is useful to see if their missing has some kind of correlation. That might help us do an advanced version of imputation.

In [None]:
nans = pd.concat([train.isnull().sum(), (train.isnull().sum() / train.shape[0]) * 100], axis=1, keys=['Num_NaN', 'NaN_Percent'])
cols_with_nans = nans[nans.Num_NaN > 0].index
msno.matrix(df=train.loc[:,cols_with_nans], figsize=(20, 20), color=(0.24, 0.77, 0.77))

Although there does seem to be some kind of correlation for example between all the ind NaNs, we will use the prefixes (since they allude to similar groupings) to look for correlation and fill up the columns. For the sake of simplicity I will be comparing them in pairs.

Note that a few columns like car_03_cat and car_05_cat have a lot more nans so filling them is optional. One approach could be to get rid of them. Yet another approach is to see if they are nan in case of any particular value or set of values in other columns and impute them accordingly. I don't know how I could do that very neatly but hell, I'll give it a try later at that too (Perhaps treating it as a sub-ML problem)

In [None]:
cols_with_nans_ind = [col for col in cols_with_nans if "ind" in col]

# ind_04_cat has a lot of target 1.0s than one might expect when its value is Null. One option 
# could be to create a new category "2" or something.

for col1, col2 in combinations(cols_with_nans_ind, 2):
    print(col1,col2, ":", end=" ")
    count_of_both_nans = len(train[train[col1].isnull()].index & train[train[col2].isnull()].index)
    print(count_of_both_nans, 'common indices')

The above result shows that when ind_04_cat is NULL then the other two columns namely ind_02_cat and ind_05_cat are also NULL. This means when filling in these NaN values, we ought to maintain this consistency and use ind_04_cat as the base. 

What about the other nans that only exist in only the other two columns i.e. 02 and 05 and in fact are the majority? Well again for the simplicity, I will fill them too with the same value since we'd be filling them up with the mode of those columns. (They could have been filled in some other fashion like one-way ANOVA by checking the correlation between a continuous variable and categorical variable but lets not sweat too much over a relatively small number of NaN values).

Side Note: On the other hand, the NaNs of ind_02_cat and ind_05_cat do not seem to be correlated to one another as only two extra instances of these are together null apart from the 79 that they share in common with ind_04_cat.

In [None]:
train['ind_04_cat'].value_counts(dropna=False)

In [None]:
# Since 0 occurs more often and the values are categorical, we will impute it with the mode
train['ind_04_cat'].fillna(value=train['ind_04_cat'].mode()[0], inplace=True)  # dont forget the damn [0]
test['ind_04_cat'].fillna(value=train['ind_04_cat'].mode()[0], inplace=True)  # Test NaNs are filled with Train mode values

pd.DataFrame({'Train':train['ind_04_cat'].value_counts(dropna=False), 'Test':test['ind_04_cat'].value_counts(dropna=False)})

Now we need to check what values of ind_02_cat and ind_05_cat occurs most often for the 0.0 of ind_04_cat

In [None]:
print('For ind_02_cat:', '\n', train.loc[train.ind_04_cat==0.0,'ind_02_cat'].value_counts())
print("*"*30)
print('For ind_05_cat:', '\n', train.loc[train.ind_04_cat==0.0,'ind_05_cat'].value_counts())

As can be seen above, after filling up the ind_04_cat, the overall mode holds it's value when conditionally checked against only the 0.0 value of ind_04_cat. Hence we're not doing anything crazy by filling up those NaNs with the mode of that column.

In [None]:
# So for both the mode value holds.
train.ind_02_cat.fillna(train.ind_02_cat.mode()[0], inplace=True)
train.ind_05_cat.fillna(train.ind_05_cat.mode()[0], inplace=True)

test.ind_02_cat.fillna(train.ind_02_cat.mode()[0], inplace=True)
test.ind_05_cat.fillna(train.ind_05_cat.mode()[0], inplace=True)

Dropping the overly NaN-ed columns

In [None]:
train.drop(['car_03_cat','car_05_cat'], axis=1, inplace=True)
test.drop(['car_03_cat','car_05_cat'], axis=1, inplace=True)

Rechecking the NaN situation

In [None]:
#Train
display_nans(train)

In [None]:
#Test
display_nans(test)

Since we'd already seen before that the other NaNs do not seem to be closely correlated, we'd just go ahead with imputing them column by column.

In [None]:
# starting with the easiest ones i.e. with the fewest values and seem to be ordinal
#train.car_01_cat.value_counts(dropna=False)  # uncheck one at a time to see the value counts
#train.car_02_cat.value_counts(dropna=False)
train.car_11.value_counts(dropna=False)

In [None]:
train.car_01_cat.fillna(train.car_01_cat.mode()[0], inplace=True)
test.car_01_cat.fillna(train.car_01_cat.mode()[0], inplace=True)

train.car_02_cat.fillna(train.car_02_cat.mode()[0], inplace=True)
test.car_02_cat.fillna(train.car_02_cat.mode()[0], inplace=True)

train.car_11.fillna(train.car_11.mode()[0], inplace=True)  # assuming car_11_cat column has nothing to do with this one
test.car_11.fillna(train.car_11.mode()[0], inplace=True)  # assuming car_11_cat column has nothing to do with this one

train.car_12.fillna(train.car_12.mean(), inplace=True)  # mean since car_12 is continuous
test.car_12.fillna(train.car_12.mean(), inplace=True)  # mean since car_12 is continuous

In [None]:
f, axarr = plt.subplots(1,4, figsize=(16,5))
train.plot(x="target", y="reg_03", ax=axarr[0], kind="scatter");
train.plot(x="target", y="car_09_cat",ax=axarr[1], kind="scatter");
train.plot(x="target", y="car_07_cat",ax=axarr[2], kind="scatter");
train.plot(x="target", y="car_14",ax=axarr[3], kind="scatter");

As can be seen above, the data for reg_03 can basically take the same values for both target boolean values. Hence will probably be of limited significance. Same problem for car_09_cat, car_07_cat and car_14.

In [None]:
train.reg_03.fillna(train.reg_03.mean(),inplace=True)
test.reg_03.fillna(train.reg_03.mean(),inplace=True)

train.car_09_cat.fillna(train.car_09_cat.mode()[0],inplace=True)
test.car_09_cat.fillna(train.car_09_cat.mode()[0],inplace=True)

train.car_07_cat.fillna(train.car_07_cat.mode()[0], inplace=True)
test.car_07_cat.fillna(train.car_07_cat.mode()[0], inplace=True)

train.car_14.fillna(train.car_14.mode()[0], inplace=True)
test.car_14.fillna(train.car_14.mode()[0], inplace=True)

Final check that we have indeed gotten rid of all the damn NaNs :D

In [None]:
# Train
display_nans(train)

In [None]:
# Test
display_nans(test)

### Heatmaps for correlations between continuous variables of the same groupings (and also with target)

Before looking for correlations, it is very helpful to remind ourselves that all the many kinds of "patterns" return a zero for linear correlation:

![](http://cdn-ak.f.st-hatena.com/images/fotolife/h/hsameshima/20130703/20130703153559.png)

We will start by drawing a heatmap of all sets but ensuring that no categorical or binary variables are selected since they need a different kind of treatment

In [None]:
def draw_heatmap(filtered_cols, train, fmt='.1f', calc_corr=True):
    sub_train = train.loc[:,filtered_cols]
    f,ax = plt.subplots(figsize=(len(filtered_cols),len(filtered_cols)))
    if calc_corr:
        sns.heatmap(sub_train.corr(), annot=True, fmt= '.1f',ax=ax, vmin=0, vmax=1);
    else:
        sns.heatmap(train, annot=True, fmt=fmt,ax=ax);

In [None]:
prefix='calc'
filtered_cols = [col for col in grouped_cols[prefix] if ('bin' not in col) and ('cat' not in col)] + ['target']
draw_heatmap(filtered_cols, train)

So it seems that the "calc." fields seem that they are at least linearly independent.

In [None]:
prefix='reg'
filtered_cols = [col for col in grouped_cols[prefix] if ('bin' not in col) and ('cat' not in col)] + ['target']
draw_heatmap(filtered_cols, train)

In [None]:
prefix='ind'
filtered_cols = [col for col in grouped_cols[prefix] if ('bin' not in col) and ('cat' not in col)] + ['target']
draw_heatmap(filtered_cols, train)

In [None]:
prefix='ind'
filtered_cols = [col for col in grouped_cols[prefix] if ('bin' not in col) and ('cat' not in col)] + ["target"]
sub_train = train.loc[:,filtered_cols]
sns.pairplot(sub_train,size=2.5,hue="target");

In [None]:
prefix='car'
filtered_cols = [col for col in grouped_cols[prefix] if ('bin' not in col) and ('cat' not in col)] + ['target']
draw_heatmap(filtered_cols, train)

In [None]:
prefix='car'
filtered_cols = [col for col in grouped_cols[prefix] if ('bin' not in col) and ('cat' not in col)] + ["target"]
sub_train = train.loc[:,filtered_cols]
sns.pairplot(train,size=2.5, vars=filtered_cols,hue="target", plot_kws={'alpha':0.3});

In [None]:
sns.lmplot(x="car_13", y="car_15", hue="target", data=train,scatter_kws={'alpha':0.3});

In [None]:
sns.lmplot(x="car_12", y="car_15", hue="target", data=train,scatter_kws={'alpha':0.7});

This was an attempt to see closer if there is any hope of seperation using two variables but boy they're damn well sandiwched together

### Comparing binary variables with the continuous features (within the same groupings) and target

Checking which columns are binary...

In [None]:
[s for s in train.columns if "_bin" in s]

Converting the columns to the right data type

In [None]:
for column in [col for col in train.columns if "bin" in col]:
    train[column] = train[column].astype(bool)
    test[column] = test[column].astype(bool)

train['target'] = train['target'].astype(bool)

Pearson's R correlation only works when both variables are continuous. Hence,

1. ~~For comparing the binary variables to binary variables, we will calculate the phi coefficient.~~ 
 I am using row-wise comparison since since phi coefficient calculation is returning false values due to a problem for which I have posted a question <a href= https://www.kaggle.com/questions-and-answers/41464> here </a>
2. For comparing the binary to continuous variables, we will calculate the point bi-serial.

__Starting with the prefix "ind"__

In [None]:
#defining a correlation dataframe maker

def correl_df_maker(filtered_cols, train, round_to=2):

    coeff_df = pd.DataFrame(columns=filtered_cols,index=filtered_cols)
    for idx,col in combinations_with_replacement(filtered_cols,2):

        if train[idx].dtype == bool and train[col].dtype == bool:
            coeff_df.loc[idx,col] = np.round_(np.sum(train[idx]==train[col])/train.shape[0],round_to)
            coeff_df.loc[col,idx] = coeff_df.loc[idx,col]
        elif train[idx].dtype == bool:
            coeff_df.loc[idx,col] = np.round_(pointbiserialr(train[idx].values, train[col].values)[0],round_to)
            coeff_df.loc[col,idx] = coeff_df.loc[idx,col]
        elif train[col].dtype == bool:
            coeff_df.loc[idx,col] = np.round_(pointbiserialr(train[col].values, train[idx].values)[0],round_to)
            coeff_df.loc[col,idx] = coeff_df.loc[idx,col]
        else:
            coeff_df.loc[idx,col] = np.round_(pearsonr(train[idx].values, train[col].values)[0],round_to)
            coeff_df.loc[col,idx] = coeff_df.loc[idx,col]
            
    return coeff_df.astype(float)

In [None]:
prefix='ind'

# first comparing all binary variables with one another
filtered_cols = [col for col in grouped_cols[prefix] if ('cat' not in col)] + ['target']
coeff_ind_df = correl_df_maker(filtered_cols, train)
coeff_ind_df

In [None]:
draw_heatmap(filtered_cols, coeff_ind_df, fmt='.2f', calc_corr=False)

Next we will do the same with the only other prefix with binary variables: __"calc"__

In [None]:
prefix='calc'

# first comparing all binary variables with one another
filtered_cols = [col for col in grouped_cols[prefix] if ('cat' not in col)] + ['target']
correl_car_df = correl_df_maker(filtered_cols,train)
correl_car_df

In [None]:
draw_heatmap(filtered_cols, correl_car_df, fmt='.2f', calc_corr=False)

What's next?

So I have the correlations between binary variables and also between continuous and binary variables. The question is... what now. It seems that there are a bunch of variables that give a seemingly high correlation with target but that's because of the class imbalance in the output. One could always say False and still have more than 96 percent match. Among these, I need to see which one give the highest f1_score and keep that and remove the rest perhaps?

Furthermore, I tried training the RandomForestClassifier (RFC) and it gives less than ideal results. Categorical Variables are basically binary variables also after OHE (One-Hot Encoding) so I could do that and check the correlation again perhaps or run RFC/Catboost etc. on it to see again how well I do.

A further option would be to standardize the fields using StandardScaler and even eliminate or shorten the long tails of continuous variables (should they be normally distributed) before doing that.

Final options, run LDA, PCA, try different methods, try stacking with different methods, imbalance learn and finally the two beasts feature engineering and deep learning (with Keras preferably but TF too).

To-Do:

1. Get rid of highly correlated features
2. Do OHE for the categorical features
3. Try Random Forest
4. Check the distribution of ordinal and continuous variables
5. If they have long tails, bring them to the center by taking the log
6. Standardize them using StandardScaler
7. Run the LDA and PCA on the model and draw the necessary conclusions
9. Try Linear Model (perhaps a regularized version like ElasticNet), try the Support Vector Machines, try k-Nearest Neighbors.
10. See if a simple ensemble for these performs better
11. Try dealing with the imbalance of classes somehow (SMOTE etc.) and retry the models. Check if the outputs improved
12. Try the models and ensemble again
13. Try some of the feature reduction techniques to see if the result improves
14. Try to make sense of the features even though they are anonymized
15. Try a deep learning model with Keras
15. Try adverserial validation

### Remove highly correlated features

So let's begin. In the next step, I will get rid of strongly inter-correlated features:

In [None]:
for column in ['ind_10_bin', 'ind_11_bin', 'ind_12_bin', 'ind_13_bin']:

    print("*"*15,column,"*"*15)
    print(classification_report(train.target.values, train[column].values))

So I will go with ind_12_bin and get rid of others since it has a beter better recall and f1-score than others.

In [None]:
train.drop(['ind_10_bin', 'ind_11_bin', 'ind_13_bin'], axis=1, inplace=True)
test.drop(['ind_10_bin', 'ind_11_bin', 'ind_13_bin'], axis=1, inplace=True)

In [None]:
for column in ['calc_18_bin', 'calc_20_bin']:

    print("*"*15,column,"*"*15)
    print(classification_report(train.target.astype(int).values, train[column].astype(int).values))

In [None]:
train.drop(['calc_18_bin'], axis=1, inplace=True)  # since its recall of 1s is better even though it has an overall worse f1-score
test.drop(['calc_18_bin'], axis=1, inplace=True)

<p><font color="green"> Before I close this conversation, I would like to reflect on what Pearson's R really means. I mean if a value is anything below 1.0, where do stop dropping the features if they are highly correlated. How can "high" be mathematically defined? I thought about it and came to the conclusion that beyond the extreme values of 0 and 1, what I am missing is a picture that usually one associates to a concept. Yes I know the higher the number the "more obvious" the correlation.</font></p>

<p><font color="green">So what does it say for example a value between two pairs of features one with a correlation of 0.6 vs. 0.7? Should one be dropped or neither? So far I am going with the understanding that unless they are almost completely correlated and any lack of perfect correlation is due to noise that is unavoidable in real world, there are real influences that can sort of lead to divergence and that divergence may capture information.</font></p>

<p><font color="green">For example, as a rule you always reach work in 15 minute after you leave your house door. So the leaving time and arriving time have a perfect correlation but the traffic introduces variance i.e. the noise.</font></p>

<p><font color="green">On  the other hand, if you go drop your kid to school twice per week, which takes you 10 more minutes, the divergence is not noise and will bring the correlation down. Such a model could only be modeled if the data was chronological in which the data points would show a pattern else a second feature "go to school" would be needed to model it using a linear relationship (arrival time = 15x1 + 10x2)</font></p>

_<p><font color="green">So the question is always, to what extent is the noise bringing the correlation down and to what extent is it the work of other factors that need to be paid attention to.</font></p>_


<p> Get rid of highly correlated features &#10004;</p>

### One-Hot Encoding

This is the standard process of converting categories (numerical or strings) into binary format.

In [None]:
for col in [col for col in train.columns if "cat" in col]:
    print(col, end="|")  
    df = pd.get_dummies(train[col],prefix=col).astype(bool)
    train.drop([col],axis=1,inplace=True)  # dropping the original columns
    train = pd.concat([train, df], axis=1)
    
    df = pd.get_dummies(test[col],prefix=col).astype(bool)
    test.drop([col],axis=1,inplace=True)  # dropping the original columns
    test = pd.concat([test, df], axis=1)

In [None]:
train.shape, test.shape  # train has target so all is well!

<p> Do OHE for the categorical features &#10004;</p>

### Random Forest Classification

In [None]:
X_train, X_test, y_train, y_test = train_test_split(train.iloc[:,2:].values, train.iloc[:,1].values, random_state=42)

In [None]:
rf_clf = RandomForestClassifier(n_estimators=100, min_samples_leaf=50, class_weight="balanced", random_state=42)

In [None]:
rf_clf.fit(X_train, y_train)

In [None]:
y_train_pred = rf_clf.predict(X_train)
confusion_matrix(y_train,y_train_pred)

In [None]:
y_test_pred = rf_clf.predict(X_test)
confusion_matrix(y_test, y_test_pred)

In [None]:
dict(zip(train.columns[2:], np.round(rf_clf.feature_importances_*100,3)))

<p> Try Random Forest  &#10004;</p>

### Violin plots for distribution comparison

First we will check how many variables do we have ordinal and continuous variables

In [None]:
train.select_dtypes(include=[int,float]).describe()

In [None]:
columns_for_violin = list(train.iloc[:,1:].select_dtypes(include=[int,float]).columns) + ['target']  # target for hue
data = train[columns_for_violin]
data = pd.melt(data, id_vars='target', var_name='feature', value_name="value")
plt.figure(figsize=(len(columns_for_violin), len(columns_for_violin)))
sns.violinplot(x="feature", y="value", hue="target", data=data,split=True,inner="quart")
plt.xticks(rotation=90);

<font color="red">What may seem like different data distribution in calc_01/02/03 is actually nothing more than class imbalance (i.e. difference in the number of samples of each class) as can be seen by the numbers below.</font>

In [None]:
calc01_dist_df = pd.concat([train[train.target==True].calc_01.value_counts(),train[train.target==False].calc_01.value_counts()], axis=1)
calc01_dist_df.columns = ['True','False']
calc01_dist_df

In [None]:
f, axarr = plt.subplots(3,2, figsize=(16,5))
sns.distplot(train.car_13,ax=axarr[0,1])
sns.distplot(train.car_11,ax=axarr[1,1])
sns.distplot(train.car_15,ax=axarr[2,1])
sns.distplot(train.reg_02,ax=axarr[0,0])
sns.distplot(train.calc_01,ax=axarr[1,0])
sns.distplot(train.car_12,ax=axarr[2,0]);

So much of the data, in my opinion based on the graphs above,  is ordinal. This can be seen given the frequency peaks at certain points that are otherwise zero.

In [None]:
sns.distplot(train.car_13);  # a long tail

First thoughts: ~~Before applying the log function, I will try standardizing it and see if that helps.~~

I can't take the log after standardization because then all values will no longer be positive.

### Standardization

<p><font color="green">An important point regarding standardization is that the data needs to be split beforehand. This is because, whatever values the model learns during the fit() call, need to be extracted/calculated from the training set __only__. Ofcourse we can't know the data we are going to test our model on and calling fit over the entire data set will introduce a form of leakage.</font><p>

In [None]:
train_df, val_df = train_test_split(train, test_size=0.2, random_state=42)

In [None]:
# To preserve the original train DataFrame, I will apply log on the split ones

for df in [train_df, val_df]:
    
    df.car_13 = df.car_13.apply(np.log2)

Only the continous and ordinal variables ought to be normalized. The code below is ugly but due to NaNs creeping up for me for an unknown number of reasons, I need to write tests to narrow down when exactly that happens when if I am unsure why exactly that happens.

In [None]:
std_scaler = StandardScaler()

for col in train.select_dtypes(include=["int64","float64"]).columns:
    
    clone_scalr = clone(std_scaler)
    print(col, end=' | ')
    
    np_data_train = train_df[col].astype(np.float32).values.reshape(-1, 1)
    assert np.sum(np.isnan(np_data_train)) == 0, 'NaNs exist in Series converted to ndarray for train'
    np_data_val = val_df[col].astype(np.float32).values.reshape(-1, 1)
    assert np.sum(np.isnan(np_data_val)) == 0, 'NaNs exist in Series converted to ndarray for validation'

    np_data_train_t = np.round(clone_scalr.fit_transform(np_data_train).ravel(),4)
    assert np.sum(np.isnan(np_data_train_t)) == 0, 'NaNs exist in transformed ndarray for train'
    np_data_val_t = np.round(clone_scalr.transform(np_data_val).ravel(),4)
    assert np.sum(np.isnan(np_data_val_t)) == 0, 'NaNs exist in transformed ndarray for validation'
    
    train_df[col] = pd.Series(np_data_train_t, name=col, index=train_df.index)
    assert not train_df[col].isnull().any(), "NaNs exist in conversion of transformed ndarray to Series for train"
    val_df[col] = pd.Series(np_data_val_t, name=col, index=val_df.index)
    assert not val_df[col].isnull().any(), "NaNs exist in conversion of transformed ndarray to Series for validation"

In [None]:
sns.distplot(train_df.car_13);  # looks good!

In [None]:
columns_for_violin = list(train_df.select_dtypes(include=["float32"]).columns) +['target'] # target for hue
data = train_df[columns_for_violin]
data = pd.melt(data, id_vars='target', var_name='feature', value_name="value")
plt.figure(figsize=(len(columns_for_violin), len(columns_for_violin)))
sns.violinplot(x="feature", y="value", hue="target", data=data,split=True,inner="quart")
plt.xticks(rotation=90);