# Porto Seguro prediction problem (Kaggle)
## Author : Mateus C. Pedrino

The aim of this notebook is to provide a preliminary and exploratory analysis of Kaggle Porto Seguro prediction problem (available at : https://www.kaggle.com/c/porto-seguro-safe-driver-prediction), giving some tips of how to deal with missing and unbalanced data. After doing that, random forest will be tested and some discussions around it will be conducted considering accuracy and auc score results.

I'd like to thank Rafael Alencar and Bert Carremans for theirs kernels, they were essential to perform the discussions in this notebook.

It's also important to highlight that it's not the goal of this notebook to go deep in the solution, however it has achieved a successful auc score in the end.

## Preprocessing

### Load data

In [7]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df_train=pd.read_csv('train.csv', header=(0))

df_train.head()

Unnamed: 0,id,target,ps_ind_01,ps_ind_02_cat,ps_ind_03,ps_ind_04_cat,ps_ind_05_cat,ps_ind_06_bin,ps_ind_07_bin,ps_ind_08_bin,...,ps_calc_11,ps_calc_12,ps_calc_13,ps_calc_14,ps_calc_15_bin,ps_calc_16_bin,ps_calc_17_bin,ps_calc_18_bin,ps_calc_19_bin,ps_calc_20_bin
0,7,0,2,2,5,1,0,0,1,0,...,9,1,5,8,0,1,1,0,0,1
1,9,0,1,1,7,0,0,0,0,1,...,3,1,1,9,0,1,1,0,1,0
2,13,0,5,4,9,1,0,0,0,1,...,4,2,7,7,0,1,1,0,1,0
3,16,0,0,1,2,0,0,1,0,0,...,2,2,4,9,0,0,0,0,0,0
4,17,0,0,2,0,1,0,1,0,0,...,3,1,1,3,0,0,0,1,1,0


### Check and deal with missing values

Once missing values are indicated by "-1", we'll replace -1 for NaN in order to be able to use pandas to check the amount of missing data.

In [8]:
train_cp = df_train # Copy 
train_cp = train_cp.replace(-1, np.NaN)

# Total missing values per feature
total = train_cp.isnull().sum().sort_values(ascending=False) 

# Percentage of each feature that is missing
percent = (train_cp.isnull().sum()/train_cp.isnull().count()).sort_values(ascending=False)

# Concat both previous information
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])

In [9]:
# Show missing data information

missing_data.head(14)

Unnamed: 0,Total,Percent
ps_car_03_cat,411231,0.690898
ps_car_05_cat,266551,0.447825
ps_reg_03,107772,0.181065
ps_car_14,42620,0.071605
ps_car_07_cat,11489,0.019302
ps_ind_05_cat,5809,0.00976
ps_car_09_cat,569,0.000956
ps_ind_02_cat,216,0.000363
ps_car_01_cat,107,0.00018
ps_ind_04_cat,83,0.000139


We can see that ps_car_03_cat and ps_car_05_cat are both variables with lots of missing data (more than 40%), so it's plausible to remove this features from our analysis in order to avoid further problems.

Let's take a better look in the other features with missing values.

In [10]:
features_missing=np.array(missing_data.index,dtype=str)
features_missing=features_missing[2:13] # Labels of missing features
train_cp[features_missing].head()

Unnamed: 0,ps_reg_03,ps_car_14,ps_car_07_cat,ps_ind_05_cat,ps_car_09_cat,ps_ind_02_cat,ps_car_01_cat,ps_ind_04_cat,ps_car_02_cat,ps_car_11,ps_car_12
0,0.71807,0.37081,1.0,0.0,0.0,2.0,10.0,1.0,1.0,2.0,0.4
1,0.766078,0.388716,1.0,0.0,2.0,1.0,11.0,0.0,1.0,3.0,0.316228
2,,0.347275,1.0,0.0,2.0,4.0,7.0,1.0,1.0,1.0,0.316228
3,0.580948,0.294958,1.0,0.0,3.0,1.0,7.0,0.0,1.0,1.0,0.374166
4,0.840759,0.365103,1.0,0.0,2.0,2.0,11.0,1.0,1.0,3.0,0.31607


As we can see, we have 3 types of remaining missing features : 

- categorical ("cat" sufix);
- continous : ps_reg_03, ps_car_12 and ps_car_14;
- ordinal : neither categorical nor continous : ps_car_11.

For these missing values, we can consider the following approaches : 

- categorical : replace back NaN for -1 and treat -1 as a new class;
- continous : replace missing values for mean or median (we'll try mean first);
- ordinal : replace missing values by mode.

In [11]:
# Remove 'ps_car_03_cat' and 'ps_car_05_cat' (over 40% of missing values)
train_cp=train_cp.drop(['ps_car_03_cat','ps_car_05_cat'],axis=1)

print(df_train.shape)
print(train_cp.shape)

(595212, 59)
(595212, 57)


Indeed, we removed those two columns with more than 40% of missing values. Let's handle the other ones !

In [12]:
# Replace continous values by mean
X_continous = np.array(train_cp[['ps_reg_03', 'ps_car_12', 'ps_car_14']],\
                       dtype = float)

print('Former X_continous shape : ',X_continous.shape)
print('Number of missing values in continous features : ',np.isnan(X_continous).sum(),'\n')

means = np.nanmean(X_continous, axis = 0) # Means of each feature
for i in np.arange(0, X_continous.shape[0]):
    for j in np.arange(0, X_continous.shape[1]):
        if(np.isnan(X_continous[i,j]) == True):
            X_continous[i,j] = means[j]
            
print('New X_continous shape : ',X_continous.shape)
print('Still missing values in X_continous : ',np.isnan(X_continous).sum())

Former X_continous shape :  (595212, 3)
Number of missing values in continous features :  150393 

New X_continous shape :  (595212, 3)
Still missing values in X_continous :  0


So, once the shape of X_continous hasn't changed and there is no more NaN, we can conclude that mean replacement was successful.

In [13]:
# Replace ordinal feature by mode
X_ordinal=np.array(train_cp['ps_car_11'],dtype = float)

print('Former X_ordinal shape : ',X_ordinal.shape)
print('Number of missing values in ordinal feature : ',np.isnan(X_ordinal).sum(),'\n')

from scipy import stats
modes = stats.mode(X_ordinal,axis=0,nan_policy='omit').mode.item()

for i in np.arange(0, X_ordinal.shape[0]):
    if(np.isnan(X_ordinal[i]) == True):
        X_ordinal[i] = modes
        
print('New X_ordinal shape : ',X_ordinal.shape)
print('Still missing values in X_ordinal : ',np.isnan(X_ordinal).sum())

Former X_ordinal shape :  (595212,)
Number of missing values in ordinal feature :  5 

New X_ordinal shape :  (595212,)
Still missing values in X_ordinal :  0


So, once the shape of X_ordinal hasn't changed and there is no more NaN, we can conclude that mode replacement was successful.

Finally, once we've already dealed with continous and ordinal features, after replacing these features with the new ones (with mean and mode instead of NaN), the only remaining NaN will correspond to categorical features. So, we can replace them back to -1 and create this new class. 

So, first, let's replace !

In [14]:
# Replacing continuous and ordinal missing features
train_cp[['ps_reg_03','ps_car_12','ps_car_14']]=X_continous[:,0:3]
train_cp['ps_car_11']=X_ordinal

In [15]:
# Checking missing values in the replaced data frame

# Total missing values per feature
total_re = train_cp.isnull().sum().sort_values(ascending=False) 

# Percentage of each feature that is missing
percent_re = (train_cp.isnull().sum()/train_cp.isnull().count()).sort_values(ascending=False)

# Concat both previous information
missing_data_re = pd.concat([total_re, percent_re], axis=1, keys=['Total', 'Percent'])

missing_data_re.head(10)

Unnamed: 0,Total,Percent
ps_car_07_cat,11489,0.019302
ps_ind_05_cat,5809,0.00976
ps_car_09_cat,569,0.000956
ps_ind_02_cat,216,0.000363
ps_car_01_cat,107,0.00018
ps_ind_04_cat,83,0.000139
ps_car_02_cat,5,8e-06
ps_calc_20_bin,0,0.0
ps_ind_15,0,0.0
ps_car_04_cat,0,0.0


Thus, the only missing values correspond to categorical features, that's why we can do the direct NaN replacement for -1 again.

In [16]:
train_cp=train_cp.replace(np.NaN, -1)

In [17]:
# Check the number of -1 to see if they are equal to the number of NaN in missing_data_re
total_minus1=pd.DataFrame((train_cp==-1).sum().sort_values(ascending=False),columns=['Total'])
total_minus1.head(10)

Unnamed: 0,Total
ps_car_07_cat,11489
ps_ind_05_cat,5809
ps_car_09_cat,569
ps_ind_02_cat,216
ps_car_01_cat,107
ps_ind_04_cat,83
ps_car_02_cat,5
ps_calc_20_bin,0
ps_ind_15,0
ps_car_04_cat,0


Therefore, the number of -1 is equal to previous number of NaN in categorical features, that's why the missing values in training features is solved for now. One more simple thing that can be performed is to check duplicated in training data.

In [18]:
print('Before dropping duplicates : ',train_cp.shape)

# Dropping duplicates
train_cp = train_cp.drop_duplicates()

print('After dropping duplicates : ',train_cp.shape)

Before dropping duplicates :  (595212, 57)
After dropping duplicates :  (595212, 57)


So, once the training set shape hasn't changed, we can conclude that there were no duplicates.

### Dealing with unbalanced data

In [19]:
# Divide by class
df_class_0 = train_cp[train_cp['target'] == 0]
df_class_1 = train_cp[train_cp['target'] == 1]

print(df_class_0.shape)
print(df_class_1.shape)

(573518, 57)
(21694, 57)


As we can see above, the number of instances with target 0 is far bigger than the number of instances with target 1. There is basically two simple ways to deal with this : oversampling and undersampling. Once we are providing only a preliminary approach for this problem, we won't explore the resampling problem further. I've already tried basic over and undersamling and it didn't work well, so we can use SMOTE oversampling from imblearn, which basic oversamples the minority data creating similar data randomly.

In [20]:
# Using random oversampling
from imblearn.over_sampling import SMOTE

# id isn't important for prediction and target is the output feature
aux_data=train_cp # Avoid changes in original data frame (which is already a copy)
aux_data=aux_data.drop(aux_data.columns[0],axis=1)
labels=list(aux_data.columns) # get features labels without id
X=np.array(aux_data.drop(labels[0], axis = 1)) #input data
Y=np.array(aux_data[labels[0]],dtype=int)

In [21]:
ros = SMOTE(ratio='minority')
X_res, y_res = ros.fit_resample(X, Y)

### Split "training" data into real training and test sets

In [22]:
from sklearn.model_selection import train_test_split

p=0.7
train_x, test_x, train_y, test_y = train_test_split(X_res, y_res, test_size = 1-p, random_state=42)

In [23]:
(test_y==1).sum()
#train_x.dtype

171855

## Random forest

In [24]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100, max_depth=3, max_features=5)
model.fit(train_x, train_y)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=3, max_features=5, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [25]:
from sklearn.metrics import accuracy_score

pred_y=model.predict(test_x)
scr=accuracy_score(pred_y,test_y)

print('Accuracy : ',scr)
print((test_y==1).sum())
print((test_y==0).sum()) # Balanced classes !!!!

Accuracy :  0.8734768722883023
171855
172256


In general, we have to take a lot of care with accuracy, specially with unbalanced data. Accuracy only treat the ratio of total amount of correct cases divided by the total amount of cases. If we have, for instance, 90% of class 0 and 10% of class 1 in test set and our model memorizes the answer "class 0" because of unbalanced data, our accuracy will be high, but we'll be missing all class 1 data. 

In order to be able to realize this fenomenon we might use confusion matrix or auc_score. Now that we balanced data (we have almost the same amount of class 1 and 0 in the test set as showed above), the accuracy score won't be so dangerous. However, it's a good practice to look at auc_score if you want to avoid misunderstandings.

In [26]:
from sklearn.metrics import roc_auc_score

auc = roc_auc_score(pred_y, test_y)

print('AUC score:',auc)

AUC score: 0.8739256982987567


As we can see, we had a high AUC score for our resampled data, so I think the aim of this preliminary analysis was sucessfully achieved.

Varying random forest parameters or testing XGBoost (also varying parameters) might be a good way to increase auc score and conduct to better results. 

Thanks for your attention !