## Introduction
For Project 1 we are participating in a Kaggle contest using given datasets and information about the datasets. The contest is called Porto Seguro’s Safe Driver Prediction. The purpose of this competition is to see who can make the best predictions with the given dataset. Before we even started to load and analyze the dataset, there was a lot of information about the features presented. There are three types of features, binary, categorical and normal numeric numbers. There is a lot of missing data in the features as well, sometimes being a huge part of the features. We decided that because there are so many different types of features, we needed to fill in missing data differently. A -1 indicates that portion of the data is missing, so for something like binary or categorical, we can't just find an average of 0 and 1, or 1, 2, 3, and so on like we can do using numerical numbers.

## Data Preparation and Exploration
Using the features provided, we are expected to analyze and preprocess the data. We start out by finding out how many NaNs there are.

In [91]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeRegressor
from sklearn.preprocessing import LabelBinarizer
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingClassifier
import sys

In [4]:
dat = pd.read_csv("data/train.csv")
X_orig = dat.loc[:, dat.columns != "target"]
y_orig = dat.loc[:, "target"]
count_nan = len(dat) - dat.count()
print("Number of NaNs")

This is a significant amount of NaNs. If we chose to ignore it, or assume that there is reason for these numbers to be missing, we might miss something. We then come up with ways to preprocess the data in a realistic and logical way.

In [None]:
X_no_neg = X_orig
y_no_neg = y_orig
cat = pd.DataFrame()
bn = pd.DataFrame()
norm = pd.DataFrame()
ids = X_orig['id']

## We ignore the id column for this step
for i in X_orig.columns[1:]:
    if "cat" in i:
        # Transform Median
        cat[i] = dat.loc[:, i]
    elif "bin" in i:
        # Transform
        bn[i] = dat.loc[:, i]
    else:
        # Do something
        norm[i] = dat.loc[:, i]
 

Categorical Features
Amount of missing data in each column of the categorical features

In [None]:
for i in cat.columns:
    tmp = 0
    cat.loc[cat[i] == -1, i] = np.median(cat.loc[cat[i] != -1,i])
    tmp = tmp + sum(cat[i] == -1)
    
print("Number of -1 in data:", tmp)

Binary Features
Amount of missing data in each column of the binary features (None)

In [None]:
for i in bn.columns:
    tmp = 0
    tmp = tmp + sum(bn[i] == -1)
    print(i, ":", tmp)
    
print("There are no negatives in bin columns")

Normal/Numerical Features
Amount of missing data in each column of the numerical features

In [None]:
temp = []
for feature in cat:
    temp.append(pd.get_dummies(cat[feature]))

new_cat = pd.DataFrame()
for feature in temp:
    new_cat = pd.concat([new_cat, feature], axis=1)

mx = 0
for i in norm.columns:
    tmp = 0
    tmp = tmp + sum(norm[i] == -1)
    print(i, ":", tmp)
    if mx < tmp:
        mx=tmp

print("~", mx/float(len(norm)) * 100, "% of the data is missing, so we are going to assume that there is significance that the data is missing")
num_missing_per_row = np.zeros(len(norm))
tmp = 0
for index, row in norm.iterrows():
    num_missing_per_row[index] =  sum(row == -1)
  

There is a significant amount of data missing in the normal fields, which is concerning. This means that there is most likley no significance for this data being missing. We then decided to use a pipeline to replace the missing values with the mean of the feature. We also scaled the data using a StandardScaler. We decided to use this because there is a huge amount of data, around 500,000 instances. However, only ~3% of the data is in class 1, while 96% is in class 0. Using a z-score can help us vizualize and quickly detect any outliers and if they have anything to do with the instance's class being a 1 or a 0.

In [5]:
  
pipe = Pipeline([
       ("remove_neg_ones", Imputer(missing_values=-1, strategy="mean")),
        ("z-scaling", StandardScaler())
        ])

scaled_norm = pd.DataFrame(pipe.fit_transform(norm), columns = norm.columns)
X_final = pd.concat([bn, new_cat, scaled_norm], axis=1)
X_orig = X_final

ps_ind_06_bin : 0
ps_ind_07_bin : 0
ps_ind_08_bin : 0
ps_ind_09_bin : 0
ps_ind_10_bin : 0
ps_ind_11_bin : 0
ps_ind_12_bin : 0
ps_ind_13_bin : 0
ps_ind_16_bin : 0
ps_ind_17_bin : 0
ps_ind_18_bin : 0
ps_calc_15_bin : 0
ps_calc_16_bin : 0
ps_calc_17_bin : 0
ps_calc_18_bin : 0
ps_calc_19_bin : 0
ps_calc_20_bin : 0
ps_ind_01 : 0
ps_ind_03 : 0
ps_ind_14 : 0
ps_ind_15 : 0
ps_reg_01 : 0
ps_reg_02 : 0
ps_reg_03 : 107772
ps_car_11 : 5
ps_car_12 : 1
ps_car_13 : 0
ps_car_14 : 42620
ps_car_15 : 0
ps_calc_01 : 0
ps_calc_02 : 0
ps_calc_03 : 0
ps_calc_04 : 0
ps_calc_05 : 0
ps_calc_06 : 0
ps_calc_07 : 0
ps_calc_08 : 0
ps_calc_09 : 0
ps_calc_10 : 0
ps_calc_11 : 0
ps_calc_12 : 0
ps_calc_13 : 0
ps_calc_14 : 0
~ 18.1064897885 % of the data is missing, so we are going to assume that there is significance that the data is missing


~96% of the data is class 0, and ~3% is class 1. This means any algorithm would have to guess 0 for everything, and it would get a 96% accuracy rate. To become more accurate, the training set can be changed to a 50/50 solution, where 50% of the data is classified at 0 and the other 50% is class 1. This ensures that the model has enough training instances of class 1 to properly make predictions when the test set contains a 1. To make the training set larger, we sample four times the amount of class 1 instances, and the same amount for the class 0 instances with replacement. This ensures that there is a lot of data to train on, and that we have enough variation in the instances that we are choosing from the 1 category.

In [102]:
X_train, X_test, y_train, y_test = train_test_split(X_orig, y_orig)

In [103]:
X_ones = X_train.loc[y_train==1, :]
X_ones = pd.DataFrame(X_ones)
X_ones = X_ones.sample(n=(len(X_ones)*4), replace=True)
y_ones = y_train[y_train==1]
X_zeroes = X_train.loc[y_train==0, :]
X_zeroes = pd.DataFrame(X_zeroes)
dat_ones = X_ones
dat_ones["target"] = [1] * (len(X_ones))
dat_f = pd.DataFrame(X_zeroes.sample(n=(len(X_ones)), replace=True))
dat_f["target"] = np.zeros(len(dat_f))
dat_ones = dat_ones.append(dat_f.sample(n=(len(dat_ones))))

In [104]:
X_shrunk = dat_ones.loc[:, dat_ones.columns != "target"]
y_shrunk = dat_ones.target

For this competition, the best predictions are judged based off of these gini functions.

In [55]:
def gini(actual, pred, cmpcol = 0, sortcol = 1):
     assert( len(actual) == len(pred) )
     all = np.asarray(np.c_[ actual, pred, np.arange(len(actual)) ], dtype=np.float)
     all = all[ np.lexsort((all[:,2], -1*all[:,1])) ]
     totalLosses = all[:,0].sum()
     giniSum = all[:,0].cumsum().sum() / totalLosses
 
     giniSum -= (len(actual) + 1) / 2
     return giniSum / len(actual)
 
def gini_normalized(a, p):
    return gini(a, p) / gini(a, a)

There has been enough preprocessing and data exploration to test out how the training set works on a model. Using a default Gradient Boosting Classifier, the predictions are put into the gini_normalized function. 

In [105]:
clf = GradientBoostingClassifier()
clf.fit(X_shrunk, y_shrunk)
predicted = clf.predict_proba(X_test)

In [106]:
print(f"Gini score for default GradientBoostingRegressor {gini_normalized(y_test, predicted[:,1])}")

Gini score for default GradientBoostingRegressor 0.2763515725664041


This gini score is quite high in comparison to our contenders. Now, the original, non-50/50 training data is put into a default Gradient Boosting Classifier as a comparission.

In [107]:
clf = GradientBoostingClassifier()
clf.fit(X_train, y_train)
predicted = clf.predict_proba(X_test)

In [108]:
print(f"Gini score for default GradientBoostingRegressor {gini_normalized(y_test, predicted[:,1])}")

Gini score for default GradientBoostingRegressor 0.2816102047006163


In [99]:
clf = GradientBoostingClassifier()
clf.fit(X_train, y_train)
predicted = clf.predict_proba(X_test)
#confusion_matrix(y_test, np.round(predicted))
#print(np.amin(predicted))

In [100]:
print(f"Gini score for default GradientBoostingRegressor {gini_normalized(y_test, predicted[:,1])}")

Gini score for default GradientBoostingRegressor 0.27795604630619936


The 50/50 dataset did not do better than the original training set. One hypothesis is that we are training with too many instances of class 1, when in reality, there are not as many in the test set. In the regular training set, there is a realistic amount of class 1 instances.

## Results
There were many obsticles to figure out with this dataset. There was a huge amount of uneven and lost data.

Difference of splits are large (.004 in change)