# Telstra Kaggle Competition Modeling (2016)

## 3/2/2018

## Hiro Miyake

This notebook deals with data provided in the [Telstra Kaggle competition](https://www.kaggle.com/c/telstra-recruiting-network) held in 2016. Exploratory data analysis is performed in the companion notebook.

# 1. Initial setup and loading of data

First, set up the environment and import modules.

In [1]:
%matplotlib inline

import matplotlib
import matplotlib.pyplot as plt

import re

import pandas as pd
import numpy as np

from sklearn.preprocessing import MaxAbsScaler
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.decomposition import SparsePCA
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import log_loss

from xgboost.sklearn import XGBClassifier



Then load all the `csv` files.

In [2]:
train = pd.read_csv("data/train.csv")
test = pd.read_csv("data/test.csv")
event = pd.read_csv("data/event_type.csv")
log = pd.read_csv("data/log_feature.csv")
resource = pd.read_csv("data/resource_type.csv")
severity = pd.read_csv("data/severity_type.csv")

In [3]:
#import sklearn
#sklearn.__version__
## I have version 0.18.1

Take a look at the data in the `train.csv`  and `test.csv` files.

In [4]:
train.head()

Unnamed: 0,id,location,fault_severity
0,14121,location 118,1
1,9320,location 91,0
2,14394,location 152,1
3,8218,location 931,1
4,14804,location 120,0


In [5]:
train.tail()

Unnamed: 0,id,location,fault_severity
7376,870,location 167,0
7377,18068,location 106,0
7378,14111,location 1086,2
7379,15189,location 7,0
7380,17067,location 885,0


In [6]:
test.head()

Unnamed: 0,id,location
0,11066,location 481
1,18000,location 962
2,16964,location 491
3,4795,location 532
4,3392,location 600


Combine the training and test dataframes without the dependent variable in the training set. To prevent issues later, I re-index the rows to be continuous, since this can cause problems in certain dataframe operations.

In [7]:
data = pd.concat([train.iloc[:,:-1], test], axis = 0)

## Note that in the above concatenation step, the indices are unchanged
## To reset the indices so that they make sense, take the tip from the following link
## and use the following line of code
## https://stackoverflow.com/questions/35084071/concat-dataframe-reindexing-only-valid-with-uniquely-valued-index-objects
data.reset_index(inplace=True, drop=True)

data.head()

Unnamed: 0,id,location
0,14121,location 118
1,9320,location 91
2,14394,location 152
3,8218,location 931
4,14804,location 120


After I am done transformating our data, I will split the data back into the training and test sets. To help with that, I add an index column `ind`. It's probably possible to do this with the 'real' index, but this will work for now.

In [8]:
data['ind'] = data.index
data.head()

Unnamed: 0,id,location,ind
0,14121,location 118,0
1,9320,location 91,1
2,14394,location 152,2
3,8218,location 931,3
4,14804,location 120,4


Check where the train and test sets are concatenated.

In [9]:
data.iloc[7379:7383,:]

Unnamed: 0,id,location,ind
7379,15189,location 7,7379
7380,17067,location 885,7380
7381,11066,location 481,7381
7382,18000,location 962,7382


Check the total number of rows in the combined training and test dataframe.

In [10]:
data.count()

id          18552
location    18552
ind         18552
dtype: int64

In [11]:
## Some tips on joining/merging dataframes
## https://pandas.pydata.org/pandas-docs/stable/merging.html
## https://chrisalbon.com/python/data_wrangling/pandas_join_merge_dataframe/
## https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.join.html
## https://www.shanelynn.ie/merge-join-dataframes-python-pandas-index-1/
## https://stackoverflow.com/questions/22676081/pandas-the-difference-between-join-and-merge

# 2. One-hot-encode the categorical variables

It turns out that most of the variables appear to be categorical. I deal with this by one-hot-encoding each of the categorical variables. I also perform principal component analysis to reduce the number of variables I will be predicting with. This should help avoid overfitting and ideally lead to better prediction metric performance.

## One-hot-encoding of the `location` variable

In [12]:
print 'Number of unique location values: ' + str(len(data.location.unique()))

Number of unique location values: 1126


In [13]:
location_dum = pd.get_dummies(data['location'])
data_loc = pd.concat([data, location_dum], axis=1)
data_loc.drop(['ind', 'location'], axis = 1, inplace = True)
data_loc.head()

Unnamed: 0,id,location 1,location 10,location 100,location 1000,location 1001,location 1002,location 1003,location 1004,location 1005,...,location 990,location 991,location 992,location 993,location 994,location 995,location 996,location 997,location 998,location 999
0,14121,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,9320,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,14394,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,8218,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,14804,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Dimensionality reduction via principal component analysis can be done as below.

In [14]:
##################################
# RUN THIS IF YOU WANT TO DO PCA #
##################################

## Dimensionality reduction with PCA
## http://scikit-learn.org/stable/modules/decomposition.html
## http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
## http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.SparsePCA.html
## https://towardsdatascience.com/pca-using-python-scikit-learn-e653f8989e60
## http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html

## Dealing with scaling the data for PCA
## http://scikit-learn.org/stable/modules/preprocessing.html#scaling-sparse-data
## http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
## http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html
## http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html

#pca = SparsePCA(n_components=500)
pca = PCA(n_components=500, svd_solver = 'randomized')
X = data_loc.iloc[:,1:]
#X = StandardScaler().fit_transform(X) ## Subtracts mean and rescales by variance
#X = MaxAbsScaler().fit_transform(X) ## Scales max value to 1.0
X_r = pca.fit(X).transform(X)
#print pca.explained_variance_ratio_
print 'Percent of variance explained: ' + str(100*sum(pca.explained_variance_ratio_)) +'%'
X_r = pd.DataFrame(X_r)
X_r.head()

Percent of variance explained: 89.5401790575%


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,490,491,492,493,494,495,496,497,498,499
0,9.85068e-12,0.003395,-0.014285,-0.013228,-0.009355,-0.0044,-0.005226,-0.018963,-0.008552,-0.011817,...,-2e-06,3.6e-05,-3e-05,-0.000181,7e-06,-0.000215,-8e-06,-3e-06,-4.4e-05,-9.2e-05
1,1.152762e-11,0.003434,-0.014457,-0.013398,-0.00949,-0.004468,-0.005308,-0.019272,-0.008709,-0.012043,...,-2e-06,3.5e-05,-3e-05,-0.000179,7e-06,-0.000212,-8e-06,-3e-06,-4.3e-05,-9.1e-05
2,-5.540639e-09,0.001711,-0.007112,-0.006367,-0.004235,-0.001918,-0.002256,-0.008027,-0.003378,-0.004554,...,-0.003351,0.003083,0.00942,-0.00292,-0.000497,-0.004795,0.007897,-0.001945,-0.003436,0.001699
3,2.235116e-11,0.002564,-0.010723,-0.009763,-0.006691,-0.003085,-0.003645,-0.013089,-0.005684,-0.007747,...,-3e-06,5.2e-05,-4.6e-05,-0.000264,1.1e-05,-0.000313,-1.2e-05,-4e-06,-6.4e-05,-0.000134
4,-1.62426e-10,0.001784,-0.007419,-0.006651,-0.004435,-0.002011,-0.002366,-0.008427,-0.003555,-0.004797,...,-5.4e-05,0.000253,-0.000927,-0.001247,-0.000541,-0.001603,0.000127,-0.000556,2e-05,-0.00035


In [15]:
## Do this if you want to do PCA
data_loc_f = pd.concat([data_loc['id'], X_r], axis=1)
## Do this if you want to take the full data set without PCA
#data_loc_f = data_loc
data_loc_f.head()

Unnamed: 0,id,0,1,2,3,4,5,6,7,8,...,490,491,492,493,494,495,496,497,498,499
0,14121,9.85068e-12,0.003395,-0.014285,-0.013228,-0.009355,-0.0044,-0.005226,-0.018963,-0.008552,...,-2e-06,3.6e-05,-3e-05,-0.000181,7e-06,-0.000215,-8e-06,-3e-06,-4.4e-05,-9.2e-05
1,9320,1.152762e-11,0.003434,-0.014457,-0.013398,-0.00949,-0.004468,-0.005308,-0.019272,-0.008709,...,-2e-06,3.5e-05,-3e-05,-0.000179,7e-06,-0.000212,-8e-06,-3e-06,-4.3e-05,-9.1e-05
2,14394,-5.540639e-09,0.001711,-0.007112,-0.006367,-0.004235,-0.001918,-0.002256,-0.008027,-0.003378,...,-0.003351,0.003083,0.00942,-0.00292,-0.000497,-0.004795,0.007897,-0.001945,-0.003436,0.001699
3,8218,2.235116e-11,0.002564,-0.010723,-0.009763,-0.006691,-0.003085,-0.003645,-0.013089,-0.005684,...,-3e-06,5.2e-05,-4.6e-05,-0.000264,1.1e-05,-0.000313,-1.2e-05,-4e-06,-6.4e-05,-0.000134
4,14804,-1.62426e-10,0.001784,-0.007419,-0.006651,-0.004435,-0.002011,-0.002366,-0.008427,-0.003555,...,-5.4e-05,0.000253,-0.000927,-0.001247,-0.000541,-0.001603,0.000127,-0.000556,2e-05,-0.00035


## One-hot-encoding of the `Event` dataframe

In [16]:
print 'There are ' + str(len(event.event_type.unique())) + ' distinct event_type values.'
event.head()

There are 53 distinct event_type values.


Unnamed: 0,id,event_type
0,6597,event_type 11
1,8011,event_type 15
2,2597,event_type 15
3,5022,event_type 15
4,5022,event_type 11


In [17]:
data_e = data.merge(event, on = 'id', how = 'left')
print data_e.count()
data_e.head()

id            31170
location      31170
ind           31170
event_type    31170
dtype: int64


Unnamed: 0,id,location,ind,event_type
0,14121,location 118,0,event_type 34
1,14121,location 118,0,event_type 35
2,9320,location 91,1,event_type 34
3,9320,location 91,1,event_type 35
4,14394,location 152,2,event_type 35


In [18]:
event_dum = pd.get_dummies(data_e['event_type'])
data_e2 = pd.concat([data_e, event_dum], axis=1)
data_e2.drop(['location', 'ind', 'event_type'], axis = 1, inplace = True)
data_e2 = data_e2.groupby('id',as_index = False).agg('sum')
print data_e2['id'].count()
data_e2.head()

18552


Unnamed: 0,id,event_type 1,event_type 10,event_type 11,event_type 12,event_type 13,event_type 14,event_type 15,event_type 17,event_type 18,...,event_type 5,event_type 50,event_type 51,event_type 52,event_type 53,event_type 54,event_type 6,event_type 7,event_type 8,event_type 9
0,1,0,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,3,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [19]:
##################################
# RUN THIS IF YOU WANT TO DO PCA #
##################################

#pca = SparsePCA(n_components=15)
pca = PCA(n_components=15, svd_solver = 'randomized')
X = data_e2.iloc[:,1:]
#X = StandardScaler().fit_transform(X) ## Subtracts mean and rescales by variance
#X = MaxAbsScaler().fit_transform(X) ## Scales max value to 1.0
X_r = pca.fit(X).transform(X)
#print pca.explained_variance_ratio_
print 'Percent of variance explained: ' + str(100*sum(pca.explained_variance_ratio_)) +'%'
X_r = pd.DataFrame(X_r)
X_r.head()

Percent of variance explained: 96.0190597571%


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,-0.649841,-0.549824,0.023177,0.067021,-0.281921,-0.256941,0.639126,-0.475553,-0.062804,-0.190123,0.000169,0.007681,-0.027571,0.007822,0.002161
1,1.047852,-0.043996,-0.147605,0.050249,-0.001219,0.003399,0.016248,0.012815,0.026546,0.001901,0.000334,-0.000678,0.000266,-0.002863,0.000234
2,-0.617005,-0.471573,-0.003318,0.042696,-0.020871,-0.009147,-0.127192,-0.01839,0.062915,-0.005964,0.003387,-0.025956,0.000607,-0.012998,-0.004754
3,-0.126243,0.154063,0.494815,0.130715,-0.355656,-0.088536,-0.167962,0.000358,0.195587,0.024568,0.336185,0.11464,0.000423,-0.045512,0.48322
4,1.047852,-0.043996,-0.147605,0.050249,-0.001219,0.003399,0.016248,0.012815,0.026546,0.001901,0.000334,-0.000678,0.000266,-0.002863,0.000234


Note that 15 PCA components describes 96% of the variance in the data, so PCA must be adding some value.

In [20]:
data_e_f = pd.concat([data_e2['id'], X_r], axis=1)
data_e_f.head()

Unnamed: 0,id,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,1,-0.649841,-0.549824,0.023177,0.067021,-0.281921,-0.256941,0.639126,-0.475553,-0.062804,-0.190123,0.000169,0.007681,-0.027571,0.007822,0.002161
1,2,1.047852,-0.043996,-0.147605,0.050249,-0.001219,0.003399,0.016248,0.012815,0.026546,0.001901,0.000334,-0.000678,0.000266,-0.002863,0.000234
2,3,-0.617005,-0.471573,-0.003318,0.042696,-0.020871,-0.009147,-0.127192,-0.01839,0.062915,-0.005964,0.003387,-0.025956,0.000607,-0.012998,-0.004754
3,4,-0.126243,0.154063,0.494815,0.130715,-0.355656,-0.088536,-0.167962,0.000358,0.195587,0.024568,0.336185,0.11464,0.000423,-0.045512,0.48322
4,5,1.047852,-0.043996,-0.147605,0.050249,-0.001219,0.003399,0.016248,0.012815,0.026546,0.001901,0.000334,-0.000678,0.000266,-0.002863,0.000234


## One-hot-encoding of the `resource` dataframe

In [21]:
print 'There are ' + str(len(resource.resource_type.unique())) + ' distinct resource_type values.'
print resource.count()
resource.head()

There are 10 distinct resource_type values.
id               21076
resource_type    21076
dtype: int64


Unnamed: 0,id,resource_type
0,6597,resource_type 8
1,8011,resource_type 8
2,2597,resource_type 8
3,5022,resource_type 8
4,6852,resource_type 8


In [22]:
data_r = data.merge(resource, on = 'id', how = 'left')
print data_r.count()
data_r.head()

id               21076
location         21076
ind              21076
resource_type    21076
dtype: int64


Unnamed: 0,id,location,ind,resource_type
0,14121,location 118,0,resource_type 2
1,9320,location 91,1,resource_type 2
2,14394,location 152,2,resource_type 2
3,8218,location 931,3,resource_type 8
4,14804,location 120,4,resource_type 2


In [23]:
resource_dum = pd.get_dummies(data_r['resource_type'])
data_r2 = pd.concat([data_r, resource_dum], axis=1)
data_r2.drop(['location', 'ind', 'resource_type'], axis = 1, inplace = True)
data_r2 = data_r2.groupby('id',as_index = False).agg('sum')
print data_r2['id'].count()
data_r2.head()

18552


Unnamed: 0,id,resource_type 1,resource_type 10,resource_type 2,resource_type 3,resource_type 4,resource_type 5,resource_type 6,resource_type 7,resource_type 8,resource_type 9
0,1,0,0,0,0,0,0,1,0,1,0
1,2,0,0,1,0,0,0,0,0,0,0
2,3,0,0,0,0,0,0,0,0,1,0
3,4,0,0,1,0,0,0,0,0,0,0
4,5,0,0,1,0,0,0,0,0,0,0


In [24]:
##################################
# RUN THIS IF YOU WANT TO DO PCA #
##################################

#pca = SparsePCA(n_components=50)
pca = PCA(n_components=5, svd_solver = 'randomized')
X = data_r2.iloc[:,1:]
#X = StandardScaler().fit_transform(X) ## Subtracts mean and rescales by variance
#X = MaxAbsScaler().fit_transform(X) ## Scales max value to 1.0
X_r = pca.fit(X).transform(X)
#print pca.explained_variance_ratio_
print 'Percent of variance explained: ' + str(100*sum(pca.explained_variance_ratio_)) +'%'
X_r = pd.DataFrame(X_r)
X_r.head()

Percent of variance explained: 95.9609884885%


Unnamed: 0,0,1,2,3,4
0,-0.692177,0.448411,0.565812,-0.569977,-0.17011
1,0.75982,0.017455,-0.017058,-0.004959,0.00155
2,-0.652544,-0.024988,-0.074774,-0.006578,-0.00225
3,0.75982,0.017455,-0.017058,-0.004959,0.00155
4,0.75982,0.017455,-0.017058,-0.004959,0.00155


Note that only 5 PCA components is sufficient to explain 95% of the variance, so again PCA must be adding some value.

In [25]:
data_r_f = pd.concat([data_r2['id'], X_r], axis=1)
data_r_f.head()

Unnamed: 0,id,0,1,2,3,4
0,1,-0.692177,0.448411,0.565812,-0.569977,-0.17011
1,2,0.75982,0.017455,-0.017058,-0.004959,0.00155
2,3,-0.652544,-0.024988,-0.074774,-0.006578,-0.00225
3,4,0.75982,0.017455,-0.017058,-0.004959,0.00155
4,5,0.75982,0.017455,-0.017058,-0.004959,0.00155


## One-hot-encoding of `severity` dataframe

In [26]:
## Each id has probably one severity type
print 'There are ' + str(len(severity.severity_type.unique())) + ' distinct severity_type values.'
print severity.count()
severity.head()

There are 5 distinct severity_type values.
id               18552
severity_type    18552
dtype: int64


Unnamed: 0,id,severity_type
0,6597,severity_type 2
1,8011,severity_type 2
2,2597,severity_type 2
3,5022,severity_type 1
4,6852,severity_type 1


In [27]:
data_s = data.merge(severity, on = 'id', how = 'left')
print data_s.count()
data_s.head()

id               18552
location         18552
ind              18552
severity_type    18552
dtype: int64


Unnamed: 0,id,location,ind,severity_type
0,14121,location 118,0,severity_type 2
1,9320,location 91,1,severity_type 2
2,14394,location 152,2,severity_type 2
3,8218,location 931,3,severity_type 1
4,14804,location 120,4,severity_type 1


In [28]:
severity_dum = pd.get_dummies(data_s['severity_type'])
data_s2 = pd.concat([data_s, severity_dum], axis=1)
data_s2.drop(['location', 'ind', 'severity_type'], axis = 1, inplace = True)
data_s2 = data_s2.groupby('id',as_index = False).agg('sum')
print data_s2['id'].count()
data_s2.head()

18552


Unnamed: 0,id,severity_type 1,severity_type 2,severity_type 3,severity_type 4,severity_type 5
0,1,1,0,0,0,0
1,2,0,1,0,0,0
2,3,1,0,0,0,0
3,4,0,0,0,1,0
4,5,0,1,0,0,0


Note that each `id` has only one `severity_type` value. And since there are only 5 `severity_type` values, it's probably not worth it to perform dimensionality reduction on this variable.

In [29]:
data_s_f = data_s2.copy()

## One-hot-encoding of `log` dataframe

In [30]:
print 'There are ' + str(len(log.log_feature.unique())) + ' distinct log_feature values.'
log.head()

There are 386 distinct log_feature values.


Unnamed: 0,id,log_feature,volume
0,6597,feature 68,6
1,8011,feature 68,7
2,2597,feature 68,1
3,5022,feature 172,2
4,5022,feature 56,1


In [31]:
data_log = data.merge(log, on = 'id', how = 'left')
print data_log.count()
data_log.head()

id             58671
location       58671
ind            58671
log_feature    58671
volume         58671
dtype: int64


Unnamed: 0,id,location,ind,log_feature,volume
0,14121,location 118,0,feature 312,19
1,14121,location 118,0,feature 232,19
2,9320,location 91,1,feature 315,200
3,9320,location 91,1,feature 235,116
4,14394,location 152,2,feature 221,1


In [32]:
log_dum = pd.get_dummies(data_log['log_feature'])
data_log2 = pd.concat([data_log, log_dum], axis=1)
data_log2.drop(['location', 'ind', 'log_feature'], axis = 1, inplace = True)
data_log2.head()

Unnamed: 0,id,volume,feature 1,feature 10,feature 100,feature 101,feature 102,feature 103,feature 104,feature 105,...,feature 90,feature 91,feature 92,feature 93,feature 94,feature 95,feature 96,feature 97,feature 98,feature 99
0,14121,19,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,14121,19,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,9320,200,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,9320,116,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,14394,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [33]:
## From PCA, it appears that putting the volume value in the corresponding feature column
## does not improve the variance explained. So keep the volume in it's own column.

## Loops through the column names of the dataframe
#for i in data_log2:
#    if bool(re.match(r'^feature', i)):
#        #print i
#        data_log2[i] = data_log2['volume']*data_log2[i]
#data_log2.head()

In [34]:
data_log3 = data_log2.groupby('id',as_index = False).agg('sum')
data_log3.head()

Unnamed: 0,id,volume,feature 1,feature 10,feature 100,feature 101,feature 102,feature 103,feature 104,feature 105,...,feature 90,feature 91,feature 92,feature 93,feature 94,feature 95,feature 96,feature 97,feature 98,feature 99
0,1,5,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2,5,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,3,2,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4,3,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,17,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [35]:
##################################
# RUN THIS IF YOU WANT TO DO PCA #
##################################

#pca = SparsePCA(n_components=50)
pca = PCA(n_components=80, svd_solver = 'randomized')
X = data_log3.iloc[:,2:]
#print X.max(axis = 0)
#X = StandardScaler().fit_transform(X) ## Subtracts mean and rescales by variance
#X = MaxAbsScaler().fit_transform(X) ## Scales max value to 1.0
X_r = pca.fit(X).transform(X)
#print pca.explained_variance_ratio_
print 'Percent of variance explained: ' + str(100*sum(pca.explained_variance_ratio_)) +'%'
X_r = pd.DataFrame(X_r)
X_r.head()

Percent of variance explained: 95.416181118%


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,70,71,72,73,74,75,76,77,78,79
0,-0.331675,-0.468316,0.129815,-0.177359,-0.057702,-0.015969,-0.093411,-0.04221,0.06821,0.007045,...,-0.003077,-0.001232,-0.000936,0.005362,0.003032,0.002718,-0.005725,-0.007405,-0.015922,0.000443
1,1.13715,0.340427,1.348122,0.322382,-0.193731,-0.071037,-0.064252,-0.021318,0.000304,-0.277255,...,0.001516,-0.000158,-0.001102,-0.001072,-0.008401,0.015629,0.006009,-0.001121,-0.001502,-0.015872
2,-0.285661,-0.337115,0.043985,0.076141,0.076676,-0.199548,-0.160323,-0.053515,-0.003116,-0.016885,...,-0.009563,0.01301,0.038495,-0.008551,0.002285,0.005897,-0.012325,0.015963,-0.002878,-0.000191
3,-0.273572,-0.305997,0.037948,0.05633,0.060593,-0.16099,-0.092163,-0.030388,-0.001412,-0.010615,...,-0.003521,-0.027062,-0.010762,-0.054319,-0.004727,0.001195,-0.001075,0.012205,-0.028629,0.027125
4,0.759213,0.014236,-0.612816,-0.167859,-0.210102,-0.1773,-0.079396,-0.02302,0.002542,-0.403728,...,-0.001563,0.000473,0.000337,0.000484,0.002149,0.002625,-0.000953,0.000343,0.000514,0.001292


In [36]:
data_log_f = pd.concat([data_log3[['id', 'volume']], X_r], axis=1)
data_log_f.head()

Unnamed: 0,id,volume,0,1,2,3,4,5,6,7,...,70,71,72,73,74,75,76,77,78,79
0,1,5,-0.331675,-0.468316,0.129815,-0.177359,-0.057702,-0.015969,-0.093411,-0.04221,...,-0.003077,-0.001232,-0.000936,0.005362,0.003032,0.002718,-0.005725,-0.007405,-0.015922,0.000443
1,2,5,1.13715,0.340427,1.348122,0.322382,-0.193731,-0.071037,-0.064252,-0.021318,...,0.001516,-0.000158,-0.001102,-0.001072,-0.008401,0.015629,0.006009,-0.001121,-0.001502,-0.015872
2,3,2,-0.285661,-0.337115,0.043985,0.076141,0.076676,-0.199548,-0.160323,-0.053515,...,-0.009563,0.01301,0.038495,-0.008551,0.002285,0.005897,-0.012325,0.015963,-0.002878,-0.000191
3,4,3,-0.273572,-0.305997,0.037948,0.05633,0.060593,-0.16099,-0.092163,-0.030388,...,-0.003521,-0.027062,-0.010762,-0.054319,-0.004727,0.001195,-0.001075,0.012205,-0.028629,0.027125
4,5,17,0.759213,0.014236,-0.612816,-0.167859,-0.210102,-0.1773,-0.079396,-0.02302,...,-0.001563,0.000473,0.000337,0.000484,0.002149,0.002625,-0.000953,0.000343,0.000514,0.001292


In [37]:
#data_log_f['volume'] = StandardScaler().fit_transform(data_log_f['volume'].values.reshape(1, -1))
data_log_f['volume'] = MaxAbsScaler().fit_transform(data_log_f['volume'])
#X = StandardScaler().fit_transform(X) ## Subtracts mean and rescales by variance
#X = MaxAbsScaler().fit_transform(X) ## Scales max value to 1.0
data_log_f.head()



Unnamed: 0,id,volume,0,1,2,3,4,5,6,7,...,70,71,72,73,74,75,76,77,78,79
0,1,0.003032,-0.331675,-0.468316,0.129815,-0.177359,-0.057702,-0.015969,-0.093411,-0.04221,...,-0.003077,-0.001232,-0.000936,0.005362,0.003032,0.002718,-0.005725,-0.007405,-0.015922,0.000443
1,2,0.003032,1.13715,0.340427,1.348122,0.322382,-0.193731,-0.071037,-0.064252,-0.021318,...,0.001516,-0.000158,-0.001102,-0.001072,-0.008401,0.015629,0.006009,-0.001121,-0.001502,-0.015872
2,3,0.001213,-0.285661,-0.337115,0.043985,0.076141,0.076676,-0.199548,-0.160323,-0.053515,...,-0.009563,0.01301,0.038495,-0.008551,0.002285,0.005897,-0.012325,0.015963,-0.002878,-0.000191
3,4,0.001819,-0.273572,-0.305997,0.037948,0.05633,0.060593,-0.16099,-0.092163,-0.030388,...,-0.003521,-0.027062,-0.010762,-0.054319,-0.004727,0.001195,-0.001075,0.012205,-0.028629,0.027125
4,5,0.010309,0.759213,0.014236,-0.612816,-0.167859,-0.210102,-0.1773,-0.079396,-0.02302,...,-0.001563,0.000473,0.000337,0.000484,0.002149,0.002625,-0.000953,0.000343,0.000514,0.001292


# 3. Join the one-hot-encoded and PCA'ed dataframes

Now that we have massaged each of the variables, we can re-combine them into a single data frame and ready it for feeding to the model to predict.

In [38]:
data_f = data[['id', 'ind']].merge(data_loc_f, on = 'id', how = 'left') ## merge location data
data_f = data_f.merge(data_e_f, on = 'id', how = 'left') ## merge event data
data_f = data_f.merge(data_r_f, on = 'id', how = 'left') ## merge resource data
data_f = data_f.merge(data_s_f, on = 'id', how = 'left') ## merge severity data
data_f = data_f.merge(data_log_f, on = 'id', how = 'left') ## merge log data
data_f.head()

Unnamed: 0,id,ind,0_x,1_x,2_x,3_x,4_x,5_x,6_x,7_x,...,70_y,71_y,72_y,73_y,74_y,75_y,76_y,77_y,78_y,79_y
0,14121,0,9.85068e-12,0.003395,-0.014285,-0.013228,-0.009355,-0.0044,-0.005226,-0.018963,...,-0.001563,0.000473,0.000337,0.000484,0.002149,0.002625,-0.000953,0.000343,0.000514,0.001292
1,9320,1,1.152762e-11,0.003434,-0.014457,-0.013398,-0.00949,-0.004468,-0.005308,-0.019272,...,-0.008468,0.001595,0.002015,0.005404,-0.001024,0.013165,-0.005862,0.000925,0.002843,0.003173
2,14394,2,-5.540639e-09,0.001711,-0.007112,-0.006367,-0.004235,-0.001918,-0.002256,-0.008027,...,-0.006388,0.002172,0.004246,0.003862,-0.001704,0.005253,-0.003798,0.004116,0.001537,0.006732
3,8218,3,2.235116e-11,0.002564,-0.010723,-0.009763,-0.006691,-0.003085,-0.003645,-0.013089,...,0.000304,0.001135,-0.0009,-0.001189,-0.001053,0.000603,0.000244,0.00321,0.00608,0.002155
4,14804,4,-1.62426e-10,0.001784,-0.007419,-0.006651,-0.004435,-0.002011,-0.002366,-0.008427,...,-0.012211,0.042337,0.007641,-0.031294,0.222864,0.008954,0.057951,0.053078,-0.020236,0.22628


## Perform further dimensionality reduction if desired

In [39]:
if False:
    print data_f.shape
    pca = PCA(n_components=300, svd_solver = 'randomized')
    X = data_f.iloc[:,2:]
    #X = StandardScaler().fit_transform(X)
    #X = MaxAbsScaler().fit_transform(X)
    X_r = pca.fit(X).transform(X)
    print 'Percent of variance explained: ' + str(100*sum(pca.explained_variance_ratio_)) +'%'
    X_r = pd.DataFrame(X_r)
    X_r.head()

In [40]:
#data_f2 = pd.concat([data_f[['id', 'ind']], X_r], axis=1)
data_f2 = data_f.copy()
data_f2.head()

Unnamed: 0,id,ind,0_x,1_x,2_x,3_x,4_x,5_x,6_x,7_x,...,70_y,71_y,72_y,73_y,74_y,75_y,76_y,77_y,78_y,79_y
0,14121,0,9.85068e-12,0.003395,-0.014285,-0.013228,-0.009355,-0.0044,-0.005226,-0.018963,...,-0.001563,0.000473,0.000337,0.000484,0.002149,0.002625,-0.000953,0.000343,0.000514,0.001292
1,9320,1,1.152762e-11,0.003434,-0.014457,-0.013398,-0.00949,-0.004468,-0.005308,-0.019272,...,-0.008468,0.001595,0.002015,0.005404,-0.001024,0.013165,-0.005862,0.000925,0.002843,0.003173
2,14394,2,-5.540639e-09,0.001711,-0.007112,-0.006367,-0.004235,-0.001918,-0.002256,-0.008027,...,-0.006388,0.002172,0.004246,0.003862,-0.001704,0.005253,-0.003798,0.004116,0.001537,0.006732
3,8218,3,2.235116e-11,0.002564,-0.010723,-0.009763,-0.006691,-0.003085,-0.003645,-0.013089,...,0.000304,0.001135,-0.0009,-0.001189,-0.001053,0.000603,0.000244,0.00321,0.00608,0.002155
4,14804,4,-1.62426e-10,0.001784,-0.007419,-0.006651,-0.004435,-0.002011,-0.002366,-0.008427,...,-0.012211,0.042337,0.007641,-0.031294,0.222864,0.008954,0.057951,0.053078,-0.020236,0.22628


# 4. Split the data back into training and test sets for prediction

In [41]:
dfdim = data_f2.shape
data_f2.columns = range(dfdim[1])
data_f2.rename(columns={0: 'id', 1: 'ind'}, inplace=True)
data_f2.head()

Unnamed: 0,id,ind,2,3,4,5,6,7,8,9,...,598,599,600,601,602,603,604,605,606,607
0,14121,0,9.85068e-12,0.003395,-0.014285,-0.013228,-0.009355,-0.0044,-0.005226,-0.018963,...,-0.001563,0.000473,0.000337,0.000484,0.002149,0.002625,-0.000953,0.000343,0.000514,0.001292
1,9320,1,1.152762e-11,0.003434,-0.014457,-0.013398,-0.00949,-0.004468,-0.005308,-0.019272,...,-0.008468,0.001595,0.002015,0.005404,-0.001024,0.013165,-0.005862,0.000925,0.002843,0.003173
2,14394,2,-5.540639e-09,0.001711,-0.007112,-0.006367,-0.004235,-0.001918,-0.002256,-0.008027,...,-0.006388,0.002172,0.004246,0.003862,-0.001704,0.005253,-0.003798,0.004116,0.001537,0.006732
3,8218,3,2.235116e-11,0.002564,-0.010723,-0.009763,-0.006691,-0.003085,-0.003645,-0.013089,...,0.000304,0.001135,-0.0009,-0.001189,-0.001053,0.000603,0.000244,0.00321,0.00608,0.002155
4,14804,4,-1.62426e-10,0.001784,-0.007419,-0.006651,-0.004435,-0.002011,-0.002366,-0.008427,...,-0.012211,0.042337,0.007641,-0.031294,0.222864,0.008954,0.057951,0.053078,-0.020236,0.22628


In [42]:
train_f = data_f2.iloc[:7381,:]
test_f = data_f2.iloc[7381:,:]

In [43]:
train_f.tail()

Unnamed: 0,id,ind,2,3,4,5,6,7,8,9,...,598,599,600,601,602,603,604,605,606,607
7376,870,7376,-3.002237e-09,0.001692,-0.007029,-0.006291,-0.004181,-0.001893,-0.002226,-0.00792,...,-0.001563,0.000473,0.000337,0.000484,0.002149,0.002625,-0.000953,0.000343,0.000514,0.001292
7377,18068,7377,-1.43027e-08,0.001702,-0.00707,-0.006329,-0.004208,-0.001905,-0.002241,-0.007973,...,-0.005297,0.001117,-0.002319,0.000826,0.006319,0.001333,-0.00433,-0.000201,0.000417,0.003845
7378,14111,7378,2.417206e-11,0.002731,-0.011436,-0.010447,-0.007204,-0.003334,-0.003943,-0.014188,...,-0.002847,-0.00025,0.000431,0.002134,0.000711,0.000775,-0.001536,-0.004561,0.000396,0.002253
7379,15189,7379,2.059386e-11,0.002706,-0.011328,-0.010343,-0.007126,-0.003296,-0.003898,-0.01402,...,0.009561,-0.004032,-0.023572,0.004786,-0.001334,-0.004877,-0.003513,-0.001578,-0.000588,-0.006106
7380,17067,7380,9.576173e-13,0.002232,-0.009313,-0.008423,-0.005704,-0.002611,-0.003079,-0.011017,...,0.003573,0.002076,0.005899,-0.007442,-0.005246,0.000966,0.011124,-0.001769,-0.015978,0.004338


In [44]:
test_f.head()

Unnamed: 0,id,ind,2,3,4,5,6,7,8,9,...,598,599,600,601,602,603,604,605,606,607
7381,11066,7381,-5.321147e-13,0.002031,-0.008462,-0.007623,-0.005126,-0.002337,-0.002753,-0.009827,...,-0.008573,0.001792,0.001486,0.002925,0.000384,-0.034962,-0.009123,0.003093,-0.001088,-0.020953
7382,18000,7382,1.121763e-11,0.005596,-0.023946,-0.023246,-0.018113,-0.009124,-0.011043,-0.041673,...,-0.002847,-0.00025,0.000431,0.002134,0.000711,0.000775,-0.001536,-0.004561,0.000396,0.002253
7383,16964,7383,-2.21263e-10,0.00195,-0.008119,-0.007303,-0.004897,-0.002229,-0.002624,-0.009361,...,-0.008468,0.001595,0.002015,0.005404,-0.001024,0.013165,-0.005862,0.000925,0.002843,0.003173
7384,4795,7384,-9.035293e-11,0.001763,-0.007328,-0.006568,-0.004376,-0.001984,-0.002334,-0.008309,...,-0.008462,-0.100142,-0.450021,0.176753,0.043204,0.019032,-0.011268,-0.032611,0.048543,0.045147
7385,3392,7385,1.115611e-11,0.006186,-0.026591,-0.026159,-0.021005,-0.010834,-0.013204,-0.050581,...,-0.002847,-0.00025,0.000431,0.002134,0.000711,0.000775,-0.001536,-0.004561,0.000396,0.002253


In [45]:
train_f.drop(['ind'], axis = 1, inplace = True)
test_f.drop(['ind'], axis = 1, inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


In [46]:
train_f.head()

Unnamed: 0,id,2,3,4,5,6,7,8,9,10,...,598,599,600,601,602,603,604,605,606,607
0,14121,9.85068e-12,0.003395,-0.014285,-0.013228,-0.009355,-0.0044,-0.005226,-0.018963,-0.008552,...,-0.001563,0.000473,0.000337,0.000484,0.002149,0.002625,-0.000953,0.000343,0.000514,0.001292
1,9320,1.152762e-11,0.003434,-0.014457,-0.013398,-0.00949,-0.004468,-0.005308,-0.019272,-0.008709,...,-0.008468,0.001595,0.002015,0.005404,-0.001024,0.013165,-0.005862,0.000925,0.002843,0.003173
2,14394,-5.540639e-09,0.001711,-0.007112,-0.006367,-0.004235,-0.001918,-0.002256,-0.008027,-0.003378,...,-0.006388,0.002172,0.004246,0.003862,-0.001704,0.005253,-0.003798,0.004116,0.001537,0.006732
3,8218,2.235116e-11,0.002564,-0.010723,-0.009763,-0.006691,-0.003085,-0.003645,-0.013089,-0.005684,...,0.000304,0.001135,-0.0009,-0.001189,-0.001053,0.000603,0.000244,0.00321,0.00608,0.002155
4,14804,-1.62426e-10,0.001784,-0.007419,-0.006651,-0.004435,-0.002011,-0.002366,-0.008427,-0.003555,...,-0.012211,0.042337,0.007641,-0.031294,0.222864,0.008954,0.057951,0.053078,-0.020236,0.22628


Merge the `fault_severity` variable back into the dataframe for prediction.

In [47]:
train_f2 = pd.merge(train[['id', "fault_severity"]], train_f, on = "id")
train_f2.head()

Unnamed: 0,id,fault_severity,2,3,4,5,6,7,8,9,...,598,599,600,601,602,603,604,605,606,607
0,14121,1,9.85068e-12,0.003395,-0.014285,-0.013228,-0.009355,-0.0044,-0.005226,-0.018963,...,-0.001563,0.000473,0.000337,0.000484,0.002149,0.002625,-0.000953,0.000343,0.000514,0.001292
1,9320,0,1.152762e-11,0.003434,-0.014457,-0.013398,-0.00949,-0.004468,-0.005308,-0.019272,...,-0.008468,0.001595,0.002015,0.005404,-0.001024,0.013165,-0.005862,0.000925,0.002843,0.003173
2,14394,1,-5.540639e-09,0.001711,-0.007112,-0.006367,-0.004235,-0.001918,-0.002256,-0.008027,...,-0.006388,0.002172,0.004246,0.003862,-0.001704,0.005253,-0.003798,0.004116,0.001537,0.006732
3,8218,1,2.235116e-11,0.002564,-0.010723,-0.009763,-0.006691,-0.003085,-0.003645,-0.013089,...,0.000304,0.001135,-0.0009,-0.001189,-0.001053,0.000603,0.000244,0.00321,0.00608,0.002155
4,14804,0,-1.62426e-10,0.001784,-0.007419,-0.006651,-0.004435,-0.002011,-0.002366,-0.008427,...,-0.012211,0.042337,0.007641,-0.031294,0.222864,0.008954,0.057951,0.053078,-0.020236,0.22628


# 5. Modeling and prediction

Split the training data into a training and validation data set.

In [48]:
train_train, train_test = train_test_split(train_f2, train_size=0.7, 
                                                             random_state=0, stratify = train_f2['fault_severity'])

Uncomment the model you want to use.

In [49]:
## http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
## Smaller C => stronger regularization. 10000 and 1000 makes no difference.
#model = linear_model.LogisticRegression(C = 10000, solver = 'sag', multi_class = 'multinomial', max_iter = 500)

## http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
## max_depth controls for regularization; smaller the more regularization
#model = RandomForestClassifier(max_depth=5, random_state=0)
#model = RandomForestClassifier(max_depth = 30, random_state=0)

## http://xgboost.readthedocs.io/en/latest/parameter.html
#model = XGBClassifier(max_depth=10, learning_rate=1.0, n_estimators=100,
#                    objective='binary:logistic', subsample=1.0, colsample_bytree=0.6, seed=0)
model = XGBClassifier(max_depth=10, learning_rate=1.0, n_estimators=100,
                    objective='binary:logistic', subsample=1.0, colsample_bytree=0.6, seed=0, reg_lambda = 10000)

Fit the training data to the model.

In [50]:
model.fit(train_train.iloc[:, 2:], train_train["fault_severity"])

XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=0.6,
       gamma=0, learning_rate=1.0, max_delta_step=0, max_depth=10,
       min_child_weight=1, missing=None, n_estimators=100, nthread=-1,
       objective='multi:softprob', reg_alpha=0, reg_lambda=10000,
       scale_pos_weight=1, seed=0, silent=True, subsample=1.0)

Determine the log loss on the training and validation sets.

In [51]:
pred_train = model.predict_proba(train_train.iloc[:, 2:])
score = log_loss(train_train["fault_severity"], pred_train)
print 'Logloss for the training set: ' + str(score)

pred_train = model.predict_proba(train_test.iloc[:, 2:])
score = log_loss(train_test["fault_severity"], pred_train)
print 'Logloss for the validation set: ' + str(score)

Logloss for the training set: 0.493300728551
Logloss for the validation set: 0.577787463532


Use the blocks below if you want to ensemble the predictions from different models.

In [52]:
## For ensembling of logistic regression, random forest, and xgboost
toEnsemble = False
if toEnsemble:
    model_l = linear_model.LogisticRegression(C = 10000, solver = 'sag', multi_class = 'multinomial', max_iter = 500)
    model_r = RandomForestClassifier(max_depth = 30, random_state=0)
    model_x = XGBClassifier(max_depth=10, learning_rate=1.0, n_estimators=100,
                        objective='binary:logistic', subsample=1.0, colsample_bytree=0.6, seed=0, reg_lambda = 1000)
    model_l.fit(train_train.iloc[:, 2:], train_train["fault_severity"])
    model_r.fit(train_train.iloc[:, 2:], train_train["fault_severity"])
    model_x.fit(train_train.iloc[:, 2:], train_train["fault_severity"])

In [53]:
## Continue ensembling
if toEnsemble:
    pred_train_l = model_l.predict_proba(train_train.iloc[:, 2:])
    pred_train_r = model_r.predict_proba(train_train.iloc[:, 2:])
    pred_train_x = model_x.predict_proba(train_train.iloc[:, 2:])
    pred_train = 0.0*pred_train_l + 0.5*pred_train_r + 0.5*pred_train_x
    score = log_loss(train_train["fault_severity"], pred_train)
    print 'Logloss for the training set: ' + str(score)

    pred_train_l = model_l.predict_proba(train_test.iloc[:, 2:])
    pred_train_r = model_r.predict_proba(train_test.iloc[:, 2:])
    pred_train_x = model_x.predict_proba(train_test.iloc[:, 2:])
    pred_train = 0.0*pred_train_l + 0.5*pred_train_r + 0.5*pred_train_x
    score = log_loss(train_test["fault_severity"], pred_train)
    print 'Logloss for the validation set: ' + str(score)

In [54]:
test_f.head()

Unnamed: 0,id,2,3,4,5,6,7,8,9,10,...,598,599,600,601,602,603,604,605,606,607
7381,11066,-5.321147e-13,0.002031,-0.008462,-0.007623,-0.005126,-0.002337,-0.002753,-0.009827,-0.004181,...,-0.008573,0.001792,0.001486,0.002925,0.000384,-0.034962,-0.009123,0.003093,-0.001088,-0.020953
7382,18000,1.121763e-11,0.005596,-0.023946,-0.023246,-0.018113,-0.009124,-0.011043,-0.041673,-0.022084,...,-0.002847,-0.00025,0.000431,0.002134,0.000711,0.000775,-0.001536,-0.004561,0.000396,0.002253
7383,16964,-2.21263e-10,0.00195,-0.008119,-0.007303,-0.004897,-0.002229,-0.002624,-0.009361,-0.003971,...,-0.008468,0.001595,0.002015,0.005404,-0.001024,0.013165,-0.005862,0.000925,0.002843,0.003173
7384,4795,-9.035293e-11,0.001763,-0.007328,-0.006568,-0.004376,-0.001984,-0.002334,-0.008309,-0.003502,...,-0.008462,-0.100142,-0.450021,0.176753,0.043204,0.019032,-0.011268,-0.032611,0.048543,0.045147
7385,3392,1.115611e-11,0.006186,-0.026591,-0.026159,-0.021005,-0.010834,-0.013204,-0.050581,-0.02878,...,-0.002847,-0.00025,0.000431,0.002134,0.000711,0.000775,-0.001536,-0.004561,0.000396,0.002253


Now use the fitted model to predict on the test data.

In [55]:
x = model.predict_proba(test_f.iloc[:,1:])

Do below if you want to ensemble.

In [56]:
## With ensembling
if toEnsemble:
    pred_test_l = model_l.predict_proba(test_f.iloc[:, 1:])
    pred_test_r = model_r.predict_proba(test_f.iloc[:, 1:])
    pred_test_x = model_x.predict_proba(test_f.iloc[:, 1:])
    x = 0.0*pred_test_l + 0.5*pred_test_r + 0.5*pred_test_x

Convert the model prediction matrix to a dataframe.

In [57]:
x2 = pd.DataFrame(x)
x2.head()

Unnamed: 0,0,1,2
0,0.934493,0.044125,0.021382
1,0.294642,0.181729,0.523629
2,0.943519,0.043226,0.013255
3,0.419826,0.532792,0.047382
4,0.273276,0.470908,0.255816


Rename the dataframe columns to the specified names.

In [58]:
## Copy and paste column names from sample submission file
predcols = ["predict_0","predict_1","predict_2"]
x2.columns = predcols

x2 = pd.concat([test['id'], x2], axis = 1)
x2.head()

Unnamed: 0,id,predict_0,predict_1,predict_2
0,11066,0.934493,0.044125,0.021382
1,18000,0.294642,0.181729,0.523629
2,16964,0.943519,0.043226,0.013255
3,4795,0.419826,0.532792,0.047382
4,3392,0.273276,0.470908,0.255816


Save the dataframe to a csv file for submission.

In [59]:
## Don't keep the indices
## https://stackoverflow.com/questions/16923281/pandas-writing-dataframe-to-csv-file
x2.to_csv("submission/submit_5.csv", index=False)