Skip to content

Latest commit

 

History

History
118 lines (91 loc) · 4.3 KB

data-analysis-explained.org

File metadata and controls

118 lines (91 loc) · 4.3 KB

Data Analysis Explained While Kaggling

Goal

Find the best model for the kaggle competition data-science-london-scikit-learn

Initial data setup :

import numpy as np

data  = np.loadtxt('train.csv', delimiter=',')
label = np.loadtxt('trainLabels.csv', delimiter=',')
test  = np.loadtxt('test.csv',delimiter=',')

Function to create a kaggle-compliant file :

def kaggle_file(path,coll):
    with open(path,"w") as csvfile:
        csvfile.write("id,solution\n")
        i = 0
        for i,v in enumerate(coll):
            csvfile.write("%d,%d\n" % (i+1,v))

Cross Validation

To determine the performance of any supervised model, known dataset is split into 2 parts :

  • a training part for the model.
  • a test part to compute the model’s preformance.

More sophisticated validation called Cross validation splits dataset in K equal parts :

  • One is used as test set
  • Others as training set.

Here the KFold used :

from sklearn import cross_validation
cv = cross_validation.KFold(len(data), n_folds=10,shuffle=True, random_state=0)  

Simple classifier : kNN

For any reasonable k, the model gives similar and reasonable good performance > 80%.

Which one to choose ? Some Kaggle submissions give k=6 (0.88860) is slightly better than k=3 (0.88785) or k=13 (0.88189).

Below boxplot shows no clear winner

./boxplot_knn_1_30.png

Afterwards average will be used as unique performance indicator.

SVM optimization

SVM is a natural choice with small dataset.

  • Pro : robust
  • Cons : computation expensive (quadratic on the size of sample dataset)

Let’s optimize C and gamma : Practice shows that gamma should be the first parameter to optimize and then C can be optimize.

Below python code snippet to test each value of the parameter with cross-validation, plot the performance and return the best parameter.

from sklearn import svm

c_values = [1e-1, 1., 1e2, 1e3, 1e4, 1e5, 1e6]
c_values = [1e-2,0.03,0.09, 0.1,0.6,1,2,3,6,10,20]
c_values = np.linspace(1,3,20)
clf = svm.SVC(gamma=0.01)

final_scores = []
for c in c_values:
    clf.C = c
    scores = cross_validation.cross_val_score(clf, data,label , cv=cv)
    final_scores.append(np.average(scores))
plot(c_values,final_scores)
max(zip(final_scores,c_values))

In scikit-learn GridSearchCV find automatically the best parameters of an estimator :

import sklearn from grid_search

C_range = np.linspace(2,20,20)
gamma_range = np.logspace(-3, 0, 5)
param_grid = {'gamma':gamma_range, 'C':C_range}
cv = cross_validation.KFold(len(label), n_folds=10, shuffle=True, random_state=0)
classifier = svm.SVC() 
gs = grid_search.GridSearchCV(classifier, param_grid=param_grid, cv=cv)
print gs.best_estimator_

Other Models

NeuralNetworks, Boosting and boostrap models give poor result as dataset is rather small.

With Dimension reduction

Having 40 features, it’s reasonable to reduce them and having a dataset less noisy.

They are linear combinations of orginal ones.

PCA variance of 40 orthogonal components shows than last 3 are very small compared to other ones. But the empirical best number of PCA components for SCM is 12. There is no obvious threshold on orthogonal components

How to get even better performance

  • Train the model on a well-truncated dataset ? Because some data points biaised the model ?
  • Refine parameters optimization not simply one at the time ?

Conclusion

The SVM (C=3 and gamma=0.28) associated with a PCA 12 components gives a score of 0.94635 (slight improvement with GridSearchCV : 0.94970).

This work was done @ bigdive2013 with special thanks to André Panisson our teacher.