# Nonlinear classifiers

Try with nonlinear classifiers,  can you do better than the models from above?
- Try with a random Forest, does increasing the number of trees help?
- Try with SVMs - does the RBF kernel perform better than the linear one?

In [19]:
import numpy as np
import pandas as pd

In [2]:
with np.load('train.npz', allow_pickle=False) as npz_file:
    # Load the arrays
    x_train = npz_file['features']
    y_train = npz_file['targets']

with np.load('valid.npz', allow_pickle=False) as npz_file:
    # Load the arrays
    x_valid = npz_file['features']
    y_valid = npz_file['targets']

with np.load('test.npz', allow_pickle=False) as npz_file:
    # Load the arrays
    x_test = npz_file['features']
    y_test = npz_file['targets']

# X = np.concatenate((x_train, x_valid, x_test), axis=0)
# y = np.concatenate((y_train, y_valid, y_test), axis=0)

## Random Forest

In [3]:
from sklearn.ensemble import RandomForestClassifier

Let's start with an ensemble of 1 decision tree and a max_depth of 3 for better comparibility with the previous model 'Simple decision tree'.

In [4]:
rf = RandomForestClassifier(n_estimators=1, max_depth=3, random_state=0)
rf.fit(x_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=3, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=1, n_jobs=None,
            oob_score=False, random_state=0, verbose=0, warm_start=False)

In [5]:
rf.score(x_valid, y_valid)

0.19424460431654678

The simple decision tree has a lower accuracy compared to this simple Random Forest tree. 

Tune the number of trees:

In [6]:
rf_tuned = RandomForestClassifier(n_estimators=100, max_depth=3, random_state=0)
rf_tuned.fit(x_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=3, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
            oob_score=False, random_state=0, verbose=0, warm_start=False)

In [7]:
rf_tuned.score(x_valid, y_valid)

0.41007194244604317

We can achieve a higher accuracy compared to the simple (notably untuned) decision tree and a slightly lower accuracy compared to the k-nn model by increasing the number of trees. However the Random Forest model does slightly better than the tuned LogRegression model. Note: the LogRegression model was tuned based on a higher amount of data set since x_train and x_valid has been concatenated. For this reason we cannot compare 1:1 both models.

In [8]:
from sklearn.model_selection import cross_validate

In [9]:
# Mean test score of a single decision tree 
rf_scores = cross_validate(rf, x_train, y_train, cv=10)
print('Decision tree - mean test {:.3f}'.format(
    np.mean(rf_scores['test_score'])))

# Mean test score of a random forest
rf_tuned_scores = cross_validate(rf_tuned, x_train, y_train, cv=10)
print('Random forest - mean test {:.3f}'.format(
    np.mean(rf_tuned_scores['test_score'])))

Decision tree - mean test 0.221
Random forest - mean test 0.231


A (stratified) 10-fold cross-validation shows that the ensemble of trees achieves better results than the single tree using cross-validation.

## RBF SVM (Support Vector Machines)

In [10]:
from sklearn.svm import SVC

In [11]:
rbf_svc = SVC(kernel='rbf', C=10, gamma=1, random_state=0)
rbf_svc.fit(x_train, y_train)

SVC(C=10, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=1, kernel='rbf',
  max_iter=-1, probability=False, random_state=0, shrinking=True,
  tol=0.001, verbose=False)

In [12]:
rbf_svc.score(x_valid, y_valid)

0.23741007194244604

Change the kernel to the linear one:

In [13]:
linear_svc = SVC(kernel='linear', C=10, gamma=1, random_state=0)
linear_svc.fit(x_train, y_train)

SVC(C=10, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=1, kernel='linear',
  max_iter=-1, probability=False, random_state=0, shrinking=True,
  tol=0.001, verbose=False)

In [14]:
linear_svc.score(x_valid, y_valid)

0.15827338129496402

Scikit-learn provides a LinearSVC estimator which is a very efficient implementation of support vector machines with a linear kernel but comes at a cost of a lower accuracy.

---------------

Store the model names and the accuracy on the test data in a .csv file:

In [15]:
rf_tuned.score(x_test, y_test)

0.44

In [16]:
linear_svc.score(x_test, y_test)

0.18

In [17]:
rbf_svc.score(x_test, y_test)

0.24

In [20]:
Test_accuracy_06_Nonlinear_classifiers =  pd.DataFrame(data = {'model': ['random forest', 'svm linear', 'svm rbf'], 'test_acurracy': [rf_tuned.score(x_test, y_test),linear_svc.score(x_test, y_test), rbf_svc.score(x_test, y_test)]})

In [22]:
Test_accuracy_06_Nonlinear_classifiers.to_csv(path_or_buf = r'C:\Users\heyus\Desktop\Desktop\EPFL_Data Science COS\EPFL\04. Applied Machine Learning 2\11. Course project\Test_accuracy_06_Nonlinear_classifiers.csv')