# Nonlinear classifiers

>*Try with nonlinear classifiers, can you do better than the baseline models from above?*
> * *Try with a random Forest, does increasing the number of trees help?*
> * *Try with SVMs - does the RBF kernel perform better than the linear one?*

### Random Forest

I am going to start by loading features and labels from all the sets (train, validation and test sets).

In [1]:
# Import the packages needed 
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier

In [2]:
# Load features and labels from npz files
with np.load('train.npz', allow_pickle=False) as npz_file:
    X_train=npz_file['features']
    y_train=npz_file['labels']

with np.load('valid.npz', allow_pickle=False) as npz_file:
    X_valid=npz_file['features']
    y_valid=npz_file['labels']

with np.load('test.npz', allow_pickle=False) as npz_file:
    X_test=npz_file['features']
    y_test=npz_file['labels']

For the random forest I am going to tune the number of estimators and see what gives me the best result. I am going to keep the max_depth equal to 3 in order to be able to compare the results with the decision tree. I am going to fit the random forest with a number of estimators from 1 to 100 and save the accuracy on both validation and test sets.

In [3]:
valid_log=[]
test_log=[]
n_estimators=np.arange(100)

for n_est in n_estimators:
    # Create random forest estimator
    rf = RandomForestClassifier(n_estimators=n_est+1, max_depth=3, random_state=0)

    # Fit estimator
    rf.fit(X_train, y_train)
    
     # Save accuracy on validation set
    valid_accuracy=rf.score(X_valid, y_valid)
    valid_log.append(valid_accuracy)
    
    # Save accuracy on test set
    test_accuracy=rf.score(X_test, y_test)
    test_log.append(test_accuracy)

I am going to save the results in a dataframe and sort them by the validation accuracy.

In [4]:
results=pd.DataFrame({'n_estimators':n_estimators,'validation accuracy':valid_log,'test accuracy':test_log})
results.sort_values(by='validation accuracy', ascending=False).head(5)

Unnamed: 0,n_estimators,validation accuracy,test accuracy
8,8,0.827338,0.75
17,17,0.820144,0.816667
7,7,0.820144,0.75
16,16,0.820144,0.816667
24,24,0.81295,0.866667


I am going to save the test accuracy of the random forest with the highest validation accuracy in a dataframe and at the end of this exercise I will save it in an csv file.

In [5]:
# save results
results=pd.DataFrame({
        'model': ['random forest'],
        'test_accuracy': '{:.3f}'.format(results.sort_values(by='validation accuracy', ascending=False).iloc[0,2])
    })

### Linear SVM

For the linear SVM I am going to fit the estimator with the train set and then compute the score on both the validation and test sets.

In [6]:
from sklearn.svm import LinearSVC

# Create SVM with linear kernel
linear_svc = LinearSVC()

In [7]:
# Fit estimator
linear_svc.fit(X_train, y_train)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)

In [8]:
linear_svc.score(X_valid, y_valid)

0.9064748201438849

In [9]:
linear_svc.score(X_test, y_test)

0.9333333333333333

I am going to save the test accuracy of the linear svc in a dataframe and add it to the random forest result.

In [10]:
# save results
results2=pd.DataFrame({
        'model': ['svm linear'],
        'test_accuracy': '{:.3f}'.format(linear_svc.score(X_test, y_test))
    })

In [11]:
# add test accuracy of linear svc to test accuracy of random forest
results=results.append(results2)

### SVM with RBF kernel

For the SVM with rbf kernel I am going to fit it the model on the train set for ten different C values from 0.0001 to 10000. I am going to save the accuracy on both the validation and test sets for all the C values. 

In [12]:
from sklearn.svm import SVC

In [13]:
valid_log=[]
test_log=[]
C_values=[]

for C_value in np.logspace(-4, 4, num=10):
    
    # Create SVM with RBF kernel
    rbf_svc = SVC(kernel='rbf', C=C_value)
    
    # Fit estimator
    rbf_svc.fit(X_train, y_train)
    
    # Save accuracy on validation set
    valid_accuracy=rbf_svc.score(X_valid, y_valid)
    valid_log.append(valid_accuracy)
    
    # Save accuracy on test set
    test_accuracy=rbf_svc.score(X_test, y_test)
    test_log.append(test_accuracy)
    
    # Save C value
    C_values.append(C_value)



I am going to save the results in a dataframe and sort it by the validation accuracy.

In [14]:
scores_df=pd.DataFrame(C_values, columns=['C'])
scores_df['validation accuracy']=valid_log
scores_df['test accuracy']=test_log

In [15]:
scores_df.sort_values('validation accuracy', ascending=False)

Unnamed: 0,C,validation accuracy,test accuracy
6,21.544347,0.920863,0.95
7,166.810054,0.920863,0.95
8,1291.549665,0.920863,0.95
9,10000.0,0.920863,0.95
5,2.782559,0.899281,0.966667
4,0.359381,0.791367,0.85
3,0.046416,0.244604,0.266667
0,0.0001,0.23741,0.2
1,0.000774,0.23741,0.2
2,0.005995,0.23741,0.2


I am going to save the test accuracy of the svc rbf with the highest validation accuracy in a dataframe and add it to the random forest and linear svc results.

In [16]:
# save results
results3=pd.DataFrame({
        'model': ['svm rbf'],
        'test_accuracy': '{:.3f}'.format(scores_df.sort_values(by='validation accuracy', ascending=False).iloc[0,2])
    })

In [17]:
results=results.append(results3)

I am going to save the test accuracies of the three models in a csv file so that we can compare the results at the end.

In [18]:
# add to csv file with results
pd.read_csv('results.csv').append(results).to_csv('results.csv', index=False)