## Fitting an SVM to veg-colour data

1. split into train, val and test sets
2. scale features
3. Try SVM with linear Kernel first - find optimal value for C to minimize error of validation set
4. Try to add new features or use Gaussian Kernel

** only 1000 data points are used to begin with to speed up processing time while we are getting the pipeline working**

In [None]:
import sklearn
import pandas as pd
import numpy as np

In [None]:
data = pd.read_csv("training.csv", header= None)
#data = data.iloc[1:100000,:]

In [None]:
X = data.iloc[:,0:2]
Y = data.iloc[:,2]

Split data into train, val and test sets.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.4)
X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, test_size = 0.5, train_size = 0.5)

Now that we have our training, validation and test sets, we need to scale our features. For this I have used the preprocessing tool found in scikitlearn. This stores the properties used to scale the data so that it can be used later on the validation and test sets. For speed we have scaled these two sets now. 

In [None]:
from sklearn import preprocessing
scaler = preprocessing.StandardScaler().fit(X_train)
scaler



In [None]:
X_train = scaler.transform(X_train)
X_val =scaler.transform(X_val)
X_test = scaler.transform(X_test)

Next we'll implement a for loop to find an optimum value for C in the SVM algorithm.

In [None]:
values = [30, 10, 0.01,0.03, 0.1, 0.3, 1, 3, 0.0001]
min_error = 100000000000000000000000000


Below we'll try a test SVM on the 1000 row test set before attempting to optimise SVM parameters.

In [None]:
clf = svm.SVC(C = 1, kernel = 'linear')
clf.fit(X_train, y_train)
error = clf.score(X_val, y_val)
error

In [None]:
clf = svm.SVC(C = 1, kernel = 'rbf')
clf.fit(X_train, y_train)
error = clf.score(X_val, y_val)
error

Below we try to optimise the SVM parameters for 1,000 row test set. We will use the rbf kernel set to default to begin with. 

In [None]:
from sklearn import svm
error_min = 100
list_C = list()
list_error = list()
for i in values:
    clf = svm.SVC(C = i, kernel = 'rbf')
    clf.fit(X_train, y_train)
    error = 1 - clf.score(X_val, y_val)
    list_C.append(i)
    list_error.append(error)
    if error < error_min:
        error_min = error
        optim_C = i
        optim_model = clf


In [None]:
print(list_error)
print(list_C)
print(optim_C)

In [None]:
optim_model.score(X_test, y_test)

## Logistic Regression

The SVM is currently taking a long time to run, so we will now try using logistic regression, which should lead to similar results in this problem. We will use the same train, validation and test sets as above which have already been scaled. We will also add polynomial features. 

In [None]:
logreg = sklearn.linear_model.LogisticRegression()
logreg.fit(X_test, y_test)
error = logreg.score(X_val, y_val)
error

In [None]:
values = [30, 10, 0.01,0.03, 0.1, 0.3, 1, 3, 0.0001]
error_min = 100
list_C = list()
list_error = list()
for i in values:
    logreg = sklearn.linear_model.LogisticRegression(C = i)
    logreg.fit(X_train, y_train)
    error = 1 - logreg.score(X_val, y_val)
    list_C.append(i)
    list_error.append(error)
    if error < error_min:
        error_min = error
        optim_C = i
        optim_model = logreg

In [None]:
print(list_error)
print(list_C)
print(optim_C)

In [None]:
optim_model.score(X_test, y_test)

With logistic regression we have nearly exacrtly the same level of accuracy ~68%. To try and improve on this we'll try adding some new features. We'll start completely from scratch to do this. 

## Restart the notebook here ##

In [1]:
import sklearn
import pandas as pd
import numpy as np
data = pd.read_csv("training.csv", header= None)
#data = data.iloc[1:100000,:]
X = data.iloc[:,0:2]
Y = data.iloc[:,2]
from sklearn import preprocessing
scaler = preprocessing.StandardScaler().fit(X)
X_scale = scaler.transform(X)

At this point we have scaled all our X values - we will know add additional polynomial features. This can be done easily using sklearn:

In [7]:
poly = sklearn.preprocessing.PolynomialFeatures(degree = 5)
X_poly = poly.fit_transform(X_scale)
X_poly.shape

(6300548, 21)

We now have 28 features in total, increasing the complexity of our model. Next we split the data into train, validation and test sets in the same 60:20:20 ratio.

In [8]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_poly, Y, test_size = 0.4)
X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, test_size = 0.5, train_size = 0.5)

In [9]:
from sklearn import svm
values = [0.01, 0.3, 1, 3, 100]
error_min = 100
list_C = list()
list_error = list()
for i in values:
    print('testing', i)
    logreg = sklearn.linear_model.LogisticRegression(C = i)
    logreg.fit(X_train, y_train)
    error = 1 - logreg.score(X_val, y_val)
    list_C.append(i)
    list_error.append(error)
    if error < error_min:
        error_min = error
        optim_C = i
        optim_model = logreg

testing 0.01
testing 0.3
testing 1
testing 3
testing 100


In [10]:
print(list_error)
print(list_C)
print(optim_C)

[0.32402409313472635, 0.32379316091452337, 0.32339160866908445, 0.32537953035846079, 0.32128068184523573]
[0.01, 0.3, 1, 3, 100]
100


In [11]:
optim_model.score(X_test, y_test)

0.67805032893953698

In [None]:
import matplotlib.pyplot as plt
