# Assignment 3- SVM, Cross- conformal predictors, Neural-Networks

In this assignment, we will see the implementation of Support vector machines, pipeline & Neural-Networks with the help of scikit-learn functions. Cross-conformal predictor has not been attempted. 

First, we will load the wine dataset & split it using the train_test_split function.

In [172]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split

wine = load_wine()

Xtrain, Xtest, ytrain, ytest =  train_test_split(wine.data, wine.target, random_state= 2206)

In [173]:
print(wine.DESCR)

.. _wine_dataset:

Wine recognition dataset
------------------------

**Data Set Characteristics:**

    :Number of Instances: 178 (50 in each of three classes)
    :Number of Attributes: 13 numeric, predictive attributes and the class
    :Attribute Information:
 		- Alcohol
 		- Malic acid
 		- Ash
		- Alcalinity of ash  
 		- Magnesium
		- Total phenols
 		- Flavanoids
 		- Nonflavanoid phenols
 		- Proanthocyanins
		- Color intensity
 		- Hue
 		- OD280/OD315 of diluted wines
 		- Proline

    - class:
            - class_0
            - class_1
            - class_2
		
    :Summary Statistics:
    
                                   Min   Max   Mean     SD
    Alcohol:                      11.0  14.8    13.0   0.8
    Malic Acid:                   0.74  5.80    2.34  1.12
    Ash:                          1.36  3.23    2.36  0.27
    Alcalinity of Ash:            10.6  30.0    19.5   3.3
    Magnesium:                    70.0 162.0    99.7  14.3
    Total Phenols:                0

From sklearn, we import the cross-val score & use the default parameters to check the accuracy of the SVM.   

In [174]:
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC
import numpy as np

svm = SVC()
wineSVC = cross_val_score(svm, Xtrain, ytrain, n_jobs= -1)

print('Cross val score is:', wineSVC)
GenWine = np.mean(wineSVC)

svm.fit(Xtrain, ytrain)

testscore = svm.score(Xtest, ytest)

Cross val score is: [0.66666667 0.59259259 0.7037037  0.61538462 0.61538462]


This outputs an array with each cross val score. As we can see from above, we get a varying score. So, we take the mean of the scores to get a generalised one which comes around to be: 0.6387. 

We also calculate the test error rate & it comes out to be 0.267. 

An accuracy of 26% with default parameters is quite good. We will tune the parameters later to have an even more optimal score. We also notice that the generalisation accuracy is quite low, as compared to the test error. 

Note: This is done with the default parameters to check the efficiency of our model. Later, we will normalise the data & use different parameters to see how it works on our model.

In [175]:
print('Generalisation is:', GenWine)
print('The test score is:', testscore)
print('Test error rate is:', 1-testscore)

Generalisation is: 0.6387464387464388
The test score is: 0.7333333333333333
Test error rate is: 0.2666666666666667


Now, we will test the SVM with various pre-processing techniques. And observe which one works well. We will also use pipeline to apply normalisation and perfrom Cross val using GridSearchCV. 

First we will use MinMaxScaler along with different parameters of C & Gamma for our SVM model


In [176]:
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import make_pipeline

param_grid = {"svc__C": [0.1, 1, 10, 100,1000], "svc__gamma":[0.01, 0.1, 1, 10, 100]}

pipe = make_pipeline(MinMaxScaler(), SVC())
grid = GridSearchCV(pipe, param_grid= param_grid, cv = 5)
grid.fit(Xtrain, ytrain)
score= grid.score(Xtest, ytest)

print('Best cross-validation accuracy: ', grid.best_score_)
print('Best parameters:', grid.best_params_)
print('Test score:', score)
print('Test error score:', round((1-score),4))

Best cross-validation accuracy:  0.9925925925925926
Best parameters: {'svc__C': 0.1, 'svc__gamma': 1}
Test score: 0.9555555555555556
Test error score: 0.0444


Now, we use normalisation technique- Normalizer

In [177]:
from sklearn.preprocessing import Normalizer

param_grid = {"svc__C": [0.1, 1, 10, 100, 1000], "svc__gamma":[0.01, 0.1, 1, 10, 100]}

pipe = make_pipeline(Normalizer(), SVC())
grid = GridSearchCV(pipe, param_grid= param_grid, cv = 5)
grid.fit(Xtrain, ytrain)
score= grid.score(Xtest, ytest)

print('Best cross-validation accuracy: ', grid.best_score_)
print('Best parameters:', grid.best_params_)
print('Test score:', score)
print('Test error score:', round((1-score),4))

Best cross-validation accuracy:  0.962962962962963
Best parameters: {'svc__C': 1000, 'svc__gamma': 100}
Test score: 0.8666666666666667
Test error score: 0.1333


Now: StandardScaler

In [178]:
from sklearn.preprocessing import StandardScaler

param_grid = {"svc__C": [0.1, 1, 10, 100, 1000], "svc__gamma":[0.01, 0.1, 1, 10, 100]}

pipe = make_pipeline(StandardScaler(), SVC())
grid = GridSearchCV(pipe, param_grid= param_grid, cv = 5)
grid.fit(Xtrain, ytrain)
score= grid.score(Xtest, ytest)

print('Best cross-validation accuracy: ', grid.best_score_)
print('Best parameters:', grid.best_params_)
print('Test score:', score)
print('Test error score:', round((1-score),4))

Best cross-validation accuracy:  0.9925925925925926
Best parameters: {'svc__C': 1, 'svc__gamma': 0.01}
Test score: 0.9555555555555556
Test error score: 0.0444


From the above, we can observe that the Standardization technique StandardScaler & scaling MinMaxScaler give similar type of results. And when compared to the normalisation technique- Normaliser; we see that the test score and test error rates are quite as compared to other two pre-processing tecniques. This difference might due to the fact it is normalising the data initially. 

Now, we import the MLPClassifier for neural network on this Wine dataset. First we check our model on the default parameters.

In [179]:
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier()

score= cross_val_score(mlp, Xtrain, ytrain)
print('Using cross-val:', score)

scores= np.mean(score)

print('Generalisation score is:', scores)

Using cross-val: [0.59259259 0.88888889 0.33333333 0.92307692 0.38461538]
Generalisation score is: 0.6245014245014244


In [180]:
mlp.fit(Xtrain, ytrain)

testscore = mlp.score(Xtest, ytest)
print('Test error', 1-testscore)

Test error 0.7111111111111111


We compare the test error and the generalisation score, we notice the accuracy is better than the SVC & at the same time the test error rate is also low as compared to our SVC model.

We now introduce different pre-processing types along with the tuned parameters of the MLPClassifier & compare them.

In [181]:
mlp1= MLPClassifier(solver= 'lbfgs', hidden_layer_sizes= [10])
param_grid = {'mlpclassifier__alpha': [0.01, 0.1, 1, 10, 100], 'mlpclassifier__beta_1':[0.09, 0.9, 0.999, 9]}

pipe1 = make_pipeline(MinMaxScaler(), mlp1)
grid1 = GridSearchCV(pipe1, param_grid= param_grid, cv = 5)
grid1.fit(Xtrain, ytrain)
score= grid1.score(Xtest, ytest)

print('Best cross-validation accuracy: ', grid1.best_score_)
print('Best parameters:', grid1.best_params_)
print('Test score:', score)
print('Test error score:', round((1-score),4))

Best cross-validation accuracy:  0.9851851851851852
Best parameters: {'mlpclassifier__alpha': 0.01, 'mlpclassifier__beta_1': 0.9}
Test score: 0.9777777777777777
Test error score: 0.0222


In [182]:
pipe1 = make_pipeline(StandardScaler(), mlp1)
grid1 = GridSearchCV(pipe1, param_grid= param_grid, cv = 5)
grid1.fit(Xtrain, ytrain)
score= grid1.score(Xtest, ytest)

print('Best cross-validation accuracy: ', grid1.best_score_)
print('Best parameters:', grid1.best_params_)
print('Test score:', score)
print('Test error score:', round((1-score),4))

Best cross-validation accuracy:  1.0
Best parameters: {'mlpclassifier__alpha': 10, 'mlpclassifier__beta_1': 0.09}
Test score: 0.9777777777777777
Test error score: 0.0222


In [183]:
pipe1 = make_pipeline(Normalizer(), mlp1)
grid1 = GridSearchCV(pipe1, param_grid= param_grid, cv = 5)
grid1.fit(Xtrain, ytrain)
score= grid1.score(Xtest, ytest)

print('Best cross-validation accuracy: ', grid1.best_score_)
print('Best parameters:', grid1.best_params_)
print('Test score:', score)
print('Test error score:', round((1-score),4))

Best cross-validation accuracy:  0.9242165242165242
Best parameters: {'mlpclassifier__alpha': 0.01, 'mlpclassifier__beta_1': 0.9}
Test score: 0.6888888888888889
Test error score: 0.3111


We compare that the MinMaxScaler & StandardScaler give us the best test error score. Normaliser gives us an error rate of ~0.134. Although their results(MinMaxScaler & StandardScaler) are similar except cross-val accuracy, the best parameters chosen are different. 

Also, we used different parameters for MLPClassifier instead of only the default ones. We used the solver 'lbfgs' instead of default 'Adam' as lbfgs works well on smaller dataset & converges faster too. At the same time, we have mentioned the hidden layers to be [10] instead of default 100 as it's a smaller dataset. 

Note: While I used 'lbfgs', I had experimented with the default 'Adam' too initially, & it took a lot more iterations to converge as compared to 'lbfgs' when it compared quite faster.

# USPS DATASET


We use the genfromtxt to extract the train and test sets. We then concatenate them to have a dataset as a whole and again split it using train_test_split.

In [184]:
train = np.genfromtxt('zip.train')
test = np.genfromtxt('zip.test')

print(train.shape)
print(test.shape)

(7291, 257)
(2007, 257)


In [185]:
import numpy

usps = numpy.concatenate([train, test], axis= 0)
print(usps.shape)

(9298, 257)


In [186]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(usps[:,1:], usps[:,0], random_state = 2206)

Now we implement the SVC on USPS dataset with the default parameters.

In [187]:
svm  = SVC()

score = cross_val_score(svm, X_train, y_train)

print('Using cross-val with default parameters:', score)

print('Generalisation score is: ', (np.mean(score)))

Using cross-val with default parameters: [0.96917563 0.96774194 0.97562724 0.97202296 0.97345768]
Generalisation score is:  0.9716050868288569


In [188]:
svm.fit(X_train, y_train)

testscoreUSPS = svm.score(X_test, y_test)

print('Test error score:', round((1-testscoreUSPS),4))

Test error score: 0.0241


We notice that the generalisation score and the test error score is very good for the default parameters.

We now use the different parameters of SVC on the USPS dataset. As the dataset takes too much time to compute, only StandardScaler & Normaliser is used here. 

The cross val accuracy using the default parameters is better as compared to the different parameters of C & Gamma and StandardScaler. Also, here, Normaliser gives us a better result as compared to the StandardScaler. 

In [189]:
param_gridUSPS = {'svc__C': [0.01, 0.1, 1], 'svc__gamma':[0.01, 0.1, 1]}

pipe = make_pipeline(StandardScaler(), svm)
gridUSPS= GridSearchCV(pipe, param_grid=param_gridUSPS, cv = 3, n_jobs= -1)
gridUSPS.fit(X_train, y_train)
score = gridUSPS.score(X_test, y_test)

print('Best cross-val accuracy:', gridUSPS.best_score_)
print('Best parameters:', gridUSPS.best_params_)
print('Test score:', score)
print('Test error rate',  round((1-score),4))

Best cross-val accuracy: 0.9211242512291871
Best parameters: {'svc__C': 1, 'svc__gamma': 0.01}
Test score: 0.9294623655913978
Test error rate 0.0705


In [190]:
print('Test error rate',  round((1-score),4))

Test error rate 0.0705


In [191]:
param_gridUSPS = {'svc__C': [0.01, 0.1, 1], 'svc__gamma':[0.01, 0.1, 1]}

pipe = make_pipeline(Normalizer(), svm)
gridUSPS= GridSearchCV(pipe, param_grid=param_gridUSPS, cv = 3, n_jobs= -1)
gridUSPS.fit(X_train, y_train)
score1 = gridUSPS.score(X_test, y_test)

print('Best cross-val accuracy:', gridUSPS.best_score_)
print('Best parameters:', gridUSPS.best_params_)
print('Test score:', score1)
print('Test error rate',  round((1-score1),4))

Best cross-val accuracy: 0.9660124985348459
Best parameters: {'svc__C': 1, 'svc__gamma': 1}
Test score: 0.9716129032258064
Test error rate 0.0284


Now, we use a MLPClassifier on the USPS Dataset. This time we have also used MinMaxScaler. The results are very close between the three. MinMaxScaler & StandardScaler are quite similar again with different values of alpha and beta chosen by them. 

Note: This time we've used the default parameters as it's a large dataset so it works well here.   

In [192]:
mlpUS = MLPClassifier()

param_gridUSPS = {'mlpclassifier__alpha': [0.01, 0.1, 1], 'mlpclassifier__beta_1':[0.09, 0.9]}

pipe = make_pipeline(Normalizer(), mlpUS)
gridUSPS= GridSearchCV(pipe, param_grid=param_gridUSPS, cv = 3, n_jobs= -1)
gridUSPS.fit(X_train, y_train)
score2 = gridUSPS.score(X_test, y_test)

print('Best cross-val accuracy:', gridUSPS.best_score_)
print('Best parameters:', gridUSPS.best_params_)
print('Test score:', score2)
print('Test error rate',  round((1-score2),4))

Best cross-val accuracy: 0.955113294961721
Best parameters: {'mlpclassifier__alpha': 0.01, 'mlpclassifier__beta_1': 0.9}
Test score: 0.970752688172043
Test error rate 0.0292


In [193]:
mlpUS = MLPClassifier()

param_gridUSPS = {'mlpclassifier__alpha': [0.01, 0.1, 1], 'mlpclassifier__beta_1':[0.09, 0.9]}

pipe = make_pipeline(StandardScaler(), mlpUS)
gridUSPS= GridSearchCV(pipe, param_grid=param_gridUSPS, cv = 3, n_jobs= -1)
gridUSPS.fit(X_train, y_train)
score3 = gridUSPS.score(X_test, y_test)

print('Best cross-val accuracy:', gridUSPS.best_score_)
print('Best parameters:', gridUSPS.best_params_)
print('Test score:', score3)
print('Test error rate',  round((1-score3),4))

Best cross-val accuracy: 0.9644354376029464
Best parameters: {'mlpclassifier__alpha': 0.1, 'mlpclassifier__beta_1': 0.09}
Test score: 0.9733333333333334
Test error rate 0.0267


In [194]:
param_gridUSPS = {'mlpclassifier__alpha': [0.01, 0.1, 1], 'mlpclassifier__beta_1':[0.09, 0.9]}


pipe = make_pipeline(MinMaxScaler(), mlpUS)
gridUSPS = GridSearchCV(pipe, param_grid= param_gridUSPS, cv = 3, n_jobs= -1)
gridUSPS.fit(X_train, y_train)
score4= gridUSPS.score(X_test, y_test)

print('Best cross-validation accuracy: ', gridUSPS.best_score_)
print('Best parameters:', gridUSPS.best_params_)
print('Test score:', score4)
print('Test error score:', round((1-score4),4))

Best cross-validation accuracy:  0.9611370828937872
Best parameters: {'mlpclassifier__alpha': 0.1, 'mlpclassifier__beta_1': 0.9}
Test score: 0.9711827956989247
Test error score: 0.0288


We have applied the SVM & the Neural Networks on Wine and the USPS dataset along with different pre-processing techniques as well.

The computation time required on a large dataset like USPS was quite long; so had to take less parameters and keep the number of CV low as well. Also, the use of n_jobs as described in the Appendix of the assignment might have to proved to be useful to the computation time too. 