### CHEM E 545 Homework 5 (30 points)

#### (1) Classification using SVM (25)

**Image Classification with SVM**

Problem Statement:
In this assignment, you are tasked with performing image classification on a dataset containing two classes: 'forest' and 'ocean'. The dataset is divided into 'train' and 'test' folders, each containing images. Your goal is to build a Support Vector Machine (SVM) model with hyperparameter optimization using cross validation. Additionally, you will analyze the accuracy and computational efficiency of the algorithm.

[Link to the Data](https://drive.google.com/drive/folders/15fjBRD7pDRiha7fDez_PX7hWfUoRZWkp?usp=sharing)


Tasks:

1. Data Preparation (10):
   - Read the images from the 'train' and 'test' folders.
   - Resize the images to a fixed size of 16x16 pixels with 3 color channels (RGB). Flatten the image to make a vector of size 768.
   - Normalize the pixel values of the images to the range [0, 1] to ensure consistent data for model training.

2. Model Training and Optimization (10):
   - Train the model and use cross validation to optimize the hyperparameters. Try both the linear and rbf kernel types. 
   - Explain your results based on your intuition

3. Model Evaluation (3):
   - Train the final SVM model (using the best parameters from above) using the entire training dataset.
   - Report both the train and test accuracy.


4. Computational Efficiency Analysis (2):
   - Measure and report the time taken for model training and testing for each SVM model.
   - Analyze and compare the computational efficiency of the linear and Gaussian kernel SVM models. Consider the training and prediction times.




In [1]:
# you can unzip the data using the code below

#!unzip test.zip
#!unzip train.zip

print('After unzipping, I commented out the two unzip lines and reran the code so that the terminal output is hidden')

After unzipping, I commented out the two unzip lines and reran the code so that the terminal output is hidden


In [4]:
#Step 1 Data Preparation
from os import listdir
from PIL import Image as PImage
import numpy as np
import pandas as pd
from sklearn import preprocessing

In [5]:
#Define a function to import the images from the folder and do some processing on them
def ImageImport(location):
    folder = listdir(location) 
    vector_array = []
    for img in folder: #for loop to iterate through each image in the folder, which is passed to the function as a parameter
        img = PImage.open(location + img).resize((16,16)).convert('RGB') #define image object and use PImage.open it, .resize() to resize it to 16x16
        # and .convert() to convert it to RGB
        vector = np.array(img).flatten() #make a 3D array called vector which is the array form of the the image, and then .flatten() to make it a 1D vector of length 768
        vector_array.append(vector) #append the vector to vector_array
    return vector_array

In [6]:
#Run the ImageImport Function 4 times, each one passed a different directory corresponding to forest/ocean train/test
f_train = ImageImport('train/forest/')
f_test = ImageImport('test/forest/')
o_train = ImageImport('train/ocean/')
o_test = ImageImport('test/ocean/')


In [7]:
min_max_scaler = preprocessing.MinMaxScaler() #From preprocessing use MinMaxScaler to scale all the data between [0,1]
f_train_sc = min_max_scaler.fit_transform(f_train) # for the train data; fit and transform, for the test data; only transform
f_test_sc = min_max_scaler.fit(f_test)
o_train_sc = min_max_scaler.fit_transform(o_train)
o_test_sc = min_max_scaler.fit(o_test)

In [8]:
#0 = ocean, 1 = forest
#Convert the o_train array into a dataframe
o_train_df = pd.DataFrame(o_train_sc) 
#and add a column called 'clf' (classification) with all the values being 0
o_train_df['clf'] = 0 
f_train_df = pd.DataFrame(f_train_sc) #do the same thing for the forest dataframe
f_train_df['clf'] = 1 #with all classifications being 1 
#concat the two dataframes atop each other
train = pd.concat([o_train_df,f_train_df], ignore_index=True) 
X_train = train.drop(['clf'], axis=1) #X_train is all of the scaled RGB vectors
Y_train = train['clf'] #Y_train is classification of what each image is 

#repeat the process for the ocean test and ocean train data
o_test_df = pd.DataFrame(o_test)
o_test_df['clf'] = 0
f_test_df = pd.DataFrame(f_test)
f_test_df['clf'] = 1
test = pd.concat([o_test_df,f_test_df], ignore_index=True)
X_test = test.drop(['clf'], axis=1)
Y_test = test['clf']


In [9]:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import LinearSVC
#First try regular linear model before cross-validation and tuning
clf_svm = LinearSVC(max_iter=10000, random_state=42).fit(X_train, Y_train)
print('Accuracy on Training Set ' + str(clf_svm.score(X_train,Y_train)))
print('Accuracy on Testing Set ' + str(clf_svm.score(X_test, Y_test)))

Accuracy on Training Set 1.0
Accuracy on Testing Set 0.645


In [29]:
#Training the model and optimizing the hyperparemeters: Linear Kernel
#parameters dictionary to find the optimal C parameter, using an np.linspace array with 10 values between 0.01 and 20 
#On first pass I tried 10 values between 0.01 and 20, and it returned C=4.45
#so I am trying again with a narrower range of values to hone in 
parameters_dict = { 'C': np.linspace(3,5,num=10)}
svc = LinearSVC(max_iter = 10000)
#initializing gridsearch object
grid_search = GridSearchCV(svc, parameters_dict, scoring='f1',
                           return_train_score=True, cv=5, verbose=1) 
#fitting the grid search to X_train and y_train
grid_search.fit(X_train,Y_train)

best_model = grid_search.best_estimator_
best_parameters = grid_search.best_params_
best_f1 = grid_search.best_score_

print('The best model was:', best_model)
print('The best parameter values were:', best_parameters)
print('The best f1-score was:', best_f1)

Fitting 5 folds for each of 10 candidates, totalling 50 fits
The best model was: LinearSVC(C=4.111111111111111, max_iter=10000)
The best parameter values were: {'C': 4.111111111111111}
The best f1-score was: 0.9831123234108808


In [31]:
#RBF Kernel SVC
from sklearn.svm import SVC
parameters_dict_rbf = {'C': np.linspace(0.01,20,num=10),
                      'gamma': np.linspace(0.01,0.1,num=5)}
svc_rbf = SVC(kernel='rbf')
grid_search_rbf = GridSearchCV(svc_rbf, parameters_dict_rbf,
                               scoring='f1', return_train_score = True,
                               cv=5, verbose=1)
grid_search_rbf.fit(X_train,Y_train)
print('The best model was: ' + str(grid_search_rbf.best_estimator_))
print('The best parameter values were : ' + str(grid_search_rbf.best_params_))
print('The best f1-score was: ' + str(grid_search_rbf.best_score_))

Fitting 5 folds for each of 50 candidates, totalling 250 fits
The best model was: SVC(C=20.0, gamma=0.01)
The best parameter values were : {'C': 20.0, 'gamma': 0.01}
The best f1-score was: 0.9730046751168778


It appears that a larger C value (as opposed to small ones like 0.1) is favorable for this data which indicates a harder margin is preferred. Since a small gamma value was found from GridSearchCV it means each individual training example has a large influence, hence the decision boundary becomes less linear.

In [33]:
#Train the best model from above using the entire training data
#and report the train and test accuracy
#Linear Model
import time #Use the time package to measure the time taken to train the model
svm_start = time.time()
svm_linear = LinearSVC(max_iter = 10000, C = 4.11,random_state=42).fit(X_train,Y_train)
svm_stop = time.time()
print('Linear SVC Training time = {:.3f} ' .format(svm_stop - svm_start) + 'sec')
print('Linear SVC Accuracy on Training Set ' + str(svm_linear.score(X_train,Y_train)))
test_start = time.time()
print('Linear SVC Accuracy on Testing Set ' + str(svm_linear.score(X_test, Y_test)))
test_stop = time.time()
print('Linear SVC Testing time =  {:.3f} ' .format(test_stop-test_start) + 'sec')

#RBF Kernel
rbf_start = time.time()
svm = SVC(kernel='rbf', C=20, gamma=0.01).fit(X_train,Y_train)
rbf_stop = time.time()
print('RBF Training Time {:.3f} ' .format(rbf_stop - rbf_start) + 'sec')
print('RBF Accuracy on Training Set ' + str(svm.score(X_train,Y_train)))
test_start = time.time()
print('RBF Accuracy on Testing Set ' + str(svm.score(X_test, Y_test)))
test_stop = time.time()
print('RBF Testing time =  {:.3f} ' .format(test_stop-test_start) + 'sec')



Linear SVC Training time = 2.547 sec
Linear SVC Accuracy on Training Set 1.0
Linear SVC Accuracy on Testing Set 0.6475
Linear SVC Testing time =  0.013 sec
RBF Training Time 0.306 sec
RBF Accuracy on Training Set 1.0
RBF Accuracy on Testing Set 0.5
RBF Testing time =  0.094 sec


It seems that the gaussian radial basis function took less time to train than the linear support vector model, making it more computationally efficient. However, it performed worse on the testing set than the linear model did; and the linear model was still only 2.547 seconds to train. 


#### (2) Conceptual Questions about SVM (5)

Match the following classifers (given in the figure below) to the descriptions below. Support vectors are given by solid pointers.

(a) A linear support vector machine with C = 0.1.

(b) A linear support vector machine with C = 10.

(c) A  support vector machine with K(u, v) = u · v + .  $(u · v)^{2}$

(d) A  support vector machine with K(u, v) = exp ( -$\frac{1}{4}$  $∥u − v∥^{2}$ ).

(e) A  support vector machine with K(u, v) = exp ( -4  $∥u − v∥^{2}$ ).

![](SVM_examples.png)

A) linear svm C = 0.1 - Number 4

B) linear svm C = 10 - Number 3

C) A support vector machine K(u, v) = u · v + .  (𝑢·𝑣)2 - Number 5

D) svm with K(u,v) = exp(-1/4 ||u-v||^2) - Number 6

E) svm with K(u,v) = exp(-4 ||u-v||^2) - Number 1 
