# Assignment 4 - Support Vector Machine (SVM) - Multiclass Classification

In [1]:
import pandas as pd 
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.svm import LinearSVC
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

import warnings
warnings.filterwarnings("ignore")

In [2]:
train = pd.read_csv('/Users/parthbansal/Downloads/train_data.csv')

In [3]:
test = pd.read_csv('/Users/parthbansal/Downloads/test_data.csv')

In [4]:
train.head()

Unnamed: 0,class,BrdIndx,Area,Round,Bright,Compact,ShpIndx,Mean_G,Mean_R,Mean_NIR,...,SD_NIR_140,LW_140,GLCM1_140,Rect_140,GLCM2_140,Dens_140,Assym_140,NDVI_140,BordLngth_140,GLCM3_140
0,concrete,1.32,131,0.81,222.74,1.66,2.18,192.94,235.11,240.15,...,31.15,5.04,0.8,0.58,8.56,0.82,0.98,-0.1,1512,1287.52
1,shadow,1.59,864,0.94,47.56,1.41,1.87,36.82,48.78,57.09,...,12.01,3.7,0.52,0.96,7.01,1.69,0.86,-0.14,196,2659.74
2,shadow,1.41,409,1.0,51.38,1.37,1.53,41.72,51.96,60.48,...,18.75,3.09,0.9,0.63,8.32,1.38,0.84,0.1,1198,720.38
3,tree,2.58,187,1.91,70.08,3.41,3.11,93.13,55.2,61.92,...,27.67,6.33,0.89,0.7,8.56,1.1,0.96,0.2,524,891.36
4,asphalt,2.6,116,2.05,89.57,3.06,3.02,73.17,94.89,100.64,...,32.05,1.01,0.83,0.75,8.62,2.08,0.08,-0.1,496,1194.76


## Data Processing 

In [5]:
train.shape

(507, 148)

In [6]:
test.shape

(168, 148)

In [7]:
train = train.dropna()

In [8]:
test = test.dropna()

In [9]:
X_train = train.drop('class', axis=1)
y_train = train[['class']]
X_test = test.drop('class', axis=1)
y_test = test[['class']]

In [10]:
scaler = StandardScaler()
scaler.fit(X_train)

StandardScaler()

In [11]:
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

## Random Forest Classifier - Base Model

In [12]:
rf = RandomForestClassifier(random_state = 42)
rf.fit(X_train_scaled, y_train)

RandomForestClassifier(random_state=42)

In [13]:
y_pred_test = rf.predict(X_test_scaled)

In [14]:
# Confusion matrix and classification report for the test data 
print(confusion_matrix(y_test, y_pred_test), '\n')
print(classification_report(y_test, y_pred_test))

[[14  0  0  0  0  0  0  0  0]
 [ 1 22  0  2  0  0  0  0  0]
 [ 1  1 13  0  0  0  0  0  0]
 [ 0  5  0 18  0  0  0  0  0]
 [ 0  0  0  0 25  0  0  0  4]
 [ 1  0  1  0  0 13  0  0  0]
 [ 3  0  0  0  0  0 13  0  0]
 [ 0  1  0  5  3  0  0  5  0]
 [ 0  0  0  1  1  0  0  0 15]] 

              precision    recall  f1-score   support

    asphalt        0.70      1.00      0.82        14
   building        0.76      0.88      0.81        25
        car        0.93      0.87      0.90        15
   concrete        0.69      0.78      0.73        23
      grass        0.86      0.86      0.86        29
       pool        1.00      0.87      0.93        15
     shadow        1.00      0.81      0.90        16
       soil        1.00      0.36      0.53        14
       tree        0.79      0.88      0.83        17

    accuracy                           0.82       168
   macro avg       0.86      0.81      0.81       168
weighted avg       0.85      0.82      0.82       168



In [15]:
y_pred_train = rf.predict(X_train_scaled)

In [16]:
# Confusion matrix and classification report for the train data
print(confusion_matrix(y_train, y_pred_train), '\n')
print(classification_report(y_train, y_pred_train))

[[45  0  0  0  0  0  0  0  0]
 [ 0 97  0  0  0  0  0  0  0]
 [ 0  0 21  0  0  0  0  0  0]
 [ 0  0  0 93  0  0  0  0  0]
 [ 0  0  0  0 83  0  0  0  0]
 [ 0  0  0  0  0 14  0  0  0]
 [ 0  0  0  0  0  0 45  0  0]
 [ 0  0  0  0  0  0  0 20  0]
 [ 0  0  0  0  0  0  0  0 89]] 

              precision    recall  f1-score   support

    asphalt        1.00      1.00      1.00        45
   building        1.00      1.00      1.00        97
        car        1.00      1.00      1.00        21
   concrete        1.00      1.00      1.00        93
      grass        1.00      1.00      1.00        83
       pool        1.00      1.00      1.00        14
     shadow        1.00      1.00      1.00        45
       soil        1.00      1.00      1.00        20
       tree        1.00      1.00      1.00        89

    accuracy                           1.00       507
   macro avg       1.00      1.00      1.00       507
weighted avg       1.00      1.00      1.00       507



**There are signs of overfitting as the model performs better on the training data then the test data. The latter model has higher accuracy, precision and recall.** 

In [17]:
imp = rf.feature_importances_
sorted_imp = imp.argsort()[::-1]

for idx in sorted_imp[:5]:
    print(f"{X_train.columns[idx]}: {imp[idx]}")

NDVI: 0.042750325674957734
Mean_NIR: 0.02936951152523256
Mean_R_40: 0.02897001358024602
NDVI_60: 0.02778714825107778
Mean_NIR_40: 0.0259840578164955


## LinearSVM Classifier - Base Model

In [18]:
svc_model = LinearSVC()
svc_model.fit(X_train_scaled, y_train)

LinearSVC()

In [19]:
y_pred_test = svc_model.predict(X_test_scaled)

In [20]:
# Confusion matrix and classification report for the test data
print(confusion_matrix(y_test, y_pred_test), '\n')
print(classification_report(y_test, y_pred_test))

[[13  0  0  0  0  0  1  0  0]
 [ 0 21  1  1  1  0  0  1  0]
 [ 0  2 12  0  0  0  0  0  1]
 [ 1  6  0 15  0  0  0  0  1]
 [ 0  0  0  1 26  0  0  0  2]
 [ 1  0  1  0  0 13  0  0  0]
 [ 2  0  0  0  0  0 14  0  0]
 [ 0  4  0  1  3  0  0  6  0]
 [ 0  0  0  1  6  0  0  0 10]] 

              precision    recall  f1-score   support

    asphalt        0.76      0.93      0.84        14
   building        0.64      0.84      0.72        25
        car        0.86      0.80      0.83        15
   concrete        0.79      0.65      0.71        23
      grass        0.72      0.90      0.80        29
       pool        1.00      0.87      0.93        15
     shadow        0.93      0.88      0.90        16
       soil        0.86      0.43      0.57        14
       tree        0.71      0.59      0.65        17

    accuracy                           0.77       168
   macro avg       0.81      0.76      0.77       168
weighted avg       0.79      0.77      0.77       168



In [21]:
y_pred_train = svc_model.predict(X_train_scaled)

In [22]:
# Confusion matrix and classification report for the train data
print(confusion_matrix(y_train, y_pred_train), '\n')
print(classification_report(y_train, y_pred_train))

[[45  0  0  0  0  0  0  0  0]
 [ 0 97  0  0  0  0  0  0  0]
 [ 0  0 21  0  0  0  0  0  0]
 [ 0  0  0 93  0  0  0  0  0]
 [ 0  1  0  0 80  0  0  0  2]
 [ 0  0  0  0  0 14  0  0  0]
 [ 0  0  0  0  0  0 45  0  0]
 [ 0  0  0  0  0  0  0 20  0]
 [ 0  0  0  0  0  0  0  0 89]] 

              precision    recall  f1-score   support

    asphalt        1.00      1.00      1.00        45
   building        0.99      1.00      0.99        97
        car        1.00      1.00      1.00        21
   concrete        1.00      1.00      1.00        93
      grass        1.00      0.96      0.98        83
       pool        1.00      1.00      1.00        14
     shadow        1.00      1.00      1.00        45
       soil        1.00      1.00      1.00        20
       tree        0.98      1.00      0.99        89

    accuracy                           0.99       507
   macro avg       1.00      1.00      1.00       507
weighted avg       0.99      0.99      0.99       507



**There are signs of overfitting as the model performs better on the training data then the test data. The latter model has higher accuracy, precision and recall.** 

## Support Vector Machine Classifier + Linear Kernel + Grid Search

In [23]:
svc_model = SVC(kernel='linear')

In [24]:
param_grid = {'C': np.arange(0.01, 10.2, 0.2)}

In [25]:
gs = GridSearchCV(svc_model, param_grid, cv=5)

In [26]:
gs.fit(X_train_scaled, y_train)

GridSearchCV(cv=5, estimator=SVC(kernel='linear'),
             param_grid={'C': array([1.000e-02, 2.100e-01, 4.100e-01, 6.100e-01, 8.100e-01, 1.010e+00,
       1.210e+00, 1.410e+00, 1.610e+00, 1.810e+00, 2.010e+00, 2.210e+00,
       2.410e+00, 2.610e+00, 2.810e+00, 3.010e+00, 3.210e+00, 3.410e+00,
       3.610e+00, 3.810e+00, 4.010e+00, 4.210e+00, 4.410e+00, 4.610e+00,
       4.810e+00, 5.010e+00, 5.210e+00, 5.410e+00, 5.610e+00, 5.810e+00,
       6.010e+00, 6.210e+00, 6.410e+00, 6.610e+00, 6.810e+00, 7.010e+00,
       7.210e+00, 7.410e+00, 7.610e+00, 7.810e+00, 8.010e+00, 8.210e+00,
       8.410e+00, 8.610e+00, 8.810e+00, 9.010e+00, 9.210e+00, 9.410e+00,
       9.610e+00, 9.810e+00, 1.001e+01])})

In [27]:
print('Best Parameters:', gs.best_params_)
print('Best Model:', gs.best_estimator_)

Best Parameters: {'C': 0.01}
Best Model: SVC(C=0.01, kernel='linear')


In [28]:
y_pred_test = gs.best_estimator_.predict(X_test_scaled)

In [29]:
# Confusion matrix and classification report for the test data
print(confusion_matrix(y_test, y_pred_test), '\n')
print(classification_report(y_test, y_pred_test))

[[13  0  0  0  0  0  1  0  0]
 [ 0 22  0  2  1  0  0  0  0]
 [ 0  1 14  0  0  0  0  0  0]
 [ 0  5  0 17  0  0  0  1  0]
 [ 0  0  0  1 25  0  0  0  3]
 [ 0  0  0  0  0 14  1  0  0]
 [ 1  0  0  0  0  0 15  0  0]
 [ 0  3  0  5  2  0  0  4  0]
 [ 0  0  0  1  2  0  0  0 14]] 

              precision    recall  f1-score   support

    asphalt        0.93      0.93      0.93        14
   building        0.71      0.88      0.79        25
        car        1.00      0.93      0.97        15
   concrete        0.65      0.74      0.69        23
      grass        0.83      0.86      0.85        29
       pool        1.00      0.93      0.97        15
     shadow        0.88      0.94      0.91        16
       soil        0.80      0.29      0.42        14
       tree        0.82      0.82      0.82        17

    accuracy                           0.82       168
   macro avg       0.85      0.81      0.82       168
weighted avg       0.83      0.82      0.81       168



In [30]:
y_pred_train = gs.best_estimator_.predict(X_train_scaled)

# Confusion matrix and classification report for the train data
print(confusion_matrix(y_train, y_pred_train), '\n')
print(classification_report(y_train, y_pred_train))

[[40  0  0  0  0  0  5  0  0]
 [ 2 87  0  7  0  0  1  0  0]
 [ 0  1 19  1  0  0  0  0  0]
 [ 0  9  0 83  1  0  0  0  0]
 [ 0  1  0  0 70  0  0  0 12]
 [ 0  1  0  0  1 12  0  0  0]
 [ 1  0  0  0  0  0 43  0  1]
 [ 0  3  0  4  2  0  0 11  0]
 [ 0  0  0  0  3  0  1  0 85]] 

              precision    recall  f1-score   support

    asphalt        0.93      0.89      0.91        45
   building        0.85      0.90      0.87        97
        car        1.00      0.90      0.95        21
   concrete        0.87      0.89      0.88        93
      grass        0.91      0.84      0.88        83
       pool        1.00      0.86      0.92        14
     shadow        0.86      0.96      0.91        45
       soil        1.00      0.55      0.71        20
       tree        0.87      0.96      0.91        89

    accuracy                           0.89       507
   macro avg       0.92      0.86      0.88       507
weighted avg       0.89      0.89      0.89       507



**There are no signs of overfitting as the model performs similarly on the training data and the test data. The  models have similar accuracy, precision and recall.** 

## Support Vector Machine Classifier + Polynomial Kernel + Grid Search

In [31]:
svc = SVC(kernel='poly')

In [32]:
param_grid = {'C': np.arange(0.01, 10.2, 0.2), 'degree': [2, 3, 4, 5, 6]}
gs = GridSearchCV(svc, param_grid, cv=5)

In [33]:
gs.fit(X_train_scaled, y_train)

GridSearchCV(cv=5, estimator=SVC(kernel='poly'),
             param_grid={'C': array([1.000e-02, 2.100e-01, 4.100e-01, 6.100e-01, 8.100e-01, 1.010e+00,
       1.210e+00, 1.410e+00, 1.610e+00, 1.810e+00, 2.010e+00, 2.210e+00,
       2.410e+00, 2.610e+00, 2.810e+00, 3.010e+00, 3.210e+00, 3.410e+00,
       3.610e+00, 3.810e+00, 4.010e+00, 4.210e+00, 4.410e+00, 4.610e+00,
       4.810e+00, 5.010e+00, 5.210e+00, 5.410e+00, 5.610e+00, 5.810e+00,
       6.010e+00, 6.210e+00, 6.410e+00, 6.610e+00, 6.810e+00, 7.010e+00,
       7.210e+00, 7.410e+00, 7.610e+00, 7.810e+00, 8.010e+00, 8.210e+00,
       8.410e+00, 8.610e+00, 8.810e+00, 9.010e+00, 9.210e+00, 9.410e+00,
       9.610e+00, 9.810e+00, 1.001e+01]),
                         'degree': [2, 3, 4, 5, 6]})

In [34]:
print('Best Parameters:', gs.best_params_)
print('Best Model:', gs.best_estimator_)

Best Parameters: {'C': 3.81, 'degree': 3}
Best Model: SVC(C=3.81, kernel='poly')


In [35]:
y_pred_test = gs.best_estimator_.predict(X_test_scaled)

In [36]:
# Confusion matrix and classification report for the test data
print(confusion_matrix(y_test, y_pred_test), '\n')
print(classification_report(y_test, y_pred_test))

[[13  0  0  0  0  0  1  0  0]
 [ 0 22  0  2  1  0  0  0  0]
 [ 0  2 11  0  0  1  0  1  0]
 [ 0  5  0 17  1  0  0  0  0]
 [ 0  0  0  0 26  0  0  1  2]
 [ 0  0  0  0  0 14  1  0  0]
 [ 1  0  0  0  0  0 14  0  1]
 [ 0  2  0  5  7  0  0  0  0]
 [ 0  0  0  1  3  0  0  0 13]] 

              precision    recall  f1-score   support

    asphalt        0.93      0.93      0.93        14
   building        0.71      0.88      0.79        25
        car        1.00      0.73      0.85        15
   concrete        0.68      0.74      0.71        23
      grass        0.68      0.90      0.78        29
       pool        0.93      0.93      0.93        15
     shadow        0.88      0.88      0.88        16
       soil        0.00      0.00      0.00        14
       tree        0.81      0.76      0.79        17

    accuracy                           0.77       168
   macro avg       0.74      0.75      0.74       168
weighted avg       0.73      0.77      0.75       168



In [37]:
y_pred_train = gs.best_estimator_.predict(X_train_scaled)

# Confusion matrix and classification report for the train data
print(confusion_matrix(y_train, y_pred_train), '\n')
print(classification_report(y_train, y_pred_train))

[[44  0  0  0  1  0  0  0  0]
 [ 0 95  0  1  1  0  0  0  0]
 [ 0  0 20  0  1  0  0  0  0]
 [ 0  1  0 91  1  0  0  0  0]
 [ 0  1  0  0 81  0  0  0  1]
 [ 0  0  0  0  1 13  0  0  0]
 [ 0  0  0  0  0  0 45  0  0]
 [ 0  0  0  0 11  0  0  9  0]
 [ 0  0  0  0  5  0  0  0 84]] 

              precision    recall  f1-score   support

    asphalt        1.00      0.98      0.99        45
   building        0.98      0.98      0.98        97
        car        1.00      0.95      0.98        21
   concrete        0.99      0.98      0.98        93
      grass        0.79      0.98      0.88        83
       pool        1.00      0.93      0.96        14
     shadow        1.00      1.00      1.00        45
       soil        1.00      0.45      0.62        20
       tree        0.99      0.94      0.97        89

    accuracy                           0.95       507
   macro avg       0.97      0.91      0.93       507
weighted avg       0.96      0.95      0.95       507



**There are signs of overfitting as the model performs better on the training data then the test data. The latter model has higher accuracy, precision and recall.** 

## Support Vector Machine Classifier + RBF Kernel + Grid Search

In [39]:
param_grid = {'C': np.arange(0.01, 10.01, 0.2), 'gamma': [0.01, 0.1, 1, 10, 100]}
svm = SVC(kernel='rbf')
svm_gs = GridSearchCV(svm, param_grid, cv=5)

In [40]:
svm_gs.fit(X_train_scaled, y_train)

GridSearchCV(cv=5, estimator=SVC(),
             param_grid={'C': array([0.01, 0.21, 0.41, 0.61, 0.81, 1.01, 1.21, 1.41, 1.61, 1.81, 2.01,
       2.21, 2.41, 2.61, 2.81, 3.01, 3.21, 3.41, 3.61, 3.81, 4.01, 4.21,
       4.41, 4.61, 4.81, 5.01, 5.21, 5.41, 5.61, 5.81, 6.01, 6.21, 6.41,
       6.61, 6.81, 7.01, 7.21, 7.41, 7.61, 7.81, 8.01, 8.21, 8.41, 8.61,
       8.81, 9.01, 9.21, 9.41, 9.61, 9.81]),
                         'gamma': [0.01, 0.1, 1, 10, 100]})

In [42]:
print('Best Parameters:', svm_gs.best_params_)
print('Best Model:', svm_gs.best_estimator_)

Best Parameters: {'C': 2.81, 'gamma': 0.01}
Best Model: SVC(C=2.81, gamma=0.01)


In [43]:
y_pred_test = svm_gs.best_estimator_.predict(X_test_scaled)

In [44]:
# Confusion matrix and classification report for the test data
print(confusion_matrix(y_test, y_pred_test), '\n')
print(classification_report(y_test, y_pred_test))

[[13  0  0  0  0  0  1  0  0]
 [ 0 21  0  3  1  0  0  0  0]
 [ 0  1 14  0  0  0  0  0  0]
 [ 0  4  0 19  0  0  0  0  0]
 [ 0  1  0  0 26  0  0  0  2]
 [ 0  0  0  0  0 14  1  0  0]
 [ 1  0  0  0  0  0 15  0  0]
 [ 0  2  0  4  3  0  0  5  0]
 [ 0  0  0  1  1  0  0  0 15]] 

              precision    recall  f1-score   support

    asphalt        0.93      0.93      0.93        14
   building        0.72      0.84      0.78        25
        car        1.00      0.93      0.97        15
   concrete        0.70      0.83      0.76        23
      grass        0.84      0.90      0.87        29
       pool        1.00      0.93      0.97        15
     shadow        0.88      0.94      0.91        16
       soil        1.00      0.36      0.53        14
       tree        0.88      0.88      0.88        17

    accuracy                           0.85       168
   macro avg       0.88      0.84      0.84       168
weighted avg       0.86      0.85      0.84       168



In [45]:
y_pred_train = gs.best_estimator_.predict(X_train_scaled)

# Confusion matrix and classification report for the train data
print(confusion_matrix(y_train, y_pred_train), '\n')
print(classification_report(y_train, y_pred_train))

[[44  0  0  0  1  0  0  0  0]
 [ 0 95  0  1  1  0  0  0  0]
 [ 0  0 20  0  1  0  0  0  0]
 [ 0  1  0 91  1  0  0  0  0]
 [ 0  1  0  0 81  0  0  0  1]
 [ 0  0  0  0  1 13  0  0  0]
 [ 0  0  0  0  0  0 45  0  0]
 [ 0  0  0  0 11  0  0  9  0]
 [ 0  0  0  0  5  0  0  0 84]] 

              precision    recall  f1-score   support

    asphalt        1.00      0.98      0.99        45
   building        0.98      0.98      0.98        97
        car        1.00      0.95      0.98        21
   concrete        0.99      0.98      0.98        93
      grass        0.79      0.98      0.88        83
       pool        1.00      0.93      0.96        14
     shadow        1.00      1.00      1.00        45
       soil        1.00      0.45      0.62        20
       tree        0.99      0.94      0.97        89

    accuracy                           0.95       507
   macro avg       0.97      0.91      0.93       507
weighted avg       0.96      0.95      0.95       507



**There are signs of overfitting as the model performs better on the training data then the test data. The latter model has higher accuracy, precision and recall.** 

## Conceptual Questions

**a) From the models run in steps 2-6, which performs the best based on the Classification Report? Support your reasoning with evidence around your test data.**
- Support Vector Machine Classifier + RBF Kernel + Grid Search performs the best. It has the highest F1 score and accuracy score (0.85) on the test data.


**b) Compare models run for steps 4-6 where different kernels were used. What is the benefit of using a polynomial or rbf kernel over a linear kernel? What could be a downside of using a polynomial or rbf kernel?** 
-  The foremost benefit of using a polynomial or rbf kernel over a linear kernel is that they are able to model non-linear relationships between the input features and the target variable. However, this is computationally expensice and is prone to overfitting.

**c) Explain the 'C' parameter used in steps 4-6. What does a small C mean versus a large C in sklearn? Why is it important to use the 'C' parameter when fitting a model?**
- C is a regularization parameter that controls the trade-off between achieving a low training error and a low testing error. Lowercase 'c' represents a softwer margin (allows more misclassifications) and vice versa.

**d) Scaling our input data does not matter much for Random Forest, but it is a critical step for Support Vector Machines. Explain why this is such a critical step. Also, provide an example of a feature from this data set that could cause issues with our SVMs if not scaled.**
- Scaling is necessary because SVMs are sensitive to the scale of the input features. If the features are not scaled, features with larger magnitudes will dominate the optimization process. For example, 'GLCM3_14' would dominate the data if not scaled beacuse it has a range in thousands.

**e) Describe conceptually what the purpose of a kernel is for Support Vector Machines.**
- Kernel functions allow us to visualize input data in higher dimensional spaces which makes it easier to seperate data poitns and visualize boundaries.