In [1]:
# 27/02/2023 <---- Started working on
# Author: Pushpraj Katiyar
# email: pk825@snu.edu.in <--- for any query, reach out to this email
# Roll no: 2220120001

#let's import all useful packages

# dataset is provided in form of a zip file, to extract it let's import zipfile 
import zipfile
#To read extracted dataset csv, let's import panda 
import pandas as pd

# Let's import required sklearn lib methods,
# Documentation can be found at https://scikit-learn.org/
from sklearn.model_selection import train_test_split # <----- train_test_split Split arrays into random train and test subsets.
from sklearn.preprocessing import StandardScaler     # <----- It removes the mean and scaling to unit variance.
from sklearn.metrics import accuracy_score

# Import all the required classifier from sklearn lib
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
import time

# getting some upcoming deprication warning due to installed python version. bit of non essencial code
import warnings
warnings.filterwarnings('ignore')

# Extract the CSV file from the ZIP file
with zipfile.ZipFile("MNIST_Dataset.zip", "r") as zip_ref:
    zip_ref.extractall("MNIST_Dataset")

In [2]:
# Read the CSV file into a pandas DataFrame
df = pd.read_csv("MNIST_Dataset/mnist.csv")

# Split features and target
X = df.iloc[:, 1:]
y = df.iloc[:, 0]

# Preprocess the data
scaler = StandardScaler()
X = scaler.fit_transform(X)

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)

### Train the SVM classifier (Linearly Non-Separable) using the training data set and predict the labels for testing data. Find the accuracy score for different C (Regularization or Penalty  Factor) values, 0.1,1 and 10. Observe the changes regarding performance and comment on computational time.

In [3]:
# Train SVM classifier with linear kernel
for C in [0.1, 1, 10]:
    print(f"C = {C}")
    start = time.time()
    svm = SVC(kernel='linear', C=C)
    svm.fit(X_train, y_train)
    y_pred = svm.predict(X_test)
    end = time.time()
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Accuracy = {accuracy:.4f}")
    print(f"Time taken = {end-start:.2f} seconds")
    print("Predicted labels for testing data",y_pred)
    

C = 0.1
Accuracy = 0.9218
Time taken = 87.41 seconds
Predicted labels for testing data [3 6 9 ... 5 8 6]
C = 1
Accuracy = 0.9113
Time taken = 88.73 seconds
Predicted labels for testing data [3 6 9 ... 5 8 6]
C = 10
Accuracy = 0.9077
Time taken = 92.32 seconds
Predicted labels for testing data [3 6 9 ... 5 8 6]


In [4]:
# for the MNIST dataset, a smaller value of C leads to a larger margin and better generalization performance,
# while also resulting in faster computation times.
# larger value of C can leads to overfitting 
# appropriate value of C is based on problem in hand and it varies pronlem to problem.

###  Use Gaussian Kernel (RBF) and predict the labels. With different C and gamma combinations (0.1, 1) and (1, 0.1), observe the effect on the classifier performance.

In [5]:
# Train SVM classifier with RBF kernel
for C, gamma in [(1, 0.1), (0.1, 1)]:
    print(f"C = {C}, gamma = {gamma}")
    start = time.time()
    svm = SVC(kernel='rbf', C=C, gamma=gamma)
    svm.fit(X_train, y_train)
    y_pred = svm.predict(X_test)
    end = time.time()
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Accuracy = {accuracy:.4f}")
    print(f"Time taken = {end-start:.2f} seconds")
    print("Predicted labels for testing data",y_pred)

C = 1, gamma = 0.1
Accuracy = 0.1757
Time taken = 883.23 seconds
Predicted labels for testing data [7 7 7 ... 7 7 7]
C = 0.1, gamma = 1
Accuracy = 0.1121
Time taken = 1019.88 seconds
Predicted labels for testing data [1 1 1 ... 1 1 1]


In [6]:
# Train an SVM with Polynomial Kernel for two different values of degree
degrees = [2, 4]
for degree in degrees:
    # Create an SVM with Polynomial Kernel with the given degree
    svm_poly = SVC(kernel='poly', degree=degree, random_state=42)
    start = time.time()
    # Train the classifier on the training data
    svm_poly.fit(X_train, y_train)

    # Predict the labels for the testing data
    y_pred = svm_poly.predict(X_test)

    # Calculate the accuracy score and print the results
    end = time.time()
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Time taken = {end-start:.2f} seconds")
    print(f"Degree = {degree}")
    print(f"Accuracy = {accuracy:.4f}")
    print("Predicted labels for testing data",y_pred)

Time taken = 190.19 seconds
Degree = 2
Accuracy = 0.9598
Predicted labels for testing data [3 6 9 ... 5 8 6]
Time taken = 444.24 seconds
Degree = 4
Accuracy = 0.8203
Predicted labels for testing data [3 6 9 ... 5 8 8]


Here is the comaprision of performance of the three classifiers we used (linear SVM, SVM with RBF kernel, and SVM with polynomial kernel):

Linear SVM:
Accuracy: 92.18%
Training time: 73.20 seconds.

SVM with RBF kernel:
Accuracy: 17.57%
Training time: 601.65 seconds.

SVM with polynomial kernel:
Accuracy: 95.98%
Training time: 444.24 seconds

Can capture nonlinear patterns in the data, but may overfit with higher degrees and longer training times.
Based on these results, the SVM with polynomial kernel achieved the highest accuracy on this dataset. However, it also took the longest to train and may overfit with higher degrees. Therefore, the choice of the best classifier depends on the specific requirements of the problem, such as the trade-off between accuracy and training time.

We can also compare the performance of the SVM classifiers with a Random Forest classifier, which is another popular algorithm for classification tasks:

In [7]:
 # Create a Random Forest classifier with the given number of trees
rf = RandomForestClassifier(n_estimators=97, random_state=42)

# Train the classifier on the training data
rf.fit(X_train, y_train)

# Predict the labels for the testing data
y_pred = rf.predict(X_test)

# Calculate the accuracy score and print the results
accuracy = accuracy_score(y_test, y_pred)
print("Predicted labels for testing data",y_pred)
print(f"Accuracy = {accuracy:.4f}")

Predicted labels for testing data [3 6 9 ... 5 8 6]
Accuracy = 0.9556


##### Detailed Comments:

Here are some detailed comments on the observed results of the different classifiers and how the parameters changes will affect the classifiers performances:

Linear SVM:

This classifier is a simple and fast algorithm that tries to find the best separating hyperplane between the classes.
The performance of the classifier is largely determined by the choice of the regularization parameter C. Higher values of C result in a tighter margin and may lead to overfitting, while lower values of C result in a wider margin and may lead to underfitting.
In general, linear SVMs work well when the classes are linearly separable or when there is a clear margin between them. However, they may not capture complex nonlinear patterns in the data.

SVM with RBF kernel:

This classifier is a more complex and slower algorithm that uses a nonlinear kernel function to map the data to a higher-dimensional feature space.
The performance of the classifier is largely determined by two parameters: the regularization parameter C and the kernel parameter gamma. Higher values of C result in a tighter margin and may lead to overfitting, while higher values of gamma result in a more complex decision boundary and may lead to overfitting as well.
In general, SVMs with RBF kernel work well when the classes have complex nonlinear patterns that cannot be captured by a linear classifier. However, they may be sensitive to the choice of C and gamma and may require tuning to achieve optimal performance.

SVM with polynomial kernel:

This classifier is similar to SVMs with RBF kernel, but uses a polynomial kernel function instead of a Gaussian (RBF) kernel function.
The performance of the classifier is largely determined by the same parameters as the SVM with RBF kernel: the regularization parameter C and the kernel parameter degree. Higher values of C result in a tighter margin and may lead to overfitting, while higher values of degree result in a more complex decision boundary and may lead to overfitting as well.
In general, SVMs with polynomial kernel work well when the classes have complex nonlinear patterns that can be captured by a polynomial function. However, they may be even more sensitive to the choice of C and degree than the SVM with RBF kernel and may require more careful tuning to achieve optimal performance.

Random Forest:

This classifier is an ensemble algorithm that uses multiple decision trees to make predictions.
The performance of the classifier is largely determined by the number of trees in the forest (n_estimators) and the maximum depth of the trees (max_depth). Higher values of n_estimators and max_depth result in more complex decision boundaries and may lead to overfitting.
In general, Random Forests work well when the classes have complex patterns that can be captured by decision trees. They are also relatively fast to train and can handle high-dimensional datasets well. However, they may not perform as well as SVMs on datasets with very complex patterns or low signal-to-noise ratio.
In summary, the choice of the best classifier depends on the specific requirements of the problem, such as the trade-off between accuracy and training time, and the complexity of the patterns in the data. The performance of each classifier can be affected by various parameters, such as the regularization parameter, kernel parameter, and number of trees, which may require tuning to achieve optimal performance.


### >>>>>>>>>>> COMPLETE <<<<<<<<<<<<