# Model Training and Evaluation
You should build a machine learning pipeline with a complete model training and evaluation step. In particular, you should do the following:
- Load the `mnist` dataset using [Pandas](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html). You can find this dataset in the datasets folder.
- Split the dataset into training and test sets using [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html).
- Conduct data exploration, data preprocessing, and feature engineering if necessary.
- Choose a few machine learning algorithms, such as [KNN](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html), [decision tree](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html), and [gradient boosting](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html).
- Define a grid of hyperparameters for every selected model.
- Conduct [grid search](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) or [random search](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html) using k-fold cross-validation on the training set to find out the best model (i.e., the best algorithm and its hyperparameters).
- Train the best model on the whole training set.
- Test the trained model on the test set and report various [evaluation metrics](https://scikit-learn.org/0.15/modules/model_evaluation.html).  
- Check the documentation to identify the most important hyperparameters, attributes, and methods. Use them in practice.

##Importing Libraries

In [68]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

##Importing the dataset

In [6]:
df = pd.read_csv("https://raw.githubusercontent.com/m-mahdavi/teaching/refs/heads/main/datasets/mnist.csv")

In [7]:
df.head(3)

Unnamed: 0,id,class,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,...,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783,pixel784
0,31953,5,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,34452,8,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,60897,5,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [4]:
df.shape

(4000, 786)

##Splitting the data

In [9]:
df_train, df_test = train_test_split(df, test_size=0.2, random_state=42)
print(df_train.shape)
print(df_test.shape)

(3200, 786)
(800, 786)


##Data Exploration

In [10]:
df_train.head(5)

Unnamed: 0,id,class,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,...,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783,pixel784
3994,13260,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
423,10953,3,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2991,37374,3,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1221,31597,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
506,69405,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [12]:
df_train.isnull().sum().value_counts()

Unnamed: 0,count
0,786


In [13]:
df_train.describe()

Unnamed: 0,id,class,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,...,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783,pixel784
count,3200.0,3200.0,3200.0,3200.0,3200.0,3200.0,3200.0,3200.0,3200.0,3200.0,...,3200.0,3200.0,3200.0,3200.0,3200.0,3200.0,3200.0,3200.0,3200.0,3200.0
mean,34454.785312,4.427812,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.095937,0.019062,0.01625,0.001875,0.0,0.0,0.0,0.0,0.0,0.0
std,20488.023669,2.879215,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,2.924578,1.078338,0.919239,0.106066,0.0,0.0,0.0,0.0,0.0,0.0
min,17.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,16743.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,34262.5,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,52250.75,7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,69998.0,9.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,125.0,61.0,52.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0


##Data Preprocessing

In [17]:
x_train = df_train.drop(["id","class"], axis=1)
y_train = df_train["class"]

x_test = df_test.drop(["id","class"], axis=1)
y_test = df_test["class"]

x_train.shape, y_train.shape
x_test.shape, y_test.shape

((800, 784), (800,))

##Feature Engineering

In [None]:
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)


##Model Preparation

##Decision Tree

In [29]:
dc = DecisionTreeClassifier()
dc.fit(x_train, y_train)

##K-Nearest Neighbour

In [28]:
knn = KNeighborsClassifier()
knn.fit(x_train, y_train)

##SVM

In [69]:
svm = SVC()
svm.fit(x_train, y_train)

##Gradient Boosting Classifier

In [82]:
gbc = GradientBoostingClassifier()
gbc.fit(x_train, y_train)

##Model Prediction

In [83]:
y1_predict = dc.predict(x_test)
y2_predict = knn.predict(x_test)
y3_predict = svm.predict(x_test)
y4_predict = gbc.predict(x_test)


In [84]:
# y1_1_predict = dc.predict(x_train)
# y2_2_predict = knn.predict(x_train)
# y3_3_predict = svm.predict(x_train)
# y4_4_predict = gbc.predict(x_train)

##Model Evaluation

In [86]:
dc_acc = accuracy_score(y_test, y1_predict)
knn_acc = accuracy_score(y_test, y2_predict)
svm_acc = accuracy_score(y_test, y3_predict)
gbc_acc = accuracy_score(y_test, y4_predict)

print("Decision Tree Accuracy:", dc_acc*100)
print("KNN Accuracy:", knn_acc*100)
print("SVM Accuracy:", svm_acc*100)
print("Gradient Boosting Accuracy:", gbc_acc*100)

#here we observed that the accuracy for the training data is higher so our model is overfitted.



Decision Tree Accuracy: 73.25
KNN Accuracy: 91.5
SVM Accuracy: 94.375
Gradient Boosting Accuracy: 90.625


In [88]:
# dc_acc_1 = accuracy_score(y_train, y1_1_predict)
# knn_acc_2 = accuracy_score(y_train, y2_2_predict)
# svm_acc_3 = accuracy_score(y_train, y3_3_predict)
# gbc_acc_4 = accuracy_score(y_train, y4_4_predict)

# print("Decision Tree Accuracy:", dc_acc_1*100)
# print("KNN Accuracy:", knn_acc_2*100)
# print("SVM Accuracy:", svm_acc_3)
# print("Gradient Boosting Accuracy:", gbc_acc_4)

# #here we observed that the accuracy for the training data is higher so our model is overfitted.

##GridSearchCV

##Decision Tree

In [40]:
gs_dt = {"criterion" : ["gini", "entropy", "log_loss"],
         "splitter" : ["best", "random"],
         "max_depth" : [i for i in range(1,10)]
         }

gd = GridSearchCV(estimator=dc, param_grid=gs_dt, cv=5)
gd.fit(x_train, y_train)

In [54]:
gd.best_params_

{'criterion': 'entropy', 'max_depth': 8, 'splitter': 'best'}

In [55]:
gd.best_score_

0.7621874999999999

##K-Nearest Neighbour

In [64]:
gs_dt = {"n_neighbors" : [2],
         "metric" : ["euclidean", "manhattan", "chebyshev", "minkowski"]}

gd_knn = GridSearchCV(estimator=knn, param_grid=gs_dt, cv=5)
gd_knn.fit(x_train, y_train)

In [65]:
gd_knn.best_params_

{'metric': 'euclidean', 'n_neighbors': 2}

In [66]:
gd_knn.best_score_

0.90625

##SVM

In [72]:
gs_svm = {
    'C': [0.1, 1, 10, 100],
    'kernel': ['linear', 'poly', 'rbf', 'sigmoid']
}

gd_svm = GridSearchCV(estimator=svm, param_grid=gs_svm, cv=5)
gd_svm.fit(x_train, y_train)

In [77]:
gd_svm.best_params_

{'C': 10, 'kernel': 'rbf'}

In [78]:
gd_svm.best_score_

0.9546875

##Gradient Boosting

In [None]:
gs_gbc = {'n_estimators': [100, 200, 500],
    'learning_rate': [0.01, 0.05, 0.1, 0.2],
    'max_depth': [3, 5, 7, 10],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 5, 10],
    'subsample': [0.6, 0.8, 1.0],
    'max_features': ['sqrt', 'log2', None]}

gd_gbc = GridSearchCV(estimator=gbc, param_grid=gs_gbc, cv=5)
gd_gbc.fit(x_train, y_train)

In [None]:
gd_gbc.best_params_

In [None]:
gd_gbc.best_score_

In [90]:
# help(DecisionTreeClassifier)
# help(KNeighborsClassifier)
#help(SVC)
# help(GradientBoostingClassifier)