# Model Assessment
You should build a machine learning pipeline with a complete model assessment step. In particular, you should do the following:
- Load the `mnist` dataset using [Pandas](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html). You can find this dataset in the datasets folder.
- Split the dataset into training and test sets using [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html). 
- Choose a few machine learning algorithms, such as [KNN](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html), [decision tree](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html), and [SVM](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html).
- Define a grid of hyperparameters for every selected model.
- Conduct [grid search](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) using k-fold cross-validation on the training set to find out the best model (i.e., the best algorithm and its hyperparameters).
- Train the best model on the whole training set.
- Test the trained model on the test set and report various [evaluation metrics](https://scikit-learn.org/0.15/modules/model_evaluation.html).  
- Check the documentation to identify the most important hyperparameters, attributes, and methods. Use them in practice.

In [1]:
import pandas as pd
import sklearn.model_selection
import sklearn.metrics
import sklearn.svm
import sklearn.tree
import sklearn.neighbors
import plotly.express as px



OpenBLAS blas_thread_init: pthread_create failed for thread 45 of 64: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 1029960 current, 1029960 max
OpenBLAS blas_thread_init: pthread_create failed for thread 46 of 64: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 1029960 current, 1029960 max
OpenBLAS blas_thread_init: pthread_create failed for thread 47 of 64: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 1029960 current, 1029960 max
OpenBLAS blas_thread_init: pthread_create failed for thread 48 of 64: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 1029960 current, 1029960 max
OpenBLAS blas_thread_init: pthread_create failed for thread 49 of 64: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 1029960 current, 1029960 max
OpenBLAS blas_thread_init: pthread_create failed for thread 50 of 64: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPR

KeyboardInterrupt: 

In [None]:
df = pd.read_csv("../../datasets/mnist.csv")
df = df.set_index("id")
df.head(3)

In [None]:
x = df.drop(["class"], axis=1)
y = df["class"]

x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x, y)

In [None]:
parameters_grid = {
    "criterion": ["gini", "entropy"], 
    "max_depth": range(1, 20, 3),   # [1, 4, 7, ...]
    "min_samples_split": range(2, 20, 3)
}
model_1 = sklearn.model_selection.GridSearchCV(sklearn.tree.DecisionTreeClassifier(), 
                                               parameters_grid, scoring="accuracy", cv=5, n_jobs=-1)
model_1.fit(x_train, y_train)
print("Accuracy of best decision tree classfier = {:.2f}".format(model_1.best_score_))
print("Best found hyperparameters of decision tree classfier = {}".format(model_1.best_params_))



In [None]:
parameters_grid = {
    "kernel": ["linear", "rbf", "poly"], 
    "C": [0.001, 0.01, 0.1, 1, 10, 100]
}
model_2 = sklearn.model_selection.GridSearchCV(sklearn.svm.SVC(), 
                                               parameters_grid, scoring="accuracy", cv=5, n_jobs=-1)
model_2.fit(x_train, y_train)
print("Accuracy of best SVM classfier = {:.2f}".format(model_2.best_score_))
print("Best found hyperparameters of SVM classifier = {}".format(model_2.best_params_))

cm = confusion_matrix(y_test, y_predicted)
sns.heatmap(cm, annot=True)


In [None]:
# Analyzing the KPI (Key Performance Indicator)

print(classification_report(y_predicted, y_test))

In [None]:
parameters_grid = {
    "n_neighbors": [1, 5, 10, 15, 20], 
    "metric": ["minkowski", "euclidean", "manhattan"]
}
model_3 = sklearn.model_selection.GridSearchCV(sklearn.neighbors.KNeighborsClassifier(),
                                               parameters_grid, scoring="accuracy", cv=5, n_jobs=-1)
model_3.fit(x_train, y_train)
print("Accuracy of best KNN classfier = {:.2f}".format(model_3.best_score_))
print("Best found hyperparameters of KNN classifier = {}".format(model_3.best_params_))
cm = confusion_matrix(y_test, y_predicted)
sns.heatmap(cm, annot=True)

In [None]:
y_predicted = model_2.predict(x_test)
accuracy = sklearn.metrics.accuracy_score(y_test, y_predicted)
cm = sklearn.metrics.confusion_matrix(y_test, y_predicted)
precision, recall, f1, support = sklearn.metrics.precision_recall_fscore_support(y_test, y_predicted)

# print("Accuracy =", accuracy)
# print("Precision =", precision)
# print("Recall =", recall)
# print("F1-Score =", f1)
# print("Confusion Matrix:\n", cm)



cm = confusion_matrix(y_test, y_predicted)
sns.heatmap(cm, annot=True)



In [None]:
# Analyzing the KPI (Key Performance Indicator)

print(classification_report(y_predicted, y_test))