## **EXERCISE 01:**

Exercise: _Try to build a classifier for the MNIST dataset that achieves over 97% accuracy on the test set. Hint: the `KNeighborsClassifier` works quite well for this task; you just need to find good hyperparameter values (try a grid search on the `weights` and `n_neighbors` hyperparameters)._

---

In [1]:
from sklearn.datasets import fetch_openml

mnist = fetch_openml("mnist_784", as_frame=False)

In [2]:
mnist.keys()  # type: ignore

dict_keys(['data', 'target', 'frame', 'categories', 'feature_names', 'target_names', 'DESCR', 'details', 'url'])

In [7]:
data_set, data_labels = mnist["data"], mnist["target"]  # type: ignore

In [5]:
data_set.shape

(70000, 784)

In [8]:
data_labels.shape

(70000,)

In [9]:
# Dataset is already split.
train_set, train_labels = data_set[:60000], data_labels[:60000]
test_set, test_labels = data_set[60000:], data_labels[60000:]

In [10]:
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier

knn_clf = KNeighborsClassifier()
cross_val_score(knn_clf, train_set, train_labels, cv=3, scoring="accuracy")

array([0.9676 , 0.9671 , 0.96755])

In [11]:
from sklearn.ensemble import RandomForestClassifier

forest_clf = RandomForestClassifier()
cross_val_score(forest_clf, train_set, train_labels, cv=3, scoring="accuracy")

array([0.96535, 0.96315, 0.96635])

In [12]:
from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier()
cross_val_score(sgd_clf, train_set, train_labels, cv=3, scoring="accuracy")

array([0.85935, 0.8584 , 0.8741 ])

We see the KNeighborsClassifier is probably the best. We try to improve that even more. Let's check it on the smaller set accelerate calculations.

In [16]:
cross_val_score(knn_clf, train_set[:5000], train_labels[:5000], cv=3, scoring="accuracy")

array([0.91061788, 0.92921416, 0.92496999])

In [23]:
from sklearn.model_selection import GridSearchCV

param_grid = [{
    "n_neighbors": [3, 4, 5, 6],
    "weights": ["uniform", "distance"],
    "algorithm": ["auto", "ball_tree", "kd_tree", "brute"]
}]

grid_search = GridSearchCV(knn_clf, param_grid, cv=3)
grid_search.fit(train_set[:5000], train_labels[:5000])

In [24]:
grid_search.best_params_

{'algorithm': 'auto', 'n_neighbors': 4, 'weights': 'distance'}

In [30]:
cvres = grid_search.cv_results_
cvres.keys()

dict_keys(['mean_fit_time', 'std_fit_time', 'mean_score_time', 'std_score_time', 'param_algorithm', 'param_n_neighbors', 'param_weights', 'params', 'split0_test_score', 'split1_test_score', 'split2_test_score', 'mean_test_score', 'std_test_score', 'rank_test_score'])

In [29]:
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(mean_score, params)

0.9224003938708055 {'algorithm': 'auto', 'n_neighbors': 3, 'weights': 'uniform'}
0.9278000342308489 {'algorithm': 'auto', 'n_neighbors': 3, 'weights': 'distance'}
0.9218005138468106 {'algorithm': 'auto', 'n_neighbors': 4, 'weights': 'uniform'}
0.9294007945109657 {'algorithm': 'auto', 'n_neighbors': 4, 'weights': 'distance'}
0.9216006738628265 {'algorithm': 'auto', 'n_neighbors': 5, 'weights': 'uniform'}
0.9264001941508457 {'algorithm': 'auto', 'n_neighbors': 5, 'weights': 'distance'}
0.9182007536067817 {'algorithm': 'auto', 'n_neighbors': 6, 'weights': 'uniform'}
0.9264010343189465 {'algorithm': 'auto', 'n_neighbors': 6, 'weights': 'distance'}
0.9224003938708055 {'algorithm': 'ball_tree', 'n_neighbors': 3, 'weights': 'uniform'}
0.9278000342308489 {'algorithm': 'ball_tree', 'n_neighbors': 3, 'weights': 'distance'}
0.9218005138468106 {'algorithm': 'ball_tree', 'n_neighbors': 4, 'weights': 'uniform'}
0.9294007945109657 {'algorithm': 'ball_tree', 'n_neighbors': 4, 'weights': 'distance'}
0.

In [31]:
grid_search.best_score_

0.9294007945109657

In [32]:
best_knn_clf = grid_search.best_estimator_
best_knn_clf.fit(train_set, train_labels)  # type: ignore
accuracy = best_knn_clf.score(test_set, test_labels)  # type: ignore

In [48]:
print(f"{accuracy = :.2%}")

accuracy = 97.14%
