EM 538-001: Practical Machine Learning for Enginering Analystics (Spring 2025)  
Instructor: Fred Livingston (fjliving@ncsu.edu)  


## K-Nearest Neighbors Implementation

- Below is a very simple implementation of a K-nearest Neighbor classifier.
- This is a very slow and inefficient implementation, and in real-world problems, it is always recommended to use established libraries (like scikit-learn) instead of implementing algorithms from scratch.
- The scikit-learn library, for example, implements *k*NN much more efficiently and robustly
- A scenario where it is useful to implement algorithms from scratch is for learning and teaching purposes, or if we want to try out new algorithms, hence, the implementation below, which gently introduces how things are implemented in scikit-learn.

In [12]:
# from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import pandas as pd

df_iris = pd.read_csv('data/iris.csv')
X = df_iris[['PetalLength[cm]', 'PetalWidth[cm]']]
y = df_iris['Species']


X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2,
                                                    random_state=123,
                                                    shuffle=True,
                                                    stratify=y)

Note that there are class attributes with a `_` suffix in the implementation above -- this is not a typo.
- The trailing `_` (e.g., here: `self.dataset_`) is a scikit-learn convention and indicates that these are "fit" attributes -- that is, attributes that are available only *after* calling the `fit` method.

## The Scikit-Learn Estimator API

- Below is an overview of the scikit-learn estimator API, which is used for implementing classification and regression models/algorithms.
- We have seen the methods in the context of the *k*NN implementation earlier; however, one interesting, additional method we have not covered yet is `score`.
- The `score` method simply runs `predict` on the features (`X`) internally and then computes the performance by comparing the predicted targets to the true targets `y`.
- In the case of classification models, the `score` method computes the classification accuracy (in the range [0, 1]) -- i.e., the proportion of correctly predicted labels.
In the case of regression models, the `score` method computes the coefficient of determination ($R^2$).

```python
class SupervisedEstimator(...):
    
    def __init__(self, hyperparam_1, ...):
        self.hyperparm_1
        ...
    
    def fit(self, X, y):
        ...
        self.fit_attribute_
        return self
    
    def predict(self, X):
        ...
        return y_pred
    
    def score(self, X, y):
        ...
        return score
    
    def _private_method(self):
        ...
    ...
    
```

- The graphic below summarizes the useage of the `SupervisedEstimator` API that scikit-learn uses for implementing classification and regression algorithms/models.

<img src="images/estimator-api.png" alt="drawing" width="250"/>

In [None]:
from sklearn.neighbors import KNeighborsClassifier

knn_model = KNeighborsClassifier(n_neighbors=3)
knn_model.fit(X_train, y_train)
y_pred = knn_model.predict(X_test)
y_pred


In [None]:
print('Test accuracy: %.2f%%' % (knn_model.score(X_test, y_test)*100))

### Visualize Decision Boundary

Usually, in machine learning, we work with datasets that have more then 2 feature variables. For educational purposes, however, we chose a very simple dataset consisting of only two features here (the petal length and the petal width of Iris flowers). If we have only two features, we can easily visualize the decision boundary of the model

In [None]:
X_train['PetalLength[cm]'].values

In [None]:
from sklearn.inspection import DecisionBoundaryDisplay

disp = DecisionBoundaryDisplay.from_estimator(
        knn_model, X_train, response_method="predict",
        xlabel='PetalLength[cm]', ylabel='PetalWidth[cm]',
        alpha=0.5)
disp.ax_.scatter(X_train['PetalLength[cm]'].values, X_train['PetalWidth[cm]'].values, edgecolor="k")