# Exercise: Shallow Sommelier (Classification, KNN, SVM, Logistic Regression)


`#classification` `#k-nearest-neighbors` `#support-vector-machines` `#logistic-regression`

> Objectives:
>
> - Explore classification models
> - Use SciKit-Learn's models to perform classification:
>   - K-Nearest Neighbors
>   - Support Vector Machines
>   - Logistic Regression


## Standard Deep Atlas Exercise Set Up


- [ ] Ensure you are using the coursework Pipenv environment and kernel ([instructions](../SETUP.md))
- [ ] Apply the standard Deep Atlas environment setup process by running this cell:

In [None]:
import sys, os
sys.path.insert(0, os.path.join('..', 'includes'))

import deep_atlas
from deep_atlas import FILL_THIS_IN
deep_atlas.initialize_environment()
if deep_atlas.environment == 'COLAB':
    %pip install -q python-dotenv==1.0.0

### 🚦 Checkpoint: Start

- [ ] Run this cell to record your start time:

In [None]:
deep_atlas.log_start_time()

---


## Context


Many ML applications involve assigning predefined labels (classes) to input data by recognizing patterns. Models are trained on labeled data to classify unseen data.

Common applications include:

- Spam detection
- Sentiment analysis
- Image recognition
- Fraud detection
- Medical diagnosis

Large datasets may require deep learning, but this walkthrough covers shallow learning techniques:

- **K-Nearest Neighbors (KNN)**: Classifies instances based on the classes of their k-nearest neighbors.
- **Support Vector Machines (SVM)**: Finds a plane that maximally separates classes in the feature space.
- **Logistic Regression**: Models the probability of class membership using a logistic function.

## Exercise Goal:


- Develop a model that can classify wine varieties.


## Dependencies


In [None]:
if deep_atlas.environment == 'VIRTUAL':
    !pipenv install ipykernel==6.28.0
    !pipenv install scikit-learn==1.4.1.post1 pandas==2.2.1 matplotlib==3.8.3
if deep_atlas.environment == 'COLAB':
    %pip install scikit-learn==1.4.1.post1 pandas==2.2.1 matplotlib==3.8.3


## Imports


In [None]:
# Data loading
import random
from sklearn.datasets import load_wine

# Creating training/testing sets
from sklearn.model_selection import train_test_split

# Models
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

# Evaluation
from sklearn.metrics import accuracy_score
import time

# Inspecting data
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import numpy as np

# Suppress scientific notation in printed output
np.set_printoptions(suppress=True)

## Load the data


For this exercise, we can load data using SciKit-Learn's built-in "load_wine" function, one of a [few easily loaded practice datasets](https://scikit-learn.org/stable/datasets.html#datasets).

- [ ] Get the features (X) and the corresponding classes (y) from the dataset:


In [None]:
# Load the Wine dataset
X, y = load_wine(return_X_y=True)

Each row in the dataset contains the following features: alcohol content, malic acid, ash, alcalinity of ash, magnesium, total phenols, flavanoids, nonflavanoid phenols, proanthocyanins, color intensity, hue, dilution amount, proline

The dataset has some feature engineering applied to it already: all the values are numeric and there are no missing values to be interpolated.

- [ ] Print 5 random samples from the training dataset:


In [None]:
print("Random set of 5 items from the dataset:")
random_indices = random.sample(range(len(X)), 5)
for i in random_indices:
    print(f"Label: {y[i]}")
    print(f"Features: {X[i]}")

Looking at a few samples helps understand the types of features, but does not give us an intuitive view of the data.

Instead, lets try applying Principle Component Analysis to find the 2 features which create most separation in the data. We can then plot the points along those axes, coloring each point by its label.

- [ ] Perform PCA and plot the points:


In [None]:
# Perform PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

# Plot the classes after PCA
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y)
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.title("Wine Dataset - Classes after PCA")
plt.show()

The final step in setting up the data for training is creating a training and testing split:

- [ ] Set aside 20% of the data for testing:


In [None]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=FILL_THIS_IN, random_state=42
)

<details>
    <summary>Solution:</summary>

```py
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
```

</details>


## Training SciKit-Learn Classification Models


A major benefit of using the established SciKit-Learn library is that all of its model implementations expose the same APIs (with methods like `fit` for training and `predict` for inference).

This allows us to write a single training function to fit a few different models.

- [ ] In the function definition below:
  - Update the function signature to accept the name and classifier instance as arguments.
  - Update the print statement to print the name of the model being explored.
  - Call the classifier's `fit` method after the start time has been recorded.
    - Make sure to pass in the training features (`X_train`) and labels (`y_train`).
  - Call the classifier's `predict` method and set its output to `y_pred`.
  - Save each model instance, accuracy and training time to the `results` dictionary.


In [None]:
# Dictionary to store the trained model, accuracy, and training time
results = {}


def train_and_evaluate(FILL_THIS_IN):
    print(f"Training {FILL_THIS_IN}...")

    # Fit the model to the training data
    start_time = time.time()

    FILL_THIS_IN

    training_time = time.time() - start_time
    print(f"Training time: {training_time:.4f} seconds")

    # Predict using the testing data
    y_pred = FILL_THIS_IN
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Accuracy: {accuracy:.4f}")

    # Store the results
    results[FILL_THIS_IN] = {
        "model": FILL_THIS_IN,
        "Accuracy": accuracy,
        "Training Time": training_time,
    }

Evaluation of classification models is typically done using metrics such as accuracy, precision, recall, and F1 score. These metrics provide insights into the performance of the model in terms of correctly classifying instances from different classes.


<details>
    <summary>Solution:</summary>

```py
def train_and_evaluate(name, classifier):
    print(f"Training {name}...")

    # Fit the model to the training data
    start_time = time.time()
    classifier.fit(X_train, y_train)
    training_time = time.time() - start_time
    print(f"Training time: {training_time:.4f} seconds")

    # Predict using the testing data
    y_pred = classifier.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Accuracy: {accuracy:.4f}")

    # Store the results
    results[name] = {
        "model": classifier,
        "Accuracy": accuracy,
        "Training Time": training_time,
    }
```

</details>


### K-Nearest Neighbors


The k-nearest neighbors (KNN) algorithm is a type of instance-based learning algorithm.

Here's how it works for classification:

1. During the training phase, the KNN algorithm simply stores all the training data.

2. When you want to classify a new, unseen instance, the KNN algorithm finds the `k` training instances that are closest to the new instance. "Closeness" is typically measured using a distance metric, such as Euclidean distance.

3. **Majority Voting**: The algorithm then assigns the class label of the new instance based on the majority class label of these `k` nearest neighbors. In other words, the new instance is assigned to the class that most of its `k` nearest neighbors belong to.

The number `k` is a hyperparameter that you choose. A small `k` (like 1 or 2) will make the classifier more sensitive to noise in the data, while a large `k` will make the classifier more resistant to noise, but also more likely to misclassify instances because it considers more distant instances in the voting process.

The KNN algorithm is simple and can be very effective, but it can also be slow for large datasets because it needs to compute the distance between the new instance and every instance in the training set.


In [None]:
train_and_evaluate(
    "K-Nearest Neighbors",
    KNeighborsClassifier(n_neighbors=5, weights="uniform"),
)

- [ ] Note the training times and accuracy of this model and the subsequent ones.


### Support Vector Machines


Support Vector Machines (SVM) are a class of supervised algorithms for both classification and regression. In Scikit-Learn, it's implemented in the `SVC` (Support Vector Classification) and `SVR` (Support Vector Regression) classes.

Here's how it works for classification:

1. **Training**: During the training phase, the SVM algorithm tries to find a hyperplane\* that separates the classes in the feature space. If the data is not linearly separable, it uses a technique called the kernel trick\*\* to project the data into a higher-dimensional space where a hyperplane can be found. The chosen hyperplane is the one that maximizes the margin between the classes, which is defined as the distance between the hyperplane and the closest data points from each class (these points are called support vectors).

   - \*_Hyperplane_ refers to a plane in a high-dimensional space that is one dimensional lower than the space that it's in. For example, in a 3D space, a hyperplane would look like a suspended 2D plane.
   - \*\*_Kernel trick_ is a function used to compute the dot-product of two vectors in a higher-dimensional space.

2. **Prediction**: When you want to classify a new, unseen instance, the SVM algorithm applies the same transformation to the new instance as it did to the training data (if a kernel was used), and then determines which side of the hyperplane the new instance falls on. The class of the new instance is then determined based on which side of the hyperplane it falls on.

The SVM algorithm is effective in high-dimensional spaces and best suited for problems where the number of dimensions is greater than the number of samples. It's also versatile as different Kernel functions can be specified for the decision function. However, it does not directly provide probability estimates.

See the resources section below for a visual explanation of SVMs.


In [None]:
train_and_evaluate("Support Vector Machine", SVC())

### Logistic Regression


In its basic form, Logistic Regression trains a function to model classify an input into two classes (binary classification).

Here's how it works for classification:

1. **Training**: During the training phase, the Logistic Regression algorithm tries to find the best parameters (weights and bias) for the logistic function that separates the classes in the feature space. This is done by minimizing a cost function (like the log loss) using an optimization algorithm (like Gradient Descent). The logistic function is an S-shaped curve that can take any real-valued number and map it into a value between 0 and 1, but never exactly at those limits. This output can be interpreted as the probability of the instance belonging to the positive class.

2. **Prediction**: When you want to classify a new, unseen instance, the Logistic Regression algorithm applies the logistic function to the dot product of the instance features and the learned weights, plus the bias term. If the output is greater than 0.5, the instance is classified as the positive class. Otherwise, it's classified as the negative class.

Thus, it has the advantage of providing probabilities for the predictions, which can be useful in many applications.

#### What about _multi-class_ problems?

Logistic Regression can be extended to handle multi-class classification problems:

1. **One-vs-Rest (OvR)**: In this strategy, a separate model is trained for each class predicted against all other classes. For example, if there are three classes A, B, and C, three models would be trained: A vs. B and C, B vs. A and C, and C vs. A and B. To make a prediction, all models are run on the input and the model with the highest confidence in its prediction is chosen.
2. **Softmax/Multinomial Logistic Regression**: This is a generalization of Logistic Regression to the multi-class case. The model computes a score for each class, then applies the softmax function to these scores to obtain the probability of each class. The class with the highest probability is chosen as the prediction. The model is trained by minimizing the cross-entropy loss, which penalizes the model if it estimates a low probability for the target class.


In [None]:
train_and_evaluate(
    "Logistic Regression",
    LogisticRegression(
        max_iter=200, solver="lbfgs", multi_class="multinomial"
    ),
)

- [ ] Note the convergence warning. The SciKit-Learn LogisticRegression class does not use Gradient Descent as its solver by default; in this case, the solver was not able to get to a minimum loss because the training loop ran out of iterations.
- [ ] (Optional) Try fixing the problem following the recommendations in the error message.


### 🚦 Checkpoint: Stop

- [ ] Uncomment this code
- [ ] Complete the feedback form
- [ ] Run the cell to log your responses and record your stop time:

In [None]:
# deep_atlas.log_feedback(
#     {
#         # How long were you actively focused on this section? (HH:MM)
#         "active_time": FILL_THIS_IN,
#         # Did you feel finished with this section (Yes/No):
#         "finished": FILL_THIS_IN,
#         # How much did you enjoy this section? (1–5)
#         "enjoyment": FILL_THIS_IN,
#         # How useful was this section? (1–5)
#         "usefulness": FILL_THIS_IN,
#         # Did you skip any steps?
#         "skipped_steps": [FILL_THIS_IN],
#         # Any obvious opportunities for improvement?
#         "suggestions": [FILL_THIS_IN],
#     }
# )
# deep_atlas.log_stop_time()

## You did it!


### Resources:

- [Classifier comparison — SciKit-Learn docs](https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html)
- Video: [Support Vector Machine (SVM) in 2 minutes](https://www.youtube.com/watch?v=_YPScrckx28)
