# Lab 9: Classification

This lab is presented with some revisions from [Dennis Sun at Cal Poly](https://web.calpoly.edu/~dsun09/index.html) and his [Data301 Course](http://users.csc.calpoly.edu/~dsun09/data301/lectures.html)

### When you have filled out all the questions, submit via [Tulane Canvas](https://tulane.instructure.com/)

In [None]:
# clone the course repository, change to right directory, and import libraries.
%cd /content
!git clone https://github.com/nmattei/cmps3160.git
%cd /content/cmps3160/_labs/Lab10
import pandas as pd
import numpy as np

_Classification models_ are used when the label we want to predict is categorical. In this section, we will train a classification model to predict the color of a wine (red or white) from its chemical properties.

The training data for the red and white wines are stored in separate files on Github (`../data/reds.csv` and `../data/whites.csv`). Let's read in the two datasets, add a column for the color ("red" or "white"), and combine them into one `DataFrame`.

In [None]:
import numpy as np
import pandas as pd
pd.options.display.max_rows = 5

reds = pd.read_csv("../data/reds.csv", sep=";")
whites = pd.read_csv("../data/whites.csv", sep=";")

reds["color"] = "red"
whites["color"] = "white"

wines = pd.concat([reds, whites],
                  ignore_index=True)
wines

Let's focus on just two features for now: volatile acidity and total sulfur dioxide. Let's plot the training data, using color to represent the class label.

In [None]:
colors = wines["color"].map({
    "red": "darkred",
    "white": "gold"
})

wines.plot.scatter(
    x="volatile acidity", y="total sulfur dioxide", c=colors,
    alpha=.3, xlim=(0, 1.6), ylim=(0, 400)
);

Now suppose that we have a new wine with volatile acidity .85 and total sulfur dioxide 120, represented by a black circle in the plot below. Is this likely a red wine or a white wine?

![](https://github.com/nmattei/cmps3160/blob/master/_labs/images/classification.png?raw=1)

It is not hard to guess that this wine is probably red, just by looking at the plot. The reasoning goes like this: most of the wines in the training data that were "close" to this wine were red, so it makes sense to predict that this wine is also red. This is precisely the idea behind the $k$-nearest neighbors classifier:

1. Calculate the distance between the new point and each point in the training data, using some distance metric on the features.
2. Determine the $k$ closest points. Of these $k$ closest points, count up how many of each class label there are.
3. The predicted class of the new point is whichever class was most common among the $k$ closest points.

The only difference between the $k$-nearest neighbors classifier and the $k$-nearest neighbors regressor from the previous chapter is the last step. Instead of averaging the labels of the $k$ neighbors to obtain our prediction, we count up the number of occurrences of each category among the labels and take the most common one. It makes sense that we have to do something different because the label is now categorical instead of quantitative. **This is yet another example of the general principle that was introduced in Chapter 1: the analysis changes depending on the variable type!**

# Implementing the K-Nearest Neighbors Classifier

Let's implement $9$-nearest neighbors for the wine above. First, we extract the training data and scale the features:

In [None]:
X_train = wines[["volatile acidity", "total sulfur dioxide"]]
y_train = wines["color"]

X_train_sc = (X_train - X_train.mean()) / X_train.std()

Then, we create a `Series` for the new wine, being sure to scale it in the exact same way:

In [None]:
x_new = pd.Series(dtype=float)
x_new["volatile acidity"] = 0.85
x_new["total sulfur dioxide"] = 120

x_new_sc = (x_new - X_train.mean()) / X_train.std()
x_new_sc

Now we calculate the (Euclidean) distance between this new wine and each wine in the training data, and sort the distances from smallest to largest.

In [None]:
dists = np.sqrt(((X_train_sc - x_new_sc) ** 2).sum(axis=1))
dists_sorted = dists.sort_values()
dists_sorted

The first 9 entries of this sorted list will be the 9 nearest neighbors. Let's get their index.

In [None]:
inds_nearest = dists_sorted.index[:9]
inds_nearest

Now we can look up these indices in the original data.

In [None]:
wines.loc[inds_nearest]

As a sanity check, notice that these wines are all similar to the new wine in terms of volatile acidity and total sulfur dioxide. To make a prediction for this new wine, we need to count up how many reds and whites there are among these 9-nearest neighbors.

In [None]:
wines.loc[inds_nearest, "color"].value_counts()

There were more reds than whites, so the 9-nearest neighbors model predicts that the wine is red.

As a measure of confidence in a prediction, classification models often report the predicted _probability_ of each label, instead of just the predicted label. The predicted probability of a class in a $k$-nearest neighbors model is simply the proportion of the $k$ neighbors that are in that class. In the example above, instead of simply predicting that the wine is red, we could have instead said that the wine has a $6/9 = .667$ probability of being red.

# K-Nearest Neighbors Classifier in Scikit-Learn

Now let's see how to implement the same $9$-nearest neighbors model above using Scikit-Learn.

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier

# define the training data
X_train = wines[["volatile acidity", "total sulfur dioxide"]]
y_train = wines["color"]

# standardize the data
scaler = StandardScaler()
scaler.fit(X_train)
X_train_sc = scaler.transform(X_train)

# fit the 9-nearest neighbors model
model = KNeighborsClassifier(n_neighbors=9)
model.fit(X_train_sc, y_train)

# define the test data (Scikit-Learn expects a matrix)
x_new = pd.DataFrame()
x_new["volatile acidity"] = [0.85]
x_new["total sulfur dioxide"] = [120]
x_new_sc = scaler.transform(x_new)

# use the model to predict on the test data
model.predict(x_new_sc)

What if we want the predicted probabilities? For classification models, there is an additional method, `.predict_proba()`, that returns the predicted probability of each class.

In [None]:
model.predict_proba(x_new_sc)

The first number represents the probability of the first class ("red") and the second number represents the probability of the second class ("white"). Notice that the predicted probabilities add up to 1, as they must.

# Part 1: Exercises

## Exercise 1

In the above example, we built a 9-nearest neighbors classifier to predict the color of a wine from just its volatile acidity and total sulfur dioxide. Use the model above to predict the color of a wine with the following features:

- fixed acidity: 11
- volatile acidity: 0.3
- citric acid: 0.3
- residual sugar: 2
- chlorides: 0.08
- free sulfur dioxide: 17
- total sulfur dioxide: 60
- density: 1.0
- pH: 3.2
- sulphates: 0.6
- alcohol: 9.8
- quality: 6

Now, build a 15-nearest neighbors classifier using all of the features in the data set. Use this new model to predict the color of the same wine above.

Does the predicted label change? Do the predicted probabilities of the labels change?

In [None]:
# TYPE YOUR CODE HERE

**Written Answer Here:**

# Part 2: Evaluating Classification Models

Just as with regression models, we need ways to measure how good a classification model is. With regression models, the main metrics were MSE, RMSE, and MAE. With classification models, the main metrics are accuracy, precision, and recall. All of these metrics can be calculated on either the training data or the test data. We can also use cross validation to estimate the value of the metric on test data.

First, let's train a $9$-nearest neighbors model on the wine data, just so that we have a model to evaluate. The following code is copied from above.

In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
pd.options.display.max_rows = 5

reds = pd.read_csv("../data/reds.csv", sep=";")
whites = pd.read_csv("../data/whites.csv", sep=";")

reds["color"] = "red"
whites["color"] = "white"

wines = pd.concat([reds, whites],
                  ignore_index=True)
wines

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier

# define the training data
X_train = wines[["volatile acidity", "total sulfur dioxide"]]
y_train = wines["color"]

# standardize the data
scaler = StandardScaler()
scaler.fit(X_train)
X_train_sc = scaler.transform(X_train)

# fit the 9-nearest neighbors model
model = KNeighborsClassifier(n_neighbors=9)
model.fit(X_train_sc, y_train)

We will start by calculating training metrics, so we need predictions for the observations in the training data.

In [None]:
y_train_pred = model.predict(X_train_sc)
y_train_pred

# Metrics for Classification

Because the labels $y_i$ in a classification model are categorical, we cannot calculate the difference $y_i - \hat y_i$ between the actual and predicted labels, as we did for regression model. But we can determine if the predicted label $\hat y_i$ is correct ($\hat y_i = y_i$) or not ($\hat y_i \neq y_i$). For example, the **error rate** is defined to be:

$$ \textrm{error rate} = \textrm{proportion where } \hat y_i \neq y_i $$

With classification models, it is more common to report the performance in terms of a score, like **accuracy**, where a higher value is better:

$$ \textrm{accuracy} = \textrm{proportion where } \hat y_i = y_i $$

In [None]:
accuracy = (y_train_pred == y_train).mean()
accuracy

If you ever forget how to calculate accuracy, you can have Scikit-Learn do it for you. It just needs to know the true labels and the predicted labels:

In [None]:
from sklearn.metrics import accuracy_score

accuracy_score(y_train, y_train_pred)

The problem with accuracy is that it is sensitive to the initial distribution of classes in the training data. For example, if 99% of the wines in the data set were white, it would be trivial to obtain a model with 99% accuracy: the model could simply predict that every wine is white. Even though such a model has high overall accuracy, it is remarkably bad for red wines. We need some way to measure the "accuracy" of a model for a particular class.

Suppose we want to know the "accuracy" of our model for class $c$. There are two ways to interpret "accuracy for class $c$". Do we want to know the accuracy among the observations our model _predicted to be_ in class $c$ or the accuracy among the observations that _actually were_ in class $c$? The two options lead to two different notions of "accuracy" for class $c$: precision and recall.

The **precision** of a model for class $c$ is the proportion of observations predicted to be in class $c$ that actually were in class $c$.

$$ \textrm{precision for class } c = \frac{\# \{i:  \hat y_i = c \textrm{ and } y_i = c\}}{\# \{i: \hat y_i = c \}}. $$

The **recall** of a model for class $c$ is the proportion of observations actually in class $c$ that were predicted to be in class $c$.

$$ \textrm{recall for class } c = \frac{\# \{i:  \hat y_i = c \textrm{ and } y_i = c\}}{\# \{i: y_i = c \}}. $$

Another way to think about precision and recall is in terms of true positives (TP) and false positives (FP). A "positive" is an observation that the model identified as belonging to class $c$ (i.e., $\hat y_i = c$). A true positive is one that actually was in class $c$ (i.e., $\hat y_i = c$ and $y_i = c$), while a false positive is one that was not (i.e., $\hat y_i = c$ and $y_i \neq c$). True and false _negatives_ are defined analogously.

In the language of positives and negatives, the precision is the proportion of positives that are true positives:
$$ \textrm{precision for class } c = \frac{TP}{TP + FP}, $$
while the recall is the proportion of observations in class $c$ that are positives (as opposed to negatives):
$$ \textrm{recall for class } c = \frac{TP}{TP + FN}. $$

The diagram below may help you to remember which numbers go in the numerator and denominator. The precision is the proportion of the red rectangle that is a TP, while the recall is the proportion of the red circle that is a TP.

![](https://github.com/nmattei/cmps3160/blob/master/_labs/images/precision_recall.png?raw=1)

Let's calculate the precision and recall of our $9$-nearest neighbors model for the red "class":

In [None]:
true_positives = ((y_train_pred == "red") & (y_train == "red")).sum()

precision = true_positives / (y_train_pred == "red").sum()
recall = true_positives / (y_train == "red").sum()

precision, recall

Again, you can have Scikit-Learn calculate precision and recall for you. These functions work similarly to `accuracy_score` above, except we have to explicitly specify the class for which we want the precision and recall. For example, to calculate the precision and recall of the model for red wines:

In [None]:
from sklearn.metrics import precision_score, recall_score

(precision_score(y_train, y_train_pred, pos_label="red"),
 recall_score(y_train, y_train_pred, pos_label="red"))

It is important to specify `pos_label` because the precision and recall for other classes may be quite different:

In [None]:
(precision_score(y_train, y_train_pred, pos_label="white"),
 recall_score(y_train, y_train_pred, pos_label="white"))

In general, there is a tradeoff between precision and recall. For example, you can improve recall by predicting more observations to be in class $c$, but this will hurt precision. To take an extreme example, a model that predicts that _every_ observation is in class $c$ has 100% recall, but its precision would likely be poor. To visualize this phenomenon, suppose we expand the positives from the dashed circle to the solid circle, as shown in the figure below, at right. This increases recall (because the circle now covers more of the red rectangle) but decreases precision (because the red rectangle now makes up a smaller fraction of the circle).

![](https://github.com/nmattei/cmps3160/blob/master/_labs/images/precision_recall_tradeoff.png?raw=1)

Likewise, you can improve precision by predicting fewer observations to be in class $c$ (i.e., only the ones you are very confident about), but this will hurt recall. This is illustrated in the figure above, at left.

# Validation Accuracy, Precision, and Recall in Scikit-Learn

We calculated the training accuracy of our classifier above. However, test accuracy is more useful in most cases. We can estimate the test accuracy by splitting our data into two sets: a **training set** and a **validation set**. The model will only be fit on the observations of the training set. Then, the model will be evaluated on the validation set. Because the validation set has not been seen by the model already, it essentially plays the role of "future" data, even though it was carved out of the current data.

To split our data into training and validation sets, we use the `.sample()` function in `pandas`. Let's use this to split our data into two equal halves, which we will call `train` and `val`.

In [None]:
X_train = wines.sample(frac=.5)
X_test = wines.drop(X_train.index)

y_train = X_train["color"]
y_test = X_test["color"]

X_train = X_train.drop(columns="color")[["volatile acidity", "total sulfur dioxide"]]
X_test = X_test.drop(columns="color")[["volatile acidity", "total sulfur dioxide"]]

Now let's use this training/validation split to approximate the test error of our model.

First, we use Scikit-Learn to preprocess the training and the validation data. Note that the scaler is fit to the training data, so we learn the categories, the mean, and standard deviation from the training set---and use these to transform both the training and validation sets.

In [None]:
# standardize the data
scaler = StandardScaler()
scaler.fit(X_train)
X_train_sc = scaler.transform(X_train)
X_test_sc = scaler.transform(X_test)

Now, we can fit the KNN model to our scaled training data and make predictions for both our training and validation datasets.

In [None]:
# fit the 9-nearest neighbors model
model.fit(X_train_sc, y_train)

#predict the training data
y_train_pred = model.predict(X_train_sc)

#predict the validation data
y_test_pred = model.predict(X_test_sc)

Finally, we can calculate the accuracy for both the training and validation sets.

In [None]:
train_acc = accuracy_score(y_train, y_train_pred)
test_acc = accuracy_score(y_test, y_test_pred)

print("Training accuracy: {}".format(train_acc))
print("Validation accuracy: {}".format(test_acc))

The validation accuracy is still high, but lower than the training accuracy. This makes sense because it is always harder to predict for future data than for current data.

If we want to calculate the precision and recall for red wine in our validation set, we can use the `precision_score` and `recall_score` functions as described above.

In [None]:
test_recall = recall_score(y_test, y_test_pred, pos_label="red")
test_precision = precision_score(y_test, y_test_pred, pos_label="red")

print("Validation recall: {}".format(test_recall))
print("Validation precision: {}".format(test_precision))

When compared with the training recall and precision (below), the validation recall and precision  are both lower. Again, this makes sense as the model has not seen this data before.

In [None]:
train_recall = recall_score(y_train, y_train_pred, pos_label="red")
train_precision = precision_score(y_train, y_train_pred, pos_label="red")

print("Training recall: {}".format(train_recall))
print("Training precision: {}".format(train_precision))

# F1 Score: Combining Precision and Recall

We have replaced accuracy by two numbers: precision and recall. We can combine the precision and recall into a single number, called the **F1 score**.

The F1 score is defined to be the **harmonic mean** of the precision and the recall. That is,

$$ \frac{1}{\text{F1 score}} = \frac{ \frac{1}{\text{precision}} + \frac{1}{\text{recall}}}{2}, $$

or equivalently,

$$ \text{F1 score} = \frac{2 \cdot \text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}}. $$

The harmonic mean of two numbers is always between the two numbers, but in general will be closer to the smaller number. For example, if precision is $90\%$ and recall is $10\%$, then the harmonic mean is

$$ \text{F1 score} = \frac{2 \cdot 0.9 \cdot 0.1}{0.9 + 0.1} = 18\%. $$

This is a desirable property of F1 scores because we want to encourage models to have both high precision _and_ high recall. It is not sufficient for one of these to be high if the other is very low. In other words, we do not want to allow a high precision to cancel out a low recall, or vice versa.

The F1 score for red wines is:

In [None]:
2 * precision * recall / (precision + recall)

We could have also asked Scikit-Learn calculate this for us. If we know the actual and predicted labels, we can use the `f1_score` function, which works similarly to `precision_score` and `recall_score` from above:

In [None]:
from sklearn.metrics import f1_score

f1_score(y_train, y_train_pred, pos_label="red")

# Part 2: Exercises

Exercises 3-5 ask you to use the Titanic data set (`../data/titanic.csv`) to train various classifiers.

## Exercise 3

Build a 5-nearest neighbors model to predict whether or not a passenger on a Titanic would survive, based on their age, sex, and class as features. Use the Titanic data set (`../data/titanic.csv`) as your training data. Split the data into _training_ and _validation_ sets, with half the data in each set. Calculate the _training_ accuracy, precision, and recall of this model for survivors.

In [None]:
# TYPE YOUR CODE HERE.

## Exercise 4

Estimate the _test_ accuracy, precision, and recall of your model for survivors.

In [None]:
# TYPE YOUR CODE HERE.

## Exercise 5

Use, use your model to predict whether a 20-year old female in first-class would survive. What about a 20-year old female in third-class?

In [None]:
# TYPE YOUR CODE HERE.

## Exercise 6

You want to build a $k$-nearest neighbors model to predict where a Titanic passenger embarked, using their age, sex, and class.

- What value of $k$ optimizes overall _test_ accuracy?
- What value of $k$ optimizes the _test_ F1 score for Southampton?
- Does the same value of $k$ optimize accuracy and the F1 score?


**Hints:**
1. Southampton refers to the `S` class in the embarked variable (what we are predicting). Unlike in previous examples where you dealt with only predicting binary variables (2 classes), this exercise deals with a multi-class classification problem (embarked has 3 categories – [‘S’,’Q’,’C’]).
2. Since the 2nd part of this exercise asks you to calculate the F1-score for only the Southampton class, you may find this sklearn classes useful: [`f1_score`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html). Please read through their documentation and it should help you with this exercise.

In [None]:
# TYPE YOUR CODE HERE.

**Written Answers**:

**When you have filled out all the questions, submit via [Tulane Canvas](https://tulane.instructure.com/)**