# Worksheet - Classification (Part II)

### Learning Goals:

After completing this workshop session, you will be able to:

* Describe what a test data set is and how it is used in classification.
* Understand several ways of representing classifier performance: accuracy, precision, and recall, and the confusion matrix.
* Using Python, evaluate classifier performance using a test data set and appropriate metrics.
* Using Python, execute cross-validation in Python to choose the number of neighbours.
* Identify when it is necessary to scale variables before classification and do this using Python
* In a dataset with > 2 attributes, perform k-nearest neighbour classification in Python using the `scikit-learn` package to predict the class of a test dataset.
* Describe advantages and disadvantages of the k-nearest neighbour classification algorithm.

This worksheet covers parts of [Chapter 6](https://python.datasciencebook.ca/classification2) of the online textbook. You should read this chapter before attempting this assignment. Any place you see `___`, you must fill in the function, variable, or data to complete the code. Substitute the `raise NotImplementedError` with your completed code and answers then proceed to run the cell!

In [None]:
### Run this cell before continuing.
import altair as alt
import numpy as np
import pandas as pd
from sklearn import set_config
from sklearn.compose import make_column_transformer
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    cross_validate,
    train_test_split,
)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

# Simplify working with large datasets in Altair
alt.data_transformers.disable_max_rows()

# Output dataframes instead of arrays
set_config(transform_output="pandas")

**Question 0.1** Multiple Choice:

The confusion matrix is:

A. A way to confuse you.

B. A table where rows correspond to predicted class and columns correspond to true class.

C. Each cell in the confusion matrix displays the number of observations with a particular predicted/true class as given by the row and column labels.

D. Is an important tool for understanding what type of mistakes a classifier makes and how often these mistakes happen.

E. All of the above except A.

*Assign your answer to an object called `answer0_1`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `"F"`)*.

In [5]:
# your code here
raise NotImplementedError

In [6]:
from hashlib import sha1
assert sha1(str(type(answer0_1)).encode("utf-8")+b"b9cdb").hexdigest() == "351b6021c8f87571484eee4e4c7bf0a0b7ed21fc", "type of answer0_1 is not str. answer0_1 should be an str"
assert sha1(str(len(answer0_1)).encode("utf-8")+b"b9cdb").hexdigest() == "439859842b2ff5343f142aebc166bbd2f5c1ba63", "length of answer0_1 is not correct"
assert sha1(str(answer0_1.lower()).encode("utf-8")+b"b9cdb").hexdigest() == "738521fce4199fe135093d56bdcd11e5af3694cb", "value of answer0_1 is not correct"
assert sha1(str(answer0_1).encode("utf-8")+b"b9cdb").hexdigest() == "f04c8abf26aed0bfc1e3cac0304301b24f47827f", "correct string value of answer0_1 but incorrect case of letters"

print('Success!')

Success!


**Question 0.2** Multiple Choice:

Precision and recall are ways to summarize the confusion matrix. What is something we must do before calculating precision and recall?

A. Turn the values (counts of observations) appearing in each cell of the table into a proportion. 

B. Choose one of the class label as being more interesting and equate that with the "positive" label.

C. Flip the column and rows of the matrix.

D. None of the above.

*Assign your answer to an object called `answer0_2`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `"F"`)*.

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(answer0_2)).encode("utf-8")+b"40f78").hexdigest() == "88fbc36d74dbc11e3dc49cd606c9d12aab8dadde", "type of answer0_2 is not str. answer0_2 should be an str"
assert sha1(str(len(answer0_2)).encode("utf-8")+b"40f78").hexdigest() == "8f13c1689337a5e773751fc0a20a001db5754a5a", "length of answer0_2 is not correct"
assert sha1(str(answer0_2.lower()).encode("utf-8")+b"40f78").hexdigest() == "904faf5c0cc90042d3f8b8e6064bae76b809e406", "value of answer0_2 is not correct"
assert sha1(str(answer0_2).encode("utf-8")+b"40f78").hexdigest() == "a586864d6d403b52cf8043aaeda3ab9125c88230", "correct string value of answer0_2 but incorrect case of letters"

print('Success!')

## 1. Fruit Data Example - (Part II)
**Question 1.0** 

In the agricultural industry, cleaning, sorting, grading, and packaging food products are all necessary tasks in the post-harvest process. Products are classified based on appearance, size and shape, attributes which helps determine the quality of the food. Sorting can be done by humans, but it is tedious and time consuming. Automatic sorting could help save time and money. Images of the food products are captured and analysed to determine visual characteristics. 

The [dataset](https://www.kaggle.com/mjamilmoughal/k-nearest-neighbor-classifier-to-predict-fruits/notebook) contains observations of fruit described with four features: (1) mass (in g), (2) width (in cm), (3) height (in cm), and (4) color score (on a scale from 0 - 1).

To get started building a classifier that can classfiy a fruit based on its appearance, use `pd.read_csv` to load the file `fruit_data.csv` (found in the data folder) from the previous tutorial into your notebook.

*Assign your data to an object called `fruit_data`.*

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(fruit_data is None)).encode("utf-8")+b"8fbd4").hexdigest() == "94dbd13369ff838be1cc37ddd2571e56a2adbad6", "type of fruit_data is None is not bool. fruit_data is None should be a bool"
assert sha1(str(fruit_data is None).encode("utf-8")+b"8fbd4").hexdigest() == "32b4f132c7e4d7fd9a4568ae950c0ade0f91c972", "boolean value of fruit_data is None is not correct"

assert sha1(str(type(fruit_data.shape)).encode("utf-8")+b"8fbd5").hexdigest() == "8bc91c167da94224c2c03ff132c43919b3ec3977", "type of fruit_data.shape is not tuple. fruit_data.shape should be a tuple"
assert sha1(str(len(fruit_data.shape)).encode("utf-8")+b"8fbd5").hexdigest() == "e95430ca7bc9a404e2b167dfe8a30b4055fc9bcd", "length of fruit_data.shape is not correct"
assert sha1(str(sorted(map(str, fruit_data.shape))).encode("utf-8")+b"8fbd5").hexdigest() == "89046d3b818261b39a3c972f538928b8817bb9ba", "values of fruit_data.shape are not correct"
assert sha1(str(fruit_data.shape).encode("utf-8")+b"8fbd5").hexdigest() == "9a07d2fbcd6742a34bdf9d9a0f7b4d90d70fe81b", "order of elements of fruit_data.shape is not correct"

assert sha1(str(type(fruit_data.fruit_name.dtype)).encode("utf-8")+b"8fbd6").hexdigest() == "77c18e2fcbfeb18070e7059fdc3a6049fd40025b", "type of fruit_data.fruit_name.dtype is not correct"
assert sha1(str(fruit_data.fruit_name.dtype).encode("utf-8")+b"8fbd6").hexdigest() == "226ebd072ebdd8910b1c851f48d365fc91a37987", "value of fruit_data.fruit_name.dtype is not correct"

print('Success!')

Let's take a look at the first few observations in the fruit dataset. Run the cell below.

In [None]:
# Run this cell.
fruit_data.head()

Now let's investigate the class proportions for each kind of fruit:

In [None]:
fruit_data['fruit_name'].value_counts(normalize=True)

## Randomness and Setting Seeds

This worksheet uses functions from the `scikit-learn` library, which not only allows us to perform K-nearest neighbour classification, but also allows us to evaluate how well our classification worked. In order to ensure that the steps in the worksheet are reproducible, we need to set a *`random_state`* or *random seed*, i.e., a numerical "starting value," which determines the sequence of random numbers Python will generate.

Below in many cells we have included an argument to set the `random_state` or `np.random.seed`. They are necessary to make sure the autotesting code functions properly.

## 2. Splitting the data into a training and test set

In this exercise, we will be partitioning `fruit_data` into a training (75%) and testing (25%) set using the `scikit-learn` package. After creating the test set, we will put the test set away in a lock box and not touch it again until we have found the best k-nn classifier we can make using the training set. We will use the variable `fruit_name` as our class label. 


**Question 2.0**

To create the training and test set, we would use the `train_test_split` function from `scikit-learn` package. Save the trained dataset and test dataset as `fruit_train` and `fruit_test`, respectively. 

In [None]:
# Randomly take 75% of the data in the training set.
# This will be proportional to the different number of fruit names in the dataset.

# ___, ___ = train_test_split(___, test_size=___, random_state=123) # set the random state to be 123

# your code here
raise NotImplementedError
fruit_train

In [None]:
fruit_test

In [None]:
from hashlib import sha1
assert sha1(str(type(fruit_train is None)).encode("utf-8")+b"ef8b1").hexdigest() == "9f77123ff7006f15c739b7b78f5f2d4de80459f5", "type of fruit_train is None is not bool. fruit_train is None should be a bool"
assert sha1(str(fruit_train is None).encode("utf-8")+b"ef8b1").hexdigest() == "7ed9c577a01bef83e68375a03929f6c1cba7755b", "boolean value of fruit_train is None is not correct"

assert sha1(str(type(fruit_test is None)).encode("utf-8")+b"ef8b2").hexdigest() == "b0edefb6186b34edf34d83f7bcf5cbb59e46ea56", "type of fruit_test is None is not bool. fruit_test is None should be a bool"
assert sha1(str(fruit_test is None).encode("utf-8")+b"ef8b2").hexdigest() == "96404ff84888a103c5f891759de786517fe317e2", "boolean value of fruit_test is None is not correct"

assert sha1(str(type(fruit_train.shape)).encode("utf-8")+b"ef8b3").hexdigest() == "ad58db68d7f455473cab66cff03bdba319b16f11", "type of fruit_train.shape is not tuple. fruit_train.shape should be a tuple"
assert sha1(str(len(fruit_train.shape)).encode("utf-8")+b"ef8b3").hexdigest() == "1d7f9a2b05a377135a71d8c34ecb0a5926601bb4", "length of fruit_train.shape is not correct"
assert sha1(str(sorted(map(str, fruit_train.shape))).encode("utf-8")+b"ef8b3").hexdigest() == "9e1e7bdacecb3a487b64e0f64d1f61366a6ea4fe", "values of fruit_train.shape are not correct"
assert sha1(str(fruit_train.shape).encode("utf-8")+b"ef8b3").hexdigest() == "342376346f8cf46351b13bed9b0160a140dd9fbb", "order of elements of fruit_train.shape is not correct"

assert sha1(str(type(fruit_test.shape)).encode("utf-8")+b"ef8b4").hexdigest() == "c00a1e9b6f64d1629e4f464c1935eda7b0f96279", "type of fruit_test.shape is not tuple. fruit_test.shape should be a tuple"
assert sha1(str(len(fruit_test.shape)).encode("utf-8")+b"ef8b4").hexdigest() == "6c6e5d05f1db9c0aef7b920ba2cfd609e1a2d2d1", "length of fruit_test.shape is not correct"
assert sha1(str(sorted(map(str, fruit_test.shape))).encode("utf-8")+b"ef8b4").hexdigest() == "ade72264f11f0c795f1ca06d6df199248d2fda40", "values of fruit_test.shape are not correct"
assert sha1(str(fruit_test.shape).encode("utf-8")+b"ef8b4").hexdigest() == "191ee842577393f5745b2c256304dcdd2895b4a6", "order of elements of fruit_test.shape is not correct"

assert sha1(str(type(sum(fruit_train.mass))).encode("utf-8")+b"ef8b5").hexdigest() == "d43052ac9a4c7109e3fd472091a2a67541dc4590", "type of sum(fruit_train.mass) is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(sum(fruit_train.mass)).encode("utf-8")+b"ef8b5").hexdigest() == "03b793f8288b890907d114552b291e4cb42a95f0", "value of sum(fruit_train.mass) is not correct"

assert sha1(str(type(sum(fruit_test.mass))).encode("utf-8")+b"ef8b6").hexdigest() == "21310037fbeb70908577535bbc83ff4f6b2a975a", "type of sum(fruit_test.mass) is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(sum(fruit_test.mass)).encode("utf-8")+b"ef8b6").hexdigest() == "baabdedf68419227ffd00e67f81a343e9c77d3e5", "value of sum(fruit_test.mass) is not correct"

print('Success!')

**Question 2.1** 

K-nearest neighbors is sensitive to the scale of the predictors so we should do some preprocessing to standardize them. Remember that standardizing involves centering/shifting (subtracting the mean of each variable) and scaling (dividing by its standard deviation). Also remember that standardization is *part of your training procedure*, so you can't use your test data to compute the centered / scaled values for each variable. Therefore, you must pass only the training data to your preprocessor to compute the preprocessing steps. This ensures that our test data does not influence any aspect of our model training. Once we have created the standardization preprocessor, we can then later on apply it separately to both the training and test data sets.

For this exercise, let's see if `mass` and `color_score` can predict `fruit_name`. 

To scale and center the data, first, pass the predictors to the `make_column_transformer` function to make the preprocessor.

*Assign your answer to an object called `fruit_preprocessor`.*

In [None]:
# ___ = make_column_transformer(
#     (___, [___, ___]),
#     verbose_feature_names_out=False
# )

# your code here
raise NotImplementedError
fruit_preprocessor

In [None]:
from hashlib import sha1
assert sha1(str(type(fruit_preprocessor is None)).encode("utf-8")+b"36d09").hexdigest() == "c16043f1e011ae5e24d73f767d244e153de052d1", "type of fruit_preprocessor is None is not bool. fruit_preprocessor is None should be a bool"
assert sha1(str(fruit_preprocessor is None).encode("utf-8")+b"36d09").hexdigest() == "a36c675b8869eb47d14d616bcfc17bf0679c1480", "boolean value of fruit_preprocessor is None is not correct"

assert sha1(str(type(type(fruit_preprocessor))).encode("utf-8")+b"36d0a").hexdigest() == "5b4a6271e3d19f04ad8e3a02389bdde94d645e8f", "type of type(fruit_preprocessor) is not correct"
assert sha1(str(type(fruit_preprocessor)).encode("utf-8")+b"36d0a").hexdigest() == "33fc23392075e2159cb275f5b072c2bd3eef06f2", "value of type(fruit_preprocessor) is not correct"

assert sha1(str(type(fruit_preprocessor.transformers[0][0])).encode("utf-8")+b"36d0b").hexdigest() == "c2ee407b203b6e1c1faf4f5bbefb51099c64d2fa", "type of fruit_preprocessor.transformers[0][0] is not str. fruit_preprocessor.transformers[0][0] should be an str"
assert sha1(str(len(fruit_preprocessor.transformers[0][0])).encode("utf-8")+b"36d0b").hexdigest() == "b66e860a0a8f53af64caf60afdbf123a56eddbfa", "length of fruit_preprocessor.transformers[0][0] is not correct"
assert sha1(str(fruit_preprocessor.transformers[0][0].lower()).encode("utf-8")+b"36d0b").hexdigest() == "b0dd09d0d913f060dfae366a65bdc73cc4c6dbe9", "value of fruit_preprocessor.transformers[0][0] is not correct"
assert sha1(str(fruit_preprocessor.transformers[0][0]).encode("utf-8")+b"36d0b").hexdigest() == "b0dd09d0d913f060dfae366a65bdc73cc4c6dbe9", "correct string value of fruit_preprocessor.transformers[0][0] but incorrect case of letters"

assert sha1(str(type(fruit_preprocessor.transformers[0][2])).encode("utf-8")+b"36d0c").hexdigest() == "462915beb3245eb3f238c0e6100eba8c60d5eeeb", "type of fruit_preprocessor.transformers[0][2] is not list. fruit_preprocessor.transformers[0][2] should be a list"
assert sha1(str(len(fruit_preprocessor.transformers[0][2])).encode("utf-8")+b"36d0c").hexdigest() == "dbcd1005f772edc467986417157dca8731db4c76", "length of fruit_preprocessor.transformers[0][2] is not correct"
assert sha1(str(sorted(map(str, fruit_preprocessor.transformers[0][2]))).encode("utf-8")+b"36d0c").hexdigest() == "280af5a8b2775780fabf4d84c5f5db2cfdbdbe56", "values of fruit_preprocessor.transformers[0][2] are not correct"
assert sha1(str(fruit_preprocessor.transformers[0][2]).encode("utf-8")+b"36d0c").hexdigest() == "2b1472a2c7fea5769561b8e3eef648c0052e5e8a", "order of elements of fruit_preprocessor.transformers[0][2] is not correct"

print('Success!')

Now that we have split the data, we can do things like exploratory data analysis and model fitting. Before we move onto the latter, run the cell below to visualize the two of the predictors (mass in grams, and width in cm) as a scatter plot, colouring the observations by their class labels.

In [None]:
# Create the scatterplot
fruit_chart = alt.Chart(fruit_data).mark_point(size=15).encode(
    x=alt.X("mass").title("Mass (grams)"),
    y=alt.Y("width")
        .title("Width (cm)")
        .scale(zero=False),
    color=alt.Color("fruit_name").title("Fruit")
)

fruit_chart

**Question 2.2**

So far, we have split the training and testing datasets as well as preprocessed the data. Now, let's create our K-nearest neighbour classifier with only the training set using the `scikit-learn` package. First, create the classifier by specifying that we want $K = 3$ neighbors. *Assign your answer to an object called `knn_spec`*. 

Next, separate the predictor columns from the target column. Name the predictor variable `X` and the target `y`. 

Train the classifier with the training data set using the `make_pipeline` and `fit` function. The `make_pipeline` function allows you to bundle together your pre-processing, modeling, and post-processing requests. Scaffolding is provided below for you.

*Assign your answer to an object called `fruit_fit`*.

In [None]:
# ___ = KNeighborsClassifier(n_neighbors=___)

# ___ = ___[["mass", "color_score"]]
# ___ = fruit_train[___]

# ___ = make_pipeline(___, ___).fit(___, ___)

# your code here
raise NotImplementedError
fruit_fit

In [None]:
from hashlib import sha1
assert sha1(str(type(knn_spec is None)).encode("utf-8")+b"27e07").hexdigest() == "e2d16a9547e72a3ae1e7aef7c84c6c0c587bad37", "type of knn_spec is None is not bool. knn_spec is None should be a bool"
assert sha1(str(knn_spec is None).encode("utf-8")+b"27e07").hexdigest() == "a15dcfacc29248935e7777a673f2a5eb8c9afe27", "boolean value of knn_spec is None is not correct"

assert sha1(str(type(type(knn_spec))).encode("utf-8")+b"27e08").hexdigest() == "8192fb1b7fe569d5b3c2f3eab3fb65464e75bfb1", "type of type(knn_spec) is not correct"
assert sha1(str(type(knn_spec)).encode("utf-8")+b"27e08").hexdigest() == "69a8d47f5a15872b58b9d89610eb519425f3fde2", "value of type(knn_spec) is not correct"

assert sha1(str(type(knn_spec.effective_metric_)).encode("utf-8")+b"27e09").hexdigest() == "8a3cc99092881053a4372e18346c2e25f3f9c98e", "type of knn_spec.effective_metric_ is not str. knn_spec.effective_metric_ should be an str"
assert sha1(str(len(knn_spec.effective_metric_)).encode("utf-8")+b"27e09").hexdigest() == "70f6652e73aa1506c93346ed63d995b98fab1b15", "length of knn_spec.effective_metric_ is not correct"
assert sha1(str(knn_spec.effective_metric_.lower()).encode("utf-8")+b"27e09").hexdigest() == "ca0f4ad54d67bcd18ff83d85539d213a6136c949", "value of knn_spec.effective_metric_ is not correct"
assert sha1(str(knn_spec.effective_metric_).encode("utf-8")+b"27e09").hexdigest() == "ca0f4ad54d67bcd18ff83d85539d213a6136c949", "correct string value of knn_spec.effective_metric_ but incorrect case of letters"

assert sha1(str(type(knn_spec.n_neighbors)).encode("utf-8")+b"27e0a").hexdigest() == "b053dd855a7ee23c1a8f2db8de53400908f659ab", "type of knn_spec.n_neighbors is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(knn_spec.n_neighbors).encode("utf-8")+b"27e0a").hexdigest() == "b726c76a82a1e689bd8bf1a87e14148b1e1ffa28", "value of knn_spec.n_neighbors is not correct"

assert sha1(str(type(sum(X.mass))).encode("utf-8")+b"27e0b").hexdigest() == "1d6efa7539e497d2785a27892fb55925101adc69", "type of sum(X.mass) is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(sum(X.mass)).encode("utf-8")+b"27e0b").hexdigest() == "acf33445c79fe8216094e045df98b2d792ffd530", "value of sum(X.mass) is not correct"

assert sha1(str(type(sum(X.color_score))).encode("utf-8")+b"27e0c").hexdigest() == "9d9694d1c2272d3f4d59434862483d10ab72a496", "type of sum(X.color_score) is not float. Please make sure it is float and not np.float64, etc. You can cast your value into a float using float()"
assert sha1(str(round(sum(X.color_score), 2)).encode("utf-8")+b"27e0c").hexdigest() == "ae307b97500019064af59ef4e7b7ed62c89d5454", "value of sum(X.color_score) is not correct (rounded to 2 decimal places)"

assert sha1(str(type(y.name)).encode("utf-8")+b"27e0d").hexdigest() == "6abe773b05068bd331b9458fe572e23efd0ba854", "type of y.name is not str. y.name should be an str"
assert sha1(str(len(y.name)).encode("utf-8")+b"27e0d").hexdigest() == "90f4248559ecfae4d207bd46db84457909a6af5f", "length of y.name is not correct"
assert sha1(str(y.name.lower()).encode("utf-8")+b"27e0d").hexdigest() == "92c4a59d1166ec03a74d8d9e0e7550629d7e35d5", "value of y.name is not correct"
assert sha1(str(y.name).encode("utf-8")+b"27e0d").hexdigest() == "92c4a59d1166ec03a74d8d9e0e7550629d7e35d5", "correct string value of y.name but incorrect case of letters"

assert sha1(str(type(fruit_fit is None)).encode("utf-8")+b"27e0e").hexdigest() == "15099bad8aa8fd88e3ed8b0336fcd45b8f39af37", "type of fruit_fit is None is not bool. fruit_fit is None should be a bool"
assert sha1(str(fruit_fit is None).encode("utf-8")+b"27e0e").hexdigest() == "cc6d59b8d76d220ac7ceeac94e7f984f2f6aacdb", "boolean value of fruit_fit is None is not correct"

assert sha1(str(type(type(fruit_fit))).encode("utf-8")+b"27e0f").hexdigest() == "7b4a4056a45db874cc809a3221d59cca0c09005c", "type of type(fruit_fit) is not correct"
assert sha1(str(type(fruit_fit)).encode("utf-8")+b"27e0f").hexdigest() == "dd6490f704dd393427bfc2cc32a050c152dd221d", "value of type(fruit_fit) is not correct"

assert sha1(str(type(len(fruit_fit.named_steps))).encode("utf-8")+b"27e10").hexdigest() == "2f62b5548ecff5633a52f4591fa7a95747b18374", "type of len(fruit_fit.named_steps) is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(len(fruit_fit.named_steps)).encode("utf-8")+b"27e10").hexdigest() == "da25296125b548d8848520ab24f87345dd290640", "value of len(fruit_fit.named_steps) is not correct"

assert sha1(str(type(fruit_fit.named_steps.keys())).encode("utf-8")+b"27e11").hexdigest() == "281cfae204ddee5837c5ebe161f9bcdcaa605d2e", "type of fruit_fit.named_steps.keys() is not correct"
assert sha1(str(fruit_fit.named_steps.keys()).encode("utf-8")+b"27e11").hexdigest() == "6287d8aeb5a0ee86fcdc595d4a94d5ed18c48017", "value of fruit_fit.named_steps.keys() is not correct"

print('Success!')

**Question 2.3**

Now that we have created our K-nearest neighbor classifier object, let's predict the class labels for our test set.

We want to make sure to `assign` the predicted class labels to a new column in the dataframe, called `predicted`. To create the predicted class labels pass your fitted model pipeline and the **test dataset** to the `predict` function. 

*Assign your answer to an object called `fruit_test_predictions`.*

In [None]:
# ___ = fruit_test.___(
#     predicted=___.predict(___[[___, ___]])
# )

# your code here
raise NotImplementedError
fruit_test_predictions

In [None]:
from hashlib import sha1
assert sha1(str(type(fruit_test_predictions is None)).encode("utf-8")+b"2f2bd").hexdigest() == "2b270d171753a33179223403dd1ff41c35a42b87", "type of fruit_test_predictions is None is not bool. fruit_test_predictions is None should be a bool"
assert sha1(str(fruit_test_predictions is None).encode("utf-8")+b"2f2bd").hexdigest() == "a4451e8e45f322126a6fbc7a90c20ae3a5bee464", "boolean value of fruit_test_predictions is None is not correct"

assert sha1(str(type(fruit_test_predictions)).encode("utf-8")+b"2f2be").hexdigest() == "11449a9d181d481214112ee5255a27c33852aaac", "type of type(fruit_test_predictions) is not correct"

assert sha1(str(type(fruit_test_predictions.shape)).encode("utf-8")+b"2f2bf").hexdigest() == "2d059656a62390dea1fb448bbdc215c77a9817f5", "type of fruit_test_predictions.shape is not tuple. fruit_test_predictions.shape should be a tuple"
assert sha1(str(len(fruit_test_predictions.shape)).encode("utf-8")+b"2f2bf").hexdigest() == "c98f20a3aacd58713ce44462685524951a0d6beb", "length of fruit_test_predictions.shape is not correct"
assert sha1(str(sorted(map(str, fruit_test_predictions.shape))).encode("utf-8")+b"2f2bf").hexdigest() == "269abef7b85b70a76ac2028e7198760edbf0102b", "values of fruit_test_predictions.shape are not correct"
assert sha1(str(fruit_test_predictions.shape).encode("utf-8")+b"2f2bf").hexdigest() == "ccdebed5e7deb5b68c4991d2b4130e87ba2f5bb8", "order of elements of fruit_test_predictions.shape is not correct"

assert sha1(str(type("predicted" in fruit_test_predictions.columns)).encode("utf-8")+b"2f2c0").hexdigest() == "95b9c71897a9496eda9fa4e37c40745a4b6017c6", "type of \"predicted\" in fruit_test_predictions.columns is not bool. \"predicted\" in fruit_test_predictions.columns should be a bool"
assert sha1(str("predicted" in fruit_test_predictions.columns).encode("utf-8")+b"2f2c0").hexdigest() == "c3cace2eeb4178326381514950fac3f2606c7de7", "boolean value of \"predicted\" in fruit_test_predictions.columns is not correct"

print('Success!')

**Question 2.4**

Great! We have now computed some predictions for our test datasets! From glancing at the dataframe above, it looks like most of them are correct, but wouldn't it be interesting if we could find out the exact accuracy of our classifier? 

Thankfully, the `score` function from the `scikit-learn` package can help us. To get the statistics about the quality of our model, you need to call the `score` function on the `fruit_fit` model. Name the predictors `X_test` and the target `y_test`. We should pass the `X_test` and `y_test` into the `score` function.

*Assign your answer to an object called `fruit_prediction_accuracy`.*

In [None]:
# ___ = ___[[___, ___]]
# ___ = ___["fruit_name"]

# ___ = fruit_fit.score(___, ___)

# your code here
raise NotImplementedError
fruit_prediction_accuracy

In [None]:
from hashlib import sha1
assert sha1(str(type(fruit_prediction_accuracy is None)).encode("utf-8")+b"58ea0").hexdigest() == "7488706d3b954877e08217e6c10b6e56e6a7b906", "type of fruit_prediction_accuracy is None is not bool. fruit_prediction_accuracy is None should be a bool"
assert sha1(str(fruit_prediction_accuracy is None).encode("utf-8")+b"58ea0").hexdigest() == "2ede9eb19115d81336d63c4f466910ed740a56cc", "boolean value of fruit_prediction_accuracy is None is not correct"

assert sha1(str(type(fruit_prediction_accuracy)).encode("utf-8")+b"58ea1").hexdigest() == "d1b9e2d27e4f648284fe90f1e995ff37b15b3b52", "type of fruit_prediction_accuracy is not correct"
assert sha1(str(fruit_prediction_accuracy).encode("utf-8")+b"58ea1").hexdigest() == "cba4f4b8508049ee804d50c214f02a7d03ec456b", "value of fruit_prediction_accuracy is not correct"

print('Success!')

**Question 2.5**

Now, let's look at the *confusion matrix* for the classifier. This will show us a table comparing the predicted labels with the true labels. 

A confusion matrix is essentially a classification matrix. The columns of the confusion matrix represent the actual class and the rows represent the predicted class (or vice versa). Shown below is an example of a confusion matrix.

|                    | Predicted Positive | Predicted Negative |
|--------------------|:------------------:|:------------------:|
| **Truly Positive** | True Positive      |     False Negative |
| **Truly Negative** | False Positive     |      True Negative |


- A **true positive** is an outcome where the model correctly predicts the positive class.
- A **true negative** is an outcome where the model correctly predicts the negative class.
- A **false positive** is an outcome where the model incorrectly predicts the positive class.
- A **false negative** is an outcome where the model incorrectly predicts the negative class.

<br>

We can create a confusion matrix by using the `crosstab` function from `pandas`. In the dataframe created by `crosstab`, the true labels will be to the left, and the predicted labels will be on top (as in the matrix above). In contrast to the confusion matrix above where there are only two possible outcomes (positive/negative), we have four possible outcomes (the four fruit names). Therefore, our dataframe will be bigger than the matrix above and contain 16 possible outcomes instead of 4.

*Assign your answer to an object called `fruit_mat`*.

In [None]:
# ___ = pd.___(
#     fruit_test_predictions[___],  # True labels
#     fruit_test_predictions[___],  # Predicted labels
# )

# your code here
raise NotImplementedError
fruit_mat

With many observations, it can be difficult to interpret the confusion matrix when it is presented as a table like above. In these cases, we could instead use the `ConfusionMatrixDisplay` function of the `scikit-learn` package to visualize the confusion matrix as a heatmap. Please run the cell below to see the fruit confusion matrix as a heatmap.

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay

ConfusionMatrixDisplay.from_estimator(
    fruit_fit,  # We are directly passing the pipeline and let sklearn do the predictions for us
    X_test,
    y_test
)

In [None]:
from hashlib import sha1
assert sha1(str(type(fruit_mat is None)).encode("utf-8")+b"6d49c").hexdigest() == "a56e492407ffda77dae01cf693d7cec3160d24a5", "type of fruit_mat is None is not bool. fruit_mat is None should be a bool"
assert sha1(str(fruit_mat is None).encode("utf-8")+b"6d49c").hexdigest() == "c7f4085c7df1435511dabd1fe07a5e3f0bf6a04a", "boolean value of fruit_mat is None is not correct"

assert sha1(str(type(fruit_mat)).encode("utf-8")+b"6d49d").hexdigest() == "28c9794a963e59bc5a7cf363c4c0e2789544fc33", "type of type(fruit_mat) is not correct"

assert sha1(str(type(fruit_mat.to_numpy().sum())).encode("utf-8")+b"6d49e").hexdigest() == "01e9653b50d767f4025c03eca241b30e06583876", "type of fruit_mat.to_numpy().sum() is not correct"
assert sha1(str(fruit_mat.to_numpy().sum()).encode("utf-8")+b"6d49e").hexdigest() == "bf73171c60376a7a7ab9275366e3d8a810042a38", "value of fruit_mat.to_numpy().sum() is not correct"

print('Success!')

**Question 2.6** Multiple Choice:

Reading `fruit_mat`, how many observations were labelled correctly?

A. 7

B. 8

C. 9

D. 14

*Assign your answer to an object called `answer2_6`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `"F"`).*

In [None]:
# your code here
raise NotImplementedError
answer2_6

In [None]:
from hashlib import sha1
assert sha1(str(type(answer2_6)).encode("utf-8")+b"63bd7").hexdigest() == "74364102b77b3fd097894f4b2ed5ffe1a3bedb6e", "type of answer2_6 is not str. answer2_6 should be an str"
assert sha1(str(len(answer2_6)).encode("utf-8")+b"63bd7").hexdigest() == "aa9b3cc37bc9e2a6492756385bfc9db9b3b3ffad", "length of answer2_6 is not correct"
assert sha1(str(answer2_6.lower()).encode("utf-8")+b"63bd7").hexdigest() == "ee28d9dab7da140c2703a3142b1a76eca2f29704", "value of answer2_6 is not correct"
assert sha1(str(answer2_6).encode("utf-8")+b"63bd7").hexdigest() == "f37d9ce6ca5d36961515b5c2a132fd356075f5f3", "correct string value of answer2_6 but incorrect case of letters"

print('Success!')

**Question 2.7** Multiple Choice:

Reading `fruit_mat`, let's suppse that we are really interested in the lemons, and treat "lemon" as being the "positive" class. What is the precision of our classifier?

*Assign your answer to an object called `answer2_7`.*

In [None]:
# your code here
raise NotImplementedError
answer2_7

In [None]:
from hashlib import sha1
assert sha1(str(type(answer2_7)).encode("utf-8")+b"cf09a").hexdigest() == "a2acc89ec6bf6fbfcaf70c4f3fd8e257714b28f5", "type of answer2_7 is not float. Please make sure it is float and not np.float64, etc. You can cast your value into a float using float()"
assert sha1(str(round(answer2_7, 2)).encode("utf-8")+b"cf09a").hexdigest() == "6e0748f97ae0f02ee4986a895c656d7199300d4c", "value of answer2_7 is not correct (rounded to 2 decimal places)"

print('Success!')

**Question 2.8** Multiple Choice:

Again, let us treat "lemon" as being the "positive" class. What is the recall of our classifier?

*Assign your answer to an object called `answer2_8`.*

In [None]:
# your code here
raise NotImplementedError
answer2_8

In [None]:
from hashlib import sha1
assert sha1(str(type(answer2_8)).encode("utf-8")+b"51283").hexdigest() == "458dc6e2f39b0dc333b972f861a30a8b8840373e", "type of answer2_8 is not float. Please make sure it is float and not np.float64, etc. You can cast your value into a float using float()"
assert sha1(str(round(answer2_8, 2)).encode("utf-8")+b"51283").hexdigest() == "2769552e2d149c93b991ad429cb50a9b3b1877a6", "value of answer2_8 is not correct (rounded to 2 decimal places)"

print('Success!')

### 3. Cross-validation

**Question 3.1**

The vast majority of predictive models in statistics and machine learning have parameters that you have to pick. For the past few exercises, we have had to pick the number of neighbours for the class vote, which we have done arbitraily. But, is it possible to make this selection, *i.e., tune the model, in a principled way?* Ideally, we want to pick the number of neighborurs to maximize the performance of our classifier on data *it hasn’t seen yet*.

An important aspect of the tuning process is that we can, if we want to, split our training data again, train and evaluate a classifier for each split, and then choose the parameter based on all of the different results. If we just split our training data once, our best parameter choice will depend strongly on the randomness from how this single split was made. Using multiple different splits, we’ll get a more robust estimate of accuracy, which will lead to a more suitable choice of the number of neighbours $K$ to perform well on unseen data.

The idea of training and evaluating models on multiple training data splits times is called "cross-validation". In cross-validation, we split our overall training data into $C$ evenly-sized chunks, and then iteratively use 1 chunk as the **validation set** and combine the remaining $C−1$ chunks as the **training set.** The validation set is used in a similar was as the test set, **except** that the test set is only used once at the end to report model performance whether we use model performance on the validation set to select the model during cross-validation. 

---

We can perform a cross-validation in Python using the `cross_validate` function from the `scikit-learn` package. To use this function, you have to identify the model, the training set, and specify the `cv` parameter (the number of folds $C$, defaults to 5). We should set `return_train_score` to be `True` to return the training score as well.

Before we use the `cross_validate` function, we need to perform the pipeline analysis again. You can reuse the `X` and `y` variables you constructed from the training data earlier, as well as the `fruit_preprocessor`and `knn_spec` variables. However, you will need to create a new pipeline since the one we made earlier is already fitted on all the data and here we want to fit it on different splits of the data during cross-validation. Since the cross_validate function outputs a dictionary, we use `pd.DataFrame` to convert it to a dataframe for convenience, as in the textbook.

*Assign your answer to an object called `fruit_vfold_score`*.

In [None]:
np.random.seed(2020)  # DO NOT REMOVE

# ___ = ___(fruit_preprocessor, knn_spec)
# ___ = pd.___(
#     cross_validate(
#         estimator=___,
#         cv=5,
#         X=___,
#         y=___,
#         return_train_score=True,
#     )
# )

# your code here
raise NotImplementedError
fruit_vfold_score

In [None]:
from hashlib import sha1
assert sha1(str(type(fruit_vfold_score is None)).encode("utf-8")+b"36a22").hexdigest() == "b1fc58d5e23c9423bbf8765714e5e58804fce66a", "type of fruit_vfold_score is None is not bool. fruit_vfold_score is None should be a bool"
assert sha1(str(fruit_vfold_score is None).encode("utf-8")+b"36a22").hexdigest() == "d2cc03f8ff44ef13491fa232070b6de92b811475", "boolean value of fruit_vfold_score is None is not correct"

assert sha1(str(type(fruit_vfold_score)).encode("utf-8")+b"36a23").hexdigest() == "d2dcb89307585b23fa4ca6f8a76f8c2c0f23346f", "type of type(fruit_vfold_score) is not correct"

assert sha1(str(type(fruit_vfold_score.shape)).encode("utf-8")+b"36a24").hexdigest() == "546d04c6fccac7326ab7400bb48a82c394c90130", "type of fruit_vfold_score.shape is not tuple. fruit_vfold_score.shape should be a tuple"
assert sha1(str(len(fruit_vfold_score.shape)).encode("utf-8")+b"36a24").hexdigest() == "d8cd8be65e2d915f492f7185bcd2736d2f68a197", "length of fruit_vfold_score.shape is not correct"
assert sha1(str(sorted(map(str, fruit_vfold_score.shape))).encode("utf-8")+b"36a24").hexdigest() == "653cb8e4264891bd349087f0a6ae4d134fd597c4", "values of fruit_vfold_score.shape are not correct"
assert sha1(str(fruit_vfold_score.shape).encode("utf-8")+b"36a24").hexdigest() == "89f531606aaf9ca14b74ef68296d4ab26838067f", "order of elements of fruit_vfold_score.shape is not correct"

assert sha1(str(type(fruit_pipe is None)).encode("utf-8")+b"36a25").hexdigest() == "9541e0d3962315e7cfa81d63c9ff1179e06dbe63", "type of fruit_pipe is None is not bool. fruit_pipe is None should be a bool"
assert sha1(str(fruit_pipe is None).encode("utf-8")+b"36a25").hexdigest() == "020dfe9202b64ad0e6861ad87373f929d736eefc", "boolean value of fruit_pipe is None is not correct"

assert sha1(str(type(type(fruit_pipe))).encode("utf-8")+b"36a26").hexdigest() == "fd97847bfa2e68ff8df95e6074f5a234c145fd58", "type of type(fruit_pipe) is not correct"
assert sha1(str(type(fruit_pipe)).encode("utf-8")+b"36a26").hexdigest() == "656864c949d89ae0be260be526c38fa4039065bd", "value of type(fruit_pipe) is not correct"

assert sha1(str(type(len(fruit_pipe.named_steps))).encode("utf-8")+b"36a27").hexdigest() == "bbcdb9c08b17d3d8a66fe4a53686f010ae0bccbb", "type of len(fruit_pipe.named_steps) is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(len(fruit_pipe.named_steps)).encode("utf-8")+b"36a27").hexdigest() == "33ba12d9299d1f48e8d73ede50813d18a77ff874", "value of len(fruit_pipe.named_steps) is not correct"

assert sha1(str(type(fruit_pipe.named_steps.keys())).encode("utf-8")+b"36a28").hexdigest() == "ea5bb9eb37991a1ed3ecb7735822ec397000ce42", "type of fruit_pipe.named_steps.keys() is not correct"
assert sha1(str(fruit_pipe.named_steps.keys()).encode("utf-8")+b"36a28").hexdigest() == "577fdf89e2c84a15d65fa7cec8269b13b6314a4e", "value of fruit_pipe.named_steps.keys() is not correct"

print('Success!')

**Question 3.2**

Now that we have ran a cross-validation on each train/validation split, one has to ask, how accurate was the classifier's validation across the folds? We can aggregate the *mean* and *standard error* of these scores from each folds. The standard error is essentially a measure of how uncertain we are in the mean value. Use the `agg` dataframe method to compute both the mean and the standard error; make sure the first row of the dataframe contains the mean values and the second contains the standard error values.

*Assign your answer to an object called `fruit_metrics`.*

In [None]:
# ___ = fruit_vfold_score.___([___, ___])


# your code here
raise NotImplementedError
fruit_metrics

In [None]:
from hashlib import sha1
assert sha1(str(type(fruit_metrics.shape)).encode("utf-8")+b"7608a").hexdigest() == "8e5f57d0f7ed0aa98c690da117537d080d1bff1a", "type of fruit_metrics.shape is not tuple. fruit_metrics.shape should be a tuple"
assert sha1(str(len(fruit_metrics.shape)).encode("utf-8")+b"7608a").hexdigest() == "f68c65a025bd3924f48305894fadea89e0c5b46e", "length of fruit_metrics.shape is not correct"
assert sha1(str(sorted(map(str, fruit_metrics.shape))).encode("utf-8")+b"7608a").hexdigest() == "bbca15d44552fa5e34cd23e46733915b6043b927", "values of fruit_metrics.shape are not correct"
assert sha1(str(fruit_metrics.shape).encode("utf-8")+b"7608a").hexdigest() == "f8ddb51af4648fc9c62d3c1d3a78fc8589ac74ca", "order of elements of fruit_metrics.shape is not correct"

assert sha1(str(type(fruit_metrics.test_score)).encode("utf-8")+b"7608b").hexdigest() == "4fae85143b5b445ce9b9807dfe9a1a379dc90033", "type of fruit_metrics.test_score is not correct"
assert sha1(str(fruit_metrics.test_score).encode("utf-8")+b"7608b").hexdigest() == "6924635698218860ac3c086773cb0acfa29dace4", "value of fruit_metrics.test_score is not correct"

print('Success!')

## 4. Parameter value selection

Using a 5-fold cross-validation, we have established a prediction accuracy for our classifier. If we were to improve our classifier, we would like to try different number of neighbours, $K$. Then we could use cross-validation to calculate an accuracy for each value of $K$ in a reasonable range, and pick the value of $K$ that gives us the best accuracy on the validation data. 

The great thing about the `scikit-learn` package is that it provides functions to conveniently tune parameters such as $K$ by training and evaluating models (via crossvalidation) for a range of specified values of $K$. The function we will use here is called "exhaustive grid search" ([sklearn.model_selection.GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)).

**Question 4.0**

Create a new K-nearest neighbor model specification but instead of specifying a particular value for the `n_neighbors` argument, try exploring a range of values with `GridSearchCV`. Before we use `GridSearchCV`, we should define the grid of values that we want to explore, and redefine the pipeline without specifying a particular value of $K$. To save us some time, instruct the grid search to use 4-fold cross-validation, rather than the default 5-fold.

*Assign your answer to an object called `knn_tune_grid`.* 

In [None]:
### Run this cell
param_grid = {
    "kneighborsclassifier__n_neighbors": range(2, 15, 1),
}
fruit_tune_pipe = make_pipeline(fruit_preprocessor, KNeighborsClassifier())

In [None]:
# ___ = GridSearchCV(
#     ___, ___, ___=__,
# )


# your code here
raise NotImplementedError
knn_tune_grid

In [None]:
from hashlib import sha1
assert sha1(str(type(knn_tune_grid is None)).encode("utf-8")+b"de7e3").hexdigest() == "b8f0b73e45acd1ecde4311b54cb8ca5c4fd48116", "type of knn_tune_grid is None is not bool. knn_tune_grid is None should be a bool"
assert sha1(str(knn_tune_grid is None).encode("utf-8")+b"de7e3").hexdigest() == "22ff6a02d8a0748a0942a49f18a56f89c44071a4", "boolean value of knn_tune_grid is None is not correct"

assert sha1(str(type(type(knn_tune_grid))).encode("utf-8")+b"de7e4").hexdigest() == "bba351be309dda4ac68b4ebd1aafb92b79462f94", "type of type(knn_tune_grid) is not correct"
assert sha1(str(type(knn_tune_grid)).encode("utf-8")+b"de7e4").hexdigest() == "5daba4ef43b4a410d3c641b8d3b2e56daed064e6", "value of type(knn_tune_grid) is not correct"

assert sha1(str(type(knn_tune_grid.param_grid.keys())).encode("utf-8")+b"de7e5").hexdigest() == "ff297bec527b81362f07bc67cd01011f1add1a08", "type of knn_tune_grid.param_grid.keys() is not correct"
assert sha1(str(knn_tune_grid.param_grid.keys()).encode("utf-8")+b"de7e5").hexdigest() == "1c7370822141bce722ca65597af73993dba4a560", "value of knn_tune_grid.param_grid.keys() is not correct"

assert sha1(str(type(knn_tune_grid.estimator.named_steps.keys())).encode("utf-8")+b"de7e6").hexdigest() == "dd0e8c205b661479a0048e81ead835f12b2b5ddf", "type of knn_tune_grid.estimator.named_steps.keys() is not correct"
assert sha1(str(knn_tune_grid.estimator.named_steps.keys()).encode("utf-8")+b"de7e6").hexdigest() == "9fe3afc262d1fef9058a7a3342aa8dbc9a76685c", "value of knn_tune_grid.estimator.named_steps.keys() is not correct"

assert sha1(str(type(knn_tune_grid.cv)).encode("utf-8")+b"de7e7").hexdigest() == "947f9f92f52034f3258a2e920f111db2ce0fc28b", "type of knn_tune_grid.cv is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(knn_tune_grid.cv).encode("utf-8")+b"de7e7").hexdigest() == "048c41e7d49846e06e4228f89a17df11f2732e82", "value of knn_tune_grid.cv is not correct"

print('Success!')

**Question 4.1**

Now, let's fit the grid search object to the data, using the `X` and `y` variables we created earlier.

*Assign your tuned model to a variable called `knn_model_grid`.*

Next, from `knn_model_grid`, find out the `cv_results_` and save it in a dataframe. 

*Assign your answer to a variable called `accuracies_grid`.*

In [None]:
# ___ = ___.fit(___, ___)

# ___ = pd.DataFrame(___.cv_results_)

# your code here
raise NotImplementedError
accuracies_grid

In [None]:
from hashlib import sha1
assert sha1(str(type(type(knn_model_grid))).encode("utf-8")+b"5aaa1").hexdigest() == "7800ffeeffbe3f109596728a651492f72ed697b9", "type of type(knn_model_grid) is not correct"
assert sha1(str(type(knn_model_grid)).encode("utf-8")+b"5aaa1").hexdigest() == "a777b6bc7a96e92dc37d7e48e532c4403b013440", "value of type(knn_model_grid) is not correct"

assert sha1(str(type(accuracies_grid is None)).encode("utf-8")+b"5aaa2").hexdigest() == "2bbc8cb94c17fac722104bbe0ae8178d6283ea7d", "type of accuracies_grid is None is not bool. accuracies_grid is None should be a bool"
assert sha1(str(accuracies_grid is None).encode("utf-8")+b"5aaa2").hexdigest() == "b31adcd31362204c2f193a8d747bac72afd4a47a", "boolean value of accuracies_grid is None is not correct"

assert sha1(str(type(accuracies_grid)).encode("utf-8")+b"5aaa3").hexdigest() == "a67afa6224ce1811c7d07895e39f81d73e309929", "type of type(accuracies_grid) is not correct"

assert sha1(str(type(accuracies_grid.shape)).encode("utf-8")+b"5aaa4").hexdigest() == "07ba20ae9b6524ef5c009590c807287bad2b6ca8", "type of accuracies_grid.shape is not tuple. accuracies_grid.shape should be a tuple"
assert sha1(str(len(accuracies_grid.shape)).encode("utf-8")+b"5aaa4").hexdigest() == "e362cae886bd94094b80f36e4d1c28b67e1ca7d7", "length of accuracies_grid.shape is not correct"
assert sha1(str(sorted(map(str, accuracies_grid.shape))).encode("utf-8")+b"5aaa4").hexdigest() == "2a86bd742543c1a3f2d3750c8ae31e564709e00e", "values of accuracies_grid.shape are not correct"
assert sha1(str(accuracies_grid.shape).encode("utf-8")+b"5aaa4").hexdigest() == "bcd625ee3af5d1cd105b2b2335288631494670e5", "order of elements of accuracies_grid.shape is not correct"

assert sha1(str(type(sum(accuracies_grid.mean_test_score))).encode("utf-8")+b"5aaa5").hexdigest() == "f04499ac18fececdea66994f437f6fe195a9a404", "type of sum(accuracies_grid.mean_test_score) is not float. Please make sure it is float and not np.float64, etc. You can cast your value into a float using float()"
assert sha1(str(round(sum(accuracies_grid.mean_test_score), 2)).encode("utf-8")+b"5aaa5").hexdigest() == "fef13114978598bb2dd1245e857f4ad0cd359ce9", "value of sum(accuracies_grid.mean_test_score) is not correct (rounded to 2 decimal places)"

assert sha1(str(type(sum(accuracies_grid.std_test_score))).encode("utf-8")+b"5aaa6").hexdigest() == "90280827318185a637233f593624365d40176a3a", "type of sum(accuracies_grid.std_test_score) is not float. Please make sure it is float and not np.float64, etc. You can cast your value into a float using float()"
assert sha1(str(round(sum(accuracies_grid.std_test_score), 2)).encode("utf-8")+b"5aaa6").hexdigest() == "7ea1bce420fa1d5037e56e6b0e1b99bf030cc962", "value of sum(accuracies_grid.std_test_score) is not correct (rounded to 2 decimal places)"

assert sha1(str(type(sum(accuracies_grid.param_kneighborsclassifier__n_neighbors))).encode("utf-8")+b"5aaa7").hexdigest() == "bcd0d8dac389990c3750e6af054592990b173ab3", "type of sum(accuracies_grid.param_kneighborsclassifier__n_neighbors) is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(sum(accuracies_grid.param_kneighborsclassifier__n_neighbors)).encode("utf-8")+b"5aaa7").hexdigest() == "8805dfb33692c14619c01c7dd91f5135cbda75dd", "value of sum(accuracies_grid.param_kneighborsclassifier__n_neighbors) is not correct"

print('Success!')

**Question 4.2**

Visually inspecting the grid search results can help us find the best value for the number of neighbors parameter.

Create a line plot using the `accuracies_grid` dataframe with `param_kneighborsclassifier__n_neighbors` on the x-axis and the `mean_test_score` on the y-axis. Use `point=True` to include a point for each value of $K$. Make it an effective visualization.

*Assign your answer to a variable called `accuracy_versus_k_grid`.*

In [None]:
# ___ = alt.Chart(___).mark_line(___).encode(
#     x=alt.X(___)
#         .title(___)
#         .scale(zero=False),
#     y=alt.Y(___)
#         .title(___)
#         .scale(zero=False)
# )


# your code here
raise NotImplementedError
accuracy_versus_k_grid

In [None]:
from hashlib import sha1
assert sha1(str(type(accuracy_versus_k_grid is None)).encode("utf-8")+b"6363").hexdigest() == "87aea9e6a7c50496055760ac5f8da1558cd36bbd", "type of accuracy_versus_k_grid is None is not bool. accuracy_versus_k_grid is None should be a bool"
assert sha1(str(accuracy_versus_k_grid is None).encode("utf-8")+b"6363").hexdigest() == "97d9d9c5a176cbd7e29b0dacb28ca2875d1fc330", "boolean value of accuracy_versus_k_grid is None is not correct"

assert sha1(str(type(accuracy_versus_k_grid.encoding.x['shorthand'])).encode("utf-8")+b"6364").hexdigest() == "e3175586c7aca8d91960689a5362be05a9843531", "type of accuracy_versus_k_grid.encoding.x['shorthand'] is not str. accuracy_versus_k_grid.encoding.x['shorthand'] should be an str"
assert sha1(str(len(accuracy_versus_k_grid.encoding.x['shorthand'])).encode("utf-8")+b"6364").hexdigest() == "bc650671ac20b6c78d46c4d81c4da5fb776adf7a", "length of accuracy_versus_k_grid.encoding.x['shorthand'] is not correct"
assert sha1(str(accuracy_versus_k_grid.encoding.x['shorthand'].lower()).encode("utf-8")+b"6364").hexdigest() == "b40361992fe114502030798e70d94b33e1ebff13", "value of accuracy_versus_k_grid.encoding.x['shorthand'] is not correct"
assert sha1(str(accuracy_versus_k_grid.encoding.x['shorthand']).encode("utf-8")+b"6364").hexdigest() == "b40361992fe114502030798e70d94b33e1ebff13", "correct string value of accuracy_versus_k_grid.encoding.x['shorthand'] but incorrect case of letters"

assert sha1(str(type(accuracy_versus_k_grid.encoding.y['shorthand'])).encode("utf-8")+b"6365").hexdigest() == "359150eb18472065771be3fabdfe0bc17f14c652", "type of accuracy_versus_k_grid.encoding.y['shorthand'] is not str. accuracy_versus_k_grid.encoding.y['shorthand'] should be an str"
assert sha1(str(len(accuracy_versus_k_grid.encoding.y['shorthand'])).encode("utf-8")+b"6365").hexdigest() == "a77f4185f8ba84217e96c473d78afe7defecd6d9", "length of accuracy_versus_k_grid.encoding.y['shorthand'] is not correct"
assert sha1(str(accuracy_versus_k_grid.encoding.y['shorthand'].lower()).encode("utf-8")+b"6365").hexdigest() == "888d0123d2b4b33cf32214184508a0981b6737c9", "value of accuracy_versus_k_grid.encoding.y['shorthand'] is not correct"
assert sha1(str(accuracy_versus_k_grid.encoding.y['shorthand']).encode("utf-8")+b"6365").hexdigest() == "888d0123d2b4b33cf32214184508a0981b6737c9", "correct string value of accuracy_versus_k_grid.encoding.y['shorthand'] but incorrect case of letters"

assert sha1(str(type(accuracy_versus_k_grid.mark)).encode("utf-8")+b"6366").hexdigest() == "4f70c15a9e4661590d768563d18d2bf38949d4ca", "type of accuracy_versus_k_grid.mark is not correct"
assert sha1(str(accuracy_versus_k_grid.mark).encode("utf-8")+b"6366").hexdigest() == "792bb3fbbb9b6f8d1cf136e30874108af9aac99e", "value of accuracy_versus_k_grid.mark is not correct"

assert sha1(str(type(accuracy_versus_k_grid.mark['point'])).encode("utf-8")+b"6367").hexdigest() == "734ecab84aea2bf10ce9cfa3ca4ebe42f82594bc", "type of accuracy_versus_k_grid.mark['point'] is not bool. accuracy_versus_k_grid.mark['point'] should be a bool"
assert sha1(str(accuracy_versus_k_grid.mark['point']).encode("utf-8")+b"6367").hexdigest() == "a0f7053f039367549a809452ba2ef7ef8eec5bec", "boolean value of accuracy_versus_k_grid.mark['point'] is not correct"

print('Success!')

From the plots above, we can see that $K = 2$ or $3$ provides the highest accuracy. Larger $K$ values result in a reduced accuracy estimate. Remember: the values you see on this plot are **estimates** of the true accuracy of our classifier. Although this is the best information we have access to for what the ideal value of $K$ would be, it is  not a gurantee that the classifier will always be more accurate with this parameter value when it is used in practice! 

Great, now you have completed a full analysis with cross-validation using the `scikit-learn` package! For your information, we can choose any number of folds and typically, the more we use the better our accuracy estimate will be (lower standard error). However, more folds would mean a greater computation time. In practice, $cv$ is chosen to be either 5 or 10. 

**Discussion question**: because we are learning, we did something in this worksheet we were not supposed to do, what was it?

> Your answer here