# Worksheet Classification (Part I)

## Learning Goals:

After completing this workshop session, you will be able to:

* Recognize situations where a simple classifier would be appropriate for making predictions.
* Explain the $K$-nearest neighbour classification algorithm.
* Interpret the output of a classifier.
* Compute, by hand, the distance between points when there are two explanatory variables/predictors.
* Describe what a training data set is and how it is used in classification.
* Given a dataset with two explanatory variables/predictors, use $K$-nearest neighbour classification in Python using the `scikit-learn` framework to predict the class of a single new observation.

This worksheet covers parts of [Chapter 5](https://python.datasciencebook.ca/classification1.html) of the online textbook.
You should read this chapter to gain a better understanding of the assignment.
Any place you see `___`, you must fill in the function, variable, or data to complete the code.
Substitute the `raise NotImplementedError` with your completed code and answers then proceed to run the cell.

In [1]:
### Run this cell before continuing
import random

import altair as alt
import pandas as pd
from sklearn.compose import make_column_transformer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.metrics.pairwise import euclidean_distances
from sklearn import set_config

# Simplify working with large datasets in Altair
alt.data_transformers.disable_max_rows()

# Output dataframes instead of arrays
set_config(transform_output="pandas")

## 1. Breast Cancer Data Set 

We will work with the breast cancer data from this from the accompanying textbook chapter.

> Note that the breast cancer data in this worksheet have been **standardized (centred and scaled)** for you already.
We will implement these steps in future worksheet later, but for now, know the data has been standardized.
Therefore the variables are unitless and hence why we have zero and negative values for variables like Radius.

**Question 1.0**
<br> {points: 1}

Read the `clean-wdbc-data.csv` file (found in the `data` directory) using the `pd.read_csv` function into the notebook and store it as a data frame. *Name it `cancer`.*

In [2]:
### BEGIN SOLUTION
cancer = pd.read_csv("data/clean-wdbc-data.csv")
### END SOLUTION
cancer

Unnamed: 0,ID,Class,Radius,Texture,Perimeter,Area,Smoothness,Compactness,Concavity,Concave_points,Symmetry,Fractal_dimension
0,842302,M,1.885031,-1.358098,2.301575,1.999478,1.306537,2.614365,2.107672,2.294058,2.748204,1.935312
1,842517,M,1.804340,-0.368879,1.533776,1.888827,-0.375282,-0.430066,-0.146620,1.086129,-0.243675,0.280943
2,84300903,M,1.510541,-0.023953,1.346291,1.455004,0.526944,1.081980,0.854222,1.953282,1.151242,0.201214
3,84348301,M,-0.281217,0.133866,-0.249720,-0.549538,3.391291,3.889975,1.987839,2.173873,6.040726,4.930672
4,84358402,M,1.297434,-1.465481,1.337363,1.219651,0.220362,-0.313119,0.612640,0.728618,-0.867590,-0.396751
...,...,...,...,...,...,...,...,...,...,...,...,...
564,926424,M,1.899514,0.117596,1.751022,2.013529,0.378033,-0.273077,0.663928,1.627719,-1.358963,-0.708467
565,926682,M,1.535369,2.045599,1.420690,1.493644,-0.690623,-0.394473,0.236365,0.733182,-0.531387,-0.973122
566,926954,M,0.560868,1.373645,0.578492,0.427529,-0.808876,0.350427,0.326479,0.413705,-1.103578,-0.318129
567,927241,M,1.959515,2.235958,2.301575,1.651717,1.429169,3.901415,3.194794,2.287972,1.917396,2.217684


In [3]:
from hashlib import sha1
assert sha1(str(type(cancer is None)).encode("utf-8")+b"1edfa").hexdigest() == "f89107c5738f5567ac4ce7e619af326d8dc7e7e4", "type of cancer is None is not bool. cancer is None should be a bool"
assert sha1(str(cancer is None).encode("utf-8")+b"1edfa").hexdigest() == "71bbe216c3f112174b74da6122dd837cb4abaafa", "boolean value of cancer is None is not correct"

assert sha1(str(type(cancer)).encode("utf-8")+b"1edfb").hexdigest() == "5f4717efa0f9568127506afdab187398929a3f76", "type of type(cancer) is not correct"

assert sha1(str(type(cancer.shape)).encode("utf-8")+b"1edfc").hexdigest() == "6790c026fc62f7f025cc5dfb65b5589e70c94f24", "type of cancer.shape is not tuple. cancer.shape should be a tuple"
assert sha1(str(len(cancer.shape)).encode("utf-8")+b"1edfc").hexdigest() == "e400bacf4406589a930f6cb04146f55ae09adbe1", "length of cancer.shape is not correct"
assert sha1(str(sorted(map(str, cancer.shape))).encode("utf-8")+b"1edfc").hexdigest() == "bc58412175f96aad1b5f00e35cbf014630e9667a", "values of cancer.shape are not correct"
assert sha1(str(cancer.shape).encode("utf-8")+b"1edfc").hexdigest() == "03f45e6b2934a9d89a82761d8cd384e41a96882e", "order of elements of cancer.shape is not correct"

assert sha1(str(type(sum(cancer.Area))).encode("utf-8")+b"1edfd").hexdigest() == "8459034dfcce3dc398256d3235e7ea1c0a6eab66", "type of sum(cancer.Area) is not float. Please make sure it is float and not np.float64, etc. You can cast your value into a float using float()"
assert sha1(str(round(sum(cancer.Area), 2)).encode("utf-8")+b"1edfd").hexdigest() == "578eeb3cce10656ea3740570a63c462e74b5bee3", "value of sum(cancer.Area) is not correct (rounded to 2 decimal places)"

assert sha1(str(type(cancer.columns.values)).encode("utf-8")+b"1edfe").hexdigest() == "ec1b028616e5bf668b318569d326edbe4fa5792d", "type of cancer.columns.values is not correct"
assert sha1(str(cancer.columns.values).encode("utf-8")+b"1edfe").hexdigest() == "46e462ccc418670863709af1c7c6c89e4aa501be", "value of cancer.columns.values is not correct"

assert sha1(str(type(cancer['Class'].dtype)).encode("utf-8")+b"1edff").hexdigest() == "0c82f629a8e37cf9fc16959e34213e0fa89a6a68", "type of cancer['Class'].dtype is not correct"
assert sha1(str(cancer['Class'].dtype).encode("utf-8")+b"1edff").hexdigest() == "852a6171a68036a67ec486822760a69c8cd61e0d", "value of cancer['Class'].dtype is not correct"

print('Success!')

Success!


**Question 1.1** True or False: 
<br> {points: 1}

After looking at the first six rows of the `cancer` data fame, suppose we asked you to predict the variable "area" for a new observation. **Is this a classification problem?**

*Assign your answer to an object called `answer1_1`. Make sure the correct answer is a boolean. i.e. `True` or `False`.*

In [4]:
### BEGIN SOLUTION
answer1_1 = False
### END SOLUTION

In [5]:
from hashlib import sha1
assert sha1(str(type(answer1_1)).encode("utf-8")+b"3524b").hexdigest() == "96e967f4261014f1a2a76ba5f230d7b7e85abe83", "type of answer1_1 is not bool. answer1_1 should be a bool"
assert sha1(str(answer1_1).encode("utf-8")+b"3524b").hexdigest() == "88e4fbb75d430084961a28b324179d14f4999b12", "boolean value of answer1_1 is not correct"

print('Success!')

Success!


**Question 1.2** 
<br> {points: 1}

Create a scatterplot of the data with `Symmetry` on the x-axis and `Radius` on the y-axis. Modify your aesthetics by colouring for `Class`. As you create this plot, ensure you follow the guidelines for creating effective visualizations. In particular, note in the chart axis titles whether the data is standardized or not and add a suitable opacity level to the graphical mark. You should also replace the values in the dataframe's `Class` column from `'M'` to `'Malignant'` and from `'B'` to `'Benign'`. 

*Assign your plot to an object called `cancer_plot`.*

In [6]:
cancer["Class"] = cancer["Class"].replace({
    'M' : 'Malignant',
    'B' : 'Benign'
})
cancer_plot = alt.Chart(cancer).mark_point(opacity=0.5).encode(
    x=alt.X("Symmetry").title("Standardized symmetry"),
    y=alt.Y("Radius").title("Standardized radius"),
    color=alt.Color("Class").title("Diagnosis")
)
cancer_plot

In [7]:
from hashlib import sha1
assert sha1(str(type(cancer['Class'].unique())).encode("utf-8")+b"7038e").hexdigest() == "2c870d7a3657b742557d66961de4a4891ee76aa2", "type of cancer['Class'].unique() is not correct"
assert sha1(str(cancer['Class'].unique()).encode("utf-8")+b"7038e").hexdigest() == "7be82206c225ab0f0a4ffad7c1488a64143481f7", "value of cancer['Class'].unique() is not correct"

assert sha1(str(type(cancer_plot is None)).encode("utf-8")+b"7038f").hexdigest() == "5f05f0c0e171e12d0d2b2a22d2789f0d9a8e342e", "type of cancer_plot is None is not bool. cancer_plot is None should be a bool"
assert sha1(str(cancer_plot is None).encode("utf-8")+b"7038f").hexdigest() == "515f1ad42fa781dff7c9975c8ded7f317ce95332", "boolean value of cancer_plot is None is not correct"

assert sha1(str(type(cancer_plot.encoding.x['shorthand'])).encode("utf-8")+b"70390").hexdigest() == "1b054e65bf0aac12e848da1d2afa2b12a5226b61", "type of cancer_plot.encoding.x['shorthand'] is not str. cancer_plot.encoding.x['shorthand'] should be an str"
assert sha1(str(len(cancer_plot.encoding.x['shorthand'])).encode("utf-8")+b"70390").hexdigest() == "dd930b7fa9aa4d16b4d1a89c7a39089300827489", "length of cancer_plot.encoding.x['shorthand'] is not correct"
assert sha1(str(cancer_plot.encoding.x['shorthand'].lower()).encode("utf-8")+b"70390").hexdigest() == "dd17739f27d755296153ed0ec742286fd3b39cb2", "value of cancer_plot.encoding.x['shorthand'] is not correct"
assert sha1(str(cancer_plot.encoding.x['shorthand']).encode("utf-8")+b"70390").hexdigest() == "d4212b1d5679f4601309609b6ea38ffb2d4ed07c", "correct string value of cancer_plot.encoding.x['shorthand'] but incorrect case of letters"

assert sha1(str(type(cancer_plot.encoding.y['shorthand'])).encode("utf-8")+b"70391").hexdigest() == "6fac44d279065f2a3f185c5a884fd9df6dd81cfb", "type of cancer_plot.encoding.y['shorthand'] is not str. cancer_plot.encoding.y['shorthand'] should be an str"
assert sha1(str(len(cancer_plot.encoding.y['shorthand'])).encode("utf-8")+b"70391").hexdigest() == "df067cfef2a7c5291dd0f9acc53fe92d50ccf1fe", "length of cancer_plot.encoding.y['shorthand'] is not correct"
assert sha1(str(cancer_plot.encoding.y['shorthand'].lower()).encode("utf-8")+b"70391").hexdigest() == "edf8bc4ede29cf991a36bb6d8b38713806753b47", "value of cancer_plot.encoding.y['shorthand'] is not correct"
assert sha1(str(cancer_plot.encoding.y['shorthand']).encode("utf-8")+b"70391").hexdigest() == "e3f5fd9a0d9f3c695791e536282f29ee7d2c6b4a", "correct string value of cancer_plot.encoding.y['shorthand'] but incorrect case of letters"

assert sha1(str(type(cancer_plot.encoding.color['shorthand'])).encode("utf-8")+b"70392").hexdigest() == "a1514e0d638e7fa47039d17e310bba44e6a8ff2c", "type of cancer_plot.encoding.color['shorthand'] is not str. cancer_plot.encoding.color['shorthand'] should be an str"
assert sha1(str(len(cancer_plot.encoding.color['shorthand'])).encode("utf-8")+b"70392").hexdigest() == "a43d370b386a7d21c0ac8e73bdc942b85ee93cae", "length of cancer_plot.encoding.color['shorthand'] is not correct"
assert sha1(str(cancer_plot.encoding.color['shorthand'].lower()).encode("utf-8")+b"70392").hexdigest() == "a2e54ddb2a451b8647df9f5b9aae970b19620f66", "value of cancer_plot.encoding.color['shorthand'] is not correct"
assert sha1(str(cancer_plot.encoding.color['shorthand']).encode("utf-8")+b"70392").hexdigest() == "dfff632fd3104b75a651bf07567aa5d572835f70", "correct string value of cancer_plot.encoding.color['shorthand'] but incorrect case of letters"

assert sha1(str(type(cancer_plot.mark)).encode("utf-8")+b"70393").hexdigest() == "e736aec18d8c2224b69bfe505db7d3968a4a2c7e", "type of cancer_plot.mark is not correct"
assert sha1(str(cancer_plot.mark).encode("utf-8")+b"70393").hexdigest() == "1746955c8bc53ef47b4cf0267c98ce50ce3b184a", "value of cancer_plot.mark is not correct"

assert sha1(str(type(isinstance(cancer_plot.encoding.color['title'], str))).encode("utf-8")+b"70394").hexdigest() == "40775fb7d185dcb2ad7acf54925bb5075bd50aca", "type of isinstance(cancer_plot.encoding.color['title'], str) is not bool. isinstance(cancer_plot.encoding.color['title'], str) should be a bool"
assert sha1(str(isinstance(cancer_plot.encoding.color['title'], str)).encode("utf-8")+b"70394").hexdigest() == "e42ecc8f1d08c9a9cbb062b31d6180e6271e740b", "boolean value of isinstance(cancer_plot.encoding.color['title'], str) is not correct"

assert sha1(str(type(isinstance(cancer_plot.encoding.x['title'], str))).encode("utf-8")+b"70395").hexdigest() == "711e00550c10a33b1b90544076b3b8d4a89d23bf", "type of isinstance(cancer_plot.encoding.x['title'], str) is not bool. isinstance(cancer_plot.encoding.x['title'], str) should be a bool"
assert sha1(str(isinstance(cancer_plot.encoding.x['title'], str)).encode("utf-8")+b"70395").hexdigest() == "b697a6b22f874c7f85b3e081a24d1b5cdd084659", "boolean value of isinstance(cancer_plot.encoding.x['title'], str) is not correct"

assert sha1(str(type(isinstance(cancer_plot.encoding.y['title'], str))).encode("utf-8")+b"70396").hexdigest() == "37f686ca3136e2e102714b66fd51cf3ef4703fc6", "type of isinstance(cancer_plot.encoding.y['title'], str) is not bool. isinstance(cancer_plot.encoding.y['title'], str) should be a bool"
assert sha1(str(isinstance(cancer_plot.encoding.y['title'], str)).encode("utf-8")+b"70396").hexdigest() == "d0eb5c4439424db54df14662a17296d7dc74ff3d", "boolean value of isinstance(cancer_plot.encoding.y['title'], str) is not correct"

print('Success!')

Success!


**Question 1.3** 
<br> {points: 1}

Just by looking at the scatterplot above, how would you classify an observation with `Symmetry` = 1 and `Radius` = 1 (Benign or Malignant)?

*Assign your answer to an object called `answer1_3`. Make sure the correct answer is written fully. Remember to surround your answer with quotation marks (e.g. "Benign" / "Malignant").*

In [8]:
### BEGIN SOLUTION
answer1_3 = "Malignant"
### END SOLUTION

In [9]:
from hashlib import sha1
assert sha1(str(type(answer1_3)).encode("utf-8")+b"192dd").hexdigest() == "c22b7e131d7091f92de118d328cf99ebfc373e56", "type of answer1_3 is not str. answer1_3 should be an str"
assert sha1(str(len(answer1_3)).encode("utf-8")+b"192dd").hexdigest() == "426e383fda7dfde77b6cbef2a70be0047a593331", "length of answer1_3 is not correct"
assert sha1(str(answer1_3.lower()).encode("utf-8")+b"192dd").hexdigest() == "04f0ffa0d0870e9e4d9a8bf7bb1ed255f9165a52", "value of answer1_3 is not correct"
assert sha1(str(answer1_3).encode("utf-8")+b"192dd").hexdigest() == "11394ec2da0fa66e3dd679f582c3d6024557dc1a", "correct string value of answer1_3 but incorrect case of letters"

print('Success!')

Success!


## 2. Using `scikit-learn` to perform k-nearest neighbours

Now that we understand how K-nearest neighbours (k-nn) classification works,
let's get familar with the `scikit-learn` Python package.
The benefit of using `scikit-learn` is that it will keep our code simple, readable and accurate.
Coding less and in a tidier format means that there is less chance for errors to occur.

We'll again focus on `Radius` and `Symmetry` as the two predictors.
This time, we would like to predict the class of a new observation with `Symmetry = 1` and `Radius = 0`.
This one is a bit tricky to do visually from the plot below,
and so is a motivating example for us to compute the prediction using k-nn with the `scikit-learn` package.
Let's use `K = 7`.

In [10]:
# Run this to remind yourself what the data looks like
cancer_plot

**Question 2.1** 
<br> {points: 1}

Create a **model** for K-nearest neighbours classification by using the `KNeighborsClassifier` function. Specify that we want to set `n_neighbors = 7`.

Name your model specification `knn_spec`.

In [11]:
# your code here
### BEGIN SOLUTION
knn_spec = KNeighborsClassifier(n_neighbors=7)
### END SOLUTION

knn_spec

In [12]:
from hashlib import sha1
assert sha1(str(type(knn_spec is None)).encode("utf-8")+b"2245").hexdigest() == "78a85e8a7790e4e7a7d20d518a6e9e23249aeb08", "type of knn_spec is None is not bool. knn_spec is None should be a bool"
assert sha1(str(knn_spec is None).encode("utf-8")+b"2245").hexdigest() == "b44afb32e99c75fdbfb62a424921d1a2fa6f12ee", "boolean value of knn_spec is None is not correct"

assert sha1(str(type(knn_spec.n_neighbors)).encode("utf-8")+b"2246").hexdigest() == "1458700d2adfd82356dfd982b7941979cbd45589", "type of knn_spec.n_neighbors is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(knn_spec.n_neighbors).encode("utf-8")+b"2246").hexdigest() == "22e82178737bce10c71283d1cf37544dd1dcbbb8", "value of knn_spec.n_neighbors is not correct"

assert sha1(str(type(knn_spec.algorithm)).encode("utf-8")+b"2247").hexdigest() == "c71db7d462b7abd56ce64492cec1d48ea9040224", "type of knn_spec.algorithm is not str. knn_spec.algorithm should be an str"
assert sha1(str(len(knn_spec.algorithm)).encode("utf-8")+b"2247").hexdigest() == "fc15a33f2e81ff2a1ebe2cdd52ca95582dc1df46", "length of knn_spec.algorithm is not correct"
assert sha1(str(knn_spec.algorithm.lower()).encode("utf-8")+b"2247").hexdigest() == "6fb8474d83ec0dc215e52cdc08b990751b7d94a9", "value of knn_spec.algorithm is not correct"
assert sha1(str(knn_spec.algorithm).encode("utf-8")+b"2247").hexdigest() == "6fb8474d83ec0dc215e52cdc08b990751b7d94a9", "correct string value of knn_spec.algorithm but incorrect case of letters"

print('Success!')

Success!


**Question 2.2**
<br> {points: 1}

To train the model on the breast cancer dataset,
pass `knn_spec` and the `cancer` dataset to the `.fit` function.
Specify `Class` as your target variable and the `Symmetry` and `Radius` variables as your predictors.
Name your fitted model as `knn_fit`.

In [13]:
### BEGIN SOLUTION
X = cancer[["Symmetry", "Radius"]]
y = cancer["Class"]
knn_fit = knn_spec.fit(X, y)
### END SOLUTION

# your code here
knn_fit

In [14]:
from hashlib import sha1
assert sha1(str(type(knn_fit is None)).encode("utf-8")+b"2bff6").hexdigest() == "56a79c088e5ec5a42b96d6ea0d14cd8712fac03e", "type of knn_fit is None is not bool. knn_fit is None should be a bool"
assert sha1(str(knn_fit is None).encode("utf-8")+b"2bff6").hexdigest() == "268e8a79ae1936524d056ca5169d2c71a72e70a5", "boolean value of knn_fit is None is not correct"

assert sha1(str(type(type(knn_fit))).encode("utf-8")+b"2bff7").hexdigest() == "097a69bbecb57e7e121a5edde688e93d2fbf9d17", "type of type(knn_fit) is not correct"
assert sha1(str(type(knn_fit)).encode("utf-8")+b"2bff7").hexdigest() == "40872fc0b9fdb9d1467bfb606dfb31df50cfe4ea", "value of type(knn_fit) is not correct"

assert sha1(str(type(knn_fit.classes_)).encode("utf-8")+b"2bff8").hexdigest() == "cb4bb5ca40cc7954a382bfa33bd1532a3df2e583", "type of knn_fit.classes_ is not correct"
assert sha1(str(knn_fit.classes_).encode("utf-8")+b"2bff8").hexdigest() == "c422ff7ed5336560783f0ada3740882c912a4d97", "value of knn_fit.classes_ is not correct"

assert sha1(str(type(knn_fit.effective_metric_)).encode("utf-8")+b"2bff9").hexdigest() == "aa9d11a76326cf64bec451997ff4ae28ec4e2a9a", "type of knn_fit.effective_metric_ is not str. knn_fit.effective_metric_ should be an str"
assert sha1(str(len(knn_fit.effective_metric_)).encode("utf-8")+b"2bff9").hexdigest() == "6e6efca67ebf0603a534109614346757c0ef146d", "length of knn_fit.effective_metric_ is not correct"
assert sha1(str(knn_fit.effective_metric_.lower()).encode("utf-8")+b"2bff9").hexdigest() == "49892ddeb7ca1f1c4bf772ff4a585dd10249b813", "value of knn_fit.effective_metric_ is not correct"
assert sha1(str(knn_fit.effective_metric_).encode("utf-8")+b"2bff9").hexdigest() == "49892ddeb7ca1f1c4bf772ff4a585dd10249b813", "correct string value of knn_fit.effective_metric_ but incorrect case of letters"

assert sha1(str(type(knn_fit.n_features_in_)).encode("utf-8")+b"2bffa").hexdigest() == "08548a765b78a776a8b132bc9a51a75e1d960554", "type of knn_fit.n_features_in_ is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(knn_fit.n_features_in_).encode("utf-8")+b"2bffa").hexdigest() == "114d53a02f71bf2cea7abe3280555955b16647b8", "value of knn_fit.n_features_in_ is not correct"

assert sha1(str(type(X.columns.values)).encode("utf-8")+b"2bffb").hexdigest() == "bd7b6259f8053a315fca30986b168530209f266d", "type of X.columns.values is not correct"
assert sha1(str(X.columns.values).encode("utf-8")+b"2bffb").hexdigest() == "b7dfb165b13d0e67f16b09e1dc18733df42a65a0", "value of X.columns.values is not correct"

assert sha1(str(type(y.name)).encode("utf-8")+b"2bffc").hexdigest() == "e8b9e2753401141ac2885e2394f32cc9cffa348a", "type of y.name is not str. y.name should be an str"
assert sha1(str(len(y.name)).encode("utf-8")+b"2bffc").hexdigest() == "793a9c94797d9b5899e9865188980d72d080ac2d", "length of y.name is not correct"
assert sha1(str(y.name.lower()).encode("utf-8")+b"2bffc").hexdigest() == "bcffa9a693072729c80524aafde25fd6b259f4c3", "value of y.name is not correct"
assert sha1(str(y.name).encode("utf-8")+b"2bffc").hexdigest() == "788a47e28ae9d6cd3cbf4823a5d93017aa6addd9", "correct string value of y.name but incorrect case of letters"

assert sha1(str(type(sum(X.Symmetry))).encode("utf-8")+b"2bffd").hexdigest() == "903f7c4e29e57fe29f567ef056e34f51335aec78", "type of sum(X.Symmetry) is not float. Please make sure it is float and not np.float64, etc. You can cast your value into a float using float()"
assert sha1(str(round(sum(X.Symmetry), 2)).encode("utf-8")+b"2bffd").hexdigest() == "9397d4f7e4087a12b928268b533b6f46554e7efa", "value of sum(X.Symmetry) is not correct (rounded to 2 decimal places)"

assert sha1(str(type(sum(X.Radius))).encode("utf-8")+b"2bffe").hexdigest() == "d9df2ec2ea3462b25d7d591366163062bdb4021e", "type of sum(X.Radius) is not float. Please make sure it is float and not np.float64, etc. You can cast your value into a float using float()"
assert sha1(str(round(sum(X.Radius), 2)).encode("utf-8")+b"2bffe").hexdigest() == "5bd565738755b0a3859dbce84dffc4a36fe6656e", "value of sum(X.Radius) is not correct (rounded to 2 decimal places)"

print('Success!')

Success!


**Question 2.3**
<br>{points: 1}

Now we will make our prediction on the `Class` of a new observation with a `Symmetry` of 1 and a `Radius` of 0.
First, create a dataframe with these variables and values and call it `new_obs`.
Next, use the `.predict` function to obtain our prediction by passing `knn_fit` and `new_obs` to it. Name your predicted class as `class_prediction`.

In [15]:
### BEGIN SOLUTION
new_obs = pd.DataFrame([[1, 0]], columns=["Symmetry", "Radius"])
class_prediction = knn_fit.predict(new_obs)
### END SOLUTION

# your code here
class_prediction

array(['Malignant'], dtype=object)

In [16]:
from hashlib import sha1
assert sha1(str(type(new_obs is None)).encode("utf-8")+b"cc461").hexdigest() == "ab0357cbb8bfe0c9949b6d69e4c245d2af0635c6", "type of new_obs is None is not bool. new_obs is None should be a bool"
assert sha1(str(new_obs is None).encode("utf-8")+b"cc461").hexdigest() == "a170f5d0292d586ae590d1baca2f13355fb1c178", "boolean value of new_obs is None is not correct"

assert sha1(str(type(new_obs)).encode("utf-8")+b"cc462").hexdigest() == "46dce56999c53d1950618cc8e55690c524892a78", "type of type(new_obs) is not correct"

assert sha1(str(type(new_obs.Symmetry.values)).encode("utf-8")+b"cc463").hexdigest() == "113c0758f0ab20fd6712d816469f64cbc4243a12", "type of new_obs.Symmetry.values is not correct"
assert sha1(str(new_obs.Symmetry.values).encode("utf-8")+b"cc463").hexdigest() == "a41abcf000cf506078cdf4e306ca8121772da0f6", "value of new_obs.Symmetry.values is not correct"

assert sha1(str(type(new_obs.Radius.values)).encode("utf-8")+b"cc464").hexdigest() == "ca2beddfd89352b63fc45c70ca25260e7271c928", "type of new_obs.Radius.values is not correct"
assert sha1(str(new_obs.Radius.values).encode("utf-8")+b"cc464").hexdigest() == "195b0a370675aea18bf01c541c099428f4e248bd", "value of new_obs.Radius.values is not correct"

assert sha1(str(type(class_prediction is None)).encode("utf-8")+b"cc465").hexdigest() == "1ebd9c72926a2430221b10ae35a374b95040d55f", "type of class_prediction is None is not bool. class_prediction is None should be a bool"
assert sha1(str(class_prediction is None).encode("utf-8")+b"cc465").hexdigest() == "883aa082b8853226ec339ff84fd4744558d7fd3b", "boolean value of class_prediction is None is not correct"

assert sha1(str(type(class_prediction)).encode("utf-8")+b"cc466").hexdigest() == "a0d1780e9e6b14b3a86e32b907408ae8253ceec5", "type of class_prediction is not correct"
assert sha1(str(class_prediction).encode("utf-8")+b"cc466").hexdigest() == "2704eb31133cad075ccb97f6c93ca707487c9642", "value of class_prediction is not correct"

print('Success!')

Success!


**Question 2.4**
<br> {points: 1}

Let's perform K-nearest neighbour classification again, but with three predictors. Use the `scikit-learn` package and `K = 7` to classify a new observation where we measure `Symmetry = 1`, `Radius = 0` and `Concavity = 1`. Use the scaffolding from **Questions 2.2** and **2.3** to help you.

- Pass the same `knn_spec` from before to `fit`, but this time specify `Symmetry`, `Radius`, and `Concavity` as the predictors. Save the predictor as `X_2` and the target as `y_2`. Store the output in `knn_fit_2`. 
- Store the new observation values in an object called `new_obs_2`.
- Store the output of `predict` in an object called `class_prediction_2`.

In [17]:
# your code here
### BEGIN SOLUTION
X_2 = cancer[["Symmetry", "Radius", "Concavity"]]
y_2 = cancer["Class"]
knn_fit_2 = knn_spec.fit(X_2, y_2)
new_obs_2 = pd.DataFrame([[1, 0, 1]], columns=["Symmetry", "Radius", "Concavity"])
class_prediction_2 = knn_fit_2.predict(new_obs_2)
### END SOLUTION
class_prediction_2

array(['Malignant'], dtype=object)

In [18]:
from hashlib import sha1
assert sha1(str(type(knn_fit_2 is None)).encode("utf-8")+b"1c52c").hexdigest() == "56fcaa2be4db81bcc9472911429b629cba6101f3", "type of knn_fit_2 is None is not bool. knn_fit_2 is None should be a bool"
assert sha1(str(knn_fit_2 is None).encode("utf-8")+b"1c52c").hexdigest() == "99533bc9488848c2608ec11993eb5cfe3033cdf5", "boolean value of knn_fit_2 is None is not correct"

assert sha1(str(type(knn_fit_2.kneighbors)).encode("utf-8")+b"1c52d").hexdigest() == "3f30af90ec8c1a811004f1225ae54fbd83681019", "type of knn_fit_2.kneighbors is not correct"
assert sha1(str(knn_fit_2.kneighbors).encode("utf-8")+b"1c52d").hexdigest() == "55cb84f4cc156f43e6f8f2b5aa8379f34d7d579d", "value of knn_fit_2.kneighbors is not correct"

assert sha1(str(type(knn_fit_2.effective_metric_)).encode("utf-8")+b"1c52e").hexdigest() == "4a071b6a39a1366f1c50cfff6bf5bafe83bd944b", "type of knn_fit_2.effective_metric_ is not str. knn_fit_2.effective_metric_ should be an str"
assert sha1(str(len(knn_fit_2.effective_metric_)).encode("utf-8")+b"1c52e").hexdigest() == "6949133fe0bc9d23f3f2391308a85eeac883a74a", "length of knn_fit_2.effective_metric_ is not correct"
assert sha1(str(knn_fit_2.effective_metric_.lower()).encode("utf-8")+b"1c52e").hexdigest() == "fed6a9fc14d942c80421836e802e4f459699e299", "value of knn_fit_2.effective_metric_ is not correct"
assert sha1(str(knn_fit_2.effective_metric_).encode("utf-8")+b"1c52e").hexdigest() == "fed6a9fc14d942c80421836e802e4f459699e299", "correct string value of knn_fit_2.effective_metric_ but incorrect case of letters"

assert sha1(str(type(type(knn_fit_2))).encode("utf-8")+b"1c52f").hexdigest() == "c0688283eb347b3a716716a2762a46dfa6000cbd", "type of type(knn_fit_2) is not correct"
assert sha1(str(type(knn_fit_2)).encode("utf-8")+b"1c52f").hexdigest() == "0704b8be28fd8da7859dd6fc8e5148535b75399d", "value of type(knn_fit_2) is not correct"

assert sha1(str(type(knn_fit_2.n_features_in_)).encode("utf-8")+b"1c530").hexdigest() == "3ac6a926c4de23bcae7e9a759aca7cd489f4cad2", "type of knn_fit_2.n_features_in_ is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(knn_fit_2.n_features_in_).encode("utf-8")+b"1c530").hexdigest() == "a032550f3adad26547a9e2adaae58acd93a5b36e", "value of knn_fit_2.n_features_in_ is not correct"

assert sha1(str(type(X_2.columns.values)).encode("utf-8")+b"1c531").hexdigest() == "a7b153af47464cec438aaac3a5398422f15c739c", "type of X_2.columns.values is not correct"
assert sha1(str(X_2.columns.values).encode("utf-8")+b"1c531").hexdigest() == "c3fd80e7a523addffb5f040e8560caf25c601447", "value of X_2.columns.values is not correct"

assert sha1(str(type(y_2.name)).encode("utf-8")+b"1c532").hexdigest() == "6ecd85411910b1c1e49f8fb6a6c446122031f3a3", "type of y_2.name is not str. y_2.name should be an str"
assert sha1(str(len(y_2.name)).encode("utf-8")+b"1c532").hexdigest() == "ebad5f09e153e6397258881e828d3d859c85154e", "length of y_2.name is not correct"
assert sha1(str(y_2.name.lower()).encode("utf-8")+b"1c532").hexdigest() == "7c29dd51d86bde999e3497b24a52324ffa88d80b", "value of y_2.name is not correct"
assert sha1(str(y_2.name).encode("utf-8")+b"1c532").hexdigest() == "ff10a06265a833d1274bee055d40fe11d6c1de8f", "correct string value of y_2.name but incorrect case of letters"

assert sha1(str(type(y_2.values)).encode("utf-8")+b"1c533").hexdigest() == "6701ed81a4e9d4e3d605bff672f33abc4cab49ab", "type of y_2.values is not correct"
assert sha1(str(y_2.values).encode("utf-8")+b"1c533").hexdigest() == "1eeb2c2596b3d6987df53c982b8921d368bfa856", "value of y_2.values is not correct"

assert sha1(str(type(sum(X_2.Symmetry))).encode("utf-8")+b"1c534").hexdigest() == "401b25657e3ccc6f373b04c3f73b718954f2f87f", "type of sum(X_2.Symmetry) is not float. Please make sure it is float and not np.float64, etc. You can cast your value into a float using float()"
assert sha1(str(round(sum(X_2.Symmetry), 2)).encode("utf-8")+b"1c534").hexdigest() == "10b6bfc8d4273cf2cda252fff3ead0fcdb4a13cd", "value of sum(X_2.Symmetry) is not correct (rounded to 2 decimal places)"

assert sha1(str(type(sum(X_2.Radius))).encode("utf-8")+b"1c535").hexdigest() == "bc045aaaa3e654827d701f1a154d82ce3bf159de", "type of sum(X_2.Radius) is not float. Please make sure it is float and not np.float64, etc. You can cast your value into a float using float()"
assert sha1(str(round(sum(X_2.Radius), 2)).encode("utf-8")+b"1c535").hexdigest() == "ba710607201f7fb546dfbff2bbf8125c43eb507a", "value of sum(X_2.Radius) is not correct (rounded to 2 decimal places)"

assert sha1(str(type(new_obs_2 is None)).encode("utf-8")+b"1c537").hexdigest() == "0e6db0897dec3a5a0be1727ee2174838fed8392d", "type of new_obs_2 is None is not bool. new_obs_2 is None should be a bool"
assert sha1(str(new_obs_2 is None).encode("utf-8")+b"1c537").hexdigest() == "3bd0c39ffd477eef3d20295014e9603c292a36db", "boolean value of new_obs_2 is None is not correct"

assert sha1(str(type(new_obs_2)).encode("utf-8")+b"1c538").hexdigest() == "dbd7bddf2ba93297c398200c3aba22eee791cc2b", "type of type(new_obs_2) is not correct"

assert sha1(str(type(new_obs_2.Symmetry.values)).encode("utf-8")+b"1c539").hexdigest() == "f025231cf463eb1b53a6eb94ed504ae8eacbd2b3", "type of new_obs_2.Symmetry.values is not correct"
assert sha1(str(new_obs_2.Symmetry.values).encode("utf-8")+b"1c539").hexdigest() == "ee3ccc35dc76e76f3d3f26ce2af31aba1c00475d", "value of new_obs_2.Symmetry.values is not correct"

assert sha1(str(type(new_obs_2.Radius.values)).encode("utf-8")+b"1c53a").hexdigest() == "88a9bfa1595cca275e5f844a75c8fd0e7e660b85", "type of new_obs_2.Radius.values is not correct"
assert sha1(str(new_obs_2.Radius.values).encode("utf-8")+b"1c53a").hexdigest() == "71bc38e24dca943a89fdba0e143c0c93a97d9dbd", "value of new_obs_2.Radius.values is not correct"

assert sha1(str(type(new_obs_2.Concavity.values)).encode("utf-8")+b"1c53b").hexdigest() == "581e0fa5a3a53d9c88d6df38ab3e022f128d41bb", "type of new_obs_2.Concavity.values is not correct"
assert sha1(str(new_obs_2.Concavity.values).encode("utf-8")+b"1c53b").hexdigest() == "6366ed5e9216d5cf0dab850bfd420217af90487c", "value of new_obs_2.Concavity.values is not correct"

assert sha1(str(type(class_prediction_2 is None)).encode("utf-8")+b"1c53c").hexdigest() == "102656310fdb351c06ebfcd34be208a380387416", "type of class_prediction_2 is None is not bool. class_prediction_2 is None should be a bool"
assert sha1(str(class_prediction_2 is None).encode("utf-8")+b"1c53c").hexdigest() == "2a4bb29e14b26cb4fba6321e00f5ea7264414756", "boolean value of class_prediction_2 is None is not correct"

assert sha1(str(type(class_prediction_2)).encode("utf-8")+b"1c53d").hexdigest() == "0d077a418cd3754263e26adf0dd15e657d98c01a", "type of class_prediction_2 is not correct"
assert sha1(str(class_prediction_2).encode("utf-8")+b"1c53d").hexdigest() == "fa0463880a9a71ddc10bfae3b027f9f4a6da0038", "value of class_prediction_2 is not correct"

print('Success!')

Success!


**Question 2.5**
<br>{points: 1}

Finally, we will perform K-nearest neighbour classification again, using the `scikit-learn` package and `K = 7` to classify a new observation where we use **all the predictors** in our data set (we give you the values in the code below). 

But we first have to do one important thing: we need to remove the ID variable from the analysis (it's not a numerical measurement that we should use for classification). Thankfully, `scikit-learn` provides a nice way of combining data preprocessing and training into a single consistent pipeline.

We will first create a preprocessor to remove the `ID` variable using the `drop` preprocessing step. Since we aren't doing any preprocessing to other columns, we will set the `remainder` parameter to `passthrough`. Do so below using the provided scaffolding. Name the preprocessor object `knn_preprocessor`.


In [19]:
# ___ = make_column_transformer(
#     ("drop", [___]),
#     remainder=___
# )

# your code here

### BEGIN SOLUTION
knn_preprocessor = make_column_transformer(
    ("drop", ["ID"]),
    remainder='passthrough'
)
### END SOLUTION

knn_preprocessor

In [20]:
from hashlib import sha1
assert sha1(str(type(knn_preprocessor is None)).encode("utf-8")+b"c9190").hexdigest() == "1aa697ad8121d7a033d091dd1dde11fd9a9048c0", "type of knn_preprocessor is None is not bool. knn_preprocessor is None should be a bool"
assert sha1(str(knn_preprocessor is None).encode("utf-8")+b"c9190").hexdigest() == "fa97b74142ab349c61853efd4aed109620ea4419", "boolean value of knn_preprocessor is None is not correct"

assert sha1(str(type(type(knn_preprocessor))).encode("utf-8")+b"c9191").hexdigest() == "1109d30d9e1570f804f20d5d028ef6afce897841", "type of type(knn_preprocessor) is not correct"
assert sha1(str(type(knn_preprocessor)).encode("utf-8")+b"c9191").hexdigest() == "b7f41df01f0ad5de7c336fbe446f33229e39ffba", "value of type(knn_preprocessor) is not correct"

assert sha1(str(type(knn_preprocessor.get_feature_names_out)).encode("utf-8")+b"c9192").hexdigest() == "bb33b66fca24d58fe1cd06f2a2985674ee8f0698", "type of knn_preprocessor.get_feature_names_out is not correct"
assert sha1(str(knn_preprocessor.get_feature_names_out).encode("utf-8")+b"c9192").hexdigest() == "ee35e154ea1785364b8455f97248689b3f960199", "value of knn_preprocessor.get_feature_names_out is not correct"

print('Success!')

Success!


**Question 2.6**
<br> {points: 1}

Create a **pipeline** that includes the new preprocessor (`knn_preprocessor`) and the model specification (`knn_spec`) using the scaffolding below. Name the pipeline object `knn_pipeline`.

In [21]:
# ___ = make_pipeline(___, ___)

# your code here
### BEGIN SOLUTION
knn_pipeline = make_pipeline(knn_preprocessor, knn_spec)
### END SOLUTION

knn_pipeline

In [22]:
from hashlib import sha1
assert sha1(str(type(knn_pipeline is None)).encode("utf-8")+b"570b0").hexdigest() == "5229d62594055e066352668824c5c3908bb29f74", "type of knn_pipeline is None is not bool. knn_pipeline is None should be a bool"
assert sha1(str(knn_pipeline is None).encode("utf-8")+b"570b0").hexdigest() == "1d2994bb8a065a370b5d478a8087d37e3841c353", "boolean value of knn_pipeline is None is not correct"

assert sha1(str(type(type(knn_pipeline))).encode("utf-8")+b"570b1").hexdigest() == "4b6b2b6e8dcd91c384d5e7b3ffaa5df5ed5fac76", "type of type(knn_pipeline) is not correct"
assert sha1(str(type(knn_pipeline)).encode("utf-8")+b"570b1").hexdigest() == "7cff373362fd36faa9d06c119ebecf6820b812f0", "value of type(knn_pipeline) is not correct"

assert sha1(str(type(knn_pipeline.named_steps.kneighborsclassifier.n_neighbors)).encode("utf-8")+b"570b2").hexdigest() == "c0cf487b545aea403a76ddbbedba6f508f1e6b8f", "type of knn_pipeline.named_steps.kneighborsclassifier.n_neighbors is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(knn_pipeline.named_steps.kneighborsclassifier.n_neighbors).encode("utf-8")+b"570b2").hexdigest() == "0c675a0ab79e6e0c763779b13443212fca47db27", "value of knn_pipeline.named_steps.kneighborsclassifier.n_neighbors is not correct"

print('Success!')

Success!


**Question 2.7**
{points: 1}

Finally, `fit` the pipeline and predict the class label for the new observation named `new_obs_all`.
Name the `fit` object `knn_fit_all`, and the class prediction `class_prediction_all`.
Name the new predictor as `X_3` and the new target as `y_3`.

In [23]:
new_obs_all = pd.DataFrame(
    [[None, 0, 0, 0, 0, 0.5, 0, 1, 0, 1, 0]],
    columns=[
        "ID",
        "Radius",
        "Texture",
        "Perimeter",
        "Area",
        "Smoothness",
        "Compactness",
        "Concavity",
        "Concave_points",
        "Symmetry",
        "Fractal_dimension",
    ],
)
# X_3 = cancer.drop(columns=[___])
# y_3 = cancer[___]
# ___ = knn_pipeline.fit(___, ___)
# ___ = knn_fit_all.____(____)

## BEGIN SOLUTION
X_3 = cancer.drop(columns=["Class"])
y_3 = cancer["Class"]
knn_fit_all = knn_pipeline.fit(X_3, y_3)
class_prediction_all = knn_fit_all.predict(new_obs_all)
### END SOLUTION

class_prediction_all

array(['Benign'], dtype=object)

In [24]:
from hashlib import sha1
assert sha1(str(type(class_prediction_all)).encode("utf-8")+b"bdab9").hexdigest() == "6a39e4695fdfb5df240814e4336bf420375f8ab0", "type of class_prediction_all is not correct"
assert sha1(str(class_prediction_all).encode("utf-8")+b"bdab9").hexdigest() == "f3a6c0bfa60af2a0f2398b36ffe3f27359ac2478", "value of class_prediction_all is not correct"

print('Success!')

Success!
