# DS106 Machine Learning : Lesson Seven Companion Notebook

### Table of Contents <a class="anchor" id="DS106L7_toc"></a>

* [Table of Contents](#DS106L7_toc)
    * [Page 1 - Introduction](#DS106L7_page_1)
    * [Page 2 - Clustering](#DS106L7_page_2)
    * [Page 3 - K-means Clustering in Python](#DS106L7_page_3)
    * [Page 4 - k-Nearest Neighbors](#DS106L7_page_4)
    * [Page 5 - Performing k-Nearest Neighbors in Python](#DS106L7_page_5)
    * [Page 6 - KNN Analysis](#DS106L7_page_6)
    * [Page 7 - Key Terms](#DS106L7_page_7)
    * [Page 8 - Lesson 2 Practice Hands-On](#DS106L7_page_8)
    * [Page 9 - Lesson 2 Practice Hands-On Solution](#DS106L7_page_9)

    

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 1 - Overview of this Module<a class="anchor" id="DS106L7_page_1"></a>

[Back to Top](#DS106L7_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

In [1]:
from IPython.display import VimeoVideo
# Tutorial Video Name: k-Means and K-Nearest Neighbors
VimeoVideo('244082598', width=720, height=480)


The transcript for the above overview video **[is located here](https://repo.exeterlms.com/documents/V2/DataScience/Video-Transcripts/DSO106-ML-L02overview.zip)**.

# Introduction

Now that you have a good understanding of supervised machine learning, you will move into unsupervised machine learning. Techniques included in unsupervised machine learning including things like *k*-means clustering and *k*-nearest neighbors, decision trees, and random forests.  You will start with *k*-means clustering and *k*-nearest neighbors!

By the end of this lesson, you should be able to:

* Understand the process of clustering
* Perform *k*-means clustering in Python
* Understand the difference between *k*-means and *k*-nearest neighbors
* Perform *k*-nearest neighbors in Python

This lesson will culminate in a hands on in which you explore data on car's miles per gallon using *k*-means and *k*-nearest neighbors in Python.

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 2 - Clustering<a class="anchor" id="DS106L7_page_2"></a>

[Back to Top](#DS106L7_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Clustering

*Clustering* is a type of unsupervised learning in which similar data are grouped together. This means the clustering algorithm does not have a label for each piece of data but finds similarities within the data itself. Here's a scenario where this would be useful: pretend the company Netflix released the first season of a new comedy show. Over the next week or so, Netflix continues to gather information about who's watched it and rated the show highly. Every user on their platform has an account containing age, gender, city of residence, and so on. Netflix could then run a clustering analysis algorithm on the gathered data that would produce a cluster of the different demographics that like their show. Perhaps the show is popular amongst male Texans between ages 18-35 or female Canadians between ages 45-65. This would be invaluable information for the company's marketing department, as they can advertise on specific platforms rather than in a generic setting.

---

# k-Means Clustering

_k_-means clustering is a popular algorithm used for cluster analysis via unsupervised machine learning. Once the k-means algorithm is given the list of data and the number of expected clusters, it is capable of calculating how similar each piece of data is to each cluster and assigning to the cluster that best classifies the data. However, the _k_-means algorithm doesn't do this in one step. It takes an iterative approach to clustering the data points. Initially, the _k_ number of points are chosen at random. As the nearest points are clustered to one of the _k_ points, the mean, or location, to the point is recalculated. The cycle is then repeated until all data clusters are added to a cluster. The animated GIF below illustrates how with each iteration of applying the _k_-means algorithm, the _k_ points are refined. Notice that the variation is minimal as the data points converge.

 Note that two dimensions are easy to visualize but, isn't practical for complex clustering where data is 4th dimensional (or higher).

![A graph title iteration number ten. The x axis runs from zero to one in increments of zero point one. The y axis runs from zero point one to zero point nine in increments of zero point one. The graph is broken into three sections, with data in each section plotted in a different color.](https://upload.wikimedia.org/wikipedia/commons/e/ea/K-means_convergence.gif)

Here is the process that *k*-means takes: 

* You choose the number of clusters (*k*)
* Randomly assign each point to a cluster
* Calculate the center of each cluster by taking the mean vector of every point within that cluster 
* Reassign each data point to the cluster where the center is closest
* Recalculate and reassign until the data “settles,” or the clusters stop changing 

---

## Illustrated Definitions in Clustering

---

### 1. _k_ random data observations (k=3)

In the illustration below, the value of _k_ is 3. The radius around the point, also referred to as the _point-to-cluster-centroid distance_, then expands to encompass the points nearest to itself. 

![Three circles, one red, one green, and one blue. The red circle is near the top. The green circle is below the red. The blue circle is to the right of the green circle. Three gray squares in a vertical column are below and to the left of the red circle, and above and to the left of the green circle. Three gray squares are above and to the right of the blue circle. A U shape of six gray squares are below and between the green and blue circles.](https://upload.wikimedia.org/wikipedia/commons/thumb/5/5e/K_Means_Example_Step_1.svg/498px-K_Means_Example_Step_1.svg.png)


---

### 2. Nearest mean

The shaded areas represent the cluster field. These areas are the boundaries that distinguish one cluster from another. Boundaries are initially created and are adjusted as each data point is associated with the mean of the neighboring points. 

![A square with three sections, each shaded in a different color. In the red section at the top of the square is a red circle and red square. In the green section, which is the largest section, is a green circle and six green squares. In the blue section to the right of the square is a blue circle and five blue squares.](https://upload.wikimedia.org/wikipedia/commons/thumb/a/a5/K_Means_Example_Step_2.svg/556px-K_Means_Example_Step_2.svg.png)

---

### 3. Centroid

Upon the completion of adding the points and creating primitive boundaries (as shown in the image above), the _centroid_ of each cluster can be calculated. This newly created centroid point represents "center of mass" for all the observations in the cluster.

![At top, a pink circle with an arrow pointing downward and leftward to a red circle. Below the red circle are two green squares with lines downward and rightward to a green circle. Just above the green circle is a light green circle with an arrow pointing to the green circle. Below the green circle are four green squares, each with a line to the green circle. To the right are two blue squares, with lines that point upward and rightward to a blue circle. Just below and to the right of the blue circle is a light blue circle with an arrow pointing to the blue circle. Just above and to the right of the blue circle are three blue squares, each with a line to the blue circle.](https://upload.wikimedia.org/wikipedia/commons/thumb/3/3e/K_Means_Example_Step_3.svg/556px-K_Means_Example_Step_3.svg.png)

---

### 4. Max Convergence

Once steps 2 and 3 are executed repeatedly to refine the centroid mean, the boundaries of the cluster assume their accurate shape.

![A square with three sections. In the red section, which is the largest, is a red circle and directly below it two red squares. Below this section is a green section, which contains a green circle and four green squares. To the right of this section is a blue section, which contains a blue circle and five blue squares.](https://upload.wikimedia.org/wikipedia/commons/thumb/d/d2/K_Means_Example_Step_4.svg/556px-K_Means_Example_Step_4.svg.png)

The next section of this lesson will explain, step-by-step, how to conduct k-means clustering in Python. 

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 3 - K-means Clustering in Python<a class="anchor" id="DS106L7_page_3"></a>

[Back to Top](#DS106L7_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# K-means Clustering in Python

Are you ready to get this party started?  Time to perform *k*-means clustering in Python!

<div class="panel panel-success">
    <div class="panel-heading">
        <h3 class="panel-title">Additional Info!</h3>
    </div>
    <div class="panel-body">
        <p>You may want to watch this <a href="https://vimeo.com/528490801"> recorded live workshop </a> that goes over the material on k-means. </p>
    </div>
</div>

---

## Import Packages

Of course the first thing that needs to be done is to import packages: 

```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.cluster import KMeans
```

---

## Load in Data

Next, you'll want to load in data.  For this, you can use the ```seaborn``` built-in dataset, ```iris```.  Use this line of code to load it in: 

```python
iris = sns.load_dataset('iris')
```

But if seaborn isn't working for you, **[click here](https://repo.exeterlms.com/documents/V2/DataScience/Machine-Learning/Iris.zip)** to download the data.

---

## Data Wrangling

Almost there, but the ```KMeans()``` function cannot handle cells that are strings, so you will create a new DataFrame that is the same as the old one, but without the ```species``` column. If there was data you actually wanted to use in a string variable, you could instead re-code that variable numerically, but in this case, you can just drop species using ```drop()```:

```python
irisTrimmed = iris.drop('species', axis=1)
```

Excellent! Now the data is in a format you can use. 

---

## Perform k-Means

The next step is pretty straight forward.  You will use the function ```KMeans()``` to specify the number of clusters, and then fit it using ```fit()```: 

```python
kmeans = KMeans(n_clusters=2)
kmeans.fit(irisTrimmed)
```

This is what you will receive in return:

```text
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=2, n_init=10, n_jobs=None, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)
```

That’s it! You are done.  Get yourself a cookie…  

![A stack of three chocolate chip cookies.](Media/106.L4.10.jpg)

---

## Utilizing k-Means

What? You want to actually be able to use the results?  And after your cookie consumption was validated!  Fine.  How about you plot the data? 

```python
plt.figure(figsize=(10,6))
plt.title('K Means')
plt.scatter(irisTrimmed['petal_length'], irisTrimmed['petal_width'], c=kmeans.labels_, cmap='viridis')
```

This creates a figure, adds a title, and creates a scatter plot where ```petal_length``` is the x axis and ```petal_width``` is the y.  If you left it at that, however, you would have a lot of dots that are all grey.  Using the argument ```c=kmeans.labels_```` means that the dots will color based on the created clusters, and ```cmap=``` is an argument for the color scheme. Your graph should look something like this:

![A graph titled K means. The x axis runs from one to seven in increments of one. The y axis runs from zero point zero to two point five in increments of zero point five. At the bottom left, data is clustered and is in a purple color. Near the center of the graph and then moving up to the right are data plotted mostly in yellow, with only two data points in purple.](Media/106.L4.5.png)

Not bad! You can see that the clustering worked pretty well; the data is showing two distinct clusters of data, one highlighted in purple and the other in yellow.  But how can you get the data back in a usable form?  Luckily, ```kmeans``` has methods for that! 

```python
kmeans.labels_
```

If you just type in the code above, it will return an array specifying which cluster the data was in:

```text
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
```

You may recognize this as the c value that we plotted before.  However, comparing this array to your data is a little awkward, so add this data back into the most recent DataFrame as a new column:

```python
irisTrimmed['Group'] = kmeans.labels_
```

Now you can skim through your data and tell at a glance where each point falls. This will be mainly what you will be doing with *k*-means clustering, but you can also find the center point (centroid) of your clusters as well, using ```cluster_centers_```: 

```python
kmeans.cluster_centers_
```

Which returns this array:

```text
array([[6.30103093, 2.88659794, 4.95876289, 1.69587629],
       [5.00566038, 3.36981132, 1.56037736, 0.29056604]])
```

You can also find the total distance of every point from its cluster center using ```inertia_```:

```python
kmeans.inertia_
```

The inertia value provided is:

```text
152.34795176035792
```

However, it is rare that finding the centroid or the distance from each centroid will be required.  Be sure to focus on plotting your data and knowing how to add the results back into your DataFrame, so you can use it.  

For real now - would you like a cookie?   

![A hand holding five small cookies, each a different color.](Media/106.L4.11.jpg)

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 4 - k-Nearest Neighbors<a class="anchor" id="DS106L7_page_4"></a>

[Back to Top](#DS106L7_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# k-Nearest Neighbors

*k*-nearest neighbors (or "KNN") is an important algorithm for unsupervised machine learning. Basically, KNN looks at all the data points around it (the neighbors) and decides whether data should be classified into one type or another based on the closest data points.   

KNN prediction first calculates the distance from the unknown point to every point in the dataset, then sorts them from closest to furthest away, and finally predicts the majority label of the *k* closest points.  *k* is the number of nearest points that are being compared to the unknown point, so a *k* of 3 looks at the 3 points with the shortest distance from the unknown data you're trying to predict.  While the distance can be any metric measure, standard *Euclidean distance* is the most common choice. 

It should be noted that the *k* you choose can impact what category a point will fall into.  The higher the *k*, the more bias you introduce.  Your results will be much cleaner, but you will falsely categorize more datapoints.

The number of samples can be a user-defined constant (*k*-nearest neighbor learning), or vary based on the local density of points (*radius-based neighbor learning*). Neighbors-based methods are known as *non-generalizing machine learning* methods, since they simply “remember” all of the training data. 

---

## Example KNN Classification

If you look at this image, the green point is the unknown, or the data point you are trying to predict.  You are trying to gauge whether the green circle should fall with the blue squares, or with the red triangles.  If you looked at a *k* of 3 (surrounded by the solid black line) then you may assume that the green circle should really be classified as a red triangle.  But if you were to expand to a *k* of 6 (surrounded by the dotted black line) then you may assume that instead, the green circle should be classified as a blue triangle.  This demonstrates that the number of neighbors is important! Luckily, functions in Python will not only help you run KNN, but also help you determine the optimal *k* for any given dataset. 

![A solid line circle surrounded by a large dashed line circle. Inside the solid line circle is a green circle in the middle with a green question mark above it. To the right are two red triangles. To the left is a blue square. In the space outside the solid line circle but within the dashed line circle are two blue squares. Outside the dashed line circle are three blue squares near the upper left and three red triangles near the top and top right of the space.](https://upload.wikimedia.org/wikipedia/commons/thumb/e/e7/KnnClassification.svg/500px-KnnClassification.svg.png)

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 5 - Performing k-Nearest Neighbors in Python<a class="anchor" id="DS106L7_page_5"></a>

[Back to Top](#DS106L7_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Performing k-Nearest Neighbors in Python

KNN in Python is a relatively simple process with very few parameters and it is easy to use and add more data.  However, there are limitations.  Because the distance between every point is calculated, it takes a fair amount of processing power, so it does not work well with large datasets or datasets that have huge numbers of variables.  

In order to predict a categorical variable, you will need continuous data.  In this example, you'll use the built-in ```iris``` dataset from ``seaborn``` to predict a species of iris (categorical) from continuous independent variables such as petal length and width.  

<div class="panel panel-success">
    <div class="panel-heading">
        <h3 class="panel-title">Additional Info!</h3>
    </div>
    <div class="panel-body">
        <p>You may want to watch this <a href="https://vimeo.com/528059230"> recorded live workshop </a> that goes over the material on KNN. </p>
    </div>
</div>

---

## Import Packages

First, as always, you have to import your packages! 

```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix
```

---

## Load in Data

Now you can load in your data.  You will use the following code to access the ```iris``` dataset:

```python
iris = sns.load_dataset('iris')
```

If you don't have access to it, you can also **[click here to download it](https://repo.exeterlms.com/documents/V2/DataScience/Machine-Learning/Iris.zip)**.

---

## Question Setup

With this analysis, you are trying to predict which type of iris (```species```) a flower will be based on the other variables in the dataset - length and width of the various flower parts.

---

## Data Wrangling

Now, you have data, but of course there's work to do before you get to the fun part! 

---

### Scaling Your Data

KNN is based completely on distance from one point to another, so you need to make sure that all of the data is on the same scale.  If some of the variables were in inches and some of it was in yards, it could do some funky things to your results.  Luckily, ```sklearn``` has a tool just for such a happenstance! 

```python
scaler = StandardScaler()
scaler.fit(iris.drop('species', axis=1))
scaledVariables = scaler.transform(iris.drop('species',axis=1))
irisScaled = pd.DataFrame(scaledVariables, columns=iris.columns[:-1])
```

This fits the ```StandardScaler()``` function to your data, except for the column that you are predicting, which is ```species```.  You then transform the data and save it as ```scaledVariables```.  Finally, you will turn it into a dataframe that you can work with called ```irisScaled```, which leaves out the predictor variable of species with the code ```[:-1]```.  

The new data now looks like this, because it's all been placed on the same scale:

![Four columns and five rows of data. Column headings are sepal underscore length, sepal underscore width, petal underscore length, and petal underscore width. Row zero, negative zero point nine zero zero six eight one, one point zero one nine zero zero four, negative one point three four zero two two seven, negative one point three one five four four four. Row one, negative one point one four three zero one seven, negative zero poitn one three one nine seven nine, negative one point three four zero two two seven, negative one point three one five four four four. Row two, negative one point three eight five three five three, zero point three two eight four one four, negative one point three nine seven zero six four, negative one point three one five four four four. Row three, negative one point five zero six five two one, zero point zero nine eight two one seven, negative one point two eight three three eight nine, negative one point three one five four four four. Row four, negative one point zero two one eight four nine, one point two four nine two zero one, negative one point three four zero two two seven, negative one point three one five four four four.](Media/106.L3.11.png)

---

### Creating x and y Datasets

Now you'll need to subset your x and y data: 

```python
x = irisScaled
y = iris['species']
```

---

## Train Test Split

The next step is to do the train/test split. The basis of ```train_test_split()``` is just to separate your data, and you can apply many different machine learning techniques to the data once you’ve used train_test_split.

```python
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.3, random_state=101)
```

Where ```x``` is the dataset you have scaled and are testing (independent variables) and y is the thing you are predicting (dependent variables).  The argument ```random_state=``` does not usually have to be set, but, in this case, it will make your results line up with the example.

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 6 - KNN Analysis<a class="anchor" id="DS106L7_page_6"></a>

[Back to Top](#DS106L7_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# KNN Analysis

Now that you've taken care of the setup, it is time to take a trial run and actually perform KNN!

```python
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(x_train, y_train)
pred = knn.predict(x_test)
```

Done! Wasn’t that easy?  Simply a matter of utilizing the ```KNeighborsClassifier()``` function, specifying the number of neighbors with the argument ```n_neighbors=```, and then fitting it to your model and predicting! In this case the number of neighbors is one to start with.

---

## Interpret KNN Predictions

You can look at pred by itself, as shown below, but it is hard to understand: 

```text
array(['setosa', 'setosa', 'setosa', 'virginica', 'versicolor',
       'virginica', 'virginica', 'versicolor', 'virginica', 'setosa',
       'virginica', 'setosa', 'setosa', 'virginica', 'virginica',
       'versicolor', 'versicolor', 'versicolor', 'setosa', 'versicolor',
       'versicolor', 'setosa', 'versicolor', 'versicolor', 'versicolor',
       'versicolor', 'versicolor', 'virginica', 'setosa', 'setosa',
       'virginica', 'versicolor', 'virginica', 'versicolor', 'virginica',
       'versicolor', 'versicolor', 'versicolor', 'versicolor',
       'virginica', 'setosa', 'setosa', 'setosa', 'versicolor',
       'versicolor'], dtype=object)
```

So you can use some other ```sklearn``` tools to make this pretty and usable.  You'll call on the functions ```confusion_matrix()``` and ```classification_report()```. Start with the confusion matrix:

```python
print(confusion_matrix(y_test, pred))
```

Here is what is produced: 

```text
[[13  0  0]
 [ 0 19  1]
 [ 0  1 11]]
```

This confusion matrix shows how the predicted data lines up with reality. There are three different species of iris: ```setosa```, ```virginica```, and ```versicolor```.  Even though there's no labels on the above chart,  they go in that order.  Imagine it looking something like this with headers: 

<table class="table table-striped">
    <tr>
        <th>Species</th>
        <th>setosa</th>
        <th>virginica</th>
        <th>versicolor</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>setosa</td>
        <td>13</td>
        <td>0</td>
        <td>0</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>virginica</td>
        <td>0</td>
        <td>19</td>
        <td>1</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>versicolor</td>
        <td>0</td>
        <td>1</td>
        <td>11</td>
    </tr>
</table>

So what this means is that 13 iris plants were correctly classified as setosa.  There were no misclassifications.  19 iris plants were correctly classified as virginica, with one accidentally being misclassified as versicolor.  And lastly, 11 iris plants were correctly classified as versicolor, with one accidentally being classified as virginica.  Not many mistakes here, so it looks like this KNN algorithm is pretty darn accurate! Want verification of that accuracy in numbers, though? Then check out the classification report as well:

```python
print(classification_report(y_test,pred))
```

Here is what is produced: 

```text
             precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        13
  versicolor       0.95      0.95      0.95        20
   virginica       0.92      0.92      0.92        12

   micro avg       0.96      0.96      0.96        45
   macro avg       0.96      0.96      0.96        45
weighted avg       0.96      0.96      0.96        45
```

You want to focus on the ```precision``` column here. The KNN algorithm was 100% correct about predicting which iris plants will be of the setosa species, was 95% accurate in predicting the versicolor species, and was 92% accurate in predicting the virginica species.  Awesome! You can also look at the ```weighted avg``` row for ```precision```, which gives an overall value of 96%. 

---

## Choose the Best Model

What if you wanted to get that accuracy level just a bit higher? You could try using the *Elbow Method*, which is a way to plot error to see which number of neighbors is best.

```python
errorRate = []
for i in range(1,40):
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(x_train, y_train)
    predI = knn.predict(x_test)
    errorRate.append(np.mean(predI != y_test))
```

The first line of this creates an empty list called ```errorRate```.  Then you set up a for loop that will run the test on every *k* between 1 and 40 with the ```range()``` function.  Finally, it adds the mean of the error rate to the empty list.  You can look at the list now, but it is a little hard to understand:

```text
[0.044444444444444446,
 0.044444444444444446,
 0.022222222222222223,
 0.044444444444444446,
 0.022222222222222223,
 0.044444444444444446,
 0.0,
 0.0,
 0.0,
 0.022222222222222223,
 0.0,
 0.0,
 0.022222222222222223,
 0.044444444444444446,
 0.044444444444444446,
 0.044444444444444446,
 0.022222222222222223,
 0.044444444444444446,
 0.06666666666666667,
 0.06666666666666667,
 0.06666666666666667,
 0.06666666666666667,
 0.06666666666666667,
 0.06666666666666667,
 0.06666666666666667,
 0.06666666666666667,
 0.06666666666666667,
 0.08888888888888889,
 0.08888888888888889,
 0.1111111111111111,
 0.1111111111111111,
 0.15555555555555556,
 0.15555555555555556,
 0.15555555555555556,
 0.13333333333333333,
 0.15555555555555556,
 0.13333333333333333,
 0.13333333333333333,
 0.1111111111111111]
```

Which is why you can plot it! Here's how - and you can choose your own labels, colors, and title: 

```python
plt.figure(figsize=(10,6))
plt.plot(range(1,40), errorRate, color='blue', linestyle='dashed', marker='o', markerfacecolor='red', markersize=10)
plt.title('Error Rate vs K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')
```

Here is the resulting plot:

![A graph titled err rate versus k value. The x axis is labeled K and runs from zero to forty in increments of five. The y axis is labeled error rate and runs from zero point zero zero to zero point one six in increments of zero point zero two. Data is plotted on the graph, with dashed lines connecting each piece of data.](Media/106.L3.8.png)

In this case, 7, 8, 9, 11, and 12 are all *k* values that are equally low. Almost no error there!

---

## Run the Final Model

The final step is simply to use one of these *k* values in the model:

```python
knn = KNeighborsClassifier(n_neighbors=8)
knn.fit(x_train, y_train)
pred = knn.predict(x_test)
```

Then you can look at the new confusion matrix again: 

```text
[[13  0  0]
 [ 0 20  0]
 [ 0  0 12]]
```

And at the new classification report again:

```text
             precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        13
  versicolor       1.00      1.00      1.00        20
   virginica       1.00      1.00      1.00        12

   micro avg       1.00      1.00      1.00        45
   macro avg       1.00      1.00      1.00        45
weighted avg       1.00      1.00      1.00        45
```

When you ran the KNN again with *k*=8, you were able to predict the species of iris with 100% accuracy!  Excellent work!

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 7 - Key Terms<a class="anchor" id="DS106L7_page_7"></a>

[Back to Top](#DS106L7_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Key Terms

Below is a list and short description of the important keywords learned in this lesson. Please read through and go back and review any concepts you do not fully understand. Great Work!

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Clustering</td>
        <td>Unsupervised learning in which similar data are grouped together.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Centroid</td>
        <td>The center of each cluster.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>k-Nearest Neighbor (KNN)</td>
        <td>An unsupervised method used for classification and regression.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Radius-Based Neighbor Learning</td>
        <td>A type of KNN based on the density of points.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Non-Generalizing Machine Learning Method</td>
        <td>Remembers all the training data.</td>
    </tr>
</table>

---

## Key Python Packages

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>sklearn.cluster</td>
        <td>Used for k-means analysis.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>sklearn.preprocessing</td>
        <td>Contains code for scaling your data.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>sklearn.neighbors</td>
        <td>For performing KNN.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>sklearn.metrics</td>
        <td>For interpreting KNN.</td>
    </tr>
</table>

---

## Key Python Code

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>KMeans()</td>
        <td>Performs k-means analysis.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>n_clusters=</td>
        <td>An argument to KMeans() in which you choose the number of clusters.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>kmeans.labels_</td>
        <td>Provides to which cluster data belong.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>kmeans.cluster_centers_</td>
        <td>Provides the centroids for each cluster.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>kmeans.inertia_</td>
        <td>Provides the distance of every point from its centroid.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>StandardScaler()</td>
        <td>Puts all of your data on the same scale for KNN.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>scaler.transform()</td>
        <td>Changes the shape of your data after scaling.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>KNeighborsClassifier()</td>
        <td>Function for computing KNN.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>n_neighbors=</td>
        <td>An argument to KNeighborsClassifier() that allows you to specify the number of clusters.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>confusion_matrix()</td>
        <td>Creates a confusion matrix from your KNN.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>classification_report()</td>
        <td>Provides information about the accuracy of your model.</td>
    </tr>
</table>

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 8 - Lesson 2 Practice Hands-On<a class="anchor" id="DS106L7_page_8"></a>

[Back to Top](#DS106L7_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">



This Hands-On will **not** be graded, but you are encouraged to complete it. However, the best way to become a data scientist is to practice.

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Do not submit your project until you have completed all requirements, as you will not be able to resubmit.</p>
    </div>
</div>

---

## K-Means Hands-On

In this Hands-On exercise, you will create a project that will solidify your understanding of *k*-means and *k*-nearest neighbors. This Hands-On will be completed in Python, using your text editor or IDE of choice (e.g. VSCode, Jupyter Notebooks, Spyder, etc.). 

Determine how cars are grouped together by using the `mpg` dataset built into Seaborn.  Import it using the following code: 

```python
Mpg = sns.load_dataset('mpg')
```

If seaborn isn't working for you, **[click here](https://repo.exeterlms.com/documents/V2/DataScience/Machine-Learning/Mpg.zip)** to download the data.

Remember that you need continuous variables for these analyses, so you'll want to pinpoint columns such as ```mpg```, ```cylinders```, ```displacement```, ```horsepower```, ```weight```, ```acceleration``` or ```model_year``` as variables.  You'll also need to have those continuous variables as integers...so...hint, hint...there may be a little data wrangling involved.

Then use first the *k*-means machine learning algorithm and then the *k*-nearest neighbors algorithm to find the most appropriate *k* to examine, and provide the graph as well as add the cluster labels back into your dataframe.  How are these groups being divided? What conclusions can you draw about the data?

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Be sure to zip and submit your entire directory when finished!</p>
    </div>
</div>


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 9 - Lesson 2 Practice Hands-On Solution<a class="anchor" id="DS106L7_page_9"></a>

[Back to Top](#DS106L7_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Lesson 2 Practice Hands-On Solution

The Jupyter Notebook containing the Lesson 3 Practice Hands-On solution is located **[here](https://repo.exeterlms.com/documents/V2/DataScience/Machine-Learning/DSO106MLL3handson.zip)**. Please make sure to download a copy and then open with your own Juptyer Notebook program.