# KNN Classification

## ANTICIPATED TIME


2 hour



## BEFORE YOU BEGIN



[Clustering](Clustering.ipynb)

## WHAT YOU WILL LEARN


- What is the K-Nearest Neighbors (KNN) algorithm and how does it work?
- How do you choose the value of 'k' in KNN, and what impact does it have on the model's performance?
- What distance metrics are commonly used in KNN, and how do they affect classification?
- How do you handle categorical data in KNN?
- What are the advantages and disadvantages of using KNN for classification?
- How do you evaluate the performance of a KNN classifier?
- Can you implement KNN in Python using libraries like scikit-learn?


## DEFINITIONS YOU’LL NEED TO KNOW


- Confusion Matrix - a contingency table for classifier performance. It shows how often the classifier was correct and the kinds of mistakes it made.
- Accuracy - a measurement of how many times a classifier was correct.
- Recall - is the ratio of true positives to the total number of actual positives.
- Precision - is the number of true positives divided by all the times true is predicted.
- Specificity - a ratio that uses the number of true negatives to the total number of actual negatives to test how good the classifier is.
- KNN - a commonly used classifier; also known as k-nearest neighbors.
- Classification - the process of predicting a class for a new observation.
- Classifier - a tool used to predict classes from data; the predictions are based on features of the data.
- Standardize - putting the data on the same scale


## SCENARIO:


Angelia and Diego have thought of different ways to understand the data they collected to help solve their pollution problem in the city. They used measures of association to understand the relationship and patterns of their datasets and used clustering to group areas with similar pollution levels. Clustering did not give them clear information about what areas had the worst pollution problems. They needed a method that could make better predictions and not just group certain areas with each other. Kiana suggested using KNN, which will help them predict pollution levels in one city by looking at pollution levels in nearby cities that they have already collected data about. Using this method will help make it easier for the group to predict which cities need the most air improvement to provide cleaner air to the entire city.


**So what even is KNN?**

In the last notebook, we learned about clustering and putting data together so we can label them, which is really helpful when we don’t know the groups ahead of time.

But what if we knew the groups ahead of time? This is where ***K-nearest neighbors (KNN)*** is super helpful because it will help us ‘classify’ what groups to put the data in based on the groups are already known. This kind of classification means we know and name the groups, so we use a model to predict the label for any new data point. In other words, when we know some of the features or something about the data, we can use those pieces of information to tell the model (calculations of your data) to classify the data into labels.



**Diving Deeper into KNN**

Let’s think through how KNN works.

Remember that idea of distance we introduced in the clustering notebook? Well, KNN classifies (labels) new items based on their distance to other datapoints. Actually, KNN uses a specific number of closest data points to predict the label of the new unknown item that we are considering. To make it simple, we’ll call that number k.

If our k = 2, then we are looking for the two closest data points to predict the label for the new data point the model considers. If k = 3, we’ll look at the three closest data points to predict the label.

How simple is that?!

This can be pretty useful most of the time, but we still have to be careful because we don't know the best value k to use. It turns out that different values of k make a big difference in how accurate our classifier will be.


**Standardization**

Another thing you might need to think about is standardizing the dataset if you have different scales. Imagine you have two things you’d like to compare: the weight of someone (weight) and how tall someone is (height). The weight of someone in pounds is usually a bigger number than the height of someone in feet. If you only look at the bigger number (weight), it might look more important than the smaller number (height), even though it is also important.

Yikes.

You can imagine how confusing that could get if we only thought the biggest number was the most important.  To fix this, we can take each number and do a special math trick called “standardizing.” This means we put them all on the same scale. This way, weight, and height can be compared better to make the KNN model make smarter choices.


**Tools to Test Classifiers (How to put data into the labels)**

To begin, first, we have to standardize by putting them on the same scale. When we say “standardize” or put the data on the "same scale," we mean making sure big numbers don't affect the distance more than small numbers.

How can we have a model that can predict how to assign the labels? A simple way is to just create two models that we can compare against each other - one where we train and another where we test out our predictions. So now KNN uses classified (labeled) items ahead of time that help us classify new data points we find in the testing data

We can check if the classifier is learning properly by splitting the data into training and testing.
. When we do this, all of the data is shuffled together (almost like a deck of cards) and then split into new rows - one for use with the classifier and one for testing the classifier later. Once we get that, the classifier will help us generalize for new data that we come across.

If a classifier chooses the same label each time we use it, the model isn’t working properly. If you overfit (memorize training data), it won't work on new. If you can generalize well enough (ie - learn about the structure of data), performance on new data will be the same as training data



**How do we know if we have a good classifier?**

A confusion matrix is a table that compares the class we predict and the true class. So this really helps us know how good or bad our classifier actually is.

First, let's think about how our classifier can be right and how it can be wrong. There are two ways it can be right:
- ***True Positive (TP)***: When you say something is there and it actually is. I think there is a UFO and there is actually one out out there.
- ***True Negative (TN)***: When you say something is not there, and it actually isn’t. I say there isn’t a UFO out there, but there actually is one out there.

There are also two ways the classifier can be wrong:
- ***False Positive (FP)***: When you say something is there, but it actually isn’t. I say there is a UFO, but there isn’t one out there.
- ***False Negative (FN)***: When you say something is not there, but it actually is. I say there isn’t a UFO, but there is actually one out there.

![](https://pbs.twimg.com/media/GaQ4srLXMAAiPt2?format=png&name=small)

A good classifier will have a large number of TP and TN (correct data) and a small number of FP and FN (errors or incorrect data).

Consider a COVID test as an example. A true positive means the test is positive and you have COVID, and a true negative means the test is negative and you don't have COVID. False positives and false negatives are when the test is wrong: either it says you have COVID when you don't (FP) or it says you don't have COVID when you do (FN).



**What are the Measures of a Good Classifier?**

The confusion matrix is a great tool by itself, but there are a few more tools we can use to understand how well our classifier is doing.

- ***Accuracy*** - accuracy helps us understand how many times we were right! With this metric, we are measuring true positives and true negatives against the number of everything in the matrix. To do this, we add TP and TN and then divide that number by the total of every classification in the confusion matrix. This number will be between 0 (bad) or 1 (great).

- ***Recall*** - is the ratio of true positives to total number of actual positives. In other words, in order to understand how large or small the recall is, we divide the number of times you said something was predicted to be positive by the times the data was actually positive. Recall will always be a number between 0 and 1. The closer the number is to 1, the better the classifier is.

- ***Precision*** - is the number of true positives divided by all the times we predicted true (first row above added together…TP + FP). With precision, the number will always be between 0 and 1. The closer the number is to 1, the better the classifier is.


![](https://pbs.twimg.com/media/GaQ4xssXYAAvG4L?format=png&name=small
)

## YOUR TURN



Now that you understand the basics of KNN and its classification power, it’s time to practice applying these concepts! As you go through this, think about how metrics like accuracy and specificity help you determine how well your model is performing and how you can adjust it for better results.


### Goal 1: Importing the Pandas Library

Need extra tools to help solve this problem? Well, we can bring in extra ‘libraries’ to help us do extra data science stuff. You can think of it as an ‘add-on’. In this case, we bring in pandas, which is a popular library for doing data science stuff.  

#### Blockly

**Step 1 - Starting the import**

First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to “import” to bring the add-on package in.

Bring in the IMPORT menu, which can be helpful to bring in other data tools. In this case, we're bringing in the **import** block.



**Step 2 - Telling what library to import**

In the text area, we type the name of the library we want to import. A library is like an extra thing we bring in to give us more coding abilities. In our case, we will type out **pandas**, which will bring in some cool data manipulation features.



**Step 3 - Renaming the library so it’s easy to remember**

Once you are done, put the **import** and **package** together in a single variable. This handy feature helps cut down on all the typing later on. You can call it whatever is easiest for you to remember. In the example below, we’ve put everything into **pd**, and we type it in the open area.



**Step 4 - Connect the blocks to run the code**

Connect the blocks and run the code!
<details>
    <summary>Click to see the answer...</summary>


![](https://pbs.twimg.com/media/GcxNkjYXkAAlpKD?format=png&name=240x240)
</details>

In [None]:
#blocks code


#### Freehand

**Step 1 - Starting the import**

First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to “import” to bring the add-on package in.


**Step 2 - Telling what library to import**

In the text area, we type the name of the library we want to import. A library is like an extra thing we bring in to give us more coding abilities. In our case, we will type out pandas, which will bring in some cool data manipulation features.


**Step 3 - Renaming the library so it’s easy to remember**

Once you are done, put the ‘import’ and ‘package’ together in a single variable. This handy feature helps cut down on all the typing later on. Feel free to use whatever name you want that will help you remember it later on. In the example below, we’ve put everything into **pd**.


**Step 4 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GZmkVCYWEA4oGso?format=jpg&name=small)
</details>

**Your Turn**: Now it’s your turn! We’re going to dive into the pandas package, which helps us with some really cool data science things. First, let’s import the package and assign it to the variable “pd” to make it easier to use throughout our notebook.



In [3]:
#freehand code 


**Explanation**: *Congrats!  Your attempts finally made it!  Now you have successfully imported the "pandas" package as the variable "pd"*.

### Goal 2: Bringing in the Dataframe

Let’s bring in the data that we want to look at.


#### Blockly


**Step 1 - Write out the variable name you want to use**

Now that we’re all set with our new package to help us to do cool things, let’s bring the data into a variable and call it **train**. Think of it as a digital spreadsheet with much more power to analyze and manipulate the data!

In Blockly, bring in the VARIABLES menu.



Step 2 - Assign the dataframe to the variable you created
Just like we did before, let’s type out a variable name. Rather than type out the full file name for our data, this easy to remember name will hold the data we bring in.

In Blockly, go to the Variables and drag the Set block for the **train** variable. This will allow us to assign the result of a function call to the variable. A function is basically code that does a specific task for us.



**Step 3 - Bring in the data**

Now we need to look at the file that has all our data. To load our dataframe, we’ll use a simple command to bring in the file we need (CSV….Comma Separated Values). Let’s say we have a file called ‘datasets/AirQualityTrain.csv' in the folder **‘datasets’**. We’re telling Python to read the CSV file and store it in a variable called **train**.

From the Variable menu, drag a DO block using the **pd** variable, go ahead with the do operation **read_csv**. The read_csv function reads a CSV file and returns a DataFrame object.

In our case, let’s bring in the “datasets/AirQualityTrain.csv" (use the Quotes from the TEXT menu) because that is what Kiana is working with.



**Step 4 - Display the variable**

Let’s see it now by ‘displaying’ and showing our work.

Drag the **train** variable to the workspace, making it available for further use in our program. This step is more of a visualization step, as it allows us to see the variable in the Blockly workspace



**Step 5 - Connect the blocks to run the code**

Connect the blocks and run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GaQOYi_WsAAG6u8?format=png&name=small)

</details>

In [None]:
#blocks code


#### Freehand

**Step 1 - Write out the variable name you want to use**

Now that we’re all set with our new package to help us  do cool things, let’s bring the data into a variable called train. Think of it as a digital spreadsheet with much more power to analyze and manipulate the data!



**Step 2 - Assign the dataframe to the variable you created**

Just like we did before, let’s type out a variable name. Rather than type out the full file name for our data, this easy to remember name will hold the data we bring in.




**Step 3 - Bring in the data**

Now we need to look at the file that has all our data.

To load our dataframe, we’ll use a simple command to bring in the file we need (CSV….Comma Separated Values). Let’s say we have a file called “datasets/AirQualityTrain.csv” folder **‘datasets’**. We’re telling Python to read the CSV file and store it in a variable called **train**. For this function, we need to specify the code as “pd.read_csv”, which makes the code read the csv file. This variable is now our dataframe!

In our case, let’s bring in the “datasets/AirQualityTrain.csv” (user the Quotes from the TEXT menu) because that is what we are working with.



**Step 4 - Print the variable**

Let’s see it now by ‘printing’ and showing our work. Retype the variable name underneath the code and it will print the code. In this case, we will type out the variable name **train**



**Step 5 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!
<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GaQObPkWAAAKRNW?format=png&name=small)
</details>

**Your Turn**: Now it’s your turn!  Let’s dive in and start working with the data! We’ll begin by loading it into a dataframe, which will allow us to easily interact with and analyze the dataset.


In [4]:
#freehand code 



Unnamed: 0,Contaminated,Methane,NOxEmissions,PM2.5Emissions,VOCEmissions,SO2Emissions,CO2Emissions
0,1,848,960,1367,1784,1745,2445
1,1,1063,968,1627,1736,1785,1888
2,1,771,765,1391,1692,1523,2106
3,1,536,624,1224,1594,1171,2158
4,1,782,789,1333,1732,1529,2213
...,...,...,...,...,...,...,...
1286,1,694,729,1396,1568,1492,2106
1287,0,377,482,1010,1466,935,2242
1288,1,737,685,1292,1659,1436,2308
1289,0,369,478,890,1471,1003,2232


**Explanation**: *The dataset could be used to train a classification model (KNN) that predicts whether conditions will result in contamination based on various emission levels. Alternatively, if all entries are contaminated, the data could support regression models to predict specific emission levels under contaminated conditions or clustering methods to identify patterns in emission profiles among contaminated sites*.


### Goal 3: Bring in SkLearn Package to Help with KNN Classification.

Remember how we bring in packages to help with extra data science things we come across? Now that we are doing KNN classification, we are going to bring in the SkLearn to help us classify the new data point based on nearest neighbors.

#### Blockly


**Step 1 - Starting the import**

First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to “import” to bring the add-on package in.

Bring in the IMPORT menu, which can be helpful to bring in other data tools. In this case, we're bringing in the **import** block.



**Step 2 - Telling what library to import**

In the text area, we type the name of the library we want to import. A library is like an extra thing we bring in to give us more coding abilities. In our case, we will type out **sklearn.neighbors**, which finds the closest data points to a new data point, and then classifies the new data point based on the majority class of its nearest neighbors.



**Step 3 - Renaming the library so it’s easy to remember**

Once you are done, put the **import** and **package** together in a single variable. This handy feature helps cut down on all the typing later on. You can call it whatever is easiest for you to remember. In the example below, we’ve put everything into **neighbors**, and we type it in the open area.



**Step 4 - Connect the blocks to run the code**

Connect the blocks and run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GaQP_vjXcAABeTu?format=png&name=360x360)

</details>

In [None]:
#blocks code


#### Freehand


**Step 1 - Starting the import**

First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to “import” to bring the add-on package in.



**Step 2 - Telling what library to import**

In the text area, we type the name of the library we want to import. A library is like an extra thing we bring in to give us more coding abilities. In our case, we will type out **sklearn.neighbors**, which will provide us with some cool data manipulation features.



**Step 3 - Import package as acronym**

Once you are done, put the ‘import’ and ‘package’ together in a single variable. This handy feature helps cut down on all the typing later on. You can call it whatever is easiest for you to remember. In the example below, we’ve put everything into **neighbors**



**Step 4 - Run**

Hit ‘control’ and ‘enter’ at the same time to run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GaQP4lLXYAAMJIq?format=png&name=360x360)

</details>

**Your Turn**: So let’s bring in SKLearn and let’s get started!


In [5]:
#freehand code 


**Explanation**: *Congrats! Your attempts finally made it! Now you have successfully imported the **sklearn.neighbors** package as the variable neighbors* **bold text**.

### Goal 4: Create the Model Object.

Let’s get started with KNN classification and tell how many neighbors

#### Blockly


**Step 1 - Write out the name of a variable you want to use for the classifier**

We setup a variable to store the classifier model in a variable for later use. In this case, we will call it **knn**.

On the "Variables" menu, click Create Variable, type a name for our model, **knn**. Then, drag a "SET" block to the workspace for the created variable. This block allows us to create a new variable and assign a value to it.



**Step 2 - Create the KNN classifier model**

Using the neighbors library, we call the KNeighborsClassifier() to create the KNN model.

From the Variable menu, drag a Create block for the **neighbors** variable. On the create listbox select the option **KNeighborsClassifier**. This specifies the type (class) of object we want to create, which is the KNeighborsClassifier from the neighbors module.



**Step 3 - Define the hyperparameters**

Inside the method, we say how many neighbors to explore for our classifier. We use 5 neighbors in this case.

Drag a Freestyle block, and type **n_neighbors=5**. This specifies the number of neighbors to consider when making predictions.



**Step 4 - Assign the classifier model to the variable you created**

We can now connect the **knn** variable with the **KNeighborsClassifier** model.



**Step 5 - Connect the blocks to run the code**

Connect the blocks and run the code!
<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GaQesefWgAARv5o?format=png&name=small)

</details>

In [None]:
#blocks code


#### Freehand


**Step 1 - Write out the name for a variable you want to use for the classifier**

We setup a variable to store the classifier model in a variable for later use. In this case, we will call it **knn**.



**Step 2 - Create the KNN classifier model**

Using the neighbors library, we call the KNeighborsClassifier() method.

`neighbors.KNeighborsClassifier()`



**Step 3 - Define the hyperparameters**

Inside the method, we say how many neighbors to explore for our classifier. We use 5 neighbors in this case. We use 5 neighbors in this case.

`neighbors.KNeighborsClassifier(n_neighbors=5)`



**Step 4 - Assign the classifier model to the variable you created**

We set up a variable to store the classifier model in a variable for later use. In this case, we will call it **knn**.

`knn = neighbors.KNeighborsClassifier(n_neighbors=5)`




**Step 5 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GaQevIqX0AAXDSS?format=png&name=small)

</details>

**Your Turn**: Ready! Set! Code!

In [6]:
#freehand code 


**Explanation**: *You have created a K-Nearest Neighbors (KNN) classifier. The KNN classifier is a machine learning model that classifies data points based on the closest data points (neighbors) around them. Here, the model will look at the 5 nearest neighbors to decide the class of a new data point*.  

### Goal 5: Train and Score the Classifier Model.

Now that we’ve brought in our KNN model, let’s train the model to see how it will learn from the data points that we have in the file.

#### Blockly

**Step 1 - Prepare to train the model**

To train data using the classifier model, we use the model and call the fit() method from it. This will use the ‘fit’ method to train the model that we want to train on.  

From the Variable menu, drag the DO block for the **knn** variable, and select the **fit** function as the do operation. This specifies the function we want to call, which is the fit method of the knn object.



**Step 2 - Have the training features ready**

The next step for training the model is to select the features to train the classifier. In this step, we select the features and add them as a dataframe in the parameter. In this case, the model will train (learn) the classifier based on these 6 variables and use it to predict the label

From the Lists menu, drag a dictVariable, and select the "train" variable from the list of available variables. Also, from the Lists menu, you will get a Create List block. Using the Gear icon, add up to 6 items. For each one of the items, add a Text (a Quote “” from the Text menu), as follows:  "Methane", "NOxEmissions", "PM2.5Emissions", "VOCEmissions", "SO2Emissions", and "CO2Emissions". These are the feature names applied to train (fit) the model.



**Step 3 - Have the training label ready**

So what is the label that we are trying to predict? Next, we need to add the data labels for the selected features. We add the data labels (**Contaminated**) as a parameter in the fit() method.

From the Lists menu, drag a dictVariable, and select the **train** variable from the list of available variables. From the Text menu, get a Quote “” block and add a Text "**Contaminated**". This is the target value applied to train (fit) the model.



**Step 4 - Measure the correctness on the training dataset**

To measure the correctness of the model, we will use the score method() from the **knn** library. Just as the previous step, we will just replace the fit() method with the score() method. Based on the ‘fit’, we will try to see how much we were able to predict in our training dataset.

This will give us the **knn** models correctness score. A good score will be closer to 1 (ie - 100). Medium might be more like .95 (95% accurate). Not great would be .90 (90%). It depends on the topic you are looking at.

Right-click on the "knn.**fit**" block and select "Duplicate" from the context menu. This creates a copy of the block. Within the duplicated block, click on the method dropdown menu and select "**score**" from the list of available methods. The score method will work similarly to fit, and will use the training features and label to measure how much of the training data was learned.   



**Step 5 - Connect the blocks to run the code**

Connect the blocks and run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GaQzqqPWsAAdMhn?format=png&name=small)

</details>

In [None]:
#blocks code


#### Freehand


**Step 1 - Prepare to train the model**

To train data using the classifier model, we use the model and call the fit() method from it. This will use the ‘fit’ method to train the model that we want to train on.  

`knn.fit()`


**Step 2 - Have the training features ready**

The next step for training the model is to select the features to train the classifier. In this step, we select the features and add them as a dataframe in the parameter. In this case, the model will train (learn) the classifier based on these 6 variables and use it to predict the label

`knn.fit(train[['Methane', 'NOxEmissions', 'PM2.5Emissions', 'VOCEmissions', 'SO2Emissions', 'CO2Emissions']]`


**Step 3 -  Have the training label ready**

So what is the label that we are trying to predict? Next, we need to add the data labels for the selected features. We add the data labels (**Contaminated**) as a parameter in the fit() method.
knn.fit(train[['Methane', 'NOxEmissions', 'PM2.5Emissions', 'VOCEmissions', 'SO2Emissions', 'CO2Emissions']],train['Contaminated'])



**Step 4 - Measure the correctness on the training dataset**

To measure the correctness of the model, we will use the score method() from the knn library. Just as in the previous step, we will just replace the **fit**() method with the **score**() method. Based on the ‘fit’, we will try to see how much we were able to predict in our training dataset.

This will give us the knn models correctness score. A good score will be closer to 1 (ie - 100). Medium might be more like .95 (95% accurate). Not great would be .90 (90%). It depends on the topic you are looking at.

`knn.score(train[['Methane', 'NOxEmissions', 'PM2.5Emissions', 'VOCEmissions', 'SO2Emissions', 'CO2Emissions']],train['Contaminated'])`


**Step 5 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GaQznuXXwAEzVd9?format=jpg&name=medium)

</details>

**Your Turn**: Try to see if you can recreate the code to train and score your own classifier model


In [7]:
#freehand code 


0.9907048799380326

**Explanation**:  *You have trained a KNN model that can predict if something is "Contaminated" (the label) based on several types of emissions, the features: Methane, NOxEmissions, PM2.5Emissions, VOCEmissions, SO2Emissions, and CO2Emissions.  You have also calculated the score, which tells us the model's accuracy on this data. A higher score means the model is better at making correct predictions based on the given emissions*.


### Goal 6: Bringing in the Test Data.

So we’ve looked at the training dataset to learn something about our data. How about applying it to the rest of the dataset and ‘test’ to see how good our predictions are?

#### Blockly



**Step 1 - Write out the variable name you want to use**

Now that we’re all set with our new package to help us to do cool things, let’s bring the data into a **test**. Think of it as a digital spreadsheet with much more power to analyze and manipulate the data!

To do this, bring in the VARIABLES menu.



**Step 2 - Assign the dataframe to the variable you created**

From the Variables menu, drag the Set block for the **test** variable. This will allow us to assign the result of a function call to the variable.

In Blockly, go to the Variables and drag the Set block for the **test** variable. This will allow us to assign the result of a function call to the variable. A function is basically code that does a specific task for us.




**Step 3 - Bring in the data**

From the Variable menu, drag a DO block using the **pd** variable, go ahead with the do operation **read_csv**. The read_csv function reads a CSV file and returns a DataFrame object. In our case, let’s bring in the "datasets/AirQualityTest.csv" (user the Quotes from the TEXT menu) because that is what Kiana is working with.

From the Variable menu, drag a DO block using the **pd** variable, go ahead with the do operation **read_csv**. The read_csv function reads a CSV file and returns a DataFrame object.

In our case, let’s bring in the "datasets/AirQualityTest.csv"  (use the Quotes from the TEXT menu) because that is what Angelina is working with.




**Step 4 - Display the variable**

Let’s see it now by ‘displaying’ and showing our work.

Drag the **test** variable to the workspace, making it available for further use in our program. This step is more of a visualization step, as it allows us to see the variable in the Blockly workspace.



**Step 5 - Connect the blocks to run the code**

Connect the blocks and run the code!
<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GaQ0zDqWUAAzB82?format=png&name=small)

</details>

In [None]:
#blocks code


#### Freehand


**Step 1 - Write out the variable name you want to use**

Now that we’re all set with our new package to help us to do cool things, let’s bring the data into a variable called **test**. Think of it as a digital spreadsheet with much more power to analyze and manipulate the data!



**Step 2 - Assign the dataframe to the variable you created**

Just like we did before, let’s type out a variable name. Rather than type out the full file name for our data, this easy to remember name will hold the data we bring in.



**Step 3 - Bring in the data**

Now we need to look at the file that has all our data.

To load our dataframe, we’ll use a simple command to bring in the file we need (CSV….Comma Separated Values). Let’s say we have a file called ‘AirQualityTest.csv' in the folder **‘datasets’**. We’re telling Python to read the CSV file and store it in a variable called **test**. For this function, we need to specify the code as “pd.read_csv”, which makes the code read the csv file. This variable is now our dataframe!

In our case, let’s bring in the “datasets/AirQualityTest.csv” (user the Quotes from the TEXT menu) because that is what the group is working with.



**Step 4 - Print the variable**

Let’s see it now by ‘printing’ and showing our work. Retype the variable name underneath the code and it will print the code. In this case, we will type out the variable name **test**.



**Step 5 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!
<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GaQ0w41WMAA60B_?format=png&name=small)
</details>

**Your Turn**: Let’s try it! Now we can apply our predictions to the rest of the dataset and ‘test’ to see how good our predictions are!


In [8]:
#freehand code 


Unnamed: 0,Contaminated,Methane,NOxEmissions,PM2.5Emissions,VOCEmissions,SO2Emissions,CO2Emissions
0,0,355,433,917,1472,947,2245
1,1,749,832,1438,1740,1492,1866
2,1,556,635,1166,1599,1282,2144
3,1,479,594,1192,1534,1243,2012
4,0,373,435,867,1416,912,2286
...,...,...,...,...,...,...,...
549,1,543,654,1268,1642,1192,2099
550,1,459,572,1060,1543,1185,2124
551,1,520,633,1218,1627,1103,2137
552,0,387,489,850,1486,1005,2262


**Explanation**: You have loaded now a testing dataset. A test dataset is necessary because it allows us to check how well our model or analysis works on new, unseen data. When we train a model, we use one set of data to learn from (called the training dataset), but we also need to make sure it performs well on different data (the test dataset). This helps us know if the model can generalize to real-world situations, not just the data it was trained on.

### Goal 7: Predict Labels for Testing Dataset (ie - rest of the data).

So far we’ve taken a smaller part of all our data to train and try and learn something about it. Can we take what we’ve learned from the training and use it to predict the rest of our dataset?

#### Blockly


**Step 1 - Write out the variable name you want to use**

Now that we’re all set with our new package to help us to do cool things, let’s bring the data into a variable and call it **predictions**.

From the Variables menu, click Create Variable, and type **predictions**. On the same menu, drag the Set block of the prediction variable. This variable will hold the result of the prediction.



**Step 2 - Prepare the predict operation**

So let’s take the **knn** variable from before and try to predict the label of the new dataset (contaminated, not contaminated). Let’s start by using the predict() method from the knn model.

From the Variables menu, get a DO block, for the **knn** variable. With that select the operation **predict**.



**Step 3 - Set the test features**

Inside the predict() method, we provide the test features from the test data. This will use the 6 features (ie - columns) to predict the labels.

From the Lists menu, drag a dictVariable, and select the "test" variable from the list of available variables. Also, from the Lists menu, you will get a Create List block. Using the Gear icon, add up to 6 items. For each one of the items, add a Text (a Quote “” from the Text menu), as follows:  "Methane", "NOxEmissions", "PM2.5Emissions", "VOCEmissions", "SO2Emissions", and "CO2Emissions". These are the feature names applied to predict the target label on the testing dataset. Store the output of the KNN prediction in the "predictions" variable. This variable will now hold the result of the prediction.



**Step 4 - Assign the predictions to the variable you created**

Just like we did before, let’s type out a variable name. Rather than type out the full file name for our data, this easy to remember name will hold the data we bring in.

Next, we store the prediction labels into a variable **‘predictions’**. To do that we have to connect the SET predictions variable to the **knn.predict**() block.



**Step 5 - Display the predictions**

Finally, we display the prediction labels using ‘predictions’

From the Variables menu, drag the "**predictions**" variable. This will show the result of the KNN predictions.



**Step 6 - Connect the blocks to run the code**

Connect the blocks and run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GaQ3yDQWAAAD16J?format=png&name=900x900)

</details>

In [None]:
#blocks code


#### Freehand


**Step 1 - Prepare the predict operation**

So let’s take the **knn** variable from before and try to predict the label of the new dataset (contaminated, not contaminated. Let’s start by using the predict() method from the knn model.

`knn.predict()`


**Step 2 - Set the test features**

Inside the predict() method, we provide the test features from the test data. This will use the 6 features (ie - columns) to predict the labels.

`knn.predict(test[['Methane', 'NOxEmissions', 'PM2.5Emissions', 'VOCEmissions', 'SO2Emissions', 'CO2Emissions']])`


**Step 3 - Assign the predictions to the variable you created**

Next, we store the prediction labels into a variable ‘predictions’ Just like we did before, let’s type out a variable name. Rather than type out the full file name for our data, this easy to remember name will hold the data we bring in.

`predictions = knn.predict(test[['Methane', 'NOxEmissions', 'PM2.5Emissions', 'VOCEmissions', 'SO2Emissions', 'CO2Emissions']])`


**Step 4 - Print the predictions**

Finally, we print the the prediction labels using ‘predictions’

`predictions`


**Step 5 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GaQ3vx7WAAA30Q4?format=jpg&name=medium
)

</details>

**Your Turn**:  Give it a go! Let’s see what our results are once we display them!


In [9]:
#freehand code 


array([0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1,
       0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,
       0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0,
       1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0,
       1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0,
       1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0,
       0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1,
       1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
       1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0,
       0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,

**Explanation**: *Now with the test dataset, with the same features ("Methane," "NOxEmissions," "PM2.5Emissions," "VOCEmissions," "SO2Emissions," and "CO2Emissions), you were able to make predictions about each data point in the test set. Each prediction shows which cluster or category the `knn` model thinks each data point belongs to*.

### Goal 8: Bringing in SKLearn metrics to help look at the performance of predictions.

So we’ve tried to predict on our new dataset. How well did we do? Let’s use SKLearn Metrics to help us think through that.

#### Blockly


**Step 1 - Starting the import**

First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to “import” to bring the add-on package in.

Bring in the IMPORT menu, which can be helpful to bring in other data tools. In this case, we're bringing in the **import** block.



**Step 2 - Telling what library to import**

In the text area, we type the name of the library we want to import. A library is like an extra thing we bring in to give us more coding abilities. In our case, we will type out **sklearn.metrics**, which is a tool that grades your machine learning model's performance, telling you how well it did on its test.



**Step 3 - Renaming the library so it’s easy to remember**

Once you are done, put the **import** and **package** together in a single variable. This handy feature helps cut down on all the typing later on. You can call it whatever is easiest for you to remember. In the example below, we’ve put everything into **metrics**, and we type it in the open area.



**Step 4 - Connect the blocks to run the code**

Connect the blocks and run the code!
<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GaQ5swrXUAAEJ-D?format=png&name=360x360)

</details>

In [None]:
#blocks code


#### Freehand

**Step 1 - Starting the import**

First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to “import” to bring the add-on package in.


**Step 2 - Telling what library to import**

In the text area, we type the name of the library we want to import. A library is like an extra thing we bring in to give us more coding abilities. In our case, we will type out **sklearn.metrics**, which is a tool that grades your machine learning model's performance, telling you how well it did on its test.


**Step 3 - Renaming the library so it’s easy to remember**

Once you are done, put the ‘import’ and ‘package’ together in a single variable. This handy feature helps cut down on all the typing later on. Feel free to use whatever name you want that will help you remember it later on. In the example below, we’ve put everything into **metrics**.


**Step 4 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!
<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GaQ5u3ZWsAA_df1?format=png&name=360x360)

</details>

Your Turn: Test it out yourself! We set up a command to tell our computer what to do and after our hard work, we’ll run what we have to see our data science major at work!


In [10]:
#freehand code 


**Explanation**: *The metrics library provides tools to measure and evaluate the performance of machine learning models. By using `metrics`, we can check how well our model is working, like seeing how accurate it is or how well it groups data in clustering. This helps us understand if our model is doing a good job or if it needs improvement*.

## Assessing the performance of the classifier



So how well did our predictions do? Let’s calculate three steps here performance of predictions on testing dataset - accuracy, confusion matrix, and precision/recall.



### Goal 9: Assessing the Performance of the Predictions on Test Dataset Using Accuracy Score.

So how well did our predictions do on our test data? Let’s calculate the accuracy score to give us an idea.

#### Blockly

**Step 1 - Call the accuracy_score() method using the metrics library**

To calculate the accuracy of the model predictions, we will use the **accuracy_score**() function from the metrics library.  

From the Variables menu, drag a DO block for the metrics variable. Select the accuracy_score function from the metrics list of operations. This function takes two inputs: the true labels and the predicted labels.



**Step 2 - Calculate KNN model’s accuracy**

The accuracy_score() function takes 2 parameters to calculate the accuracy score and help measure the percentage of correct predictions. So let’s compare contaminated from the test dataset and predictions from the model we just created.

From the Lists menu get a dictVariable block and select the test variable. From the Text menu get a Quote “” block to inform the label name **”Contaminated”**. This list will be used as the true labels for the accuracy calculation.

As the second parameter of the **accuracy_score** get the variable **predictions**.  The accuracy score function will calculate the accuracy of the model by comparing the true labels with the predicted labels. The result will be a score that indicates the performance of the model.



**Step 3 - Connect the blocks to run the code**

Connect the blocks and run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GaQ8AdSWIAEPAL_?format=png&name=small)

</details>

In [None]:
#blocks code


#### Freehand

**Step 1 - Call the accuracy_score() method using the metrics library**

To calculate the accuracy of the model predictions, we will use the accuracy_score() method from the metrics library.  This accuracy score will measure the percentage of correct predictions.

`metrics.accuracy_score()`



**Step 2 - Calculate KNN model’s accuracy**

The **accuracy_score**() function takes 2 parameters to calculate the accuracy score and help measure the percentage of correct predictions. So let’s compare **contaminated** from the test dataset and predictions from the model we just created.

`metrics.accuracy_score(test['Contaminated'],predictions)`



**Step 3 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GaQ7bb5WMAAad1n?format=png&name=small)

</details>

**Your Turn**:  Have a go at it! Once you begin you’ll be able to assess the performance of the predictions!



In [11]:
#freehand code 


0.9927797833935018

**Explanation**: *The accuracy score tells us the percentage of correct predictions out of the total. A higher accuracy score means the model is doing a good job matching the actual labels*.

### Goal 10: Assessing the Performance of the Predictions on Test Dataset Using the Confusion Matrix.

So before we looked at the accuracy score to give us an idea about how well our predictions did on the test data. How about let’s look at the errors (false positives and false negatives) in the confusion matrix to explore further.

#### Blockly

**Step 1 - Call the confusion_matrix() function using the metrics library**

To break down the accuracy, let’s get the numbers from the confusion matrix using the **confusion_matrix**() function from the metrics library.  

From the Variables menu, drag a DO block for the metrics variable. Select the **confusion_matrix** function from the metrics list of operations. This function takes two inputs: the true labels and the predicted labels.



**Step 2 - Calculate KNN model’s confusion matrix**

The confusion_matrix () function takes 2 parameters to explore different parts of the confusion matrix. So let’s compare **Contaminated** from the test dataset and **predictions** from the model we just created. The confusion matrix will tell us these numbers - TP (true positive), TN (true negative), FP (false positive), and FN (false negative).

From the Lists menu, get a dictVariable block and select the test variable. From the Text menu get a Quote “” block to inform the label name **”Contaminated”**. This list will be used as the true labels for the confusion matrix calculation. As the second parameter of the **confusion_matrix** get the variable **predictions**.



**Step 3 - Connect the blocks to run the code**

Connect the blocks and run the code!
<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GaQ-UXxWgAAwYlY?format=png&name=small)
</details>

In [None]:
#blocks code


#### Freehand

**Step 1 - Call the confuction_matrix() method from the metrics library**

To calculate the confuction_matrix() of the model predictions, we will use the confusion_matrix() function from the metrics library.

`metrics.confusion_matrix()`



Step 2 - Calculate KNN model’s confusion matrix
The confusion_matrix () function takes 2 parameters to explore different parts of the confusion matrix. So let’s compare **Contaminated** from the test dataset and **predictions** from the model we just created. The confusion matrix will tell us these numbers - TP (true positive), TN (true negative), FP (false positive), and FN (false negative).

Test data labels: *test[‘Contaminated’]*

The predicted labels: *predictions*

`metrics.confusion_matrix(test['Contaminated'],predictions)`



**Step 3 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!
<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GaQ-R7IW8AA9YQw?format=png&name=small)

</details>

**Your Turn**: Let’s type in the code and see what our performance looks like! What do you see?


In [12]:
#freehand code 


array([[169,   3],
       [  1, 381]])

**Explanation**: *This matrix indicates that the model correctly identified 169 instances as "Not Contaminated" and 381 instances as "Contaminated." However, it made 3 false positive errors (predicting "Contaminated" when it was actually "Not Contaminated") and 1 false negative error (predicting "Not Contaminated" when it was actually "Contaminated"). This breakdown helps understand the model's accuracy and the types of mistakes it makes*.

### Goal 11: Assessing the Performance of the Predictions on Test Dataset using recall and precision.

Let’s look further to compare how many we predicted true positives and compare it with our false positives (precision). Also, let’s look at true positives with false negatives (recall).

#### Blockly


**Step 1 - Call the classification_report() function from the metrics library**

To calculate the classification_report() for the model, we will use the **classification_report**() function from the metrics library.

From the Variables menu, drag a DO block for the metrics variable. Select the **classification_report** function from the metrics list of operations. This function takes two inputs: the true labels and the predicted labels.



**Step 2 - Saying what parameters to use for the classification report**

The classification_report() method takes 2 parameters to calculate the classification report.

From the Lists menu, get a dictVariable block and select the test variable. From the Text menu, get a Quote “” block to inform the label name **”Contaminated”**. This list will be used as the true labels for the accuracy calculation. As the second parameter of the **classification_report** function, get the variable **predictions**.



**Step 3 - Print the classification report**

Connect the metrics block to a Print block (from the Text menu). The classification report will include other metrics like precision and recall.



**Step 4 - Connect the blocks to run the code**

Connect the blocks and run the code!
<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GaRCnOcX0AAPsn4?format=png&name=small)
</details>

In [None]:
#blocks code


#### Freehand

**Step 1 - Call the classification_report() function from the metrics library**

To calculate the classification_report() for the model, we will use the classification_report() function from the metrics library.

`metrics.classification_report()`




**Step 2 - Saying what parameters to use for the classification report**

The classification_report() method takes 2 parameters to calculate the classification report.

`metrics.classification_report(test['Contaminated'],predictions)`



**Step 3 - Print the classification report**

Use the print function to show the results of the classification report.

`print(metrics.classification_report(test['Contaminated'],predictions))`



**Step 4 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!
<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GaRClCEWAAAMfW3?format=png&name=small)
</details>

**Your Turn**: Let’s try it! First we tried accuracy score, confusion matrix, and now we have recall and precision. The possibilities are endless!


In [13]:
#freehand code 


              precision    recall  f1-score   support

           0       0.99      0.98      0.99       172
           1       0.99      1.00      0.99       382

    accuracy                           0.99       554
   macro avg       0.99      0.99      0.99       554
weighted avg       0.99      0.99      0.99       554



**Explanation**: *The classification report provides some metrics, particularly precision and recall. Precision measures how many of the samples the model labeled as contaminated were actually contaminated, essentially showing the accuracy of its positive predictions. Recall, on the other hand, reflects the model's ability to detect all actual contaminated samples, indicating how well it "found" the true cases. In this report, the precision and recall scores are both high, at 0.99 or 99% for each class, which demonstrates that the model is highly accurate in identifying both contaminated and non-contaminated samples. These scores mean that nearly all positive predictions made by the model were correct, and it missed almost none of the actual contaminated cases*.

## WHAT DID YOU LEARN?



In this notebook, we learned about [K-Nearest Neighbors](## "a commonly used classifier; also known as k-nearest neighbors") and how it works. It helps classify data by looking at nearby data points and picking the most common label for them. Choosing how many neighbors (K) is important because it helps to decide how many neighbors the algorithm should check. Too many neighbors being checked makes the data more sensitive and too many might make our results look a bit funky.

## WHAT’S NEXT?

[KNN Regression](KNN_Regression.ipynb)

## TELL ME MORE


- [Datawhys KNN Classification Notebook](https://github.com/memphis-iis/datawhys-content-notebooks-python/blob/master/KNN-classification.ipynb)
- [Datawhys KNN Classification Problem-Solving Notebook](https://github.com/memphis-iis/datawhys-content-notebooks-python/blob/master/KNN-classification-PS.ipynb)
- [K Nearest Neighbors](https://youtu.be/0p0o5cmgLdE?si=V8336jEKrVF_aLJc) - Machine Learning Basics (video)
- [K-nearest neighbors](https://youtu.be/HVXime0nQeI?si=WjFinXnxVm1CJhX5) - StatQuest (video)