# KNN Regression

## ANTICIPATED TIME


2 Hours



## BEFORE YOU BEGIN

[KNN Classification](KNN_Classification.ipynb)


## WHAT YOU WILL LEARN


- What is regression?
- What is KNN regression?
- When do I use KNN regression?
- How does KNN regression work?
- How to choose K?
- How do you evaluate the performance of a KNN regression model?
- What are the strengths and weaknesses of using KNN regression?


## DEFINITIONS YOU’LL NEED TO KNOW



- Regression - model used to predict a number
- KNN Regression - is a technique that is used to predict continuous values
- Predictors - a variable that is used to predict the values of a dependent/response variable
- Response - the number you are trying to predict
- Residuals - the difference between a predicted value and the actual value
- Feature similarity - predicts values of new data
- Standardization - putting data on the same scale for evaluation


## SCENARIO:  


Ethan and Angelina want to predict how much pollution a new car would produce. They aren't sure what about cars is important for pollution, so they decide they want to make predictions using cars that are similar to the new car. Diego suggested that the team should figure this out using KNN (K-nearest neighbors) algorithm.

To make sure that KNN works correctly, Ethan and Angelina need to make sure all of the features are treated equally. Otherwise a feature like car size, which has bigger numbers than fuel efficiency, could affect their prediction by a lot. The team also has to consider picking the right number for K. A small K would mean predictions only use the most similar car to the new car. A large K would use several similar cars to the new car in order to make the prediction. Picking the correct K and making sure all the features are looked at equally can help the team make a good prediction about how much pollution the new car would produce! Ethan and Angelina suggest they use this approach to better understand the data in their afterschool program.


## WHAT DO I NEED TO KNOW?



How KNN regression works
So in the last notebook, we looked at predicting the class (label) of an item, car type (diesel, gas, hybrid). What happens if we want to predict a number?

We are in luck because K-nearest-neighbor (KNN) can also be used to predict a number (**response variable**). The KNN algorithm is used for classification (sorting) and regression (predicting) problems. So it uses feature similarity to predict data values.

For example, if you want to predict the height of a new classmate (a transfer student), you can use the heights of some of your friends (the nearest neighbors data points) and take the average height as your prediction. In the example of being environmentally friendly, you can use the heights of similar cars and take the average as your prediction for a new car coming next year.

**Feature similarity** is when new data is given a value based on how much it looks like the closest certain amount of numbers (K data) in the training set. That's what KNN regression does: it makes predictions based on the average of the nearest neighbors.

***Residuals***
Just like we learned in our other notebooks, we want to see how good our model is. In regression, we try to measure how many ‘errors’ we got in our prediction. In other words, we just
we just use the difference between the number we predicted and the actual number in the dataset. Because it can be negative or positive, we usually square it and make it easier to read.

***Choosing K***
Now we will go back to wanting to see how a person’s height and age affects their weight. After we’ve done our standardization, we can choose a number called K to help us draw a line that shows a relationship between height and weight. If K is 1, you just use the closest data point. It's great if it's close, but if it's far away it's bad for predicting. In many cases, the line will look pretty jagged and bumpy. It will change a lot with the differences in height by trying to fit each person’s weight closely.

If K is a larger number, then the line becomes very smooth. This is because it looks more at heights together to decide where the line should go which makes it steady. When you look at pictures of the line with different K values, you can see that a low K gives a zigzaggy line, while a high K gives a smoother line.

*So finding the right k can be a bit tricky.*

The best choice of K depends on how good your features are at predicting the outcome, and whether they are consistently good or just sometimes good. If the features are not that good at predicting the outcome or are not consistently good, then a larger K will help by including more neighbors. However, if K is too large, then the model will become insensitive to the values of the nearest neighbors, and this generally won't work - the closest neighbors are normally the best predictors.  


## YOUR TURN:


Now let’s give it a go with some practice! You now understand another model of KNN and its ability to make predictions, it’s time to practice applying these concepts! As you go through this, think about how important it is to choose K to help you determine the best prediction.


### Goal 1: Importing the Pandas Library

Need extra tools to help solve this problem? Well, we can bring in extra ‘libraries’ to help us do extra data science stuff. You can think of it as an ‘add-on’. In this case, we bring in pandas, which is a popular library for doing data science stuff.  

#### Blockly


**Step 1 - Starting the import**

First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to “import” to bring the add-on package in.

Bring in the IMPORT menu, which can be helpful to bring in other data tools. In this case, we're bringing in the **import** block.



**Step 2 - Telling what library to import**

In the text area, we type the name of the library we want to import. A library is like an extra thing we bring in to give us more coding abilities. In our case, we will type out **pandas**, which will bring in some cool data manipulation features.



**Step 3 - Renaming the library so it’s easy to remember**

Once you are done, put the **import** and **package** together in a single variable. This handy feature helps cut down on all the typing later on. You can call it whatever is easiest for you to remember. In the example below, we’ve put everything into **pd**, and we type it in the open area.



**Step 4 - Connect the blocks to run the code**

Connect the blocks and run the code!
<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GcxNkjYXkAAlpKD?format=png&name=240x240)

</details>

In [4]:
#blocks code


#### Freehand

**Step 1 - Starting the import**

First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to “import” to bring the add-on package in.


**Step 2 - Telling what library to import**

In the text area, we type the name of the library we want to import. A library is like an extra thing we bring in to give us more coding abilities. In our case, we will type out **pandas**, which will bring in some cool data manipulation features.


**Step 3 - Renaming the library so it’s easy to remember**

Once you are done, put the ‘import’ and ‘package’ together in a single variable. This handy feature helps cut down on all the typing later on. Feel free to use whatever name you want that will help you remember it later on. In the example below, we’ve put everything into **pd**.


**Step 4 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GZmkVCYWEA4oGso?format=jpg&name=small)

</details>

**Your Turn**: Now it’s your turn! We’re going to dive into the pandas package, which helps us with some really cool data science things. First, let’s import the package and assign it to the variable “pd” to make it easier to use throughout our notebook.


In [8]:
#freehand code 


**Explanation**: *Congrats!  Your attempts finally made it!  Now you have successfully imported the "pandas" package as the variable "pd"*.

### Goal 2 - Bringing in the Dataframe
Let’s bring in the data that we want to look at.


#### Blockly


**Step 1 - Write out the variable name you want to use**

Now that we’re all set with our new package to help us to do cool things, let’s bring the data into a variable and call it **train**.

In Blockly, bring in the VARIABLES menu.



**Step 2 - Assign the dataframe to the variable you created**

Just like we did before, let’s type out a variable name. Rather than type out the full file name for our data, this easy to remember name will hold the data we bring in.

In Blockly, go to the Variables and drag the Set block for the **train** variable. This will allow us to assign the result of a function call to the variable. A function is  



**Step 3 - Bring in the data**

Now we need to look at the file that has all our data. To load our dataframe, we’ll use a simple command to bring in the file we need (CSV….Comma Separated Values). Let’s say we have a file called ‘datasets/AirQualityCars.csv' in the folder **‘datasets’**. We’re telling Python to read the CSV file and store it in a variable called **train**.

From the Variable menu, drag a DO block using the **pd** variable, go ahead with the do operation **read_csv**. The read_csv function reads a CSV file and returns a DataFrame object.

In our case, let’s bring in the “datasets/AirQualityCars.csv" (use the Quotes from the TEXT menu) because that is what Kiana is working with.



**Step 4 - Display the variable**

Let’s see it now by ‘displaying’ and showing our work.

Drag the **train** variable to the workspace, making it available for further use in our program. This step is more of a visualization step, as it allows us to see the variable in the Blockly workspace



**Step 5 - Connect the blocks to run the code**

Connect the blocks and run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/Gab5Z-TXsAAQAWr?format=jpg&name=medium)
</details>

In [3]:
#blocks code


#### Freehand

**Step 1 - Write out the variable name you want to use**

Now that we’re all set with our new package to help us to do cool things, let’s bring the data into a variable called train. Think of it as a digital spreadsheet with much more power to analyze and manipulate the data!

Just like we did before, let’s type out a variable name. Rather than type out the full file name for our data, this easy to remember name will hold the data we bring in.



**Step 2 - Bring in the data**

Now we need to look at the file that has all our data.

To load our dataframe, we’ll use a simple command to bring in the file we need (CSV….Comma Separated Values). Let’s say we have a file called ‘AirQualityCars.csv' in the folder **‘datasets’**. We’re telling Python to read the CSV file and store it in a variable called **train**. For this function, we need to specify the code as “pd.read_csv”, which makes the code read the csv file. This variable is now our dataframe!

In our case, let’s bring in the “datasets/AirQualityCars.csv” (user the Quotes from the TEXT menu) because that is what Kiana is working with.



**Step 3 - Assign the dataframe to the variable you created**

Just like we did before, let’s type out a variable name. Rather than type out the full file name for our data, this easy to remember name will hold the data we bring in.



**Step 4 - Print the variable**

Let’s see it now by printing the **train** variable and showing our work.



**Step 5 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/Gab5c61WwAApSiM?format=png&name=900x900)
</details>

Your Turn: Now it’s your turn!  Let’s dive in and start working with the data! We’ll begin by loading it into a dataframe, which will allow us to easily interact with and analyze the dataset.


In [9]:
#freehand code 


Unnamed: 0,Brand,Model,Volume,Weight,CO2Emission
0,Toyota,Corolla,12.3,1300,120
1,Toyota,Camry,14.0,1495,135
2,Toyota,RAV4,15.0,1680,140
3,Toyota,Prius,12.5,1375,90
4,Toyota,Highlander,16.5,1965,165
5,Ford,F-150,19.5,2200,255
6,Ford,Mustang,13.0,1655,180
7,Ford,Explorer,16.8,2020,200
8,Ford,Focus,12.0,1350,115
9,Ford,Escape,15.4,1580,150


**Explanation**: *To predict CO2Emissions (the label), this dataset can be used in a regression model with CO2 emissions as the target variable and other pollutants as features. Each feature, such as Methane, NOx, PM2.5, VOC, and SO2 emissions, provides insight into activities that might correlate with CO2 levels, especially since many are byproducts of combustion processes emitting CO2. Training on these features allows the model to estimate CO2 emissions based on observed patterns in other pollutant levels, aiding in emissions monitoring and control efforts*.

###  Goal 3 - Import the Plotly.Express Library.

Need extra tools to help solve this problem? Well, we can bring in extra ‘libraries’ to help us do extra data science stuff. You can think of it like an ‘add-on’. In this case, we bring in Plotly Express, which is a popular library for doing different visualizations.

#### Blockly


**Step 1 - Starting the import**

First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to “import” to bring the add-on package in.
Bring in the IMPORT menu, which can be helpful to bring in other data tools. In this case, we're bringing in the import block.



**Step 2 - Telling what library to import**

In the text area, we type the name of the library we want to import. A library is like an extra thing we bring in to give us more coding abilities. In our case, we will type out plotly.express, which will bring in some cool data manipulation features.



**Step 3 - Renaming the library so it’s easy to remember**

Once you are done, put the import and package together in a single variable. This handy feature helps cut down on all the typing later on. You can call it whatever is easiest for you to remember. In the example below, we’ve put everything into px, and we type it in the open area.



**Step 4 - Connect the blocks to run the code**

Connect the blocks and run the code!
<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/Gab594HW4AA7zAX?format=png&name=small)
</details>

In [None]:
#blocks code


#### Freehand

**Step 1 - Starting the import**

First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to “import” to bring the add-on package in.


**Step 2 - Telling what library to import**

In the text area, we type the name of the library we want to import. A library is like an extra thing we bring in to give us more coding abilities. In our case, we will type out plotly.express, which will bring in some cool data manipulation features.


**Step 3 - Renaming the library so it’s easy to remember**

Once you are done, put the ‘import’ and ‘package’ together in a single variable. This handy feature helps cut down on all the typing later on. Feel free to use whatever name you want that will help you remember it later on. In the example below, we’ve put everything into px.


**Step 4 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!
<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/Gab6AduXQAIEi5X?format=png&name=900x900)
</details>

**Your Turn**: Now it’s your turn! We’re going to dive into the plotly.express package, which helps us with some really cool data science things. First, let’s import the package and assign it to the variable “px” to make it easier to use throughout our notebook.


In [10]:
#freehand code 





**Explanation**: *Congrats! Your attempts finally made it! Now you have successfully imported the plotly.express package as the variable px*.

### Goal 4 - Create a scatter plot and trendline to visualize the correlation between two variables.

Let’s try to look at two variables to see their relationship, along with a trendline.

#### Blockly


**Step 1 - Call the scatter function from plotly**

Let’s see if we can explore the correlations between two variables. Let’s start with a visualization that helps ‘scatter’ all the data on one visualization.

From the Variables menu, get a DO block for the **px** block. With that, select the **scatter** operation. That will generate the scatter plot.



**Step 2 - Saying what data to use for the scatter plot**

So what are we going to look at? Simple, you just have to inform the x-axis and y-axis variables.

Inside the scatter() method, we add our parameters for the plot. First we have the dataframe,  which is the data containing the data to plot.  From the Variables menu, get the train variable that will contain the data that you want to generate the scatter plot with. Then, get two Quote “” blocks from the Text menu.



**Step 3 - Tell plotly what columns to put on the axis:**

Now we need to inform the x-axis and y-axis variables.  Type for each of the names of the columns of the x-axis and y-axis as **‘Weight’** and '**CO2Emission**'.     



**Step 4 - Show the linear regression**

Lastly, we want to add a trendline to our scatter plot to help visualize the relationship between the x and y variables.

In this case, using **freestyle**, type **trendline='ols'**. This will be the least ordered square that we talked about earlier. This will give us the scatterplot, which will show the relationship between the weight of the car and the amount of CO2 it emits.



**Step 5 - Connect the blocks to run the code**

Connect the blocks and run the code!
<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/Gab603uXMAA2kms?format=png&name=small)
</details>

In [None]:
#blocks code


#### Freehand


**Step 1 - Call the scatter function from plotly**

Let’s see if we can explore the correlations between the two variables. Let’s start with a visualization that helps ‘scatter’ all the data in one visualization.

To make the scatterplot, we first call the scatter() function from the px library.

`px.scatter()`



**Step 2 - Saying what data to use for the scatter plot**

So what are we going to look at? Simple, you just have to inform the x-axis and y-axis variables.

Inside the scatter() method, we add our parameters for the plot. First, we have the train,  which is the data containing the data to plot



**Step 3 - Tell plotly what columns to put on the axis**

**‘Weight’** is the column to be plotted on the x-axis, representing the weight of the cars. '**CO2Emission’** is the amount of CO2 emissions we have recorded. This second thing is the column to be plotted on the y-axis.

Inside the scatter() method, we add the following parameters to make our scatterplot.
- **train**: the name of the dataframe
- **Weight**: the data to plot in the X-Axis
- **CO2Emission**: the data to plot in the Y-Axis



**Step 4 - Show the linear regression**

Lastly, we want to add a trendline to our scatter plot to help visualize the relationship between the x and y variables. In this case, we’ll type type **trendline='ols'**. This will be the least ordered square that we talked about earlier
trendline=’ols’: this parameter adds a trendline to the scatter plot to help visualize the relationship between the x and y variables

This will give us the scatterplot, which will show the relationship between the weight of the car and the amount of CO2 it emits.

`px.scatter(train,'Weight','CO2Emission',trendline='ols')`


**Step 5 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!
<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/Gab6yUOXcAIM6_H?format=jpg&name=medium)
</details>

**Your Turn**: Test it out yourself! Firstly we will create a scatterplot to explore our two variables: car weight and CO2 emissions to see their relationship. Make sure to set the x and y axis with the correct data then we can add a trendline to see the pattern between the variables and better understand their relationship!


In [1]:
#freehand code 


**Explanation**: *The scatterplot shows the relationship between the weight of an object (like a car) and the amount of CO2 it emits. The dots represent data points for different weights and their corresponding CO2 emissions. As we can see, there’s a trend where heavier objects generally produce more CO2 emissions, shown by the upward slope of the trend line. The trend line helps summarize the overall direction of the data. In simple terms, this means that, typically, the heavier something is, the more CO2 it tends to emit more*.

### Goal 5 - Import the sklearn library

Remember how we bring in packages to help with extra data science things we come across? Now that we are doing KNN regression, we are going to bring in the SkLearn to help us project/predict the new data points based on nearest neighbors.

#### Blockly


**Step 1 - Starting the import**

First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to “import” to bring the add-on package in.

Bring in the IMPORT menu, which can be helpful to bring in other data tools. In this case, we're bringing in the **import** block.



**Step 2 - Telling what library to import**

In the text area, we type the name of the library we want to import. A library is like an extra thing we bring in to give us more coding abilities. In our case, we will type out **sklearn.neighbors**, which will find the closest data points to a new data point, and then projects the new data point based on the majority class of its nearest neighbors.



**Step 3 - Renaming the library so it’s easy to remember**

Once you are done, put the **import** and **package** together in a single variable. This handy feature helps cut down on all the typing later on. You can call it whatever is easiest for you to remember. In the example below, we’ve put everything into **neighbors**, and we type it in the open area.



**Step 4 - Connect the blocks to run the code**

Connect the blocks and run the code!
<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/Gab7_88WwAE79BG?format=png&name=small)
</details>

In [None]:
#blocks code


#### Freehand

**Step 1 - Starting the import**

First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to “import” to bring the add-on package in.



**Step 2 - Telling what library to import**

In the text area, we type the name of the library we want to import. A library is like an extra thing we bring in to give us more coding abilities. In our case, we will type out **sklearn.neighbors**, which will provide us with some cool data manipulation features.



**Step 3 - Import package as an acronym**

Once you are done, put the ‘import’ and ‘package’ together in a single variable. This handy feature helps cut down on all the typing later on. You can call it whatever is easiest for you to remember. In the example below, we’ve put everything into **neighbors**



**Step 4 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!
<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/Gab79hSWsAALBYV?format=png&name=900x900)
</details>

**Your Turn**: Now it’s your turn! We’re going to dive into the sklearn.neighbors package, which helps us with some really cool data science things. First, let’s import the package and assign it to the variable “neighbors” to make it easier to use throughout our notebook.



In [12]:
#freehand code 


**Explanation**: *Congrats! Your attempts finally made it! Now you have successfully imported the sklearn.neighbors package as the variable neighbors*.

### Goal 6 - Create a KNN Regression model.

Let’s get started with the KNN regressions model and to help set up the number we want to predict.

#### Blockly


**Step 1 - Write out the variable name you want to use**

We setup a variable to store the model in a variable for later use. In this case, we will call it regr to make it easier to remember

On the "Variables" menu, click Create Variable, type a name for our model, **regr**. Then, drag a "SET" block to the workspace for the created variable. This block allows us to create a new variable and assign a value to it.



**Step 2 - Create the KNN regression model**

We read about the KNN Regression model. Now let’s do it
Using the neighbors library, we call the KNeighborsRegressorr() to create the KNN regression model.

From the Variable menu, drag a Create block for the **neighbors** variable. On the create listbox select the option **KNeighborsRegressor**. This specifies the type (class) of object we want to create, which is the KNeighborsRegressor from the neighbors module.

Get a Create block for the "**neighbors**" variable from the Variables menu. With that, a new object of the model, **KNeighborsRegressor**, is created. The "KNeighborsRegressor" class is a type of regression model that uses k-nearest neighbors to make predictions.



**Step 3 -  Define the hyperparameters**

KNN Regression will look at a certain number of neighbors . But how many neighbors?

Inside the method, we say how many neighbors to explore for our regressor. We use 5 neighbors in this case.

Drag a Freestyle block, and type **n_neighbors=5**. This specifies the number of neighbors to consider when making predictions.



**Step 4 - Assign the regressor model to the variable you created**

We can now connect the **regr** variable with the **KNeighborsRegressor** model.



**Step 5- Connect the blocks to run the code**

Connect the blocks and run the code!
<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GcO1zFZW0AAyZOu?format=jpg&name=small)
</details>

In [None]:
#blocks code


#### Freestyle


**Step 1 -  Create the KNN regressor model**

We read about the KNN Regression model. Now let’s do it!

Using the neighbors library, we call the KNeighborsRegessor() method

`neighbors.KNeighborsRegressor()`



**Step 2 -  Define the hyperparameters**

KNN Regression will look at a certain number of neighbors. But how many neighbors?
Inside the method, we say how many neighbors to explore for our regressor. We use 5 neighbors in this case.

`neighbors.KNeighborsRegressor(n_neighbors=5)`



**Step 3 - Assign the regressor model to the variable you created**

We setup a variable to store the regressor model in a variable for later use. In this case, we will call it **regr**.

`regr = neighbors.KNeighborsRegressor(n_neighbors=5)`




**Step 4 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!
<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GcO1wkSXEAEBSPN?format=jpg&name=small)
</details>

**Your Turn**: Try it! Using the KNN regressor tool helps us make predictions based on the neighbors of the data point, so let’s see how useful it is!


In [13]:
#freehand code 


**Explanation**: *The KNN regressor is a tool that helps make predictions based on the "neighbors" of a data point. In other words, if we want to predict something like CO2 emissions, the model will look at similar data points (its "neighbors") and use them to make a guess. This model is useful when there’s a pattern in the data, where similar inputs have similar outputs*.

### Goal 7 - Train and Score the Regressor Model.

Now that we’ve brought in our KNN model, let’s train the model to see how it will learn from the data points that we have in the file.

#### Blockly


**Step 1 - Prepare to train the model**

To train data using the regressor model, we use the model and call the fit() method from it. This will use the ‘fit’ method to train the model that we want to train on.  

From the Variable menu, drag the DO block for the **regr** variable, and select the **fit** function as the do operation. This specifies the function we want to call, which is the fit method of the knn object.



**Step 2 - Have the training features ready**

The next step for training the model is to select the features to train the regressor. In this step, we select the features and add them as a dataframe in the parameter. In this case, the model will train (learn) the regressor based on these 2 variables and use it to predict the label

From the Lists menu, drag a dictVariable, and select the **train** variable from the list of available variables. Also, from the Lists menu, you will get a Create List block. Using the Gear icon, add up to 2 items. For each one of the items, add a Text (a Quote “” from the Text menu), as follows:  "**Weight**" and "**Volume**".



**Step 3 -  Have the training label ready**

So what are we trying to predict? Next, we need to add the data labels for the selected features. We add the data labels(**CO2Emission**) as a parameter in the fit() method.

From the Lists menu, drag a dictVariable, and select the "train" variable from the list of available variables. From the Text menu, get a Quote “” block and add a Text **CO2Emission**. This is the target value applied to train (fit) the model.



**Step 4 - Measure the correctness on the training dataset**

To measure the correctness of the model, we will use the score method() from the neighbors library. Just as in the previous step, we will just replace the fit() method with the score() method. Based on the **‘fit’**, we will try to see how much we were able to predict in our training dataset.

This will give us the knn model’s correctness score. A good score will be closer to 1 (ie - 100). Medium might be more like .95 (95% accurate). Not great would be .90 (90%). It depends on the topic you are looking at.

Right-click on the "regr.fit" block and select "Duplicate" from the context menu. This creates a copy of the block. Within the duplicated block, click on the method dropdown menu and select "**score**" from the list of available methods. The score method will work similarly to fit, and will use the training features and label to measure how much of the training data was learned.  



**Step 5 - Connect the blocks to run the code**

Connect the blocks and run the code!
<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/Gab9wftX0AAdu0V?format=jpg&name=small)

</details>

In [None]:
#blocks code


#### Freehand


**Step 1 - Prepare to train the model**

To train data using the regressor model, we use the model and call the fit() method from it. This will use the **‘fit’** method to train the model that we want to train on.  

`regr.fit()`



**Step 2 - Have the training features ready**

The next step for training the model is to select the features to train the regressor. In this step, we select the features and add them as a dataframe in the parameter. In this case, the model will train (learn) the regressor with the scaled features we have stored in the variable **train**

`regr.fit(train[[‘Weight’,’Volume’]])`


**Step 3 - Have the training label ready**

So what is the label that we are trying to predict? Next, we need to add the data labels for the selected features. We add the data labels (**CO2Emission**) as a parameter in the fit() method.

`regr.fit(train[[‘Weight’,’Volume’]], train[‘CO2Emission’])`


**Step 4 - Measure the correctness on the training dataset**

To measure the correctness of the model, we will use the score method() from the neighbors library. Just as in the previous step, we will just replace the fit() method with the score() method. Based on the ‘fit’, we will try to see how much we were able to predict in our training dataset.

This will give us the knn model correctness score. A good score will be closer to 1 (ie - 100). Medium might be more like .95 (95% accurate). Not great would be .90 (90%). It depends on the topic you are looking at.

`regr.score(train[[‘Weight’,’Volume’]], train[‘CO2Emission’])`



**Step 5 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!
<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/Gab90PUWsAAdNzI?format=jpg&name=small)

</details>

**Your Turn**: Indentation aligned with Step. Bolded and colon. 2-3 sentence description. Please center image


In [14]:
#freehand code 


0.8845480512300746

**Explanation**: *You have now trained the model to predict CO2 emissions based on two features: Weight and Volume. By fitting the model to this data, the algorithm learns the relationship between weight, volume, and CO2 emissions, allowing it to make predictions based on that relationship. You have also calculated the model’s accuracy on the training data. It checks how well the model’s predictions match the actual CO2 emissions in the training dataset. The score is a number between 0 and 100%*.

### Goal 8 - Load the Test dataset.

So we’ve looked at the training dataset to learn something about our data. How about applying it to the rest of the dataset and seeing how good our predictions are?

#### Blockly


**Step 1: Write out the variable name you want to use**

Now that we’re all set with our new package to help us to do cool things, let’s bring the data into a variable and call it **test**. Think of it as a digital spreadsheet with much more power to analyze and manipulate the data! To do this, bring in the VARIABLES menu.



**Step 2: Assign the dataframe to the variable you created**

Just like we did before, let’s type out a variable name. Rather than type out the full file name for our data, this easy to remember name will hold the data we bring in.

From the Variables menu, drag the Set block for the **test** variable. This will allow us to assign the result of a function call to the variable.



**Step 3: Bring in the data**

Now we need to look at the file that has all our data. To load our dataframe, we’ll use a simple command to bring in the file we need (CSV….Comma Separated Values). Let’s say we have a file called ‘AirQualityCTest.csv' in the folder **‘datasets’**. We’re telling Python to read the CSV file and store it in a variable called **test**.

To load our test dataframe, we’ll call the method **read_csv**() from pandas to read the file we need (i.e. CSV(Comma Separated Values) file).

From the Variable menu, drag a DO block using the **pd** variable, go ahead with the do operation **read_csv**. The read_csv function reads a CSV file and returns a DataFrame object. In our case, let’s bring in the "datasets/AirQualityCtest.csv" (user the Quotes from the TEXT menu) because that is what we are working with.



**Step 4 - Display the variable**

Drag the **test** variable to the workspace, making it available for further use in our program. This step is more of a visualization step, as it allows us to see the variable in the Blockly workspace.



**Step 5 - Connect the blocks to run the code**

Connect the blocks and run the code!
<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/Gab-trtWgAAOONP?format=jpg&name=small)

</details>

In [None]:
#blocks code


#### Freehand


**Step 1 - Write out the variable name you want to use**

Now that we’re all set with our new package to help us to do cool things let’s bring the data into a variable called **test**. Think of it as a digital spreadsheet with much more power to analyze and manipulate the data!



**Step 2 - Assign the dataframe to the variable you created**

Just like we did before, let’s type out a variable name. Rather than type out the full file name for our data, this easy to remember name will hold the data we bring in.



**Step 3 - Bring in the data**

Now, we need to look at the file that has all our data.

To load our dataframe, we’ll use a simple command to bring in the file we need (CSV….Comma Separated Values). Let’s say we have a file called ‘AirQualityCtests.csv' in the folder **‘datasets’**. We’re telling Python to read the CSV file and store it in a variable called **test**. For this function, we need to specify the code as “**pd.read_csv**”, which makes the code read the csv file. This variable is now our dataframe!

In our case, let’s bring in the AirQualityCtest.csv’ because that is what our group working with.



**Step 4 - Print the variable**

Let’s see it now by ‘printing’ and showing our work. Retype the variable name underneath the code, and it will print the code. In this case, we will type out the variable name **test**.



**Step 5 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!
<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/Gab-wAQXMAAIVs8?format=png&name=small)
</details>

**Your Turn**: Now it’s your turn!  Let’s dive in and start working with the data!  We’ll begin by loading it into our variable named test, which will allow us to easily interact with and analyze the dataset.


In [15]:
#freehand code 


Unnamed: 0,Brand,Model,Volume,Weight,CO2Emission
0,Porsche,911,13,1570,153
1,Porsche,Cayenne,16,2000,199
2,Porsche,Macan,15,1855,152
3,Porsche,Panamera,15,1850,170
4,Porsche,Taycan,14,2290,245
5,Fiat,500,10,965,82
6,Fiat,Panda,11,950,79
7,Fiat,Tipo,12,1270,104
8,Fiat,500X,13,1420,126
9,Fiat,Doblo,14,1600,152


**Explanation**: *The testing dataset is used to check how well the model can predict CO2 emissions for new cars. After training, the model uses Volume and Weight from this new data to predict CO2 emissions. We then compare these predictions to the actual emissions in the testing dataset. If the predictions are close, it means the model learned well and can make accurate predictions on new data; if not, the model may need improvement*.

### Goal 9: Predict Labels for the Testing Dataset (i.e., - predict the rest of the data)

So far we’ve taken a smaller part of all our data to train and try and learn something about it. Can we take what we’ve learned from the training and use it to predict the rest of our dataset?

#### Blockly

**Step 1 - Write out the variable name you want to use**

Now that we’re all set with our new package to help us to do cool things, let’s bring the data into a variable and call it **predictions**.

From the Variables menu, click Create Variable, and type **predictions**. On the same menu, drag the Set block of the prediction variable. This variable will hold the result of the prediction.





**Step 2 - Prepare the predict operation**

So let’s take the regr variable from before and try to predict the label of the new dataset for **CO2Emission**. Let’s start by using the predict() method from the knn model.

From the Variables menu, get a DO block, for the **regr** variable. With that select the operation **predict**.



**Step 3 - Set the test features**

Inside the predict() method, we provide the **test** features from the test data. This will use the 2 features (ie - columns) to predict the labels.

From the Lists menu, drag a dictVariable, and select the "test" variable from the list of available variables. Also, from the Lists menu, you will get a Create List block. Using the Gear icon, add up to 2 items. For each one of the items, add a Text (a Quote “” from the Text menu), as follows: **Weight** and **Volume**. These are the feature names applied to predict the target label on the testing dataset. Store the output of the KNN prediction in the **predictions** variable. This variable will now hold the result of the prediction.



**Step 4 - Assign the predictions to the variable you created**

Next, we store the prediction labels into a variable **‘predictions’**. To do that we have to connect the SET predictions variable to the **regr.predict**() block.



**Step 5 - Display the predictions**

Let’s see it now by showing our work.

Drag the **predictions** variable to the workspace, making it available for further use in our program. This step is more of a visualization step, as it allows us to see the variable in the Blockly workspace.



**Step 6 - Connect the blocks to run the code**

Connect the blocks and run the code!
<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GbeSDKLbsAAT4EG?format=png&name=small)

<details>

In [None]:
#blocks code


#### Freehand

**Step 1 - Prepare the predict operation**

So let’s take the **regr** variable from before and try to predict the label of the new dataset (contaminated, not contaminated. Let’s start by using the predict() method from the knn model.

`regr.predict()`



**Step 2 - Set the test features**

Inside the predict() method,we  provide the test features from the test data.

`regr.predict(test[['Weight', 'Volume']])`



**Step 3 - Assign the predictions to the variable you created**

Next, we store the prediction labels into a variable ‘predictions’

`predictions = knn.predict(test[['Weight', 'Volume']])`



**Step 4 - Print the predictions**

Finally, we print the the prediction labels using ‘predictions’

`predictions`



**Step 5 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/Gab_xhAWwAAYySA?format=jpg&name=small)

</details>

**Your Turn**: *Let’s give it a go! Try using the predict() method to predict the label for your new dataset and eventually store the predictions in our predictions variable. We’ll then see the results to see how accurate our model is!*


In [17]:
#freehand code 


array([141., 195., 165., 165., 223., 114., 114., 114., 125., 143.])

**Explanation**: *You have used the predict function to estimate results based on new input data, the testing dataset. In this case, the code tells the model to make predictions using values from the Weight and Volume columns in the test dataset. These columns are the features the model uses to predict an outcome, like CO2 emissions, for example*.

### Goal 10: Importing the SKLearn Metrics Library

Need extra tools to help solve this problem? Well, we can bring in extra ‘libraries’ to help us do extra data science stuff. You can think of it as an ‘add-on’. In this case, we bring in SKLearn, which is a popular library that helps us understand how well the predictions performed.

#### Blockly

**Step 1 - Starting the import**

First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to “import” to bring the add-on package in.

Bring in the IMPORT menu, which can be helpful to bring in other data tools. In this case, we're bringing in the **import** block.



**Step 2 - Telling what library to import**

In the text area, we type the name of the library we want to import. A library is like an extra thing we bring in to give us more coding abilities. In our case, we will type out **sklearn.metrics**, which will bring in some cool data manipulation features.



**Step 3 - Renaming the library so it’s easy to remember**

Once you are done, put the **import** and **package** together in a single variable. This handy feature helps cut down on all the typing later on. You can call it whatever is easiest for you to remember. In the example below, we’ve put everything into **metrics**, and we type it in the open area.



**Step 4 - Connect the blocks to run the code**

Connect the blocks and run the code!
<details>
    <summary>Click to see the answer...</summary>


![](https://pbs.twimg.com/media/GacAgr6W4AAT6At?format=png&name=small)
</details>

In [None]:
#blocks code


#### Freehand


**Step 1 - Starting the import**

First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to “import” to bring the add-on package in.



**Step 2 - Telling what library to import**

In the text area, we type the name of the library we want to import. A library is like an extra thing we bring in to give us more coding abilities. In our case, we will type out **sklearn.metrics**, which will provide us with some cool data manipulation features. Some packages have a lengthy name, so we will want to make our own nickname for this.



**Step 3 - Renaming the library so it’s easy to remember**

Once you are done, put the ‘import’ and ‘package’ together in a single variable. This handy feature helps cut down on all the typing later on. Feel free to use whatever name you want that will help you remember it later on. In the example below, we’ve put everything into **metrics**.



**Step 4 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!
<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GacAi_vXgAAdT_8?format=png&name=small)
</details>

**Your Turn**: Indentation aligned with Step. Bolded and colon. 2-3 sentence description. Please center image


In [18]:
#freehand code 


**Explanation**: *The metrics library provides tools to measure and evaluate the performance of machine learning models. By using `metrics`, we can check how well our model is working*.

## Assessing the Performance of the Regressor

So how well did our predictions do? Let’s calculate R2 to help us think about the performance of predictions on the testing dataset


### Goal 11: Assessing the Performance of the Predictions on Test Dataset Using R2


#### Blockly


**Step 1 - Prepare the R2 score calculation from the metrics library**

To calculate the correctness of the model predictions, we will use the r2_score() function from the metrics library. This correctness score will measure the percentage of correct predictions.  

From the Variables menu, get a DO block for the **metrics** variable. With that, select the **r2_score** operation. This operation will compare the correctness of the model and the test label with the predicted values.



**Step 2 - Calculate KNN model’s correctness**

The accuracy_score() function takes 2 parameters to calculate the correctness score and helps measure the percentage of correct predictions. So let’s compare **CO2Emission** from the test dataset and **predictions** from the model we just created.

From the Lists menu get a dictVariable block and select the test variable. From the Text menu get a Quote “” block to inform the label name ”Contaminated”. This list will be used as the true labels for the accuracy calculation.



**Step 3 -  Compare testing labels with the predicted values**

From the Lists menu, get a dictVariable block and select "**test**". Get then a Quote “” block from the Text menu, and type "**CO2Emission**" column.
<details>
    <summary>Click to see the answer...</summary>


![](https://pbs.twimg.com/media/GcNQTiBWsAASJbl?format=jpg&name=small)
</details>

In [None]:
#blocks code


#### Freehand

**Step 1 - Prepare the R2 score calculation from the metrics library**

To calculate the correctness of the model predictions, we will use the metrics.R2_score() method from the metrics library.  This correctness score will measure the percentage of correct predictions.

`metrics.r2_score()`



**Step 2 - Calculate knn model’s correctness**

The r2_score() function takes 2 parameters to calculate the correctness score and helps measure the percentage of correct predictions. So let’s compare **CO2Emission** from the test dataset and **predictions** from the model we just created.
The **r2_score**() method takes 2 parameters to calculate the R2 score.
- *Test data labels*: **test[‘CO2Emission’]**
- *The predicted labels*: **predictions**



**Step 3 - Compare testing labels with the predicted values**

So let’s compare **CO2Emission** from the test dataset and **predictions** from the model we just created.

`metrics.r2_score(test[‘CO2Emission’], predictions)`



**Step 4 - Connect the blocks to run the code**

Connect the blocks and run the code!
<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GcNPLE0XkAETUid?format=jpg&name=small)
</details>

**Your Turn**: Give it a try! Here we use a metric called r2_score() to verify how accurate our model is by comparing our test labels with predictions!


In [21]:
#freehand code 


0.8641064866392857

**Explanation**: *You have calculated a metric called R-squared to evaluate how well a model’s predictions match the actual values. The R-squared score is a number between 0 and 1. If it’s close to 1, it means the model’s predictions are very close to the actual values, so it’s doing a good job. If it’s closer to 0, it means the predictions aren’t very accurate. In short, a higher R-squared means a better model*.

## WHAT DID YOU LEARN?


In this lesson, we learned how to use the K-nearest-neighbor (KNN) algorithm to predict numeric values based on feature similarity. We explored the importance of standardizing data to ensure accurate comparisons and how to choose the optimal value of K to balance model complexity and accuracy. Additionally, we learned to evaluate the model’s performance using residuals and the coefficient of determination (r²). This lesson provided a foundation for understanding non-parametric regression techniques and their applications in real-world scenarios.



## WHAT’S NEXT?


[Simple Linear Regression](Simple_Linear_Regression.ipynb)



## TELL ME MORE


- [Datawhys KNN Regression Notebook](https://github.com/memphis-iis/datawhys-content-notebooks-python/blob/master/KNN-regression.ipynb)
- [Datawhys KNN Regression Problem-Solving Notebook](https://github.com/memphis-iis/datawhys-content-notebooks-python/blob/master/KNN-regression.ipynb)
