# Simple Linear Regression

## ANTICIPATED TIME


2 hours




## BEFORE YOU BEGIN


[KNN Regression](KNN_Regression.ipynb)

## WHAT YOU WILL LEARN



- What is a linear model?
- How can you check if a model fits the data well?
- How do you train and score a linear regression model?
- What is the best fit line and how is it used in predictions?
- How do you calculate and interpret the coefficient of determination (r²)?
- How do you evaluate the performance of a model’s predictions?

## DEFINITIONS YOU’LL NEED TO KNOW



- Regression - model used to predict a number
- Predictors - a variable that is used to predict the values of a dependent/response variable
- Response - the number you are trying to predict
- Coefficient of determination (r2 ) - a way to tell if our model fits. An  r2  close to 1 means our model fits the training data well, while an  r2  closer to 0 means that the model does not fit the training data well.
- Best fit line - the line that is closest to all the data points on a scatter plot. It shows the direction that the data follows and can help us make predictions.
- Linear relationship - a way to describe two variables that change at a constant rate. This will create a straight line on a graph.
- Residual - the difference between a predicted value and the actual value


##SCENARIO:


Ethan, Diego, Angelina, and Kiana discuss the traffic and pollution issues in their city.  The group is working on a project to solve traffic and pollution problems in their city. Diego has been diligently collecting data on different things that could be related to pollution, such as the number of cars and their emissions levels.

Before, they were trying to see if they could group things to find patterns. Diego now suggested to see if they could use one kind of variable to predict the amount of pollution it would create


## WHAT DO I NEED TO KNOW?


**Simple Linear Regression**
Remember when we talked about the relationships between two variables? Well, what if we could use one variable to predict another?

So how would that work?

Let’s think of an example first. Imagine wanting to know how study time affects your test scores. Here, we have two things that are connected and depend on each other.
Study time is one variable that you can change and
Your test score is the other variable that depends on your study time.

Simple linear regression can help us understand how these two things are related! Better yet, we can take that relationship to predict the dependent variable. This word linear is really important because the pattern on the graph will be a straight line that we call a ***linear relationship.*** That makes sense, right?

We can use a simple linear regression model to find the best-fit line through our data. The ***best-fit line*** touches or stays close to most of our data points - it's like using a ruler to draw a line through our data points.

What’s really cool about the **best-fit line** is that it

- It tells us about data points that we *already have*
- Helps us predict data points we *don’t know*!

Let’s go back to our study time and test score scenario. If the best-fit line shows students score 5 points more for every hour that they study, we can use a formula to guess what their scores will be based on their total study time. Again, this isn’t info that we already know or have observed. But this is a good guess, or a prediction, from the study time and test scores that we know and put on our graph. We call this **predicted response value**.

**Residuals…..again!**
We’ve already mentioned in the KNN Regression notebooks that **residuals** tell us the difference between all the actual numbers and predicted values that are found on the best-fit line. As you can probably guess, we want them as small as possible because it means our estimates are pretty good.

If you’re like me and worry about eyeballing how small the residuals are, one way to check is with the **coefficient of determination (r2)**, which will always be between 0 and 1.

What does r2  tell us? It helps us understand the differences in our data. r2 . If it’s closer to 1, our model fits the data really well! If it’s closer to 0, we might need to try to rethink what variables are needed to predict the dependent variable.

## YOUR TURN



### Goal 1: Importing the Library

Import a package in programming, use the import command followed by the package name, optionally assign it a nickname, and store it in a variable for easier use.

#### Blockly


**Step 1 - Starting the import**
First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to “import” to bring the add-on package in.

Bring in the IMPORT menu, which can be helpful to bring in other data tools. In this case, we're bringing in the import block.



**Step 2 - Telling what library to import**
In the text area, we type the name of the library we want to import. A library is like an extra thing we bring in to give us more coding abilities. In our case, we will type out pandas, which will bring in some cool data manipulation features.



**Step 3 - Renaming the library so it’s easy to remember**

Once you are done, put the import and package together in a single variable. This handy feature helps cut down on all the typing later on. You can call it whatever is easiest for you to remember. In the example below, we’ve put everything into pd, and we type it in the open area.



**Step 4 - Connect the blocks to run the code**

Connect the blocks and run the code!

<details>
    <summary>Click to see the answer...</summary>
    
![](https://pbs.twimg.com/media/GcxNkjYXkAAlpKD?format=png&name=240x240)

</details>

In [None]:
#blocks code


#### Freehand


**Step 1 - Command**

First, we need to set up a “command” to tell the computer what to do. First we will bring in a certain “package”, which gives us special cool features that help us solve our problem. So we first need to start to “import” that package. In this case, “command” is set up as “import”.


**Step 2 - Package**

Then, we need the “package” we would like to import. So, the package name is typed next to the ‘import’ command. In our case, we will bring in **pandas**, which will provide us with some cool data manipulation features. Some packages have a lengthy name, so we will want to make our own nickname for this.


**Step 3 - Import package as an acronym**

Once you are done, put the ‘import’ and ‘package’ together in a single variable. This handy feature helps cut down on all the typing later on. You can call it whatever is easiest for you to remember. In the example below, we’ve put everything into **pd**


**Step 4 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!
<details>
    <summary>Click to see the answer...</summary>
    
![](https://pbs.twimg.com/media/GZmkVCYWEA4oGso?format=jpg&name=small)
</details>


**Your Turn**: Now it’s your turn! We’re going to dive into the pandas package, which helps us with some really cool data science things. First, let’s import the package and assign it to the variable “pd” to make it easier to use throughout our notebook.


In [3]:
#freehand code 


**Explanation**: *Congrats! Your attempts finally made it! Now you have successfully imported the pandas package as the variable pd*.

### Goal 2: Load the training dataset

Load data from a CSV file into a variable called train to visualize the data for further analysis and manipulation.

#### Blockly


**Step 1 - Write out the variable name you want to use:**

Now that we’re all set with our new package to help us do cool things, let’s bring the data into a variable and call it **train**.

In Blockly, bring in the VARIABLES menu.



**Step 2 - Assign the dataframe to the variable you created**

Just like we did before, let’s type out a variable name. Rather than type out the full file name for our data, this easy to remember name will hold the data we bring in.

In Blockly, go to the Variables and drag the Set block for the **train** variable. This will allow us to assign the result of a function call to the variable.



**Step 3 - Bring in the data**

Now we need to look at the file that has all our data. To load our dataframe, we’ll use a simple command to bring in the file we need (CSV….Comma Separated Values). Let’s say we have a file called ‘datasets/AirQualityCars.csv” in the folder **‘datasets’**. We’re telling Python to read the CSV file and store it in a variable called **train**.

From the Variable menu, drag a DO block using the **pd** variable, and go ahead with the do operation **read_csv**. The read_csv function reads a CSV file and returns a DataFrame object.

In our case, let’s bring in the “datasets/AirQualityCars.csv" (use the Quotes from the TEXT menu) because that is what Angelina is working with.



**Step 4: Display the variable**

Let’s see it now by ‘displaying’ and showing our work.

Drag the **train** variable to the workspace, making it available for further use in our program. This step is more of a visualization step, as it allows us to see the variable in the Blockly workspace.



**Step 5 - Connect the blocks to run the code**

Connect the blocks and run the code!

<details>
    <summary>Click to see the answer...</summary>
    
![](https://pbs.twimg.com/media/Gab5Z-TXsAAQAWr?format=jpg&name=small)

</details>

In [None]:
#blocks code


#### Freehand


**Step 1 - Write out the variable name you want to use**

Now that we’re all set with our new package to help us do cool things, let’s bring the data into a variable called **train**. Think of it as a digital spreadsheet with much more power to analyze and manipulate the data!

Just like we did before, let’s type out a variable name. Rather than type out the full file name for our data, this easy to remember name will hold the data we bring in.



**Step 2 - Bring in the data**

Now we need to look at the file that has all our data.

To load our dataframe, we’ll use a simple command to bring in the file we need (CSV….Comma Separated Values). Let’s say we have a file called ‘datasets/AirQualityCars.csv' in the folder **‘datasets’**. We’re telling Python to read the CSV file and store it in a variable called **train**. For this function, we need to specify the code as “pd.read_csv”, which makes the code read the csv file. This variable is now our dataframe!

In our case, let’s bring in the “datasets/AirQualityCars.csv” because that is what Kiana is working with.



**Step 3 - Assign the dataframe to the variable you created**

Just like we did before, let’s type out a variable name. Rather than type out the full file name for our data, this easy to remember name will hold the data we bring in.



**Step 4 - Print the variable**

Let’s see it now by ‘printing’ and showing our work.



**Step 5 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!

<details>
    <summary>Click to see the answer...</summary>
    
![](https://pbs.twimg.com/media/Gab5c61WwAApSiM?format=png&name=small)

</details>


**Your Turn**: Now it’s your turn!  Let’s dive in and start working with the data! We’ll begin by loading it into a dataframe, which will allow us to easily interact with and analyze the dataset.



In [4]:
#freehand code 


Unnamed: 0,Brand,Model,Volume,Weight,CO2Emission
0,Toyota,Corolla,12.3,1300,120
1,Toyota,Camry,14.0,1495,135
2,Toyota,RAV4,15.0,1680,140
3,Toyota,Prius,12.5,1375,90
4,Toyota,Highlander,16.5,1965,165
5,Ford,F-150,19.5,2200,255
6,Ford,Mustang,13.0,1655,180
7,Ford,Explorer,16.8,2020,200
8,Ford,Focus,12.0,1350,115
9,Ford,Escape,15.4,1580,150



**Explanation**: *The training dataset contains information about different car models and their CO2 emissions. Each row includes the brand (e.g., Toyota, Ford), model, volume (likely engine size), weight, and CO2Emission level. Heavier cars or those with larger engine volumes tend to have higher CO2 emissions because they use more fuel. For example, smaller cars like the Toyota Prius have lower emissions, while larger vehicles like the Ford F-150 have higher emissions. This data helps a model learn the relationship between features like volume and weight and their impact on CO2 emissions*.


### Goal 3: Import the Plotly.Express

Library We’ve already brought pandas to help with data science. Let’s bring in Plotly Express to help with some fancy-pants visualizations.

#### Blockly

**Step 1 - Starting the import**

First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to “import” to bring the add-on package in.

Bring in the IMPORT menu, which can be helpful to bring in other data tools. In this case, we're bringing in the **import** block.



**Step 2 - Telling what library to import**

In the text area, we type the name of the library we want to import. A library is like an extra thing we bring in to give us more coding abilities. In our case, we will type out **plotly.express**, which will bring in some cool data manipulation features.



**Step 3 - Renaming the library so it’s easy to remember**

Once you are done, put the **import** and **package** together in a single variable. This handy feature helps cut down on all the typing later on. You can call it whatever is easiest for you to remember. In the example below, we’ve put everything into **px**, and we type it in the open area.



**Step 4 - Connect the blocks to run the code**

Connect the blocks and run the code!

<details>
    <summary>Click to see the answer...</summary>
    
![](https://pbs.twimg.com/media/Gab594HW4AA7zAX?format=png&name=small)

</details>

In [None]:
#blocks code


#### Freehand

**Step 1 - Starting the import**

First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to “import” to bring the add-on package in.


**Step 2 - Telling what library to import**

In the text area, we type the name of the library we want to import. A library is like an extra thing we bring in to give us more coding abilities. In our case, we will type out **plotly.express**, which will bring in some cool data visualization features.


**Step 3 - Renaming the library so it’s easy to remember**

Once you are done, put the ‘import’ and ‘package’ together in a single variable. This handy feature helps cut down on all the typing later on. Feel free to use whatever name you want that will help you remember it later on. In the example below, we’ve put everything into **px**.


**Step 4 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!

<details>
    <summary>Click to see the answer...</summary>
    
![](https://pbs.twimg.com/media/Gab6AduXQAIEi5X?format=png&name=small)

</details>

**Your Turn**:  Now it’s your turn! We’re going to dive into the ploty.express package, which helps us with some really cool data science things. First, let’s import the package and assign it to the variable “px” to make it easier to use throughout our notebook.


In [5]:
#freehand code 


**Explanation**: *The line import plotly.express as px imports the Plotly Express library, a high-level interface for creating interactive and dynamic visualizations in Python*.

### Goal 4: Present a scatter plot  

Scatter plots help us to look at each data point when it comes to interval ratio data. The scatter plot shows us the relationship between two variables in a data set. One variable is plotted on the X-axis, while the other variable is plotted on the Y-axis. They are super handy for finding the relationship between different numeric variables.

#### Blockly

**Step 1 - Call the scatter function from plotly**

Let’s see if we can explore the correlations between two variables. Let’s start with a visualization that helps ‘scatter’ all the data on one visualization.

From the Variables menu, get a DO block for the **px** block. With that, select the **scatter** operation. That will generate the scatter plot.



**Step 2 -  Saying what data to use for the scatter plot**

So what are we going to look at? Simple, you just have to inform the x-axis and y-axis variables.

Inside the scatter() method, we add our parameters for the plot. First we have the **train**,  which is the data containing the data to plot.  From the Variables menu, get the **train** variable that will contain the data that you want to generate the scatter plot with. Then, get two Quote “” blocks from the Text menu.



**Step 3 -  Tell plotly what columns to put on the axis**

Now we need to inform the x-axis and y-axis variables.  Type for each of the names of the columns of the x-axis and y-axis as **‘Weight’** and '**CO2Emission**'.     



**Step 4 - Connect the blocks to run the code**

Connect the blocks and run the code!
<details>
    <summary>Click to see the answer...</summary>
    
![](https://pbs.twimg.com/media/GbE_1liWwAAaKro?format=png&name=small)

</details>

In [None]:
#blocks code


#### Freehand


**Step 1 - Call the scatter function from plotly**

Let’s see if we can explore the correlations between two variables. Let’s start with a visualization that helps ‘scatter’ all the data on one visualization.

To make the scatterplot, we first call the scatter() function from the px library.

`px.scatter()`



**Step 2 - Saying what data to use for the scatter plot**

So what are we going to look at? Simple, you just have to inform the x-axis and y-axis variables.

Inside the scatter() method, we add our parameters for the plot. First we have the **train**,  which is the data containing the data to plot



**Step 3 - Tell plotly what columns to put on the axis**

**‘Weight’** is the column to be plotted on the x-axis, representing the weight of the cars. '**CO2Emission’** is the amount of CO2 emissions we have recorded. This second thing is the column to be plotted on the y-axis.

Inside the scatter() method, we add the following parameters to make our scatterplot.
- **train**: the name of the dataframe
- **Weight**: the data to plot in the X-Axis
- **CO2Emission**: the data to plot in the Y-Axis



**Step 4 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!
<details>
    <summary>Click to see the answer...</summary>
    
![](https://pbs.twimg.com/media/GbE_xP_WgAA5J88?format=png&name=small)

</details>

**Your Turn**: Now you give a go! Let’s start by creating a scatterplot using px.scatter so we can better visualize the relationship between each of our variables.


In [1]:
#freehand code 


**Explanation**: *The scatter plot shows how the weight of an object (like a car) is related to the amount of CO2 it emits. Each dot represents a sample with its specific weight and CO2 emission level. We can see a general upward trend: as the weight increases, CO2 emissions tend to be higher as well. This suggests that heavier objects or vehicles usually produce more CO2. However, the points don't form a perfect line, which means there are other factors besides weight that also affect CO2 emissions*.

### Goal 5: Create a scatter plot and trendline to visualize the correlation between two variables

Let’s try to look at two variables to see their relationship, along with a trendline.

#### Blockly


**Step 1 - Call the scatter function from plotly**

Let’s see if we can explore the correlations between two variables. Let’s start with a visualization that helps ‘scatter’ all the data on one visualization.

From the Variables menu, get a DO block for the **px** block. With that, select the **scatter** operation. That will generate the scatter plot.



**Step 2 -  Saying what data to use for the scatter plot**

So what are we going to look at? Simple, you just have to inform the x-axis and y-axis variables.

Inside the scatter() method, we add our parameters for the plot. First we have the train,  which is the data containing the data to plot.  From the Variables menu, get the **train** variable that will contain the data that you want to generate the scatter plot with. Then, get two Quote “” blocks from the Text menu.



**Step 3 -  Tell plotly what columns to put on the axis:**

Now we need to inform the x-axis and y-axis variables.  Type for each of the names of the columns of the x-axis and y-axis as **‘Weight’** and '**CO2Emission**'.     



**Step 4 - Show the linear regression**

Lastly, we want to add a trendline to our scatter plot to help visualize the relationship between the x and y variables.

In this case, using **freestyle**, type **trendline='ols'**. This will be the least ordered square that we talked about earlier. This will give us the scatterplot, which will show the relationship between the weight of the car and the amount of CO2 it emits.



**Step 5 - Connect the blocks to run the code**

Connect the blocks and run the code!
<details>
    <summary>Click to see the answer...</summary>
    
![](https://pbs.twimg.com/media/Gab603uXMAA2kms?format=png&name=small)

</details>

In [None]:
#blocks code


#### Freehand


**Step 1 - Call the scatter function from plotly**

Let’s see if we can explore the correlations between the two variables. Let’s start with a visualization that helps ‘scatter’ all the data in one visualization.

To make the scatterplot, we first call the scatter() function from the px library.

`px.scatter()`



**Step 2 - Saying what data to use for the scatter plot**

So what are we going to look at? Simple, you just have to inform the x-axis and y-axis variables.

Inside the scatter() method, we add our parameters for the plot. First we have the **train**,  which is the data containing the data to plot



**Step 3 - Tell plotly what columns to put on the axis**

**‘Weight’** is the column to be plotted on the x-axis, representing the weight of the cars. '**CO2Emission’** is the amount of CO2 emissions we have recorded. This second thing is the column to be plotted on the y-axis.

Inside the scatter() method, we add the following parameters to make our scatterplot.
- **train**: the name of the dataframe
- **Weight**: the data to plot in the X-Axis
- **CO2Emission**: the data to plot in the Y-Axis



**Step 4 - Show the linear regression**

Lastly, we want to add a trendline to our scatter plot to help visualize the relationship between the x and y variables. In this case, we’ll type type **trendline='ols'**. This will be the least ordered square that we talked about earlier
- **trendline=’ols’**: this parameter adds a trendline to the scatter plot to help visualize the relationship between the x and y variables

This will give us the scatterplot, which will show the relationship between the weight of the car and the amount of CO2 it emits.

`px.scatter(train,'Weight','CO2Emission',trendline='ols')`


**Step 5 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!

<details>
    <summary>Click to see the answer...</summary>
    
![](https://pbs.twimg.com/media/Gab6yUOXcAIM6_H?format=jpg&name=small)

</details>

**Your Turn**: Test it out yourself! Firstly we will create a scatterplot to explore our two variables: car weight and CO2 emissions to see their relationship. Make sure to set the x and y axis with the correct data then we can add a trendline to see the pattern between the variables and better understand their relationship!


In [3]:
#freehand code 


**Explanation**: *The scatterplot shows the relationship between the weight of an object (like a car) and the amount of CO2 it emits. The dots represent data points for different weights and their corresponding CO2 emissions. As we can see, there’s a trend where heavier objects generally produce more CO2 emissions, shown by the upward slope of the trend line. The trend line helps summarize the overall direction of the data. In simple terms, this means that, typically, the heavier something is, the more CO2 it tends to emit more gas*.

### Goal 6: Import the Sklearn linear model library

Let’s bring in a library/package to help with the linear regression that will help us with our analysis and other data science tasks.

#### Blockly

**Step 1 - Starting the import**

First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to “import” to bring the add-on package in.

Bring in the IMPORT menu, which can be helpful to bring in other data tools. In this case, we're bringing in the **import** block.



**Step 2 - Telling what library to import**

In the text area, we type the name of the library we want to import. A library is like an extra thing we bring in to give us more coding abilities. In our case, we will type out **sklearn.linear_model**, which will bring in some cool data manipulation features.



**Step 3 - Renaming the library so it’s easy to remember**

Once you are done, put the **import** and **package** together in a single variable. This handy feature helps cut down on all the typing later on. You can call it whatever is easiest for you to remember. In the example below, we’ve put everything into lm, and we type it in the open area.



**Step 4 - Connect the blocks to run the code**

Connect the blocks and run the code!
<details>
    <summary>Click to see the answer...</summary>
    
![](https://pbs.twimg.com/media/GbFDOz7XoAAVLtr?format=png&name=small)

</details>

In [None]:
#blocks code


#### Freehand

**Step 1 - Starting the import**

First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to “import” to bring the add-on package in.


**Step 2 - Telling what Library to Import**

In the text area, we type the name of the library we want to import. A library is like an extra thing we bring in to give us more coding abilities. In our case, we will type out **sklearn.linear_model**, which will bring in some cool data manipulation features.


**Step 3 - Renaming the library so it’s easy to remember**

Once you are done, put the ‘import’ and ‘package’ together in a single variable. This handy feature helps cut down on all the typing later on. Feel free to use whatever name you want that will help you remember it later on. In the example below, we’ve put everything into lm.


**Step 4 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!

<details>
    <summary>Click to see the answer...</summary>
    
![](https://pbs.twimg.com/media/GbFDMCWW8AABSiy?format=png&name=small)

</details>

**Your Turn**: Now it’s your turn! We’re going to dive into the sklearn.linear_model package, which helps us with some really cool data science things. First, let’s import the package and assign it to the variable “lm” to make it easier to use throughout our notebook.


In [8]:
#freehand code 


**Explanation**: *With the scikit-learn library, specifically the linear model module, which includes methods for creating linear regression models*.

### Goal 7: Setting up our linear model

Let’s create a model to help with the training that we will do for our dataset.

#### Blockly

**Step 1 - Create and assign a variable**

Now that we’re all set with our new package to help us to do cool things, let’s bring the data into a variable and call it **regr**. Think of it as a digital spreadsheet with much more power to analyze and manipulate the data!

In Blockly, bring in the VARIABLES menu. On the "Variables" menu, click Create Variable, type a name for our model, **regr**. Then, drag a "SET" block to the workspace for the created variable. This block allows us to create a new variable and assign a value to it.




**Step 2 - Create the linear regression model**

Using the neighbors library, we call the LinearRegressorr() to create the linear regression model.

From the Variable menu, drag a Create block for the lm variable. On the create listbox select the option **LinearRegressor**. This specifies the type (class) of object we want to create, which is the KNeighborsRegressor from the neighbors module.

Get a Create block for the lm  variable from the Variables menu. With that, a new object of the model, **LinearRegressor**, is created. The **LinearRegressor** is a type of regression model that uses one specific number to predict one other number,



**Step 3 - Store the linear regressor model in a variable**

We can now connect the **regr** variable with the **LinearRegressor** model.



**Step 4 - Connect the blocks to run the code**

Connect the blocks and run the code!
<details>
    <summary>Click to see the answer...</summary>
    
![](https://pbs.twimg.com/media/GbFD3KsXwAAqIWJ?format=png&name=small))

</details>

In [None]:
#blocks code


#### Freehand


**Step 1 - Create the linear regression model**

Using the linear model library, we call the LinearRegression() method

`lm.LinearRegression()`



**Step 2 - Store the regression model in a variable**

Now that we’re all set with our new package to help us do cool things, let’s bring the data into a variable called **regr**. Think of it as a digital spreadsheet with much more power to analyze and manipulate the data!

Just like we did before, let’s type out a variable name. Rather than type out the full file name for our data, this easy to remember name will hold the data we bring in.

`regr = lm.LinearRegression()`





**Step 3 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!

<details>
    <summary>Click to see the answer...</summary>
    
![](https://pbs.twimg.com/media/GbFD57vWsAAOizJ?format=png&name=small)

</details>


**Your Turn**: Now it's your turn to create a linear regression model for training your dataset! We will create and assign a variable for the model and use the neighbors library to call the LinearRegressor(). Use the linear model library to call the LinearRegression() method and store the regression model in a variable. Once you're ready, connect the blocks or run the code to see your model in action.


In [9]:
#freehand code 


**Explanation**:  *You have created a linear regression model. By creating this model, we're preparing to analyze how one variable, like weight, affects another variable, like CO2 emissions, by finding the line that best fits the data points. Once trained, this model can help predict CO2 emissions based on weight or any other factor it’s trained on*.

### Goal 8: Train and Score the Regressor Model

Now that we’ve brought in our linear regression model, let’s train the model to see how it will learn from the datapoints that we have in the file.

#### Blockly


**Step 1 - Prepare to train the model**

To train data using the classifier model, we use the model and call the fit() method from it. This will use the ‘fit’ method to train the model that we want to train on.  

From the Variable menu, drag the DO block for the **regr** variable, and select the fit function as the do operation. This specifies the function we want to call, which is the fit method of the linear regression.



**Step 2 - Have the training features ready**

The next step for training the model is to select the features to train the regressor. In this step, we select the features and add them as a dataframe in the parameter. In this case, the model will train (learn) the linear regressor variables and use it to predict our other variable

From the Lists menu, drag a dictVariable, select the **train** variable from the list of available variables. Also, from the Lists menu, you will get a Create List block. Using the Gear icon, add up to 2 items. For each one of the items, add a Text (a Quote “” from the Text menu), as follow:  "**Weight**".



**Step 3 -  Have the training label ready**

So what are we trying predict? Next, we need to add the data labels for the selected features. We add the data label (**CO2Emission**) as a parameter in the fit() method.

From the Lists menu, drag a dictVariable, and select the "train" variable from the list of available variables. From the Text menu, get a Quote “” block and add a Text **CO2Emission**. This is the target value applied to train (fit) the model.



**Step 4 - Measure the correctness on the training dataset**

To measure the correctness of the model, we will use the score method() from the neighbors library. Just as the previous step, we will just replace the fit() method with the score() method. Based on the ‘fit’, we will try to see how much we were able to predict in our training dataset.

This will give us the linear regression correctness score. A good score will be closer to 1 (ie - 100). Medium might be more like .95 (95% accurate). Not great would be .90 (90%). It depends on the topic you are looking at.

Right-click on the "**regr.fit**" block and select "Duplicate" from the context menu. This creates a copy of the block. Within the duplicated block, click on the method dropdown menu and select "**score**" from the list of available methods. The score method will work similarly to fit, will use the training features and label to measure how much of the training data was learned.  



**Step 5 - Connect the blocks to run the code**

Connect the blocks and run the code!

<details>
    <summary>Click to see the answer...</summary>
    
![](https://pbs.twimg.com/media/GbFFHUgXgAAnFx5?format=png&name=small)

</details>

In [None]:
#blocks code


#### Freehand


**Step 1 - Prepare to train the model**

To train data using the regressor model, we use the model and call the fit() method from it. This will use the ‘fit’ method to train the model that we want to train on.  

`regr.fit()`



**Step 2 - Have the training features ready**

The next step for training the model is to select the features to train the regressor. In this step , we select the features and add them as a dataframe in the parameter. In this case, the model will train (learn) the regressor with the scaled features we have stored in the variable **train**

`regr.fit(train[[‘Weight’]])`


**Step 3 - Have the training label ready**

So what is the label that we are trying to predict? Next, we need to add the data labels for the selected features. We add the data labels (**CO2Emission**) as a parameter in the fit() method.

`regr.fit(train[[‘Weight’]], train[‘CO2Emission’])`



**Step 4 - Measure the correctness on the training dataset**

To measure the correctness of the model, we will use the score method() from the neighbors library. Just as the previous step, we will just replace the fit() method with the score() method. Based on the ‘fit’, we will try to see how much we were able to predict in our training dataset.

This will give us the linear regression correctness score. A good score will be closer to 1 (ie - 100). Medium might be more like .95 (95% accurate). Not great would be .90 (90%). It depends on the topic you are looking at.

`regr.score(train[[‘Weight’]], train[‘CO2Emission’])`



**Step 5 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!

<details>
    <summary>Click to see the answer...</summary>
    
![](https://pbs.twimg.com/media/GbFFJWxXgAAAkrQ?format=png&name=small)

</details>


**Your Turn**: Now it's time to train and score your regressor model! Prepare to train the model by calling the fit method and setting up your training features and labels. You’ll measure the correctness of your model by using the score method and evaluate how well your model has learned from the training data. Let’s use the same steps to call the fit method, add your training features and labels, and measure the correctness of your model. Finally, run the code to see your results!


In [4]:
#freehand code 


**Explanation**: *You have trained the model using data from the train dataset. It takes weight as the input feature and CO2 emission as the output we want to predict. The fit function makes the model learn the relationship between weight and CO2 emissions from this data. You have also calculated the model’s R-squared score on the training data, which tells us how well the model's predictions match the actual CO2 emission values in the training dataset. An R-squared score closer to 1 means the model’s predictions are correct, while a score closer to 0 means the predictions are less correct*.

### Goal 9: Load the testing dataset

So we’ve looked at the training dataset to learn something about our data. How about applying it to the rest of the dataset and seeing how good our predictions are?

#### Blockly


**Step 1 - Write out the variable name you want to use**

Now that we’re all set with our new package to help us to do cool things, let’s bring the data into a variable and call it **test**. Think of it as a digital spreadsheet with much more power to analyze and manipulate the data!

In Blockly, bring in the VARIABLES menu.



**Step 2 - Assign the dataframe to the variable you created**

Just like we did before, let’s type out a variable name. Rather than type out the full file name for our data, this easy to remember name will hold the data we bring in.

In Blockly, go to the Variables and drag the Set block for the **test** variable. This will allow us to assign the result of a function call to the variable. A function is basically code that does a specific task for us.



**Step 3 - Bring in the data**

Now we need to look at the file that has all our data.

To load our dataframe, we’ll use a simple command to bring in the file we need (CSV….Comma Separated Values). Let’s say we have a file called ‘AirQualityCTest.csv' in the folder **‘datasets’**. We’re telling Python to read the CSV file and store it in a variable called **test**.

From the Variable menu, drag a DO block using the **pd** variable, go ahead with the do operation **read_csv**. The read_csv function reads a CSV file and returns a DataFrame object.

In our case, let’s bring in the “datasets/AirQualityCtest.csv" (use the Quotes from the TEXT menu) because that is what Angelina is working with.



**Step 4 - Display the variable**

Let’s see it now by ‘displaying’ and showing our work.

Drag the **test** variable to the workspace, making it available for further use in our program. This step is more of a visualization step, as it allows us to see the variable in the Blockly workspace.



**Step 5 - Connect the blocks to run the code**

Connect the blocks and run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/Gab-trtWgAAOONP?format=jpg&name=small)

</details>

In [None]:
#blocks code


#### Freehand


**Step 1 - Write out the variable name you want to use**

Now that we’re all set with our new package to help us to do cool things, let’s bring the data into a variable called **test**. Think of it as a digital spreadsheet with much more power to analyze and manipulate the data!

Just like we did before, let’s type out a variable name. Rather than type out the full file name for our data, this easy to remember name will hold the data we bring in.



**Step 2 - Bring in the data**

Now we need to look at the file that has all our data.

To load our dataframe, we’ll use a simple command to bring in the file we need (CSV….Comma Separated Values). Let’s say we have a file called ‘AirQualityCTest.csv' in the folder **‘datasets’**. We’re telling Python to read the CSV file and store it in a variable called **test**. For this function, we need to specify the code as “**pd.read_csv**”, which makes the code read the csv file. This variable is now our dataframe!

In our case, let’s bring in the “datasets/AirQualityCtest.csv" (user the Quotes from the TEXT menu) because that is what Kiana is working with.



**Step 3 - Assign the dataframe to the variable you created**

Just like we did before, let’s type out a variable name. Rather than type out the full file name for our data, this easy to remember name will hold the data we bring in.



**Step 4 - Print the variable**

Let’s see it now by ‘printing’ and showing our work.



**Step 5 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!

<details>
    <summary>Click to see the answer...</summary>
    
![](https://pbs.twimg.com/media/Gab-wAQXMAAIVs8?format=png&name=small)

</details>

**Your Turn**: Now it’s your turn!  Let’s dive in and start working with the data!  We’ll begin by loading it into a dataframe, which will allow us to easily interact with and analyze the dataset.


In [11]:
#freehand code 


Unnamed: 0,Brand,Model,Volume,Weight,CO2Emission
0,Porsche,911,13,1570,153
1,Porsche,Cayenne,16,2000,199
2,Porsche,Macan,15,1855,152
3,Porsche,Panamera,15,1850,170
4,Porsche,Taycan,14,2290,245
5,Fiat,500,10,965,82
6,Fiat,Panda,11,950,79
7,Fiat,Tipo,12,1270,104
8,Fiat,500X,13,1420,126
9,Fiat,Doblo,14,1600,152


**Explanation**: *You have loaded the testing dataset, which is essential to see if our model makes accurate predictions on new, unseen data, ensuring it generalizes well beyond the training data. In this dataset, we have different Porsche models with their Weight and actual CO2Emission values. Testing the model on this data shows if it can predict CO2 emissions accurately based on features like weight, confirming the model's reliability*.

### Goal 10: Predict Labels for Testing Dataset (i.e., - predict the rest of the data)

So far, we’ve taken a smaller part of all our data to train and try and learn something about it. Can we take what we’ve learned from the training and use it to predict the rest of our dataset?

#### Blockly

**Step 1 - Write out the variable name you want to use**

Now that we’re all set with our new package to help us to do cool things, let’s bring the data into a variable and call it **predictions**.

From the Variables menu, click Create Variable, and type **predictions**. On the same menu, drag the Set block of the prediction variable. This variable will hold the result of the prediction.




**Step 2 - Prepare the predict operation**

So let’s take the **regr** variable from before and try to predict the label of the new dataset for **CO2Emission**. Let’s start by using the predict() method from the linear regression model.

From the Variables menu, get a DO block, for the **regr** variable. With that select the operation **predict**.



**Step 3 - Set the test features**

Inside the predict() method,we  provide the test features from the test data. This will use the 1 features (ie - columns) to predict the labels.

From the Lists menu, drag a dictVariable, select the "test" variable from the list of available variables. Also, from the Lists menu, you will get a Create List block. Using the Gear icon, add up to 2 items. For each one of the items, add a Text (a Quote “” from the Text menu), as follow: **Weight**. These are the feature names applied to predict the target label on the testing dataset. Store the output of the model prediction in the "**predictions**" variable. This variable will now hold the result of the prediction.



**Step 4 - Assign the predictions to the variable you created**

Next, we store the prediction labels into a variable ‘predictions’. To do that we have to connect the SET **predictions** variable to the **regr.predict**() block.



**Step 5 - Display the predictions**

Let’s see it now by showing our work.

Drag the **predictions** variable to the workspace, making it available for further use in our program. This step is more of a visualization step, as it allows us to see the variable in the Blockly workspace.



**Step 6 - Connect the blocks to run the code**

Connect the blocks and run the code!
<details>
    <summary>Click to see the answer...</summary>
    
![](https://pbs.twimg.com/media/GbFGIkHWwAARNmD?format=png&name=small)

</details>

In [None]:
#blocks code


#### Freehand


**Step 1 - Prepare the predict operation**

So let’s take the regr variable from before and try to predict the label of the new dataset (contaminated, not contaminated. Let’s start by using the predict() method from the linear regression model.

`regr.predict() `



**Step 2 - Set the test features**

Inside the predict() method,we  provide the test features from the test data.

`regr.predict(test[['Weight', 'Volume']])`



**Step 3 - Assign the predictions to the variable you created**

Next, we store the prediction labels into a variable ‘predictions’

`predictions = regr.predict(test[['Weight']])`



**Step 4 - Display the predictions**

Finally, we display the the prediction labels using ‘predictions’

`predictions`




**Step 5 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!

<details>
    <summary>Click to see the answer...</summary>
    
![](https://pbs.twimg.com/media/GbFGKjbW4AAGRIC?format=png&name=small)

</details>

**Your Turn**: Let’s give it a go and see how the output of your predictions!


In [12]:
#freehand code 


array([140.55133978, 190.71308247, 173.79807622, 173.21480014,
       224.54309499,  69.97493436,  68.22510613, 105.55477511,
       123.05305744, 144.05099625])

**Explanation**: *You have used the predict function to estimate results based on new input data from the testing dataset. In this case, the code tells the model to make predictions using values from the Weight column in the test dataset. This column is the feature of the model used to predict the outcome, CO2 emissions*.

### Goal 11: Import metrics library

Let’s bring in a library/package to help with the linear regression that will help us with our analysis and other data science tasks.

#### Blockly


**Step 1 - Starting the import**

First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to “import” to bring the add-on package in.

Bring in the IMPORT menu, which can be helpful to bring in other data tools. In this case, we're bringing in the **import** block.



**Step 2 - Telling what library to import**

In the text area, we type the name of the library we want to import. A library is like an extra thing we bring in to give us more coding abilities. In our case, we will type out **sklearn.metrics**, which will help us with our linear regression



**Step 3 - Renaming the library so it’s easy to remember**

Once you are done, put the **import** and **package** together in a single variable. This handy feature helps cut down on all the typing later on. You can call it whatever is easiest for you to remember. In the example below, we’ve put everything into **metrics**, and we type it in the open area.



**Step 4 - Connect the blocks to run the code**

Connect the blocks and run the code!

<details>
    <summary>Click to see the answer...</summary>
    
![](https://pbs.twimg.com/media/GacAgr6W4AAT6At?format=png&name=small)

</details>

In [None]:
#blocks code


#### Freehand

**Step 1 - Starting the import**

First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to “import” to bring the add-on package in.


**Step 2 - Telling what library to import**

In the text area, we type the name of the library we want to import. A library is like an extra thing we bring in to give us more coding abilities. In our case, we will type out **sklearn.metrics**, which will help us with our linear regression


**Step 3 - Renaming the library so it’s easy to remember**

Once you are done, put the ‘import’ and ‘package’ together in a single variable. This handy feature helps cut down on all the typing later on. Feel free to use whatever name you want that will help you remember it later on. In the example below, we’ve put everything into **metrics**.


**Step 4 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!

<details>
    <summary>Click to see the answer...</summary>
    
![](https://pbs.twimg.com/media/GacAi_vXgAAdT_8?format=png&name=small)

</details>

**Your Turn**:  Now it’s your turn! We’re going to dive into the pandas package, which helps us with some really cool data science things. First, let’s import the package and assign it to the variable “pd” to make it easier to use throughout our notebook.

In [13]:
#freehand code 


**Explanation**: *The metrics library provides tools to measure and evaluate the performance of machine learning models. By using `metrics`, we can check how well our model is working*.

### Goal 12: Assessing the performance of the predictions on test dataset using R2  

So how well did our predictions do? Let’s calculate R2 to help us think about the performance of predictions on the testing dataset

#### Blockly

**Step 1 - Prepare the R2 score calculation from the metrics library**

To calculate the correctness of the model predictions, we will use the r2_score() function from the metrics library. This correctness score will measure the percentage of correct predictions.  

From the Variables menu, get a DO block for the **metrics** variable. With that, select the **r2_score** operation. This operation will compare the correctness of the model and the test label with the predicted values.



**Step 2 - Calculate linear model model’s correctness**

The r2_score() function takes 2 parameters to calculate the correctness score and help measure the percentage of correct predictions. So let’s compare **CO2Emission** from the test dataset and **predictions** from the model we just created.

From the Lists menu get a dictVariable block and select the test variable. From the Text menu get a Quote “” block to inform the label name **”CO2Emission”**. This list will be used as the true labels for the accuracy calculation.



**Step 3 -  Compare testing labels with the predicted values**

From the Lists menu, get a dictVariable block and select "**test**". Get then a Quote “” block from the Text menu, and type "**CO2Emission**" column.



**Step 4 - Connect the blocks to run the code**

Connect the blocks and run the code!

<details>
    <summary>Click to see the answer...</summary>
    
![](https://pbs.twimg.com/media/GcNQTiBWsAASJbl?format=jpg&name=small)

</details>

In [None]:
#blocks code


#### Freehand

**Step 1 - Prepare the R2 score calculation from the metrics library**

To calculate the correctness of the model predictions, we will use the **metrics.R2_score**() method from the metrics library.  This correctness score will measure the percentage of correct predictions.

`metrics.r2_score()`



**Step 2 - Calculate linear model’s correctness**

The accuracy_score() function takes 2 parameters to calculate the correctness score and help measure the percentage of correct predictions. So let’s compare **CO2Emission** from the test dataset and **predictions** from the model we just created.
The r2_score() method takes 2 parameters to calculate the accuracy score.
- Test data labels: **test[‘CO2Emission’]**
- The predicted labels: **predictions**


**Step 3 - Compare testing labels with the predicted values**

So let’s compare CO2Emission from the test dataset and predictions from the model we just created.

`metrics.r2_score(test[‘CO2Emission’], predictions)`



**Step 4 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!

<details>
    <summary>Click to see the answer...</summary>
    
![](https://pbs.twimg.com/media/GcNPLE0XkAETUid?format=jpg&name=small)

</details>

**Your Turn**: Give it a try! Here we use a metric called r2_score() to verify how accurate our model is by comparing our test labels with predictions!


In [14]:
#freehand code 


0.9391988904414559

**Explanation**: *You have calculated a metric called R-squared to evaluate how well a model’s predictions match the actual values. The R-squared score is a number between 0 and 1. If it’s close to 1, it means the model’s predictions are very close to the actual values, so it’s doing a good job. If it’s closer to 0, it means the predictions aren’t very accurate. In short, a higher R-squared means a better model*.


## WHAT DID YOU LEARN?



In this lesson on simple linear regression, we started by understanding what a linear model is and how it can help us predict outcomes based on known variables. You explored the concept of the best fit line, which shows the direction of the data and helps make predictions.  We also covered how to check the model’s fit using tools like the coefficient of determination (r²) and the Wald test, which gives us a p-value to see how accurate our predictions are.  Additionally, you learned about residuals and how plotting them can help us understand the model's accuracy.  Throughout the lesson, you got hands-on experience with practical examples, ensuring you not only understood the theory but also gained practical skills in applying simple linear regression.


## WHAT’S NEXT?


[Multiple Linear Regression](Multiple_Linear_Regression.ipynb)



## TELL ME MORE


- [Datawhys Simple Linear Regression Notebook](https://github.com/memphis-iis/datawhys-content-notebooks-python/blob/master/Simple-linear-regression.ipynb)
- [Datawhys Simple Linear Regression Problem-Solving Notebook](https://github.com/memphis-iis/datawhys-content-notebooks-python/blob/master/Simple-linear-regression-PS.ipynb)
