# Regression Trees

## ANTICIPATED TIME:

2 hours

## BEFORE YOU BEGIN

[Decision Trees](Decision_Trees.ipynb)

## WHAT YOU WILL LEARN:

* What are decision trees?
* How can decision trees help make decisions for categorical variables?
* What are nodes and how do they help with decision trees?
* How does pruning help us make a better decision tree?
* What is a leaf node?

## DEFINITIONS YOU’LL NEED TO KNOW

* Regression Tree - a model used to make a decisions at multiple points using scale/numeric data
* Node - point in a regression tree where a choice is made
* Leaf Node - the end point of a decision tree that gives a final prediction
* Outcome variable - the thing the decision tree is trying to predict
* Partition - to split large groups into smaller groups
* Pruning - removing parts of a decision tree to make it smaller


## SCENARIO:

At this point, the data club has collected a lot of data about vehicle emissions in their city, which is great, but it’s so much data! As they all look at the data, they notice that much of the data is numerical, and they’re not really sure how to go forward from here. The club used decision trees to help make predictions based on category, but now they also want to look at the numerical data like temperature, number of CO2 emissions, and gasoline used. Even the most experienced group members are feeling overwhelmed by all of the information. But then, Angela remembers learning something about regression trees and how this could be used for all of their numerical data. Maybe this could help!

## WHAT DO I NEED TO KNOW?

In the last notebook, we used decision trees to help predict a category. But what about if we are trying to predict a number instead? When connections between data aren’t simple or in a straight line, regression trees can help us see which variables in our data matter the most.

**How Are Decision Trees Different than Other Classification Approaches?**

We’ve used other approaches like simple or multiple linear regression, where we try to predict a number based on other data. Regression works best with linear relationships, but regression trees don’t need that. They can handle situations where the relationship between variables is not simple or linear.

**How it Works**

Just like decision trees, regression trees use leaf nodes to ‘partition’ variables into different branches. Those categories are like little boxes where any point inside gets the same prediction - the average. To make things simple, each region looks like a rectangle, like you can see in Figure 1

Let’s think back to the example. When we looked at the decision tree notebook, we had categories like whether or not someone will ‘pass’ or ‘fail’. What if instead of ‘pass’ or ‘fail’ categories, we wanted to predict their grade on a scale of 0-100? Also, we can rethink categories for their year in school (‘freshman’, ‘sophomore’, ‘junior’ or ‘senior’), and instead use the number of theirage. Lastly, how about we transform variables like ‘received extra tutoring (yes/no)’ and ‘extracurricular activities (yes/no’) into the number of hours they did for each of these things. In many cases, we can use the average value of the data that reaches that leaf.

![](https://pbs.twimg.com/media/GrKs2JwWYAAi664?format=jpg&name=large)

Now that we’ve moved from thinking in categories to numbers, we are all set with our regression trees!


**Can Trees Get Too Big?**

Big trees can show a lot of data, which isn’t always bad. However, if a regression tree is too big, it can make predictions less accurate. To fix this, we use a process called pruning, where we force the tree to be smaller. There are various ways to prune, like restricting the size of the leaves or the number of branches. Pruning makes the tree smaller and easier to understand. The goal is to keep just enough data, grouped closely together, so the predictions are more accurate and easier to trust.

## YOUR TURN:

### Goal 1: Importing the pandas library

Need extra tools to help solve this problem? Well, we can bring in extra ‘libraries’ to help us do extra data science stuff. You can think of it as an ‘add-on’. In this case, we bring in pandas, which is a popular library for doing data science stuff.  

#### Blockly

**Step 1 - Starting the import**

First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to “import” to bring the add-on package in.

Bring in the IMPORT menu, which can be helpful to bring in other data tools. In this case, we're bringing in the **import** block.

**Step 2 - Telling what library to import**

In the text area, we type the name of the library we want to import. A library is like an extra thing we bring in to give us more coding abilities. In our case, we will type out **pandas**, which will bring in some cool data manipulation features.

**Step 3 - Renaming the library so it’s easy to remember**

Once you are done, put the **import** and **package** together in a single variable. This handy feature helps cut down on all the typing later on. You can call it whatever is easiest for you to remember. In the example below, we’ve put everything into **pd**, and we type it in the open area.

**Step 4 - Connect the blocks to run the code**

Connect the blocks and run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GZYmOOpW8AAf7uc?format=png&name=small)
</details>

In [None]:
# blockly code


#### Freehand

 **Step 1 - Starting the import**

First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to “import” to bring the add-on package in.


**Step 2 - Telling what library to import**

In the text area, we type the name of the library we want to import. A library is like an extra thing we bring in to give us more coding abilities. In our case, we will type out **pandas**, which will bring in some cool data manipulation features.

**Step 3 - Renaming the library so it’s easy to remember**

Once you are done, put the ‘import’ and ‘package’ together in a single variable. This handy feature helps cut down on all the typing later on. Feel free to use whatever name you want that will help you remember it later on. In the example below, we’ve put everything into **pd**.


**Step 4 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GZmkVCYWEA4oGso?format=jpg&name=small)
</details>

**Your Turn**: *Now it’s your turn! We’re going to dive into the pandas package, which helps us with some really cool data science things. First, let’s import the package and assign it to the variable “pd” to make it easier to use throughout our notebook*

In [None]:
# freehand code

**Explanation**: *Congrats!  Your attempts finally made it!  Now you have successfully imported the "pandas" package as the variable "pd"*.

### Goal 2: Bringing in the dataframe.

Let’s bring in the data from a CSV file to visualize the data for further analysis and manipulation.

#### Blockly

**Step 1 - Write out the variable name you want to use**

Now that we’re all set with our new package to help us do cool things, let’s bring the data into a variable and call it **data**.

In Blockly, bring in the VARIABLES menu.

**Step 2 - Assign the dataframe to the variable you created**

Just like we did before, let’s type out a variable name. Rather than type out the full file name for our data, this easy to remember name will hold the data we bring in.

In Blockly, go to the Variables and drag the Set block for the **data** variable. This will allow us to assign the result of a function call to the variable. A function is basically code that does a specific task for us.

**Step 3 - Bring in the data**

Now we need to look at the file that has all our data. To load our dataframe, we’ll use a simple command to bring in the file we need (CSV….Comma Separated Values). Let’s say we have a file called ‘AirQualityReg.csv' in the folder **‘datasets’**. We’re telling Python to read the CSV file and store it in a variable called **data**.

From the Variable menu, drag a DO block using the pd variable, and go ahead with the do operation read_csv. The read_csv function reads a CSV file and returns a DataFrame object.

In our case, let’s bring in the "datasets/AirQualityReg.csv" (use the Quotes from the TEXT menu) because that is what Angelina is working with.

**Step 4 - Display the variable**

Let’s see it now by ‘displaying’ and showing our work.

Drag the train variable to the workspace, making it available for further use in our program. This step is more of a visualization step, as it allows us to see the variable in the Blockly workspace.

**Step 5 - Connect the blocks to run the code**

Connect the blocks and run the code!


<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GjeU1JbW0AERBhQ?format=jpg&name=small)
</details>


In [None]:
# blockly code


#### Freehand

**Step 1 - Write out the variable name you want to use**

Now that we’re all set with our new package to help us do cool things, let’s bring the data into a variable called **data**. Think of it as a digital spreadsheet with much more power to analyze and manipulate the data!

Just like we did before, let’s type out a variable name. Rather than type out the full file name for our data, this easy to remember name will hold the data we bring in.

**Step 2 - Bring in the data**

Now we need to look at the file that has all our data.

To load our dataframe, we’ll use a simple command to bring in the file we need (CSV….Comma Separated Values). Let’s say we have a file called ‘AirQualityReg.csv' in the folder **‘datasets’**. We’re telling Python to read the CSV file and store it in a variable called **data**. For this function, we need to specify the code as “pd.read_csv”, which makes the code read the csv file. This variable is now our dataframe!

In our case, let’s bring in the “datasets/AirQualityReg.csv" (user the Quotes from the TEXT menu) because that is what our group is working with

**Step 3 - Assign the train to the variable you created**

Just like we did before, let’s type out a variable name. Rather than type out the full file name for our data, this easy to remember name will hold the data we bring in.

**Step 4 - Print the variable**

Let’s see it now by ‘printing’ and showing our work.

**Step 5 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!


<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GjeU7BAXwAAjwEH?format=png&name=small)
</details>




**Your Turn**: Now it’s your turn!  Let’s dive in and start working with the data!  We’ll begin by loading it into a dataframe, which will allow us to easily interact with and analyze the dataset.

In [None]:
# freehand code


**Explanation**:  *The training dataset contains information about different car models and their CO2 emissions. Each row includes the brand (e.g., Toyota, Ford), model, volume (likely engine size), weight, and CO2Emission level. Heavier cars or those with larger engine volumes tend to have higher CO2 emissions because they use more fuel. For example, smaller cars like the Toyota Prius have lower emissions, while larger vehicles like the Ford F-150 have higher emissions. This data helps a model learn the relationship between features like volume and weight and their impact on CO2 emissions*.

### Goal 3: Import the seaborn library

Before, we’ve been using Plotly to help with the visualization, but we’ll switch it up a bit and use the Seaborn library which has some other visualizations for the correlations we’ll be looking at.

#### Blockly

**Step 1 - Starting the import**

First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to “import” to bring the add-on package in.

Bring in the IMPORT menu, which can be helpful to bring in other data tools. In this case, we're bringing in the **import** block.

**Step 2 - Telling what library to import**

In the text area, we type the name of the library we want to import. A library is like an extra thing we bring in to give us more coding abilities. In our case, we will type out **seaborn**, which will bring in some cool data manipulation features.

**Step 3 - Renaming the library so it’s easy to remember**

Once you are done, put the import and seaborn together in a single variable. This handy feature helps cut down on all the typing later on. You can call it whatever is easiest for you to remember. In the example below, we’ve put everything into **sns**, and we type it in the open area.

**Step 4 - Connect the blocks to run the code**

Connect the blocks and run the code!


<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/Gbdo5AabUAA8svq?format=png&name=small)
</details>

In [None]:
# blockly code


#### Freehand

**Step 1 - Starting the import**

First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to “import” to bring the add-on package in.

**Step 2 - Telling what library to import**

In the text area, we type the name of the library we want to import. A library is like an extra thing we bring in to give us more coding abilities. In our case, we will type out **seaborn**, which will bring in some cool data manipulation features.

**Step 3 - Renaming the library so it’s easy to remember**

Once you are done, put the ‘import’ and ‘package’ together in a single variable. This handy feature helps cut down on all the typing later on. Feel free to use whatever name you want that will help you remember it later on. In the example below, we’ve put everything into **sns**.

**Step 4 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the data science magic!


<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GbfSdtqaQAA2hpQ?format=png&name=small)
</details>


**Your Turn**: Now it’s your turn! We’re going to dive into the seaborn package, which helps us with some really cool data science things. First, let’s import the package and assign it to the variable “sns” to make it easier to use throughout our notebook.
import seaborn as sns

In [None]:
# freehand code


**Explanation**: *Now you have a new graphic library, seaborn, which is a tool that makes it easy to create nice-looking charts and graphs in Python*.



### Goal 4: Analyze correlations with a pair plot

Before, we used a scatter plot that helped show the relationship with 1 variable. Let’s use a pair plot to find the relationships among multiple variables, which will help us with our multiple linear regression.

#### Blockly

**Step 1 - Create a pair plot**

Let’s see if we can explore the correlations between the variables. Let’s start with a visualization called ‘pairplot’ that helps visualize the relationship among all the variables used.

From the Variables menu, get a DO block for the **sns** variable. With it, select the operation **pairplot** from the Seaborn library.

**Step 2 - Add the parameters to the graph**

Inside the pairplot() method, we add our parameter for the plot. For the pairplot, we add our parameter(s). In this case, we’ll bring in **data** variable that has the dataframe and the variables we’ll look at.

Drag the train variable from the variable menu to connect to it. The pair plot will show histograms for each variable on its diagonal and, on each one of the grids, a scatter plot between variables.

**Step 3 - Connect the blocks to run the code**

Connect the blocks and run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GjeVlBBXIAA7DC2?format=png&name=small)
</details>

In [None]:
# blockly code


#### Freehand

**Step 1 - Create a pair plot**

Let’s see if we can explore the correlations between the variables. Let’s start with a visualization called **‘pairplot’** that helps visualize the relationship among all the variables using.

For the pairplot, we call the pairplot() method from the seaborn library.

`sns.pairplot()`

**Step 2 - Add the parameters of the graph**

Inside the pairplot() method, we add our parameter for the plot. For the pairplot, we add our parameter(s). In this case, we’ll bring in **data** variable that has the dataframe and the variables we’ll look at.

`sns.pairplot(data)`

**Step 3 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GjeVnlLWwAA1GZv?format=png&name=small)
</details>

**Your Turn**: Now it’s your turn! Use pair plot to analyze the relationships between variables.

In [None]:
# freehand code


**Explanation**: *The pairplot shows histograms for each variable on its diagonal, and on each of the grids a scatter plot between variables. We can see that as volume and weight increase, CO2emission also tends to increase. The plots with upward trends (like volume vs. CO2 and weight vs. CO2) indicate that cars with larger engines and more weight generally emit more CO2. This pattern suggests that both engine size and weight play a role in how much CO2 a car produces, which can hint that there is a correlation between the features and the label, so we can create a linear model to predict it*.


### Goal 5: Importing SKLearn and Model_Selection

Need extra tools to help solve this problem? Well, we can bring in extra ‘libraries’ to help us do extra data science stuff. You can think of it like an ‘add-on’. In this case, we bring in SKLearn and Model selection, which is a popular machine learning library that will help us train and test our data.  This will help us learn from it and understand it later.

#### Blockly

**Step 1 - Starting the import**

First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to “import” to bring the add-on package in.

Bring in the IMPORT menu, which can be helpful to bring in other data tools. In this case, we're bringing in the import block.

**Step 2 - Telling what library to import**

In the text area, we type the name of the library we want to import. A library is like an extra thing we bring in to give us more coding abilities. In our case, we will type out **sklearn.model_selection**, which will bring in some cool data manipulation features.

**Step 3 - Renaming the library so it’s easy to remember**

Once you are done, put the import and package together in a single variable. This handy feature helps cut down on all the typing later on. You can call it whatever is easiest for you to remember. In the example below, we’ve put everything into model_selection, and we type it in the open area.


<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GZYmOOpW8AAf7uc?format=png&name=small)
</details>

In [None]:
# blockly code



#### Freehand

**Step 1 - Starting the import**

First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to “import” to bring the add-on package in.

**Step 2 - Telling what library to import**

In the text area, we type the name of the library we want to import. A library is like an extra thing we bring in to give us more coding abilities. In our case, we will type **sklearn.model_selection**, which will bring in some cool data manipulation features.

**Step 3 - Renaming the library so it’s easy to remember**

Once you are done, put the ‘import’ and ‘package’ together in a single variable. This handy feature helps cut down on all the typing later on. Feel free to use whatever name you want that will help you remember it later on. In the example below, we’ve put everything into **model_selection**.

**Step 4 - Run**

Hit ‘control’ and ‘enter’ at the same time to run the data science magic!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GZmkVCYWEA4oGso?format=jpg&name=small)
</details>


**Your Turn**: Now it’s your turn! We’re going to dive into the pandas package, which helps us with some really cool data science things. First, let’s import the package and assign it to the variable model_selection to make it easier to use throughout our notebook.
​​from sklearn.model_selection import train_test_split

In [None]:
# freehand code


**Explanation**: *Congrats!  Your attempts finally made it!  Now you have successfully imported the model selection sublibrary from the sklearn package and named it model_selection*.

### Goal 6: Split the dataset into train test data

We are going to split our data. One part of our data we will learn from, and the other one we are going to use machine learning to help us understand our data. So even if we have unseen data, we should still be able to understand it!

#### Blockly

**Step 1 - Use the train_test_split function**

First we need to use a function that will help us with the splitting. In other words, what data to train on and what data to test on.
From the List menu, drag the **train test split** block to divide the dataset.

**Step 2 - Split the dataset**

So now let’s split! But how much? Most people recommend 20%, so let’s go with that.

From the Math menu, drag the number block and set the test size to **0.2** (20% of the data will be used for testing), to define the Test Size. From the Variable menu drag the **data** variable to define the Dataframe parameter.

**Step 3 - Define the label and the features**

We want to predict the variable **CO2Emission**. But what is our prediction based on? Let’s go ahead and tell our model

From the Text menu, drag the Quotes block and inform "**CO2Emission**" to define the Label. From the List menu, drag the List block, and use the gear icon to add up to 6 items. From the Text menu, drag 2 Quotes blocks, and inform the following features: "**Volume**" and **“Weight”**. Connect that block with Features input.

**Step 4 - Store the split into a variable**

We’ve done our split. Now let’s put it into a new variable so it’s easier to work with. Let’s go ahead and call it **split** so it’s easier to remember.

From the Variables menu, create a variable named **split**. Connect that with the Split block.


**Step 5 - Connect the blocks to run the code**

Connect the blocks and run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GjeV7DHWgAMXIoq?format=jpg&name=small)
</details>

In [None]:
# blockly code


#### Freehand

**Step 1 - Use the train_test_split function**

First we need to use a function that will help us with the splitting. In other words, what data to train on and what data to test on.
The **train_test_split**() function splits data into training and testing sets, ensuring proper evaluation. It takes parameters like test_size and random_state, returning separate feature and label sets to prevent overfitting.

`train_test_split()`

**Step 2 - Split the dataset**

Define the test size to **0.2** (20% of the data will be used for testing). Also, set the complete Dataframe stored in the **data** variable.

`train_test_split(data, test_size=0.2)`

**Step 3 - Define the label and the features**

We want to predict the variable **CO2Emissions**. But what is our prediction based on? Let’s go ahead and tell our model.
Define the label as the column "**CO2Emissssions**" and the features: Volume" and "Weight" from the Dataframe called data.

`train_test_split(data[['Volume',’Weight’]], data[‘CO2Emission’], test_size=0.2)`

**Step 4 - Store the split into a variable**

We’ve done our split. Now let’s put it into a new variable so it’s easier to work with. Let’s go ahead and call it split so it’s easier to remember.

Store its structure into a variable called split, containing four outputs: **X_train** and **X_test** (feature sets for training and testing) and **y_train** and **y_test** (corresponding labels). This separation helps train the model on one part of the data while evaluating its performance on unseen data, preventing overfitting

`split = train_test_split(data[['Volume',’Weight’]], data[‘CO2Emission’], test_size=0.2)`




**Step 5 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GjeV-f4WMAAygqx?format=jpg&name=medium)
</details>


**Your Turn**: Breaking down your data and setting up train-test splits is a crucial step toward building a solid predictive model. Remember, every piece of data you split and organize brings you closer to a well-trained model. You’ve got this—now go ahead and let your code do the magic!

In [None]:
# freehand code


**Explanation**: *This code imports train_test_split from scikit-learn and uses it to split the dataset into training and testing sets. It selects two features—Volume, Weight —from the data DataFrame as inputs and the C02Emission  column as the target variable. The test_size=0.2 parameter ensures that 20% of the data is allocated for testing, while the remaining 80% is used for training. The result, stored in split, consists of four outputs: training features, testing features, training labels, and testing labels. This helps evaluate the model's performance on unseen data*.

### Goal 7: Bring in the SkLearn package to use regression trees in our prediction.  


Remember when we brought in other packages for the extra add-ons? Now let’s bring in a predictor (classifier) to help us find out the categorical variable we want to predict. We specifically want to use SKLearn here to help with regression trees as part of our prediction.

#### Blockly

**Step 1 - Starting the import**

First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to **“import”** to bring the add-on package in.

Bring in the IMPORT menu, which can be helpful to bring in other data tools. In this case, we're bringing in the import block.


**Step 2 - Telling what library to import**

In the text area, we type the name of the library we want to import. A library is like an extra thing we bring in to give us more coding abilities. In our case, we will type out **sklearn.tree**, which will give us the regression trees that we can use to help with our predictions.

**Step 3 - Renaming the library so it’s easy to remember**

Once you are done, put the **import** and **package** together in a single variable. This handy feature helps cut down on all the typing later on. You can call it whatever is easiest for you to remember. In the example below, we’ve put everything into **tree**, and we type it in the open area.

**Step 4 - Connect the blocks to run the code**

Connect the blocks and run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GctrVUWWcAAZjt3?format=png&name=small)
</details>

In [None]:
# blockly code


#### Freehand

**Step 1 - Starting the import**

First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to “import” to bring the add-on package in.

**Step 2 - Telling what library to import**

In the text area, we type the name of the library we want to import. A library is like an extra thing we bring in to give us more coding abilities. In our case, we will type out **sklearn.tree**, which will provide us with some cool data manipulation features.

**Step 3 - Import package as acronym**

Once you are done, put the **‘import’** and **‘package’** together in a single variable. This handy feature helps cut down on all the typing later on. You can call it whatever is easiest for you to remember. In the example below, we’ve put everything into tree

**Step 4 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GctrXuFXMAAueAW?format=png&name=small)
</details>

**Your Turn**: Now it’s your turn! We’re going to dive into the sklearn.tree package, which helps us with some really cool data science things. First, let’s import the package and assign it to the variable “tree” to make it easier to use throughout our notebook.
import sklearn.tree as tree

In [None]:
# freehand code




**Explanation**: *Congrats! Your attempts finally made it! Now you have successfully imported the sklearn.tree package as the variable tree*.

### Goal 8: Create a linear model.

Let’s create a model to help with the training we will do for our dataset.  

#### Blockly

**Step 1 - Write out the name of a variable you want to use for the classifier**

We setup a variable to store the classifier model in a variable for later use. In this case, we will call it **rtree**

On the "Variables" menu, click Create Variable, type a name for our model, rtree. Then, drag a "SET" block to the workspace for the created variable. This block allows us to create a new variable and assign a value to it.

**Step 2 - Create the linear regression model**

We got the variable, so let’s get our regression tree. Easy peasy!

Using the **tree** library, we call the **DecisionTreeRegressor**() to create the Tree model.

From the Variable menu, drag a Create block for the **tree** variable. On the create list box select the option **DecisionTreeRegressor**. This specifies the type (class) of object we want to create, which is the RegressionTreeRegressor from the neighbors module.

**Step 3 - Define the hyperparameters**

As we saw above, a tree has different levels. But how many do we want to look at? In this case, we want a tree with a maximum depth of 3 levels.

Drag a Freestyle block, and type **max_depth=3**. This tells the maximum depth of the tree.

**Step 4 - Connect the blocks to run the code**

Connect the blocks and run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GjeWICxXgAANRHr?format=jpg&name=small)
</details>


#### Freehand

**Step 1 - Write out the name for a variable you want to use for the classifier**

We setup a variable to store the classifier model in a variable for later use. In this case, we will call it **rtree**

Using the neighbors' library, we call the **RegressionTreeClassifier**() constructor.

`tree.RegressionTreeClassifier()`

**Step 2 -  Define the hyperparameters**

As we saw above, a tree has different levels. But how many do we want to look at? In this case, we want a tree with a maximum depth of 3 levels.

`tree.DecisionTreeRegressor(max_depth=3)`

**Step 3 - Assign the regressor model to the variable you created**

We set up a variable to store the classifier model in a variable for later use. In this case, we will call it **rtree**

`rtree = tree.RegressionTreeRegressor(max_depth=3)`

**Step 4 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GjeWKQmWIAEYcQx?format=jpg&name=small)
</details>

**Your Turn**: Now give it a go and see what you get!

In [None]:
# freehand code


**Explanation**:  *You have created a decision tree regressor model. By creating this model, we're preparing to analyze how one variable, like weight, affects another variable, like CO2 emissions, by finding the line that best fits the data points. Once trained, this model can help predict CO2 emissions based on weight or any other factor it’s trained on*.

### Goal 9: Train and score the regressor model.

Now that we’ve brought in our linear regression model, let’s train the model to see how it will learn from the data points that we have in the file.

#### Blockly

**Step 1 - Prepare to train the model**

Now we want to train our regression tree using what the machine learning understands. Now let’s train the model using training data! To do this, the **fit**() method will help us

From the Variable menu, drag the DO block for the **rtree** variable, and select the fit function as the do operation. This specifies the function we want to call, which is the fit method of the tree object.

**Step 2 - Have the training features read**y

So, what variables are we going to use in our prediction? The next step for training the model is to select the features to train the classifier. In this step, we select the features (variables) and add them as a dataframe in the parameter. What’s cool is that the model will train (learn) the classifier based on these 6 variables and use it to predict the label.

From the Lists menu, drag a Train Test Split selection. Select **XTrain** as the feature input. Lastly, from the Variable menu, drag the **split** variable. These are the feature names applied to train (fit) the model.

**Step 3 - Have the training label ready**

So, what is the label that we are trying to predict? Next, we need to add the data labels for the selected features. We add the data labels(CO2Emission feature) as a parameter in the fit() method.

From the Lists menu, drag a Train Test Split selection. Select **YTrain** as the feature input. Lastly, from the Variable menu, drag the **split** variable. This is the target value applied to train (fit) the model.

**Step 4 - Measure the correctness on the training dataset**

To measure the correctness of the model, we will use the score method() from the rtree object. Just as in the previous step, we will just replace the fit() method with the score() method. Based on the ‘fit’, we will try to see how much we were able to predict in our training dataset.

This will give us the tree model's correctness score. A good score will be closer to 1 (ie - 100). Medium might be more like .95 (95% accurate). Not great would be .90 (90%). It depends on the topic you are looking at.

Right-click on the "**rtree.fit**" block and select "Duplicate" from the context menu. This creates a copy of the block. Within the duplicated block, click on the method dropdown menu and select "**score**" from the list of available methods. The score method will work similarly to fit, and will use the training features and label to measure how much of the training data was learned.

**Step 5 - Connect the blocks to run the code**

Connect the blocks and run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GjeWtFrWYAEejGQ?format=png&name=small)
</details>

In [None]:
# blockly code


#### Freehand

**Step 1 - Prepare to train the model**

Now we want to train our regression tree using what the machine learning understands. Now let’s train the model using training data!

To train data using the classifier model, we use the model and call the fit() method from it. This will use the **‘fit’** method to train the model that we want to train on.

`rtree.fit()`

**Step 2 - Have the training features ready**

The next step for training the model is to select the features to train the classifier. In this step , we select the features and add them as a dataframe in the parameter.

In this case, the model will train (learn) the classifier based on these 2 variables, here saved as the **position 0** (**XTrain**) of the split variable.

`rtree.fit(split[0])`

**Step 3 -  Have the training label ready**

So what is the label that we are trying to predict? Next, we need to add the data labels for the selected features.

We add the data labels(**CO2Emission**) as a parameter in the fit() method.  here saved as the **position 2** (**YTrain**) of the split variable.

`rtree.fit(split[0],split[2])`

**Step 4 - Measure the correctness on the training dataset**

To measure the correctness of the model, we will use the score method from the tree object. Just as the previous step, we will just replace the fit() method with the **score**() method. Based on the ‘fit’, we will try to see how much we could predict in our training dataset.

This will give us the Tree model correctness score. A good score will be closer to 1 (ie - 100). Medium might be more like .95 (95% accurate). Not great would be .90 (90%). It depends on the topic you are looking at.

**rtree.score(split[0],split[2])**

**Step 5 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!


<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GjeWu4dWQAAPmpA?format=png&name=small)
</details>

**Your Turn**: Now it’s your turn to train the model! Put your skills into action by training and scoring the regressor model. Explore how well the model learns from the training data and analyze its correctness score. It's a great opportunity to understand the model's predictive power! Ready! Set! Go!

In [None]:
# freehand code


**Explanation**: *You have trained the model using data from the train dataset. It takes weight as the input feature and CO2 emission as the output we want to predict. The fit function makes the model learn the relationship between weight and volume, with CO2 emissions from this data. You have also calculated the model's R-squared score on the training data. The R-squared score, which is 0.86 (or 86%), shows how well the model’s predictions match the actual CO2 emissions. A score close to 1 (or 100%) means the model is doing a good job. Here, 0.86 means the model explains 86% of the variation in CO2 emissions based on weight and volume, indicating it's a fairly accurate model*.

### Goal 10: Importing the graphviz library.

Let’s bring in a package to help us see what our regression tree looks like. In this case, we want to use the graphviz package.

#### Blockly

**Step 1 - Starting the import**

First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to “import” to bring the add-on package in.

Bring in the IMPORT menu, which can be helpful to bring in other data tools. In this case, we're bringing in the **import** block.

**Step 2 - Telling what library to import**

In the text area, we type the name of the library we want to import. A library is like an extra thing we bring in to give us more coding abilities. In our case, we will type out **graphviz**, which will bring in some cool data manipulation features.

**Step 3 - Renaming the library so it’s easy to remember**

Once you are done, put the **import** and **package** together in a single variable. This handy feature helps cut down on all the typing later on. You can call it whatever is easiest for you to remember. In the example below, we’ve put everything into **gviz**, and we type it in the open area.

**Step 4 - Connect the blocks to run the code**

Connect the blocks and run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GctxWklXEAAG9cq?format=png&name=small)
</details>

In [None]:
# blockly code


#### Freehand


**Step 1 - Starting the import**

First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to “import” to bring the add-on package in.


**Step 2 - Telling what library to import**

In the text area, we type the name of the library we want to import. A library is like an extra thing we bring in to give us more coding abilities. In our case, we will type out **sklearn.tree**, which will provide us with some cool data manipulation features.

**Step 3 - Renaming the library so it’s easy to remember**

Once you are done, put the ‘import’ and ‘package’ together in a single variable. This handy feature helps cut down on all the typing later on. You can call it whatever is easiest for you to remember. In the example below, we’ve put everything into gviz.

**Step 4 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GctxZDmXUAABQd6?format=png&name=small)
</details>


**Your Turn**: Test it out yourself! We set up a command to tell our computer what to do, and after our hard work, we’ll run what we have to see our data science major at work!

In [None]:
# freehand code


**Explanation**: *Congrats! Your attempts finally made it! Now you have successfully imported the graphviz package as the variable gviz*.

### Goal 11: Show the trained tree.

So we’ve taken some data to train on. Let’s see what it looks like! First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to “import” to bring the add-on package in.


#### Blockly

**Step 1 - Let’s tell our system to show the image**

Let’s start by creating a graph, which we call a source object.

From the Variable menu, drag the Create block. With the **gviz** variable, select **Source** to create the object that exhibits the graph.

**Step 2 - Let’s clarify that we are trying to show a regression tree**

From the Variable menu, drag the DO block with the **tree** variable, and select the operation **export_graphviz** block and connect it to the **Source** block.

**Step 3 - Tell the image generator what model to show**

Let’s tell it what it is we want to generate. In this case, let’s display rtree where we have our regression tree stored.

From the Variable menu, drag the **rtree** variable to specify the regression tree model used for visualization.

**Step 4 - Add in color so it’s easier to read**

Let’s add a bit of color so it’s easier to read.
From the Freestyle menu, drag a freestyle block and type **filled** into it. From the Logic menu,  drag the **true** block and connect it to the parameter to ensure that the tree nodes are color-filled for better readability.


**Step 5 - Bring in variable names into our regression tree**
What variable names do we use to make predictions in our Regression tree?

From the Lists menu, drag the create list block and use the gear icon to add two items.
From the Text menu, drag two Quote blocks and enter the feature names: "**Volume**", "**Weight**".
Connect this block to the **feature_names** parameter.

**Step 6 - Displaying what we are trying to predict**

Now that we have our variable (feature) names, let’s tell it what to say about the prediction.  In this case, we want to figure out the **CO2Emission** level based on **Volume** and **Weight**.

From the Lists menu, drag another create list block and add two items. From the Text menu, drag two Quote blocks and enter the class names: "Volume" and "Weight". Connect this block to the feature_names parameter.

**Step 7 - Connect the blocks to run the code**

Connect the blocks and run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GjeXBq4WgAM_ItL?format=jpg&name=small)
</details>


In [None]:
# blockly code


#### Freehand

**Step 1 - Let’s tell our system to show the image**

Let’s start by creating a graph, which we call a source object.
Use **gviz** to create a visualization **Source** object for graph-based output.

`gviz.Source()`


**Step 2 - Let’s clarify that we are trying to show a decision tree**

Use **tree.export_graphviz** to generate a regression tree visualization.

`gviz.Source(tree.export_graphviz())`

**Step 3 - Tell the image generator to what model to show**

Let’s tell it what it is we want to generate. In this case, let’s display dtree where we have our regression tree stored.

Pass the **rtree** variable to specify the regression tree model being visualized.

`gviz.Source(tree.export_graphviz(rtree))`

**Step 4 - Add in color so it’s easier to read**

Let’s add a bit of color so it’s easier to read.

Set **filled=True** to color the nodes, making the visualization clearer.

`gviz.Source(tree.export_graphviz(rtree,filled=True))`

**Step 5 - Bring in variable names into our decision tree**

What variable names do we use to make predictions in our regression tree?
Define the **feature_names** list: '**Volume**', '**Weight**', which are used as input variables for the model.

`gviz.Source(tree.export_graphviz(rtree,filled=True,feature_names= ['Volume', 'Weight']))`

**Step 6 - Displaying what we are trying to predict**

Now that we have our variable (feature) names, let’s tell it what to say about the prediction. In this case, we want to figure out the CO2Emission level based on **Volume** and **Weight**.

Set **class_names=['cont', 'not']** to define the classification labels, representing whether contamination is present or not.

`gviz.Source(tree.export_graphviz(rtree,filled= True,feature_names= ['Volume', 'Weight'])`

**Step 7 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GjeXDldXkAAvFDi?format=jpg&name=medium)
</details>


**Your Turn**: Give it a try!
graphviz.Source(tree.export_graphviz(rtree,filled= True,feature_names= ['Volume', 'Weight']))

In [None]:
# freehand code



**Explanation**: *You have generated a visual representation of a regression tree (rtree) using Graphviz. It converts the tree to Graphviz's DOT format, colors nodes based on class, labels features as 'Volume' and 'Weight', and creates a Graphviz Source object for rendering the tree's graphical representation*.

### Goal 12: Predict labels for the testing dataset (i.e., predict the rest of the data).

So far, we’ve taken a smaller part of all our data to train and try and learn something about it. Can we take what we’ve learned from the training and use it to predict the rest of our dataset?

#### Blockly

**Step 1 - Write out the variable name you want to use**

From the Variables menu, click Create Variable, and type **predictions**. On the same menu, drag the Set block of the prediction variable. This variable will hold the result of the prediction.


**Step 2 - Prepare the predict operation**

So let’s take the rtree variable from before and try to predict the label of the new dataset. What will we see in the columns? We’ll want to predict the C02Emission variable. Let’s start by using the **predict**() method from the tree model.

From the Variables menu, get a DO block, for the **rtree** variable. With that, select the operation predict.

**Step 3 - Set the test features**

Inside the predict() method, we provide the test features from the test data. This will use the 2 features (ie - columns) to predict the labels.

From the Lists menu, drag a Train Test Split selection. Select **XTest** as the feature input. Lastly, from the Variable menu, drag the **split** variable. These feature names are applied to predict the target label on the testing dataset.

**Step 4 - Assign the predictions to the variable you created**

Next, we store the prediction labels into a variable ‘predictions’. To do that, we have to connect the SET **predictions** variable to the **rtree.predict**() block.

**Step 5 - Display the predictions**

Finally, we display the prediction labels using **‘predictions’**.

From the Variables menu, drag the "predictions" variable. This will show the result of the Tree predictions.

**Step 6 - Connect the blocks to run the code**

Connect the blocks and run the code!


<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GjeXTRWWUAAiZjm?format=jpg&name=small)
</details>


In [None]:
# blockly code


#### Freehand

**Step 1 - Prepare the predict operation**

So let’s take the rtree variable from before and try to predict the label of the new dataset. What will we see in the columns? We’ll want to predict the C02Emission variable. Let’s start by using the **predict**() method from the tree model.

`rtree.predict() `

**Step 2 - Set the test features**

Inside the predict() method, we provide the test features from the test data. This will use the 2 features (ie - columns) to predict the labels. We provide the test features from the test data from position 1 of the split variable (**XTest**).

`rtree.predict(split[1])`

**Step 3 - Assign the predictions to the variable you created**

Next, we store the prediction labels into a variable ‘predictions’. To do that, we have to connect the SET **predictions** variable to the **rtree.predict**()

Next, we store the prediction labels into a variable called ‘predictions’

`predictions = rtree.predict(split[1])`


**Step 4 - Display the predictions**

Finally, we display the prediction labels using ‘predictions’

`predictions`

**Step 5 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!


<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GjeXU_7XAAAAvq_?format=png&name=small)
</details>

**Your Turn**: Give it a go! Let’s see what our results are once we display them!


In [None]:
# freehand code


**Explanation**: *You have used the predict function to estimate results based on new input data, Weight and Volume, from the testing dataset. In this case, the code tells the model to make predictions using values from the Weight and Volume columns in the test dataset. These columns are the features of the model used to predict the outcome, CO2 emissions*.

### Goal 13: Bringing in SKLearn metrics to help look at the performance of predictions.

So we’ve tried to predict on our new dataset. How well did we do? Let’s use SKLearn Metrics to help us think through that.

#### Blockly

**Step 1 - Starting the import**

First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to “import” to bring the add-on package in.

Bring in the IMPORT menu, which can be helpful to bring in other data tools. In this case, we're bringing in the **import** block.

**Step 2 - Telling what library to import**

In the text area, we type the name of the library we want to import. A library is like an extra thing we bring in to give us more coding abilities. In our case, we will type out **sklearn.metrics**, which is a tool that grades your machine learning model's performance, telling you how well it did on its test.

**Step 3 - Renaming the library so it’s easy to remember**

Once you are done, put the **import** and **package** together in a single variable. This handy feature helps cut down on all the typing later on. You can call it whatever is easiest for you to remember. In the example below, we’ve put everything into **metrics**, and we type it in the open area.

**Step 4 - Connect the blocks to run the code**

Connect the blocks and run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GaQ5swrXUAAEJ-D?format=png&name=small)
</details>

In [None]:
# blockly code


#### Freehand

**Step 1 - Starting the import**

First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to “import” to bring the add-on package in.


**Step 2 - Telling what library to import**

In the text area, we type the name of the library we want to import. A library is like an extra thing we bring in to give us more coding abilities. In our case, we will type out **sklearn.metrics**, which is a tool that grades your machine learning model's performance, telling you how well it did on its test.



**Step 3 - Renaming the library so it’s easy to remember**

Once you are done, put the ‘import’ and ‘package’ together in a single variable. This handy feature helps cut down on all the typing later on. Feel free to use whatever name you want that will help you remember it later on. In the example below, we’ve put everything into **metrics**.


**Step 4 - Run the code**

Hit ‘control’ and ‘enter’ simultaneously to run the data science magic!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GaQ5u3ZWsAA_df1?format=png&name=small)
</details>

**Your Turn**: Test it out yourself! We set up a command to tell our computer what to do and after our hard work, we’ll run what we have to see our data science major at work!
import sklearn.metrics as metrics

In [None]:
# freehand code


**Explanation**: *The metrics library provides tools to measure and evaluate the performance of machine learning models. By using `metrics`, we can check how well our model is working, like seeing how accurate it is or how well it groups data in clustering. This helps us understand if our model is doing a good job or if it needs improvement*.

## Assessing the Performance of the Regressor

So, how well did our predictions do? Let’s calculate R2 to help us think about the performance of predictions on the testing dataset.

### Goal 14: Assessing the performance of the predictions on test dataset using R2.

So, how well do our predictions align with the actual values in the test data? Let’s calculate the R² (coefficient of determination) score to evaluate the model's performance and determine how much variance in the target variable is explained by the predictions.

#### Blockly

**Step 1 - Prepare the R2 score calculation from the metrics library**

To calculate the correctness of the model predictions, we will use the **r2_score**() function from the metrics library. This correctness score will measure the percentage of correct predictions.  

From the Variables menu, get a DO block for the **metrics** variable. With that, select the **r2_score** operation. This operation will compare the correctness of the model and the test label with the predicted values.

**Step 2 - Calculate the linear model’s correctness**

The r2_score() function takes 2 parameters to calculate the correctness score and helps measure the percentage of correct predictions. So let’s compare **CO2Emission** from the test dataset and **predictions** from the model we have just created.

From the Lists menu, get a Train Test Split selector and select Y Test. Connect the **split** variable to it. That will define the values that will be used as the true labels for the accuracy calculation, "**CO2Emission**".

**Step 3 -  Compare testing labels with the predicted values**

From the Variable menu, get the variable **prediction** to connect with the **r2_score** function.


**Step 4 - Connect the blocks to run the code**

Connect the blocks and run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GjeXfXxXYAAVGX3?format=jpg&name=small)
</details>

In [None]:
# blockly code


#### Freehand

**Step 1 - Prepare the R2 score calculation from the metrics library**

To calculate the correctness of the model predictions, we will use the **metrics.r2_score**() method from the metrics library.  This correctness score will measure the percentage of correct predictions.

`metrics.r2_score()`

**Step 2 - Calculate linear model’s correctness**

The accuracy_score() function takes 2 parameters to calculate the correctness score and help measure the percentage of correct predictions. So let’s compare CO2Emission from the test dataset and **predictions** from the model we have just created.
The **r2_score**() method takes 2 parameters to calculate the accuracy score.
- Test data labels: **split[2]** (which are stored in position 2 of the split variable)
- The predicted labels: **predictions**


**Step 3 - Compare testing labels with the predicted values**

So let’s compare CO2Emission from the test dataset and predictions from the model we just created.

`metrics.r2_score(test[‘CO2Emission’], predictions)`

**Step 4 - Connect the blocks to run the code**

Connect the blocks and run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GjeXhHZXAAEPoDl?format=png&name=small)
</details>

**Your Turn**: Give it a try! Here we use a metric called r2_score() to verify how accurate our model is by comparing our test labels with predictions!

In [None]:
# freehand code


**Explanation**: *You have calculated a metric called R-squared to evaluate how well a model’s predictions match the actual values. The R-squared score is a number between 0 and 1. If it’s close to 1, it means the model’s predictions are very close to the actual values, so it’s doing a good job. If it’s closer to 0, it means the predictions aren’t very accurate. In short, a higher R-squared means a better model*.

## WHAT DID YOU LEARN?

In this lesson, we learned how to use regression trees for making predictions with numerical variables.  We explored the basics of regression trees, including how to construct them and the importance of pruning to improve model performance.  We also learned practical techniques to enhance the accuracy and robustness of our models.

## WHAT’S NEXT?:


[Random forests](Random_Forests.ipynb)

## ANY EXTRAS?



Bulleted list of extra like resources