# Decision Trees

## ANTICIPATED TIME


2 hours


## BEFORE YOU BEGIN:



[Logistic Regression](Logistic_Regression.ipynb)

## WHAT YOU WILL LEARN

*   What are decision trees?
*   How can decision trees help make decisions for categorical variables?
* What are nodes and how do they help with decision trees?
* How does pruning help us make a better decision tree?
* What is a leaf node?

## DEFINITIONS YOU’LL NEED TO KNOW:

* Decision Tree - a model used to make a decisions at multiple points using categorical data
* Node - point in a decision tree where a choice is made
* Leaf Node - the end point of a decision tree that gives a final prediction
* Outcome variable - the thing the decision tree is trying to predict
* Partition - to split large groups into smaller groups
* Pruning - removing parts of a decision tree to make it smaller

## SCENARIO:

Kofi, a new member of the group, recognizes that the team has difficulties in predicting what type of transportation people are more likely to choose, such as bikes, scooters, or cars. To help think through the categorical data, the team he suggests using decision trees. Kofi explains that decision trees work by splitting up data at each layer that uses categories like efficiency, cost, and convenience to help with predictions. For example, the first layer of the tree might split based on high or low efficiency. At other layers the tree might consider other variables such as cost categories like expensive or cheap. This structure helps the team to break down complex data into manageable chunks for better informed decisions, especially predictions. As the team learns how decision trees work, Kofi realizes this method will help them see their data in different ways and find patterns they haven’t seen before


## WHAT DO I NEED TO KNOW?

**How Are Decision Trees Different than Other Classification Approaches?**

So far, we’ve learned about the [KNN classifier](KNN_Classification.ipynb), which uses distance to known data points to classify any new data. We also learned about [logistic regression](Logistic_Regression.ipynb), which weighs the variables to help with our classification of new data. We will now talk about a new way to classify using **decision trees**.

Building a decision tree means creating a tree where each “branch” is a decision point, which is called a node. Let’s dive in to learn more about our decision trees.


**How are Decision Trees Created?**

Let’s think through different things we want to predict. For example, whether someone should pass or fail a class. Or maybe we want to predict whether something is contaminated or not contaminated. The decision tree algorithm tries to find the predictor that gives us the best classification.

The trees use leaf nodes to **‘partition’** variables into different ‘branches’. The best predictor (independent variable) goes at the top of the tree.

Once that is done, the algorithm repeats itself by finding the next predictor (independent variable) that would give the best classification….then breaking down the next predictor…then breaking down the next predictor …..and over. As each predictor is considered, it adds new layers and there are fewer and fewer datapoints under each layer. The last layer that has all our final outcome is called the **leaf node**.

Let’s dig into an example. Let’s say you want to predict how someone will do in a class using categories like ‘pass’ or ‘fail’. The most impactful predictor might be grade level - 'middle school' and 'high school'. Under that, you might have another set of categories like ‘received extra tutoring (yes/no)’ or another leaf node that says ‘extracurricular activities (yes/no’). At the bottom level, we will find the leaf nodes that show us the best combination of predictors that will help us understand if a student will ‘pass’ or ‘fail’


![](https://pbs.twimg.com/media/GrFtGFHWMAEvGeK?format=jpg&name=4096x4096)


**Can Trees Get Too Big?**

The problem with this algorithm is that it can be too powerful - the decision tree algorithm can easily memorize the data. If a decision tree is too big, it can make predictions less accurate. To fix this, we use a process called **pruning**, where we force the tree to be smaller. Pruning makes the tree smaller and easier to understand. The goal is to keep just enough data, grouped closely together, so the predictions are more accurate and easier to trust.

## YOUR TURN:

### Goal 1: Importing the pandas library

Need extra tools to help solve this problem? Well, we can bring in extra ‘libraries’ to help us do extra data science stuff. You can think of it like an ‘add-on’. In this case, we bring in pandas, which is a popular library for doing data science stuff.

#### Blockly

**Step 1 - Starting the import**

First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to “import” to bring the add-on package in.

Bring in the IMPORT menu, which can be helpful to bring in other data tools. In this case, we're bringing in the **import** block.

**Step 2 - Telling what library to import**

In the text area, we type the name of the library we want to import. A library is like an extra thing we bring in to give us more coding abilities. In our case, we will type out **pandas**, which will bring in some cool data manipulation features.

**Step 3 - Renaming the library so it’s easy to remember**

Once you are done, put the **import** and **package** together in a single variable. This handy feature helps cut down on all the typing later on. You can call it whatever is easiest for you to remember. In the example below, we’ve put everything into **pd**, and we type it in the open area.


**Step 4 - Connect the blocks to run the code**

Connect the blocks and run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GZYmOOpW8AAf7uc?format=png&name=240x240)

</details>



In [None]:
# blocks code


#### Freehand

**Step 1 - Starting the import**

First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to “import” to bring the add-on package in.

**Step 2 - Telling what library to import**

In the text area, we type the name of the library we want to **import**. A library is like an extra thing we bring in to give us more coding abilities. In our case, we will type out pandas, which will bring in some cool data manipulation features.

**Step 3 - Renaming the library so it’s easy to remember**

Once you are done, put the **‘import’** and **‘package’** together in a single variable. This handy feature helps cut down on all the typing later on. Feel free to use whatever name you want that will help you remember it later on. In the example below, we’ve put everything into **pd**.


**Step 4 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GZmkVCYWEA4oGso?format=jpg&name=small)

</details>



**Your Turn**: Now it’s your turn! We’re going to dive into the pandas package, which helps us with some really cool data science things. First, let’s import the package and assign it to the variable “pd” to make it easier to use throughout our notebook.

In [None]:
#freehand code


**Explanation**: *Congrats!  Your attempts finally made it!  Now you have successfully imported the "pandas" package as the variable "pd"*.

### Goal 2: Bringing in the dataframe

Let’s bring in the data that we want to look at.


#### Blockly

**Step 1 - Write out the variable name you want to use**

Now that we’re all set with our new package to help us to do cool things, let’s bring the data into a variable and call it **data**.

In Blockly, bring in the VARIABLES menu.

**Step 2 - Assign the dataframe to the variable you created**

Just like we did before, let’s type out a variable name. Rather than type out the full file name for our data, this easy to remember name will hold the data we bring in.

In Blockly, go to the Variables and drag the Set block for the **data** variable. This will allow us to assign the result of a function call to the variable. A function is basically code that does a specific task for us.

**Step 3 - Bring in the data**

Now we need to look at the file that has all our data. To load our dataframe, we’ll use a simple command to bring in the file we need (CSV….Comma Separated Values). Let’s say we have a file called ‘datasets/AirQualityClass.csv' in the folder **‘datasets’**. We’re telling Python to read the CSV file and store it in a variable called **data**.

From the Variable menu, drag a DO block using the **pd** variable, go ahead with the Do operation **read_csv**. The read_csv function reads a CSV file and returns a DataFrame object.

In our case, let’s bring in the “datasets/AirQualityClass.csv" (use the Quotes from the TEXT menu) because that is what students are working with.

**Step 4 - Display the variable**

Let’s see it now by ‘displaying’ and showing our work.

Drag the **dataframe** variable to the workspace, making it available for further use in our program. This step is more of a visualization step, as it allows us to see the variable in the Blockly workspace

**Step 5 - Connect the blocks to run the code**

Connect the blocks and run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GiuKgNgXEAA3dDc?format=jpg&name=medium)

</details>



In [None]:
#blockly code


#### Freehand

**Step 1 - Write out the variable name you want to use**

Now that we’re all set with our new package to help us to do cool things, let’s bring the data into a variable called **data**. Think of it as a digital spreadsheet with much more power to analyze and manipulate the data!

Just like we did before, let’s type out a variable name. Rather than type out the full file name for our data, this easy to remember name will hold the data we bring in.

`pd.read_csv(“datasets/AirQualityClass.csv”)`





**Step 2 -Bring in the data**

Now we need to look at the file that has all our data.

To load our dataframe, we’ll use a simple command to bring in the file we need (CSV….Comma Separated Values). Let’s say we have a file called ‘AirQualityClass.csv' in the folder **‘datasets’**. We’re telling Python to read the CSV file and store it in a variable called **data**. For this function, we need to specify the code as “**pd.read_csv**”, which makes the code read the csv file. This variable is now our dataframe!

In our case, let’s bring in the “datasets/AirQualityClass.csv” (use the Quotes from the TEXT menu) because that is what Kofi is working with.

**Step 3 - Assign the dataframe to the variable you created**

Just like we did before, let’s type out a variable name. Rather than type out the full file name for our data, this easy to remember name will hold the data we bring in.

**Step 4 - Print the variable**

Let’s see it now by ‘printing’ and showing our work.

**Step 5 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GiuKh7zWkAA4z89?format=png&name=small)

</details>



**Your Turn**:  Now it’s your turn!  Let’s dive in and start working with the data! We’ll begin by loading it into a dataframe, which will allow us to easily interact with and analyze the dataset.

In [None]:
#freehand code







**Explanation**:  *Easy-peasy! You have now brought in the dataframe and stored it as a variable that you can reference later on. Now, onto the fun part*!

### Goal 3: Importing SKLearn and model_selection.

Need extra tools to help solve this problem? Well, we can bring in extra ‘libraries’ to help us do extra data science stuff. You can think of it like an ‘add-on’. In this case, we bring in SKLearn and Model selection, which is a popular machine learning library which will help us train and test our data.  This will help us learn from it and understand it later.

#### Blockly

**Step 1 - Starting the import**

First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to “import” to bring the add-on package in.

Bring in the IMPORT menu, which can be helpful to bring in other data tools. In this case, we're bringing in the import block.

**Step 2 - Telling what library to import**

In the text area, we type the name of the library we want to import. A library is like an extra thing we bring in to give us more coding abilities. In our case, we will type out **sklearn.model_selection**, which will bring in some cool data manipulation features.

**Step 3 - Renaming the library so it’s easy to remember**

Once you are done, put the import and package together in a single variable. This handy feature helps cut down on all the typing later on.  Feel free to use whatever name you want that will help you remember it later on. In the example below, we’ve put everything into **model_selection**, and we type it in the open area.

**Step 4 - Connect the blocks to run the code**

Connect the blocks and run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/Gnj02MLWoAE8wxx?format=png&name=small)

</details>



In [None]:
#blockly code


#### Freehand

**Step 1 - Starting the import**

First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to “import” to bring the add-on package in.

**Step 2 - Telling what library to import**

In the text area, we type the name of the library we want to import. A library is like an extra thing we bring in to give us more coding abilities. In our case, we will type **sklearn.model_selection**, which will bring in some cool data manipulation features.

**Step 3 - Renaming the library so it’s easy to remember**

Once you are done, put the ‘import’ and ‘package’ together in a single variable. This handy feature helps cut down on all the typing later on. Feel free to use whatever name you want that will help you remember it later on. In the example below, we’ve put everything into **model_selection**.

**Step 4 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!


<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/Gnj1FAnXsAAA5dt?format=jpg&name=small)

</details>



**Your Turn**: Now it’s your turn! We’re going to dive into the pandas package, which helps us with some really cool data science things. First, let’s import the package and assign it to the variable model_selection to make it easier to use throughout our notebook.

In [None]:
#freehand code


**Explanation**: *Congrats!  Your attempts finally made it!  Now you have successfully imported the model selection sublibrary from the sklearn package and named it model_selection*.




### Goal 4: Split the dataset into train test data.

We are going to split our data. One part of our data we will learn from, and the other one we are going to use machine learning to help us understand our data. So even if we have unseen data, we should still be able to understand it!

#### Blockly

**Step 1 - Use the train_test_split function**

First we need to use a function that will help us with the splitting. In other words, what data to train on and what data to test on.
From the List menu, drag the train test split block to divide the dataset.

**Step 2 - Split the dataset**

So now let’s split! But how much? Most people recommend 20%, so let’s go with that.

From the Math menu, drag the number block and set the test size to **0.2** (20% of the data will be used for testing), to define the Test Size. From the Variable menu drag the **data** variable to define the Dataframe parameter.

**Step 3 - Define the label and the features**

We want to predict the variable **Contaminated**. But what is our prediction based on? Let’s go ahead and tell our model.

From the Text menu, drag the Quotes block and inform "**Contaminated**" to define the Label. From the List menu, drag the List block, and use the gear icon to add up to 6 items. From the Text menu, drag 6 Quotes blocks, and inform the following features: "Methane", "NOxEmissions", "PM2.5Emissions", "VOCEmissions", "SO2Emissions", and "CO2Emissions".
Connect that block with the Features input.

**Step 4 - Store the split into a variable**

We’ve done our split. Now let’s put it into a new variable so it’s easier to work with. Let’s go ahead and call it **split** so it’s easier to remember.

From the Variables menu, create a variable named **split**. Connect that with the Split block.

**Step 5 - Connect the blocks to run the code**

Connect the blocks and run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GiuGX39XAAA7M8u?format=jpg&name=small)

</details>



In [None]:
#blockly code


#### Freehand

**Step 1 - Use the train_test_split function**

We need to use a function that will help us with the splitting. In other words, what data to train on and what data to test on.

The train_test_split() function splits data into training and testing sets, ensuring proper evaluation. It takes parameters like test_size and random_state, returning separate feature and label sets to prevent overfitting.

`train_test_split()`


**Step 2 - Split the dataset**

So now let’s split! But how much? Most people recommend 20%, so let’s go with that.
Define the test size to **0.2** (20% of the data will be used for testing). Also, set the complete Dataframe stored in the **data** variable.

`train_test_split(data, test_size=0.2)`


**Step 3 - Define the label and the features**

We want to predict the variable **Contaminated**. But what is our prediction based on? Let’s go ahead and tell our model.

Define the label as the column "**Contaminated**" and the features: "Methane", "NOxEmissions", "PM2.5Emissions", "VOCEmissions", "SO2Emissions", and "CO2Emissions" from the Dataframe called **data**.

`train_test_split(data[['Methane', 'NOxEmissions', 'PM2.5Emissions', 'VOCEmissions', 'SO2Emissions', 'CO2Emissions']], data['Contaminated'], test_size=0.2)`

**Step 4 - Store the split into a variable**

We’ve done our split. Now let’s put it into a new variable so it’s easier to work with. Let’s go ahead and call it **split** so it’s easier to remember.

Store its structure into a variable called **split**, containing four outputs: **X_train** and **X_test** (feature sets for training and testing) and **y_train** and **y_test** (corresponding labels). This separation helps train the model on one part of the data while evaluating its performance on unseen data, preventing overfitting

`split = train_test_split(data[['Methane', 'NOxEmissions', 'PM2.5Emissions', 'VOCEmissions', 'SO2Emissions', 'CO2Emissions']], data['Contaminated'], test_size=0.2)`

**Step 5 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!


<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GiuGSF5X0AAdHvi?format=jpg&name=medium)

</details>



**Your Turn**: First,  below, we’ve put everything into tree.

In [None]:
#freehand code


**Explanation**: *This code imports train_test_split from scikit-learn and uses it to split the dataset into training and testing sets. It selects six features—Methane, NOxEmissions, PM2.5Emissions, VOCEmissions, SO2Emissions, and CO2Emissions—from the data DataFrame as inputs and the Contaminated column as the target variable. The test_size=0.2 parameter ensures that 20% of the data is allocated for testing, while the remaining 80% is used for training. The result, stored in split, consists of four outputs: training features, testing features, training labels, and testing labels. This helps evaluate the model's performance on unseen data*.

### Goal 5: Bring in the SkLearn package to use decision trees in our prediction.

Remember when we brought in other packages for the extra add-ons? Now let’s bring in a predictor (classifier) to help us find out the categorical variable we want to predict. We specifically want to use SKLearn here to help with decision trees as part of our prediction.


#### Blockly

**Step 1 - Starting the import**

First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to “import” to bring the add-on package in.

Bring in the IMPORT menu, which can be helpful to bring in other data tools. In this case, we're bringing in the **import** block.

**Step 2 - Telling what library to import**

In the text area, we type the name of the library we want to import. A library is like an extra thing we bring in to give us more coding abilities. In our case, we will type out **sklearn.tree**, which will give us the decision trees that we can use to help with our predictions.

**Step 3 - Renaming the library so it’s easy to remember**

Once you are done, put the **import** and **package** together in a single variable. This handy feature helps cut down on all the typing later on. You can call it whatever is easiest for you to remember. In the example below, we’ve put everything into a tree, and we type it in the open area.

**Step 4 - Connect the blocks to run the code**

Connect the blocks and run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GctrVUWWcAAZjt3?format=png&name=small)

</details>



In [None]:
#blockly code


#### Freehand

**Step 1 - Starting the import**

First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to “import” to bring the add-on package in.


**Step 2 - Telling what library to import**

In the text area, we type the name of the library we want to import. A library is like an extra thing we bring in to give us more coding abilities. In our case, we will type out **sklearn.tree**, which will provide us with some cool data manipulation features.

**Step 3 - Import package as acronym**

Once you are done, put the **‘import’** and **‘package’** together in a single variable. This handy feature helps cut down on all the typing later on. You can call it whatever is easiest for you to remember. In the example below, we’ve put everything into tree.


**Step 4 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!



<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GctrXuFXMAAueAW?format=png&name=small)

</details>



**Your Turn**: Now it’s your turn! We’re going to dive into the sklearn.tree package, which helps us with some really cool data science things. First, let’s import the package and assign it to the variable “tree” to make it easier to use throughout our notebook.

In [None]:
#freehand code


**Explanation**: *Congrats! Your attempts finally made it! Now you have successfully imported the sklearn.tree package as the variable tree*.

### Goal 6: Create the model object.

Now that we’ve imported the library, let’s get started on creating our decision tree. As we get going, we’ll setup the classifier model, assign it to a variable, and define its hyperparameters for later use.


#### Blockly

**Step 1 - Write out the name of a variable you want to use for the classifier**.

We setup a variable to store the classifier model in a variable for later use. In this case, we will call it **dtree**.

On the "Variables" menu, click Create Variable, type a name for our model, **dtree**. Then, drag a "SET" block to the workspace for the created variable. This block allows us to create a new variable and assign a value to it.

**Step 2 - Create the Decision Tree classifier model**

We got the variable, so let’s get our decision tree. Easy peasy!

Using the **tree** library, we call the **DecisionTreeClassifier**() to create the Tree model.

From the Variable menu, drag a Create block for the **tree** variable. On the create list box select the option **DecisionTreeClassifier**. This specifies the type (class) of object we want to create, which is the DecisionTreeClassifier from the tree library.

**Step 3 - Define the hyperparameters**

As we saw above, a tree has different levels. But how many do we want to look at? In this case, we want a tree with a maximum depth of 2 levels.

Drag a Freestyle block, and type **max_depth=2**. This tells the maximum depth of the tree.


**Step 4 - Assign the regressor model to the variable you created**

We can now connect the **dtree** variable with the **DecisionTreeClassifier** model.

**Step 5 - Connect the blocks to run the code**

Connect the blocks and run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GcttrFQWYAAiRoO?format=jpg&name=small)

</details>



In [None]:
#blockly code


#### Freehand


**Step 1 - Write out the name for a variable you want to use for the classifier**

We setup a variable to store the classifier model in a variable for later use. In this case, we will call it **dtree**.

`tree.DecisionTreeClassifier()`

**Step 2 -  Define the hyperparameters**

We got the variable, so let’s get create our tree. Easy peasy!

Using the **tree** library, we call the **DecisionTreeClassifier**() to create the Tree model.

`tree.DecisionTreeClassifier(max_depth=2)`

**Step 3 - Assign the regressor model to the variable you created**

As we saw above, a tree has different levels. But how many do we want to look at? In this case, we want a tree with a maximum depth of 2 levels.

`dtree = tree.DecisionTreeClassifier(max_depth=2)`

**Step 4 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GctttmiXgAETWWZ?format=jpg&name=small)

</details>



**Your Turn**: Ready! Set! Code!


In [None]:
#freehand code


**Explanation**: *You have created a decision tree classifier, which is a type of machine learning model used for classification tasks. By specifying max_depth=2, the tree is limited to two levels, meaning it can only make decisions based on at most two splits in the data. This helps keep the model simple and reduces the risk of overfitting (learning too much from the training data). The classifier will later be trained to classify data into categories based on input features*.

### Goal 7: Train and score the classifier model.

Now that we’ve brought in our decision tree model, let’s train the model to see how it will learn from the data points that we have in the file.

#### Blockly

**Step 1 - Prepare to train the model**

Now we want to train our decision tree using what the machine learning understands. Now let’s train the model using training data! To do this, the fit() method will help us.

From the Variable menu, drag the DO block for the **dtree** variable, and select the **fit** function as the do operation. This specifies the function we want to call, which is the fit method of the tree object.

**Step 2 - Have the training features ready**

So what variables are we going to use in our prediction? The next step for training the model is to select the features to train the classifier. In this step, we select the features (variables) and add them as a dataframe in the parameter. What’s cool is that the model will train (learn) the classifier based on these 6 variables and use it to predict the label.

From the Lists menu, drag a Train Test Split selection. Select **XTrain** as the feature input. Lastly, from the Variable menu drag the **split** variable. These are the feature names applied to train (fit) the model.

**Step 3 - Have the training label ready**

So what is the label that we are trying to predict? Next, we need to add the data labels for the selected features. We add the data labels (Contaminated feature) as a parameter in the fit() method.

From the Lists menu, drag a Train Test Split selection. Select **YTrain** as the feature input. Lastly, from the Variable menu drag the **split** variable. This is the target value applied to train (fit) the model.

**Step 4 - Measure the correctness on the training dataset**

To measure the correctness of the model, we will use the score method from the dtree object. Just as in the previous step, we will just replace the fit() method with the **score**() method. Based on the ‘fit’, we will try to see how much we were able to predict in our training dataset.

This will give us the tree model's correctness score. A good score will be closer to 1 (ie - 100). Medium might be more like .95 (95% accurate). Not great would be .90 (90%). It depends on the topic you are looking at.

Right-click on the "dtree.fit" block and select "Duplicate" from the context menu. This creates a copy of the block. Within the duplicated block, click on the method dropdown menu and select "**score**" from the list of available methods. The score method will work similarly to fit, and will use the training features and label to measure how much of the training data was learned.   

**Step 5 - Connect the blocks to run the code**

Connect the blocks and run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GiuEOQjXUAAvDye?format=jpg&name=small)

</details>



In [None]:
#blockly code


#### Freehand

**Step 1 - Prepare to train the model**

Now we want to train our decision tree using what the machine learning understands. Now let’s train the model using training data! To do this, the **fit**() method will help us.

To train data using the classifier model, we use the model and call the fit() method from it. This will use the ‘fit’ method to train the model that we want to train on.

`dtree.fit()`

**Step 2 - Have the training features ready**

The next step for training the model is to select the features to train the classifier. In this step, we select the features and add them as a dataframe in the parameter.

In this case, the model will train (learn) the classifier based on these 6 variables, here saved as the **position 0** (**XTrain**) of the split variable

`dtree.fit(split[0])`

**Step 3 -  Have the training label ready**

So what is the label that we are trying to predict? Next, we need to add the data labels for the selected features.

We add the data labels(Contaminated) as a parameter in the fit() method, here saved as the **position 2** (**YTrain**) of the split variable.

`dtree.fit(split[0],split[2])`

**Step 4 - Measure the correctness on the training dataset**

To measure the correctness of the model, we will use the score method from the tree object. Just as in the previous step, we will just replace the fit() method with the **score**() method. Based on the ‘fit’, we will try to see how much we could predict in our training dataset.

This will give us the Tree model's correctness score. A good score will be closer to 1 (ie - 100). Medium might be more like .95 (95% accurate). Not great would be .90 (90%). It depends on the topic you are looking at.

`dtree.score(split[0],split[2])`

**Step 5 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GiuEMWOWgAEP8IA?format=png&name=small)

</details>









**Your Turn**: Put your skills into action by training and scoring the classifier model. Explore how well the model learns from the training data and analyze its correctness score. It's a great opportunity to understand the model's predictive power! Ready! Set! Go!

In [None]:
#freehand code


**Explanation**:  *You have trained a decision tree model (dtree) to predict whether something is "Contaminated" based on six types of emissions: Methane, NOx, PM2.5, VOC, SO2, and CO2. The first line fits (or trains) the model using the data in the train dataset, where the emissions are the input features, and "Contaminated" is the output label. The second line calculates the model's accuracy on the same training data, showing how well the trained model can predict the contamination status using those emissions*.

### Goal 8: Importing the graphviz library

Let’s bring in a package to help us see what our decision tree looks like. In this case, we want to use the graphviz package.

#### Blockly


**Step 1 - Starting the import**

First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to “import” to bring the add-on package in.

Bring in the IMPORT menu, which can be helpful to bring in other data tools. In this case, we're bringing in the **import** block.

**Step 2 - Telling what library to import**

In the text area, we type the name of the library we want to import.

A library is like an extra thing we bring in to give us more coding abilities. In our case, we will type out graphviz, which will bring in some cool data manipulation features.

**Step 3 - Renaming the library so it’s easy to remember**

Once you are done, put the **import** and **package** together in a single variable. This handy feature helps cut down on all the typing later on. You can call it whatever is easiest for you to remember. In the example below, we’ve put everything into **gviz**, and we type it in the open area.


**Step 4 - Connect the blocks to run the code**

Connect the blocks and run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GctxWklXEAAG9cq?format=png&name=360x360)

</details>



In [None]:
#blockly code


#### Freehand


**Step 1 - Starting the import**

First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to “import” to bring the add-on package in.

**Step 2 - Telling what library to import**

In the text area, we type the name of the library we want to import. A library is like an extra thing we bring in to give us more coding abilities. In our case, we will type out **sklearn.tree**, which will provide us with some cool data manipulation features.

**Step 3 - Renaming the library so it’s easy to remember**

Once you are done, put the ‘import’ and ‘package’ together in a single variable. This handy feature helps cut down on all the typing later on. You can call it whatever is easiest for you to remember. In the example below, we’ve put everything into **tree**


**Step 4 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GctxZDmXUAABQd6?format=png&name=small)
</details>



**Your Turn**: Test it out yourself! We set up a command to tell our computer what to do and after our hard work, we’ll run what we have to see our data science major at work!



In [None]:
#freehand code


**Explanation**: *Congrats! Your attempts finally made it! Now you have successfully imported the graphviz package as the variable gviz*.

### Goal 9: Show the trained tree

So we’ve taken some data to train on. Let’s see what it looks like! First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to “import” to bring the add-on package in.


#### Blockly

**Step 1 - Let’s tell our system to show the image**

Let’s start by creating a graph, which we call a source object.

From the Variable menu, drag the Create block. With the **gviz** variable, select **Source** to create the object that exhibits the graph.

**Step 2 - Let’s clarify that we are trying to show a decision tree**

From the Variable menu, drag the DO block. With the **tree** variable, and select the operation **export_graphviz** block and connect it to the **Source** block.

**Step 3 - Tell the image generator which model to show**

Let’s tell it what it is we want to generate. In this case, let’s display **dtree**, where we have our decision tree stored.

From the Variable menu, drag the **dtree** variable to specify the decision tree model used for visualization.

**Step 4 - Add in color so it’s easier to read**

Let’s add a bit of color so it’s easier to read. For each node on our tree, we want to split between **cont** (contaminated) and **not** (not contaminated).

From the Freestyle menu, drag a freestyle block and type **filled** into it. From the Logic menu,  drag the **true** block and connect it to the parameter to ensure that the tree nodes are color-filled for better readability.

**Step 5 - Bring in variable names into our decision tree**

What variable names do we use to make predictions in our decision tree?

From the Lists menu, drag the create list block and use the gear icon to add six items.

From the Text menu, drag six Quote blocks and enter the feature names: "**Methane**", "**NOxEmissions**", "**PM2.5Emissions**", "**VOCEmissions**", "**SO2Emissions**", and "**CO2Emissions**".

Connect this block to the feature_names parameter.

**Step 6 - Displaying what we are trying to predict**

Now that we have our variable (feature) names, let’s tell it what to say about the prediction. In this case, **cont** (contaminated) or **not** (not contaminated).

From the Lists menu, drag another create list block and add two items. From the Text menu, drag two Quote blocks and enter the class names: "cont" (Contaminated) and "not" (Not Contaminated). Connect this block to the **class_names** parameter.

**Step 7 - Connect the blocks to run the code**

Connect the blocks and run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/Gctnk11XMAA5BYb?format=png&name=900x900)
</details>



In [None]:
#blockly code


#### Freehand

**Step 1 - Let’s tell our system to show the image**

Let’s start by creating a graph, which we call a source object.

Use **gviz** to create a visualization **Source** object for graph-based output.

`gviz.Source()`

**Step 2 - Let’s clarify that we are trying to show a decision tree**

Use **tree.export_graphviz** to generate a decision tree visualization.

`gviz.Source(tree.export_graphviz())`


**Step 3 - Tell the image generator to what model to show**

Let’s tell it what it is we want to generate. In this case, let’s display **dtree** where we have our decision tree stored.

Pass the **dtree** variable to specify the decision tree model being visualized.

`gviz.Source(tree.export_graphviz(dtree))`

**Step 4 - Add in color so it’s easier to read**

Let’s add a bit of color so it’s easier to read. For each node on our tree, we want to split between **cont** (contaminated) and **not** (not contaminated).

Set **filled=True** to color the nodes, making the visualization clearer.

`gviz.Source(tree.export_graphviz(dtree,filled=True))`

**Step 5 - Bring in variable names into our decision tree**


What variable names do we use to make predictions in our decision tree?

Define the feature_names list: '**Methane**', '**NOxEmissions**', '**PM2.5Emissions**', '**VOCEmissions**', '**SO2Emissions**', '**CO2Emissions**', which are used as input variables for the model.

`gviz.Source(tree.export_graphviz(dtree,filled=True,feature_names= ['Methane', 'NOxEmissions', 'PM2.5Emissions', 'VOCEmissions', 'SO2Emissions', 'CO2Emissions']))`

**Step 6 - Displaying what we are trying to predict**

Now that we have our variable (feature) names, let’s tell it what to say about the prediction? In this case, **cont** (contaminated) or **not** (not contaminated).

Set **class_names=['cont', 'not']** to define the classification labels, representing whether contamination is present or not.

`gviz.Source(tree.export_graphviz(dtree,filled= True,feature_names= ['Methane', 'NOxEmissions', 'PM2.5Emissions', 'VOCEmissions', 'SO2Emissions', 'CO2Emissions'],class_names= ['cont', 'not']))`

**Step 7 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/Gctnn_EXcAAf9O0?format=jpg&name=large)
</details>



**Your Turn**: Give it a try!
gviz.Source(tree.export_graphviz(dtree,filled= True,feature_names= ['Methane', 'NOxEmissions', 'PM2.5Emissions', 'VOCEmissions', 'SO2Emissions', 'CO2Emissions'],class_names= ['cont', 'not']))

In [None]:
#freehand code


**Explanation**: *You have generated a visual representation of a decision tree model using the graphviz library. It takes a trained decision tree (dtree) and creates a diagram that shows how the tree makes decisions based on six features (like 'Methane' and 'CO2Emissions'). Each node in the tree splits the data using these features to classify examples into one of two categories ('contaminated or no'). The filled=True option colors the nodes based on how confident the tree is in its decisions, making it easier to interpret*.

### Goal 10: Predict labels for testing dataset (ie - rest of the data).

So far, we’ve taken a minor part of all our data to train and learn something about it. Can we take what we’ve learned from the training and use it to predict the rest of our dataset?

#### Blockly


**Step 1 - Write out the variable name you want to use**

From the Variables menu, click Create Variable, and type **predictions**. On the same menu, drag the Set block of the prediction variable. This variable will hold the result of the prediction.


**Step 2 - Prepare the predict operation**

So let’s take the **dtree** variable from before and try to predict the label of the new dataset. What will we see in the columns? It’ll be either 0 (not contaminated) or 1 (contaminated). Let’s start by using the **predict**() method from the tree model.

From the Variables menu, get a DO block, for the **dtree** variable. With that, select the operation predict.

**Step 3 - Set the test features**

Inside the **predict**() method, we provide the test features from the test data. This will use the 6 features (ie - columns) to predict the labels.

From the Lists menu, drag a Train Test Split selection. Select **XTest** as the feature input. Lastly, from the Variable menu drag the **split** variable. These feature names are applied to **predict** the target label on the testing dataset.

**Step 4 - Assign the predictions to the variable you created**

Next, we store the prediction labels into a variable **‘predictions’**. To do that we have to connect the SET **predictions** variable to the **dtree.predict**() block.

**Step 5 - Display the predictions**

Finally, we display the prediction labels using **‘predictions’**
From the Variables menu, drag the "**predictions**" variable. This will show the result of the Tree predictions

**Step 6 - Connect the blocks to run the code**

Connect the blocks and run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GiuBghrWkAEd_gO?format=jpg&name=small)
</details>




In [None]:
#blockly code


#### Freehand

 **Step 1 - Prepare the predict operation**

So let’s take the **dtree** variable from before and try to **predict** the label of the new dataset (contaminated, not contaminated. Let’s start by using the **predict**() method from the decision tree model.

`dtree.predict()`

**Step 2 - Set the test features**

So let’s take the **dtree** variable from before and try to predict the label of the new dataset. What will we see in the columns?
Inside the predict method, we provide the test features from the test data. Inside the **predict**() method, we provide the test features from the test data from position 1 of the split variable (**XTest**).

`dtree.predict(split[1])`

**Step 3 - Assign the predictions to the variable you created**

Inside the **predict**() method, we provide the test features from the test data. This will use the 6 features (ie - columns) to predict the labels.

Next, we store the prediction labels into a variable called **‘predictions’**

`predictions = dtree.predict(split[1])`

**Step 4 - Assign the predictions to the variable you created**

Finally, we display the prediction labels using ‘predictions’

`predictions`


**Step 5 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GiuBZ6nXUAAJQJb?format=png&name=small)

</details>




**Your Turn**: Give it a go! Let’s see what our results are once we display them!


In [None]:
#freehand code


 **Explanation**: *You have used a decision tree model (dtree) to make predictions based on test data. It selects specific columns from the test dataset (related to various emissions like Methane, NOx, PM2.5, VOC, SO2, and CO2) and passes them to the model’s predict method. The result, stored in predictions, contains the model’s output (e.g., categories, values, or labels) for each row in the test data based on patterns it learned during training*.

### Goal 11: Bringing in SKLearn metrics to help look at the performance of predictions.

So we’ve tried to predict on our new dataset. How well did we do? Let’s use SKLearn Metrics to help us think through that.

#### Blockly

**Step 1 - Starting the import**

First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to “import” to bring the add-on package in.

Bring in the IMPORT menu, which can be helpful to bring in other data tools. In this case, we're bringing in the **import** block.

**Step 2 - Telling what library to import**

In the text area, we type the name of the library we want to import. A library is like an extra thing we bring in to give us more coding abilities. In our case, we will type out **sklearn.metrics**, which is a tool that grades your machine learning model's performance, telling you how well it did on its test.


**Step 3 - Renaming the library so it’s easy to remember**

Once you are done, put the **import** and **package** together in a single variable. This handy feature helps cut down on all the typing later on. You can call it whatever is easiest for you to remember. In the example below, we’ve put everything into metrics, and we type it in the open area.

**Step 4 - Connect the blocks to run the code**

Connect the blocks and run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GaQ5swrXUAAEJ-D?format=png&name=medium)

</details>




In [None]:
#blockly code


### Freehand

**Step 1 - Starting the import**

First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to “import” to bring the add-on package in.


**Step 2 - Telling what library to import**

In the text area, we type the name of the library we want to import. A library is like an extra thing we bring in to give us more coding abilities. In our case, we will type out **sklearn.metrics**, which is a tool that grades your machine learning model's performance, telling you how well it did on its test.

**Step 3 - Renaming the library so it’s easy to remember**

Once you are done, put the ‘import’ and ‘package’ together in a single variable. This handy feature helps cut down on all the typing later on. Feel free to use whatever name you want that will help you remember it later on. In the example below, we’ve put everything into **metrics**

**Step 4 - Run the code**

Hit ‘control’ and ‘enter’ simultaneously to run the data science magic!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GaQ5u3ZWsAA_df1?format=png&name=medium)

</details>


**Your Turn**: Test it out yourself! We set up a command to tell our computer what to do and after our hard work, we’ll run what we have to see our data science major at work!

In [None]:
#freehand code


**Explanation**: *The metrics library provides tools to measure and evaluate the performance of machine learning models. By using `metrics`, we can check how well our model is working, like seeing how accurate it is or how well it groups data in clustering. This helps us understand if our model is doing a good job or if it needs improvement*.

## Assessing the performance of the classifier










So how well did our predictions do? Let’s calculate three steps here: performance of predictions on testing dataset - accuracy, confusion matrix, and precision/recall.


### Goal 12: Assessing the performance of the predictions on test dataset using the accuracy score

Now that we have made predictions on our dataset, let’s evaluate our predictions using the accuracy score. This allows you to measure the percentage of correct predictions and get a clear sense of how well your model aligns with the actual test data. Let’s go!

#### Blockly

**Step 1 - Call the accuracy_score() method using the metrics library**

To calculate the accuracy of the model predictions, we will use the **accuracy_score**() function from the metrics library.  

From the Variables menu, drag a DO block for the metrics variable. Select the accuracy_score function from the metrics list of operations. This function takes two inputs: the true labels and the predicted labels.


**Step 2 - Calculate Tree model’s accuracy**

The accuracy_score() function takes 2 parameters to calculate the accuracy score and help measure the percentage of correct predictions. So let’s compare contaminated from the test dataset and predictions from the model we just created.

From the Lists menu, drag a Train Test Split selection. Select X Test as the feature input. Lastly, from the Variable menu drag the split variable. These feature names are applied to predict the target label on the testing dataset.  

As the second parameter of the **accuracy_score**, get the variable **predictions**.  The accuracy score function will calculate the accuracy of the model by comparing the true labels with the predicted labels. The result will be a score that indicates the performance of the model.

**Step 3 - Connect the blocks to run the code**

Connect the blocks and run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GiuB2maWIAA2D1g?format=jpg&name=small)

</details>



In [None]:
#blockly code


#### Freehand


**Step 1 - Call the accuracy_score() method using the metrics library**

To calculate the accuracy of the model predictions, we will use the **accuracy_score**() method from the metrics library.  This accuracy score will measure the percentage of correct predictions.

`metrics.accuracy_score()`

**Step 2 - Calculate Tree model’s accuracy**

The accuracy_score() function takes 2 parameters to calculate the accuracy score and helps measure the percentage of correct predictions. So let’s compare **contaminated** from the test dataset saved on position 3 from the split variable with the **predictions** from the model we have just created.

`metrics.accuracy_score(split[3],predictions)`

**Step 3 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!


<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GiuATwRWYAA5wJ4?format=png&name=small)

</details>

**Your Turn**: Have a go at it! Once you begin, you’ll be able to assess the performance of the predictions!

In [None]:
#freehand code



**Explanation**: *The accuracy score tells us the percentage of correct predictions out of the total. A higher accuracy score means the model is doing a good job matching the actual labels*.


### Goal 13: Assessing the performance of the predictions on the test dataset using the confusion matrix

Now that we have calculated the accuracy, let’s evaluate your model's performance using a confusion matrix. You'll discover how to analyze true positives, true negatives, false positives, and false negatives to better understand the accuracy of your predictions and where the model might need improvement. Ready to dive in?

#### Blockly

**Step 1 - Call the confusion_matrix() function using the metrics library**

To break down the accuracy, let’s get the numbers from the confusion matrix using the confusion_matrix() function from the metrics library.  

From the Variables menu, drag a DO block for the metrics variable. Select the **confusion_matrix** function from the metrics list of operations. This function takes two inputs: the true labels and the predicted labels.

**Step 2 - Calculate Tree model’s confusion matrix**

The confusion_matrix() function takes 2 parameters to explore different parts of the confusion matrix. So let’s compare **Contaminated** from the test dataset and **predictions** from the model we have just created. The confusion matrix will tell us these numbers: TP (true positive), TN (true negative), FP (false positive), and FN (false negative).

From the Lists menu, drag a Train Test Split selection.

Select X Test as the feature input. Lastly, from the Variable menu drag the split variable. These feature names are applied to predict the target label on the testing dataset.  As the second parameter of the **confusion_matrix** get the variable **predictions**.

**Step 3 - Connect the blocks to run the code**

Connect the blocks and run the code!  

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/Git8_tEWcAAzVxz?format=jpg&name=small)

</details>



In [None]:
#blockly code


#### Freehand


**Step 1 - Call the confuction_matrix() method from the metrics library**

To calculate the **confuction_matrix**() of the model predictions, we will use the confusion_matrix() function from the metrics library.

`metrics.confusion_matrix()`

**Step 2 - Calculate Tree model’s confusion matrix**

The confusion_matrix () function takes 2 parameters to explore different parts of the confusion matrix. So let’s compare **contaminated** from the test dataset saved on position 3 from the split variable with the **predictions** from the model we have just created.

The confusion matrix will tell us these numbers - TP (true positive), TN (true negative), FP (false positive), and FN (false negative).

`metrics.confusion_matrix(split[3],predictions)`


**Step 3 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/Git883YW8AAlXJ6?format=png&name=small)

</details>



**Your Turn**: Let’s type in the code and see what our performance looks like! What do you see?

In [None]:
#freehand code


**Explanation**: *This matrix indicates that the model correctly identified 169 instances as "Not Contaminated" and 381 instances as "Contaminated." However, it made 3 false positive errors (predicting "Contaminated" when it was actually "Not Contaminated") and 1 false negative error (predicting "Not Contaminated" when it was actually "Contaminated"). This breakdown helps understand the model's accuracy and the types of mistakes it makes*.

![](https://pbs.twimg.com/media/GaQ4srLXMAAiPt2?format=png&name=small)


### Goal 14: Assessing the performance of the predictions on the test dataset using recall and precision

Now let’s evaluate your model's performance using precision and recall. These metrics offer key insights into how well your model identifies true positives and avoids false negatives, giving you a deeper understanding of its predictive accuracy. Let’s dive in and see how your predictions measure up!


#### Blockly

**Step 1 - Call the classification_report() method from the metrics library**

To calculate the classification_report() for the model, we will use the **classification_report**() function from the metrics library.

From the Variables menu, drag a DO block for the metrics variable. Select the **classification_report** function from the metrics list of operations. This function takes two inputs: the true labels and the predicted labels.

**Step 2 - Saying what parameters to use for the classification report**

The classification_report() method takes 2 parameters to calculate the classification report.

From the Lists menu, drag a Train Test Split selection. Select X Test as the feature input. Lastly, from the Variable menu drag the **split** variable. These feature names are applied to predict the target label on the testing dataset.  

As the second parameter of the **classification_report** function, get the variable **predictions**.

**Step 3 - Print the classification report**

Connect the metrics block to a Print block (from the Text menu). The classification report will include other metrics like precision and recall.

**Step 4 - Connect the blocks to run the code**

Connect the blocks and run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/Git7yH6XsAEpd0H?format=jpg&name=small)

</details>


In [None]:
# blockly code


#### Freehand


**Step 1 - Call the classification_report() method from the metrics library**

To calculate the classification_report() for the model, we will use the classification_report() function from the metrics library.

`metrics.classification_report()`

**Step 2 - Saying what parameters to use for the classification report**

The **classification_report**() method takes 2 parameters to calculate the classification report. The first is the expected label saved on position 3 of the split variable. And the second, is the predictions variable with the predictions inferred by the model.

`metrics.classification_report(split[3],predictions)`

**Step 3: Print the classification report**

Use the print function to show the results of the classification report.

`print(metrics.classification_report(split[3],predictions))`

**Step 4 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/Git7uTnXMAAtCcW?format=jpg&name=small)

</details>



**Your Turn**: Let’s try it! First we tried accuracy score, confusion matrix, and now we have recall and precision. The possibilities are endless!

In [None]:
#freehand code


**Explanation**: *The classification report provides some metrics, particularly precision and recall. Precision measures how many of the samples the model labeled as contaminated were actually contaminated, essentially showing the accuracy of its positive predictions. Recall, on the other hand, reflects the model's ability to detect all actual contaminated samples, indicating how well it "found" the true cases. In this report, the precision and recall scores are both high, at 0.99 or 99% for each class, which demonstrates that the model is highly accurate in identifying both contaminated and non-contaminated samples. These scores mean that nearly all positive predictions made by the model were correct, and it missed almost none of the actual contaminated cases*.


## WHAT DID YOU LEARN?

In this notebook, you learned how Decision trees can work to make predictions, especially when the data is categorical.

## WHAT’S NEXT?:

Next, you will learn about [Regression trees](Regression_Trees.ipynb) and why you might use a regression tree instead of a decision tree.

## TELL ME MORE:


* xxxx
* yyyy

## Additional Resources

* [Datawhys Decision Trees Notebook](https://github.com/memphis-iis/datawhys-intern-solutions-2024/blob/master/Decision-trees.ipynb)
* [Datawhys Decision Trees Problem-Solving Notebook](https://github.com/memphis-iis/datawhys-intern-notebooks-2024/blob/master/Decision-trees.ipynb)