# Random Forests

## ANTICIPATED TIME:

2 hours

## BEFORE YOU BEGIN:


[Regression Trees](Regression_Trees.ipynb)

## WHAT YOU WILL LEARN:




* What is overfitting and why is it bad?
* What random forests and how do they improve accuracy compared to individual decision trees?
* How hyperparameters can be tweaked to help make the best random forest model?



## DEFINITIONS YOU’LL NEED TO KNOW:

* Random forest - an ‘mix’ algorithm that combines multiple decision trees to improve how well a tree can make a prediction
* Bootrapping - choosing small random samples of our data to make each new tree.
* Bootstrapping aggregating (bagging) - an ensemble learning technique that combines the predictions of multiple models to improve the overall predictive performance and reduce the risk of overfitting.
* Overfitting - when a model learns too much from training that it messes up when we need to make new predictions
* Out-of-bag error (OBB) - a way to measure how well a model can predict results
* Feature Importance - how much each variable helps predict the final outcome
* Training Data - the part of data used to teach a machine learning model to recognize patterns.
* Testing Data - the other part of data used to evaluate a trained model’s accuracy by comparing predictions to actual values.

## SCENARIO


Angelia and Diego built a decision tree to predict pollution levels in their city, but it kept making mistakes with new data. The model was a little too perfect—so perfect that it wasn’t helpful when presented with new situations. They looked into this a bit more and realized that everytime they ran a decision tree, it was taking a random subset of the data. So how could they use the power of decision trees, but not worry about the random split between the training and testing data?
One person suggested using not just one decision tree, but could they use multiple trees to help reduce some of the worry? To fix this, someone suggested using random forests. By creating many slightly different trees and averaging their predictions, they got much more accurate predictions on the categories.
With their new model, they could identify places where pollution is the highest. Now they wondered—what other problems could random forests help solve?


## WHAT DO I NEED TO KNOW?

Sometimes when we’re working with decision trees (categorical variables) or regression trees (numerical variables), we run into problems where they seem just a bit…..too perfect. They try so hard to ‘fit’ the data that after training, they don’t know what to do when they come up with new data during testing. This is called **overfitting**.
How can we fix the problem of this one-too-perfect tree problem? Instead of building one perfect tree, how about we build many imperfect trees…or a forest of trees! The idea is to create many trees that are slightly different from each other and then average their answers. When we average the predictions from many trees that aren’t perfect, we can get a prediction that is as good, or better, than a single perfect tree.

**How Can We Make Trees in Our Forest Different?**

To make sure our trees are different, we do something called **bootstrapping**. It’s a fancy way of saying we’re choosing small random samples of our data–like shuffling and picking cards from a deck– to make each new tree. This helps us make sure that no trees have the exact same data. In other words, bootstrapping is all about selecting random samples for each of our trees!

When we average the predictions from all these bootstrapped trees, it’s called **bootstrap** **aggregating** or **bagging**. For example, let’s say we have 300 decision trees. If 151 trees say “Pass” and 149 say “Fail” for a student, the final prediction of the forest is “Pass” because the data from most of the trees agree. The majority wins!

**How ‘Random’ Are Random Forests?**

After selecting features to create a random forest model (goal 4), how does this model work? Because we have many ‘random’ trees, we could also make predictions become more reliable by using bootstrapping and randomly selecting those features to make a tree. Instead of using every variable or feature to make each tree, the algorithm will pick a random set of features (variables) to select. This helps the trees become even more unique, with their own specific identities!

Ultimately, a random forest is excellent because it can help us combine and consider parts of many different trees. Whether it’s predicting if someone will pass or fail a class….or how many miles an electric scooter can travel with one charge, random forests work like a team—we can help minimize the bias from just one tree.


### Goal 1: Importing the pandas library.

Need extra tools to help solve this problem? Well, we can bring in extra ‘libraries’ to help us do extra data science stuff. You can think of it as an ‘add-on’. In this case, we bring in pandas, which is a popular library for data science.

#### Blockly

**Step 1 - Start the import**

First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to “import” to bring the add-on package in.

Bring in the IMPORT menu, which can be helpful to bring in other data tools. In this case, we're bringing in the **import** block.

**Step 2 - Tell what library to import**

In the text area, we type the name of the library we want to import. A library is like an extra thing we bring in to give us more coding abilities. In our case, we will type out **pandas**, which will bring in some cool data manipulation features.

**Step 3 - Rename the library so it’s easy to remember**

Once you are done, put the **import** and **package** together in a single variable. This handy feature helps cut down on all the typing later on. You can call it whatever is easiest for you to remember. In the example below, we’ve put everything into **pd**, and we type it in the open area.

**Step 4 - Connect the blocks to run the code**

Connect the blocks and run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GZYmOOpW8AAf7uc?format=png&name=small)
</details>

In [None]:
# blockly code


#### Freehand

**Step 1 - Start the import:**

First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to “import” to bring the add-on package in.

**Step 2 - Tell what library to import**

In the text area, we type the name of the library we want to import. A library is like an extra thing we bring in to give us more coding abilities. In our case, we will type out **pandas**, which will bring in some cool data manipulation features.


**Step 3 - Rename the library so it’s easy to remember**

Once you are done, put the ‘import’ and ‘package’ together in a single variable. This handy feature helps cut down on all the typing later on. Feel free to use whatever name you want that will help you remember it later on. In the example below, we’ve put everything into **pd**.

**Step 4 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GZmkVCYWEA4oGso?format=jpg&name=small)
</details>


**Your Turn**: Now it’s your turn! We’re going to dive into the pandas package, which helps us with some really cool data science things. First, let’s import the package and assign it to the variable “pd” to make it easier to use throughout our notebook.

In [None]:
# freehand code


**Explanation**: *Congrats!  Your attempts finally made it!  Now you have successfully imported the "pandas" package as the variable "pd"*.

### Goal 2: Bringing in the dataframe.

Load data into a dataframe in Python, use the pd.read_csv command to read a CSV file and store it in a variable for easy data manipulation and analysis.


#### Blockly

**Step 1 - Write out the variable name you want to use**

Now that we’re all set with our new package to help us to do cool things, let’s bring the data into a variable called **data**. Think of it as a digital spreadsheet with much more power to analyze and manipulate the data!

In Blockly, bring in the VARIABLES menu.

**Step 2 - Assign the dataframe to the variable you created**

Just like we did before, let’s type out a variable name. Rather than type out the full file name for our data, this easy to remember name will hold the data we bring in.

In Blockly, go to the Variables and drag the Set block for the **data** variable. This will allow us to assign the result of a function call to the variable. Assign the dataframe

From the Variables menu, drag the Set block for the **data** variable. This will allow us to assign the result of a function call to the variable.

**Step 3 - Bring in the data**

Now we need to look at the file that has all our data. To load our dataframe, we’ll use a simple command to bring in the file we need (CSV….Comma Separated Values). Let’s say we have a file called ‘datasets/AirQualityCars.csv' in the folder **‘datasets’**. We’re telling Python to read the CSV file and store it in a variable called **data**.

From the Variable menu, drag a DO block using the **pd** variable, go ahead with the do operation **read_csv**. The read_csv function reads a CSV file and returns a DataFrame object.

In our case, let’s bring in the “datasets/AirQualityClass.csv" (use the Quotes from the TEXT menu) because that is what Kiana is working with.

From the Variable menu, drag a DO block using the **pd** variable, go ahead with the do operation read_csv. The read_csv function reads a CSV file and returns a DataFrame object. In our case, let’s bring in the "datasets/AirQualityClass.csv" (user the Quotes from the TEXT menu) because that is what is working with.

**Step 4 - Display the variable**

Let’s see it now by ‘displaying’ and showing our work.

Drag the **data** variable to the workspace, making it available for further use in our program. This step is more of a visualization step, as it allows us to see the variable in the Blockly workspace.

**Step 5 - Connect the blocks to run the code**

Connect the blocks and run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GiuKgNgXEAA3dDc?format=jpg&name=small)
</details>

In [None]:
# blockly code


#### Freehand


**Step 1 - Write out the variable name you want to use**

Now that we’re all set with our new package to help us to do cool things, let’s bring the data into a variable called **data**. Think of it as a digital spreadsheet with much more power to analyze and manipulate the data!

Just like we did before, let’s type out a variable name. Rather than type out the full file name for our data, this easy to remember name will hold the data we bring in.

`pd.read_csv() `

**Step 2 - Bring in the data**

Now we need to look at the file that has all our data.

To load our dataframe, we’ll use a simple command to bring in the file we need (CSV….Comma Separated Values). Let’s say we have a file called ‘AirQualityCars.csv' in the folder **‘datasets’**. We’re telling Python to read the CSV file and store it in a variable called **data**. For this function, we need to specify the code as “pd.read_csv”, which makes the code read the csv file. This variable is now our dataframe!

In our case, let’s bring in the “datasets/AirQualityClass.csv” (use the Quotes from the TEXT menu)

Now that we have read our data file, let’s store the data into a dataframe/variable called data. A dataframe is like a table made up of rows and columns, which helps us organize and work with data easily. Think of it as a digital spreadsheet with much more power to analyze and manipulate the data!

`pd.read_csv(“datasets/AirQualityClass.csv”) `


**Step 3 - Assign the dataframe to the variable you created**

Just like we did before, let’s type out a variable name. Rather than type out the full file name for our data, this easy to remember name will hold the data we bring in. Type the variable's name **data** will print on the screen the contents of the variable.

`data = pd.read_csv(“datasets/AirQualityClass.csv”) `


**Step 4 - Print the variable**

Let’s see it now by ‘printing’ and showing our work.

`data`

**Step 5 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!


<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GiuKh7zWkAA4z89?format=png&name=small)
</details>

**Your Turn**:  Now it’s your turn!  Let’s dive in and start working with the data! We’ll begin by loading it into a dataframe, which will allow us to easily interact with and analyze the dataset.

In [None]:
# freehand code


**Explanation**:  *The dataset could be used to train a classification model (KNN) that predicts whether conditions will result in contamination based on various emission levels. Alternatively, if all entries are contaminated, the data could support regression models to predict specific emission levels under contaminated conditions or clustering methods to identify patterns in emission profiles among contaminated sites*.

### Goal 3: Bring in the SkLearn Package to Create the Classifier.

We’ll import the SKLearn Ensemble library to help us with our random forest. Using the import command, we’ll name the library for easy use, store it in a variable, and connect everything. Let’s set it up!

#### Blockly

**Step 1 - Start the import**

First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to “import” to bring the add-on package in.

Bring in the IMPORT menu, which can be helpful to bring in other data tools. In this case, we're bringing in the import block.

**Step 2 - Tell what library to import**

In the text area, we type the name of the library we want to import. A library is like an extra thing we bring in to give us more coding abilities. In our case, we will type out **sklearn.ensemble**, which will bring in some cool data manipulation features.

**Step 3 - Rename the library so it’s easy to remember**

Once you are done, put the import and package together in a single variable. This handy feature helps cut down on all the typing later on. You can call it whatever is easiest for you to remember. In the example below, we’ve put everything into a **ensemble**, and we type it in the open area.

**Step 4 - Connect the blocks to run the code**

Connect the blocks and run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GjoXkEHW0AA4zgo?format=png&name=small)
</details>

In [None]:
# blockly code


#### Freehand


**Step 1 - Start the import**

First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to “import” to bring the add-on package in.


**Step 2 - Tell what library to import**

In the text area, we type the name of the library we want to import. A library is like an extra thing we bring in to give us more coding abilities. In our case, we will type out **sklearn.ensemble**, which will provide us with some cool data manipulation features.

**Step 3 - Import package as acronym**

Once you are done, put the ‘import’ and ‘package’ together in a single variable. This handy feature helps cut down on all the typing later on. You can call it whatever is easiest for you to remember. In the example below, we’ve put everything into **ensemble**


**Step 4 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GjoXrGSW8AAbb8c?format=png&name=small)
</details>

**Your Turn**: Try setting up the sklearn.ensemble library for your classifier.

In [None]:
# freehand code


**Explanation**: *Congrats! Your attempts finally made it! Now you have successfully imported the sklearn.ensemble package as the variable ensemble*.

### Goal 4: Create the Model Object.

Now we’ll set up the foundation for your Random Forest classifier! We’ll create a variable to store the model, initialize it, define hyperparameters, and connect everything to make it ready for use. Let’s build it step by step!

#### Blockly

**Step 1 - Write out the name of a variable you want to use for the classifier**

We set up a variable to store the classifier model in a variable for later use. In this case, we will call it **forest**.

On the "Variables" menu, click Create Variable, type a name for our model, **forest**. Then, drag a "SET" block to the workspace for the created variable. This block allows us to create a new variable and assign a value to it.

**Step 2 - Create the random forest classifier model**

We got the variable, so let's create our tree. Easy peasy!

Using the **ensable** library, we call the **RandomForestClassifier**() to create the Tree model.

From the Variable menu, drag a Create block for the forest variable. On the create list box select the option RandomForestClassifier. This specifies the type (class) of object we want to create, which is the RandomForestClassifier from the ensable module.

**Step 3 - Define the hyperparameters**

As we saw above, a tree has different levels. But how many do we want to look at? In this case, we say how many trees want (100) and also how much error we are willing to consider (out-of-bag error)

**Step 4 - Assign the forest model to the variable you created**

We can now connect the forest variable with the **RandomForestClassifier** model.

**Step 5 - Connect the blocks to run the code**

Connect the blocks and run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GjoX3KaXIAAaLuZ?format=jpg&name=small)
</details>


In [None]:
# blockly code


#### Freehand


**Step 1 - Write out the name for a variable you want to use for the classifier**

We set up a variable to store the classifier model in a variable for later use. In this case, we will call it **forest**

Using the neighbors library, we call the RandomForestClassifier() constructor

`ensemble.RandomForestClassifier()`

**Step 2 - Define the hyperparameters**

We got the variable, so let's create our tree. Easy peasy!

Using the ensemble library, we call the RandomForestClassifier() to create the Tree model.

Inside the method, we say **n_estimators=100** and **oob_score=True**. The first will tell the maximum depth of the tree and calculate the out-of-bag (OOB) error.

`ensemble.RandomForestClassifier(n_estimators=100,oob_score=True)`


**Step 3 - Assign the regressor model to the variable you created**

We setup a variable to store the classifier model in a variable for later use. In this case, we will call it **forest**

`forest=ensemble.RandomForestClassifier(n_estimators=100,oob_score=True)`

**Step 4 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GjoX6HjWIAA--gs?format=jpg&name=small)
</details>

**Your Turn**: Define, initialize, and set up your Random Forest Classifier to create it. Bring the "forest" to life and get it ready for precise forecasts!
forest=ensemble.RandomForestClassifier(n_estimators=100,oob_score=True)

In [None]:
# freehand code


**Explanation**: *You have created a Random Forest Classifier, which is a ML model. It's called "forest" because it uses multiple decision trees (100 in this case, set by n_estimators=100) to make predictions. These trees work together like a forest of decision-makers. The oob_score=True part enables a feature that helps evaluate how well the model performs without needing separate test data. Essentially, this model combines the "opinions" of many decision trees to make more accurate predictions, similar to how asking many friends for advice might lead to a better decision than asking just one*.  


### Goal 5: Train and score the classifier model.

Now that we’ve brought in our Tree model, let’s train the model to see how it will learn from the data points that we have in the file.

#### Blockly

**Step 1 - Prepare to train the model**

Now we want to train our decision tree using what machine learning understands. Now let’s train the model using training data! To do this, the fit() method will help us

From the Variable menu, drag the DO block for the **forest** variable, and select the fit function as the do operation. This specifies the function we want to call, which is the fit method of the tree object.

**Step 2 - Have the training features ready**

The next step for training the model is to select the features to train the classifier. In this step, we select the features and add them as a dataframe in the parameter. In this case, the model will train (learn) the classifier based on these 6 variables and use it to predict the label.

From the Lists menu, drag a dictVariable, select the "data" variable from the list of available variables. Also, from the Lists menu, you will get a Create List block. Using the Gear icon, add up to 6 items. For each one of the items, add a Text (a Quote “” from the Text menu), as follows:  "**Methane**", "**NOxEmissions**", "**PM2.5Emissions**", "**VOCEmissions**", "**SO2Emissions**", and "**CO2Emissions**". These are the feature names applied to train (fit) the model.


**Step 3 - Have the training label ready**

So what is the label that we are trying to predict? Next, we need to add the data labels for the selected features. We add the data labels(Contaminated feature) as a parameter in the fit() method.

From the Lists menu, drag a dictVariable, and select the **“data”** variable from the list of available variables. From the Text menu, get a Quote “” block and add a Text "**Contaminated**". This is the target value applied to train (fit) the model.


**Step 4 - Measure the correctness on the training dataset**

To measure the correctness of the model, we will use the score method from the **forest** object. Just as the previous step, we will just replace the fit() method with the **score**() method. Based on the ‘fit’, we will try to see how much we could predict in our training dataset.

This will give us the tree model correctness score. A good score will be closer to 1 (ie - 100). Medium might be more like .95 (95% accurate). Not great would be .90 (90%). It depends on the topic you are looking at.

Right-click on the "**forest.fit**" block and select "Duplicate" from the context menu. This creates a copy of the block. Within the duplicated block, click on the method dropdown menu and select "**score**" from the list of available methods. The score method will work similarly to fit, and will use the training features and label to measure how much of the training data was learned.

**Step 5 - Connect the blocks to run the code**

Connect the blocks and run the code!  


<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GjoYGIDWcAAfc9b?format=jpg&name=small)
</details>


In [None]:
# blockly code


#### Freehand

**Step 1 - Prepare to train the model**

Now we want to train our decision tree using what machine learning understands. Now let’s train the model using training data! To do this, the fit() method will help us.

To train data using the classifier model, we use the model and call the fit() method from it. This will use the **‘fit’** method to train the model that we want to train on.

`forest.fit()`

**Step 2 - Have the training features ready**

The next step for training the model is to select the features to train the classifier. In this step , we select the features and add them as a dataframe in the parameter. In this case, the model will train (learn) the classifier based on these 6 variables and use it predict the label

`forest.fit(data[['Methane', 'NOxEmissions', 'PM2.5Emissions', 'VOCEmissions', 'SO2Emissions', 'CO2Emissions']])`

**Step 3 -  Have the training label ready**

So what is the label that we are trying to predict? Next, we need to add the data labels for the selected features. We add the data labels(**Contaminated**) as a parameter in the fit() method.

`forest.fit(data[['Methane', 'NOxEmissions', 'PM2.5Emissions', 'VOCEmissions', 'SO2Emissions', 'CO2Emissions']],data['Contaminated'])`




**Step 4 - Measure the correctness on the training dataset**

To measure the correctness of the model, we will use the score method from the tree object. Just as the previous step, we will just replace the fit() method with the **score**() method. Based on the ‘fit’, we will try to see how much we could predict in our training dataset.
This will give us the Tree model correctness score. A good score will be closer to 1 (ie - 100). Medium might be more like .95 (95% accurate). Not great would be .90 (90%). It depends on the topic you are looking at.

`forest.score(data[['Methane', 'NOxEmissions', 'PM2.5Emissions', 'VOCEmissions', 'SO2Emissions', 'CO2Emissions']],data['Contaminated'])`


**Step 5 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!


<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GjoYJQBXQAAjGbr?format=jpg&name=small)
</details>

**Your Turn**: Now, it's time for you to apply the steps outlined above to train and score the classifier model using your own data.
forest.fit(data[['Methane', 'NOxEmissions', 'PM2.5Emissions', 'VOCEmissions', 'SO2Emissions', 'CO2Emissions']],data['Contaminated'])

`forest.score(data[['Methane', 'NOxEmissions', 'PM2.5Emissions', 'VOCEmissions', 'SO2Emissions', 'CO2Emissions']],data['Contaminated'])`


In [None]:
# freehand code


**Explanation**: *You have trained a decision forest model to predict whether something is "Contaminated" based on six types of emissions: Methane, NOx, PM2.5, VOC, SO2, and CO2. The first line fits (or trains) the model using the data in the train dataset, where the emissions are the input features, and "Contaminated" is the output label. The second line calculates the model's accuracy on the same training data, showing how well the trained model can predict the contamination status using those emissions*.

### Goal 6: Assessing the performance of the training.

Now that we have trained and scored the classifier model, let’s see how well we did in our training. To do this, let’s calculate the out-of-bag (OOB) score. Let’s jump in!

#### Blockly

**Step 1 - Get the OOB from the forest variable**

From the **forest** object get the attribute **oob_score**.The out-of-bag (OOB) score measures how well the model predicts data it hasn't seen during training, calculated by averaging the prediction errors on each data point using only the trees that didn't include that point in their bootstrap sample.

**Step 2 - Connect the blocks to run the code**

Connect the blocks and run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GjoYc_nW8AAiUJw?format=png&name=small)
</details>

In [None]:
# blockly code


#### Freehand

**Step 1 - Get the OOB from the first variable**

From the **forest** object get the attribute **oob_score**. The out-of-bag (OOB) score measures how well the model predicts data it hasn't seen during training, calculated by averaging the prediction errors on each data point using only the trees that didn't include that point in their bootstrap sample.

`forest.oob_score_`


**Step 2 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GjoYjQdW4AA3RzT?format=png&name=small)
</details>





**Your Turn**: Now, it's your turn to evaluate the performance of your classifier on the test dataset using the OOB score.

In [None]:
# blockly code


**Explanation**: *The forest.oob_score_ attribute of the Random Forest models provided an estimate of the model's performance on unseen data, calculated using out-of-bag samples. During the training of each decision tree within the forest, some data points are left out. These "out-of-bag" samples are then used to predict their corresponding target values, and the average prediction performance across all trees yields the OOB score. This metric offers a convenient, built-in way to assess the model's generalization ability without requiring a separate validation dataset*

### Goal 7: Getting the most important features to predict the label

So far, we have trained the model and made our prediction. So, what are the most important features that will predict our final label? Let’s dive in for feature importance.

#### Blockly

**Step 1 - Get the feature importance from the forest variable**

From the forest object get the attribute **feature importance**. Feature importance in Random Forest quantifies the relative contribution of each input variable to the model's predictive accuracy. In other words, how important each one is for predicting the variable we want to look at.

**Step 2 - Connect the blocks to run the code**

Connect the blocks and run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GjoY3qBWsAAboww?format=png&name=small)
</details>



In [None]:
#blockly code


#### Freehand


**Step 1 - Get the feature importance from the forest variable**

From the forest object get the attribute **feature importance**. Feature importance in Random Forest quantifies the relative contribution of each input variable to the model's predictive accuracy. In other words, how important each one is for predicting the variable we want to look at.

`forest.feature_importances_`

**Step 2 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GjoY5vSW8AA8kdm?format=png&name=small)
</details>

**Your Turn**: Take the next step and explore which features play the most significant role in predicting your label.

**Explanation**: *The random forest’s feature importance provided the numerical representation of the relative importance of each feature in the dataset. It quantifies how much each feature contributes to the predictions made by the model, based on how frequently and effectively it was used to reduce impurity across all the decision trees within the forest. Higher values indicate greater importance. This allows users to understand which features have the most significant impact on the model's output, aiding in feature selection and gaining insights into the underlying relationships within the data*

In [None]:
# freehand code


### Goal 8: Import the Plotly.Express Library.

Pandas was a great library, but we may need more information from a different library. Let's bring in Plotly Express, which is a popular library for doing different visualizations.

#### Blockly

**Step 1 - Start the import**

First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to “import” to bring the add-on package in.
Bring in the IMPORT menu, which can be helpful to bring in other data tools. In this case, we're bringing in the **import** block.

**Step 2 - Tell what library to import**

In the text area, we type the name of the library we want to import. A library is like an extra thing we bring in to give us more coding abilities. In our case, we will type out **plotly.express**, which will bring in some cool data manipulation features.


**Step 3 - Rename the library so it’s easy to remember**

Once you are done, put the **import** and **package** together in a single variable. This handy feature helps cut down on all the typing later on. You can call it whatever is easiest for you to remember. In the example below, we’ve put everything into px, and we type it in the open area.

**Step 4 - Connect the blocks to run the code**

Connect the blocks and run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/Gab594HW4AA7zAX?format=png&name=small)
</details>

In [None]:
# blockly code


#### Freehand


**Step 1 - Start the import**

First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to “import” to bring the add-on package in.

**Step 2 - Tell what library to import**

In the text area, we type the name of the library we want to import. A library is like an extra thing we bring in to give us more coding abilities. In our case, we will type out **plotly.express**, which will bring in some cool data manipulation features.

**Step 3 - Rename the library so it’s easy to remember**

Once you are done, put the **‘import’** and **‘package’** together in a single variable. This handy feature helps cut down on all the typing later on. Feel free to use whatever name you want that will help you remember it later on. In the example below, we’ve put everything into px.


**Step 4 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/Gab6AduXQAIEi5X?format=png&name=small)
</details>


**Your Turn**: Now it’s your turn! We’re going to dive into the plotly.express package, which helps us with some really cool data science things. First, let’s import the package and assign it to the variable “px” to make it easier to use throughout our notebook.
import plotly.express as px

In [None]:
# freehand code

**Explanation**: *Congrats! Your attempts finally made it! Now you have successfully imported the plotly.express package as the variable px*.

### Goal 9: Dropping the label column

Right now we have 7 columns - our 6 prediction features (variables) plus our label. Because we are mostly interested in the 6 prediction variables, let’s get rid of the label column so that our chart is easier to read later on.

#### Blockly

**Step 1 - Bring in the drop method from the data variable**

With the **data** variable DO the **drop** method, that will set up the drop of the label, leaving only the features on the dataset.


**Step 2 - Tell what columns we want to drop**

Now that we have the drop method ready to go, let’s tell it what we want to get rid of. In this case, we’ll get rid of the ‘Contaminated’

Inform as the first parameter of the drop method, the name of the column as Quote block: “Contaminated” to be dropped (the label). And as the second parameter inform axis=1, meaning remove a column (not a row), using a freehand block.

**Step 3 - Assign the updated dataset to the variable**

Originally, we started with 7 columns (6 predictions, one label). Now let’s make sure that we are only looking at the 6 predictions in our **data** variable

Set the **data** variable to the updated dataset (without the label, only with the features)

**Step 4 - Print the updated dataset**

Drag the **data** variable to the canva so it will print it on the screen.

**Step 5 - Connect the blocks to run the code**

Connect the blocks and run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GjoZEetWIAAOLOj?format=png&name=small)
</details>




In [None]:
# blockly code


#### Freehand

**Step 1 - Bring in the drop method from the data variable**

With the **data** variable we will set up the drop of the label, leaving only the features on the dataset.

`data.drop()`

**Step 2 - Tell what columns we want to drop **

Now that we have the drop method ready to go, let’s tell it what we want to get rid of. In this case, we’ll get rid of the **‘Contaminated’**

Inform as the first parameter of the **drop** method, the name of the column as Quote block: **“Contaminated”** to be dropped (the label). And as the second parameter inform **axis=1**, meaning remove a column (not a row), using a freehand block.

`data.drop('Contaminated',axis=1) `

**Step 3 - Assign the updated dataset to the variable**

Originally, we started with 7 columns (6 predictions, one label). Now let’s make sure that we are only looking at the 6 predictions in our data variable

Set the **data** variable to the updated dataset (without the label, only with the features).

`data = data.drop('Contaminated',axis=1) `

**Step 4 - Print the updated dataset**

Drag the data variable to the canva so it will print it on the screen.

`data`

**Step 5 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GjoZGrFXkAAc1kn?format=png&name=small)
</details>

**Your Turn**: Now that you've seen how to remove a label from a dataset using the drop() method, it's time to try it yourself! Once you've executed the steps, take a moment to analyze the dataset—do you now have only the predictive features left?

In [None]:
# freehand code


**Explanation**: *You removed the column named 'Contaminated' from the Pandas DataFrame data. The axis=1 argument specifies that we are dropping a column, not a row. After this operation, the DataFrame data is updated to exclude the 'Contaminated' column, effectively removing that feature from the dataset. The subsequent data call would then display the modified DataFrame, showing all remaining columns and rows*

### Goal 10: Showing the most important features in a bar chart.  

We’ve ran the numbers to see what variable is the most importance for our prediction. Now let’s see if we can graph them to visualize it!

#### Blockly


**Step 1 - Call the bar chart function from plotly**

To make a bar chart, we first need to call the bar chart function with our plotly library (px).

From the Variable menu, get a DO block for the **px** variable and select the **bar** operation. From this same menu, get a variable.

**Step 2 - Tell Plotly what columns to put on the x-axis for the bar chart**

For the first parameter of the bar function, use a freestyle block to inform the first axis name, **x=**. Then From the **data** object Get the **columns** attribute to connect to x=. That will define the x-axis as the feature names (the name of the columns of the dataset)


**Step 3 - Tell Plotly what columns to put on the y-axis for the bar chart**

For the second parameter of the bar function, use a freestyle block to inform the second axis name, **y=**. Then From the **data** object Get the **feature_importance** attribute to connect to y=. That will define the y-axis as the feature importance measurement.

**Step 4 - Connect the blocks to run the code**

Connect the blocks and run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GjoZPZzWIAAHZKu?format=jpg&name=small)
</details>

In [None]:
# blockly code


#### Freehand


**Step 1 - Call the bar chart function from plotly**

To make a bar chart, we first need to call the bar chart function with our plotly library (px).
px.bar()

**Step 2 - Tell Plotly what columns to put on the x-axis for the bar chart**

For the first parameter of the bar function, use a freestyle block to inform the first axis name, **x=**. Then From the **data** object get the **columns** attribute to connect to x=. That will define the x-axis as the feature names (the name of the columns of the dataset)

`px.bar(x= (data.columns) )`



**Step 3 - Tell Plotly what columns to put on the y-axis for the bar chart**

For the second parameter of the bar function, use a freestyle block to inform the second axis name, **y=**. Then From the **data** object get the **feature_importance** attribute to connect to y=. That will define the y-axis as the feature importance measurement.

`px.bar(x= (data.columns),y= (forest.feature_importances_))`

**Step 4 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GjoZRKSW8AAhriN?format=jpg&name=small)
</details>
​​

**Your Turn**: Now that we’ve identified the key features, let’s bring them to life with a bar chart. Using Plotly, create a graph to visualize their importance.
px.bar(x= (data.columns),y= (forest.feature_importances_))


In [None]:
# freehand code


**Explanation**: *You generated an interactive bar chart to visualize feature importances. It uses Plotly Express to create a bar for each feature in the data DataFrame, with the x-axis representing the feature names and the y-axis representing the corresponding importance scores from the trained forest model. This visualization allows for a quick and clear understanding of which features contribute most significantly to the model's predictions*.

## WHAT DID YOU LEARN?

In this lesson, you learned about Random Forests, including how this ‘ensemble’ (mix of other models) learning method combines multiple decision trees to make our predictions more accurate.  You explored bootstrapping and bagging, understanding how these tools contribute to the diversity of the training datasets used in Random Forests.

## WHAT’S NEXT?:

[Crossvalidation](Crossvalidation.ipynb)

## ANY EXTRAS?

* **Random Forests** - [Data Science Explained](https://godatadrive.com/blog/random-forests)
  * ***Description***: In this blog, dive into random forests and how they are different from your standard decision trees.

* **Art Connection**: Visualizing Decision Trees - [GeeksforGeeks](https://www.geeksforgeeks.org/difference-between-random-forest-and-decision-tree/)
  * ***Description***: This resource discusses how decision trees can be visualized, making them an excellent tool for artists and designers interested in data visualization. Understanding how to represent complex data structures visually can enhance storytelling through data, bridging the gap between art and data science.

* **Math Connection**: Decision Trees and Random Forests with scikit-learn - [Free Video Tutorial](https://www.udemy.com/course/decision-trees-and-random-forests-with-scikit-learn/)
  * ***Description***: This free video tutorial covers the implementation of decision trees and Random Forests using the scBased on the search results, here are some relevant video resources that can help students understand the mathematics and visualization aspects of Random Forests and decision trees:

* **Computer Science Connection**: Random Forest Algorithm in Machine Learning - [GeeksforGeeks](https://www.geeksforgeeks.org/random-forest-algorithm-in-machine-learning/)
  * ***Description***: This article offers a comprehensive overview of the Random Forest algorithm, detailing its implementation and advantages in computer science. It serves as a valuable resource for students looking to understand how these algorithms are applied in software development and data analysis.

* **Career Connection**: Exploring Decision Trees & Random Forests in ML – [365 Data Science](https://365datascience.com/tutorials/machine-learning-tutorials/decision-trees-random-forests/)
  * ***Description***: This resource highlights the relevance of decision trees and Random Forests in various career paths, including data science, machine learning engineering, and analytics. It emphasizes the skills needed to work with these algorithms and their importance in making data-driven decisions in business environments.
