# Crossvalidation

## ANTICIPATED TIME:


2 hours

## BEFORE YOU BEGIN:  

[Random forests](Random_Forests.ipynb)

## WHAT YOU WILL LEARN:

* How does cross-validation help avoid overfitting?
* What does it mean to split data for training and testing, and why does it matter?
* How can cross-validation help us test our model in a fair way?
* What are hyperparameters, and how do they change how our model learns?
* How does grid search help us find the best settings for our model instead of just guessing?
* What metrics are used to evaluate models during cross-validation?

## DEFINITIONS YOU’LL NEED TO KNOW:


* Training Data - the part of data used to teach a machine learning model to recognize patterns.
* Testing Data - the other part of data used to evaluate a trained model’s accuracy by comparing predictions to actual values.
* Crossvalidation - a technique used to evaluate the performance of a machine learning model by splitting the dataset into multiple subsets, training the model on a subset, and
* Folds - groups used for testing and training in crossvalidation.
* Hyperparameters - parameters we define when we create our model.
* Grid Search - a method that helps us choose the best hyperparameters by making a list and testing each value separately.
* Out-of-bag error (OBB) - a way to measure how well a model can predict results

## SCENARIO

The team has come so far in solving the pollution problem in their city! They have collected a lot of data, such as the number of cars in the city and different air quality levels. However, to make sure the data they collected was accurate, Rin suggested they use a new method called cross-validation. Ever since they looked at random forests, they wanted to look at other approaches that would lessen the randomness. While random forest helped with the randomness when looking at categorical data, cross-validation might help with the numerical data, like CO2 emissions and the number of fuel gallons used. Maybe with this method, they can find which factor is causing the most pollution in their city!

## WHAT DO I NEED TO KNOW?

Some important things we’ve talked about when it comes to data science…

* Selecting relevant variables? Check!
* Choosing the right algorithm? Yep!
* Building our model! You know it!

Another thing we need to do is splitting the data into training and testing. Training is when the model finds the patterns in our data in a specific section of our data (usually 80% of our data). Once that is done, we then test it on the rest of the datasets (e.g., 20% of the overall data points) to make sure the model actually ‘learns’ well so that it actually can make an accurate prediction in future data. This will help us to compare the actual outcome values with the predicted values from the models trained, which allows us to see how well the model performs (e.g., predicted accurately in 90% of the cases).

**Training and Splitting - Not so Fast!**

Training and testing are necessary because we want our model to make good predictions.

Everything is all perfect, right? Well, we need to be careful because we can get "lucky" with our train/test split, and randomly get training data that is "hard" and test data that is "easy."

How can we randomly pick the wrong part of our dataset during testing and training? Imagine we had 100 students in our summer program that we want to look at data for. We trained our model (KNN, decision tree, and others we’ve learned about) on a random set of 80 students who signed up. But maybe there was something weird with the remaining 20 that were picked, which messes with our ‘randomness’ that we need for doing intense training and testing.

Let’s think about why this matters. If we had split our training and testing on a different set of students (1-20 or 21-40 or 41-60 or 61-80), our results might be totally different. All of a sudden, now we can’t quite trust our training and testing because the ‘chunk’ we looked at during training might be off.  


**Crossvalidation to the Rescue for Our Train/Test Problem!**

Crossvalidation is a perfect way to use all of the data in a dataset for training and testing…. but not at the same time. The idea with crossvalidation is to split the data into equal pieces called folds. One of those folds is used for testing, and the rest is used for training. Once we’ve ‘chunked’ our data into folds, each fold takes turns being used for training or testing. Then you build as many models as there are folds, using a different fold for testing each time. The performance metrics (accuracy, R2) are then combined across all the different models.

![](https://pbs.twimg.com/media/GrVZP1yXMAMryQt?format=jpg&name=small)

**Hyperparameters - Putting You In the Driver’s Seat for Data Science!**

Now, let’s talk **hyperparameters**.

Some parts of a model, called parameters, are learned automatically, so we don’t have to really think about it too much because the model takes care of it. Instead of a default setting, hyperparameters are settings we choose or customize when we create a model. We’ve run into this before, like when we picked the number of clusters in clustering, the correct number of neighbors in KNN classification, and others. Just like with the other topics, the tricky part is choosing the best one!

If we don’t use crossvalidation, we might pick a hyperparameter that works well on some data but doesn’t work well with others. For example, what about….
….number of clusters to consider in clustering?
….or the K in KNN classification?
….or levels in our decision or regression trees?
….how many trees to select in random forests?


**Grid Search**

Sometimes we also need to choose and customize multiple hyperparameters for our model. For example, in decision trees, we might select hyperparameters like the levels in our tree (1, 2, 3?) and the number of leaf nodes (6, 7, 8?). So…many….hyperparameter…..combinations….!
How can we know the best combination so we aren’t randomly guessing? We can run a grid search that helps us find the best hyperparameters to select, such as if the best decision tree is 3 levels and 6 leaf nodes. Or maybe for a different dataset, it might tell us that it’s best to have 2 levels of 7 leaf nodes. That’s why the grid search is helpful because it’ll tell us specifically how much for each parameter instead of us guessing.

![](https://pbs.twimg.com/media/GrVZyxTWoAAuR5r?format=jpg&name=small)


## YOUR TURN:

### Goal 1: Importing the pandas library.

Need extra tools to help solve this problem? Well, we can bring in extra ‘libraries’ to help us do extra data science stuff. You can think of it like an ‘add-on’. In this case, we bring in pandas, which is a popular library for doing data science stuff.

#### Blockly

**Step 1 - Starting the import**

First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to “import” to bring the add-on package in.

Bring in the IMPORT menu, which can be helpful to bring in other data tools. In this case, we're bringing in the **import** block.

**Step 2 - Telling what library to import**

In the text area, we type the name of the library we want to import. A library is like an extra thing we bring in to give us more coding abilities. In our case, we will type out **pandas**, which will bring in some cool data manipulation features.


**Step 3 - Renaming the library so it’s easy to remember**

Once you are done, put the import and package together in a single variable. This handy feature helps cut down on all the typing later on. You can call it whatever is easiest for you to remember. In the example below, we’ve put everything into **pd**, and we type it in the open area.

**Step 4 - Connect the blocks to run the code**

Connect the blocks and run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GZYmOOpW8AAf7uc?format=png&name=small)
</details>

In [1]:
# blockly code


### Freehand

**Step 1 - Starting the import**

First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to “import” to bring the add-on package in.

**Step 2 - Telling what library to import**

In the text area, we type the name of the library we want to import. A library is like an extra thing we bring in to give us more coding abilities. In our case, we will type out **pandas**, which will bring in some cool data manipulation features.

**Step 3 - Renaming the library so it’s easy to remember**

Once you are done, put the ‘import’ and ‘package’ together in a single variable. This handy feature helps cut down on all the typing later on. Feel free to use whatever name you want that will help you remember it later on. In the example below, we’ve put everything into **pd**.

**Step 4 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GZmkVCYWEA4oGso?format=jpg&name=small)
</details>

**Your Turn**: Now it’s your turn! We’re going to dive into the pandas package, which helps us with some really cool data science things. First, let’s import the package and assign it to the variable pd to make it easier to use throughout our notebook.

In [None]:
# freehand code


**Explanation**: *Congrats!  Your attempts finally made it!  Now you have successfully imported the "pandas" package as the variable "pd"*.

### Goal 2: Bringing in the dataframe.

Let’s bring in the data that we want to look at.

#### Blockly

**Step 1 - Write out the variable name you want to use**

Now that we’re all set with our new package to help us to do cool things, let’s bring the data into a variable called **data**. Think of it as a digital spreadsheet with much more power to analyze and manipulate the data!

In Blockly, bring in the VARIABLES menu.

**Step 2 - Assign the dataframe to the variable you created**

Just like we did before, let’s type out a variable name. Rather than type out the full file name for our data, this easy to remember name will hold the data we bring in.

In Blockly, go to the Variables and drag the Set block for the **data** variable. This will allow us to assign the result of a function call to the variable. A function is basically code that does a specific task for us.

**Step 3 - Bring in the data**

Now we need to look at the file that has all our data. To load our dataframe, we’ll use a simple command to bring in the file we need (CSV….Comma Separated Values). Let’s say we have a file called ‘AirQuality.csv' in the folder **‘datasets’**. We’re telling Python to read the CSV file and store it in a variable called **data**.

From the Variable menu, drag a DO block using the **pd** variable, go ahead with the do operation **read_csv**. The read_csv function reads a CSV file and returns a DataFrame object.

In our case, let’s bring in the “datasets/AirQualityClass.csv" (use the Quotes from the TEXT menu) because that is what the team is working with.

**Step 4 - Display the variable**

Let’s see it now by ‘displaying’ and showing our work.

Drag the **data** variable to the workspace, making it available for further use in our program. This step is more of a visualization step, allowing us to see the variable in the Blockly workspace.

**Step 5 - Connect the blocks to run the code**

Connect the blocks and run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GiuKgNgXEAA3dDc?format=jpg&name=small)
</details>




In [None]:
# blockly code

#### Freehand


**Step 1 - Write out the variable name you want to use**

Now that we’re all set with our new package to help us do cool things, let’s bring the data into a variable called **data**. Think of it as a digital spreadsheet with much more power to analyze and manipulate the data!

Just like we did before, let’s type out a variable name. Rather than type out the full file name for our data, this easy to remember name will hold the data we bring in.


`pd.read_csv(“datasets/AirQualityClass.csv”) `

**Step 2 - Assign the dataframe to the variable you created**

Just like we did before, let’s type out a variable name. Rather than type out the full file name for our data, this easy to remember name will hold the data we bring in.

**Step 3 - Bring in the data**

Now we need to look at the file that has all our data.

To load our dataframe, we’ll use a simple command to bring in the file we need (CSV….Comma Separated Values). Let’s say we have a file called ‘AirQualityClass.csv' in the folder **‘datasets’**. We’re telling Python to read the CSV file and store it in a variable called **data**. For this function, we need to specify the code as "**pd.read_csv**”, which makes the code read the csv file. This variable is now our dataframe!

In our case, let’s bring in the “datasets/AirQualityClass.csv” (use the Quotes from the TEXT menu) because that is what Rin is working with.

`data = pd.read_csv(“datasets/AirQualityClass.csv”)`

**Step 4 - Print the variable**

Let’s see it now by ‘printing’ and showing our work.

`data`

**Step 5 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GiuKh7zWkAA4z89?format=png&name=small)
</details>

**Your Turn**: Let’s dive in and start working with the data! We’ll begin by loading it into a dataframe, which will allow us to interact with and analyze the dataset easily.

In [None]:
# freehand code


**Explanation**:  *The dataset could be used to train a classification model (KNN) that predicts whether conditions will result in contamination based on various emission levels. Alternatively, if all entries are contaminated, the data could support regression models to predict specific emission levels under contaminated conditions or clustering methods to identify patterns in emission profiles among contaminated sites*.

### Goal 3: Bring in the SkLearn package to create the classifier:

Remember when we brought in other packages for the extra add-ons? Now let’s bring in a predictor (classifier) to help us find out the categorical variable we want to predict. We specifically want to use SKLearn here to help with decision trees as part of our prediction.

#### Blockly

**Step 1 - Starting the import**

First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to “import” to bring the add-on package in.

Bring in the IMPORT menu, which can help bring in other data tools. In this case, we're bringing in the **import** block.

**Step 2 - Telling what library to import**

In the text area, we type the name of the library we want to import. A library is like an extra thing we bring to give us more coding abilities. In our case, we will type out **sklearn.tree**, which will bring in some cool data manipulation features.

**Step 3 - Renaming the library so It’s easy to remember**

Once you are done, put the **import** and **package** together in a single variable. This handy feature helps cut down on all the typing later on. You can call it whatever is easiest for you to remember. In the example below, we’ve put everything into a tree, and we type it in the open area.



**Step 4 - Connect the blocks to run the code**

Connect the blocks and run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GctrVUWWcAAZjt3?format=png&name=small)
</details>

In [None]:
# blockly code


#### Freehand


**Step 1 - Starting the import**

First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to “import” to bring the add-on package in.

**Step 2 - Telling what library to import**

In the text area, we type the name of the library we want to import. A library is like an extra thing we bring in to give us more coding abilities. In our case, we will type out **sklearn.tree**, which will provide us with some cool data manipulation features.

**Step 3 - Import package as acronym**

Once you are done, put the **‘import’** and **‘package’** together in a single variable. This handy feature helps cut down on all the typing later on. You can call it whatever is easiest for you to remember. In the example below, we’ve put everything into **tree**

**Step 4 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GctrXuFXMAAueAW?format=png&name=small)
</details>

**Your Turn**: Now it’s your turn! We’re going to dive into the sklearn.tree package, which helps us with some really cool data science things. First, let’s import the package and assign it to the variable “tree” to make it easier to use throughout our notebook.


In [None]:
# freehand code



**Explanation**: *Congrats! Your attempts finally made it! Now you have successfully imported the sklearn.tree package as the variable tree*.

### Goal 4: Importing SKLearn and model_selection.

Need extra tools to help solve this problem? Well, we can bring in extra ‘libraries’ to help us do extra data science stuff. You can think of it like an ‘add-on’. In this case, we bring in SKLearn and Model selection, which is a popular machine learning library which will help us train and test our data.  This will help us learn from it and understand it later.

#### Blockly

**Step 1 - Starting the import**

First, we need to set up a “command” to tell the computer what to do . In this case, “command” is to “import” to bring the add-on package in.

Bring in the IMPORT menu, which can be helpful to bring in other data tools. In this case, we're bringing in the **import** block.

**Step 2 - Telling what library to import**

In the text area, we type the name of the library we want to import. A library is like an extra thing we bring in to give us more coding abilities. In our case, we will type out **sklearn.model_selection**, which will bring in some cool data manipulation features.

**Step 3 - Renaming the library so it’s easy to remember**

Once you are done, put the **import** and **package** together in a single variable. This handy feature helps cut down on all the typing later on.  Feel free to use whatever name you want that will help you remember it later on. In the example below, we’ve put everything into **model_selection**, and we type it in the open area.

**Step 4 - Connect the blocks to run the code**

Connect the blocks and run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/Gnj02MLWoAE8wxx?format=png&name=small)
</details>

In [None]:
# blockly code


#### Freehand


**Step 1 - Starting the import**

First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to “import” to bring the add-on package in.

**Step 2 - Telling what library to import**

In the text area, we type the name of the library we want to import. A library is like an extra thing we bring in to give us more coding abilities. In our case, we will type **sklearn.model_selection**, which will bring in some cool data manipulation features.

**Step 3 - Renaming the library so it’s easy to remember**

Once you are done, put the ‘import’ and ‘package’ together in a single variable. This handy feature helps cut down on all the typing later on. Feel free to use whatever name you want that will help you remember it later on. In the example below, we’ve put everything into **model_selection**.

**Step 4 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!


<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/Gnj1FAnXsAAA5dt?format=jpg&name=small)
</details>

**Your Turn**: Now it’s your turn! We’re going to dive into the pandas package, which helps us with some really cool data science things. First, let’s import the package and assign it to the variable model_selection to make it easier to use throughout our notebook.

In [None]:
# freehand code


**Explanation**: *Congrats!  Your attempts finally made it!  Now you have successfully imported the model selection sublibrary from the sklearn package and named it model_selection*.

### Goal 5: Create the model object.

Now that we’ve imported the library, let’s get started on creating our decision tree. As we get going, we’ll set up a classifier model, assign it to a variable, and define its hyperparameters for later use.


#### Blockly



**Step 1 - Write out the name of a variable you want to use for the classifier**

We set up a variable to store the classifier model in a variable for later use. In this case, we will call it dtree

On the "Variables" menu, click Create Variable, type a name for our model, **dtree**. Then, drag a "SET" block to the workspace for the created variable. This block allows us to create a new variable and assign a value to it.

**Step 2 - Create the Decision Tree classifier model**

We got the variable, so let’s get create our tree. Easy peasy!

Using the **tree** library, we call the **DecisionTreeClassifier**() to create the Tree model.

From the Variable menu, drag a Create block for the tree variable. On the create list box select the option DecisionTreeClassifier. This specifies the type (class) of object we want to create, which is the DecisionTreeClassifier from the tree library.

**Step 3 - Define the hyperparameters**

As we saw above, a tree has different levels. But how many do we want to look at? In this case, we want a tree with a maximum depth of 2 levels.

Drag a Freestyle block, and type **max_depth=2**. This tells the maximum depth of the tree.

**Step 4 - Assign the decision tree to the variable you created**

We can now connect the dtree variable with the DecisionTreeClassifier model.

**Step 5 - Connect the blocks to run the code**

Connect the blocks and run the code!


<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GcttrFQWYAAiRoO?format=jpg&name=small)
</details>



In [None]:
# blockly code


#### Freehand


**Step 1 - Write out the name for a variable you want to use for the classifier**

We setup a variable to store the classifier model in a variable for later use. In this case, we will call it **dtree**

`tree.DecisionTreeClassifier()`


**Step 2 -  Define the hyperparameters**

We got the variable, so let’s get create our tree. Easy peasy!

Using the **tree** library, we call the **DecisionTreeClassifier**() to create the Tree model

`tree.DecisionTreeClassifier(max_depth=2)`

**Step 3 - Assign the decision tree to the variable you created**

We set up a variable to store the classifier model in a variable for later use. In this case, we will call it **dtree**

`dtree = tree.DecisionTreeClassifier(max_depth=2)`

**Step 4 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!


<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GctttmiXgAETWWZ?format=jpg&name=small)
</details>

**Your Turn**: Ready! Set! Code!

In [None]:
# freehand code


**Explanation**: *You have created a decision tree classifier, which is a type of machine learning model used for classification tasks. By specifying max_depth=2, the tree is limited to two levels, meaning it can only make decisions based on at most two splits in the data. This helps keep the model simple and reduces the risk of overfitting (learning too much from the training data). The classifier will later be trained to classify data into categories based on input features*.  

### Goal 6: Train and predict using the cross validation process.

So our next step is to train the model to predict the label, in this particular case, whether something is contaminated. Let’s jump in!

#### Blockly




**Step 1 - Prepare to train and predict the model**

First we need to start by doing the training and prediction.

With the **model_selection** library Do the **cross_val_predict**, which will set up the cross-validation process. This process will be a model, a tree on our case, train it using the cross-validation technique on a single dataset. Finally, the label will be predicted for all records on the dataset.

**Step 2 - Inform the decision tree model to the cross validation operation**

Now we need to say we are using the tree model as the baseline for our cross validation we’ll be doing.

From the variables menu, connect the **dtree** variable as the first parameter of the cross_val_predict operation.


**Step 3 - Have the training features ready**

So, what are we doing to predict as our baseline? These will be our variables that we’ll use to predict our label.

From the list menu, get a dictVariable as **data**. Select the desired columns (features), as [], create a list with the feature names (Quote blocks): 'Methane', 'NOxEmissions', 'PM2.5Emissions', 'VOCEmissions', 'SO2Emissions', 'CO2Emissions’.

**Step 4 - Have the training label ready**

So what are we trying to predict? In our case, we are looking to predict the variable **Contaminated**.

From the list menu, get a dictVariable as **data**. Select the desired column (label): [ “Contaminated” (Quote block) ]. This will define the target variable to train the model.  

**Step 5 - Define the number of folds**

Now we need to set up the number of chunks (‘folds’) that we will explore? In our case, let’s break this down into 10 different ‘folds’.

Using the freestyle block define the number of folds to 10, as **cv=10**. The 10-fold cross-validation divides the dataset into 10 parts, trains and tests the model 10 times with each part used once as the test set, and averages the results for a more reliable performance estimate.


**Step 6 - Assigning the predicted labels to the predictions variable**

We’ve done all this work, so let’s make sure to store it in a separate variable we can reference later on. In this case, let’s call it **predictions**.

Set the **predictions** variable to the result of the cross_val_predict operation. That variable will hold the predictions for all records in the original dataset.

**Step 7 - Print the predictions**

From the variable menu, drag the predictions variable to the canva, which will print the predictions on the screen.


**Step 8 - Connect the blocks to run the code**

Connect the blocks and run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GjtjNeiXcAAqGrk?format=jpg&name=small)
</details>


In [None]:
# blockly code


#### Freehand


**Step 1 - Call the cross_val_predict function from the model_selection library**

First we need to start by doing the training and prediction.

`model_selection.cross_val_predict( )`



**Step 2 - Inform the tree model to the cross validation operation**

Now we need to say we are using the tree model as the baseline for our cross-validation, we’ll be doing.

`model_selection.cross_val_predict(dtree )`

**Step 3 - Define the training features**

So what are we doing to predict on in our baseline? These will be our variables that we’ll use to predict our label.

`model_selection.cross_val_predict(dtree,data[['Methane', 'NOxEmissions', 'PM2.5Emissions', 'VOCEmissions', 'SO2Emissions', 'CO2Emissions']]  )`

**Step 4 - Define the training label**

So what are we trying to predict? In our case, we are looking to predict the variable Contaminated.

`model_selection.cross_val_predict(dtree,data[['Methane', 'NOxEmissions', 'PM2.5Emissions', 'VOCEmissions', 'SO2Emissions', 'CO2Emissions']],data['Contaminated'])`

**Step 5 - Define the number of folds**

Now we need to set up the number of chunks (‘folds’) that we will explore? In our case,  let’s break this down into 10 different ‘folds’.

`model_selection.cross_val_predict(dtree,data[['Methane', 'NOxEmissions', 'PM2.5Emissions', 'VOCEmissions', 'SO2Emissions', 'CO2Emissions']],data['Contaminated'],cv=10)  `

**Step 6 - Assing the predicted labels to the predictions variable**

We’ve done all this work, so let’s make sure to store it in a separate variable we can reference later on. In this case, let’s call it **predictions**.

`predictions = model_selection.cross_val_predict(dtree,data[['Methane', 'NOxEmissions', 'PM2.5Emissions', 'VOCEmissions', 'SO2Emissions', 'CO2Emissions']],data['Contaminated'],cv=10) `

**Step 7 - Print the predictions**

From the variable menu, drag the predictions variable to the canva, which will print the predictions on the screen.

`predictions`

**Step 8 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GjtjPmDXYAA5h29?format=jpg&name=large)
</details>




**Your Turn**:  Put your skills into action by training and scoring the classifier model. Explore how well the model learns from the training data and analyze its correctness score. It's a great opportunity to understand the model's predictive power! Ready! Set! Go!



In [None]:
# freehand code


**Explanation**: *You have generated the predictions from a decision tree (dtree) using 10-fold cross-validation. It trains and predicts on different subsets of the data, avoiding the need for a separate validation set. Specifically, it uses the features 'Methane', 'NOxEmissions', 'PM2.5Emissions', 'VOCEmissions', 'SO2Emissions', and 'CO2Emissions' to predict the 'Contaminated' target variable. The resulting predictions variable stores the predicted values for each data point, obtained from the folds where that point was held out during training. This provides a robust assessment of the model's performance*.

### Goal 7: Bringing in SKLearn metrics to help look at the performance of predictions.

So we’ve tried to predict on our new dataset. How well did we do? Let’s use SKLearn Metrics to help us think through that.

#### Blockly



**Step 1 - Starting the import**

First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to “import” to bring the add-on package in.

Bring in the IMPORT menu, which can be helpful to bring in other data tools. In this case, we're bringing in the **import** block.

**Step 2 - Telling what library to import**

In the text area, we type the name of the library we want to import. A library is like an extra thing we bring in to give us more coding abilities. In our case, we will type out **sklearn.metrics**, which is a tool that grades your machine learning model's performance, telling you how well it did on its test.


**Step 3 - Renaming the library so it’s easy to remember**

Once you are done, put the **import** and **package** together in a single variable. This handy feature helps cut down on all the typing later on. You can call it whatever is easiest for you to remember. In the example below, we’ve put everything into **metrics**, and we type it in the open area.


**Step 4 - Connect the blocks to run the code**

Connect the blocks and run the code!


<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GaQ5swrXUAAEJ-D?format=png&name=medium)
</details>

In [None]:
#blockly code


#### Freehand



**Step 1 - Starting the import**

First, we need to set up a “commad” to tell the computer what to do. In this case, “command” is to “import” to bring the add-on package in

**Step 2 - Telling what library to import**

In the text area, we type the name of the library we want to import. A library is like an extra thing we bring in to give us more coding abilities. In our case, we will type out **sklearn.metrics**, which is a tool that grades your machine learning model's performance, telling you how well it did on its test.

**Step 3 - Renaming the library so it’s easy to remember**

Once you are done, put the ‘import’ and ‘package’ together in a single variable. This handy feature helps cut down on all the typing later on. Feel free to use whatever name you want that will help you remember it later on. In the example below, we’ve put everything into **metrics**

**Step 4 - Run the code**

Hit ‘control’ and ‘enter’ simultaneously to run the data science magic!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GaQ5u3ZWsAA_df1?format=png&name=medium)
</details>


**Your Turn**: Test it out yourself! We set up a command to tell our computer what to do and after our hard work, we’ll run what we have to see our data science major at work!

In [None]:
# freehand code


**Explanation**: *The metrics library provides tools to measure and evaluate the performance of machine learning models. By using `metrics`, we can check how well our model is working, like seeing how accurate it is or how well it groups data in clustering. This helps us understand if our model is doing a good job or if it needs improvement*.

## Assessing the performance of the classifier.



So, how well did our predictions do? Let’s calculate three steps here: performance of predictions on the testing dataset - accuracy, confusion matrix, and precision/recall.

### Goal 8: Assessing the performance of the predictions on test dataset using the accuracy score.

So how well did our predictions do on our test data? Let’s calculate the accuracy score to give us an idea.

#### Blockly

**Step 1 - Call the accuracy_score() method using the metrics library**

To calculate the accuracy of the model predictions, we will use the **accuracy_score**() function from the metrics library.  

From the Variables menu, drag a DO block for the metrics variable. Select the accuracy_score function from the metrics list of operations. This function takes two inputs: the true labels and the predicted labels.


**Step 2 - Calculate Tree model’s accuracy**

The accuracy_score() function takes 2 parameters to calculate the accuracy score and help measure the percentage of correct predictions. So let’s compare contaminated from the test dataset and predictions from the model we just created.

From the Lists menu, get a dictVariable block and select the **data** variable. From the Text men,u get a Quote “” block to inform the label name **”Contaminated”**. This list will be used as the true labels for the accuracy calculation.

As the second parameter of the **accuracy_score** get the variable **predictions**.  The accuracy score function will calculate the accuracy of the model by comparing the true labels with the predicted labels. The result will be a score that indicates the performance of the model.

**Step 3 - Connect the blocks to run the code**

Connect the blocks and run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GjtjbUSXEAAzvr7?format=jpg&name=small)
</details>

In [None]:
# blockly code

#### Freehand


**Step 1 - Call the accuracy_score() method using the metrics library**

To calculate the accuracy of the model predictions, we will use the **accuracy_score**() method from the metrics library.  This accuracy score will measure the percentage of correct predictions.

`metrics.accuracy_score()`


**Step 2 - Calculate Tree model’s accuracy**

The accuracy_score() function takes 2 parameters to calculate the accuracy score and help measure the percentage of correct predictions. So let’s compare contaminated from the data dataset and the predictions from the model we just created.

`metrics.accuracy_score(data['Contaminated'],predictions)`

**Step 3 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GjtjdxYXgAAbBKP?format=jpg&name=small)
</details>



**Your Turn**:  Have a go at it! Once you start, you’ll be able to assess the performance of the predictions!

In [None]:
# freehand code


**Explanation**: *The accuracy score tells us the percentage of correct predictions out of the total. A higher accuracy score means the model is doing a good job matching the actual labels*.

### Goal 9:  Create a grid search for the optimal hyperparemter value.

We know the tree can include a lot of different levels? But what is the ideal level? A grid search is a handy way of thinking through what is that right level.

#### Blockly

**Step 1 - Create a GridSearchCV object from the model_selection library**

Let’s create the object that we’ll use for our gridsearch. In other words, we are going to create a grid that we’ll use to compare the different trees against each other.

With the **model_selection** Create a **GridSearchCV** object. That will set up grid of hyperparameters of a model, in our case, a decision tree with different depths, 2, 3 or 4.

**Step 2 - Tell the grid search to use decision trees for the comparison**

Now we need to tell our gridsearch what we’ll want to do. In this case, we want to have it to use a decision tree as the baseline model that will be used for comparison.

From the variables menu, connect the **dtree** variable as the first parameter of the cross_val_predict operation.

**Step 3 - Define the hyperparameters**

So how many different variables do we want to explore in our gridsearch? In other words, how many levels of our tree do we want to experiment with?

As the second parameter of the GridSearchCV constructor, using a freestyle block type **param_grid**. Connect to it a dict (from the List menu), associated with a freestyle block of **max_depth=**. Lastly, from the List menu, create a list with the values (numbers from the Math menu): 2, 3, and 4. That will define the possible tree depths.

**Step 4 - How many folds in the grid?**

Now we need to setup the number of chunks (‘folds’) that we will explore? In our case, let’s break this down into 10 different ‘folds’.

Using the freestyle block define the number of folds to 10, as **cv=10**. The 10-fold cross-validation divides the dataset into 10 parts, trains and tests the model 10 times with each part used once as the test set, and averages the results for a more reliable performance estimate.

**Step 5 - Assigning the grid labels to the gridSearch variable**

We’ve done all this work, so let’s make sure to store it in a separate variable we can reference later on. In this case, let’s assign the gridsearch object into the **gridSearch** variable

Set the gridSearch variable to the object created with the grid search set up.

**Step 6 - Connect the blocks to run the code**

Connect the blocks and run the code!


<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/Gjtjni3XEAALw2y?format=jpg&name=medium)
</details>


In [None]:
# blockly code


#### Freehand


**Step 1 - Create a GridSearchCV object from the model_selection library**

With the **model_selection** create a GridSearchCV object that will set up a grid of hyperparameters of a model, in our case, a decision tree with different depths, 2, 3 or 4.

`model_selection.GridSearchCV( )`

**Step 2 - Tell the grid search to use decision trees for the comparison**

From the variables menu, connect the **dtree** variable as the first parameter of the cross_val_predict operation.

`model_selection.GridSearchCV(dtree )`

**Step 3 - Define the hyperparameters**

As the second parameter of the GridSearchCV constructior, using a freestyle block type **param_grid**. Connect to it a dict (from the List menu), associated with a freestyle block of **max_depth=**. Lastly, from the List menu, create a list with the values (numbers from the Math menu): 2, 3, and 4. That will define the possible tree depths.

`model_selection.GridSearchCV(dtree,param_grid= (dict(max_depth= [2, 3, 4])) )`

**Step 4 - How many folds in the grid?**

Using the freestyle block define the number of folds to 10, as **cv=10**. The 10-fold cross-validation divides the dataset into 10 parts, trains and tests the model 10 times with each part used once as the test set, and averages the results for a more reliable performance estimate.

`model_selection.GridSearchCV(dtree,param_grid= (dict(max_depth= [2, 3, 4])),cv= 10)`

**Step 5 - Assigning the grid labels to the gridSearch variable**

Set the **gridSearch** variable to the objected created with the grid search set up.

`gridSearch = model_selection.GridSearchCV(dtree,param_grid= (dict(max_depth= [2, 3, 4])),cv= 10)`

**Step 6 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GjtjpytW8AAw_Ub?format=jpg&name=small)
</details>

**Your Turn:** Give it a go! Let’s see what our results are once we display them!

In [None]:
# freehand code


**Explanation**: *You have created a tool for systematically optimizing a decision tree (dtree). GridSearchCV automates the process of trying out different combinations of hyperparameters, which are settings that control the model's behavior. The param_grid argument, which would be completed in the full code, defines the specific hyperparameters and their possible values to be tested. The GridSearchCV object will then train and evaluate the decision tree with each combination of hyperparameters, using cross-validation to assess performance, and ultimately identify the best-performing set of settings*.

### Goal 10: Perform a grid search using the cross validation technique.

So our next step is to train the model to predict the label, in this particular case, whether something is contaminated. Before we use the tree, but now we’ll use multiple trees that we defined in our gridsearch. Let’s jump in!

#### Blockly

**Step 1 - Prepare to train and predict the model**

First, we need to start by doing the training and prediction.

With the **model_selection** library Do the **cross_val_predict**, which will set up the cross-validation process. This process will be a model, a tree in our case, train it using the cross-validation technique on a single dataset. Finally, the label will be predicted for all records in the dataset.

**Step 2 - Inform the grid to the cross validation operation**

Now we need to say we are using the grid search as the baseline for our cross-validation we’ll be doing.

From the variables menu, connect the **gridSearch** variable as the first parameter of the cross_val_predict operation.

**Step 3 - Have the training features read**y

So what are we doing to predict on in our baseline? These will be our variables that we’ll use to predict our label.

From the list menu, get a dictVariable as **data**. Select the desired columns (features), as [], create a list with the feature names (Quote blocks): 'Methane', 'NOxEmissions', 'PM2.5Emissions', 'VOCEmissions', 'SO2Emissions', 'CO2Emissions’.

**Step 4 - Have the training label ready**

So what are we trying to predict? In our case, we are looking to predict the variable **Contaminated**.

From the list menu, get a dictVariable as **data**. Select the desired column (label): [ “Contaminated” (Quote block) ]. This will define the target variable to train the model.  

**Step 5 - Assigning the predicted labels to the predictions variable**

We’ve done all this work, so let’s make sure to store it in a separate variable we can reference later on. In this case, let’s call it **predictions**.

Set the predictions variable to the result of the cross_val_predict operation. That variable will hold the predictions for all records in the original dataset.  

**Step 6 - Print the predictions**

From the variable menu, drag the predictions variable to the canva, which will print the predictions on the screen.


**Step 7 - Connect the blocks to run the code**

Connect the blocks and run the code!


<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GjtjzQnXUAAS7iT?format=jpg&name=small)
</details>

In [None]:
# blockly code


#### Freehand



**Step 1 - Call the cross_val_predict function from the model_selection library**

First, we need to start by doing the training and prediction.

With the **model_selection** library Do the **cross_val_predict**, which will set up the cross-validation process. This process will be a model, a tree in our case, train it using the cross-validation technique on a single dataset. Finally, the label will be predicted for all records on the dataset.

`model_selection.cross_val_predict( )`

**Step 2 - Inform the grid search to the cross validation operation**

Now we need to say we are using the grid search as the baseline for our cross-validation, we’ll be doing.

From the variables menu, connect the **gridSearch** variable as the first parameter of the cross_val_predict operation.

`model_selection.cross_val_predict(gridSearch )`

**Step 3 - Define the training features**

So what are we doing to predict on in our baseline? These will be used our variables that we’ll use to predict our label.

From the list menu, get a dictVariable as **data**. Select the desired columns (features), as [], create a list with the feature names (Quote blocks): 'Methane', 'NOxEmissions', 'PM2.5Emissions', 'VOCEmissions', 'SO2Emissions', 'CO2Emissions’.

`model_selection.cross_val_predict(gridSearch,data[['Methane', 'NOxEmissions', 'PM2.5Emissions', 'VOCEmissions', 'SO2Emissions', 'CO2Emissions']] )`

**Step 4 - Define the training label**

So what are we trying to predict? In our case, we are looking to predict the variable **Contaminated**.

From the list menu, get a dictVariable as **data**. Select the desired column (label): [ “Contaminated” (Quote block) ]. This will define the target variable to train the model.

`model_selection.cross_val_predict(gridSearch,data[['Methane', 'NOxEmissions', 'PM2.5Emissions', 'VOCEmissions', 'SO2Emissions', 'CO2Emissions']],data['Contaminated']) `

**Step 5 - Assigning the predicted labels to the predictions variable**

We’ve done all this work, so let’s make sure to store it in a separate variable we can reference later on. In this case, let’s call it **predictions**.

Set the **predictions** variable to the result of the cross_val_predict operation. That variable will hold the predictions for all records in the original dataset.

`predictions = model_selection.cross_val_predict(gridSearch,data[['Methane', 'NOxEmissions', 'PM2.5Emissions', 'VOCEmissions', 'SO2Emissions', 'CO2Emissions']],data['Contaminated'])`

**Step 6 - Print the predictions**

From the variable menu, drag the predictions variable to the canva, which will print the **predictions** on the screen.

`predictions`

**Step 7 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/Gjtj1X5XEAAKslc?format=jpg&name=small)
</details>

**Your Turn**: Test it out yourself!

In [None]:
# freehand code


**Explanation**: *The classification report provides some metrics, particularly precision and recall. Precision measures how many of the samples the model labeled as contaminated were actually contaminated, essentially showing the accuracy of its positive predictions. Recall, on the other hand, reflects the model's ability to detect all actual contaminated samples, indicating how well it "found" the true cases. In this report, the precision and recall scores are both high, at 0.99 or 99% for each class, which demonstrates that the model is highly accurate in identifying both contaminated and non-contaminated samples. These scores mean that nearly all positive predictions made by the model were correct, and it missed almost none of the actual contaminated cases.*

## WHAT DID YOU LEARN?


Cross-validation helps us use all our data for training and testing with specific tools to avoid overfitting. By splitting data into sections and testing different parts at different times, our model gets better at making predictions.  Grid search helps find the best hyperparameters by testing more than one option. Remember, it’s important to  test hyperparameters on the same data we trained on, so that we don’t get misleading results.

## WHAT’S NEXT?

[Data Cleaning](Data_Cleaning.ipynb)

## ANY EXTRAS?

none