# Clustering

## ANTICIPATED TIME

2 hours.

## WHAT YOU WILL LEARN


- What is a cluster?
- When do I use clustering?
- What are the different types of clustering methods?
- What do I use to find patterns within a cluster?
- How do I evaluate the effectiveness of clustering?



## DEFINITIONS YOU’LL NEED TO KNOW


- Cluster - a group of data points that are similar to each other than to points in other clusters.
- Seeds - the initial centroids or centers of the clusters in k-means clustering.
- Hierarchical clustering - dividing clustering in a way that makes a tree of clusters inside the clusters.
- Agglomerative clustering - small clusters being combined until a big cluster is formed.
- Divisive clustering - a big cluster is split up into smaller clusters until all pieces of info are alone.
- Partitioning - a method of clustering that groups data based on certain rules; sometimes we call this optimization.
- K-means Clustering - a common method of partitioning.
- Centroid - the average of data in each cluster.
- Feature - something unique about the data that helps describe a data point.


## SCENARIO


Ethan, Angelina, Diego, and Kiana have been trying to solve the problems of CO2 emissions in their city. They have collected data on CO2 levels from neighborhoods, cars, factories, and parks in different areas. Diego suggested for the team to use clustering, which will allow them to group neighborhoods that have similar levels of CO2 together. They were wondering if maybe neighbors that have high levels of CO2 could based on the areas people lived or the cars they drove? They got thinking more and started thinking about how many clusters to look at based on the type of data they had. So thought that clustering based on the data would really help them discover which areas need the most attention to help reduce pollution and make cleaner air for their city.


## WHAT DO I NEED TO KNOW?




**Grouping similar data types into clusters**

Have you ever wondered if there are any hidden patterns that you might not recognize at first? Well, you are in luck because data science can help with clustering that puts the data into different kinds of groups.

Imagine having a forest that has a bunch of fallen leaves. We could group them together based on size, color, and other things. We could create “clusters” for types of tree (oak, pecan, etc) and then within those maybe leaves for specific oak trees, pecan trees, etc.

**Variables as features**

While analyzing data, each variable in a data set can be viewed as a feature. Features are considered to be something unique about the data that helps describe a data point. Once we start to look at all the data points, they become helpful because they can help us think through how similar or different they are. Better yet, they can calculate the distances between data points.

Smaller distance between data points? Probably should be grouped together
Bigger distance between data points? Probably should not be grouped together

**Hierarchical clustering**

*So how do we get started with clustering?*

In the **agglomerative clustering** way, it says “Hey, we are starting to seem similar. Let’s combine”. We start where each separate data point is considered it’s own cluster on the bottom and then merges with similar ones to make new clusters. It would then do this until there is one large cluster at the top. So it’s like you were working from the bottom to the top to find the cluster.

**Divisive clustering** way is kinda the opposite and says “Hey, maybe we aren’t so similar. Let’s split”. It’s the top-down way where you start with one big cluster and then starts to split when the data clusters become different enough.

Over time, we’ll create a tree-like structure that has all these different clusters. This helps analyze the relationship and thinking through how all the data points are related to each.




**Nonhierarchical methods**

Not all clustering methods work like a tree, where we start with all data points as individuals and slowly group them (hierarchical). Some methods jump right into creating groups based on simple rules.

One popular method for this is **partitioning**. **Partitioning**, also called **optimization**, clusters data based on certain rules. Think of it as dividing your data into sections based on specific guidelines, like sorting your closet by color, size, or even favorite items. Using partitioning, you could group the items in the closet based on certain characteristics, each of which items may have some similar or different characteristics from each other.
There are different partitioning methods that help divide a lot of data into clear groups quickly. One common technique in partitioning is **k-means clustering**. How does it work? It works in a few main steps.

We start by picking a few random points, called **“seeds**,” around which the clusters will form.
The number of seeds equals the number of clusters we want. If you choose three seeds, you’re aiming for three clusters. The seeds are usually spread out from each other to capture different types of data points.

Then, how could you decide the number of clusters? Deciding on the number of clusters is a key part of K-means clustering. If there are too few clusters, you might combine things that aren’t very similar. If there are too many clusters, you might end up with some that don’t have a clear purpose. Ideally, you want to find a balance where items within each cluster are close to each other, and clusters themselves are well-separated.

Once you have your chosen number of clusters, assign each data point to the closest cluster based on its features. For instance, a photo showing a large, gray animal might get grouped with other large, gray animals.

After grouping the data points, we calculate the **centroid**. The centroid is like the average point, showing the general center of each group based on all items in it. This represents the “heart” of each group and is recalculated each time we adjust the clusters. So new clusters = new centroid. Makes sense if you think about it.

With new centroids, we might begin to wonder if some data points fit better in other clusters. So, we need to reassign them and recalculate the centroids until each cluster has the closest possible items, making the clusters well-defined.


## YOUR TURN



Now that you understand what a cluster is and when to use it, it’s time to practice applying this concept! As you go through this, pay close attention to the different tools that you can use to identify patterns within a cluster and how the tools help us group data by it’s features.

### Goal 1: Importing the Pandas Library

Need extra tools to help solve this problem? Well, we can bring in extra ‘libraries’ to help us do extra data science stuff. You can think of it like an ‘add-on’. In this case, we bring in pandas, which is a popular library for doing data science stuff.  

**Step 1 - Starting the import**

First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to “import” to bring the add-on package in.

Bring in the IMPORT menu, which can be helpful to bring in other data tools. In this case, we're bringing in the **import** block.



**Step 2 - Telling what library to import**

In the text area, we type the name of the library we want to import. A library is like an extra thing we bring in to give us more coding abilities. In our case, we will type out **pandas**, which will bring in some cool data manipulation features.



**Step 3 - Renaming the library so it’s easy to remember**

Once you are done, put the **import** and **package** together in a single variable. This handy feature helps cut down on all the typing later on. You can call it whatever is easiest for you to remember. In the example below, we’ve put everything into pd, and we type it in the open area.



**Step 4 - Connect the blocks to run the code**

Connect the blocks and run the code!

![](https://pbs.twimg.com/media/GZYmOOpW8AAf7uc?format=png&name=240x240)

#### Freehand


**Step 1 - Starting the import**

First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to “import” to bring the add-on package in.



**Step 2 - Telling what library to import**

In the text area, we type the name of the library we want to import. A library is like an extra thing we bring in to give us more coding abilities. In our case, we will type out **pandas**, which will bring in some cool data manipulation features.



**Step 3 - Renaming the library so it’s easy to remember**

Once you are done, put the ‘import’ and ‘package’ together in a single variable. This handy feature helps cut down on all the typing later on. Feel free to use whatever name you want that will help you remember it later on. In the example below, we’ve put everything into **pd**.



**Step 4 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!

![](https://pbs.twimg.com/media/GZmkVCYWEA4oGso?format=jpg&name=small)

**Your Turn**: Now it’s your turn! We’re going to dive into the pandas package, which helps us with some really cool data science things. First, let’s import the package and assign it to the variable “pd” to make it easier to use throughout our notebook.

In [3]:
import pandas as pd

**Explanation**: *Congrats!  Your attempts finally made it!  Now you have successfully imported the "pandas" package as the variable "pd"*.

### Goal 2: Bringing in the Dataframe

Let’s bring in the data that we want to look at.


#### Blockly


**Step 1 - Write out the variable name you want to use**

Now that we’re all set with our new package to help us to do cool things, let’s bring the data into a variable and call it **dataframe**. Think of it as a digital spreadsheet with much more power to analyze and manipulate the data!

In Blockly, bring in the VARIABLES menu.



**Step 2 - Assign the dataframe to the variable you created**

Just like we did before, let’s type out a variable name. Rather than type out the full file name for our data, this easy to remember name will hold the data we bring in.

In Blockly, go to the Variables and drag the Set block for the **dataframe** variable. This will allow us to assign the result of a function call to the variable. A function is basically code that does a specific task for us.



**Step 3 - Bring in the data**

Now we need to look at the file that has all our data. To load our dataframe, we’ll use a simple command to bring in the file we need (CSV….Comma Separated Values). Let’s say we have a file called ‘datasets/AirQualityEmissions.csv' in the folder **‘datasets’**. We’re telling Python to read the CSV file and store it in a variable called **dataframe**.

From the Variable menu, drag a DO block using the **pd** variable, go ahead with the do operation **read_csv**. The read_csv function reads a CSV file and returns a DataFrame object.

In our case, let’s bring in the “datasets/AirQualityEmissions.csv" (use the Quotes from the TEXT menu) because that is what Angelina is working with.



**Step 4 - Display the variable**

Let’s see it now by ‘displaying’ and showing our work.

Drag the **dataframe** variable to the workspace, making it available for further use in our program. This step is more of a visualization step, as it allows us to see the variable in the Blockly workspace.



**Step 5 - Connect the blocks to run the code**

Connect the blocks and run the code!

![](https://pbs.twimg.com/media/GaNzytBWAAAGX0n?format=png&name=small)

#### Freehand

**Step 1 - Write out the variable name you want to use**

Now that we’re all set with our new package to help us to do cool things, let’s bring the data into a variable called **dataframe**. Think of it as a digital spreadsheet with much more power to analyze and manipulate the data!



**Step 2 - Assign the dataframe to the variable you created**

Just like we did before, let’s type out a variable name. Rather than type out the full file name for our data, this easy to remember name will hold the data we bring in.



**Step 3 - Bring in the data**

Now we need to look at the file that has all our data.

To load our dataframe, we’ll use a simple command to bring in the file we need (CSV….Comma Separated Values). Let’s say we have a file called ‘AirQualityEmissions.csv' in the folder **‘datasets’**. We’re telling Python to read the CSV file and store it in a variable called **dataframe**. For this function, we need to specify the code as “pd.read_csv”, which makes the code read the csv file. This variable is now our dataframe!

In our case, let’s bring in the “datasets/AirQualityEmissions.csv" (user the Quotes from the TEXT menu) because that is what Kiana is working with.



**Step 4 - Print the variable**

Let’s see it now by ‘printing’ and showing our work.



**Step 5 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!

![](https://pbs.twimg.com/media/GenW05YW8AA3PMJ?format=png&name=small)

**Your Turn**:  Now it’s your turn! We’re going to assign the data to a dataframe and print the name, which will help us categorize the data.




In [4]:
dataframe = pd.read_csv('datasets/AirQualityEmissions.csv')
dataframe

Unnamed: 0,Methane,NOxEmissions,PM2.5Emissions,VOCEmissions,SO2Emissions,CO2Emissions,Class,Contaminated
0,670,696,1252,1720,1321,2431,4,1
1,641,674,1156,1652,1410,2433,4,1
2,642,646,1159,1643,1455,2361,4,1
3,640,590,1105,1608,1459,2427,4,1
4,616,627,1192,1637,1466,2447,4,1
...,...,...,...,...,...,...,...,...
1840,862,826,1564,1768,1540,2037,4,1
1841,917,821,1571,1779,1543,2008,4,1
1842,925,832,1582,1776,1545,1989,4,1
1843,928,840,1587,1787,1538,1986,4,1



**Explanation**:  *The dataset contains measurements of various air pollutants and their emissions, including Methane, NOx (Nitrogen Oxides), PM2.5 (Particulate Matter), VOCs (Volatile Organic Compounds), SO2 (Sulfur Dioxide), and CO2 (Carbon Dioxide). Each row represents a different observation with specific values for each pollutant. The “Class” column likely categorizes the data into different types or sources of emissions, while the “Contaminated” column indicates whether the observation is considered contaminated (1) or not (0). This dataset can be used to analyze the levels of different pollutants and their potential sources, as well as to assess the overall air quality and identify contaminated samples*.

**Goal 3: Import the Plotly.Express Library**

We’ve already brought pandas to help with data science. Let’s bring in Plotly Express to help with some fancy-pants visualizations.


#### Blockly

**Step 1 - Starting the import**

First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to “import” to bring the add-on package in.

Bring in the IMPORT menu, which can be helpful to bring in other data tools. In this case, we're bringing in the **import** block.




**Step 2 - Telling what library to import**

In the text area, we type the name of the library we want to import. A library is like an extra thing we bring in to give us more coding abilities. In our case, we will type out the **plotly.express** library. Plotly is a popular library in Python that provides functions for fancy-pants data visualizations.



**Step 3 - Renaming the library so it’s easy to remember**

Once you are done, put the **import** and **plotly.express** together in a single variable. This handy feature helps cut down on all the typing later on. You can call it whatever is easiest for you to remember. In the example below, we’ve put everything into px so it’s easier to remember



**Step 4 - Connect the blocks to run the code**

Connect the blocks and run the code!

![](http://pbs.twimg.com/media/GaON4A9XsAANFz2?format=png&name=240x240)

#### Freehand


**Step 1 - Starting the import**

First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to “import” to bring the add-on package in.



**Step 2 - Telling what library to import**

In the text area, we type the name of the library we want to import. A library is like an extra thing we bring in to give us more coding abilities. In our case, we will type out the **plotly.express** library. Plotly is a popular library in Python that provides functions for fancy-pants data visualizations.



**Step 3 - Renaming the library so it’s easy to remember**

Once you are done, put the ‘import’ and ‘package’ together in a single variable. This handy feature helps cut down on all the typing later on. Feel free to use whatever name you want that will help you remember it later on. In the example below, we’ve put everything into **px**.



**Step 4 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!

![](https://pbs.twimg.com/media/GaON1ppW0AIrmJh?format=png&name=360x360)

**Your Turn** Now it’s your turn! We’re going to bring in Plotly Express to help with those visualizations. Let’s start the import making sure to rename it before we run it.

In [6]:
import plotly.express as px

**Explanation**: *Congrats! Your attempts finally made it! Now you have successfully imported the **plotly.express** package as the variable px* **bold text**.

### Goal 4: Initial visual inspection of correlations using Scatter Plot

Scatter plots help us to look at each data point when it comes to interval ratio data. The scatter plot shows us the relationship between two variables in a data set. The independent variable is plotted on the X-axis, while the dependent variable is plotted on the Y-axis. They are super handy for finding the relationship between different numeric variables.

#### Blockly


**Step 1 -  Call the scatter function from plotly**

To make a scatterplot, we first need to call the scatter function with our plotly library (px).

From the Variables menu drag a DO block for the **px** variable. Select the "**scatter**" function. This specifies the function we want to call, which is the scatter function from the Plotly Express library (imported as "px" earlier).



**Step 2 -  Saying what data to use for the scatter plot**

In order to make a plot, we need to choose its source from which data we want to plot from. In this case, our dataset is stored in the dataframe **dataframe**.

For the first argument, drag from the Variable menu the **dataframe** variable. This allows us to specify a dataframe as and what to look at for the scatter function.



**Step 3 -  Tell plotly what columns to put on the axis**

Identify the two variables you want to look at. One variable will be alongside the X-axis (*across*) and another one alongside the Y-axis (*up and down*). In our context, we want to see the relationship between **CO2Emissions** and **PM2.5Emissions**. We will assign the variables in the 2 axis in the graph.

From the TEXT menu, drag the Quotes. Type the text **CO2Emissions**. This specifies TrafficVolume as the x-axis variable for the scatter plot. Also, from the TEXT menu, drag the Quotes. Type the text **PM2.5Emissions**.   



**Step 4 - Connect the blocks to run the code**

Connect the blocks and run the code!

![](https://pbs.twimg.com/media/GbbDb9PbcAAvgvN?format=png&name=900x900)

#### Freehand

**Step 1 - Call the scatter function from plotly**

To make a scatterplot, we first need to call the scatter function with our plotly library (px).

`px.scatter()`


**Step 2 -  Saying what data to use for the scatter plot**

In order to make a plot, we need to choose its source from which data we want to plot from. In this case, our dataset is stored in the dataframe **“dataframe”**

`px.scatter(dataframe) `


**Step 3 -  Tell plotly what the columns to put on the axis**

Identify the two variables you want to look at. One variable will be alongside the X-axis (*across*) and another one alongside the Y-axis (*up and down*). In our context, we want to see the relationship between **CO2Emissions** and **PM2.5Emissions**. We will assign the variables in the 2 axis in the graph

`px.scatter(dataframe, x="CO2Emissions", y="PM2.5Emissions")`


**Step 4 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!


![](https://pbs.twimg.com/media/GbbDZJYbgAA-gdA?format=jpg&name=medium)

**Your Turn** Now it’s your turn! We’re going to create a scatter plot so we can review the correlations of the data. Let’s start with adding in the x and y-axis variables.

In [7]:
px.scatter(dataframe,'CO2Emissions','PM2.5Emissions')

**Explanation**: *This scatter plot shows groups of data points that represent different patterns in CO2 and PM2.5 emissions. We can see that most points fall within a few main clusters, indicating that certain levels of CO2 emissions are often associated with specific ranges of PM2.5 emissions. For example, one cluster of points is concentrated around higher PM2.5 levels with lower CO2 emissions, while another cluster shows lower PM2.5 levels as CO2 emissions increase. Identifying these clusters helps us understand how different sources or types of emissions are linked to different pollution patterns*.

### Goal 5: Create a new dataframe with new columns

We will create a new dataframe using emissions data that is found in some of the columns found in our dataset. Be warned....creating a new dataframe, we are overwriting it so this becomes the latest and newest version for the dataframe name. (Gulp!)

#### Freehand


**Step 1 - Select the dataframe columns**

Let’s think through what variables in **dataframe** will help us the prediction.

Get the dictVariable from the LISTS menu. Select **dataframe** from the dropdown menu.

Create a new list to store the column names. Using the Quotes “” from the Text menu, add the following column names to the list "Methane", "NOxEmissions", "PM2.5Emissions", "VOCEmissions", "SO2Emissions", and "CO2Emissions".



**Step 2 - Store the columns in a new variable**

Now that we have all the data we want to look at, let’s assign the resulting **dataframe** to the dataframe variable we used earlier.

The new dataset now has only the variables we think will help with our prediction. With that, it would be possible to find patterns as a result of the clustering process.



**Step 3 - Display the dataframe**

Finally, we display the dataframe variable to see what we got.



**Step 4 - Connect the blocks to run the code**

Connect the blocks and run the code!

![](https://pbs.twimg.com/media/GaOHIb4WMAAjl-0?format=png&name=small)

#### Freehand


**Step 1 - Select the dataframe columns**

Let’s think through what variables in **dataframe** will help us with the prediction.

To select the specific columns, to first select the dataframe[ ]. Inside the dataframe, we give the column names as a list.

`dataframe[['Methane', 'NOxEmissions', 'PM2.5Emissions', 'VOCEmissions', 'SO2Emissions', 'CO2Emissions']]`



**Step 2 - Store the columns in a new variable**

Now that we have all the data we want to look at, let’s assign the resulting dataframe to the **dataframe** variable we used earlier.

The new dataset now has only the variables we think will help with our prediction. With that, it would be possible to identify patterns as the result of the clustering process.

`dataframe = dataframe[['Methane', 'NOxEmissions', 'PM2.5Emissions', 'VOCEmissions', 'SO2Emissions', 'CO2Emissions']]`




**Step 3 - Print the dataframe**

Finally, we print the dataframe variable to see what we got.

`dataframe`



**Step 4 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!


![](https://pbs.twimg.com/media/GaOHF6LWkAAa5ZH?format=jpg&name=medium)


**Your Turn** Now it’s your turn!.



In [8]:
dataframe = dataframe[['Methane', 'NOxEmissions', 'PM2.5Emissions', 'VOCEmissions', 'SO2Emissions', 'CO2Emissions']]
dataframe

Unnamed: 0,Methane,NOxEmissions,PM2.5Emissions,VOCEmissions,SO2Emissions,CO2Emissions
0,670,696,1252,1720,1321,2431
1,641,674,1156,1652,1410,2433
2,642,646,1159,1643,1455,2361
3,640,590,1105,1608,1459,2427
4,616,627,1192,1637,1466,2447
...,...,...,...,...,...,...
1840,862,826,1564,1768,1540,2037
1841,917,821,1571,1779,1543,2008
1842,925,832,1582,1776,1545,1989
1843,928,840,1587,1787,1538,1986


**Explanation**: *You have created a new dataset containing only specific emissions-related columns, which can then be used to identify patterns through clustering analysis*.

### Goal 6: Import the package to start the K-means cluster

We will use the K-means clustering model to predict cluster membership for each datapoint (i.e. a row in our dataframe) then use the fit_predict function to store cluster prediction results.


**Step 1 - Starting the import**

First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to “import” to bring the add-on package in.

Bring in the IMPORT menu, which can be helpful to bring in other data tools. In this case, we're bringing in the **import** block.



**Step 2 - Telling what library to import**

In the text area, we type the name of the library we want to import. A library is like an extra thing we bring in to give us more coding abilities. In our case, we will type out **sklearn.cluster**, which automatically groups similar data points into awesome clusters!


**Step 3 - Renaming the library so it’s easy to remember**

Once you are done, put the **import** and **package** together in a single variable. This handy feature helps cut down on all the typing later on. You can call it whatever is easiest for you to remember. In the example below, we’ve put everything into **cluster**, and we type it in the open area.



**Step 4 - Connect the blocks to run the code**

Connect the blocks and run the code!

![](https://pbs.twimg.com/media/GaN02IIWAAA2XDP?format=png&name=360x360
)

#### Freehand


**Step 1 - Starting the import**

First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to “import” to bring the add-on package in.



**Step 2 - Telling what library to import**

In the text area, we type the name of the library we want to import. A library is like an extra thing we bring in to give us more coding abilities. In our case, we will type out **sklearn.cluster**, which automatically groups similar data points into awesome clusters!



**Step 3 - Renaming the library so it’s easy to remember**

Once you are done, put the ‘import’ and ‘package’ together in a single variable. This handy feature helps cut down on all the typing later on. Feel free to use whatever name you want that will help you remember it later on. In the example below, we’ve put everything into **cluster**.



**Step 4 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!


![](https://pbs.twimg.com/media/GaN0z4hWUAA7ExU?format=png&name=360x360)


**Your Turn**: How about let’s give it a go. See if you can bring in sklearn library to help us do some interesting data science tasks


In [9]:
import sklearn.cluster as cluster

**Explanation**: *You have now the library that provides clustering algorithms to group similar data points, helping to find patterns and natural groupings in data*.

### Goal 7: Create the model object Let’s get started with KMeans

Clustering and tell how many clusters to look at

#### Blockly

**Step 1 Write out the variable name you want to use**

On the Variable menu, click Create Variable, and type **kmeansModel**. From the same menu, get a Set block for this newly created variable.


**Step 2 - Create the KMeans model**

Now we will try to create a KMeans model to find the patterns and put them in groups.

From the Variable menu, get a Create block. Select the **cluster** variable, and on the Create menu select the **KMeans** object type (class). Let’s use the **cluster** module to help us create the KMeans model.


**Step 3 - Define the hyperparameters**

But how many groups? How about 4? For this dataset, we set the number of clusters to 4, meaning the data will be divided into  4 clusters.

Get a Freestyle block, type **n_clusters=4**. This will set the number of clusters to 4, meaning the data will be divided into  4 clusters.



**Step 4 - Assign the model to the variable you created**

Just like we did before, let’s put all that code into a single variable that’s easy to remember.

Assign the model to the **kmeansModel** variable.



**Step 5 - Connect the blocks to run the code**

Connect the blocks and run the code!


![](https://pbs.twimg.com/media/GaOKGjtWkAALJBf?format=png&name=small)


#### Freehand


**Step 1 - Create the KMeans model**

Now we will try to create a KMeans model to find the patterns and put them in groups.  Let’s use the cluster module to help us create the KMeans model.

`cluster.KMeans()`



**Step 2 - Define the hyperparameters**

But how many groups? How about 4? For this dataset, we set the number of clusters to 4, meaning the data will be divided into  4 clusters.

`cluster.KMeans(n_clusters=4)`



**Step 3 - Assign the model to the variable you created**

Just like we did before, let’s put all that code into a single variable that’s easy to remember.

Finally, we save the clustering result in a variable **‘kmeansModel’**.

`kmeansModel = cluster,KMeans(n_clusters=4)`



**Step 4 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!

![](https://pbs.twimg.com/media/GaQeFCZXMAA2ywS?format=png&name=small
)

**Your Turn**: So let’s try it ourselves! Let’s see if we can run our KMeans with 4 clusters

In [10]:
kmeansModel = cluster.KMeans(n_clusters=4)

**Explanation**: *You now have a KMeans clustering model using, which will group data into 4 clusters. The model will find four distinct groups in the data based on similarities. This can help us see patterns or categorize data points that are alike*.

### Goal 8: Train (fit) and predict

Let’s take part of our data and train it. Once it’s all trained, let’s use it to predict the 4 clusters new data will go into.


#### Blockly

**Step 1 - Write out the name of a variable you want to use**

Now that we’re all set with our new package to help us to do cool things, let’s bring the data into a variable and call it **predictions**. Think of it as a digital spreadsheet with much more power to analyze and manipulate the data!

In Blockly, bring in the VARIABLES menu.



**Step 2 - Train the model**

To make predictions using the cluster model, we use our **kmeansModel** and call the **fit_predict**() method on it. This method will first train our model and then set labels for each datapoint.

From the Variable menu, get a DO block for the variable kmeansModel. With that, select fit_predict operation.



**Step 3 - Bring the training dataset**

Next, we provide the data for the cluster to do the clustering. Here, we provide the same data **dataframe** we used to do the clustering. We can also provide other data to the cluster to access the model's performance better.

From the Variables menu, get the dataframe variable to connect with this operation. This will train the model and predict the values of the 4 clusters.



**Step 4 - Assign the predicted results to the variable you created**

From the Variables menu, drag the predictions variable. This will predict which one of the four clusters will go for new data the model tries to analyze.



**Step 5 - Display the predicted clusters**

So let’s try to look at our results. So now we display the predicted results by using variable **predictions**.



**Step 6 - Connect the blocks to run the code**

Connect the blocks and run the code!

![](https://pbs.twimg.com/media/GaN3kx5XwAAnKSP?format=png&name=small)

#### Freehand


**Step 1 - Train the model**

To make predictions using the cluster model, we use our **kmeansModel** and call the fit_predict() method on it. This method will first train our model and then set labels for each datapoint.

`kmeansModel.fit_predict()`



**Step 2 - Bring the training dataset**

Next, we provide the data for the cluster to do the clustering. Here, we provide the same data **dataframe** we used to do the clustering. We can also provide other data to the cluster to access the models performance better.

`kmeansModel.fit_predict(dataframe)`



**Step 3 - Assign the predicted results to the variable you created**

Next, we save the prediction results in a variable **predictions**.

`predictions= kmeansModel.fit_predict(dataframe)`



**Step 4 - Print the predicted clusters**

So let’s try to look at our results. Finally, we display the predicted results by using variable **predictions**

`predictions`



**Step 5 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!

![](https://pbs.twimg.com/media/GaN3irEW4AIQcrq?format=png&name=small
)

**Your Turn** Now it’s your turn! Let’s try to train (fit) our model and see what we get!



In [11]:
predictions = kmeansModel.fit_predict(dataframe)
predictions

array([0, 0, 0, ..., 1, 1, 1], dtype=int32)

**Explanation**: *At this point you asked the model to look at the data, figure out the clusters, and then assign each data point to a cluster*.

### Goal 9: Now that we have our predictions

Let’s just add it back into the dataframe we already created


#### Blockly


**Step 1 - Call the insert() method of the dataframe**

To see the clustering result of each datapoint, we will add the results into our dataframe. This will let us look at the cluster number, set for each datapoint. To do this, we use the insert() method on the ‘dataframe’.

From the Variable menu, get a DO block for the **dataframe** variable. With it, select the **insert** operation. The insert function will add a new column to the dataframe.



**Step 2 - Add parameters into the insert() method**

Inside the insert method, we need to tell it what to go and what to look at using parameters. In this case, let’s write…
- **0** The column index where the new column should be inserted. Here, it specifies the first column position
- "**cluster**" The name of the new column
- **predictions** The data to insert into the new column. predictions should be a list, array, or Series with the same number of elements as the rows in dataframe.

From the Math menu, get the number 123 block, change it to 0 (zero), The column index where the new column should be inserted. Here, the 0 specifies the first column position

Then get a Quote “” block to type **“cluster”**, that will define the name of the column.

Lastly, from the Variables menu, get the block for the **predictions** variable. This is basically bringing in what clusters we just predicted in the last goal



**Step 3 - Display the updated dataframe with the new ‘cluster’ column**

From the Variables menu, get the block of the **dataframe** variable. This will display the updated dataframe, which now includes the new "cluster" column with the predicted cluster labels.



**Step 4 - Connect the blocks to run the code**

Connect the blocks and run the code!

![](https://pbs.twimg.com/media/GaOMK_yWUAANQ78?format=png&name=360x360)

#### Freehand


**Step 1 - Call the insert() method of the dataframe**

To see the clustering result of each datapoint, we will add the results into our dataframe. This will let us look at the cluster number, set for each datapoint. To do this, we use the **insert**() method on the ‘dataframe’.

`dataframe.insert()`



**Step 2 - Add parameters into the insert() method**

Inside the insert method, we need to tell it what to go and what to look at using parameters. In this case, let’s write…

- **0** The column index where the new column should be inserted. Here, the 0 specifies the first column position
- "**cluster**"  Define name of the new column
- **predictions** The data to insert into the new column. This is basically bringing in what clusters we just predicted in the last goal

`dataframe.insert(0, ”cluster”, predictions)`



**Step 3 - Print the updated dataframe with the new ‘cluster’ column**

Now, we can view the new dataframe that has the cluster column added to it.

`dataframe`



**Step 4 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!


![](https://pbs.twimg.com/media/GaOMNx3XoAAKbZR?format=png&name=small)


**Your Turn** Seems easy enough, right? See if you can type out the code above and try to run it. What do you get?!


In [12]:
dataframe.insert(0,"cluster",predictions)
dataframe

Unnamed: 0,cluster,Methane,NOxEmissions,PM2.5Emissions,VOCEmissions,SO2Emissions,CO2Emissions
0,0,670,696,1252,1720,1321,2431
1,0,641,674,1156,1652,1410,2433
2,0,642,646,1159,1643,1455,2361
3,0,640,590,1105,1608,1459,2427
4,0,616,627,1192,1637,1466,2447
...,...,...,...,...,...,...,...
1840,1,862,826,1564,1768,1540,2037
1841,1,917,821,1571,1779,1543,2008
1842,1,925,832,1582,1776,1545,1989
1843,1,928,840,1587,1787,1538,1986


**Explanation**: *You have added a new column called "cluster" at the beginning of the dataframe. The values in this column come from predictions, which likely contains the cluster labels assigned to each data point. This way, each row in the dataframe shows which cluster it belongs to, helping us see how the data is grouped*.

### Goal 10: Create scatterplots using the new cluster variable.


#### Blockly


**Step 1 -  Call the scatter function from plotly**

To make a scatterplot, we first need to call the scatter function with our plotly library (px).

From the Variables menu drag a DO block for the **px** variable. Select the **scatter** function. This specifies the function we want to call, which is the scatter function from the Plotly Express library (imported as "px" earlier).



**Step 2 -  Saying what data to use for the scatter plot**

In order to make a plot, we need to choose its source from which data we want to plot from. In this case, our dataset is stored in the dataframe **dataframe**.

For the first argument, drag from the Variable menu the **dataframe** variable. This allows us to specify a dataframe and what to look at for the scatter function.



**Step 3 -  Tell plotly what columns to put on the axis**

Identify the two variables you want to look at. One variable will be alongside the X-axis (*across*) and another one alongside the Y-axis (*up and down*). In our context, we want to see the relationship between **CO2Emissions** and **PM2.5Emissions**. We will assign the variables to the 2 axes in the graph.

From the TEXT menu, drag the Quotes. Type the text **CO2Emissions**. This specifies TrafficVolume as the x-axis variable for the scatter plot. Also, from the TEXT menu, drag the Quotes. Type the text **PM2.5Emissions**.



**Step 4 - Adding colors to each cluster**

Finally, with the last Quote block, type cluster, it will determine the name of the column with the predicted cluster identified by corresponding colors.     

From the TEXT menu, drag the Quotes. Type the text **‘cluster’**.



**Step 5 - Connect the blocks to run the code**

Connect the blocks and run the code!

![](https://pbs.twimg.com/media/GaOPZxQWEAEkoxU?format=png&name=360x360)


#### Freehand


**Step 1 -  Call the scatter function from plotly**

To make a scatterplot, we first need to call the scatter function with our plotly library (px).

From the Variables menu drag a DO block for the **px** variable. Select the "**scatter**" function. This specifies the function we want to call, which is the scatter function from the Plotly Express library (imported as "px" earlier).

`px.scatter()`



**Step 2 -  Saying what data to use for the scatter plot**

In order to make a plot, we need to choose its source from which data we want to plot from. In this case, our dataset is stored in the dataframe **dataframe**.

`px.scatter(dataframe)`




**Step 3 -  Tell plotly what the columns to put on the axis**

Inside the scatter() method, we add our parameters for the plot
- **dataframe** The DataFrame containing the data to plot.
- '**CO2Emissions**' The column to be plotted on the x-axis, representing the CO₂ emission levels.
- '**PM2.5Emissions**' The column to be plotted on the y-axis, representing PM2.5 emission levels.

`px.scatter(dataframe,'CO2Emissions','PM2.5Emissions')`



**Step 4 - Adding colors to each cluster**

Finally, let’s type type **cluster** to determine the name of the column with the predicted cluster identified by corresponding colors.     

- '**Cluster**', add a last parameter as ‘cluster’ which will paint each dot with the corresponding color

`px.scatter(dataframe,'CO2Emissions','PM2.5Emissions','cluster)`


**Step 5 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!

![](https://pbs.twimg.com/media/GaOPc3IXgAA7qxy?format=png&name=small
)


**Your Turn**: We’ve tried to create our new variable. Let’s jump in and see what our scatterplot looks like and if we can find any interesting patterns


In [13]:
px.scatter(dataframe,'CO2Emissions','PM2.5Emissions','cluster')




**Explanation**: *The scatter plot shows how CO2 and PM2.5 emissions are grouped into four clusters, each represented by a different color. The trend shows that as CO2 emissions increase, PM2.5 emissions generally decrease, but not in a straightforward way. The clusters reveal distinct groups the orange cluster has high PM2.5 emissions and lower CO2 emissions, while the purple and blue clusters show lower PM2.5 emissions at varying levels of CO2. This clustering helps us understand that emissions don’t follow a single pattern but instead form different groups, which can give insights into the sources or types of emissions*.

### Goal 11: Create scatterplots using different variables to compare.

We looked at another scatterplot earlier. How about let’s try some new variables to compare? How about methane?

#### Blockly

**Step 1 -  Call the scatter function from plotly**

To make a scatterplot, we first need to call the scatter function with our plotly library (px).

From the Variables menu drag a DO block for the **px** variable. Select the "**scatter**" function. This specifies the function we want to call, which is the scatter function from the Plotly Express library (imported as "px" earlier).



**Step 2 -  Saying what data to use for the scatter plot**

In order to make a plot, we need to choose its source from which data we want to plot from. In this case, our dataset is stored in the dataframe **dataframe**.

For the first argument, drag from the Variable menu the **dataframe** variable. This allows us to specify a dataframe as and what to look at for the scatter function.



**Step 3 -  Tell plotly what columns to put on axis**

Identify the two variables you want to look at. One variable will be alongside the X-axis (*across*) and another one alongside the Y-axis (*up and down*). In our context, we want to see the relationship between **CO2Emissions** and **Methane**. We will assign the variables in the 2 axis in the graph.

From the TEXT menu, drag the Quotes. Type the text **CO2Emissions**. This specifies TrafficVolume as the x-axis variable for the scatter plot. Also, from the TEXT menu, drag the Quotes. Type the text **Methane**.



**Step 4 - Adding colors to each cluster**

Finally, with the last Quote block, type **cluster**, it will determined the name of the column with the predicted cluster identified by corresponding colors.     

From the TEXT menu, drag the Quotes. Type the text ‘cluster’.



**Step 5 - Connect the blocks to run the code**

Connect the blocks and run the code!

![](https//pbs.twimg.com/media/GbbFJzhagAAiy0x?format=png&name=900x900)

#### Freehand


**Step 1 -  Call the scatter function from plotly**

To make a scatterplot, we first need to call the scatter function with our plotly library (px).

From the Variables menu drag a DO block for the **px** variable. Select the "**scatter**" function. This specifies the function we want to call, which is the scatter function from the Plotly Express library (imported as "px" earlier).

`px.scatter()`



**Step 2 -  Saying what data to use for the scatter plot**

In order to make a plot, we need to choose its source from which data we want to plot from. In this case, our dataset is stored in the dataframe dataframe.

`px.scatter(dataframe)`



**Step 3 -  Tell plotly what the columns to put on the axis**

Inside the scatter() method, we add our parameters for the plot
- **dataframe** The DataFrame containing the data to plot.
- '**CO2Emissions**' The column to be plotted on the x-axis, representing the CO₂ emission levels.
- '**Methane**' The column to be plotted on the y-axis, representing PM2.5 emission levels.

`px.scatter(dataframe,'CO2Emissions','Methane')`



**Step 4 - Adding colors to each cluster**

Finally, let’s type **cluster** to determine the name of the column with the predicted cluster identified by corresponding colors.     

- '**cluster**', add a last parameter as ‘cluster’ which will paint each dot with the corresponding color

`px.scatter(dataframe,'CO2Emissions','Methane','cluster)`



**Step 5 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!

![](https://pbs.twimg.com/media/GbbFGQoaIAAvAdq?format=jpg&name=medium)

**Your Turn**: Let’s try another one! This time, let’s create a scatterplot, but let’s tweak the variables we are looking at.


In [14]:
px.scatter(dataframe,'CO2Emissions','Methane','cluster')

**Explanation** *The scatter plot shows the relationship between CO2 emissions and Methane emissions, with data points grouped into four clusters. Each color represents a different cluster, indicating that certain ranges of CO2 and Methane emissions are commonly found together. For example, the orange cluster has higher CO2 and Methane levels, while the purple cluster has lower levels of both. The clustering helps us see that emissions don’t follow a single pattern; instead, there are different groups with distinct levels of CO2 and Methane. This kind of grouping can be useful for identifying different sources or types of pollution*.

### Goal 12: Create even more scatterplots using different variables to compare

So how about we try even more mixing and matching of variables to compare!?

#### Blockly


**Step 1 -  Call the scatter function from plotly**

To make a scatterplot, we first need to call the scatter function with our plotly library (px).

From the Variables menu drag a DO block for the **px** variable. Select the "**scatter**" function. This specifies the function we want to call, which is the scatter function from the Plotly Express library (imported as "px" earlier).



**Step 2 -  Saying what data to use for the scatter plot**

In order to make a plot, we need to choose its source from which data we want to plot from. In this case, our dataset is stored in the dataframe **dataframe**.

For the first argument, drag from the Variable menu the **dataframe** variable. This allows us to specify a dataframe and what to look at for the scatter function.



**Step 3 -  Tell plotly what columns to put on the axis**

Identify the two variables you want to look at. One variable will be alongside the X-axis (*across*) and another one alongside the Y-axis (*up and down*). In our context, we want to see the relationship between **Methane** and **PM2.5Emissions**. We will assign the variables to the 2 axes in the graph.

From the TEXT menu, drag the Quotes. Type the text **Methane**. This specifies TrafficVolume as the x-axis variable for the scatter plot. Also, from the TEXT menu, drag the Quotes. Type the text **PM2.5Emissions**.



**Step 4 - Adding colors to each cluster**

Finally, with the last Quote block, type cluster, it will determine the name of the column with the predicted cluster identified by corresponding colors.     

From the TEXT menu, drag the Quotes. Type the text **‘cluster’**.



**Step 5 - Connect the blocks to run the code**

Connect the blocks and run the code!

![](https://pbs.twimg.com/media/GbbEQk-acAAD_0T?format=png&name=900x900)
    


#### Freehand


**Step 1 -  Call the scatter function from plotly**

To make a scatterplot, we first need to call the scatter function with our plotly library (px).

From the Variables menu drag a DO block for the **px** variable. Select the "**scatter**" function. This specifies the function we want to call, which is the scatter function from the Plotly Express library (imported as "px" earlier).

`px.scatter()`



**Step 2 -  Saying what data to use for the scatter plot**

In order to make a plot, we need to choose its source from which data we want to plot from. In this case, our dataset is stored in the dataframe **dataframe**.

`px.scatter(dataframe)`



**Step 3 -  Tell plotly what the columns to put on the axis**

Inside the scatter() method, we add our parameters for the plot
- **dataframe** The DataFrame containing the data to plot.
- '**Methane**' The column to be plotted on the x-axis, representing the CO₂ emission levels.
- '**PM2.5Emissions**' The column to be plotted on the y-axis, representing PM2.5 emission levels.

`px.scatter(dataframe,'Methane','PM2.5Emissions')`



**Step 4 - Adding colors to each cluster**

Finally, let’s type **cluster** to determine the name of the column with the predicted cluster identified by corresponding colors.     

- '**cluster**', add a last parameter as ‘cluster’ which will paint each dot with the corresponding color

`px.scatter(dataframe,'Methane','PM2.5Emissions','cluster)`



**Step 5 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!

![](https://pbs.twimg.com/media/GbbENvkawAAuixa?format=jpg&name=medium)


**Your Turn**: You are probably feeling pretty good about this by now. Let’s run a scatterplot for these variables and see what we find!


In [15]:
px.scatter(dataframe,'Methane','PM2.5Emissions','cluster')

**Explanation**: *The scatter plot shows how data points are grouped into four clusters based on two variables Methane and PM2.5 emissions. Each color represents a different cluster, and we can see that as Methane emissions increase, PM2.5 emissions also tend to increase. However, the clustering suggests that there are different "types" or patterns within this trend. For example, points in the purple cluster have lower Methane and PM2.5 emissions, while points in the orange cluster have higher emissions of both. This clustering helps us see natural groupings in the data, which can be useful for understanding patterns in emissions*.

## WHAT DID YOU LEARN?


In this lesson, we’ve covered the ins and outs of [clustering](## "a group of data points that are similar to each other than to points in other clusters"). We’ve explored different methods that we can use to find patterns and group similar things together. We also explored the differences between hierarchical clustering and nonhierarchical clustering. Remember, hierarchical clustering starts by combining small groups of data until there’s one big cluster. Non-hierarchical clustering starts with data points and creates groups around whatever that starting point is. Regardless of the method, clustering is great because it can help us solve real problems by showing us how data fits together!



## WHAT’S NEXT?



[KNN Classification](KNN_Classification.ipynb)



## TELL ME MORE




- [Datawhys Clustering Notebook](https://saturn.olney.ai/user/ehunter1/lab/tree/datawhys-content-notebooks-python/Clustering.ipynb)
- [Datawhys Clustering Problem-Solving Notebook](https://saturn.olney.ai/user/ehunter1/lab/tree/datawhys-content-notebooks-python/Clustering-PS.ipynb)
