# Measures of Association

## ANTICIPATED TIME


2 hour


## BEFORE YOU BEGIN

[Descriptive Statistics](Descriptive_Statistics.ipynb)

## WHAT YOU WILL LEARN



- How should you determine the relationship between two variables?
- What are Pearson’s correlation coefficient and Spearman’s rank correlation coefficient, and how are they used?
- How is correlation different than causation?
- How do you calculate correlations between variables and create heatmaps for visualization?
- How do you interpret correlation coefficients using data science?
- How do you create a contingency table to make categorical associations?

## DEFINITIONS YOU’LL NEED TO KNOW


- Correlations - seeing how two different variables are related.
- Correlation coefficient - measures the strength of a linear relationship.
- Ratio data - numerical data with a meaningful zero point.
- Pearson’s correlation coefficient - measures linear relationships between two variables.
- Heatmap - a visual representation that uses colors to show the magnitude of values in a dataset.
- Spearman’s rank correlation coefficient - computes a measure of association using ranks.
- Ordinal data - categorical data with a natural order or ranking.
- Contingency table - a type of table used to show the frequency distribution of variables.



## SCENARIO:


Kiana is a new member of the afterschool program that Ethan and Diego are involved in. They all shared their ideas for solving the traffic and pollution problems. Diego has been collecting different types of data to help Angelina and the group understand the variables causing pollution in the city. Angelina and Diego have both expressed that they wanted to understand the data they’ve collected better by looking at the middle points of the samples. Since Kiana joined the group, she’s started to brainstorm to see if two variables are related and follow similar patterns. For example, Diego suggested to Kiana to look at the number of cars and their emissions levels. However, Kiana suggests looking at the relationship between the number of sidewalks or bike lanes and whether there is a relationship with pollution levels. Kiana wonders if there are other variables to consider that could cause the pollution levels to be low, in addition to the sidewalks and bike lanes.


## WHAT DO I NEED TO KNOW?



We looked at the type of variables in Data Science and Nature of Data. So let’s assume that you have a hunch that two variables are related. Maybe they both go up in kinda the same direction? Or maybe something goes up, but the other variable seems to go down? Can we go beyond a ‘hunch’ and actually measure if they are related?

You’re in luck because that’s what we are going to look at!

A couple of things.

1. Just because our correlation finds things are related, that doesn’t necessarily mean that it causes it….only that we see the strength of the relationship and the direction it’s moving.

2. As a general rule, anything below 0.10 is considered a weak association between the variables, values between 0.10 and 0.30 are considered a medium association, and values above 0.30 are evidence of a strong association.




## YOUR TURN


Now that you understand measures of association and the different ways these methods can show us relationships between data, it’s time to practice applying this concept! As you go through this, pay close attention to the different relationships that come from the data.


## Correlations for Ratio Data


So let’s say we want to look at two ratio variables. How do we know if something is actually related?


Welp, the **Pearson correlation coefficient (r)**  is a number that lets us know how two things are related. To make it even easier, we can just think of this number going from either -1 to 1. If the correlation is above 0 and up to 1, it means the two things have a positive correlation and moving. An example might be the number of minutes spent exercising and the amount of water you drink. In other words, as something goes up, the other variable will also go up.


If the correlation is below 0 and goes to -1, it means the two things are still related but have a negative correlation because they move in opposite directions. Thinking through our exercise example, you might think about the number of minutes spent biking might be related to lower weight.


And how about 0? That is simple - it just means the two things are not related at all and we could say there is no correlation. Maybe think of the minutes spent exercising and the number of times you stub your toe.


Huh?


Since we probably wouldn’t expect to see any relationship there, we would expect it to be r = 0.


### Goal 1: Importing the Pandas Library

Need extra tools to help solve this problem? Well, we can bring in extra ‘libraries’ to help us do extra data science stuff. You can think of it as an ‘add-on’. In this case, we bring in pandas, which is a popular library for doing data science stuff.

#### Blockly


**Step 1 - Starting the import**

First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to “import” to bring the add-on package in.

Bring in the IMPORT menu, which can be helpful to bring in other data tools. In this case, we're bringing in the **import** block.



**Step 2 - Telling what library to import**

In the text area, we type the name of the library we want to import. A library is like an extra thing we bring in to give us more coding abilities. In our case, we will type out **pandas**, which will bring in some cool data manipulation features.



**Step 3 - Renaming the library so it’s easy to remember**

Once you are done, put the **import** and **package** together in a single variable. This handy feature helps cut down on all the typing later on. You can call it whatever is easiest for you to remember. In the example below, we’ve put everything into **pd**, and we type it in the open area.



**Step 4 - Connect the blocks to run the code**

Connect the blocks and run the code!
<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GcxNkjYXkAAlpKD?format=png&name=240x240)
</details>

In [None]:
#blocks code


#### Freehand

**Step 1 - Starting the import**

First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to “import” to bring the add-on package in.



**Step 2 - Telling what library to import**

In the text area, we type the name of the library we want to import. A library is like an extra thing we bring in to give us more coding abilities. In our case, we will type out **pandas**, which will bring in some cool data manipulation features.



**Step 3 - Renaming the library so it’s easy to remember**

Once you are done, put the ‘import’ and ‘package’ together in a single variable. This handy feature helps cut down on all the typing later on. Feel free to use whatever name you want that will help you remember it later on. In the example below, we’ve put everything into **‘pd’**.



**Step 4 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!
<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GZmkVCYWEA4oGso?format=jpg&name=small) 
</details>

**Your Turn**: Now it’s your turn! We’re going to dive into the pandas package, which helps us with some really cool data science things. First, let’s import the package and assign it to the variable “pd” to make it easier to use throughout our notebook

In [3]:
#freehand code 


**Explanation**: *Congrats!  Your attempts finally made it!  Now you have successfully imported the "pandas" package as the variable "pd"*.

### Goal 2: Bringing in the Dataframe

Let’s bring in the data that we want to look at.


#### Blockly


**Step 1 - Write out the variable name you want to use**

Now that we’re all set with our new package to help us to do cool things, let’s bring the data into a variable and call it **dataframe**. Think of it as a digital spreadsheet with much more power to analyze and manipulate the data!

In Blockly, bring in the VARIABLES menu.



**Step 2 - Assign the dataframe to the variable you created**

Just like we did before, let’s type out a variable name. Rather than type out the full file name for our data, this easy to remember name will hold the data we bring in.

In Blockly, go to the Variables and drag the Set block for the **dataframe** variable. This will allow us to assign the result of a function call to the variable.  



**Step 3 - Brining in the data**

Now we need to look at the file that has all our data.
To load our dataframe, we’ll use a simple command to bring in the file we need (CSV….Comma Separated Values). Let’s say we have a file called ‘datasets/AirQuality.csv' in the folder **‘datasets’**. We’re telling Python to read the CSV file and store it in a variable called **dataframe**.

From the Variable menu, drag a DO block using the **pd** variable, go ahead with the do operation **read_csv**. The read_csv function reads a CSV file and returns a DataFrame object.

In our case, let’s bring in the “datasets/AirQuality.csv" (use the Quotes from the TEXT menu) because that is what Angelina is working with.



**Step 4 - Display the variable**

Let’s see it now by ‘displaying’ and showing our work.

Drag the **dataframe** variable to the workspace, making it available for further use in our program. This step is more of a visualization step, as it allows us to see the variable in the Blockly workspace.



**Step 5 - Connect the blocks to run the code**

Connect the blocks and run the code!
<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GaNM5olXIAAzaAB?format=png&name=small)
</details>

In [None]:
#blocks code


#### Freehand


**Step 1 - Write out the variable name you want to use**

Now that we’re all set with our new package to help us to do cool things, let’s bring the data into a variable called **dataframe**. Think of it as a digital spreadsheet with much more power to analyze and manipulate the data!

Just like we did before, let’s type out a variable name. Rather than type out the full file name for our data, this easy to remember name will hold the data we bring in.



**Step 2 - Assign the dataframe to the variable you created**

Just like we did before, let’s type out a variable name. Rather than type out the full file name for our data, this easy to remember name will hold the data we bring in.



**Step 3 - Bring in the data**

Now we need to look at the file that has all our data.

To load our dataframe, we’ll use a simple command to bring in the file we need (CSV….Comma Separated Values). Let’s say we have a file called ‘datasets/AirQuality.csv' in the folder **‘datasets’**. We’re telling Python to read the CSV file and store it in a variable called **dataframe**. For this function, we need to specify the code as “pd.read_csv”, which makes the code read the csv file. This variable is now our dataframe!

In our case, let’s bring in the “datasets/AirQuality.csv" (user the Quotes from the TEXT menu) because that is what Kiana is working with.



**Step 4 - Print the variable**

Let’s see it now by ‘printing’ and showing our work. Retype the variable name underneath the code and it will print the code. In this case, we will type out the variable name **dataframe**



**Step 5 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!
<details>
    <summary>Click to see the answer...</summary>


![](https://pbs.twimg.com/media/GaNNf6QXIAAhIEA?format=png&name=small)
</details>

**Your Turn**: Now it’s your turn! Let’s dive in and start working with the data! We’ll begin by loading it into a dataframe, which will allow us to easily interact with and analyze the dataset.

In [4]:
#freehand code 


Unnamed: 0,TrafficVolume,AverageSpeed,CO2Emissions,NoiseLevel,TrafficCondition,Road Type
0,2222.222222,75,33.898305,33.75,Free Flow,Rural
1,1666.666667,50,33.898305,33.75,Free Flow,Rural
2,1111.111111,60,25.423729,33.75,Free Flow,Rural
3,833.333333,55,42.372881,33.75,Free Flow,Rural
4,1944.444444,80,33.898305,33.75,Free Flow,Rural
...,...,...,...,...,...,...
145,6666.666667,50,355.932203,112.50,Heavy,City
146,5555.555556,25,338.983051,97.50,Heavy,City
147,6111.111111,50,355.932203,101.25,Heavy,City
148,5277.777778,70,372.881356,112.50,Heavy,City


**Explanation**: *Easy-peasy! Just like before, you’ve brought in the dataset*.

*The AirQuality dataset contains hourly air quality measurements from Italy, collected from March 2004 to February 2005. It has 19 columns, including date, time, and concentrations of various pollutants like CO, NOx, and O3, as well as temperature, humidity, and absolute humidity. The dataset has 9357 rows, representing hourly measurements over a year. It can be used for analyzing air quality patterns, predicting pollution levels, and investigating the impact on human health*.


### Goal 3: Calculate Correlations

Now that we’ve brought in Pandas and have our dataset loaded, let’s see if we can find any relationships between the variables we can calculate.

#### Blockly


**Step 1 - Write out the variable name you want to use for the correlation matrix**

Create a new variable named **corrMatrix** and drag it into the VARIABLES menu. This variable will hold the correlation matrix calculated from the dataframe.



**Step 2 - Call the corr() method of the dataframe**

To get the correlation of variables, we first need to call the corr() method to help calculate the correlations. We’ll apply it to the data we have in our **dataframe** variable.

From the VARIABLES menu, set the **corrMatrix** variable to the result of WITH the dataframe variable DO the **corr**() function.



**Step 3 - Run correlations for only numeric values**

We tell our code **numeric_only=True** (FREESTYLE menu), which means we only want to consider numeric columns in the calculation.



**Step 4 - Assign the correlation matrix to the variable you created**

we need to store the correlation into a variable for later use. Here, we store the correlation into the variable **corrMatrix**.

We will do this through Set **corrMatrix** variable to the dataframe DO corr block.



**Step 5 - Display the correlation matrix**

Lastly, we display the correlation variable. To do this, we drag the **corrMatrix** variable to the workspace, making it available for further use in our program. This step is more of a visualization step, allowing us to see the variable in the Blockly workspace.



**Step 6 - Connect the blocks to run the code**

Connect the blocks and run the code!
<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GaNORqEXEAAJDF8?format=png&name=small)
</details>

In [None]:
#blocks code


#### Freehand


**Step 1 - Call the corr() method of the dataframe**

To get the correlation of variables, we first need to call the corr() method to help calculate the correlations. We’ll apply it to the data we have in our **dataframe** variable.

`dataframe.corr()`


**Step 2 - Run correlations for only numeric values**

We tell our code numeric_only=True (FREESTYLE menu), which means we only want to consider numeric columns in the calculation.

`dataframe.corr(numeric_only=True)`



**Step 3 - Assign the correlation matrix to the variable you created**

Finally, we need to store the correlation into a variable for later use. Here, we store the correlation into the variable **corrMatrix**

`corrMatrix= dataframe.corr(numeric_only=True)`



**Step 4 - Print the correlation matrix**

Lastly, we print the correlation variable.

`print(corrMatrix)`



**Step 5 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!
<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GZpDrBdX0AUY5IP?format=png&name=small)
</details>

**Your Turn**: Now it’s your turn! Let’s start calculating! We’ll begin by calculating the correlation matrix and assigning it a variable, this will allow us to print the result.

In [5]:
#freehand code 


               TrafficVolume  AverageSpeed  CO2Emissions  NoiseLevel
TrafficVolume       1.000000     -0.109369      0.871754    0.817954
AverageSpeed       -0.109369      1.000000     -0.420516   -0.356544
CO2Emissions        0.871754     -0.420516      1.000000    0.962757
NoiseLevel          0.817954     -0.356544      0.962757    1.000000


**Explanation**: *The correlation matrix reveals several interesting relationships between the variables. Traffic volume is strongly positively correlated with CO2 emissions and noise Levels. This means that as traffic volume increases, so do emissions and noise pollution*.

*On the other hand, Average Speed is negatively correlated with CO2 Emissions and Noise Level, suggesting that higher speeds are associated with lower emissions and noise levels*.

*Also, CO2 Emissions and Noise Level are very strongly correlated, implying that they may be driven by similar factors. Overall, the correlations suggest that traffic volume is a key driver of environmental pollution, while average speed may have a mitigating effect*.

### Goal 4: Import the Plotly.Express Library

We’ve already brought pandas to help with data science. Let’s bring in Plotly Express to help with some fancy-pants visualizations.

#### Blockly

**Step 1 - Starting the import**

First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to “import” to bring the add-on package in.

Bring in the IMPORT menu, which can be helpful to bring in other data tools. In this case, we're bringing in the **import** block.




**Step 2 - Telling what library to import**

In the text area, we type the name of the library we want to import. A library is like an extra thing we bring in to give us more coding abilities. In our case, we will type out the **plotly.express** library. Plotly is a popular library in Python that provides functions for fancy-pants data visualizations.



**Step 3 - Renaming the library so it’s easy to remember**

Once you are done, put the **import** and **plotly.express** together in a single variable. This handy feature helps cut down on all the typing later on. You can call it whatever is easiest for you to remember. In the example below, we’ve put everything into **px** so it’s easier to remember



**Step 4 - Connect the blocks to run the code**

Connect the blocks and run the code!
<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GaON4A9XsAANFz2?format=png&name=240x240)
</details>

In [1]:
#blocks code


#### Freehand


**Step 1 - Starting the import**

First, we need to set up a “command” to tell the computer what to do. In this case, “command” is to “import” to bring the add-on package in.



**Step 2 - Telling what library to import**

In the text area, we type the name of the library we want to import. A library is like an extra thing we bring in to give us more coding abilities. In our case, we will type out the **plotly.express** library. Plotly is a popular library in Python that provides functions for fancy-pants data visualizations.



**Step 3 - Renaming the library so it’s easy to remember**

Once you are done, put the ‘import’ and ‘package’ together in a single variable. This handy feature helps cut down on all the typing later on. Feel free to use whatever name you want that will help you remember it later on. In the example below, we’ve put everything into **px**.



**Step 4 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!
<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GaON1ppW0AIrmJh?format=png&name=360x360)
</details>

**Your Turn**: Now it’s your turn! We’re going to bring in Plotly Express to help with those visualizations. Let’s start the import making sure to rename it before we run it.

In [6]:
#freehand code 


**Explanation**: *Congrats! Your attempts finally made it! Now you have successfully imported the **plotly.express** package as the variable **px**.*

### Goal 5: View the heatmap of the correlation matrix.

We’ve run the correlations saw the numbers. How about let’s look at a heatmap that uses colors to show how strongly the variables are correlated with each other?

#### Blockly


**Step 1 - Call the image show function from plotly**

To make a heatmap, we first need to call the imshow() function with our plotly library (px).

From the VARIABLE menu, drag the DO block for the **px** variable, which is a reference to the plotly library.




**Step 2 - Tell what data to use for image show to create the heatmap**

Inside the method, we provide the correlation matrix to show. In this case, our correlation matrix is **corrMatrix**.

To do that in the VARIABLES menu, we will tell the **imshow** function to look at the **corrMatrix** variable.



**Step 3 - Connect the blocks to run the code**

Connect the blocks and run the code!
<details>
    <summary>Click to see the answer...</summary>


![](https://pbs.twimg.com/media/GZoflMaWEAUJ1fw?format=png&name=360x360)
</details>

In [None]:
#blocks code


#### Freehand

**Step 1 - Call the Image show function from plotly**

To make a heatmap, we first need to call the imshow() function with our plotly library (px).
px.imshow()



**Step 2 - Tell what data to use for image show to create the heatmap**

Inside the method, we provide the correlation matrix to show. In this case, our correlation matrix is **corrMatrix**.

`px.imshow(corrMatrix)`



**Step 3 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!
<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GZpFgE_W8AMWBhq?format=png&name=360x360)
</details>

**Your Turn**: Now you give it a try! Create your own heatmap that will give you a visual that shows the relationship between the different values in the data

In [1]:
#freehand code 


**Explanation**: *Bolded The heatmap generated by px.imshow(corrMatrix) visually represents the correlation matrix, providing a clear and intuitive illustration of the strong positive relationships between Traffic Volume and CO2 Emissions/Noise Level, as well as the negative relationships between Average Speed and these pollutants*.

## Correlations for Ordinal Data



What happens if the variable types are a bit different? For ordinal data there are measures of correlation based on ranks. We’ll work through this so their ranks replace the original variables.

*Relationship between Ordinal Data*

**Spearman’s rank correlation coefficient** uses rank numbers to help us find out how different ordinal variables are related. Remember **Pearson’s correlation coefficient** from before? We can't use Pearson correlation for ordinal data directly, but we can make it work with a trick we convert the ordinal values to ranks.

For example, if our ordinal data is coded as small, medium,  and large, we can recode it in ranks as 1, 2, 3. Magic! Then we can calculate Pearson correlation with these ranks. This idea is so useful it has a special name, Spearman's rank coefficient. You can use Spearman's for ordinal variables or even for numeric variables with wild distributions.



### Goal 6 - Calculate Correlations for Ordinal Data

In our dataset, we got all different kinds of data types that we talked about in our Data Science and the Nature of Data notebook. As we look to help Kiana, let’s see if we can explore other relationships

By default, the Pearson correlation method is used when we call the corr() method, which is really helpful when we have ratio data.

But that’s not what we use when it comes to ordinal data. In order to use the Spearman correlation method, we can use the “method” parameter to define which correlation method we want to use.


#### Blockly


**Step 1 - Write out the variable name you want to use for the correlation matrix**

Create a new variable named **corrMatrix** and drag it into the VARIABLES menu. This variable will hold the correlation matrix calculated from the dataframe.



**Step 2 - Call the corr() method of the dataframe**

To get the correlation of variables, we first need to call the corr() method to help calculate the correlations. We’ll apply it to the data we have in our **dataframe** variable.

From the VARIABLES menu, set the **corrMatrix** variable to the result of WITH the **dataframe** variable DO the corr function.



**Step 3 - Add spearman correlation method in corr() function**

Then, we should pass the method argument to the corr function with the value 'spearman'. This specifies the correlation method to use, **method='spearman'**. To do this, using FREESTYLE, pass the method argument to the corr function with the value 'spearman'.



**Step 4 - Run correlations for only numeric values**

We tell our code **numeric_only=True** (FREESTYLE menu), which means we only want to consider numeric columns in the calculation.



**Step 5 - Assign the correlation matrix to the variable you created**

Finally, we need to store the correlation into a variable for later use. Here, we store the correlation into the variable **corrMatrix**.

We will do this through Set **corrMatrix** variable to the **dataframe** DO **corr** block.



**Step 6 - Display the correlation matrix**

Lastly, we display the correlation variable. To do this, we drag the **corrMatrix** variable to the workspace, making it available for further use in our program. This step is more of a visualization step, allowing us to see the variable in the Blockly workspace.



**Step 7 - Connect the blocks to run the code**

Connect the blocks and run the code!
<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GaNPPz1XEAABJXF?format=png&name=small)
</details>

In [None]:
#blocks code


#### Freestyle


**Step 1 - Call the corr() method of the dataframe**

To get the correlation of variables, we first need to call the corr() method to help calculate the correlations. We’ll apply it to the data we have in our **dataframe** variable.

`dataframe.corr()`



**Step 2 - Run correlations for only numeric values**

The method will have one parameter **numeric_only=True** , which means showing correlations for only numeric values

`dataframe.corr(numeric_only=True)`



**Step 3 -  Add spearman correlation method in corr() function**

Then, we should pass the method argument to the corr function with the value 'spearman'. This specifies the correlation method to use, **method='spearman'**

`dataframe.corr(method=”spearman”, numeric_only=True)`






**Step 4 - Assign the correlation matrix to the variable you created**

Finally, we need to store the correlation into a variable for later use. Here, we store the correlation into the variable **corrMatrix**

`corrMatrix= dataframe.corr(method=”spearman”, numeric_only=True)`



**Step 5 - Print the correlation matrix**

Lastly, we print the correlation variable.

`print(corrMatrix)`



**Step 6 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!

<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GZpGTe1W0Ac89mL?format=png&name=small)

</details>

**Your Turn**: Now it’s your turn! Start by applying the Pearson correlation method to the data, and after that you’ll add Spearman’s method into the mix!


In [8]:
#freehand code 


               TrafficVolume  AverageSpeed  CO2Emissions  NoiseLevel
TrafficVolume       1.000000     -0.159457      0.881386    0.834421
AverageSpeed       -0.159457      1.000000     -0.303421   -0.277511
CO2Emissions        0.881386     -0.303421      1.000000    0.936003
NoiseLevel          0.834421     -0.277511      0.936003    1.000000


**Explanation** *The Spearman rank correlation matrix reveals that traffic volume is strongly and positively correlated with CO2 emissions and noise level, indicating a monotonic increasing relationship between these variables. In contrast, Average Speed is negatively correlated with CO2 Emissions and Noise Level, suggesting that emissions and noise levels tend to decrease as speed increases. The strong correlation between CO2 Emissions and Noise Level implies that these two variables are closely related and may be driven by similar factors. Overall, the correlations suggest that Traffic Volume is a crucial driver of environmental pollution, while Average Speed may have a mitigating effect*.


## Correlations for Categorical Data



We’ve talked about associations in numbers, but what if we wanted to know how data in specific categories are related? We have a few tools that we can use to understand these relationships called tables. The first table is a **contingency table**. Contingency tables are helpful to use when you’re looking at data that only has two possible outcomes.

|             | **Pass**    | **Fail**    |
| ----------- | ----------- | ----------- |
| Attended    | 25          | 5           |
| Skipped     | 5           | 10          |



To understand the relationship between the two categories, we need to know if there actually is one. For example, if we wanted to know if skipping a class (one category) has an effect on passing the same class (another category), we could put this in a table - passing/failing and skipping/attending. We would use the number of students who skipped and passed and those who skipped and failed and compare these numbers to the students who attended and passed and those who attended and failed. When we add these numbers together and compare them, the totals can help us predict what will happen when other students skip or attend class.

So can we use data science to measure that relationship? The chi-square statistic comes in handy. Rather than eyeball or assume there is a difference, the **chi-square statistic** will tell us that the difference is statistically significant.

So let’s dive in!


### Goal 7:  Calculate Correlations for Categorical Data

You’ve now mastered ratio and ordinal data. You might be thinking “Well, what about categorical data?”. Let’s see how we can use crosstabs to explore categorical data.

#### Blockly


**Step 1 - Write out the variable name you want to use for our contingency table**

Create a new variable named **contingencyTable** and drag it into the VARIABLES menu. This variable will hold the correlation matrix calculated from the dataframe.



**Step 2 - Call the crosstab() function**

To create a contingency table, we first need to call the crosstab() method. From the VARIABLES menu, set the **contingencyTable** variable to the result of WITH **pd** DO the crosstab function.



**Step 3 - Add parameters to the crosstab() function**

In the next step, let’s look at what variables (column) to compare in our crosstab() method.

Now, from the LISTS menu, get a *dictVariable*. Change it to **dataframe** and inform the name of the column "TrafficCondition" (use Quotes from the TEXT menu). Add it as a parameter of the crosstab function. To add a second parameter to it, repeat the process for the column "Road Type".



**Step 4 - Assign the crosstab to the variable you created**

Finally, we need to store the contingency table result into a variable for later use. Here, we store it in the variable “contigencyTable”. From the VARIABLES menu, set the **contingencyTable** variable to the result of WITH **pd** DO the **crosstab** function.



**Step 5 - Print the contingency table**

From the TEXT menu, print the **contingencyTable** variable to the console, displaying the calculated contingency table.



**Step 6 - Connect the blocks to run the code**

Connect the blocks and run the code!
<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/GaspkPFWQAAVv_n?format=jpg&name=medium)

</details>

In [None]:
#blocks code


#### Freehand


**Step 1 - Call the crosstab() function**

To create a contingency table, we first need to call the **crosstab**() method.

`pd.crosstab()`



**Step 2 -  Add parameters to the crosstab() function**

In the next step, let’s look at what variables (column) to compare in our crosstab() method.The parameters for this table will be
dataframe[‘TrafficCondition’] - the columns representing different traffic conditions
dataframe[‘Road Type’] - the column representing different types of roads

`pd.crosstab(dataframe[‘TrafficCondition’], dataframe[‘Road Type’])`






**Step 3 - Assign the crosstab to the variable you created**

Finally, we need to store the contingency table result into a variable for later use. Here, we store it in the variable **“contigencyTable”**.

`contigencyTable= pd.crosstab(dataframe[‘TrafficCondition’], dataframe[‘Road Type’])`



**Step 4 -  Print the contingency table**

Lastly, we print the correlation variable.

`print(contigencyTable)`



**Step 5 - Run the code**

Hit ‘control’ and ‘enter’ at the same time to run the code!
<details>
    <summary>Click to see the answer...</summary>

![](https://pbs.twimg.com/media/Gasq8wuXcAAxTN-?format=jpg&name=medium)

</details>

**Your Turn**: Now it’s your turn! Start by creating a variable for the contingency table. After you’ve added parameters, you’ll use the cross-tabulation method to get the results.


In [9]:
#freehand code 


Road Type         City  Highway  Rural
TrafficCondition                      
Free Flow            3        3     44
Heavy               44        3      3
Moderate             6       38      6


**Explanation** *The correlation matrix and contingency table provide insights into the relationships between traffic-related variables. The correlation matrix (not shown) likely reveals correlations between continuous variables, while the contingency table shows the distribution of Traffic Condition and Road Type. The contingency table indicates that most Free Flow traffic conditions occur on Rural roads, while Heavy traffic conditions are more common on City roads. Moderate traffic conditions are more evenly distributed across Road Types, with a slight majority on Highway roads. These findings suggest that Road Type is associated with different Traffic Conditions, with Rural roads tending to have lighter traffic and City roads experiencing heavier traffic*.


## WHAT DID YOU LEARN?


[Correlations](## "seeing how two different variables are related") describe the strength and direction of a relationship between two variables. In this lesson, we learned about two important concepts called [Pearson's correlation coefficient](## "measures linear relationships between two variables") and [Spearman's rank correlation coefficient](## "computes a measure of association using ranks"). It helps determine whether and how strongly pairs of variables are related. We learned how to figure out coefficients and what they mean.

We also learned that these tools have certain assumptions and limitations, which means they don’t work perfectly in every situation. We also explored how to visualize these relationships using [heatmaps](## "a visual representation that uses colors to show the magnitude of values in a dataset") and [contingency tables](## "a type of table used to show the frequency distribution of variables") for categorical data analysis.


## WHAT’S NEXT?




[Clustering](Clustering.ipynb)


## TELL ME MORE




- [Datawhys Measures of Association Notebook](https://github.com/memphis-iis/datawhys-content-notebooks-python/blob/master/Measures-of-association.ipynb)
- [Datawhys Measures of Association Problem-Solving Notebook](https://github.com/memphis-iis/datawhys-content-notebooks-python/blob/master/Measures-of-association-PS.ipynb)
- [What is Correlation?](https://youtu.be/PEfQCv9nvSo?si=q76qtB4STTqK8ZLY) - Simplilearn (video)
- [The (Pearson) Correlation Coefficient Explained in One Minute](https://youtu.be/WpZi02ulCvQ?si=spmoPwDI6TxrBdHO) - One Minute Economics (video)