# Measures of association

Often when working on a data science project, interest is in deciding whether two variables (features) are related.
In this notebook we will discover some of the ways we can measure the association between two variables.
As before, the type of variable we have, either categorical (nominal or ordinal level of measure) or numerical (interval/ratio level) will determine the procedures we can use to identify associations between variables.

## Correlation

### Numerical Data

For numerical data the most common measure of association is *Pearson’s correlation coefficient*, used to measure *linear* relationships between two variables.

A scatterplot of the variables can be used to see whether there is a possible relationship and whether it is linear.
If we let ${\mbox{$X_1, X_2, \ldots, X_n$}}$ and $Y_1, Y_2, \ldots, Y_n$ be the two sets of variables which are assumed to be random samples from two different populations, then Pearson’s correlation coefficient is defined as: 

$$r=\frac{\sum_{i=1}^n (X_i - \bar{X})(Y_i - \bar{Y})}{\sqrt{\sum_{i=1}^n(X_i - \bar{X})^2}\sqrt{\sum_{i=1}^n (Y_i - \bar{Y})^2}}$$ 

where $\bar{X}$ and $\bar{Y}$ are the sample means from each sample.
We can show that $r$ will always be between $-1$ and $1$ and we can use the value to explain the relationship between the two variables.
For values of $r>0$ we say the variables are *positively correlated*.
For values of $r<0$ the variables are deemed *negatively correlated*.
If $r=0$ there is no linear relationship between the two variables.
If $r=1$ or $-1$, there is a perfect linear relationship between the two.
The figures below show cases of positive, negative, and no correlation.
Figure 1 shows two variables that have a fairly high positive correlation of $r=0.
79$.
Figure 2 shows two variables with a negative correlation of $r=-0.
87$.
In Figure 3 we see two variables that have no linear relationship between them according to the scatterplot.
The estimated correlation should be close to zero and Pearson’s correlation coefficient was $0.11$.

![image.png](attachment:image.png)

Let's take a look at correlations in the now-familiar, `iris` dataset.
First, the import:

- `import pandas as pd`

Now load the dataset into a dataframe variable:

- Create `dataframe`
- Set it to `with pd do read_csv using "datasets/iris.csv"`
- `dataframe` (so you can see the dataframe displayed)

Now that we have the `dataframe`, we can calculate correlations:

- Create variable `corrMatrix`
- Set it to `with dataframe do corr using`
- `corrMatrix` (so you can see the matrix displayed)

First, notice that `corrMatrix` only contains correlations for numeric variables - the nominal (categorical) variable `Species` has been ignored here.
We will talk about measures of association for nominal variables in a moment.

Second, notice that entries on the top left to bottom right diagonal are all 1.0.
This is because every variable is perfectly correlated with itself by definition.

Finally, notice that the triangle of entries below the diagonal (lower triangular matrix) and triangle of entries above the diagonal (upper triangular matrix) are mirror images of each other.
For that reason, you often see matrices like this with only the lower diagonal matrix, because the rest is redundant.

`PetalLength` and `PetalWidth` are almost perfectly *positively* correlated at .96, meaning that as one increases, the other increases by almost the same amount.

In contrast, both `PetalLength` and `PetalWidth` are *negatively* correlated with `SepalWidth` at -.42 and -.36 respectively. 
So as `SepalWidth` increases, we expect `PetalLength` and `PetalWidth` to *decrease*.

Sometimes it can be useful to look at a correlation matrix as a plot instead of as a table of values, especially if you are mostly interested in positive/negative associations.
What we will do is convert the numbers into colors, such that purple is a negative correlation and yellow is a positive correlation.
This is also called a **heatmap**.

- `import plotly.express as px`

To display the correlation matrix as a heatmap, we just need `imshow`:

- Create `fig_iris`
- Set it to `with px do imshow using` a list containing
    - `corrMatrix`

Now we just need to show it:

- `with fig_iris do show using`

With a heatmap, its always important to look at the color that represents zero, which here is a kind of violet. 
Any color close to yellow is a strong positive correlation, and any color close to dark purple (or indigo) is a strong negative correlation.

One problem with this plot is that it's a little hard to interpret without labels for the variables.
We can get those by using `x` and `y` as we did last time - the difference is that `x` and `y` are now our axis labels.
Copy the `imshow` blocks above, click the cell below, and paste the blocks into the workspace on the left. 
Then make the following changes:

- Set `fig_iris` to `with px do imshow using` a list containing
    - `corrMatrix`
    - A freestyle block **with a notch on the right** containing `x=`, connected to `from corrMatrix get columns`
    - A freestyle block **with a notch on the right** containing `y=`, connected to `from corrMatrix get columns`


Now  show it:

- `with fig_iris do show using`

Let's keep going.

The value of $r$ can be close to zero if the relationship between the two variables is curved because it only measures linear relationships.
For example the relationship $y=e^x$ is a perfect relationship between $Y$ and $X$ and yet it is not linear so the Pearson correlation coefficient will not be 1.
Correlation can also be close to zero if there are outliers in the data that are much different than the linear trend seen in the remaining data.
These anomalies illustrate the importance of always plotting the data first to be sure the relationship is linear.
It is also important to note that *correlation* does not imply *causation*.
Just because two variables are highly correlated does not mean there is a cause and effect relationship between them.
For example, significant correlation has been found between sunspot activity and economic cycles and yet it is not plausible to say that one causes the other.
There could also be lurking variables causing both variables to move in the same direction.
For example, a correlation can be found between the amount of ice cream sold and the number of people at the beach.
Neither causes each other but both are affected by the outside temperature.
Their relationship is illustrated in the figure below 

![image.png](attachment:image.png)

Remember that causality requires a counterfactual, which we typically create using a randomized experiment.
You can't establish causality with a correlation.

### Ordered Categorical Data

For ordered categorical data there are measures of correlation based on ranks.
In these methods the original variables are replaced by their *ranks*.
If we order the original sample from smallest to largest, then the rank $R_1$ corresponds to the smallest, rank $R_2$ is the next largest, etc.
*Spearman’s rank correlation coefficient* computes a measure of association using the ranks.
The method is to assign ranks to the $X$'s and $Y$'s separately and then compute Pearson’s correlation coefficient on the ranks.
Spearman’s correlation works for numerical data as well as ordinal data.
It can be interpreted in the same way as Pearson’s correlation.

We can get a Spearman correlation almost exactly the same with a Pearson correlation.
The only difference is that we need to tell `corr` what correlation to use (Pearson is the default):

- Set `corrMatrix` to `with dataframe do corr using` a list containing
    - a freestyle block containing `method='spearman'`
- `corrMatrix` (so you can see the matrix displayed)

As you can see, the results are very similar to Pearson correlation in this case.

## Categorical Associations

Suppose we are interested in determining whether two categorical variables are related to each other.
We can construct a *contingency table* to divide the data into categories and count the frequency of observations that fall into each category.
Consider the case where each variable consists of only two possible outcomes.
For example suppose we are interested in determining whether attendance in a class is related to whether you pass or fail.
Then we can classify the students in the class into four cases:

1. Those who attended the class and passed
2. Those who attended the class and failed
3. Those who skipped class and passed
4. Those who skipped class and failed

The frequency of students in each of these four cases can be summarized in a $2 \times 2$ contingency table as shown below.

|          | Pass | Fail |
|----------|------|------|
| Attended | 25   | 5    |
| Skipped  | 5    | 10   |

To determine if there is a connection between passing/failing and attending/skipping, we need to see what our table would look like if there were no connection (called the *expected* table) and then see how different the expected table is from what we have (called the *observed* table).

In order to find the expected table, we need another table detailing the row sums and column sums, called *marginals*.

|          | Pass | Fail | Total |
|----------|------|------|-------|
| Attended | 25   | 5    | 30    |
| Skipped  | 5    | 10   | 15    |
| Total    | 30   | 15   | 45    |

Using the marginal totals we can say the probability of a student having attended the class is the total number of students who attended divided by the total number of students, 30/45.
The probability of the student passing the class is the total number of students who passed divided by the total number of students, also 30/45.
If there is no relationship between the row and column variables then the probability of a student attending class and passing the class is the probability of attending times the probability of passing - (30/45)\*(30/45).
And so, if there is no relationship, we would expect to have 45 times the probability of being in a cell as the number of students in that cell.
That is

$$45 \times P(attending) \times P(passing) = 45 \times \left( \frac{30}{45} \right) \left( \frac{30}{45}\right).
$$ 

The marginal totals must stay the same so we get the expected table as shown below.

|          | Pass | Fail | Total |
|----------|------|------|-------|
| Attended | 20   | 10   | 30    |
| Skipped  | 10   | 5    | 15    |
| Total    | 30   | 15   | 45    |

Now to compute the difference in the two tables (observed and expected), we compute the *chi-square* statistic, $\chi^2$.
If $O$ stands for counts in the observed table and $E$ stands for counts in the expected table, then the chi-square statistic is found as $$\chi^2 = \sum \left(\frac{(O-E)^2}{E}\right)$$ 

where the sum is over the 4 cells in the contingency table.
For our example this would be 

$$\left( \frac{(25-20)^2}{20} \right) + \left(\frac{(5-10)^2}{10}\right) + \left( \frac{(5-10)^2}{10} \right)+\left( \frac{(10-5)^2}{5} \right) = 11.25.
$$  

The chi-square statistic can be used to test for whether the row and column variables are independent by comparing it to a chi-square distribution (hence the name) or it can be used to compute a single measure of association between the two variables.
One such measure for a $2 \times 2$ table is the *Phi coefficient* and is defined as 

$$\phi = \sqrt{\frac{\chi^2}{N}}$$

where $N$ is the total number of counts in the contingency table.
The measure ranges between 0 and 1 and higher values are associated with a stronger association.
The value of $\phi$ for our example is 

$$\phi = \sqrt{\frac{11.25}{45}}=0.5.$$ 

This is considered a strong association.
As a rule of thumb, values of $\phi$ below 0.10 are considered a weak association between the variables, values between 0.10 and 0.30 are considered a moderate association, and values above 0.30 are evidence of a strong association.

As per usual, we can get Python to do all this work for us, as long as we know how to tell it.

Let's start with a new dataset that has some better categorical variables.
The dataset we will use is called `titanic` and is a list of passengers on the famous ship Titanic that sank over 100 years ago when it hit an iceberg:

- Set `dataframe` to `with pd do read_csv using "datasets/titanic.csv"`
- `dataframe` 

The two variables we are interested in are `Survived` and `Sex`, because we wonder if men or women were more likely to survive.
Notice that `Survived` is either 1 or 0 - it is common to code a categorical variable this way when 1 is "true" and 0 is "false".
Even though `Survived` looks like a numeric variable, it is actually categorical.

Next we need to make a contingency table for these two variables:

- Create variable `contingencyTable`
- Set it to `with pd do crosstab using` a list containing
    - `dataframe ["Survived"]` (In menu LISTS, get `{dictVariable}[ ]` and add text block `"Survived"`)
    - `dataframe ["Sex"]`
- `contingencyTable` (for display)

Stop for a moment to consider what `crosstab` has done - it has gone through 886 rows of the dataframe and counted the number of time females survived, males survived, females died, and males died in order to make this table.

Next we need to calculate chi-square with `crosstab`. 
To do this, we need to import a new library, `scipy.stat`:

- `import scipy.stats as stats`

Now call `chi2_contingency` in `stats`:

- `with stats do chi2_contingency using` a list containing
    - `contingencyTable`

The output is a little strange because `chi2_contingency` returns a bunch of information rather than just one thing.
However, we just want the first number returned, 258.39, which is the chi-square value.
To finish calculating phi, all we need to do is divide 258.39 by 886 (the number of people) and take the square root:

- In menu MATH, get `1 + 1`, change the + to &#xF7; and then change the 1s to divide 258.39 by 886 
- In menu MATH, get the `square root` block and connect it to the front of the first block

This is a pretty strong association and matches what we see in the contingency table:

- 233 females survived and 81 females died, so females were  about three times as likely to live as die
- 109 males survived and 464 males died, so males were over four times likely to die as to live

## Submit your work

When you have finished the notebook, please download it, log in to [OKpy](https://okpy.org/) using "Student Login", and submit it there.

Then let your instructor know on Slack.

To download the notebook, click on the file browser icon and right click on the notebook file.

![image.png](attachment:image.png)