# Correlation

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import math
import scipy.stats

food = pd.read_pickle("../data/processed/food")

## Pearson's correlation coefficient, $r$

$r$ is standardised, which is useful because:

- tests of all sorts of units can be compared which each other,
- the result is between -1 (perfect negative association), through 0 (no association), and 1 (perfect positive association)

$r$ is a measure of effect size (or correlation) between two numerical variables.
It works on the principle that as the difference from the mean for one variable increases we expect the difference from the mean for the related variable to increase (positive correlation) or decrease (negative correlation).

For example the mean income is:

In [None]:
food.P344pr.mean()

Let's say we hypothesise that people with higher incomes spend more money on food (they have more money to shop at Waitrose).
Expenditure is top--coded, so let's trim the data like we did for income and take a look at the resulting distribution:

In [None]:
food = food[food.P550tpr < food.P550tpr.max()]
food.hist(column = "P550tpr", bins = 100)
plt.xlabel("Food expenditure (£)")
plt.ylabel("Frequency")

The mean expenditure is:

In [None]:
food.P550tpr.mean()

If we take an individual with a high income (their income deviates from the mean) we would expect their expenditure to also deviate from the expenditure mean.
These deviations from the mean are their variances, so we are stating that we expect income and expenditure on food to **covary**.
This principle is used to calculate the **Pearson correlation coefficient** (usually just called the correlation), which is a standardised measure of how much the two variables vary together.

In [None]:
scipy.stats.pearsonr(
    food["P344pr"], food["P550tpr"]
)

In this example the first number is the correlation coefficient and the second number is its associated $p$ value.

The correlation is positive so as income goes up, expenditure on food goes up (if it were negative it would be a negative correlation, which would state that as income went up expenditure on food went down for some reason).
The value of 0.63 suggests quite a lot of the variance in expenditure is accounted for by income (so the correlation is strong).

The $p$ value is $<< 0.01$ ($<<$ means 'much less than') so it is highly improbable we would see a correlation this large by chance alone, so we have strong evidence to reject the null hypothesis and conclude that there is an association between income and expenditure on food.

### Assumptions

Pearson's correlation coefficient assumes that both variables are numeric and normally distributed for the $p$ value to be accurate.
In this case our variables are numeric (income and expenditure) so this assumption is met.

Neither variable should have any outliers (defined as any value greater than the mean + 3.29 standard deviations).
For income this is ok:

In [None]:
len(
    food[food.P344pr > 
                 food.P344pr.mean() + (3.29 * food.P344pr.std())]
)

But there are a few outliers for the expenditure variable:

In [None]:
len(
    food[food.P550tpr > 
                 food.P550tpr.mean() + (3.29 * food.P550tpr.std())]
)

To be safe, let's remove these:

In [None]:
food = food[food.P550tpr < food.P550tpr.mean() + (3.29 * food.P550tpr.std())]

A scatterplot of these two variables:

In [None]:
food.plot.scatter("P344pr", "P550tpr")
plt.xlabel("Income")
plt.ylabel("Expenditure")

The points should be linear (i.e. a straight line) and roughly cylindrical to meet the assumptions.
If it's too conincal it means the deviances aren't consistent (heteroskedasticity).

If these assumptions aren't true of our data we can use **Spearman's $\rho$** (pronounced 'row').
Spearman's $\rho$ is also useful when we have a numeric variable and an ordinal variable (something we couldn't test with Pearson's $r$).

This is a **non--parametric** test.
Non--parametric tests tend to be more robust (which is why we can use them when we violate some of the assumptions of the parametric equivalents, in this case Pearon's $r$) but sometimes have lower statistical power.
Therefore, try to use the parametric version by default and switch to the non--parametric version when necessary.

In [None]:
scipy.stats.spearmanr(
    food["P344pr"], food["P550tpr"]
)

As you can see in this example the correlation statistic is very similar and the $p$ value is still significant ($<< 0.01$).