<a href="https://colab.research.google.com/github/lcnature/statsthinking21-python/blob/master/12_ModelingContinuousRelationships.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Correlation


## What is the evidence that carbon dioxide increases global temperature?

We often hear about climate change. The central cause is the increased emmision of greenhouse gas, particularly carbon dioxide (CO2) due to burning fuels.

How are CO2 concentration in air and temperature related?

Let's look at some data of historical CO2 concentration (based on ice deep in glacier) and data of temperature flutuation (based on isodope concentration in ancient rocks). [Data source](https://archive.epa.gov/climatechange/kids/documents/temp-and-co2.pdf)


Let's first load the data from a text file:




In [None]:
import pandas as pd
url = 'https://raw.githubusercontent.com/lcnature/statsthinking21-python/refs/heads/master/notebooks/data/Co2_temp.csv'

data = pd.read_csv(url)
print(data)

After the index column, the second column is the year (B.C.). The third column is the estimated concentration of CO2 in the air (ppm). The fourth column is the deviation of the temerature in each year from a baseline (here chosen as 400 B.C.)

Let's plot how the two variables (CO2 concentration and temperature anomaly) change over the year


In [None]:
import matplotlib.pyplot as plt
plt.figure(figsize=[10,3])
plt.plot(data['year_BC'], data['CO2'])
plt.xlabel('year (BC)')
plt.ylabel('CO2 (ppm)')
plt.xticks(range(0, 500000, 100000))
plt.show()

plt.figure(figsize=[10,3])
plt.plot(data['year_BC'], data['temperature_anomaly'])
plt.xlabel('year (BC)')
plt.ylabel('temperature anomaly ($^\circ$C)')
plt.xticks(range(0, 500000, 100000))
plt.show()

What do you see?


They seem to rise and fall with similar patterns

Can we visualize their relationship in a different way?


In [None]:
plt.scatter(data['CO2'], data['temperature_anomaly'])
plt.xlabel('CO2 (ppm)')
plt.ylabel('temperature anomaly ($^\circ$C)')
plt.show()


A pattern like this indicates that when X variable increases, Y variable turns to increase as well. We call such pattern as correlation.

The opposite of it anti-correlation.

## Calculate Pearson correlation in Python

We want to use a number to quantify how much two variables are related. This is called **correlation coefficient**. More specifically, we will use a technique called **Pearson correlation** to obtain the correlation coefficient.

Let's calculate the correlation coefficient between these two variables.



In [None]:
from scipy.stats import pearsonr

r, p = pearsonr(data['CO2'], data['temperature_anomaly'])

print('correlation coefficient between CO2 and temperature anomaly:', r)
print('p-value:', p)

It is a positive number!

## Magnitude of correlation coefficient and pattern of scatter plot

What would a pattern of no correlation look like?

If we make a new list by randomly drawing a CO2 value from the historical data withour replacement, we would effectively obtain a shuffled list of the original CO2 data such that each CO2 value is now randomly re-associated with a year. After this shuffling, we shouldn't expect the shuffled variable to be correlated with the other, right?

This procedure is called **permutation**

Let's give it a try.

In [None]:
import random, copy
copy_CO2_data = copy.copy(data['CO2'])
# we make a copy, because we don't want to destroy the original list
print(type(copy_CO2_data))

random.shuffle(copy_CO2_data)
# This will make the items in copy_CO2_data randomly shuffled. To not cause confusion,
# we make a copy of it with a new name
shuffled_CO2 = copy_CO2_data

plt.scatter(shuffled_CO2, data['temperature_anomaly'])
plt.xlabel('shuffled CO2 (ppm)')
plt.ylabel('temperature anomaly ($^\circ$C)')
plt.show()


That certainly looks messier.

What is the correlation coefficient for such a pattern?

Please copy a relevant line of code from above to calculate the correlation coefficient.


In [None]:
r_shuffled, p_shuffled =
# Complete the line above


print('correlation coefficient between shuffled CO2 and temperature anomaly:',
      r_shuffled)
print('p-value:', p_shuffled)

If we flip the sign of one of the variable, how should the correlation coefficient look like?



In [None]:
r_flipped, p_flipped =

print('correlation coefficient between flipped CO2 and temperature anomaly:',
      r_flipped)
print('p-value:', p_flipped)


Did you notice how this correlation coefficient and p-value compared to the original correlation coefficient and p-value?

In case you are interested, the following code generates figures used in the slide.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

n_sample = 1000

for rho in np.linspace(-1, 1, 9):
  data = np.random.multivariate_normal([0,0], [[1,rho],[rho,1]], n_sample)
  plt.scatter(data[:,0], data[:,1], alpha=0.2)
  plt.title(r'$\rho$={:.02f}'.format(rho), fontsize=26)
  plt.xlabel('X')
  plt.ylabel('Y')
  plt.show()

## A few cautions for correlation

### sensitivity to outliers
When there are a few data points with extremely deviation from most other data points, these data points may dominate the correlation coefficient

In [None]:
from scipy.stats import pearsonr
import numpy as np
X = np.random.randn(40)
Y = np.random.randn(40)

# Since X and Y are two sets of random numbers drawn from independent normal distributions,
# we do not expect them to show a strong correlation.
r, p = pearsonr(X, Y)

plt.scatter(X,Y)
plt.xlabel('X')
plt.ylabel('Y')
plt.title('without outlier, r='+r'{:.02f}'.format(r))
plt.show()


print('correlation coefficient between X and Y:', r)

# However, if we have one data pair of scores of X and Y that have very big magnitude,
# the correlation coefficient will change dramatically.

X_wOutlier = np.append(X, 6)
Y_wOutlier = np.append(Y, 8)
r, p = pearsonr(X_wOutlier, Y_wOutlier)
print('correlation coefficient between X and Y after including an outlier:', r)


# This can be shown by the scatter plot
plt.scatter(X_wOutlier, Y_wOutlier)
plt.xlabel('X')
plt.ylabel('Y')
plt.title('with outlier, r='+r'{:.02f}'.format(r))
plt.show()

### Spurious correlation can arise due to common causal factor.

Let's assume that a common variable $Z$ influences both $X$ and $Y$.

$X = 3Z + \epsilon_X$

$Y = -2Z + \epsilon_Y$

$\epsilon_X$ and $\epsilon_Y$ are both random noise in $X$ and $Y$ unexplained by $Z$.

How does the correlation between $X$ and $Y$ look like?

In [None]:
n = 30
Z = np.random.uniform(low=-2, high=2, size=n)

X = 3 * Z + np.random.randn(30)

Y = -2 * Z + np.random.randn(30)

r, p = pearsonr(X, Y)

plt.scatter(X,Y)
plt.xlabel('X')
plt.ylabel('Y')
plt.title('spurious correlation \ninduced by a common cause Z: r='+r'{:.02f}'.format(r))
plt.show()