In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from numpy import mean, std
from matplotlib.mlab import csv2rec
import matplotlib
matplotlib.rcParams['legend.scatterpoints'] = 1
from pylab import poly_between

# stats60 specific

from code.week2 import pearson_lee
figsize = (8,8)

## Correlation (Chapters 8 and 9)

In the early 1900s Karl Pearson collected data on [heights](http://www.stat.cmu.edu/~roeder/stat707/=data/=data/data/Rlibraries/alr3/html/heights.html) of mothers
and daughters.

In [None]:
%%capture
height = pearson_lee()
f = plt.figure(figsize=figsize)
height._figure = f
height.draw()


In [None]:
height.figure


There is a *positive association* between the two.


### Drawing a scatter plot

What we saw above, is an example of a **scatter plot**.

- Given two lists: $X, Y$ of length $n$
- Draw a coordinate system between roughly `[min(X), max(X)]` on the $x$-axis and between roughly `[min(Y), max(Y)]` on the $y$-axis.
- Each pair $(X_i, Y_i)$ gets a point with those corresponding coordinates.

In [None]:
fig = plt.figure(figsize=(6,6))
ax = fig.gca()
X = [1,4,6,9,3]
Y = [-2,2,8,0,1]
ax.scatter(X, Y, c='r', s=100)
ax.set_xlabel('X', fontsize=15)
ax.set_ylabel('Y', fontsize=15)

### Dependent and independent variables

- The $X$ axis is called the *independent variable.*

- The $Y$ axis is called the *dependent variable.*

In [None]:
ax.set_xlabel('Independent', fontsize=15)
ax.set_ylabel('Dependent', fontsize=15)
ax.figure


### What can correlation tell us? 

- From the plot, daughters born to taller mothers tend to be taller. 

- It has more concrete information that allows us to estimate
the *average* height of daughters born to mothers of a *given height*.

In [None]:
height.strip = 65
height.draw()
height.figure


What if we want to guess daughter’s height when mother = 65 in?

### Average within a strip

In [None]:
height.axes.set_xlim([64.5,65.5])
height.axes.set_title('Zooming in on the strip...', color='red', fontsize=15)
height.figure


We see some variability within the strip: the scatter plot is not exactly a line.

However, we can compute the average height within a given strip.

In [None]:
height.axes.set_title('The average within the strip is %0.1f' % height.mean_strip, fontsize=15)
height.figure

In [None]:
height.axes.set_xlim([54,72])
height.figure

Let's collect these averages within many strips

In [None]:
%%capture
averages = []
height.draw()
mother_heights = range(56,69)
for mother in mother_heights:
    height.strip = mother
    averages.append(height.mean_strip)

height.strip = None
height.axes.plot(mother_heights, averages, linewidth=5, c='k')
height.axes.set_title("Relationship is almost a straight line.", fontsize=15)
height.axes.scatter(mother_heights, averages, s=300, c='yellow', label='Average(strip)')
height.axes.legend(loc='lower right') 

In [None]:
height.figure

In [None]:
height.axes.set_title('Slope of the line is predicted by correlation...', fontsize=15)
height.axes.figure

## Correlation

### Conceptual definition

- A numerical summary of a scatterplot, i.e. a pair of lists.
- If there is a strong association between two variables, then knowing one helps a lot in predicting the other. But when there is a weak association, information about one variable does not help much in guessing the other.

### Correlation coefficient

The *correlation coefficient*
  , $r$ is a measure of the strength of this association.
* $r=+1$ if the variables are perfectly positively associated.
* $r=-1$ if the variables are perfectly negatively associated.

### Perfectly positively correlated

In [None]:
%%capture
positive = plt.figure(figsize=(8,8))
ax = positive.gca()
X = np.random.standard_normal(50)
ax.scatter(X, X, c='red', s=100)
ax.set_xticks([])
ax.set_yticks([])
ax.set_title('$r=+1$', fontsize=20)

In [None]:
positive

### Perfectly negatively correlated, $r=-1$

In [None]:
%%capture
negative = plt.figure(figsize=(8,8))
ax = negative.gca()
X = np.random.standard_normal(50)
ax.scatter(X, -X, c='red', s=100)
ax.set_xticks([])
ax.set_yticks([])
ax.set_title('$r=-1$', fontsize=20)


In [None]:
negative

### Uncorrelated variables (no relation) $r=0$

In [None]:
%%capture
uncorrelated = plt.figure(figsize=(8,8))
ax = uncorrelated.gca()
X = np.random.standard_normal(50)
Y = np.random.standard_normal(50)
ax.scatter(X, Y, c='red', s=100)
ax.set_xticks([])
ax.set_yticks([])
ax.set_title('$r=0$', fontsize=20)


In [None]:
uncorrelated

### Positive and negative correlation

In [None]:
%%capture
def c_from_r(r):
   return np.sqrt(r**2 / (1.-r**2)) * np.sign(r)
mixture = plt.figure(figsize=(8,8))

X = np.random.standard_normal(50)
Y = np.random.standard_normal(50)
for i, r in zip([1,2,3,4], [0.4,0.9,-0.4,-0.9]):
   ax = plt.subplot(2,2,i)
   ax.scatter(X, Y + c_from_r(r) * X, c='red', s=50)
   ax.set_xticks([])
   ax.set_yticks([]);
   ax.set_title('r=%0.1f' % r)

In [None]:
mixture

## Computing $r$, the correlation coefficient

* Given two lists, $X, Y$, convert them each to standardized units. Call these new lists $Z_X, Z_Y$.
* Make a new list $Z_{XY}$ whose entries are the products of the entries of $Z_X, Z_Y$.
* Then, $r = \text{average}(Z_{XY}).$
* Another way:
$$
   r = \frac{\text{average(products $X, Y$)} - \text{average}(X) \times \text{average}(Y)}{\text{SD}(X) \times \text{SD}(Y)}.
   $$
   

### Summation notation

- The entries of the lists $Z_X, Z_Y, Z_{XY}$ are: 
$$\begin{aligned}
     Z_{X,i} &= \frac{X_i - \bar{X}}{\text{SD}(X)} \\
     Z_{Y,i} &= \frac{Y_i - \bar{Y}}{\text{SD}(Y)} \\
     Z_{XY,i} &= Z_{X,i} \times Z_{Y,i}
     \end{aligned}$$
- Then, $$r = r(X,Y) = \bar{Z}_{XY} = \frac{1}{n} \sum_{i=1}^n Z_{XY,i}.$$

### Summation notation

- The other way above can be written as:
     * If $XY$ is the list with entries $X_i \times Y_i$, then $$r = \frac{\overline{XY} - \bar{X} \times \bar{Y}}{\text{SD}(X) \times \text{SD}(y)}.$$

## Example

Take X = [1,4,6,9,3],
Y = [-2,2,8,0,1].

 $$\begin{aligned}
   \bar{X} &= 4.6 & \text{SD}(X) &= 2.72 \\
   \bar{Y} &= 1.8 & \text{SD}(Y) &= 3.37 \\
   \end{aligned}$$
   
The only thing new to compute is $\overline{XY}$. 
$$XY = [-2,8,48,0,3], \qquad \overline{XY}=(-2+8+48+3)/5=11.4$$

Therefore
$$ r = \frac{11.4 - 4.6 * 1.8}{2.72 * 3.37} \approx 0.34$$

In [None]:
X = [1,4,6,9,3]
Y = [-2,2,8,0,1]
print (mean(X), mean(Y), std(X), std(Y), mean([x*y for x,y in zip(X,Y)]))
R = (mean([x*y for x,y in zip(X,Y)]) - mean(X) * mean(Y)) / (std(X) * std(Y))
R, (11.4 - 4.6 * 1.8)/(2.72 * 3.37)

### Properties of correlation

* Correlation is unitless.
* Changing units of $X$ or $Y$ does not change the correlation.
* Correlation does not change if we interchange $X$ and $Y$: it is *symmetric*.

In [None]:
height.draw()
height.axes.set_title("$r=%0.2f$" % np.corrcoef([height.D, height.M])[0,1], fontsize=20)

## Correlation in Pearson's data

In [None]:
height.figure

## Correlation is symmetric

In [None]:
%%capture
swapped = plt.figure(figsize=(8,8))
ax = swapped.gca()
ax.scatter(height.D, height.M, c='red', s=100, edgecolor='gray')
ax.set_ylabel("Mother's height (inches)", fontsize=15)
ax.set_xlabel("Daughter's height (inches)", fontsize=15)
ax.set_title("$r=%0.2f$" % np.corrcoef([height.D, height.M])[0,1], fontsize=20)

In [None]:
swapped

**This plot also illustrates the important principle:**

     Correlation is not causality!
     
Why?


## Correlation

Like mean and SD, outliers can greatly affect the correlation.

In [None]:
%%capture
outlier_fig = plt.figure(figsize=(8,8))
ax = outlier_fig.gca()
X = np.random.standard_normal(30)
X.sort()
e = np.random.standard_normal(30)
Y = 2 + 2.5 * X + 0.5 * e
Z = Y * 1.
Z[-1] = -3
ax.scatter(X[-1],Z[-1], s=200, c='r')
ax.scatter(X[:-1],Y[:-1], s=200)
ax.set_title('r=%0.2f with outlier, %0.2f without' % (np.corrcoef(X,Z)[0,1], np.corrcoef(X[:-1],Y[:-1])[0,1]),
             fontsize=15, color='red')
ax.set_xlim([1.1*X.min(),1.1*X.max()])
None

In [None]:
outlier_fig

## Correlation

- Correlation is a linear measure of association.

- Variables can be associated without being *linearly associated.*

In [None]:
%%capture 
quadratic_fig = plt.figure(figsize=(8,8))
ax = quadratic_fig.gca()
X = np.linspace(-2,2,50)
X += 0.1 * np.random.uniform(0,0.1,(50,)) - 0.05
X.sort()
e = np.random.standard_normal(50)
Y = - 1.5 * X**2 + 0.5 * e
ax.scatter(X,Y, c='r', s=200)
ax.set_title('r=%0.2f' % np.corrcoef(X,Y)[0,1], color='red', fontsize=20)

In [None]:
quadratic_fig

### Correlation

### The SD line

* The SD line passes through the point of averages $(\bar{X}, \bar{Y})$ and has $\text{slope(SD line)} = \frac{\text{SD}(Y)}{\text{SD}(X)} \times \text{sign}(r(X,Y))$
* For every one standardized unit increase of $X$, the SD line changes by one standardized unit of $Y$. The direction of change is positive if $X$ and $Y$ are positively correlated, and negative if they are negatively correlated.

### Correlation

In [None]:
height.SDline()
height.figure


The point cloud seems to cluster around the SD line

Let's look at those means we computed within each strip above.

While the points are almost on a line, **they do not lie on the SD line.**

In [None]:
height.axes.plot(mother_heights, averages, linewidth=5, c='k', label='Strip averages')
height.axes.scatter(mother_heights, averages, s=300, c='yellow', label='Average(strip)')
height.axes.legend(loc='lower right', scatterpoints=1) 

In [None]:
height.figure

### Ecological correlations

* Plots of averages vs. averages can exaggerate correlations.
* This is because this is a plot of averages of X versus averages of Y, so the points are less variable.
* In the next Figure, I divided heights into groups based on mother’s height, then averaged both mother’s height and daughter’s height within that group.

In [None]:
%%capture
maverages = []
for mother in mother_heights:
    maverages.append(height.M[(height.M >= mother-0.5) 
                              * (height.M <= mother+0.5)].mean())

ecological_plot = plt.figure(figsize=(8,8))
ax = ecological_plot.gca()
ax.scatter(maverages, averages, s=100, c='yellow', edgecolor='gray')
ax.set_xlabel("Average mother's height within strip (inches)", fontsize=15)
ax.set_ylabel("Average daughter's height within strip (inches)", fontsize=15)
ax.set_title('r=%0.2f' % np.corrcoef(maverages,averages)[0,1])


In [None]:
ecological_plot


Ecological correlations ignore variability …

## Correlation and selection bias

- Just as in measuring a single variable, correlation
is susceptible to *selection bias.*

- When the correlation is truly 0, selection bias can
cause us to think there is some correlation.

- The next figure shows points randomly scattered in the square.

In [None]:
%%capture
X, Y = np.random.standard_normal((2,2000))  # * 8 - 4
independent = plt.figure(figsize=(8,8))
ax = independent.gca()
ax.scatter(X,Y, c='r', s=100, edgecolor='gray')
ax.set_title('$r=%0.2f$' % np.corrcoef([X,Y])[0,1], fontsize=20)
ax.set_xlim([-4,4])
ax.set_ylim([-4,4])

In [None]:
independent

However, suppose we collected data for our study in a particular fashion.
Without knowing it, we collected all pairs such that their sum is in


In [None]:
xf, yf = poly_between([-4,4], [3,-5], [5, -3])
g = (X + Y >= -1) * (X + Y <= 1)
ax.fill(xf, yf, facecolor='yellow', alpha=0.4, hatch='/')
ax.set_xlim([-4,4])
ax.set_ylim([-4,4])

In [None]:
independent

We have introduced a strong correlation between $X$ and $Y$ where
previously there wasn't.

This is an unrealistic example of how bias might be introduced
in correlation, but still...

In [None]:
%%capture
sampled = plt.figure(figsize=(8,8))
ax = sampled.gca()
ax.scatter(X[g],Y[g], c='r', s=100, edgecolor='gray')
ax.set_title('$r=%0.2f$' % np.corrcoef([X[g],Y[g]])[0,1], fontsize=20)
ax.set_xlim([-4,4])
ax.set_ylim([-4,4])
ax.fill(xf, yf, facecolor='yellow', alpha=0.4, hatch='/')

In [None]:
sampled