# Deficiencies of the Pearson Correlation Coefficient

While the Pearson correlation is widely used, it has significant limitations that can lead to misleading conclusions. In this notebook, we will explore:

1. **Anscombe's Quartet** - Four datasets with identical correlations but very different relationships
2. **Non-linear relationships** - Cases where correlation fails to capture dependence
3. **The relationship between correlation and regression**

Understanding these limitations is essential for knowing when to use more sophisticated dependence measures.

---

## Setup and Imports

In [None]:
# Environment configuration for Jupyter
%load_ext autoreload
%autoreload 2
%matplotlib notebook

# Visualization libraries
import matplotlib.pyplot as plt
from IPython import display
import seaborn as sns

# Numerical computing
import time
import numpy as np
from scipy.stats import multivariate_normal as mvn

# Machine learning utilities
from sklearn import linear_model
import sklearn.datasets as toy_datasets

## Quick Review: Pearson Correlation

Recall the population Pearson correlation coefficient:

$$\rho_{X,Y} = \frac{\text{Cov}(X,Y)}{\sigma_X \sigma_Y}$$

where:
- $\text{Cov}(X,Y) = \mathbb{E}[(X-\mu_X)(Y-\mu_Y)]$ is the covariance
- $\mu_X = \mathbb{E}[X]$ is the mean of $X$
- $\mu_Y = \mathbb{E}[Y]$ is the mean of $Y$
- $\sigma_X$ is the standard deviation of $X$
- $\sigma_Y$ is the standard deviation of $Y$

And the sample estimator:

$$ r_{xy} = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2} \sqrt{\sum_{i=1}^n (y_i - \bar{y})^2}} $$

---

## Anscombe's Quartet: Same Correlation, Different Data

Anscombe's Quartet (1973) is a famous example demonstrating that datasets can have nearly identical statistical properties but look completely different. All four datasets have:
- Same mean of $x$: 9
- Same mean of $y$: 7.50
- Same variance of $x$: 11
- Same variance of $y$: 4.125
- Same correlation: 0.816
- Same regression line: $y = 3.00 + 0.500x$

Yet visually, they tell very different stories!

In [None]:
# Anscombe's Quartet data
x = [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5]
y1 = [8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68]
y2 = [9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74]
y3 = [7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73]
x4 = [8, 8, 8, 8, 8, 8, 8, 19, 8, 8, 8]
y4 = [6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.50, 5.56, 7.91, 6.89]

datasets = {
    'I': (x, y1),
    'II': (x, y2),
    'III': (x, y3),
    'IV': (x4, y4)
}

# Create 2x2 subplot grid
fig, axs = plt.subplots(2, 2, sharex=True, sharey=True, figsize=(8, 8),
                        gridspec_kw={'wspace': 0.08, 'hspace': 0.08})
axs[0, 0].set(xlim=(0, 20), ylim=(2, 14))
axs[0, 0].set(xticks=(0, 10, 20), yticks=(4, 8, 12))

for ax, (label, (x_data, y_data)) in zip(axs.flat, datasets.items()):
    # Add dataset label
    ax.text(0.1, 0.9, label, fontsize=20, transform=ax.transAxes, va='top')
    ax.tick_params(direction='in', top=True, right=True)
    
    # Plot data points
    ax.plot(x_data, y_data, 'o')

    # Fit and plot linear regression
    p1, p0 = np.polyfit(x_data, y_data, deg=1)  # slope, intercept
    ax.axline(xy1=(0, p0), slope=p1, color='r', lw=2)

    # Add correlation coefficient in text box
    stats = f'$r$ = {np.corrcoef(x_data, y_data)[0][1]:.2f}'
    bbox = dict(boxstyle='round', fc='blanchedalmond', ec='orange', alpha=0.5)
    ax.text(0.95, 0.07, stats, fontsize=9, bbox=bbox,
            transform=ax.transAxes, horizontalalignment='right')

plt.suptitle("Anscombe's Quartet: Same r, Different Relationships", fontsize=14)
plt.tight_layout()

### Observations from Anscombe's Quartet

- **Dataset I:** Linear relationship - correlation is appropriate
- **Dataset II:** Quadratic relationship - correlation misses the structure
- **Dataset III:** Perfect linear relationship with one outlier - outlier distorts correlation
- **Dataset IV:** No relationship except one extreme point - single point drives the correlation

**Lesson:** Always visualize your data! Correlation alone can be misleading.

---

## Non-Linear Relationships

Pearson correlation measures **linear** dependence only. For non-linear relationships, it can fail spectacularly.

### Example 1: Quadratic Relationship ($y = x^2$)

This is a perfectly deterministic relationship, but the correlation is zero!

In [None]:
# Generate quadratic data
x = np.linspace(-3, 3, 100)
y = np.power(x, 2)

# Create plot
plt.figure(figsize=(8, 5))

# Calculate correlation
r = np.corrcoef(x, y)
plt.title(r'$y = x^2$ : r = %.2f' % (r[0][1],))
plt.grid(True)

# Plot with regression line
g = sns.regplot(x=x, y=y)

# Add regression equation
regr = linear_model.LinearRegression()
regr.fit(x.reshape(-1, 1), y.reshape(-1, 1))
props = dict(boxstyle='round', alpha=0.5, color=sns.color_palette()[0])
textstr = 'y = %.2f + %.2fx' % (regr.intercept_[0], regr.coef_[0][0])
g.text(0.1, 0.9, textstr, transform=g.transAxes, fontsize=14, bbox=props)

plt.xlabel('x')
plt.ylabel('y')

### Example 2: Logarithmic Relationship ($y = \log(x)$)

The logarithmic relationship shows strong dependence but correlation may not fully capture it.

In [None]:
# Generate logarithmic data
x = np.linspace(0.01, 3, 100)
y = np.log(x)

# Create plot
plt.figure(figsize=(8, 5))

# Calculate correlation
r = np.corrcoef(x, y)
plt.title(r'$y = \log(x)$ : r = %.2f' % (r[0][1],))
plt.grid(True)

# Plot with regression line
g = sns.regplot(x=x, y=y)

# Add regression equation
regr = linear_model.LinearRegression()
regr.fit(x.reshape(-1, 1), y.reshape(-1, 1))
props = dict(boxstyle='round', alpha=0.5, color=sns.color_palette()[0])
textstr = 'y = %.2f + %.2fx' % (regr.intercept_[0], regr.coef_[0][0])
g.text(0.1, 0.9, textstr, transform=g.transAxes, fontsize=14, bbox=props)

plt.xlabel('x')
plt.ylabel('y')

---

## Relationship Between Regression and Correlation

There is a direct mathematical relationship between the regression slope and the correlation coefficient.

For simple linear regression:
$$Y_i = \alpha + \beta X_i + \epsilon_i$$

The ordinary least squares (OLS) estimator of the slope is:
$$\hat{\beta} = r_{xy} \frac{s_y}{s_x}$$

where:
- $r_{xy}$ is the sample correlation coefficient
- $s_x$ and $s_y$ are the sample standard deviations

**Key Insight:** The correlation coefficient is essentially a standardized regression slope. When both variables have the same standard deviation, $r = \hat{\beta}$.

---

## Summary of Pearson Correlation Deficiencies

| Deficiency | Example |
|------------|---------|
| Only measures **linear** dependence | $y = x^2$ has $r = 0$ despite perfect dependence |
| Sensitive to **outliers** | A single extreme point can dominate the correlation |
| Does not imply **causation** | High correlation does not mean one variable causes the other |
| Not **invariant** to transformations | $\rho(X, Y) \neq \rho(f(X), g(Y))$ for nonlinear $f, g$ |
| Can be **zero** for dependent variables | Symmetric non-linear relationships yield $r = 0$ |

---

## Key Takeaways

1. **Always visualize data** before relying on correlation.

2. **Correlation measures linear dependence only** - it can completely miss non-linear relationships.

3. **Outliers can dramatically affect correlation** - be cautious with extreme values.

4. **We need better dependence measures** - especially for non-linear relationships and non-Gaussian data.

---

**Next:** We will explore alternative dependence measures that address these limitations, including Mutual Information, Distance Correlation, and the Maximal Information Coefficient.