# Pearson Correlation Coefficient

The Pearson correlation coefficient is the most widely used measure of linear dependence between two continuous random variables. In this notebook, we will:

1. Define the population and sample correlation coefficients
2. Explore how correlation relates to the bivariate Gaussian distribution
3. Visualize correlation with real data examples

---

## Setup and Imports

In [None]:
# Environment configuration for Jupyter
%load_ext autoreload
%autoreload 2
%matplotlib notebook

# Visualization libraries
import matplotlib.pyplot as plt
from IPython import display
import seaborn as sns

# Numerical computing
import time
import numpy as np
from scipy.stats import multivariate_normal as mvn

# Machine learning utilities
from sklearn import linear_model
import sklearn.datasets as toy_datasets

## Population Correlation Coefficient

The **population Pearson correlation coefficient** measures the linear relationship between two random variables $X$ and $Y$:

$$\rho_{X,Y} = \frac{\text{Cov}(X,Y)}{\sigma_X \sigma_Y}$$

where:
- $\text{Cov}(X,Y) = \mathbb{E}[(X-\mu_X)(Y-\mu_Y)]$ is the covariance
- $\mu_X = \mathbb{E}[X]$ is the mean of $X$
- $\mu_Y = \mathbb{E}[Y]$ is the mean of $Y$
- $\sigma_X$ is the standard deviation of $X$
- $\sigma_Y$ is the standard deviation of $Y$

**Properties:**
- $\rho \in [-1, 1]$
- $\rho = 1$ indicates perfect positive linear relationship
- $\rho = -1$ indicates perfect negative linear relationship
- $\rho = 0$ indicates no linear relationship (but not necessarily independence!)

## Sample Correlation Coefficient (Estimator)

Given $n$ paired observations $(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)$, the **sample Pearson correlation coefficient** is:

$$ r_{xy} = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2} \sqrt{\sum_{i=1}^n (y_i - \bar{y})^2}} $$

where $\bar{x}$ and $\bar{y}$ are the sample means.

This is an unbiased estimator of the population correlation $\rho$ when the data comes from a bivariate normal distribution.

---

## Examples with Real Data

### Example 1: Strong Positive Correlation

We examine the relationship between alcohol content and proline levels in wine. These variables show a strong positive correlation.

In [None]:
# Load wine dataset
X, y = toy_datasets.load_wine(return_X_y=True)

# Extract alcohol content and proline
xx = X[:, 0]   # Alcohol content
yy = X[:, 12]  # Proline

# Create scatter plot
plt.figure()
sns.scatterplot(x=xx, y=yy)

# Calculate and display Pearson correlation
r = np.corrcoef(xx, yy)
plt.title('r = %.2f' % (r[0, 1],))
plt.xlabel('Alcohol')
plt.ylabel('Proline')

### Example 2: Weaker Correlation

The relationship between alcohol and magnesium shows a weaker positive correlation.

In [None]:
# Extract alcohol and magnesium
xx = X[:, 0]   # Alcohol
yy = X[:, 4]   # Magnesium

# Create scatter plot
plt.figure()
sns.scatterplot(x=xx, y=yy)

# Calculate and display Pearson correlation
r = np.corrcoef(xx, yy)
plt.title('r = %.2f' % (r[0, 1],))
plt.xlabel('Alcohol')
plt.ylabel('Magnesium')

---

## Correlation and the Bivariate Gaussian Distribution

The Pearson correlation coefficient has a special relationship with the bivariate Gaussian distribution. For a bivariate normal distribution with parameters $(\mu_x, \mu_y, \sigma_x, \sigma_y, \rho)$, the probability density function is:

$$ f(x,y) = \frac{1}{2 \pi \sigma_x \sigma_y \sqrt{1-\rho^2}} \exp \left( -\frac{1}{2(1-\rho^2)} \left[ \left( \frac{x-\mu_x}{\sigma_x} \right)^2 - 2\rho \left( \frac{x-\mu_x}{\sigma_x} \right) \left( \frac{y-\mu_y}{\sigma_y} \right) + \left( \frac{y-\mu_y}{\sigma_y} \right)^2 \right] \right) $$

**Key Insight:** For the bivariate Gaussian, the correlation coefficient $\rho$ completely characterizes the dependence structure. This is **not** true for other distributions!

### Visualizing Different Correlation Values

The following plots show contours of the bivariate Gaussian PDF for different values of $\rho$:

In [None]:
# Create 3x3 grid of bivariate Gaussian contour plots
fig, ax = plt.subplots(3, 3, figsize=(9, 9))

# Range of correlation values from -0.9 to 0.9
rvec = np.linspace(-0.9, 0.9, 9)
kk = 0

for ii in range(3):
    for jj in range(3):
        r = rvec[kk]
        kk += 1

        # Correlation matrix with unit variances
        R = [[1, r], [r, 1]]
        rv = mvn([0, 0], R)

        # Create grid for contour plot
        N = 200
        K = 2
        X = np.linspace(-K, K, N)
        Y = np.linspace(-K, K, N)
        X, Y = np.meshgrid(X, Y)
        pos = np.dstack((X, Y))
        Z = rv.pdf(pos)

        # Plot contours
        ax[ii][jj].contour(X, Y, Z)
        ax[ii][jj].set_title(r'$\rho$ = %.2f' % (r,))
        ax[ii][jj].set_aspect('equal')

plt.suptitle('Bivariate Gaussian Distributions with Varying Correlation', y=1.02)
plt.tight_layout()

---

## Key Takeaways

1. **The Pearson correlation coefficient** measures linear dependence between two variables.

2. **Range:** $\rho \in [-1, 1]$, where the sign indicates the direction and the magnitude indicates the strength.

3. **Gaussian relationship:** For bivariate Gaussian distributions, $\rho$ fully characterizes the dependence structure.

4. **Limitations:** Pearson correlation has significant limitations for non-Gaussian data, which we explore in the next notebook.

---

**Next:** We will examine the deficiencies of Pearson correlation and why it can be misleading for many real-world applications.