# Module 3: Correlation and Simple Linear Regression
This module explores how variables relate to each other through correlation analysis and how we can model and predict these relationships using simple linear regression.

## 🎯 Learning Objectives
- Understand correlation coefficients and how to interpret them
- Use scatterplots and heatmaps to explore relationships
- Perform and interpret a simple linear regression in Python
- Diagnose model performance and residuals

## 🔗 What is Correlation?
Correlation measures the strength and direction of a linear relationship between two numerical variables.

- **Positive correlation**: As one variable increases, so does the other
- **Negative correlation**: As one increases, the other decreases
- **No correlation**: No consistent pattern

### 📏 Pearson Correlation Coefficient (r)
- Ranges from -1 to 1
- Values close to 0 indicate weak linear relationship
- Use Pearson for linear and Spearman for monotonic relationships

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Simulated dataset
np.random.seed(42)
x = np.random.normal(50, 10, 100)
y = 2 * x + np.random.normal(0, 10, 100)
df = pd.DataFrame({'Study Hours': x, 'Test Score': y})

# Correlation matrix
correlation = df.corr()
print(correlation)

## 📊 Visualizing Correlation

In [None]:
sns.scatterplot(data=df, x='Study Hours', y='Test Score')
plt.title('Scatterplot of Study Hours vs Test Score')
plt.show()

sns.heatmap(correlation, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()

## 📐 Simple Linear Regression
Simple linear regression models the relationship between one independent variable (X) and a dependent variable (Y) using a straight line.

Model: $Y = β₀ + β₁X + ε$
- $β₀$: Intercept
- $β₁$: Slope
- $ε$: Error term

In [None]:
from scipy.stats import linregress

# Perform linear regression
slope, intercept, r_value, p_value, std_err = linregress(df['Study Hours'], df['Test Score'])

print(f"Slope: {slope:.2f}")
print(f"Intercept: {intercept:.2f}")
print(f"R-squared: {r_value**2:.3f}")
print(f"P-value: {p_value:.4f}")

### 📈 Plotting the Regression Line

In [None]:
# Predict values
predicted = intercept + slope * df['Study Hours']

# Plot
plt.scatter(df['Study Hours'], df['Test Score'], label='Observed')
plt.plot(df['Study Hours'], predicted, color='red', label='Fitted Line')
plt.xlabel('Study Hours')
plt.ylabel('Test Score')
plt.legend()
plt.title('Linear Regression Fit')
plt.show()

## 🧪 Residual Analysis
Residuals are the differences between observed and predicted values.
A residual plot helps verify linearity and homoscedasticity assumptions.

In [None]:
residuals = df['Test Score'] - predicted
plt.scatter(predicted, residuals)
plt.axhline(y=0, color='red', linestyle='--')
plt.title('Residual Plot')
plt.xlabel('Predicted Score')
plt.ylabel('Residuals')
plt.show()

## ✅ Practice Exercises
1. Simulate two variables with a negative correlation and calculate Pearson’s r.
2. Plot a regression line for any two continuous variables and interpret the slope.
3. Use a real dataset (e.g., seaborn's `tips` or `penguins`) to run a regression.
4. Create a residual plot. What patterns do you see?
5. Reflect: What does an R² of 0.80 tell you about the model?