# Linear Regression and Related Topics

This notebook summarizes key concepts of linear regression along with illustrative Python examples. Later, a real-world example using the built-in Diabetes dataset is provided to demonstrate the application of linear regression.

## 1.1 Best Linear Prediction and Ordinary Least Squares (OLS)

**Core Content:**
- The best linear prediction (BLP) parameter \(\beta\) is defined to minimize the expected squared error.
- This leads to the formula \(\beta = (E[XX'])^{-1}E[XY]\).
- The Ordinary Least Squares (OLS) estimator is given by \(\hat{\beta}_n = (X'X)^{-1}X'Y\), with consistency and asymptotic properties discussed.

**Key Python Code:**

In [None]:
import numpy as np
n = 100
p = 3
X = np.random.randn(n, p)
Y = np.dot(X, np.array([2, -1, 0.5])) + np.random.randn(n)
beta_hat = np.linalg.inv(X.T @ X) @ X.T @ Y
print("OLS Estimator:", beta_hat)

## 1.2 Frisch-Waugh-Lowell Theorem

**Core Content:**
- When splitting covariates into a target variable and controls, the coefficient of the target variable can be computed by partialling out the controls from both the dependent variable and the target variable.

**Key Python Code:**

In [None]:
from statsmodels.api import OLS, add_constant
import numpy as np
n = 100
D = np.random.randn(n)
W = np.random.randn(n, 2)
Y = D + W @ np.array([0.5, -0.3]) + np.random.randn(n)
X = add_constant(np.column_stack([D, W]))
model = OLS(Y, X).fit()
print(model.summary())

## 1.3 Omitted Variable Bias

**Core Content:**
- When a relevant variable (e.g., \(W\)) is omitted from the regression, the estimated coefficient of the included variable (\(D\)) will be biased due to correlation between \(D\) and the omitted variable.

**Key Python Code:**

In [None]:
W = np.random.randn(n)
Y = 2*D + 0.5*W + np.random.randn(n)
X_omit = add_constant(D)
model_omit = OLS(Y, X_omit).fit()
print(model_omit.summary())

## 1.4 Conditional Expectation

**Core Content:**
- The conditional expectation \(E[Y|X]\) is the expected value of \(Y\) given \(X\).
- Key properties include linearity and the law of iterated expectations.

**Key Python Code:**

In [None]:
X_vals = np.random.randn(n)
Y_vals = 3 * X_vals + np.random.randn(n)
conditional_mean = np.mean(Y_vals[X_vals > 0])
print("Conditional Expectation for X > 0:", conditional_mean)

## 1.5 Linear Regression Model

**Core Content:**
- In the linear regression model \(Y = X'\beta + \epsilon\), the assumption \(E[\epsilon|X] = 0\) implies \(E[Y|X] = X'\beta\).

**Key Python Code:**

In [None]:
from sklearn.linear_model import LinearRegression
X = np.random.randn(n, 2)
Y = 3 + 2*X[:, 0] - X[:, 1] + np.random.randn(n)
model = LinearRegression().fit(X, Y)
print("Coefficients:", model.coef_)
print("Intercept:", model.intercept_)

## 1.6 Potential Outcomes Framework and Causal Parameters

**Core Content:**
- Introduces treatment effects like Average Treatment Effect (ATE) and Conditional Average Treatment Effect (CATE), which are estimated based on potential outcomes.

**Key Python Code:**

In [None]:
D = np.random.binomial(1, 0.5, n)
Y0 = 10 + np.random.randn(n)
Y1 = 15 + np.random.randn(n)
Y = D * Y1 + (1 - D) * Y0
ATE = np.mean(Y[D == 1]) - np.mean(Y[D == 0])
print("Estimated ATE:", ATE)

## 1.7 Endogeneity and Instrumental Variables

**Core Content:**
- Endogeneity occurs when regressors are correlated with the error term. Instrumental Variables (IV) are used to address this issue.

**Key Python Code:**

In [None]:
from statsmodels.sandbox.regression.gmm import IV2SLS
Z = np.random.randn(n)
X_iv = 0.5 * Z + np.random.randn(n)
Y_iv = 2 * X_iv + np.random.randn(n)
iv_model = IV2SLS(Y_iv, X_iv, instrument=Z).fit()
print(iv_model.summary())

## 1.8 Challenges with Many Covariates

**Core Content:**
- With many covariates, issues like multicollinearity and overfitting can arise. OLS becomes unstable when predictors are highly correlated.

**Key Python Code:**

In [None]:
X1 = np.random.randn(n)
X2 = 0.9 * X1 + 0.1 * np.random.randn(n)
Y_multi = 3 * X1 + 0.5 * X2 + np.random.randn(n)
X_multicol = add_constant(np.column_stack([X1, X2]))
model_multicol = OLS(Y_multi, X_multicol).fit()
print(model_multicol.summary())

## Real-World Example: Diabetes Dataset

We use the built-in Diabetes dataset from scikit-learn to perform linear regression and assess which variables are predictive of disease progression.

In [None]:
from sklearn.datasets import load_diabetes
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

diabetes = load_diabetes()
X = diabetes.data
y = diabetes.target

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

model = LinearRegression().fit(X_train, y_train)
print("Coefficients:", model.coef_)
print("Intercept:", model.intercept_)
print("Test R^2 Score:", model.score(X_test, y_test))

y_pred = model.predict(X_test)
plt.scatter(y_test, y_pred)
plt.xlabel("Actual")
plt.ylabel("Predicted")
plt.title("Diabetes Progression Prediction")
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], 'r--')
plt.grid(True)
plt.show()