## Closed-Form Solution
### 1. Ordinary Least Squares (OLS) Perspective

#### 1.1 Model Representation:
The linear regression model is represented as:
$$
\hat{y} = X\theta
$$
where:
- $ X $ is the design matrix with dimensions $ n \times (m+1) $ (including the intercept term).
- $ \theta$ is the parameter vector with dimensions $ (m+1) \times 1 $.

#### 1.2 Loss Function:
The mean squared error (MSE) is given by:
$$
J(\theta) = \frac{1}{n}(X\theta - y)^T(X\theta - y)
$$

#### 1.3 Objective:
Given a dataset with $n$ samples and $m$ features, we want to find the parameter vector $\theta$ that minimizes the mean squared error (MSE) between predicted values $\hat{y}$ and actual values $y$.
$$
\theta = \arg\min_\theta{J(\theta)}=\arg\min_\theta{(X\theta - y)^T(X\theta - y)}
$$

### 2. Maximum Likelihood Estimation Perspective


#### 2.1 Model Representation:
The linear regression model is represented as:
$$
y = X\theta + \epsilon
$$
where $ \epsilon $ is the error term assumed to be normally distributed with mean zero and variance $ \sigma^2 $.

#### 2.2 Likelihood Function:
Under the assumption of Gaussian errors, the likelihood function of the observed data is given by the product of the probability density functions of the individual observations:
$$
\mathcal{L}(\theta | X, y) = \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y^{(i)} - X^{(i)}\theta)^2}{2\sigma^2}\right)
$$

#### 2.3 Log-Likelihood Function:
Taking the logarithm of the likelihood function yields the log-likelihood function:
$$
\ell(\theta | X, y) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^{n} (y^{(i)} - X^{(i)}\theta)^2
$$

#### 2.4 Objective:
We seek to find the parameter vector \( \theta \) that maximizes the likelihood function of the observed data.

#### Maximum Likelihood Estimation:
To find the maximum likelihood estimate of \( \theta \), we take the derivative of the log-likelihood function with respect to \( \theta \) and set it to zero:
$$
\nabla_{\theta} \ell(\theta | X, y) = 0
$$
Solving this equation analytically yields the maximum likelihood estimate of \( \theta \).

The above explanations provide insights into the closed-form solution for linear regression from both the ordinary least squares perspective and the maximum likelihood estimation perspective.

### 3 Derivation Steps:
1. **Compute the Gradient:**
$$
\frac{\partial J}{\partial \theta} = \frac{2}{n}X^T(X\theta - y)
$$

2. **Set the Gradient to Zero:**
$$
\frac{2}{n}X^T(X\theta - y) = 0
$$

3. **Solve for \( \theta \):**
$$
X^TX\theta - X^Ty = 0
$$

4. **Obtain the Analytical Solution:**
$$
\theta = (X^TX)^{-1}X^Ty
$$



### Derivation of Analytical Solution for Linear Regression

#### Objective:
Given a dataset with $n$ samples and $m$ features, we want to find the parameter vector $\theta$ that minimizes the mean squared error (MSE) between predicted values $\hat{y}$ and actual values $y$.


#### Model Representation:
We represent the linear regression model as:
$$
\hat{y} = X\theta
$$
where:
- $ X $ is the design matrix with dimensions $ n \times (m+1) $ (including the intercept term).
- $ \theta$ is the parameter vector with dimensions $ (m+1) \times 1 $.

#### Loss Function:
The mean squared error (MSE) is given by:
$$
J(\theta) = \frac{1}{n}(X\theta - y)^T(X\theta - y)
$$



Sure, here's how you can express the analytical solution for linear regression from the maximum likelihood perspective using Markdown:

---

### Analytical Solution for Linear Regression from Maximum Likelihood Perspective

#### Objective:
Linear regression aims to model the relationship between independent variables \(X\) and a dependent variable \(y\) by fitting a linear equation to observed data. From a maximum likelihood perspective, we seek to maximize the likelihood function to estimate the parameters of the linear model.

#### Model Representation:
The linear regression model can be represented as:
$$
y = X\theta + \epsilon
$$
where:
- \(X\) is the design matrix of size \(n \times (m+1)\), with \(n\) observations and \(m\) features (plus the intercept term).
- \(\theta\) is the parameter vector of size \((m+1) \times 1\) containing the coefficients.
- \(\epsilon\) is the error term assumed to be Gaussian noise with mean \(0\) and variance \(\sigma^2\).

#### Likelihood Function:
Under the assumption of Gaussian noise, the likelihood function for linear regression is given by the product of the probability density functions of the observed data:
$$
\mathcal{L}(\theta | X, y) = \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y^{(i)} - X^{(i)}\theta)^2}{2\sigma^2}\right)
$$

#### Log-Likelihood Function:
Taking the logarithm of the likelihood function, we obtain the log-likelihood function:
$$
\ell(\theta | X, y) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^{n} (y^{(i)} - X^{(i)}\theta)^2
$$

#### Maximum Likelihood Estimation:
To find the maximum likelihood estimate of the parameter vector \( \theta \), we maximize the log-likelihood function by taking its derivative with respect to \( \theta \) and setting it to zero:
$$
\nabla_{\theta} \ell(\theta | X, y) = 0
$$
Solving this equation analytically yields the maximum likelihood estimate of \( \theta \).

#### Conclusion:
The maximum likelihood estimation provides a statistical perspective for finding the optimal parameter vector \( \theta \) in linear regression by maximizing the likelihood function of the observed data.

---

This markdown provides an explanation of linear regression from the maximum likelihood perspective, including the likelihood function, log-likelihood function, and the process of maximum likelihood estimation for finding the optimal parameter vector \( \theta \).

In [5]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
class LinearRegression:
    def __init__(self, lr=0.01, epochs=1000, method='closed_form', reg=None, alpha=0.01):
        self.lr = lr  # 学习率
        self.epochs = epochs  # 迭代次数
        self.method = method  # 方法：'closed_form'闭式解，'gradient_descent'梯度下降
        self.theta = None  # 参数
        self.reg = reg  # 正则化方法：None（无罚项），'l1'（L1正则），'l2'（L2正则）
        self.alpha = alpha  # 正则化参数
        self.loss_history = [] # Loss record

    def fit(self, X, y):
        # 添加偏置项
        X_b = np.c_[np.ones((X.shape[0], 1)), X]

        if self.method == 'closed_form':
            # 闭式解
            if self.reg is None:
                # 无正则化
                self.theta = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)
            elif self.reg == 'l1':
                # L1正则化（Lasso）
                L = np.eye(X_b.shape[1])
                L[0, 0] = 0  # 不对偏置项进行惩罚
                self.theta = np.linalg.inv(X_b.T.dot(X_b) + self.alpha * L).dot(X_b.T).dot(y)
            elif self.reg == 'l2':
                # L2正则化（Ridge）
                self.theta = np.linalg.inv(X_b.T.dot(X_b) + self.alpha * np.eye(X_b.shape[1])).dot(X_b.T).dot(y)
        elif self.method == 'gradient_descent':
            # 初始化参数
            self.theta = np.random.randn(X_b.shape[1], 1)
            # 梯度下降
            for _ in range(self.epochs):
                gradients = 2 / len(X_b) * X_b.T.dot(X_b.dot(self.theta) - y)
                if self.reg == 'l1':
                    # L1正则化（Lasso）
                    gradients[1:] += self.alpha * np.sign(self.theta[1:])
                elif self.reg == 'l2':
                    # L2正则化（Ridge）
                    gradients[1:] += 2 * self.alpha * self.theta[1:]
                self.theta -= self.lr * gradients
                loss = np.mean((X_b.dot(self.theta) - y) ** 2)
                self.loss_history.append(loss)
        else:
            raise ValueError("Unsupported method. Choose 'closed_form' or 'gradient_descent'.")

    def predict(self, X):
        # 添加偏置项
        X_b = np.c_[np.ones((X.shape[0], 1)), X]
        return X_b.dot(self.theta)

In [None]:
# Generate some random data
np.random.seed(42)
X = np.random.rand(100, 1)
# Introduce outliers
# X = np.linspace(0,100,100)
y = 4 + 3 * X + np.random.randn(100, 1)
X_outliers = np.array([[1], [1.01], [1.02],[1.03], [1.04], [1.05]])
y_outliers = np.array([[30], [35], [39],[40], [41], [42]])
X = np.vstack((X, X_outliers))
y = np.vstack((y, y_outliers))
methods = ['closed_form','gradient_descent']
regs = [None,'l1','l2']
# Instantiate and fit the models
# No penalty (ordinary linear regression)
models = {}
for m in methods:
    for r in regs:
        models['{}_{}'.format(m,r)] = LinearRegression(method=m,reg=r,alpha=10)
        models['{}_{}'.format(m,r)].fit(X,y)
w, b = [], []
for k in models:
    p = models[k].theta.ravel()
    w.append(p[0])
    b.append(p[1])
df = pd.DataFrame({'solution':list(models.keys()),'w':w,'b':b})
df

In [None]:
# Visualize data and fitted models
plt.figure(figsize=(10, 5))

# Scatter plot of original data
plt.scatter(X, y, color='blue', label='Original data')

# Fitted lines for closed-form solution and gradient descent
for m in methods:
    for r in regs:
        plt.plot(X, models['{}_{}'.format(m,r)].predict(X), label='{}_{}'.format(m,r))
# plt.plot(X, lin_reg_no_penalty_gd.predict(X), color='red', label='Gradient descent')

plt.title('Linear Regression')
plt.xlabel('X')
plt.ylabel('y')
plt.legend()
plt.grid(True)
plt.show()

In [None]:
# Plot loss change during gradient descent
plt.figure(figsize=(8, 5))
for r in regs:
    # plt.plot(X, models['{}_{}'.format(m,r)].predict(X), label='{}_{}'.format(m,r))
    plt.plot(range(len(models['{}_{}'.format(methods[1],r)].loss_history)), 
             models['{}_{}'.format(methods[1],r)].loss_history, 
             label='{}_{}'.format(methods[1],r))
plt.title('Gradient Descent Loss')
plt.xlabel('Iterations')
plt.ylabel('Loss')
plt.grid(True)
plt.legend()
plt.show()