# **Machine Learning: Ordinary Least Squares**

- __Part 1: Probability__ 

- __Part 2: Linear Algebra__

- __Part 3: Statistics__

## Part 1: Probability

- **Conditional Distributions (Single Variable)**

    - The distribution of a random variable $Y$ conditional on another random variable $X$ taking on a specific value is called the conditional distribution of $Y$ given $X$

    - In general: $$Pr(Y = y | X = x) = \frac{Pr(Y = y, X = x)}{Pr(X = x)}$$
    
    - In regression, we're interested in how $Y$ is distributed for each value of $X$. Ordinary Least Squares assumes that for each $X = x$, there's a conditional distribution of $Y$ with:

        - A mean that depends on $x$

        - Constant variance (homoskedasticity assumption)

- **Conditional Expectations (Single Variable)**

    - The conditional expectation is the expected value of $Y$, computed using the conditional distribution of $Y$ given $X$

    - If $Y$ takes on $k$ values $y_1, y_2, \ldots, y_k$, then the conditional mean of $Y$ given $X = x$ is:
$$E[Y | X = x] = \sum_{i=1}^{k} y_i \cdot Pr(Y = y_i | X = x)$$

- **Conditional Distributions (Multi-variable)**

    - The distribution of the random variable $Y$ conditional on multiple random variables $X_1, X_2, \ldots, X_p$ each taking on a specific value is called the conditional distrubtion of $Y$ given $X_1, X_2, \ldots, X_p$
    
    - In general: $$Pr(Y = y | X_1 = x_1, X_2 = x_2, \ldots, X_p = x_p) = \frac{Pr(Y = y, X_1 = x_1, \ldots, X_p = x_p)}{Pr(X_1 = x_1, \ldots, X_p = x_p)}$$

    - In multiple regression, we model how $Y$ is distributed conditional on the entire set of predictor variables

- **Conditional Expectations (Multi-variable)**

    - The conditional expectation of $Y$ given multiple predictors is:
    $$E[Y | X_1 = x_1, \ldots, X_p = x_p] = \sum_{i=1}^{k} y_i \cdot Pr(Y = y_i | X_1 = x_1, \ldots, X_p = x_p)$$

    - In multiple linear regression, this becomes:
    $$E[Y | X_1, \ldots, X_p] = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p$$

## Part 2: Linear Algebra

### Basic Linear Algebra Notation and Basic Computations

- **Matrix**

    - An _m × n_ rectangular array of entries 

- **Vector**

    - Row Vector

        - A _1 × n_ matrix consisting of n entries

        - For instance let $\mathbf{u}$ denote a _1 x n_ row vector,
        $$\mathbf{u} = [u_1, u_2, \ldots, u_n]$$

    - Column Vector

        - An _m × 1_ matrix consisting of m entries

        - For instance let $\mathbf{v}$ denote a _m x 1_ column vector,
        $$\mathbf{v} = \begin{bmatrix} v_1 \\ v_2 \\ \vdots \\ v_m \end{bmatrix}$$

- **Inner Product**

    - Commonly referred to as the dot product when $\mathbf{u}$ and $\mathbf{v}$ are in $\mathbb{R}^n$

    - The inner product results in a scalar

    - The inner product of two vectors $\mathbf{u}$ and $\mathbf{v}$ is denoted by

$$ \langle \mathbf{u}, \mathbf{v} \rangle = [u_1, u_2, \ldots, u_n] \begin{bmatrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{bmatrix} = \sum_{i=1}^{n} u_i v_i = u_1 v_1 + u_2 v_2 + \cdots + u_n v_n $$

- **Distance**

    - For $\mathbf{u}$ and $\mathbf{v}$ in $\mathbb{R}^n$, the distance between $\mathbf{u}$ and $\mathbf{v}$, is the length of the vector $\mathbf{u} - \mathbf{v}$

    - Distance is denoted by

$$
\text{dist}(\mathbf{u}, \mathbf{v}) = \|\mathbf{u} - \mathbf{v}\| = \sqrt{(u_1 - v_1)^2 + (u_2 - v_2)^2 + \cdots + (u_n - v_n)^2}
$$

### Orthogonality

- Two vectors are said to be orthogonal (perpendicular) if their inner product is zero 

- **Definition: Orthogonal Set**

    - A set of vectors where each pair of distinct vectors from the set is orthogonal

    - If $\mathbf{u}_i \cdot \mathbf{v}_j = 0$, whenever $i \neq j$

- **Orthogonal Projection Formula**

    - The orthogonal projection of $\mathbf{y}$ onto $\mathbf{L}$ is

$$\hat{\mathbf{y}} = \text{proj}_\mathbf{L}(\mathbf{y}) = \frac{\mathbf{y} \cdot \mathbf{u}}{\mathbf{u} \cdot \mathbf{u}} \mathbf{u}$$

- **Orthogonal Projections**

    - Given a vector $\mathbf{y}$ and a subspace $\mathbf{W}$ in $\mathbb{R}^n$, there is a vector $\hat{\mathbf{y}}$ in $\mathbf{W}$ such that

        - $\hat{\mathbf{y}}$ is the unique vector in $\mathbf{W}$ for which $\mathbf{y} - \hat{\mathbf{y}}$ is orthogonal to $\mathbf{W}$, and

        - $\hat{\mathbf{y}}$ is the unique vector in $\mathbf{W}$ closest to $\mathbf{y}$

- **Orthogonal Decomposition Formula**

    - Let $\mathbf{W}$ be a subspace of $\mathbb{R}^n$. Then each $\mathbf{y}$ in $\mathbb{R}^n$ can be written uniquely in the form,
    $\mathbf{y} = \hat{\mathbf{y}} + \mathbf{z}$. Where $\hat{\mathbf{y}}$ is in $\mathbf{W}$ and $\mathbf{z}$ is in $\mathbf{W}^\perp$

    - In fact, if $\set{\mathbf{u_1}, \mathbf{u_2}, \ldots, \mathbf{u_n}}$ is any orthogonal basis of $\mathbf{W}$ then,

    $$\hat{\mathbf{y}} = \frac{\mathbf{y} \cdot \mathbf{u_1}}{\mathbf{u_1} \cdot \mathbf{u_1}}\mathbf{u_1} + \frac{\mathbf{y} \cdot \mathbf{u_2}}{\mathbf{u_2} \cdot \mathbf{u_2}}\mathbf{u_2} + \ldots + \frac{\mathbf{y} \cdot \mathbf{u_n}}{\mathbf{u_n} \cdot \mathbf{u_n}}\mathbf{u_n} $$

    - And, 
    $$\mathbf{z} = \mathbf{y} - \hat{\mathbf{y}}$$

### Orthogonal Projection of $\mathbf{y}$ onto $\mathbf{W}$
<img src="/Users/henrycosentino/Desktop/Python/quant_mentorship/images/orthogonal_projection.png">

In [8]:
import numpy as np
import pandas as pd

In [9]:
# Example of an Orthogonal Projection Decomposition

# Basis for W (matrix)
W = [np.array([1, 0, 0]), np.array([0, 1, 0]), np.array([0, 0, 1])]

# y (vector)
y = np.array([0, 1, 1])

# y_hat (vector)
y_hat = y_hat = sum((np.dot(y, u) / np.dot(u, u)) * u for u in W) # <--- Decomposition formula applied to find y_hat

print("W:\n", np.matrix(W))
print("\n")
print("y_hat:\n", y_hat)

W:
 [[1 0 0]
 [0 1 0]
 [0 0 1]]


y_hat:
 [0. 1. 1.]


### Orthogonality (cont'd)

- **The Best Approximation Theorem**
    
    - Let $\mathbf{W}$ be a subspace of $\mathbf{R}^n$, let $\mathbf{y}$ be any vector in $\mathbf{R}^n$, and let $\hat{\mathbf{y}}$ be the orthogonal projection of $\mathbf{y}$ onto $\mathbf{W}$. Then $\hat{\mathbf{y}}$ is the closest point in $\mathbf{W}$ to $\mathbf{y}$, in the sense that

    $$ \|\mathbf{y} - \hat{\mathbf{y}}\| < \|\mathbf{y} - \mathbf{v}\|, \forall\mathbf{v}\in\mathbf{W} \text{ where } \mathbf{v} \ne \hat{\mathbf{y}} $$

### Best Approximation Theorem
<img src="/Users/henrycosentino/Desktop/Python/quant_mentorship/images/best_approximation.png">

### Linear Algebra Notation (cont'd)

- **Matrix-Vector Equation**

    - Let $\mathbf{A}$ be an _m × n_ real valued matrix, $\mathbf{x}$ be an _n × 1_ real valued column vector, and $\mathbf{b}$ be an _m × 1_ real valued column vector
    
    - Then, a matrix-vector equation is denoted by $$\mathbf{Ax} = \mathbf{b}$$

    - If there exists an $\mathbf{x}$ such that $\mathbf{Ax} = \mathbf{b}$ holds, then $\mathbf{Ax} = \mathbf{b}$ is said to be _consistent_

    - If there is no such $\mathbf{x}$ such that $\mathbf{Ax} = \mathbf{b}$ holds, then $\mathbf{Ax} = \mathbf{b}$ is said to be _inconsistent_

- **Matrix-Vector Equation as a System of Linear Equations**

    - $\mathbf{Ax} = \mathbf{b}$ can be decomposed into
    
$$\begin{align}
a_{11}x_1 + a_{12}x_2 + \cdots + a_{1n}x_n &= b_1 \\
a_{21}x_1 + a_{22}x_2 + \cdots + a_{2n}x_n &= b_2 \\
&\vdots \\
a_{m1}x_1 + a_{m2}x_2 + \cdots + a_{mn}x_n &= b_m
\end{align}$$

### Least Squares Notation & Solution

- **Notation**

    - Instead of the typical notation $\mathbf{Ax} = \mathbf{b}$, we can rewrite this in a more familiar manner as:

    $$\mathbf{X\beta} = \mathbf{y}$$

    - Where:

        - $\mathbf{X}$ represents an _n × (m+1)_ matrix of our data (independent variables, known values)

        - $\boldsymbol{\beta}$ represents an _(m+1) × 1_ matrix of weights (coefficients, unknown values)

        - $\mathbf{y}$ represents an _n × 1_ matrix of our observation data (dependent variable, known values)

    - $\mathbf{X\beta} = \mathbf{y}$ can be further decomposed:

    $$\begin{bmatrix} 1 & x_{11} & x_{12} & \cdots & x_{1m} \\ 1 & x_{21} & x_{22} & \cdots & x_{2m} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & x_{n2} & \cdots & x_{nm} \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \\ \vdots \\ \beta_m \end{bmatrix} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix}$$

- **Solution**

    - Inconsistent systems arise often in applications. Thus, we are not always guaranteed a vector $\boldsymbol{\beta}$ that satisfies $\mathbf{X\beta} = \mathbf{y}$

    - We notice that to approximate a solution to $\mathbf{X\beta} = \mathbf{y}$ we look for a vector in the span of $\mathbf{X}$ that is "closest" to $\mathbf{y}$. Then the closest solution for $\boldsymbol{\beta}$ will be the coefficients of the columns of $\mathbf{X}$ that give this vector in the column space of $\mathbf{X}$

    - In terms of the definition of projection we are looking for a solution to:
    
    $$\tag{1} \mathbf{X\beta} = \text{proj}_{\text{col}(\mathbf{X})}(\mathbf{y})$$

    - We call the solution $\hat{\boldsymbol{\beta}}$, we let $\hat{\mathbf{y}} = \text{proj}_{\text{col}(\mathbf{X})}(\mathbf{y})$, and so the solution we are looking for $(1)$ becomes $$\mathbf{X\hat{\beta}} = \mathbf{\hat{y}}$$

    - By the orthogonal decomposition theorem we know that $\mathbf{y} - \hat{\mathbf{y}}$ is orthogonal to the column space of $\mathbf{X}$, or $\mathbf{x}_j \cdot (\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}) = 0$

    - By the definition of the inner product this means that $\mathbf{x}_j^T \cdot (\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}) = 0$ for each column of $\mathbf{X}$. But these are the rows of $\mathbf{X}^T$, and so $\mathbf{X}^T(\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}) = 0$

    - Thus, the _normal equations_ are given by

    $$\mathbf{X}^T\mathbf{y} = \mathbf{X}^T\mathbf{X}\hat{\mathbf{\beta}}$$

    - Notice, the _normal equations_ only have one unknown, $\hat{\mathbf{\beta}}$, which can be easily solved for by row reduction (Gaussian elimination)

### Least Squares Projection
<img src="/Users/henrycosentino/Desktop/Python/quant_mentorship/images/least_squares_proj.png">

### Least Squares (cont'd)

- **Least Squares Error of Approximation**

    - The least squares error of approximation can be denoted by:
    
    $$\|\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}\|$$

    - Notice that $\mathbf{y}$ is a vector of the observed data. In a sense we are comparing the distance of our solution to the observed data

- **General Linear Models**

    - The matrix equation is still $\mathbf{X\beta} = \mathbf{y}$, but the specific form of $\mathbf{X}$ changes from problem to problem due to the design of the model and the available data

    - Thus, statisticians usually introduce a residual vector $\boldsymbol{\epsilon}$, defined by:
    
    $$\boldsymbol{\epsilon} = \mathbf{y} - \mathbf{X\beta}$$

    - So, the model becomes:
    
    $$\mathbf{y} = \mathbf{X\beta} + \boldsymbol{\epsilon}$$

    - The goal is to minimize the length of $\boldsymbol{\epsilon}$

- **Least Squares of Other Curves**

    - When the data points $(x_1, y_1), \ldots, (x_n, y_n)$ do not lie close to any 'straight' line, it may be appropriate to postulate some other functional relationship between $x$ (the independent variable) and $y$ (the dependent variable)

    - One such form:
    
    $$y = \beta_0 f_0(x) + \beta_1 f_1(x) + \cdots + \beta_k f_k(x) \tag{2}$$

    - Where:

        - $\beta_0, \beta_1, \ldots, \beta_k$ are parameters (coefficients) that must be determined

        - $f_0(x), f_1(x), \ldots, f_k(x)$ are known functions

    - Notice that $(2)$ is a linear model because it is linear in terms of the parameters $\beta_0, \beta_1, \ldots, \beta_k$

    - The solution to least squares of other curves is similar, and involves the _normal equations_

    - Example:

        - Consider the polynomial linear regression model of degree three:
        
        $$y = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3$$

        - Which can be rewritten as:
        
        $$\begin{bmatrix} 1 & x_1 & x_1^2 & x_1^3 \\ 1 & x_2 & x_2^2 & x_2^3 \\ \vdots & \vdots & \vdots & \vdots \\ 1 & x_n & x_n^2 & x_n^3 \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \\ \beta_3 \end{bmatrix} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix}$$

- **Multiple Linear Regression**

    - Experiments involving more than one independent variable

    - Generally:

    $$y = \beta_0 f_0(u,v,\ldots) + \beta_1 f_1(u,v,\ldots) + \cdots + \beta_k f_k(u,v,\ldots)$$

    - Where:

        - $\beta_0, \beta_1, \ldots, \beta_k$ are parameters (coefficients) that must be determined

        - $f_0(u,v,\ldots), f_1(u,v,\ldots), \ldots, f_k(u,v,\ldots)$ are known functions

    - The solution for multiple linear regression is similar, and involves the _normal equations_

### Example One

Find a least squares solution and least squares error of the below system by hand.

$$x_1 + x_2 = 1$$
$$x_1 + 2x_2 = 3$$
$$x_1 + 3x_2 = 2$$

### Example Two

Find a least squares solution for the below points in quadratic form through the origin.

$$(1, 1), (2, 5), (-1, 2)$$

### Example Three

Find a least squares solution and least squares error of the below data using NumPy.

- [Data Description](https://www.princeton.edu/~mwatson/Stock-Watson_3u/Students/EE_Datasets/Growth_Description.pdf)

    - _growth_: Average annual percentage growth of real Gross Domestic Product (GDP)* from
1960 to 1995

    - _yearsschool_: Average number of years of schooling of adult residents in that country in 1960

In [10]:
# Load in data

fh = "/Users/henrycosentino/Desktop/Python/quant_mentorship/data/growth_data.csv"

df = pd.read_csv(fh)

df

Unnamed: 0,country_name,growth,oil,rgdp60,tradeshare,yearsschool,rev_coups,assasinations
0,India,1.915168,0,765.999817,0.140502,1.45,0.133333,0.866667
1,Argentina,0.617645,0,4462.001465,0.156623,4.99,0.933333,1.933333
2,Japan,4.304759,0,2953.999512,0.157703,6.71,0.000000,0.200000
3,Brazil,2.930097,0,1783.999878,0.160405,2.89,0.100000,0.100000
4,United States,1.712265,0,9895.003906,0.160815,8.66,0.000000,0.433333
...,...,...,...,...,...,...,...,...
60,Cyprus,5.384184,0,2037.000366,0.979355,4.29,0.100000,0.166667
61,Malaysia,4.114544,0,1420.000244,1.105364,2.34,0.033333,0.033333
62,Belgium,2.651335,0,5495.001953,1.115917,7.46,0.000000,0.000000
63,Mauritius,3.024178,0,2861.999268,1.127937,2.44,0.000000,0.000000


In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65 entries, 0 to 64
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   country_name   65 non-null     object 
 1   growth         65 non-null     float64
 2   oil            65 non-null     int64  
 3   rgdp60         65 non-null     float64
 4   tradeshare     65 non-null     float64
 5   yearsschool    65 non-null     float64
 6   rev_coups      65 non-null     float64
 7   assasinations  65 non-null     float64
dtypes: float64(6), int64(1), object(1)
memory usage: 4.2+ KB


In [12]:
# Assign independent and dependent variables

X = df[['yearsschool']].values # <-- The ".values" attribute gives us the values and not a data frame
X = np.column_stack([np.ones(len(X)), X]) # <-- Here we are adding a column of ones for the intercept

y = df[['growth']].values

X[:5 , :]

array([[1.        , 1.45000005],
       [1.        , 4.98999977],
       [1.        , 6.71000004],
       [1.        , 2.8900001 ],
       [1.        , 8.65999985]])

In [None]:
# Solving using NumPy

coefficients = np.linalg.solve(X.T @ X, X.T @ y) # <-- We use "@" for matrix multiplication and ".T" to take the transpose

e = y - X @ coefficients
error = np.linalg.norm(e)

rss = error**2
tss = np.sum((y - np.mean(y))**2)
r_squared = 1 - (rss / tss)

print(f"R-squared: {r_squared:.4f}")
print(f"Error: {error:.4f}")
print(f"Intercept: {coefficients[0][0]:.4f}")
print(f"Slope (yearsschool): {coefficients[1][0]:.4f}")

R-squared: 0.1096
Error: 14.3215
Intercept: 0.9583
Slope (yearsschool): 0.2470


In [14]:
# Verifying answers with sklearn

from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X, y)
print(f"R-squared: {model.score(X, y):.4f}")
print(f"Intercept: {model.intercept_[0]:.4f}")
print(f"Slope (yearsschool): {model.coef_[0][1]:.4f}")

R-squared: 0.1096
Intercept: 0.9583
Slope (yearsschool): 0.2470


## Part 3: Statistics

- **The Gauss-Markov Theorem**

    - Theorem: 

        - The Ordinary Least Squares (OLS) estimate for $\hat{\mathbf{\beta}}$ in the linear model $\mathbf{y} = \mathbf{X\beta} + \mathbf{\epsilon}$ is given by $\hat{\mathbf{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$. If we assume that the errors have a mean $\mathbf{0}$ and covariance $\sigma^2\mathbf{I}$, then the OLS estimator achieves the lowest sampling variance among all linear unbiased estimators for $\mathbf{\beta}$; that is, the OLS estimator is the _best linear unbiased estimator_ (BLUE) under the above statement

    - Assumptions:

        1. Linearity in Parameters: The dependent variable is a linear function of the parameters

        2. Random Sampling: The data is obtained through random sampling from the population, so observations are _independently identically distributed_

        3. No Perfect Collinearity: No independent variable is a perfect linear combination of the others

        4. Zero Conditional Mean (exogeneity):

            - $E[\mathbf{\epsilon} | \mathbf{X}] = \mathbf{0}$

            - The error term has an expected value of zero given any value of the independent variables

        5. Homoskedasticity

            - $\text{var}[\mathbf{\epsilon} | \mathbf{X}] = \sigma^2\mathbf{I}$

            - The variance of the error term is constant across all observations

            - Errors have the same variance regardless of the $\mathbf{X}$ values

        6. No Autocorrelation (time series specific)

            - $\text{cov}[\epsilon_i, \epsilon_j | \mathbf{X}] = 0$, when $i \neq j$

            - Error terms are uncorrelated with each other

    - In a sense, when using OLS, the Gauss-Markov Theorem tells us that we are extracting the maximum amount of information from the data in the most efficient manner. In other words, OLS is optimal, but in reality it is optimal in a world that does not exist

    - Proof of Unbiasedness (assumption 4)

We aim to show that $E[\hat{\mathbf{\beta}}] = \mathbf{\beta}$.

Consider the OLS estimator:
$$\hat{\mathbf{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$$

Substitute the population model $\mathbf{y} = \mathbf{X\beta} + \mathbf{\epsilon}$ into the estimator:
$$\hat{\mathbf{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T(\mathbf{X\beta} + \mathbf{\epsilon})$$

Distribute the terms:
$$\hat{\mathbf{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}(\mathbf{X}^T\mathbf{X})\mathbf{\beta} + (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{\epsilon}$$

Since $(\mathbf{X}^T\mathbf{X})^{-1}(\mathbf{X}^T\mathbf{X}) = \mathbf{I}$, we simplify to:
$$\hat{\mathbf{\beta}} = \mathbf{\beta} + (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{\epsilon}$$

Apply the conditional expectation $E[\cdot | \mathbf{X}]$. Because $\mathbf{\beta}$ is a fixed parameter and we are conditioning on $\mathbf{X}$, both are treated as non-stochastic in this step:
$$E[\hat{\mathbf{\beta}} | \mathbf{X}] = \mathbf{\beta} + (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T E[\mathbf{\epsilon} | \mathbf{X}]$$

Apply the zero conditional mean assumption ($E[\mathbf{\epsilon} | \mathbf{X}] = \mathbf{0}$):
$$E[\hat{\mathbf{\beta}} | \mathbf{X}] = \mathbf{\beta} + (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T(\mathbf{0})$$
$$E[\hat{\mathbf{\beta}} | \mathbf{X}] = \mathbf{\beta}$$

Finally, apply the Law of Iterated Expectations to find the unconditional expectation:
$$E[\hat{\mathbf{\beta}}] = E[E[\hat{\mathbf{\beta}} | \mathbf{X}]] = E[\mathbf{\beta}]$$

Since $\mathbf{\beta}$ is a constant vector of population parameters, $E[\mathbf{\beta}] = \mathbf{\beta}$, therefore:
$$E[\hat{\mathbf{\beta}}] = \mathbf{\beta}$$

- **Maximum Likelihood Estimator**

    - Likelihood Function

        - The information in the sample and the parameter $\theta$ are involved in the joint distribution of the random sample, $$\prod_{i=1}^{n} f(x_i ; \theta)$$

        - Here $f$ is any probability density function

        - We want to view this as a function of $\theta$, so we can write it as, $$L(\theta) = L(\theta; x_1, x_2, \ldots, x_n) = \prod_{i=1}^{n} f(x_i ; \theta)$$

    - Maximum Likelihood Estimator (MLE)

        - An often-used estimate is the value of $\theta$ that provides a maximum of our function $L(\theta)$

        - If $\hat{\theta}$ is unique, then we denote it by $\hat{\theta}$, and so, $$\hat{\theta} = \arg\max_{\theta} L(\theta)$$

        - In a sense we are looking for a peak of the $L(\theta)$ function

    - Log-Likelihood

        - In practice, it is often much easier to work with the log likelihood, $$\ell(\theta) = \ln[L(\theta)]$$

        - Since $\ln$ is a strictly increasing function, the value that maximizes $\ell(\theta)$ is the same value that maximizes $L(\theta)$

        - If we assume the probability density function is a differentiable function of $\theta$ (which it usually is), then $\hat{\theta}$ frequently solves, $$\frac{\partial \ell(\theta)}{\partial \theta} = 0$$

    - OLS & MLE

        - MLE is not OLS. However, in regression modeling, the maximum likelihood function finds the optimal values of the intercept and slope of the line that maximizes the likelihood. In other words, MLE can be applied to solve for the parameters in OLS

        - In terms of OLS, $\theta$ will be a vector of parameters, $$\theta = [\beta_0, \beta_1, \sigma]$$

        - And, the probability density function will follow that of a normal distribution, $$f(x_i ; \theta) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{1}{2}(\frac{(x_i - \mu)}{\sigma})^2}$$

        - In other words, for the MLE estimates of regression coefficients ($\beta$) to be identical to OLS estimates, we must assume that the error terms ($\epsilon$) follow a Normal distribution with a mean of zero and constant variance (homoscedasticity)

- **Statistical Modeling to Machine Learning**

    - Two Different Goals

        - Causation: Statisticians and econometricians usually seek to answer questions, and in doing so answer problems of causality

        - Prediction: Data scientists typically seek to predict a specified value(s), often with limited explanatory power

    - Statistical Inference

        - Goal: causal models with explanatory power

        - Typical Process

            1. Statement of hypothesis

            2. Specification of the mathematical model of theory

            3. Specification of the statistical model of theory

            4. Obtaining the data

            5. Estimation of the statistical model's parameters

            6. Determining whether to accept or reject the initial hypothesis

        - Typically embody a probabilistic framework, are expressed linearly, not scalable (limited to lower dimensional data), prone to overfitting, and have extensive model diagnostics

    - Machine Learning

        - Goal: prediction performance, usually with low explanatory power

        - Typical Process

            1. Define the prediction problem and target variable

            2. Collect and prepare the data (cleaning, feature engineering)

            3. Split data into training, validation, and test sets

            4. Select and train candidate models/algorithms

            5. Tune hyperparameters and evaluate performance on validation set

            6. Select best model and assess final performance on test set

        - Typically embody probabilistic and algorithmic frameworks, can handle non-linear relationships, scalable to high-dimensional data, less prone to overfitting (through regularization and cross-validation), and focus on predictive accuracy over interpretability

    - Therefore, ordinary least squares can be considered a type of parametric, supervised, linear regression machine learning model if used in the aforementioned predictive capacity

### References

1. **Hogg, R. V., McKean, J. W., & Craig, A. T.** (2019). *Introduction to Mathematical Statistics* (8th ed.). Pearson.

2. **Lay, D. C., Lay, S. R., & McDonald, J. J.** (2021). *Linear Algebra and Its Applications* (6th ed., Global Edition).

3. **Stock, J. H., & Watson, M. W.** (2020). *Introduction to Econometrics* (4th ed., Global Edition).

4. **Jiang, Y.** (2023). *STA 211: The Mathematics of Regression - Lecture 8: The Gauss-Markov Theorem*. Department of Statistical Science, Duke University.

5. **Jiang, Y.** (2023). *STA 211: The Mathematics of Regression - Lecture 9: Maximum Likelihood Estimation*. Department of Statistical Science, Duke University.