# Linear Regression #

## ***Vocabulary***

second derivate test \
variance\
covariance\
span\
projection\
transpose\
normal equations\
pseudo-inverse\
likelihood\
derivative rules\
pdf of the gaussian


# Lecture Notes #

## ***Introduction***

#### **Background**

Linear regression is a core task in multiple fields including statistics, computer science, and machine learning. It is the problem of fitting a line to data.

One difference from classification to regression, is our labels in regression will be real-valued, not just 1 or 0.

#### **Deriving the Regression Function**

Let's say we have two random variables $X$, $Y$. We want to predict $Y$, the label.

In one scenario, assume we are not able to see $X$. Given that $(X,Y) \textasciitilde \mathbf{D}$, our optimal guess for $Y$ is simply $\mathbf{E}[Y]$. We would then measure our loss using square loss: $(Prediction-Y)^2$

In the scenario where we do get to see $X$, then the optimal prediction for $Y$ will be $\mathbf{E}[Y|X]$, the expected value of $Y$ conditioned on $X$.

This value, $\mathbf{E}[Y|X]$, is a function of the random variable $X$, written $f(X)$. We call this function the **regression** function.

A major obstacle is that $f(x)$ might be unknown, or very hard to compute.

#### **Introducing the Coefficients**

Linear regression asks the following: 

**Given $X$, what linear function of $X$ should we use to predict $Y$?**

Essentially, we want to know which line best fits the data with respect to square loss. We want to learn coefficients $\beta_0$ and $\beta_1$ to minimize $\mathbf{E}_{(X,Y) \textasciitilde \mathbf{D}}([(Y-(\beta_0 + \beta_1x))^2]$, the expected value of the square loss.

## ***Finding the Betas***

#### **Our Minimization Function**

We are going to draw a training set of size $m$: $(x^1, y^1),\; \dots \;, (x^m,y^m)$, where $x$ and $y$ are scalars, so this is simple linear regression. In this case, we want to:

$$ \underset{\beta_0, \beta_1}{\min} \frac{1}{m}\sum_{j=1}^m(y^j-(\beta_o+\beta_1 x^j))^2$$

#### **How to Find Beta 0 and Beta 1**

We will take the derivate with respect to $\beta_0$ and $\beta_1$, and set them equal to 0. Let's fix $\mathcal{l}$ as $\sum_{j=1}^m(y^j-(\beta_o+\beta_1 x^j))^2$ from the minimization function. Now we can compute the partial derivative of $\mathcal{l}$ with respect to $\beta_0$.

Note: We know from calculus that setting a derivative equal to zero yields a critical point, but how can we be sure that this is the global minimum? Since this function is convex we can assume so. However we can also take the second derivative and apply the second derivative test. You can also get the Hessian from the second derivate, a semi-definite matrix, and look at the eigenvalues, which we will do later in the course.

$$ \frac{\partial \mathcal{l}}{\partial \beta_0} = \frac{1}{m}\sum_{j=1}^m(y^j-\beta_0-\beta_1x^j)(-2)$$

$$ \frac{\partial \mathcal{l}}{\partial \beta_1} = \frac{1}{m}\sum_{j=1}^m(y^j-\beta_0-\beta_1x^j)(-2x^j)$$

Removing the $-2$ multiplier (because it can easily be divided out), we have:

$$ \frac{1}{m}\sum_{j=1}^m(y^j-\beta_0-\beta_1x^j) = 0 $$

$$\frac{1}{m}\sum_{j=1}^m(y^j-\beta_0-\beta_1x^j)(x^j) = 0$$

Solving for the betas:

$$ \beta_0 = \bar{y}-\beta_1\bar{x} $$

$$ \beta_1 = \frac{\bar{xy}-\bar{x}\cdot\bar{y}}{\bar{x^2}-(\bar{x})^2} $$

<br>
<center>
    <img src="images/1.7.1.png" alt="Professor Notes" />
</center>
<br>

#### **Beta 1 and the Variance of X**

Notice that the denominator in our expression for $\beta_1$ looks a lot like the expression for the variance of $x$:

$$ \bar{x^2}-(\bar{x})^2 $$
$$ Var(X) = \mathbf{E}[X^2]−(\mathbf{E}[X])^2 $$

And the numerator looks a lot like the expression for the covariance of $x$ and $y$:

$$ \bar{xy}-\bar{x}\cdot\bar{y} $$
$$ Cov(X,Y) = \mathbf{E}[X\cdot Y] - \mathbf{E}[X] \cdot \mathbf{E}[Y] $$

So it seems $\beta_1$ is the covariance of the slope of the line ($x, y$), divided by the variance of $x$.

$$ \beta_1 = \frac{Cov(X,Y)}{Var(X)} $$

## ***Regression with Multiple Variables***

#### **Defining the Problem**
Now, instead of assuming $X$ is a scalar, $X$ is an $n$ dimensional vector, but $Y$ is still a scalar:

$$ X \in \mathbf{R}^n\;\;\;\; y \in \mathbf{R} $$

So we are fitting a line to $n$-dimensional data.

Consider a matrix $X$, that is an $m \times n$ matrix. 
- $X$ has $m$ rows, where each row is equal to $X^i$ drawn from $\mathcal{D}$. Each row is an $n$ dimensional data point.
- $X$ has $n$ columns becuase each point is in $\mathbf{R}^n$. $X$ has $n$ features.

We will still have a vector $y \in \mathbf{R}^m$ corresponding to the labels for these $m$ points.

**The goal** is to find a vector $w \in \mathbf{R}^n$ that minimizes $||X\cdot w-y||^2_2$

Example, given: 

$$ x^1 = x^1_1, \dots , x^1_n \;\;\;\; y^1 $$

Then: 

$$ (y-(x^1_1 w_1 + \dots +x^1_n w_n))^2 $$

#### **Formal Problem Statement**

$$ \underset{w}{\min} ||Xw-y||^2_2 $$

#### **How to Find w**

Let's find $w$ by using geometric concepts. $Xw$ is a vector in the span of the columns of $X$. The point $y \in \mathbf{r}^n$ is not necessarily in the span of $X$.

What point should we pick in the span of $X$ to best approximate $y$, geometrically speaking? We should take the orthogonal projection of $y$ down to the span of $X$, and that is the optimal point. We will call this point $Xw$, and that is the point we should choose.

The vector $y-Xw$ line from the point $y$ to the point $Xw$, which is orthogonal to  $X$.

<br>
<center>
    <img src="images/1.7.2.png" alt="Professor Notes" />
</center>
<br>

Now we can take the vector $y-Xw$ and, since it is orthogonal to $X$, do the following:

$$ X^T\cdot (y-Xw) = 0 $$
$$ X^Ty-X^TXw = 0 $$
$$ X^Ty = X^TXw $$
$$ (X^TX)^{-1}X^Ty = w $$

Thus, we have solved for $w$.

#### **The Normal Equations** 

$ (X^TX)^{-1}X^Ty = w $ is called the normal equation. And there are a few issues with the normal equations.

1. What if $X^TX$ is not invertible?
- We end up using something called a pseudo-inverse instead of a true inverse. This means that there will be multiple choices of $w$ that will give you this minimum square loss.

2. What is the running time for computing $w$, in terms of $m$ and $n$?
- There are some matrix inverses going on here. Roughly $\mathcal{O}(n^3 + m\cdot n^2) $. We will soon see faster algorithms, specifically using gradient descent.

## ***Maximum Likelihood***

#### **Assumption**

We will be looking at the "simple" linear regression case, with scalar $X$, and assume $y = \beta_0 + \beta_1x + \epsilon$, where $\epsilon$ is a random noise variable where $\epsilon \in N(0, \sigma^2)$ gaussian distribution.

Imagine we have drawn $x^1, \dots, x^m$ and $y^1, \dots, y^m$. We want to understand: for a fixed choice of $\beta_0$ and $\beta_1$ ($\sigma^2$ is known), **what is the probability that we see $(x^1, y^1)\dots (x^m,y^m)$**.

#### **Likelihood Function**

The likelihood function is the probability of seeing the training set given a choice $\beta_0$ and $\beta_1$ of our parameters:

$$ \prod_{i=1}^m P(y^i|X^i;\beta_0, \beta_1) $$

With the probability of choosing $y$, or $P(y^i|X^i;\beta_0, \beta_1)$ plugged in (something about the PDF of the gaussian and $\epsilon$):

$$ = \prod_{i=1}^m \frac{1}{\sqrt{2\pi \sigma^2}}\cdot \mathcal{e}^{\frac{-(y^i-(\beta_0+\beta_1x^i))^2}{2\sigma^2}} $$

This is the likelihood of our training set. Let's call this quantity $L(\beta_0,\beta_1)$. We want to choose $\beta_0$ and $\beta_1$ that maximizes this likelihood.

#### **Maximizing Likelihood**

Instead of directly maximizing likelihood, we will maximize log-likelihood $Log(L(\beta_0,\beta_1))$. Because we have a product, it is often convenient to take the log.

$$Log(L(\beta_0,\beta_1)) = log\prod_{i=1}^m P(y^i|X^i;\beta_0,\beta_1) $$
$$ = \sum_{i=1}^m log (P(y^i|X^i;\beta_0,\beta_1)) $$
$$ = \frac{-m}{2}log\;2\;\pi-m\;log\;\sigma-\frac{1}{2\sigma^2}\sum_{i=1}^m (y^i-(\beta_0+\beta_1x^i))^2 $$

Notice that the only term that will be affected by our choice of $\beta$'s is the last term. Also notice that the last term is the **least-squares estimate** for simple linear regression we looked at earlier.

#### **Interpreting Coefficients**

There are two interpretations for coefficients in lienar regression:
- Geometric; coefficients of the line that minimizes squared distance from the line to our labels
- Statistical; coefficients that give you the maximum likelihood estimator for a training set generated per the assumtion, in this case $y \textasciitilde N(\beta_0+\beta_1x^i, \epsilon))^2$.

# Personal Notes #