# Linear Regression

- Fit a line to a data set of observations
- Use this line to predict unobserved values
- I don’t know why they call it “regression.”
It’s really misleading. You can use it to
predict points in the future, the past,
whatever. In fact time usually has nothing
to do with it.

# Linear Regression Example: Weight(kg) vs Height(cm)

Linear regression is a statistical method that allows us to model the relationship between a dependent variable and one or more independent variables. It is one of the most widely used techniques for data analysis and prediction.

In your example, you want to explore the relationship between weight (kg) and height (cm). Weight is the dependent variable, because it may depend on height. Height is the independent variable, because it does not depend on weight.

To perform a linear regression, you need to have some data points that represent the values of weight and height for different individuals. You can plot these data points on a scatter plot, where the x-axis is height and the y-axis is weight.

The goal of linear regression is to find a line that best fits the data points. This line is called the regression line, and it has the equation:

$$y = \beta_0 + \beta_1 x$$

where $y$ is the predicted value of weight, $x$ is the given value of height, $\beta_0$ is the intercept (the value of $y$ when $x$ is zero), and $\beta_1$ is the slope (the change in $y$ for a unit change in $x$).

The regression line can be used to estimate the weight of a person given their height, or to test hypotheses about the relationship between weight and height. For example, you can test whether there is a significant positive or negative correlation between weight and height, or whether the slope of the regression line is different from zero.

There are different methods to find the best-fitting regression line, such as the method of least squares, which minimizes the sum of squared errors between the observed and predicted values of weight. You can also use software tools or online calculators to perform linear regression and obtain the values of $\beta_0$ and $\beta_1$, as well as other statistics such as the coefficient of determination ($R^2$), which measures how well the regression line explains the variation in weight.

# Linear Regression Usually using “least squares”

Linear regression is a statistical method that tries to find the best-fitting line for a set of data points. The “least squares” approach is one way to measure how well a line fits the data, by minimizing the sum of the squared errors (squared errors are a way of measuring how much the predictions of a model differ from the actual values of the data. They are calculated by taking the difference between the predicted value and the actual value for each data point, and then squaring that difference.) between the observed values and the predicted values by the line. 

The formula for the least squares line is $y = mx + b$, where $m$ is the slope and $b$ is the y-intercept. To find $m$ and $b$, we can use these steps: 

- For each $(x,y)$ point, calculate $x^2$ and $xy$.
- Sum up $x$, $y$, $x^2$ and $xy$, which gives us $\Sigma x$, $\Sigma y$, $\Sigma x^2$ and $\Sigma xy$.
- Calculate $m$ using this formula: $m = \frac{N\Sigma(xy) - \Sigma x\Sigma y}{N\Sigma(x^2) - (\Sigma x)^2}$, where $N$ is the number of points.
- Calculate $b$ using this formula: $b = \frac{\Sigma y - m\Sigma x}{N}$.
- Plug in $m$ and $b$ into the equation $y = mx + b$.

Here is an example of how to use the least squares method to find the best-fitting line for some data points: 

| x | y | x^2 | xy |
|---|---|-----|----|
| 2 | 4 | 4   | 8  |
| 3 | 5 | 9   | 15 |
| 5 | 7 | 25  | 35 |
| 7 | 10| 49  | 70 |
| 9 | 15| 81  | 135|

$\Sigma x = 26$, $\Sigma y = 41$, $\Sigma x^2 = 168$, $\Sigma xy = 263$, $N = 5$

$m = \frac{5 \times 263 - 26 \times 41}{5 \times 168 - 26^2} = \frac{249}{164} = 1.518$

$b = \frac{41 - 1.518 \times 26}{5} = 0.305$

$y = 1.518x + 0.305$


# More than one way to do Linear Regression

- Gradient Descent is an alternate method
to least squares.
- Basically iterates to find the line that best
follows the contours defined by the data.
- Can make sense when dealing with 3D
data
- Easy to try in Python and just compare the
results to least squares
    - But usually least squares is a perfectly good choice.

let's go through how to use Gradient Descent for Linear Regression in R^2 with a simple example. 

**Step 1: Define the Model**

In R^2, a linear regression model can be represented as:

y = b0 + b1*x

where:
- y is the dependent variable we want to predict.
- x is the independent variable.
- b0 and b1 are the parameters of the model we want to learn.

**Step 2: Initialize Parameters**

We initialize the parameters (b0, b1) with some random values.

**Step 3: Define the Loss Function**

The loss function for our linear regression model is the Mean Squared Error (MSE), defined as:

MSE = 1/N * Σ(yi - (b0 + b1*xi))^2

where:
- N is the total number of observations.
- Σ denotes the sum over all observations.
- yi is the actual value of the dependent variable for the ith observation.
- xi is the value of the independent variable for the ith observation.

**Step 4: Compute the Gradients**

The gradients of the loss function with respect to the parameters are:

∂MSE/∂b0 = -2/N * Σ(yi - (b0 + b1*xi))

∂MSE/∂b1 = -2/N * Σ(yi - (b0 + b1*xi)) * xi

**Step 5: Update the Parameters**

We update the parameters using the learning rate (α) and the gradients:

b0 = b0 - α * ∂MSE/∂b0

b1 = b1 - α * ∂MSE/∂b1

**Step 6: Repeat Steps 4 and 5**

We repeat steps 4 and 5 for a fixed number of iterations or until our loss function converges to the minimum, at this time Step-Size(α * ∂MSE/∂b0 or α * ∂MSE/∂b1) trend to 0.

let's see how we can fit a linear regression model (y = b0 + b1*x) to the given data using gradient descent. We'll use a learning rate of 0.01 and 1000 iterations for simplicity. We left off with our initial parameters b0 = 0 and b1 = 0, and using the Mean Squared Error (MSE) as our loss function. Our dataset is:

**Dataset:**

| x | y  |
|---|----|
| 1 | 2  |
| 2 | 3  |
| 3 | 4  |
| 4 | 5  |
| 5 | 6  |

**Step 1: Initialize Parameters**

We initialize b0 and b1 to be 0.

So, b0 = 0 and b1 = 0.

**Step 2: Define the Loss Function**

The loss function is the Mean Squared Error (MSE):

MSE = 1/N * Σ(yi - (b0 + b1*xi))^2

**Step 3: Compute the Gradients and Update Parameters**

We need to compute the gradients and update the parameters for each iteration. 

The gradients of the loss function with respect to the parameters are:

∂MSE/∂b0 = -2/N * Σ(yi - (b0 + b1*xi))

∂MSE/∂b1 = -2/N * Σ(yi - (b0 + b1*xi)) * xi

***Let's calculate these for the first iteration:***

For b0:

∂MSE/∂b0 = -2/5 * [(2-0*1) + (3-0*2) + (4-0*3) + (5-0*4) + (6-0*5)]
           = -4

For b1:

∂MSE/∂b1 = -2/5 * [(2-0*1)*1 + (3-0*2)*2 + (4-0*3)*3 + (5-0*4)*4 + (6-0*5)*5]
           = -22

Update the parameters:

b0 = b0 - α * ∂MSE/∂b0 = 0 - 0.01 * -4 = 0.04

b1 = b1 - α * ∂MSE/∂b1 = 0 - 0.01 * -22 = 0.22

***Second Iteration:***

For b0:

∂MSE/∂b0 = -2/5 * [(2-(0.04+0.22*1)) + (3-(0.04+0.22*2)) + (4-(0.04+0.22*3)) + (5-(0.04+0.22*4)) + (6-(0.04+0.22*5))]
           ≈ -3.68

For b1:

∂MSE/∂b1 = -2/5 * [(2-(0.04+0.22*1))*1 + (3-(0.04+0.22*2))*2 + (4-(0.04+0.22*3))*3 + (5-(0.04+0.22*4))*4 + (6-(0.04+0.22*5))*5]
           ≈ -20.24

Update the parameters:

b0 = b0 - α * ∂MSE/∂b0 ≈ 0.04 - 0.01 * -3.68 ≈ 0.0768

b1 = b1 - α * ∂MSE/∂b1 ≈ 0.22 - 0.01 * -20.24 ≈ 0.4244

***Third Iteration:***

For b0:

∂MSE/∂b0 ≈ -2/5 * [(2-(0.0768+0.4244*1)) + (3-(0.0768+0.4244*2)) + (4-(0.0768+0.4244*3)) + (5-(0.0768+0.4244*4)) + (6-(0.0768+0.4244*5))]
           ≈ -3.41

For b1:

∂MSE/∂b1 ≈ -2/5 * [(2-(0.0768+0.4244*1))*1 + (3-(0.0768+0.4244*2))*2 + (4-(0.0768+0.4244*3))*3 + (5-(0.0768+0.4244*4))*4 + (6-(0.0768+0.4244*5))*5]
           ≈ -18.65

Update the parameters:

b0 = b0 - α * ∂MSE/∂b0 ≈ 0.0768 - 0.01 * -3.41 ≈ 0.1109

b1 = b1 - α * ∂MSE/∂b1 ≈ 0.4244 - 0.01 * -18.65 ≈ 0.6109

After three iterations, our parameters are approximately b0 ≈ 0.1109 and b1 ≈ 0.6109.

**Step 4: Repeat Step 3**

We repeat step 3 for a fixed number of iterations (in this case, 1000) or until our loss function converges to the minimum. 

After several iterations, you will see that the values of b0 and b1 will start to converge. These will be the parameters of the linear regression model that best fits our data.

Please note that in practice, we often use libraries or built-in functions to perform these computations as they can handle more complex scenarios and are optimized for performance.