# Linear Regression

There are several factors that can influence a student's performance on a test, but for the sake of this analogy, let's consider the number of hours spent studying as a key determinant. In this scenario, the student's performance on the test is the 'dependent' variable, which is influenced by the 'independent' variable of hours studied.

For instance, if student one achieved a perfect score of 100% after dedicating 10 hours to studying, and student two attained a score of 50% after studying for 5 hours, what might student three score after studying for 7 hours? If you guessed 70%, you're correct.

In the language of data analysis, predicting the outcome of one variable based on another is termed linear regression analysis.

The linear regression formula is represented as:

## \[ y = mx + b \]

Where:
- \( y \) is the dependent variable,
- \( x \) is the independent variable,
- \( m \) is the slope of the regression line, and
- \( b \) is the y-intercept.

In the formula described above, we introduce two new constants: m and b. Let's begin with m, it represents the slope of the regression line, indicating the rate at which the student's score changes with respect to the number of hours studied (dy/dx).

Now, imagine a situation where class attendance contributes to 10% of the overall score. In this context, even if a student dedicates no time to studying, they would still achieve a score of 10% provided they attended all classes. This 10% value serves as the y-intercept(b).


## Question

Let's revisit the initial scenario with the student data and discuss how we can mathematically determine the values of m, b, and y using a set of data points.

| Student | Study Hour | Score |
| ------- | ---------- | ----- |
| S1      | 10         | 100   |
| S2      | 5          | 50    |
| S3      | 7          | y     |

Find the values of m, b and y.

## Solution

Prerequisite
- Linear Algebra: Systems of Linear Equations

### Formula
y = mx + b

### Step 1
Substitute values of y and x for s1 and s2 in the formula above to have a system of linear equations

100 = 10m + b ........... eq. 1

50 = 5m + b ............. eq. 2

### Step 3
Substract eq.2 from eq.1

50 = 5m

### Step 4
Divide both sides by 5

m = 10

### Step 5
b = 0 or more conveniently substitute the value of m in either eq.1 or eq.2

eq.1 : 100 - 10(10) = b

b = 0

### Step 6
solve for y for student 3

y = 10(7) + 0

y = 70

## Handling Larger Datasets

The previous example provided a straightforward illustration. However, as datasets grow in size, such as with 10,000 rows of data, simple calculations of m and b using just two values of x and y may not yield accurate results. In such cases, we need more sophisticated techniques like regression lines, cost or error functions, and perhaps even gradient descent optimization.

Let's consider a new dataset to delve deeper into this complexity.

In [3]:
# import necessary modules

import numpy as np 
import pandas as pd 

# read the new dataset using pd.read_csv(path_to_csv)
df = pd.read_csv('data/test_score.csv')

print(df)

    Study Hours  Score
0          7.12  33.54
1         11.45  61.13
2          5.78  31.98
3         14.23  67.57
4          9.36  47.48
5          2.15  13.52
6         16.78  86.92
7          3.99  23.76
8         18.21  90.57
9         13.45  67.02
10         6.89  37.21
11        19.78  99.45
12         8.55  42.12
13         1.45   9.34
14        12.67  64.89
15         4.89  25.74
16        17.32  86.11
17        10.75  53.67
18         4.23  22.56
19        15.67  79.12
20         6.44  34.21
21         9.88  51.79
22         2.99  14.56
23        13.12  64.78
24         5.56  27.89
25        11.01  55.32
26         8.23  40.98
27         3.78  19.56
28        16.01  81.78
29         7.99  41.23
30        18.56  94.12
31        12.89  65.76
32         6.21  34.98
33        19.12  96.89
34         9.45  49.34
35         1.78  10.56
36        14.56  73.23
37         4.12  20.67
38        17.89  89.45
39        10.23  51.78
40         3.45  18.56
41        15.34  78.12
42         