# **IS-4100: Linear Regression from Scratch**

### **Objective**
In this assignment, you will implement a simple linear regression model from scratch using NumPy. You will compute the slope ($m$) and intercept ($b$) of the best-fit line using the least squares method and evaluate your implementation with test data.

---

### **Background**
Linear regression is used to find the relationship between a dependent variable ($y$) and an independent variable ($x$). The formula for the regression line is:

$$
y = mx + b
$$

Where:
- **\(m\) (slope)** is calculated as:
$$
m = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2}
$$

- **\(b\) (intercept)** is calculated as:
$$
b = \bar{y} - m \bar{x}
$$

The model predicts values of ($y$) (dependent variable) based on ($x$) (independent variable). The **goodness of fit** of the model can be measured using the ($R^2$) value, which quantifies how well the model explains the variability in the data.

---

### **Task**

1. **Implement the Linear Regression Formula**
   - Write a function `linear_regression(x, y)` that takes two NumPy arrays ($x$) (independent variable) and ($y$) (dependent variable) as inputs.
   - Compute the slope ($m$) and intercept ($b$) using the least squares method.
   - Return the slope and intercept as a tuple.

2. **Predict Using the Model**
   - Write a function `predict(x, m, b)` that takes ($x$), ($m$), and \($b$) as inputs and predicts ($y$) values using the regression line formula.

3. **Evaluate the Model**
   - Write a function `r_squared(y_true, y_pred)` to compute the ($R^2$) value, which measures how well the model fits the data:
   $$
   R^2 = 1 - \frac{\sum (y_i - \hat{y}_i)^2}{\sum (y_i - \bar{y})^2}
   $$

4. **Test Your Implementation**
   - Use the provided dataset to test your functions.
   - Compare your slope, intercept, and ($R^2$) results with the implementation from `sklearn.linear_model.LinearRegression`.

---

### **Dataset**

You can use the following dataset to test your implementation:

```python
import numpy as np

# Independent variable (e.g., hours studied)
x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

# Dependent variable (e.g., test scores)
y = np.array([50, 55, 61, 66, 70, 74, 79, 83, 88, 92])


## Function Implementation

In [1]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

In [9]:
# This function takes in two lists of values (x values and y values), and
# returns a tuple of the slope and y-intercept of the line of best fit that
# is calculated using the least-squares methods.

def linear_regression(independent_var, dependent_var):
  # Make sure the parameters are arrays
  x = np.array(independent_var)
  y = np.array(dependent_var)

  # Calculate averages for x and y
  xMean = np.mean(x)
  yMean = np.mean(y)

  # Calculate Slope and Intercept terms
  slope = np.sum((x - xMean) * (y - yMean)) / np.sum((x - xMean) ** 2)
  intercept = yMean - (slope * xMean)

  # Create tuple to return
  slope_intercept = (slope, intercept)

  return slope_intercept;

In [10]:
# This function uses the calculated line of best fit from the
# linear_regression() function and an x value as inputs and returns
# a prediction for the y value

def predict(x, m, b):
  return m * x + b;

In [36]:
# This function takes two lists as input, the actual values of y and the
# corresponding predicted values y. The function then computes and returns
# an R Squared value

def r_squared(y_true, y_pred):
  # Ensure parameters are numpy arrays
  y_true = np.array(y_true)
  y_pred = np.array(y_pred)

  # Calculate R^2
  yMean = np.mean(y_true)

  numerator = np.sum((y_true - y_pred) ** 2)
  denominator = np.sum((y_true - yMean) ** 2)
  r_squared = 1 - (numerator / denominator)

  return r_squared;


## Function Testing

In [34]:
# Testing of the linear_regression() function

x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
y = np.array([50, 55, 61, 66, 70, 74, 79, 83, 88, 92])

results = linear_regression(x, y)

print(f'Manual BF Line  = {results[0]}x + {results[1]}')

# Verify Results using SKlearn
X = x.reshape(-1, 1)

model = LinearRegression()
model.fit(X, y)

slope = model.coef_[0]
intercept = model.intercept_

print(f'SKLearn BF Line = {slope}x + {intercept}')

Manual BF Line  = 4.618181818181818x + 46.4
SKLearn BF Line = 4.618181818181818x + 46.4


In [32]:
# Testing of the predict() function
y_prediction = predict(x, results[0], results[1])

# Verify Results using SKlearn
y_prediction2 = model.predict(X)

# Print Values out side by side
print('Manual Function     SKLearn Function')

for val1, val2 in zip(y_prediction, y_prediction2):
    print(f"{val1:.2f}                {val2:.2f}")

Manual Function     SKLearn Function
51.02                51.02
55.64                55.64
60.25                60.25
64.87                64.87
69.49                69.49
74.11                74.11
78.73                78.73
83.35                83.35
87.96                87.96
92.58                92.58


In [37]:
# Testing of the r_squared() function

# Get Value from the manual function
r_squared_value = r_squared(y, y_prediction)

# Verify Results using the SKLearn
r_squared_value2 = model.score(X, y)

print(f'Manual R^2 = {r_squared_value:.3f}')
print(f'SKLearn R^2 = {r_squared_value2:.3f}')

Manual R^2 = 0.998
SKLearn R^2 = 0.998
