<a href="https://colab.research.google.com/github/jaityagi63/TDS/blob/master/simple_linear_regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

formula for Simple linear regression (closed solution)

$$
m = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2}
$$

$$
b = \bar{y} - m \cdot \bar{x}
$$


In [None]:
class LR_function:
    def __init__(self):
        self.m = None
        self.b = None

    def fit(self, x_train, y_train):
        num, den = 0, 0
        x_mean = x_train.mean()
        y_mean = y_train.mean()

        for i in range(x_train.shape[0]):
            num += (x_train[i] - x_mean) * (y_train[i] - y_mean)
            den += (x_train[i] - x_mean) ** 2

        self.m = num / den
        self.b = y_mean - self.m * x_mean

        print(f'value of m is: {self.m}')
        print(f'value of b is: {self.b}')
        return self.m, self.b

    def predict(self, x_test):
        return self.m * x_test + self.b

    def show_predictions(self, x_train, y_train, x_test):
        self.fit(x_train, y_train)
        y_pred = self.predict(x_test)
        print(f'y_test (predicted): {y_pred}')



In [None]:
one = LR_function()


In [None]:
import numpy as np

In [None]:
x_train = np.array([1, 2, 3, 4, 5])
y_train = np.array([2, 4, 5, 4, 5])


In [None]:
one.fit(x_train , y_train )

value of m is: 0.6
value of b is: 2.2


(np.float64(0.6), np.float64(2.2))

In [None]:
x_test = np.array([6, 7, 8, 9, 10])
one.show_predictions(x_train,y_train,x_test)

value of m is: 0.6
value of b is: 2.2
y_test (predicted): [5.8 6.4 7.  7.6 8.2]


array([5.8, 6.4, 7. , 7.6, 8.2])

metric for linear regression
1. MAE = mean abs error
2. MSE = mean squared error
3. RMSE = root mean squared error
4. R2 score = determination score
5. adjusted R2 square

### Mean Absolute Error (MAE)

The **Mean Absolute Error (MAE)** is a commonly used evaluation metric for regression models. It measures the average magnitude of the errors in a set of predictions, without considering their direction. The formula for MAE is:

$$
\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|
$$

#### Explanation:

- \( y_i \): Actual value for the \( i \)-th data point  
- \( \hat{y}_i \): Predicted value for the \( i \)-th data point  
- \( n \): Total number of data points  
- \( |y_i - \hat{y}_i| \): Absolute difference between the actual and predicted values

#### Key Points:

- MAE gives an idea of how far predictions are from the actual values on average.
- It is expressed in the same units as the target variable.
- MAE is easy to interpret and less sensitive to outliers compared to other metrics like Mean Squared Error (MSE).

A lower MAE value indicates better model performance.


In [None]:
class MAE(LR_function):
    def __init__(self):
        super().__init__()
        self.y_pred = None
        self.mae = None

    def calculate_mae(self, x_train, y_train, x_test, y_test):
        self.fit(x_train, y_train)

        self.y_pred = self.predict(x_test)

        absolute_errors = abs(self.y_pred - y_test)
        self.mae = absolute_errors.mean()

        print(f"Predictions: {self.y_pred}")
        print(f"Mean Absolute Error (MAE): {self.mae}")
        return self.mae




In [None]:


x_train = np.array([1, 2, 3, 4, 5])
y_train = np.array([2.1, 3.9, 6.2, 7.8, 9.7])

x_test = np.array([6, 7, 8])
y_test = np.array([11.8, 14.5, 16.1])

model = MAE()
model.calculate_mae(x_train, y_train, x_test, y_test)


value of m is: 1.9099999999999997
value of b is: 0.21000000000000085
Predictions: [11.67 13.58 15.49]
Mean Absolute Error (MAE): 0.5533333333333358


np.float64(0.5533333333333358)

### Mean Squared Error (MSE)

The **Mean Squared Error (MSE)** is a popular metric used to evaluate the performance of regression models. It measures the average of the squared differences between actual and predicted values.

$$
\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
$$

#### Explanation:

- \( y_i \): Actual value for the \( i \)-th data point  
- \( \hat{y}_i \): Predicted value for the \( i \)-th data point  
- \( n \): Total number of data points  
- \( (y_i - \hat{y}_i)^2 \): Squared error for each prediction

#### Key Points:

- MSE gives a higher penalty to larger errors due to squaring.
- It is sensitive to outliers.
- The lower the MSE, the better the model's predictions align with the actual data.
- Unlike MAE, MSE does not maintain the same unit as the target variable—it squares the unit.

MSE is often used when large errors are particularly undesirable.


In [None]:
class MSE(LR_function):
    def __init__(self):
        super().__init__()
        self.y_pred = None
        self.mse = None

    def calculate_mse(self, x_train, y_train, x_test, y_test):
        # Train the model
        self.fit(x_train, y_train)

        # Predict on test data
        self.y_pred = self.predict(x_test)

        # Calculate Mean Squared Error
        squared_errors = (self.y_pred - y_test) ** 2
        self.mse = squared_errors.mean()
        return self.mse


In [None]:
import numpy as np

# Noisy training data
x_train = np.array([1, 2, 3, 4, 5])
y_train = np.array([2.2, 3.8, 6.3, 8.1, 10.2])

# Slightly off test data
x_test = np.array([6, 7, 8])
y_test = np.array([12.1, 13.9, 16.5])

# Instantiate and use MSE class
model = MSE()
model.calculate_mse(x_train, y_train, x_test, y_test)


value of m is: 2.03
value of b is: 0.02999999999999936
Mean Squared Error (MSE): 0.060200000000000135


np.float64(0.060200000000000135)

### Root Mean Squared Error (RMSE)

The **Root Mean Squared Error (RMSE)** is a commonly used metric to evaluate the accuracy of a regression model. It is the **square root** of the Mean Squared Error (MSE), which makes it easier to interpret because it is in the **same units** as the target variable.

$$
\text{RMSE} = \sqrt{ \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 }
$$

#### Explanation:

- \( y_i \): Actual value for the \( i \)-th data point  
- \( \hat{y}_i \): Predicted value for the \( i \)-th data point  
- \( n \): Total number of data points  
- \( (y_i - \hat{y}_i)^2 \): Squared error for each prediction

#### Key Points:

- RMSE gives a **heavier penalty to large errors** (just like MSE), but its output is more interpretable since it has the **same units** as the original data.
- It is **sensitive to outliers**, because it squares the errors.
- A **lower RMSE** means better model performance.

RMSE is often used when large prediction errors are particularly undesirable and interpretability is important.


In [None]:
class RMSE(MSE):
    def __init__(self):
        super().__init__()
        self.rmse = None
    def calculate_RMSE(self, x_train, y_train, x_test, y_test):
      mse = self.calculate_mse(x_train,y_train,x_test,y_test)
      rmse = np.sqrt(mse)
      return rmse


In [None]:
import numpy as np

# Noisy training data
x_train = np.array([1, 2, 3, 4, 5])
y_train = np.array([2.2, 3.8, 6.3, 8.1, 10.2])

# Slightly off test data
x_test = np.array([6, 7, 8])
y_test = np.array([12.1, 13.9, 16.5])

# Instantiate and use MSE class
model = RMSE()
model.calculate_RMSE(x_train, y_train, x_test, y_test)

value of m is: 2.03
value of b is: 0.02999999999999936
Mean Squared Error (MSE): 0.060200000000000135


np.float64(0.24535688292770622)

### R² Score (Coefficient of Determination)

The **R² score**, or **coefficient of determination**, is a statistical measure that shows how well a regression model explains the variability of the target variable.

It ranges from **0 to 1** (and can be negative in some cases), where:
- **1** means perfect prediction
- **0** means the model does no better than simply predicting the mean
- **< 0** means the model performs worse than the mean

#### Formula:

$$
R^2 = 1 - \frac{\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}{\sum_{i=1}^{n}(y_i - \bar{y})^2}
$$


R^2 = 1 - sum of squared Error in regression\sum of squared Error in mean term

#### Explanation:

- \( y_i \): Actual value  
- \( \hat{y}_i \): Predicted value  
- \( \bar{y} \): Mean of the actual values  
- \( \sum (y_i - \hat{y}_i)^2 \): Residual sum of squares (model error)  
- \( \sum (y_i - \bar{y})^2 \): Total sum of squares (total variance in data)

#### Key Points:

- R² measures the proportion of the variance in the dependent variable that is **predictable from the independent variable(s)**.
- The closer R² is to **1**, the better the model explains the data.
- A **low or negative R²** means the model is not capturing the pattern in the data well.

R² is a good quick-check metric, but it should be used along with MAE, MSE, or RMSE for a complete evaluation.


In [None]:
class R2(LR_function):
  def __init__(self):
    super().__init__()
    self.r2 = 0

  def calculate_r2(self,x_train,y_train,x_test,y_test):
    y_pred = self.show_predictions(x_train,y_train,x_test)
    y_mean = np.mean(y_test)
    numerator = np.sum((y_test - y_pred)**2)
    denominator = np.sum((y_test - y_mean)**2)
    self.r2 = 1 - (numerator/denominator)
    print(f'value of r2: {self.r2}')
    return self.r2




In [None]:
x_train = np.array([1, 2, 3, 4, 5])
y_train = np.array([2, 4, 5, 4, 5])

x_test = np.array([6, 7, 8])
y_test = np.array([6, 7, 9])

# Instantiate and test R2
model = R2()
r2_score = model.calculate_r2(x_train, y_train, x_test, y_test)

value of m is: 0.6
value of b is: 2.2
y_test (predicted): [5.8 6.4 7. ]
value of r2: 0.05714285714285716


# Adjusted R² Score in Regression Analysis

## What is R² (R-squared)?
R², or the **coefficient of determination**, is a statistical measure in regression that explains how much of the variance in the dependent variable is predictable from the independent variables.

- **Range:** 0 to 1
- **Interpretation:**
  - 0: The model explains none of the variability.
  - 1: The model explains all the variability.

However, R² **always increases** as you add more variables, even if those variables don't actually improve the model.

---

## What is Adjusted R²?

**Adjusted R²** is a modified version of R² that adjusts for the number of predictors in the model. It penalizes the addition of irrelevant predictors and is more reliable for comparing models with a different number of variables.

### Formula:

\[
\text{Adjusted } R^2 = 1 - \left( \frac{(1 - R^2)(n - 1)}{n - p - 1} \right)
\]

Where:
- \( R^2 \) = regular R-squared
- \( n \) = number of observations
- \( p \) = number of independent variables

---

## Key Differences: R² vs Adjusted R²

| Feature         | R²         | Adjusted R²         |
|----------------|------------|----------------------|
| Increases with added variables | Always | Only if the variable improves the model |
| Penalizes unnecessary predictors | No     | Yes                  |
| Best for comparing models with different numbers of features | No | Yes |

---

## When to Use Adjusted R²

- When you're comparing models with **different numbers of predictors**.
- When you want to **prevent overfitting** from adding too many features.
- In **multiple linear regression** problems.

---

## Summary

- **Adjusted R²** helps in choosing a model that balances goodness-of-fit with model simplicity.
- It is a more **trustworthy metric** than R² when you add more variables.

> **Tip:** If Adjusted R² decreases when you add a variable, that variable likely isn't helping your model.



In [None]:
class adjusted_r2(R2):
    def __init__(self):
        super().__init__()
        self.adj_r2 = 0

    def calculate_adjusted_R2(self, x_train, y_train, x_test, y_test):
        r2 = self.calculate_r2(x_train, y_train, x_test, y_test)
        n = x_train.shape[0]
        p = x_train.shape[1]
        if n - p - 1 == 0:
            raise ValueError("Cannot calculate adjusted R² due to division by zero (n - p - 1 == 0).")

        self.adj_r2 = 1 - (1 - r2) * ((int(n) - 1) / (int(n) - int(p) - 1))
        print(f'value of adjusted_R2: {self.adj_r2}')
        return self.adj_r2


In [None]:
x_train = np.array([1, 2, 3, 4, 5]).reshape(-1, 1)  # (5, 1)
y_train = np.array([2, 4, 5, 4, 5])

x_test = np.array([6, 7, 8]).reshape(-1, 1)        # (3, 1)
y_test = np.array([6, 7, 9])

# Instantiate and compute adjusted R²
model = adjusted_r2()
adjusted_r2_value = model.calculate_adjusted_R2(x_train, y_train, x_test, y_test)

value of adjusted_R2: -0.2571428571428571
