## **Linear Regression From Scratch**
Student: Lucca de Sena Barbosa

Graduation: Computer Science - 2025.2

- **Case study**: Forecast the an employee's salary based on their age;
- **Objective**: Understanding how linear regression models work from scratch;

### **1. Importing the necessary libraries;**

In [1]:
import pandas as pd
import numpy as np
import plotly.graph_objects as go
from sklearn.linear_model import LinearRegression

### **2. Creating an Simple Dataset;**

In [2]:
quantity = 1000

age = np.random.randint(18, 81, quantity)

salary = 800 + (age - 18) * (20000 - 800) / (80 - 18)  
noise = np.random.normal(0, 1500, quantity)  
salary = np.clip(salary + noise, 800, 20000) 

# Criar DataFrame
dataframe = pd.DataFrame({
    "Age": age,
    "Salary": salary.astype(int)
})

In [3]:
dataframe

Unnamed: 0,Age,Salary
0,79,18459
1,77,19016
2,50,8071
3,36,3850
4,69,14806
...,...,...
995,39,8726
996,21,800
997,41,7396
998,49,10994


### **3. Creating the model:**

#### **3.1 Simple Linear Regression (y):**

$$
y = m \cdot x + b
$$

- We need to define the **inclination(m)** and **interception(b)** values for building the function which represents the model.

---

#### **3.2 Inclination (m):**

$$
m = \frac{\operatorname{Cov}(X,Y)}{\operatorname{Var}(x)}
$$

**3.2.1 Covarience (Cov):**

$$
\operatorname{Cov}(X,Y) = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{n-1}
$$


**3.2.3 Varience (Var):**

$$
\operatorname{Var}(x) = \frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1}
$$

---

#### **3.3 Interception (b):**

$$
b = \bar{y} - m \cdot \bar{x}
$$

---

In [50]:
def mean(values):
    """
    Calculate the mean of list
    """
    n = len(dataframe)
    sum = 0
    for i in range(n):
            sum += values[i]

    return sum / n

class SimpleRegression:
    def __init__(self):
        # Initializing the model parameters:
        self.m = None # Line Inclination (m)
        self.b = None # Interception (b)

    def inclination(self, x, y):
        """
        Calculate the inclination of the regression line based on the dataframe.
                                    m = Cov(X,Y) / Var(x)
        """
        xMean = mean(x)
        yMean = mean(y)
        n = len(x)

        sum_squared_x = 0
        sum_covariance = 0

        for i in range(x.shape[0]):
            sum_squared_x += ((x[i] - xMean)**2)
            sum_covariance += ((x[i] - xMean))*((y[i] - yMean))
        
        VarX = sum_squared_x / (n - 1)
        CovXY = sum_covariance / (n - 1)

        return CovXY / VarX

    def interception(self, xMedia, yMedia, m):
        b = yMedia - (m*xMedia) 
        return b

    def print_value(self):
        print(f"{self.m}, {self.b}")
        

    def fit(self, x, y):
        self.m = self.inclination(x, y)
        self.b = self.interception(mean(x), mean(y), self.m)

    def predict(self, value):
        try: 
           y = (self.m*value) + self.b

        except Exception as e:
           print(f"An error has occured while attempting to make the forecast!\nErro: {e}")
        
        else:
            return y

x = dataframe.iloc[:, 0].values
y = dataframe.iloc[:, 1].values

model = LinearRegression()
model.fit(x.reshape(-1, 1), y)
y_predict = model.predict(x.reshape(-1, 1))


In [51]:
model1 = SimpleRegression()
model1.fit(x, y)
model2 = LinearRegression()
model2.fit(x.reshape(-1, 1), y)

0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


In [52]:
model1.print_value()

301.9983641694407, -4420.380780505206


In [29]:
model2.coef_, model2.intercept_

(array([301.99836417]), np.float64(-4420.380780505207))

### **4. Model visualization using graphics:**

In [14]:
fig = go.Figure()

# Creating the real records line graph;
fig.add_trace(go.Scatter(
    x=x, 
    y=y, 
    mode='markers', 
    name='Dataset',
    marker=dict(color='blue')
))

# Creating the predicted records line graph;
fig.add_trace(go.Scatter(
    x=x, 
    y=y_predict, 
    mode='lines',  # Apenas a linha
    name='Model',
    line=dict(color='red', width=3)
))

# Adjusting the layout
fig.update_layout(
    title='Simple Linear Regression',
    xaxis_title='Age',
    yaxis_title='Salary',
    showlegend=True,
    template='plotly_dark' 
)

fig.show()

### **5. Evaluating the model's perfomance:**

In [15]:
def r_squared(y_true, y_predict):
    sumPredict = 0
    yMean = mean(y_true)
    sumMeanY = 0

    for i in range(len(y_true)):
        sumPredict += ((y_true[i] - y_predict[i])**2)
        sumMeanY += ((y_true[i] - yMean)**2)
    
    return 1 - (sumPredict / sumMeanY)

def rmse(y_true, y_predict):
    sumPredict = 0
    n = len(y_true)

    for i in range(len(y_true)):
        sumPredict += ((y_true[i] - y_predict[i])**2)
    
    return (sumPredict / n)**0.5

print(f"R squared: {r_squared(y_true=y, y_predict=y_predict)}\nRoot Mean Squared Erro: {rmse(y_true=y, y_predict=y_predict)}")



R squared: 0.9401355222732326
Root Mean Squared Erro: 1397.379717750068


R squared: 0.9401355222732326
Root Mean Squared Erro: 1397.3797177500678