In [1]:
import pandas as pd
import numpy as np

In [5]:
# Step 1: Load and Inspect
data = pd.read_csv("/content/multiple_linear_regression_dataset.csv")

print(data.head())
print("\nColumns:", data.columns)
print("Shape:", data.shape)

   age  experience  income
0   25           1   30450
1   30           3   35670
2   47           2   31580
3   32           5   40130
4   43          10   47830

Columns: Index(['age', 'experience', 'income'], dtype='object')
Shape: (20, 3)


- Which columns are inputs -> experience and age.

- Which column is the output -> income.

- How many features does your model need to handle -> 2 features.

In [6]:
# Step 2: Separate Inputs and Output
X = data[["age", "experience"]].values
y = data["income"].values

- What is the shape of X -> (20, 2)

- What is the shape of y -> (20, )

- Why does X have 2 columns but y only one -> Because we are using 2 independent variables to predict 1 single dependent variable

In [9]:
# Step 3: Initialize the Model Parameters
n = X.shape[1]
w = np.zeros(n)
b = 0.0

- Why do we need one weight per feature -> Because each feature affects the income differently. experience might be worth more money per unit than an age point. The model needs a specific multiplier for each to learn their distinct importance.

- Why is bias separate -> Bias acts as the base income. It's the starting value when both age and experience are zero.

- Would initializing with large values be risky -> Yes. Large initial weights would cause massive initial predictions, leading to gigantic errors. This can cause the gradients to explode, making the model instantly unstable.

In [8]:
# Step 4: Define the Forward Pass
def predict(X, w, b):
    return X.dot(w) + b

- Why is there no activation function -> Activation functions are used to force outputs into specific categories or probabilities. Here, we want to predict an actual, unrestricted amount, so we just output the raw linear calculation.

- What kind of values can y_pred take -> Any real number from -infinity to infinity.

- How is this different from logistic regression -> Logistic regression squashes this exact same linear output into a probability between 0 and 1 using a sigmoid curve. Linear regression leaves it alone to predict continuous numbers.

In [10]:
# Step 5: Define the Loss Function (MSE)
def mean_squared_error(y, y_pred):
    return ((y_pred - y) ** 2).mean()

- Why square the error -> It ensures all errors are positive so a prediction that is 1000 too high doesn't cancel out a prediction that is 1000 too low and it heavily penalizes massive mistakes.

- What happens if one prediction is very wrong -> Because the error is squared, a huge mistake will result in a massive loss spike, forcing the model to aggressively correct itself on the next loop.

- Why not just take absolute error -> Absolute error creates a V-shaped graph that isn't smooth at the very bottom , which can make calculating the gradient tricky for the math. MSE is a smooth bowl.

In [12]:
# Step 6: Compute Gradients
def compute_gradients(X, y, y_pred):
    N = len(y)
    dw = (2 / N) * X.T.dot(y_pred - y)
    db = (2 / N) * (y_pred - y).sum()
    return dw, db

- Why does X appear in dw but not in db -> Chain rule calculus! The weight w is directly multiplied by X in the forward pass equation (y = X*w + b). Therefore, the gradient of the weight scales with the size of the input X. The bias b is just a constant addition, so its derivative doesn't rely on X.

- Why does the error term appear everywhere -> Because the error tells the model how much to change. If you are extremely wrong, you need a big update.

- What happens if error is zero -> The gradients become zero. When dw and db are zero, the weights stop updating. The model has perfectly solved the dataset.

In [13]:
# Step 7: Update Parameters
def update_parameters(w, b, dw, db, lr):
    w = w - lr * dw
    b = b - lr * db
    return w, b

In [14]:
# Step 8: Training Loop
lr = 0.0001
epochs = 1000

for epoch in range(epochs):
    y_pred = predict(X, w, b)
    loss = mean_squared_error(y, y_pred)
    dw, db = compute_gradients(X, y, y_pred)
    w, b = update_parameters(w, b, dw, db, lr)

    if epoch % 100 == 0:
        print(f"Epoch {epoch}, Loss: {loss:.2f}")

Epoch 0, Loss: 1727049635.00
Epoch 100, Loss: 66491868.55
Epoch 200, Loss: 61752567.20
Epoch 300, Loss: 58616531.08
Epoch 400, Loss: 56528801.54
Epoch 500, Loss: 55126542.03
Epoch 600, Loss: 54172526.95
Epoch 700, Loss: 53511656.14
Epoch 800, Loss: 53042523.73
Epoch 900, Loss: 52698829.56


- Does loss decrease over time -> Yes, it should steadily drop as the model learns.

- What happens if it increases -> Your learning rate
 is too high. The model is taking steps that are too large and bouncing out of the error bowl.

- How do learning rate and epochs interact -> If you use a very tiny learning rate, you are taking baby steps, meaning you will need a massive number of epochs to reach the goal. If you use a larger learning rate, you can get there in fewer epochs (but risk overshooting if it's too big).

In [15]:
# Step 9: Final Evaluation
print(f"Final Weights: {w}")
print(f"Final Bias: {b}")

Final Weights: [ 764.75405919 1371.03430441]
Final Bias: 321.73641174472493
