# Multiple Regression

YT Video - https://www.youtube.com/watch?v=zITIFTsivN8&list=PLblh5JKOoLUICTaGLRoHQDuF_7q2GfuJF&index=11

Multiple Regression is where we use more than one variable to predict our target. Instead of just fitting a line, we're now fitting a plane (with two variables) or a "higher-dimensional object" (with more than two). We use multiple regression is what you use when you think **multiple factors** might influence an outcome.
* If we add another variable to our model, the equation just gets a little longer : `Body length = y-intercept + slope1 +* Mouse Weight + slope2 * Tail Length`

Lets add tail_ lnegth to our data and build a new model using the same Scikit_learn Component, **LinearRegression**



In [2]:
import numpy as np 
from sklearn.linear_model import LinearRegression

# Make data with two independent variables (Mouse Weight and Tail Length) and one dependent variable (Body Length)
mouse_weight = np.array([18, 20, 22, 24, 26, 28, 30, 32])
tail_length = np.array([7.0, 7.2, 7.5, 7.8, 8.0, 8.1, 8.3, 8.5])
body_length = np.array([42, 44, 45, 48, 50, 51, 52, 55])

# Combine two features into a single 2D array
X_multiple = np.c_[mouse_weight, tail_length]

# Create and fit the multiple regression model
multiple_model = LinearRegression()
multiple_model.fit(X_multiple, body_length)

# New results 
print("\n--- Multiple Regression ---")
print(f"Intercept: {multiple_model.intercept_:.2f}")
print(f"Coefficients (for Mouse Weight and Tail Length): {multiple_model.coef_}")
print(f"Equation: Body Length = {multiple_model.intercept_:.2f} + ({multiple_model.coef_[0]:.2f} * Mouse_Weight) + ({multiple_model.coef_[1]:.2f} * Tail_Length)")


--- Multiple Regression ---
Intercept: 6.12
Coefficients (for Mouse Weight and Tail Length): [0.48484848 3.86363636]
Equation: Body Length = 6.12 + (0.48 * Mouse_Weight) + (3.86 * Tail_Length)


### Measuring the fit: R-Squared (R²)

R-squared tells you how much of the variation in the target (Body Length) is explained by your model. A value is 1.0 is a perfect fit. The good news is that calculating R² is the exact same for both simple and multiple regression. You're still just comparing how much better your fit is than just guessing the average body length every time.

**Scikit_Learn component** : `r2_score`
* To get this value, scikit-learn gives us a handy function in its metrics module called `r2_score`. Alternatively, you can just call the `.score()` method directly on your fitted model.

In [3]:
from sklearn.metrics import r2_score

# Reshape the single feature into a 2D array
X_simple = mouse_weight.reshape(-1, 1)
simple_model = LinearRegression()
simple_model.fit(X_simple, body_length)
simple_r2 = simple_model.score(X_simple, body_length)

# Multiple models r-squared
multiple_r2 = multiple_model.score(X_multiple, body_length)

# Print results
print(f"Simple Model R-squared (mouse weight): {simple_r2:.2f}")
print(f"Multiple Model R-squared (mouse weight and tail length): {multiple_r2:.2f}")

Simple Model R-squared (mouse weight): 0.98
Multiple Model R-squared (mouse weight and tail length): 0.99


Notice the R² value went up when we added Tail Length. This means the new model explains more of the variation in Body Length. 

# The F-test

With the F-test we can directly compare the simple model to the multiple model to see if adding the trail_length gave us a statistically significant improvement..
* A statistical test to see if the improvement in R² between a simple model and a more complex model is big enough to be meaningful, or if it could have happened by random chance.
* Instead of comparing our model to the `mean`, we now compare the **multiple regression** model to the **simple regression** model. We're replacing the mean stuff in the F-value equation with the simple regression stuff.
* `F = (Improvement in Fit) / (Remaining Unexplained Variation)`

In [None]:
from scipy.stats import f

# Calculate the Sum of Squared errors for both models
ss_simple = np.sum((body_length - simple_model.predict(X_simple))**2)
ss_multiple = np.sum((body_length - multiple_model.predict(X_multiple))**2)

# 2 parameters: intercept and slope for weight
p_simple = 2
# Multiple model has 3: intercept, slope for weight, slope for tail length
p_multiple = 3
# Number of data points
n = len(body_length)

# Calculate the F-statistic using the formula from the video
numerator = (ss_simple - ss_multiple) / (p_multiple - p_simple)
denominator = ss_multiple / (n - p_multiple)
f_statistic = numerator / denominator

# Calculate the p-value from the F-statistic
p_value = 1 - f.cdf(f_statistic, dfn=(p_multiple - p_simple), dfd=(n - p_multiple))

print("\n--- Model Comparison F-test ---")
print(f"F-statistic: {f_statistic:.4f}")
print(f"P-value: {p_value:.4f}")

if p_value < 0.05:
    print("\nThe p-value is small! Adding Tail Length to the model was worth the trouble.")
else:
    print("\nThe p-value is not small. Adding Tail Length did not significantly improve the model.")


--- Model Comparison F-test ---
F-statistic: 1.3917
P-value: 0.2912

The p-value is not small. Adding Tail Length did not significantly improve the model.


If the p-value is small (traditionally less than 0.05), it means the improvement we saw in R² is statistically significant. We can be confident that adding the tail length data made our model meaningfully better.