# Machine Learning with Multiple Linear Regression

## Topics Today:

1. Multiple Linear Regression
2. Multiple Linear Regression in StatsModels
3. Transformation of Linear Models
4. Evaluating Multiple Regression (R-squared, Adjusted R-squared)
5. Dealing with Categorical Variables
6. One-Hot Encoding
7. Interpreting One-Hot Encoded Coefficients
8. Error Metrics: MAE and RMSE
9. Practice Exercises

---

## 1. Multiple Linear Regression

Multiple linear regression helps us predict a value using more than one input variable. Think of predicting house prices using size, number of bedrooms, and location.

### Formula:

```
y = b0 + b1*x1 + b2*x2 + ... + bn*xn
```

Where:

- `y` = predicted value
- `x1, x2,..., xn` = input variables
- `b0` = intercept
- `b1, b2,..., bn` = coefficients for each variable

### Python Example:

In [34]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import statsmodels.api as sm
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

In [35]:
# Sample data
X = pd.DataFrame({"size": [1000, 1500, 1800], "bedrooms": [2, 3, 4]})
y = [200000, 250000, 300000]

model = LinearRegression()
model.fit(X, y)
print(model.coef_, model.intercept_)

[-1.2730274e-13  5.0000000e+04] 100000.00000000006


**Explanation:**

- `X` is a DataFrame with multiple features.
- `model.fit(X, y)` trains the model.
- `model.coef_` gives weights for each feature.

---

## 2. Multiple Linear Regression in StatsModels

StatsModels gives detailed statistical summaries.

In [36]:
X = sm.add_constant(X)  # Adds intercept
model = sm.OLS(y, X).fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                    nan
Method:                 Least Squares   F-statistic:                       nan
Date:                Tue, 12 Aug 2025   Prob (F-statistic):                nan
Time:                        15:45:32   Log-Likelihood:                 58.448
No. Observations:                   3   AIC:                            -110.9
Df Residuals:                       0   BIC:                            -113.6
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const           1e+05        inf          0        n

In [37]:
R2 = model.rsquared
adj_r2 = model.rsquared_adj

R2, adj_r2

(1.0, nan)

**Use Case:**

- When data grows exponentially (e.g. population vs time)

---

## 4. Regression Model Evaluation

### R-squared:

- Measures how well the model explains the data (closer to 1 is better).

### Adjusted R-squared:

- Adjusts for number of predictors (prevents overfitting).

### In Code:

In [38]:
# Sample data

data = {
    'size':  [1000, 1500, 1800, 1200, 2000, 1600],
    'bedrooms': [2, 3, 4,2, 5, 3],
    'age': [10, 15, 20, 5, 25, 10]
}

X = pd.DataFrame(data)
y = [200000, 250000, 300000, 220000, 400000, 280000]

#Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#Scale the features

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)

#Create the train model

model = LinearRegression()
model.fit(X_train, y_train)

#Predictions
y_pred = model.predict(X_train)

# Model Evaluation

mse = mean_squared_error(y_train, y_pred)
r2 = r2_score(y_train, y_pred)

#Output Results
print(y_pred)
print("Coefficients:", model.coef_)
print("Intercept:", model.intercept_)
print("Mean Squared Error:", mse)
print("R2_Squared:", r2)


[280000. 300000. 400000. 220000.]
Coefficients: [ -59160.797831    245967.47752498 -126491.10640674]
Intercept: 300000.0
Mean Squared Error: 2.541098841762901e-21
R2_Squared: 1.0


---

## 5. Dealing with Categorical Variables

Models can't use text like "red", "blue" directly.

In [39]:
df = pd.DataFrame({"color": ["red", "blue", "red"]})
df

Unnamed: 0,color
0,red
1,blue
2,red


In [40]:
df = pd.DataFrame({"color": ["red", "blue", "red"], "value": [1, 2, 3]})
df

Unnamed: 0,color,value
0,red,1
1,blue,2
2,red,3


---

## 6. One-Hot Encoding

In [41]:
df_encoded = pd.get_dummies(df, columns=["color"], drop_first=True)
df_encoded

Unnamed: 0,value,color_red
0,1,True
1,2,False
2,3,True


**Result:**

- Replaces "color" with `color_red` (0 or 1)

---

## 7. Interpreting Encoded Coefficients

Each binary column's coefficient tells how much that category increases or decreases the prediction compared to the baseline (the dropped category).



In [42]:
data = {
    'Color': ['red', 'blue', 'green', 'red', 'blue'],
    'Satisfaction': [20, 15, 10, 25, 12]
    
}

df = pd.DataFrame(data)

#One-hot-encoding
df_encoded = pd.get_dummies(df, columns=['Color'], drop_first=True).astype(int)

# 

#Fit the model
X = df_encoded.drop('Satisfaction', axis=1)
y = df_encoded['Satisfaction']

X = sm.add_constant(X) #Adding the constant for intercept

model = sm.OLS(y, X).fit()

print(model.summary())

                            OLS Regression Results                            
Dep. Variable:           Satisfaction   R-squared:                       0.886
Model:                            OLS   Adj. R-squared:                  0.772
Method:                 Least Squares   F-statistic:                     7.776
Date:                Tue, 12 Aug 2025   Prob (F-statistic):              0.114
Time:                        15:45:32   Log-Likelihood:                -10.154
No. Observations:                   5   AIC:                             26.31
Df Residuals:                       2   BIC:                             25.14
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
const          13.5000      2.062      6.548      

---

## 8. Error Metrics: MSE, MAE and RMSE

### Formulas:

```
MAE = (1/n) * Σ |yi - ŷi|
RMSE = sqrt( (1/n) * Σ (yi - ŷi)^2 )
```

### Python Equivalent

Model Predicts = $1.20 , $2.00, $2.50
Actual Prices = $1.00, $2.10, $2.40

In [43]:
#Actual Vlues
y_actual = np.array([1.00, 2.10, 2.40])


#Predicted Values
y_predicted = np.array([1.20,2.00, 2.50])

#Calculated MAE
mae = (np.mean(np.abs(y_actual) - y_predicted))

print("MEAN Absolute Error", mae)

#Calculated MSE
mse = (np.mean(y_actual - y_predicted) ** 2)

print("MEAN Squared Error", mse)


#Calculated RMSE
rmse = np.sqrt(mse)

print("Root MEAN Squared Error", rmse)


MEAN Absolute Error -0.06666666666666665
MEAN Squared Error 0.004444444444444443
Root MEAN Squared Error 0.06666666666666665


In [44]:
true = [3, -0.5, 2]
pred = [2.5, 0.0, 2]

mae = mean_absolute_error(true, pred)
rmse = np.sqrt(mean_squared_error(true, pred))

---

## 9. Practice Exercises

1. Create a linear regression model using two or more features.
2. Use StatsModels to evaluate the regression with `.summary()`.
3. Try applying a log transformation on one variable.
4. Encode a categorical variable using one-hot encoding.
5. Interpret the adjusted R-squared from your model.
6. Calculate MAE and RMSE of your model predictions.

---

## Summary

- Multiple linear regression is a foundational ML technique.
- StatsModels gives a deeper look into the math behind the model.
- One-hot encoding is crucial for categorical variables.
- R-squared, MAE, and RMSE help assess model performance.

Next step: Practice with real datasets and interpret your results statistically and practically.