
# Regression Analysis with Boston Housing Dataset

This notebook covers different regression techniques using the **Boston Housing Dataset**.  
We will go step by step, starting with **Simple Linear Regression**, then **Multiple Linear Regression**, **Polynomial Regression**, and finally **Regularized Regression (Ridge, Lasso, ElasticNet)**.  

At the end, we will also cover **Evaluation Metrics** (MSE, RMSE, MAE, R² score) and the **assumptions** to check before interpreting regression results.



## Step 1: Load the Dataset

We will use the **Boston Housing dataset** available in `sklearn.datasets`.  
This dataset contains information about houses in Boston and their prices.


In [None]:
import pandas as pd
import numpy as np

# Load Boston Housing dataset directly
url = "https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv"
df = pd.read_csv(url)

# Display the first few rows
df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,PRICE
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422



## Step 2: Simple Linear Regression

We start with **one independent variable** (e.g., RM = average number of rooms per dwelling) to predict the house price.

**Steps:**
1. Select `RM` as the independent variable (X) and `PRICE` as the dependent variable (y).
2. Fit a linear regression model.
3. Plot the regression line.


In [None]:

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression

X = df[['RM']]
y = df['PRICE']

lin_reg = LinearRegression()
lin_reg.fit(X, y)

# Predictions
y_pred = lin_reg.predict(X)

# Visualization
plt.scatter(X, y, alpha=0.5)
plt.plot(X, y_pred, color='red', linewidth=2)
plt.xlabel('Average Number of Rooms (RM)')
plt.ylabel('House Price')
plt.title('Simple Linear Regression (RM vs Price)')
plt.show()


KeyError: "None of [Index(['RM'], dtype='object')] are in the [columns]"


## Step 3: Multiple Linear Regression

Now we use **all available features** to predict house prices.

**Steps:**
1. Use all columns except PRICE as independent variables (X).
2. Fit a linear regression model.
3. Evaluate using metrics.


In [None]:

X = df.drop('PRICE', axis=1)
y = df['PRICE']

mlr = LinearRegression()
mlr.fit(X, y)

y_pred_mlr = mlr.predict(X)



## Step 4: Polynomial Regression

Sometimes relationships are **non-linear**.  
Polynomial regression allows us to capture such relationships by including powers of the features.

**Steps:**
1. Use `RM` again as the predictor.
2. Transform it into polynomial features (degree=2).
3. Fit a regression model.


In [None]:

from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(df[['RM']])

lin_reg2 = LinearRegression()
lin_reg2.fit(X_poly, y)

y_poly_pred = lin_reg2.predict(X_poly)

plt.scatter(df['RM'], y, alpha=0.5)
plt.scatter(df['RM'], y_poly_pred, color='red', s=10)
plt.xlabel('Average Number of Rooms (RM)')
plt.ylabel('House Price')
plt.title('Polynomial Regression (Degree=2)')
plt.show()



## Step 5: Regularized Regression

Regularization is used to prevent **overfitting** by adding a penalty term to the regression.

- **Ridge Regression:** L2 penalty (squares of coefficients).
- **Lasso Regression:** L1 penalty (absolute value of coefficients, can shrink some to 0).
- **ElasticNet Regression:** Combination of L1 + L2 penalties.


In [None]:

from sklearn.linear_model import Ridge, Lasso, ElasticNet

ridge = Ridge(alpha=1.0)
ridge.fit(X, y)

lasso = Lasso(alpha=0.1)
lasso.fit(X, y)

elastic = ElasticNet(alpha=0.1, l1_ratio=0.5)
elastic.fit(X, y)



## Step 6: Evaluation Metrics

We evaluate regression models using:

- **Mean Squared Error (MSE):** Average of squared errors.
- **Root Mean Squared Error (RMSE):** Square root of MSE (more interpretable).
- **Mean Absolute Error (MAE):** Average of absolute errors.
- **R² Score:** Proportion of variance explained by the model.


In [None]:

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np

mse = mean_squared_error(y, y_pred_mlr)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y, y_pred_mlr)
r2 = r2_score(y, y_pred_mlr)

print("MSE:", mse)
print("RMSE:", rmse)
print("MAE:", mae)
print("R² Score:", r2)



## Step 7: Verify Assumptions Before Interpreting Results

Linear regression makes several assumptions:

1. **Linearity:** Relationship between predictors and response is linear.
2. **Independence:** Residuals are independent.
3. **Homoscedasticity:** Constant variance of residuals.
4. **Normality of Residuals:** Errors should be normally distributed.

We check residual plots to validate assumptions.


In [None]:

residuals = y - y_pred_mlr

sns.scatterplot(x=y_pred_mlr, y=residuals)
plt.axhline(y=0, color='red', linestyle='--')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.show()

sns.histplot(residuals, kde=True)
plt.title('Distribution of Residuals')
plt.show()
