Exercise: Predicting House Prices Using Linear Regression
Objective:
Use the Boston Housing Dataset to build a Linear Regression model that predicts house prices.

The columns in the Boston Housing Dataset represent different attributes (features) of houses in the Boston area. Here’s what each column means:

RM - Average number of rooms per dwelling (indicates house size).
LSTAT - Percentage of lower-status population in the area (a socioeconomic indicator).
PTRATIO - Pupil-teacher ratio by town (educational quality indicator).
MEDV - Median value of owner-occupied homes in USD (target variable for regression).

Tasks:

Load the dataset.
Preprocess the data.
Split into training and testing sets.
Train a Linear Regression model.
Evaluate the model using Mean Squared Error (MSE).
Verify the model's performance using assertions

In [1]:
#import all the requiered libaries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

In [2]:

df = pd.read_csv('housing.csv')
# Display the first few rows to visually inspect the data

print("First 5 rows of the dataset:")
print(df.head())


First 5 rows of the dataset:
      RM  LSTAT  PTRATIO      MEDV
0  6.575   4.98     15.3  504000.0
1  6.421   9.14     17.8  453600.0
2  7.185   4.03     17.8  728700.0
3  6.998   2.94     18.7  701400.0
4  7.147   5.33     18.7  760200.0


In [4]:
# Here we assume the target column is the last column in the dataset.
# If your target column has a different name, update the code accordingly.
# All columns except the last one are considered features or Input variable(X)
# The last column is the target or output variable (y). 


X = df.iloc[:, :-1]  # All columns except the last one are considered features
y = df.iloc[:, -1]   # The last column is the target

print("\nFeatures (first 5 rows):")
print(X.head())
print("\nTarget (first 5 rows):")
print(y.head())


Features (first 5 rows):
      RM  LSTAT  PTRATIO
0  6.575   4.98     15.3
1  6.421   9.14     17.8
2  7.185   4.03     17.8
3  6.998   2.94     18.7
4  7.147   5.33     18.7

Target (first 5 rows):
0    504000.0
1    453600.0
2    728700.0
3    701400.0
4    760200.0
Name: MEDV, dtype: float64


In [5]:
# Test the splitting by checking shapes
print("\nFeatures (first 5 rows):")
print(X.head())
print("\nTarget (first 5 rows):")
print(y.head())


Features (first 5 rows):
      RM  LSTAT  PTRATIO
0  6.575   4.98     15.3
1  6.421   9.14     17.8
2  7.185   4.03     17.8
3  6.998   2.94     18.7
4  7.147   5.33     18.7

Target (first 5 rows):
0    504000.0
1    453600.0
2    728700.0
3    701400.0
4    760200.0
Name: MEDV, dtype: float64


In [6]:
#Train-Test Split
from sklearn.model_selection import train_test_split

# Split data into training (80%) and testing (20%) sets
# Use X_train, X_test, y_train, y_test variables for splitting the dataset


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("\nTraining set size:", X_train.shape[0])
print("Test set size:", X_test.shape[0])



Training set size: 391
Test set size: 98


In [7]:
# Assertions to ensure the split is correct

print("\nTraining set size:", X_train.shape[0])
print("Test set size:", X_test.shape[0])


Training set size: 391
Test set size: 98


In [8]:
#Train the Linear Regression Model

from sklearn.linear_model import LinearRegression

# Initialize and train the Linear Regression model
# use model.fit Function 

model = LinearRegression()
model.fit(X_train, y_train)

print("\nModel training complete.")
print("Model coefficients:")
print(model.coef_)


Model training complete.
Model coefficients:
[ 87322.20361861 -10620.63731522 -19324.4102965 ]


In [9]:
# Test that the model has learned coefficients

print("\nModel training complete.")
print("Model coefficients:")
print(model.coef_)


Model training complete.
Model coefficients:
[ 87322.20361861 -10620.63731522 -19324.4102965 ]


In [10]:
#Make Predictions
# Use the trained model to make predictions on the test set
y_pred = model.predict(X_test)

print("\nFirst 5 predictions:")
print(y_pred[:5])


First 5 predictions:
[342593.79029768 506257.0916297  410499.93166174 237792.7411537
 327005.79653234]


In [11]:
print("\nFirst 5 predictions:")
print(y_pred[:5])


First 5 predictions:
[342593.79029768 506257.0916297  410499.93166174 237792.7411537
 327005.79653234]


In [12]:
#Evaluate the Model
from sklearn.metrics import mean_squared_error


# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)

print(f"\nMean Squared Error (MSE): {mse:.2f}")


Mean Squared Error (MSE): 6789025559.27


In [13]:
print(f"\nMean Squared Error (MSE): {mse:.2f}")


Mean Squared Error (MSE): 6789025559.27
