# House Price Prediction With a Multiple Variable Linear Regression

This notebook explores a simple linear regression model using multiple variables as features to predict the target of house prices based on numerical features. The project focuses on understanding the full machine learning workflow, from data inspection to model evaluation.

**Goal:** Predict house prices using multiple variable linear regression  
**Tools:** Python, pandas, scikit-learn, matplotlib


### Import Necessary Tools & Libraries

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt

### Load the Dataset

In [None]:
df = pd.read_csv("Housing_Price_Data.csv")

### Inspect Data

In [None]:
df.info()

In [None]:
df.head()

In [None]:
X = df[['area','bedrooms', 'bathrooms', 'stories', 'parking']]
y = df['price']    

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

### Create The Model

In [None]:
model = LinearRegression()

### Train The Model (Fit it)

In [None]:
model.fit(X_train, y_train)

### Feature Importance by Coefficient Comparison

The coefficients indicate how much each feature contributes to the predicted house price while holding other variables constant.
Features with larger absolute coefficients have a stronger influence on price, while features with coefficients near zero contribute less to the model.

In [None]:
coefficients = pd.Series(
    model.coef_,
    index=X.columns
)

coefficients.sort_values(ascending=False)

# Visualize the Coefficiants via a Bar Graph
coefficients.plot(kind='barh', title='Feature Coefficients')



### Although house area appears to have a zero coefficient, this is due to feature scaling differences rather than a lack of importance. Since area is measured on a much larger scale than the other features, its coefficient magnitude is not directly comparable. Despite this, area remains an important predictor of house price and should not be excluded solely based on raw coefficient values.

### Create a Refined Feature Set

In [None]:
X_refined = df[['bathrooms', 'stories', 'parking']]
y = df['price']

### Split Train/Test Data

In [None]:
X = df[['bathrooms', 'stories', 'parking']]
y = df['price']    

X_train_refined, X_test_refined, y_train_refined, y_test_refined = train_test_split(X_refined, y, test_size = 0.2, random_state = 42)

### Create and Fit a Refined Model

In [None]:
model_refined = LinearRegression()
model_refined.fit(X_refined, y)


### Make Predictions With Refined Model

In [None]:
y_pred_refined = model_refined.predict(X_test_refined)


### Create a Predictions DataFrame

In [None]:
predictions_multi = X_test_refined.copy()

predictions_multi["actual_price"] = y_test_refined.values
predictions_multi["predicted_price"] = y_pred_refined


### Save to CSV

In [None]:
predictions_multi.to_csv("multi_variable_predictions.csv", index=False)

### Evaluate The Model

In [None]:
mse_multi = mean_squared_error(y_test, y_pred_refined)
r2_multi = r2_score(y_test, y_pred_refined)

print("Mean Squared Error (MSE):", mse_multi)
print("R² Score:", r2_multi)


## Model Evaluation

### Mean Squared Error (MSE)

The Mean Squared Error for this model is **2.86 × 10¹²**.  



### R² Score
The R² score for this model is **0.43**, meaning that approximately **43% of the variance in house prices** is explained by the selected features collectively (area, bathrooms, stories, and parking).



Compared to the single-variable model, the multiple linear regression model achieves a higher R² score and lower MSE, indicating that incorporating additional features improves the model’s ability to explain variance in house prices. This suggests that house price is influenced by multiple factors beyond area alone.

### Residual Plot

The model captures general trends but struggles to accurately predict individual house prices, especially at higher price ranges. This suggests that while the selected numerical features contribute to price estimation, additional variables or more complex models may be required for improved accuracy.

In [None]:
residuals = y_test_refined - y_pred_refined

plt.scatter(y_pred_refined, residuals)
plt.axhline(0)
plt.xlabel("Predicted Price")
plt.ylabel("Residuals")
plt.title("Residuals vs Predicted Values")

plt.show()

### Model Limitations

This model assumes a linear relationship between features and house price, which may not fully capture real-world pricing dynamics. Additionally, categorical and location-based factors are not included, which may limit predictive accuracy. Feature scaling was not applied, which affects coefficient comparability but not overall model predictions.