# ML Regression 01 — House Price Prediction

This notebook walks through a complete supervised-learning regression workflow:

1. Load and explore the dataset  
2. Exploratory Data Analysis (EDA)  
3. Feature engineering & preprocessing  
4. Train a Linear Regression model  
5. Evaluate the model (MAE, RMSE, R²)  
6. Visualise predictions vs actual values  

**Dataset**: `dataset.csv` — synthetic house-price data with features such as size, bedrooms, bathrooms, age, and distance to city.

## 1. Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

%matplotlib inline
sns.set_theme(style='whitegrid')

## 2. Load the Dataset

In [None]:
df = pd.read_csv('dataset.csv')
print(f'Shape: {df.shape}')
df.head()

## 3. Exploratory Data Analysis (EDA)

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
# Check for missing values
print('Missing values:')
print(df.isnull().sum())

In [None]:
# Distribution of the target variable
plt.figure(figsize=(8, 4))
sns.histplot(df['price_usd'], bins=15, kde=True)
plt.title('Distribution of House Prices')
plt.xlabel('Price (USD)')
plt.ylabel('Count')
plt.tight_layout()
plt.show()

In [None]:
# Correlation heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(df.corr(), annot=True, fmt='.2f', cmap='coolwarm')
plt.title('Feature Correlation Heatmap')
plt.tight_layout()
plt.show()

In [None]:
# Scatter plots: each feature vs price
features = ['house_size_sqft', 'num_bedrooms', 'num_bathrooms', 'age_years', 'distance_to_city_km']

fig, axes = plt.subplots(2, 3, figsize=(14, 8))
axes = axes.flatten()

for i, feature in enumerate(features):
    axes[i].scatter(df[feature], df['price_usd'], alpha=0.6)
    axes[i].set_xlabel(feature)
    axes[i].set_ylabel('Price (USD)')
    axes[i].set_title(f'{feature} vs Price')

# Hide unused subplot
axes[-1].set_visible(False)
plt.tight_layout()
plt.show()

## 4. Feature Engineering & Preprocessing

In [None]:
# Separate features (X) and target (y)
X = df[features]
y = df['price_usd']

# Split into train / test sets (80 / 20)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f'Training samples : {len(X_train)}')
print(f'Test samples     : {len(X_test)}')

In [None]:
# Feature scaling (StandardScaler)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled  = scaler.transform(X_test)

## 5. Train the Model

In [None]:
model = LinearRegression()
model.fit(X_train_scaled, y_train)

print('Model coefficients:')
for name, coef in zip(features, model.coef_):
    print(f'  {name:30s}: {coef:,.2f}')
print(f'  Intercept                     : {model.intercept_:,.2f}')

## 6. Evaluate the Model

In [None]:
y_pred = model.predict(X_test_scaled)

mae  = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2   = r2_score(y_test, y_pred)

print(f'Mean Absolute Error  (MAE) : ${mae:,.2f}')
print(f'Root Mean Squared Error    : ${rmse:,.2f}')
print(f'R² Score                   : {r2:.4f}')

In [None]:
# Actual vs Predicted scatter plot
plt.figure(figsize=(7, 5))
plt.scatter(y_test, y_pred, alpha=0.7, edgecolors='k', linewidths=0.4)
min_val = min(y_test.min(), y_pred.min())
max_val = max(y_test.max(), y_pred.max())
plt.plot([min_val, max_val], [min_val, max_val], 'r--', label='Perfect prediction')
plt.xlabel('Actual Price (USD)')
plt.ylabel('Predicted Price (USD)')
plt.title('Actual vs Predicted House Prices')
plt.legend()
plt.tight_layout()
plt.show()

In [None]:
# Residual plot
residuals = y_test - y_pred

plt.figure(figsize=(7, 4))
plt.scatter(y_pred, residuals, alpha=0.7, edgecolors='k', linewidths=0.4)
plt.axhline(0, color='r', linestyle='--')
plt.xlabel('Predicted Price (USD)')
plt.ylabel('Residuals (USD)')
plt.title('Residual Plot')
plt.tight_layout()
plt.show()

## 7. Summary

| Metric | Value |
|--------|-------|
| MAE    | see output above |
| RMSE   | see output above |
| R²     | see output above |

### Key Takeaways
- **house_size_sqft** and **num_bathrooms** are the strongest positive predictors of price.
- **age_years** and **distance_to_city_km** have a negative effect on price, which is intuitive.
- Linear Regression achieves a high R² on this synthetic dataset because the data was generated with roughly linear relationships.

### Next Steps
- Try **Ridge / Lasso** regression to add regularisation.
- Experiment with **polynomial features** to capture non-linear relationships.
- Use a **Random Forest Regressor** or **Gradient Boosting** for potentially higher accuracy.