# Task 6 — House Price Prediction (Regression)

This notebook predicts house prices from tabular features using **Gradient Boosting**.

**Dataset:** Use a Kaggle-style `housing.csv` (or similar with columns like `price`, `sqft_living`, `bedrooms`, `bathrooms`, `lat`, `long`, etc.). Place the file next to this notebook.

**How to run (locally):**
1. Download a house price dataset (e.g., *House Sales in King County, USA*).
2. Put the CSV in the same folder (rename to `housing.csv` or adjust the path below).
3. Install requirements: `pip install scikit-learn matplotlib pandas numpy`
4. Run all cells.

**Metrics:** MAE and RMSE, plus a predicted vs. actual plot.


In [None]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingRegressor

pd.set_option('display.max_columns', None)

In [None]:
# Load data (expects 'housing.csv' in the same directory)
df = pd.read_csv('housing.csv')
print('Shape:', df.shape)
df.head()

In [None]:
# Basic cleaning: drop obvious non-numeric identifiers if present
drop_cols = [c for c in ['id','date'] if c in df.columns]
df = df.drop(columns=drop_cols)

# Handle missing values (simple strategy)
df = df.dropna()
df.describe()

In [None]:
# Define target and features
target = 'price'  # adjust if your CSV uses a different target name
X = df.drop(columns=[target])
y = df[target]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Some models benefit from scaling; Gradient Boosting is tree-based and doesn't require it, but we'll keep numeric consistency
gb = GradientBoostingRegressor(random_state=42)
gb.fit(X_train, y_train)
preds = gb.predict(X_test)

mae = mean_absolute_error(y_test, preds)
rmse = mean_squared_error(y_test, preds, squared=False)
print('MAE:', mae)
print('RMSE:', rmse)

In [None]:
# Plot Predicted vs Actual
plt.figure()
plt.scatter(y_test, preds)
plt.xlabel('Actual Price')
plt.ylabel('Predicted Price')
plt.title('Predicted vs Actual House Prices')
plt.show()

In [None]:
# Feature Importances (if columns are numeric)
if hasattr(gb, 'feature_importances_'):
    importances = pd.Series(gb.feature_importances_, index=X.columns).sort_values(ascending=False)
    print(importances.head(15))
else:
    print('Model does not expose feature_importances_.')

## Notes & Next Steps
- Try **XGBoost** or **LightGBM** for stronger performance.
- Engineer features like price per sqft, age of house, and neighborhood encodings.
