# 🏡 Housing Market Analysis and Prediction using Python

This notebook performs end-to-end analysis and price prediction for housing data:
- Data loading & cleaning
- Exploratory data analysis (EDA)
- Feature engineering
- Model training & evaluation (Linear Regression, Random Forest)
- Cross-validation & feature importance
- Example predictions & model persistence

> Note: This notebook uses **matplotlib** only for charts (no seaborn).

## 1. Setup & Imports

In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
import joblib
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)


## 2. Load Data
The cell below will try to read a CSV file named `housing_market.csv` if it exists.
If not found, it will create a small sample dataset similar to your original notebook.

In [None]:
# Try reading a CSV; otherwise, create a small sample dataset
csv_path = 'housing_market.csv'
if os.path.exists(csv_path):
    df = pd.read_csv(csv_path)
else:
    # Sample synthetic data (you can replace with your actual dataset)
    data = {
        'SqFt': [1000, 1200, 1500, 1800, 2000, 2200, 2500, 2700, 3000, 3200],
        'Price': [200000, 230000, 280000, 330000, 360000, 390000, 450000, 480000, 550000, 590000],
        'Bedrooms': [2, 2, 3, 3, 3, 4, 4, 4, 4, 5]
    }
    df = pd.DataFrame(data)
df.head()


## 3. Basic Data Checks

In [None]:
print('Shape:', df.shape)
print('\nInfo:')
print(df.info())
print('\nDescribe:')
display(df.describe())
print('\nMissing values (per column):')
print(df.isna().sum())


## 4. Exploratory Data Analysis (EDA)

In [None]:
# Histogram of Price
plt.figure()
plt.hist(df['Price'].dropna(), bins=10)
plt.title('Distribution of House Prices')
plt.xlabel('Price')
plt.ylabel('Frequency')
plt.show()

# Scatter: Price vs SqFt
plt.figure()
plt.scatter(df['SqFt'], df['Price'])
plt.title('Price vs. Square Feet')
plt.xlabel('Square Feet')
plt.ylabel('Price')
plt.show()

# Boxplot: Price by Bedrooms
plt.figure()
groups = [g['Price'].values for _, g in df.groupby('Bedrooms')]
labels = [str(int(b)) for b in sorted(df['Bedrooms'].unique())]
plt.boxplot(groups, labels=labels)
plt.title('Price Distribution by Bedroom Count')
plt.xlabel('Bedrooms')
plt.ylabel('Price')
plt.show()


## 5. Feature Engineering

In [None]:
df['Price_per_SqFt'] = df['Price'] / df['SqFt']
display(df.head())


## 6. Train/Test Split & Baseline Models

In [None]:
# Select features and target
features = ['SqFt', 'Bedrooms']
X = df[features]
y = df['Price']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=RANDOM_STATE
)
print('Train size:', X_train.shape, ' Test size:', X_test.shape)


### 6.1 Linear Regression

In [None]:
linreg = LinearRegression()
linreg.fit(X_train, y_train)
y_pred_lr = linreg.predict(X_test)
mse_lr = mean_squared_error(y_test, y_pred_lr)
rmse_lr = mean_squared_error(y_test, y_pred_lr, squared=False)
r2_lr = r2_score(y_test, y_pred_lr)
print(f'Linear Regression — RMSE: {rmse_lr:.2f}, R2: {r2_lr:.4f}')
print('Coefficients:', dict(zip(features, linreg.coef_)))
print('Intercept:', linreg.intercept_)


### 6.2 Random Forest Regressor

In [None]:
rf = RandomForestRegressor(n_estimators=300, random_state=RANDOM_STATE)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
rmse_rf = mean_squared_error(y_test, y_pred_rf, squared=False)
r2_rf = r2_score(y_test, y_pred_rf)
print(f'Random Forest — RMSE: {rmse_rf:.2f}, R2: {r2_rf:.4f}')
print('Feature Importances:', dict(zip(features, rf.feature_importances_)))


## 7. Cross-Validation (Optional)

In [None]:
cv_scores = cross_val_score(linreg, X, y, cv=5, scoring='r2')
print('Linear Regression 5-fold R²:', cv_scores)
print('Mean R²:', cv_scores.mean().round(4))


## 8. Residual Analysis

In [None]:
residuals = y_test - y_pred_lr
plt.figure()
plt.scatter(y_pred_lr, residuals)
plt.axhline(0)
plt.title('Residuals vs Predicted (Linear Regression)')
plt.xlabel('Predicted Price')
plt.ylabel('Residuals')
plt.show()


## 9. Example Prediction

In [None]:
# Example: predict price for a 2100 SqFt, 3-bedroom house
example = pd.DataFrame({'SqFt': [2100], 'Bedrooms': [3]})
pred_lr = linreg.predict(example)[0]
pred_rf = rf.predict(example)[0]
print('Example input:', example.to_dict(orient='records')[0])
print(f'Predicted price (Linear Regression): {pred_lr:,.0f}')
print(f'Predicted price (Random Forest):   {pred_rf:,.0f}')


## 10. Save Trained Model(s)

In [None]:
os.makedirs('artifacts', exist_ok=True)
joblib.dump(linreg, 'artifacts/linear_regression.joblib')
joblib.dump(rf, 'artifacts/random_forest.joblib')
print('Saved to artifacts/.')


## 11. Conclusions & Next Steps
- **Both models** provide a baseline; Random Forest can capture non-linear relationships.
- Collect more data and add features (location, lot size, year built) to improve accuracy.
- Try regularized models (Ridge/Lasso), gradient boosting, and hyperparameter tuning.
- Validate with **time-based splits** if data is temporal.
- Monitor for drift if deploying in production.