# Housing Price Prediction using Scikit-Learn

This notebook demonstrates how to build a machine learning model to predict housing prices using various regression algorithms from scikit-learn.

## What You'll Learn:
1. Loading and exploring housing data
2. Data preprocessing and feature engineering
3. Training multiple regression models
4. Evaluating model performance
5. Making predictions on new data

Let's get started! 🚀

## 1. Import Required Libraries

First, we'll import all the necessary libraries for data manipulation, visualization, and machine learning.

In [None]:
# Data manipulation
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Machine Learning
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Settings
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

print("✓ All libraries imported successfully!")

## 2. Load and Explore the Dataset

We'll use the California Housing dataset, which contains information about housing prices in California. This dataset includes features like median income, house age, average rooms, and more.

In [None]:
# Load the California Housing dataset
housing = fetch_california_housing()

# Create a DataFrame
df = pd.DataFrame(housing.data, columns=housing.feature_names)
df['PRICE'] = housing.target  # Add the target variable

print(f"Dataset shape: {df.shape}")
print(f"Features: {', '.join(housing.feature_names)}")
print(f"\nFirst 5 rows:")
df.head()

In [None]:
# Get dataset information
print("Dataset Information:")
print(df.info())
print("\n" + "="*50)
print("\nStatistical Summary:")
df.describe()

In [None]:
# Check for missing values
print("Missing values:")
print(df.isnull().sum())
print(f"\nTotal missing values: {df.isnull().sum().sum()}")

### Visualize Data Distribution

In [None]:
# Distribution of housing prices
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
sns.histplot(df['PRICE'], kde=True, bins=50, color='blue')
plt.title('Distribution of Housing Prices')
plt.xlabel('Price (in $100,000s)')
plt.ylabel('Frequency')

plt.subplot(1, 2, 2)
sns.boxplot(y=df['PRICE'], color='skyblue')
plt.title('Boxplot of Housing Prices')
plt.ylabel('Price (in $100,000s)')

plt.tight_layout()
plt.show()

In [None]:
# Correlation analysis
print("Correlation with Price:")
correlations = df.corr()['PRICE'].sort_values(ascending=False)
print(correlations)

# Correlation heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', center=0, fmt='.2f')
plt.title('Feature Correlation Heatmap')
plt.tight_layout()
plt.show()

## 3. Data Preprocessing and Feature Engineering

Now we'll prepare our data for machine learning by scaling features and handling any preprocessing needed.

In [None]:
# Separate features (X) and target (y)
X = df.drop('PRICE', axis=1)
y = df['PRICE']

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"\nFeature columns: {list(X.columns)}")

## 4. Split Data into Training and Testing Sets

We'll split our data into training (80%) and testing (20%) sets. The training set is used to train the model, and the testing set is used to evaluate its performance.

In [None]:
# Split the data (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"Training set size: {len(X_train)} samples")
print(f"Testing set size: {len(X_test)} samples")
print(f"Training set percentage: {len(X_train) / len(X) * 100:.1f}%")
print(f"Testing set percentage: {len(X_test) / len(X) * 100:.1f}%")

In [None]:
# Scale the features using StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("✓ Features scaled using StandardScaler")
print(f"\nOriginal feature range example (MedInc):")
print(f"  Before scaling: {X_train['MedInc'].min():.2f} to {X_train['MedInc'].max():.2f}")
print(f"  After scaling: {X_train_scaled[:, 0].min():.2f} to {X_train_scaled[:, 0].max():.2f}")

## 5. Train a Linear Regression Model

Let's start with the simplest regression model: Linear Regression. This model assumes a linear relationship between features and the target.

In [None]:
# Create and train Linear Regression model
lr_model = LinearRegression()
lr_model.fit(X_train_scaled, y_train)

print("✓ Linear Regression model trained!")
print(f"\nModel coefficients:")
for feature, coef in zip(X.columns, lr_model.coef_):
    print(f"  {feature:15s}: {coef:8.4f}")
print(f"\nIntercept: {lr_model.intercept_:.4f}")

## 6. Make Predictions and Evaluate Model Performance

Now we'll use our trained model to make predictions and evaluate how well it performs using various metrics.

In [None]:
# Make predictions
y_train_pred = lr_model.predict(X_train_scaled)
y_test_pred = lr_model.predict(X_test_scaled)

# Calculate evaluation metrics
def evaluate_model(y_true, y_pred, set_name=""):
    """Calculate and print regression metrics"""
    mae = mean_absolute_error(y_true, y_pred)
    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_true, y_pred)
    
    print(f"{set_name} Metrics:")
    print(f"  Mean Absolute Error (MAE):  ${mae:.4f} (×100k)")
    print(f"  Mean Squared Error (MSE):   ${mse:.4f}")
    print(f"  Root Mean Squared Error:     ${rmse:.4f} (×100k)")
    print(f"  R² Score:                    {r2:.4f}")
    print(f"  Accuracy (R²):               {r2*100:.2f}%")
    
    return {'MAE': mae, 'MSE': mse, 'RMSE': rmse, 'R2': r2}

print("="*50)
train_metrics = evaluate_model(y_train, y_train_pred, "Training Set")
print("\n" + "="*50)
test_metrics = evaluate_model(y_test, y_test_pred, "Test Set")
print("="*50)

### Understanding the Metrics:

- **MAE (Mean Absolute Error)**: Average absolute difference between predicted and actual prices. Lower is better.
- **RMSE (Root Mean Squared Error)**: Square root of average squared differences. Penalizes larger errors more. Lower is better.
- **R² Score**: Proportion of variance in the target variable explained by the model. Range: 0 to 1. Higher is better (1 = perfect prediction).

## 7. Visualize Predictions vs Actual Values

Let's visualize how well our model predictions match the actual housing prices.

In [None]:
# Scatter plot: Predicted vs Actual
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.scatter(y_test, y_test_pred, alpha=0.5, color='blue')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2, label='Perfect Prediction')
plt.xlabel('Actual Price (×$100k)')
plt.ylabel('Predicted Price (×$100k)')
plt.title('Predicted vs Actual Housing Prices')
plt.legend()
plt.grid(True, alpha=0.3)

# Residual plot
plt.subplot(1, 2, 2)
residuals = y_test - y_test_pred
plt.scatter(y_test_pred, residuals, alpha=0.5, color='green')
plt.axhline(y=0, color='r', linestyle='--', lw=2)
plt.xlabel('Predicted Price (×$100k)')
plt.ylabel('Residuals (Actual - Predicted)')
plt.title('Residual Plot')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"Average residual: ${np.mean(residuals):.4f}")
print(f"Residual standard deviation: ${np.std(residuals):.4f}")

In [None]:
# Distribution of prediction errors
plt.figure(figsize=(10, 6))
sns.histplot(residuals, kde=True, bins=50, color='purple')
plt.xlabel('Prediction Error (Actual - Predicted)')
plt.ylabel('Frequency')
plt.title('Distribution of Prediction Errors')
plt.axvline(x=0, color='r', linestyle='--', lw=2, label='Zero Error')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

## 8. Experiment with Other Regression Models

Let's train and compare multiple regression algorithms to find the best one for our data!

In [None]:
# Define multiple models to compare
models = {
    'Linear Regression': LinearRegression(),
    'Ridge Regression': Ridge(alpha=1.0),
    'Lasso Regression': Lasso(alpha=0.1),
    'Decision Tree': DecisionTreeRegressor(max_depth=10, random_state=42),
    'Random Forest': RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42, n_jobs=-1),
    'Gradient Boosting': GradientBoostingRegressor(n_estimators=100, max_depth=5, random_state=42)
}

# Train and evaluate each model
results = []

print("Training and evaluating models...\n")
print("="*80)

for name, model in models.items():
    print(f"\n{name}:")
    print("-" * 80)
    
    # Train
    model.fit(X_train_scaled, y_train)
    
    # Predict
    y_pred = model.predict(X_test_scaled)
    
    # Evaluate
    mae = mean_absolute_error(y_test, y_pred)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    r2 = r2_score(y_test, y_pred)
    
    # Cross-validation
    cv_scores = cross_val_score(model, X_train_scaled, y_train, cv=5, scoring='r2')
    
    results.append({
        'Model': name,
        'MAE': mae,
        'RMSE': rmse,
        'R² Score': r2,
        'CV R² Mean': cv_scores.mean(),
        'CV R² Std': cv_scores.std()
    })
    
    print(f"  MAE:  ${mae:.4f}")
    print(f"  RMSE: ${rmse:.4f}")
    print(f"  R²:   {r2:.4f} ({r2*100:.2f}%)")
    print(f"  CV R² Score: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")

print("\n" + "="*80)

In [None]:
# Create comparison DataFrame
results_df = pd.DataFrame(results).sort_values('R² Score', ascending=False)

print("\n" + "="*80)
print("MODEL COMPARISON SUMMARY (Sorted by R² Score)")
print("="*80)
print(results_df.to_string(index=False))

# Display the best model
best_model = results_df.iloc[0]['Model']
best_r2 = results_df.iloc[0]['R² Score']
print(f"\n🏆 Best Model: {best_model} (R² = {best_r2:.4f})")

In [None]:
# Visualize model comparison
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# R² Score comparison
axes[0, 0].barh(results_df['Model'], results_df['R² Score'], color='skyblue')
axes[0, 0].set_xlabel('R² Score')
axes[0, 0].set_title('Model Comparison: R² Score')
axes[0, 0].grid(axis='x', alpha=0.3)

# MAE comparison
axes[0, 1].barh(results_df['Model'], results_df['MAE'], color='lightcoral')
axes[0, 1].set_xlabel('Mean Absolute Error')
axes[0, 1].set_title('Model Comparison: MAE')
axes[0, 1].grid(axis='x', alpha=0.3)

# RMSE comparison
axes[1, 0].barh(results_df['Model'], results_df['RMSE'], color='lightgreen')
axes[1, 0].set_xlabel('Root Mean Squared Error')
axes[1, 0].set_title('Model Comparison: RMSE')
axes[1, 0].grid(axis='x', alpha=0.3)

# CV R² Score with error bars
axes[1, 1].barh(results_df['Model'], results_df['CV R² Mean'], 
                xerr=results_df['CV R² Std'], color='plum', capsize=5)
axes[1, 1].set_xlabel('Cross-Validation R² Score')
axes[1, 1].set_title('Model Comparison: CV R² Score (with std)')
axes[1, 1].grid(axis='x', alpha=0.3)

plt.tight_layout()
plt.show()

## 9. Feature Importance Analysis

For tree-based models, we can see which features are most important for predicting housing prices.

In [None]:
# Use Random Forest for feature importance
rf_model = RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42)
rf_model.fit(X_train_scaled, y_train)

# Get feature importance
feature_importance = pd.DataFrame({
    'Feature': X.columns,
    'Importance': rf_model.feature_importances_
}).sort_values('Importance', ascending=False)

print("Feature Importance (Random Forest):")
print(feature_importance.to_string(index=False))

# Plot feature importance
plt.figure(figsize=(10, 6))
sns.barplot(data=feature_importance, x='Importance', y='Feature', palette='viridis')
plt.title('Feature Importance in Housing Price Prediction')
plt.xlabel('Importance Score')
plt.ylabel('Feature')
plt.tight_layout()
plt.show()

## 10. Making Predictions on New Data

Now let's use our best model to make predictions on new, unseen data!

In [None]:
# Create example houses for prediction
example_houses = pd.DataFrame({
    'MedInc': [3.0, 5.0, 8.0],           # Median income
    'HouseAge': [25.0, 15.0, 10.0],      # House age
    'AveRooms': [5.0, 6.5, 8.0],         # Average rooms
    'AveBedrms': [1.0, 1.2, 2.0],        # Average bedrooms
    'Population': [1200.0, 800.0, 500.0], # Population
    'AveOccup': [3.0, 2.5, 2.0],         # Average occupancy
    'Latitude': [35.0, 37.0, 34.0],      # Latitude
    'Longitude': [-120.0, -122.0, -118.0] # Longitude
})

print("Example Houses to Predict:")
print(example_houses)

# Scale the features
example_houses_scaled = scaler.transform(example_houses)

# Make predictions using the best model (Random Forest)
predictions = rf_model.predict(example_houses_scaled)

# Display results
print("\n" + "="*60)
print("PREDICTIONS")
print("="*60)
for i, pred in enumerate(predictions, 1):
    print(f"\nHouse {i}:")
    print(f"  Predicted Price: ${pred:.2f} × 100k = ${pred * 100000:.2f}")
    print(f"  Features: MedInc={example_houses.iloc[i-1]['MedInc']}, "
          f"HouseAge={example_houses.iloc[i-1]['HouseAge']}, "
          f"AveRooms={example_houses.iloc[i-1]['AveRooms']}")

## Summary and Next Steps

### What We Learned:
✅ How to load and explore housing data  
✅ How to preprocess data (scaling, train-test split)  
✅ How to train multiple regression models  
✅ How to evaluate models using MAE, RMSE, and R²  
✅ How to visualize predictions and errors  
✅ How to compare different algorithms  
✅ How to identify important features  
✅ How to make predictions on new data  

### Model Performance Summary:
The Random Forest and Gradient Boosting models typically perform best, achieving R² scores around 0.80-0.82, meaning they can explain about 80% of the variance in housing prices!

### Next Steps to Improve:
1. **Hyperparameter Tuning**: Use GridSearchCV or RandomizedSearchCV
2. **Feature Engineering**: Create new features (e.g., rooms per person, price per room)
3. **Handle Outliers**: Identify and handle outliers in the data
4. **Try Advanced Models**: XGBoost, LightGBM, CatBoost
5. **Ensemble Methods**: Combine multiple models for better predictions
6. **Use Your Own Data**: Replace with real housing data from your area
