## Explanation of the House Price Prediction Process

Below is a comprehensive overview of the entire workflow, incorporating key aspects of the dataset’s features, transformations, modeling, and predictions. This explanation aims to provide clarity on each step, including the rationale behind certain decisions, and to highlight the professional approach used in preparing a reliable house price prediction model. I have condensed everything into one cell to improve performance during deployment and have left out some of the visualizations that I used when training and evaluating the model as they are in the dashboard.

---

### Key Features and Their Definitions

The model leverages a variety of features extracted from housing records. Each feature contributes to the understanding of a home’s potential market value. Below are some of the core features used, along with their ranges and definitions:

- **1stFlrSF (First Floor SF):**  
  *Range:* 334 – 4692  
  *Description:* Total square feet of the house’s first floor living area.
  
- **2ndFlrSF (Second Floor SF):**  
  *Range:* 0 – 2065  
  *Description:* Total square feet of the house’s second floor living area.
  
- **BedroomAbvGr (Bedrooms Above Grade):**  
  *Range:* 0 – 8  
  *Description:* Number of bedrooms above ground level (excludes basement bedrooms).
  
- **BsmtExposure (Basement Exposure):**  
  *Categories:* None (No Basement), No, Mn, Av, Gd  
  *Description:* Describes walkout or garden-level basement walls. Higher categories (e.g., Gd) indicate better exposure.
  
- **BsmtFinType1 (Basement Finish Type):**  
  *Categories:* None (No Basement), Unf, LwQ, Rec, BLQ, ALQ, GLQ  
  *Description:* Indicates the finish quality of the basement, from unfinished (Unf) to high-quality living quarters (GLQ).
  
- **BsmtFinSF1 (Type 1 Finished Basement SF):**  
  *Range:* 0 – 5644  
  *Description:* Finished basement area in square feet.
  
- **BsmtUnfSF (Unfinished Basement SF):**  
  *Range:* 0 – 2336  
  *Description:* Unfinished portion of the basement area in square feet.
  
- **TotalBsmtSF (Total Basement SF):**  
  *Range:* 0 – 6110  
  *Description:* Total basement area in square feet.
  
- **GarageArea (Garage Area):**  
  *Range:* 0 – 1418  
  *Description:* Size of the garage in square feet.
  
- **GarageFinish (Garage Finish Quality):**  
  *Categories:* None (No Garage), Unf, RFn, Fin  
  *Description:* Interior finish level of the garage.
  
- **GarageYrBlt (Garage Year Built):**  
  *Range:* 1900 – 2010  
  *Description:* The year the garage was constructed.
  
- **GrLivArea (Above Grade Living Area):**  
  *Range:* 334 – 5642  
  *Description:* Total above-ground (non-basement) living area square feet.
  
- **KitchenQual (Kitchen Quality):**  
  *Categories:* Po, Fa, TA, Gd, Ex  
  *Description:* Rates the kitchen’s quality from poor (Po) to excellent (Ex).
  
- **LotArea (Lot Size):**  
  *Range:* 1300 – 215245  
  *Description:* Total lot size in square feet.
  
- **LotFrontage (Street Frontage):**  
  *Range:* 21 – 313  
  *Description:* Linear feet of street connected to the property.
  
- **MasVnrArea (Masonry Veneer Area):**  
  *Range:* 0 – 1600  
  *Description:* Square feet of masonry veneer on the exterior of the home.
  
- **EnclosedPorch (Enclosed Porch Area):**  
  *Range:* 0 – 286  
  *Description:* Total enclosed porch area in square feet.
  
- **OpenPorchSF (Open Porch Area):**  
  *Range:* 0 – 547  
  *Description:* Total open porch area in square feet.
  
- **OverallCond (Overall Condition):**  
  *Range:* 1 (Very Poor) – 10 (Very Excellent)  
  *Description:* Rates the overall condition of the house.
  
- **OverallQual (Overall Quality):**  
  *Range:* 1 (Very Poor) – 10 (Very Excellent)  
  *Description:* Rates the overall material and finish of the house.
  
- **WoodDeckSF (Wood Deck Area):**  
  *Range:* 0 – 736  
  *Description:* Total wood deck area in square feet.
  
- **YearBuilt (Year Built):**  
  *Range:* 1872 – 2010  
  *Description:* Year the property was originally constructed.
  
- **YearRemodAdd (Year Remodeled):**  
  *Range:* 1950 – 2010  
  *Description:* Year the house was remodeled or had an addition. If no remodeling occurred, it’s the year built.
  
- **SalePrice (Target Variable):**  
  *Range:* 34900 – 755000  
  *Description:* The actual sale price of the property, serving as the target variable for model training.

---

### Process Overview

**Key Inputs:**  
- **House Sales Historical Data:** Contains the above features and the `SalePrice` for a large number of properties.  
- **Inherited Houses Data:** Similar structure but without known `SalePrice`, used for producing price predictions.

**Major Steps:**

1. **Data Loading & Initial Inspection:**  
   **Input:** Raw CSV files.  
   **Process:** Read into Pandas, confirm shapes, and inspect initial rows.  
   **Output:** Two DataFrames ready for preprocessing (`house_data`, `inherited_houses`).

2. **Preprocessing & Missing Values:**  
   **Input:** Raw datasets with potential missing data.  
   **Process:**  
   - Impute zero for features logically zero if missing (e.g., `EnclosedPorch`).  
   - Fill categorical missing values with mode or appropriate defaults (`KitchenQual` → ‘TA’).  
   - Fill numeric gaps with median values as needed.  
   
   These steps produce complete, consistent datasets suitable for modeling.  
   **Output:** Cleaned DataFrames with no missing values.

3. **Data Transformations (Normalization & Encoding):**  
   **Input:** Cleaned DataFrames.  
   **Process:**  
   - Apply `log1p` or Box-Cox transformations to right-skewed features (including `SalePrice`) for more stable distributions.  
   - Ordinally encode categorical features (e.g., `BsmtExposure`, `KitchenQual`) to preserve meaningful order.  
   
   **Output:** DataFrames with normalized numeric features and encoded categorical variables.

4. **Feature Engineering:**  
   **Input:** Transformed DataFrames.  
   **Process:**  
   - Create `TotalSF` (sum of basement and above-grade areas) and `Qual_TotalSF` (Overall Quality × TotalSF) to capture additional value-driven relationships.  
   
   **Output:** Enhanced feature sets that potentially improve predictive accuracy.

5. **Feature Selection:**  
   **Input:** DataFrames with multiple potential predictors.  
   **Process:**  
   - Use a preliminary Random Forest to gauge feature importance.  
   - Select the most predictive features to simplify the model and potentially boost performance.  
   
   **Output:** A refined set of selected features (`selected_features.pkl`) for the final model.

6. **Model Training & Evaluation:**  
   **Input:** Selected features and log-transformed target (`SalePrice_Log`).  
   **Process:**  
   - Split the data into training and test sets.  
   - Scale features for linear models.  
   - Train several models (Linear, Ridge, Lasso, ElasticNet, Gradient Boosting, Random Forest, XGBoost) and compare performance using MAE, RMSE, and R².  
   
   **Output:**  
   - Trained models saved as joblib files.  
   - Evaluation metrics stored in CSV.  
   - Identification of the best model for final deployment.

7. **Predicting on Inherited Houses:**  
   **Input:** `inherited_houses` dataset processed identically to training data.  
   **Process:**  
   - Apply the same cleaning, encoding, scaling, and feature selection steps.  
   - Use trained models to predict house prices (in transformed scale, then converted back to the original scale).  
   
   **Output:** A CSV file containing predictions for the inherited houses and a `final_model.joblib` representing the best-performing model.

---

### Overall Rationale & Professional Approach

- **Imputation & Transformation Choices:** Ensures robust modeling and prevents data leakage or bias.  
- **Ordinal Encoding & Feature Engineering:** Incorporates domain knowledge and logical structures into the dataset, improving signal strength.  
- **Multiple Models & Thorough Evaluation:** Encourages evidence-based selection of the best algorithm, rather than relying on a single approach.  
- **Reproducibility & Documentation:** All intermediate artifacts (e.g., scalers, selected features, trained models) are saved for transparency, reproducibility, and future maintenance.

In sum, this process exemplifies a structured, professional, and scalable approach to building a predictive model. It leverages domain insights, maintains rigorous standards for data handling, and ensures that the final product is both reliable and easily integrable into a real-world pricing scenario.


In [None]:
# house_price_prediction.ipynb

# Import essential libraries
import os
import pandas as pd
import numpy as np
import joblib
import pickle
import warnings
from scipy.stats import boxcox
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, RidgeCV, ElasticNetCV, LassoCV
from sklearn.ensemble import GradientBoostingRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from pathlib import Path

# Ignore warnings for clean output
warnings.filterwarnings('ignore')

# Set BASE_DIR to the current working directory for Jupyter Notebooks
BASE_DIR = Path(os.getcwd())

# Define directories
data_dir = BASE_DIR / 'data'
models_dir = BASE_DIR / 'data' / 'models'  # Subdirectory for models
os.makedirs(models_dir, exist_ok=True)  # Ensure models directory exists

# Set file paths
house_data_file = BASE_DIR /'raw_data' / 'house_prices_records.csv'
inherited_houses_file = BASE_DIR / 'raw_data' / 'inherited_houses.csv'

# Import datasets
house_data = pd.read_csv(house_data_file)
inherited_houses = pd.read_csv(inherited_houses_file)

print(f"House Data Shape: {house_data.shape}")
print(f"Inherited Houses Shape: {inherited_houses.shape}")

# Display first few rows of the datasets
print("First few rows of house_data:")
print(house_data.head())
print("First few rows of inherited_houses:")
print(inherited_houses.head())

# Apply log transformation to SalePrice
# The sale prices are right-skewed; applying log transformation to normalize the distribution
house_data['SalePrice_Log'] = np.log1p(house_data['SalePrice'])

# Handle missing values in house_data
print("\nHandling missing values in house_data...")

# List of features where missing values likely indicate absence of the feature
zero_fill_features = ['2ndFlrSF', 'EnclosedPorch', 'MasVnrArea', 'WoodDeckSF',
                      'BsmtFinSF1', 'TotalBsmtSF', '1stFlrSF', 'BsmtUnfSF']

for feature in zero_fill_features:
    house_data[feature].fillna(0, inplace=True)
    print(f"Filled missing values in {feature} with 0.")

# Fill missing categorical features with mode or default value
categorical_mode_fill = {
    'BedroomAbvGr': house_data['BedroomAbvGr'].mode()[0],
    'BsmtFinType1': 'None',
    'GarageFinish': 'Unf',
    'BsmtExposure': 'No',
    'KitchenQual': 'TA'
}

for feature, value in categorical_mode_fill.items():
    house_data[feature].fillna(value, inplace=True)
    print(f"Filled missing values in {feature} with '{value}'.")

# Fill missing numerical features with median
numerical_median_fill = ['GarageYrBlt', 'LotFrontage', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd']

for feature in numerical_median_fill:
    median_value = house_data[feature].median()
    house_data[feature].fillna(median_value, inplace=True)
    print(f"Filled missing values in {feature} with median value {median_value}.")

# Verify that there are no missing values left
print("\nChecking for remaining missing values:")
print(house_data.isnull().sum()[house_data.isnull().sum() > 0])

# Encode categorical features
print("\nEncoding categorical features in house_data...")

# Define mappings for ordinal categorical features based on their definitions
ordinal_mappings = {
    'BsmtFinType1': {'None': 0, 'Unf': 1, 'LwQ': 2, 'Rec': 3, 'BLQ': 4, 'ALQ': 5, 'GLQ': 6},
    'KitchenQual': {'Po': 1, 'Fa': 2, 'TA': 3, 'Gd': 4, 'Ex': 5},
    'BsmtExposure': {'None': 0, 'No': 1, 'Mn': 2, 'Av': 3, 'Gd': 4},
    'GarageFinish': {'None': 0, 'Unf': 1, 'RFn': 2, 'Fin': 3}
}

for col, mapping in ordinal_mappings.items():
    if col in house_data.columns:
        house_data[col] = house_data[col].map(mapping)
        print(f"Encoded {col} using ordinal mapping.")

# Identify numeric features
numeric_feats = house_data.select_dtypes(include=[np.number]).columns

# Check skewness of numeric features
skewness = house_data[numeric_feats].apply(lambda x: x.skew()).sort_values(ascending=False)
print("\nSkewness of numeric features:")
print(skewness)

# Features with high skewness (threshold can be adjusted)
skewed_features = skewness[abs(skewness) > 0.75].index.tolist()
print("\nFeatures with high skewness (|skewness| > 0.75):")
print(skewed_features)

# Apply log or box-cox transformation to skewed features
print("\nTransforming skewed features in house_data...")

# Dictionary to store lambda values for box-cox transformation
lam_dict = {}

for feat in skewed_features:
    if (house_data[feat] <= 0).any():
        # If the feature has zero or negative values, use log1p transformation
        house_data[feat] = np.log1p(house_data[feat])
        print(f"Applied log1p transformation to {feat}.")
    else:
        # Apply box-cox transformation
        try:
            transformed_data, lam = boxcox(house_data[feat])
            house_data[feat] = transformed_data
            lam_dict[feat] = lam
            print(f"Applied box-cox transformation to {feat} with lambda {lam:.4f}.")
        except ValueError:
            # If box-cox fails, use log1p
            house_data[feat] = np.log1p(house_data[feat])
            print(f"Applied log1p transformation to {feat} (box-cox failed).")

# Save skewed features and lambda values for future use
with open(models_dir / 'skewed_features.pkl', 'wb') as f:
    pickle.dump(skewed_features, f)
with open(models_dir / 'lam_dict.pkl', 'wb') as f:
    pickle.dump(lam_dict, f)

# Feature engineering
print("\nPerforming feature engineering in house_data...")

# Create new features based on domain knowledge
house_data['TotalSF'] = house_data['TotalBsmtSF'] + house_data['1stFlrSF'] + house_data['2ndFlrSF']
print("Created TotalSF feature as sum of TotalBsmtSF, 1stFlrSF, and 2ndFlrSF.")

house_data['Qual_TotalSF'] = house_data['OverallQual'] * house_data['TotalSF']
print("Created Qual_TotalSF feature as product of OverallQual and TotalSF.")

# Prepare data for modeling
print("\nPreparing data for modeling...")

# Drop unnecessary columns
X = house_data.drop(['SalePrice', 'SalePrice_Log'], axis=1, errors='ignore')
y = house_data['SalePrice_Log']

# Define the features based on the provided metadata
feature_list = [
    '1stFlrSF', '2ndFlrSF', 'BedroomAbvGr', 'BsmtExposure', 'BsmtFinType1',
    'BsmtFinSF1', 'BsmtUnfSF', 'TotalBsmtSF', 'GarageArea', 'GarageFinish',
    'GarageYrBlt', 'GrLivArea', 'KitchenQual', 'LotArea', 'LotFrontage',
    'MasVnrArea', 'EnclosedPorch', 'OpenPorchSF', 'OverallCond', 'OverallQual',
    'WoodDeckSF', 'YearBuilt', 'YearRemodAdd', 'TotalSF', 'Qual_TotalSF'  # Include engineered features
]

# Ensure the features are in X
X = X[feature_list]

# Feature selection using Random Forest
print("\nPerforming feature selection using Random Forest...")

# Use Random Forest to estimate feature importances
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X, y)

# Get feature importances
importances = pd.Series(rf.feature_importances_, index=X.columns).sort_values(ascending=False)
print("Feature importances from Random Forest:")
print(importances)

# Select top features (e.g., top 20)
selected_features = importances[:20].index.tolist()
print("\nSelected top features for modeling:")
print(selected_features)

# Save selected features for future use
with open(models_dir / 'selected_features.pkl', 'wb') as f:
    pickle.dump(selected_features, f)

# Keep only selected features
X = X[selected_features]

# Split data into training and test sets
print("\nSplitting data into training and test sets...")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Save train and test data for the dashboard
joblib.dump((X_train, X_test, y_train, y_test), models_dir / 'train_test_data.joblib')

# Scaling features
print("\nScaling features...")
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Save the scaler for future use
joblib.dump(scaler, models_dir / 'scaler.joblib')

# Model training
print("\nTraining models...")

# Adjusted alpha values for Ridge Regression and Lasso Regression to avoid numerical instability
models = {
    'Linear Regression': LinearRegression(),
    'Ridge Regression': RidgeCV(alphas=np.logspace(-3, 3, 7), cv=5),
    'ElasticNet': ElasticNetCV(alphas=np.logspace(-4, -0.5, 30), l1_ratio=[0.1, 0.5, 0.9], cv=5, max_iter=10000),
    'Lasso Regression': LassoCV(alphas=np.logspace(-3, -0.5, 30), cv=5, max_iter=10000),
    'Gradient Boosting': GradientBoostingRegressor(
        n_estimators=300, learning_rate=0.05, max_depth=3,
        min_samples_leaf=5, max_features=0.8, random_state=42),
    'Random Forest': RandomForestRegressor(
        n_estimators=100, max_depth=None, max_features='sqrt',
        min_samples_leaf=2, random_state=42),
    'XGBoost': XGBRegressor(
        n_estimators=300, learning_rate=0.05, max_depth=5,
        min_child_weight=3, subsample=0.8, colsample_bytree=0.8, random_state=42)
}

# Model evaluation
print("\nEvaluating models...")
results = {'Model': [], 'MAE': [], 'RMSE': [], 'R² Score': []}

for name, model in models.items():
    print(f"Training {name}...")
    model.fit(X_train_scaled, y_train)
    model_filename = f"{name.replace(' ', '_').lower()}_model.joblib"
    # Save the trained model
    joblib.dump(model, os.path.join(models_dir, model_filename))
    # Make predictions on the test set
    predictions = model.predict(X_test_scaled)
    # Calculate performance metrics
    y_test_exp = np.expm1(y_test)
    predictions_exp = np.expm1(predictions)
    # Handle any negative predictions due to model limitations
    predictions_exp[predictions_exp < 0] = 0
    mae = mean_absolute_error(y_test_exp, predictions_exp)
    rmse = np.sqrt(mean_squared_error(y_test_exp, predictions_exp))
    r2 = r2_score(y_test_exp, predictions_exp)
    # Store results
    results['Model'].append(name)
    results['MAE'].append(mae)
    results['RMSE'].append(rmse)
    results['R² Score'].append(r2)
    print(f"{name} - MAE: {mae:.2f}, RMSE: {rmse:.2f}, R² Score: {r2:.4f}")

# Create a DataFrame with the results
results_df = pd.DataFrame(results)
print("\nModel Evaluation Results:")
print(results_df)

# Save the results to a CSV file
results_df.to_csv(os.path.join(models_dir, 'model_evaluation.csv'), index=False)

# Save feature importances
# Using the 'importances' Series from Random Forest
feature_importances = importances.reset_index()
feature_importances.columns = ['Feature', 'Importance']
feature_importances.to_csv(os.path.join(models_dir, 'feature_importances.csv'), index=False)
print("\nSaved feature importances to 'feature_importances.csv'.")

# Process inherited houses
print("\nProcessing inherited houses...")

# Handle missing values in inherited_houses
print("Handling missing values in inherited_houses...")
for feature in zero_fill_features:
    inherited_houses[feature].fillna(0, inplace=True)
    print(f"Filled missing values in {feature} with 0.")

for feature, value in categorical_mode_fill.items():
    inherited_houses[feature].fillna(value, inplace=True)
    print(f"Filled missing values in {feature} with '{value}'.")

for feature in numerical_median_fill:
    median_value = house_data[feature].median()
    inherited_houses[feature].fillna(median_value, inplace=True)
    print(f"Filled missing values in {feature} with median value {median_value}.")

# Encode categorical features
print("Encoding categorical features in inherited_houses...")
for col, mapping in ordinal_mappings.items():
    if col in inherited_houses.columns:
        inherited_houses[col] = inherited_houses[col].map(mapping)
        print(f"Encoded {col} using ordinal mapping.")

# Feature engineering on inherited houses
print("Performing feature engineering on inherited_houses...")
inherited_houses['TotalSF'] = inherited_houses['TotalBsmtSF'] + inherited_houses['1stFlrSF'] + inherited_houses['2ndFlrSF']
print("Created TotalSF feature.")
inherited_houses['Qual_TotalSF'] = inherited_houses['OverallQual'] * inherited_houses['TotalSF']
print("Created Qual_TotalSF feature.")

# Transform skewed features
print("\nTransforming skewed features in inherited_houses...")
for feat in skewed_features:
    if feat in inherited_houses.columns:
        if (inherited_houses[feat] <= 0).any():
            inherited_houses[feat] = np.log1p(inherited_houses[feat])
            print(f"Applied log1p transformation to {feat}.")
        else:
            lam = lam_dict.get(feat)
            if lam is not None:
                try:
                    inherited_houses[feat] = boxcox(inherited_houses[feat], lam)
                    print(f"Applied box-cox transformation to {feat} with lambda {lam:.4f}.")
                except ValueError:
                    inherited_houses[feat] = np.log1p(inherited_houses[feat])
                    print(f"Applied log1p transformation to {feat} (box-cox failed).")
            else:
                inherited_houses[feat] = np.log1p(inherited_houses[feat])
                print(f"Applied log1p transformation to {feat} (no lambda found).")

# Ensure the features match
inherited_houses = inherited_houses.reindex(columns=selected_features, fill_value=0)
print("\nReindexed inherited_houses to match selected features.")

# Scaling
print("Scaling inherited houses features...")
inherited_houses_scaled = scaler.transform(inherited_houses)

# Predictions
print("\nMaking predictions on inherited houses...")
predictions_df = pd.DataFrame()
for name, model in models.items():
    predictions_log = model.predict(inherited_houses_scaled)
    predictions_actual = np.expm1(predictions_log)
    # Handle negative predictions
    predictions_actual[predictions_actual < 0] = 0
    # Store predictions
    predictions_df[name] = predictions_actual
    print(f"Predictions made using {name}.")

# Save predictions to CSV
predictions_df.to_csv(os.path.join(models_dir, 'inherited_houses_predictions.csv'), index=False)
print("\nPredictions saved to 'inherited_houses_predictions.csv'.")

# Optional: Display the predictions
print("\nPredictions for Inherited Houses:")
print(predictions_df)

# Save the final model (best performing model)
best_model_name = results_df.sort_values('RMSE').iloc[0]['Model']
print(f"\nBest performing model is {best_model_name}. Saving as final_model.joblib.")
joblib.dump(models[best_model_name], os.path.join(models_dir, 'final_model.joblib'))
