# 🏠 House Price Prediction

Welcome to the **House Price Prediction** project. This notebook outlines the process of building a predictive model to estimate house prices based on various features. We'll cover data preprocessing, feature engineering, model training, evaluation, and making predictions on new data to help Lydia Doe predict the price of her inherited houses in Ames Iowa USA.

---



# Cell 2: Import Essential Libraries

In [1]:
# Importing necessary libraries for data manipulation, model building, and evaluation
import os
import pandas as pd
import numpy as np
import joblib
import pickle
import warnings
from scipy.stats import boxcox

# Machine Learning libraries
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, RidgeCV, ElasticNetCV, LassoCV
from xgboost import XGBRegressor

# Evaluation metrics
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')


# Cell 3: Directory Configuration

In [2]:
# Define the base data directory
data_dir = 'data'

# Define the models directory within the data directory
models_dir = os.path.join(data_dir, 'models')

# Create the models directory if it doesn't exist to store trained models and related artifacts
os.makedirs(models_dir, exist_ok=True)


# Cell 4: Load Datasets

In [None]:
# Define file paths for the datasets
house_data_file = os.path.join(data_dir, 'house_prices_records.csv')
inherited_houses_file = os.path.join(data_dir, 'inherited_houses.csv')

# Import datasets using pandas
house_data = pd.read_csv(house_data_file)
inherited_houses = pd.read_csv(inherited_houses_file)

# Display the shape of the datasets to understand their dimensions
print(f"House Data Shape: {house_data.shape}")
print(f"Inherited Houses Shape: {inherited_houses.shape}")


5: Log Transformation of SalePrice

In [4]:
# Apply log transformation to the 'SalePrice' to normalize its distribution
house_data['SalePrice_Log'] = np.log1p(house_data['SalePrice'])

6: Handle Missing Values in house_data

In [None]:
print("Handling missing values in house_data...")

# List of features where missing values should be filled with zero
zero_fill_features = [
    '2ndFlrSF', 'EnclosedPorch', 'MasVnrArea', 'WoodDeckSF',
    'BsmtFinSF1', 'TotalBsmtSF', '1stFlrSF', 'BsmtUnfSF'
]

# Fill missing values with zero for the specified features
for feature in zero_fill_features:
    house_data[feature].fillna(0, inplace=True)

# Fill categorical features with appropriate default values
house_data['BedroomAbvGr'].fillna(house_data['BedroomAbvGr'].mode()[0], inplace=True)
house_data['BsmtFinType1'].fillna('None', inplace=True)
house_data['GarageFinish'].fillna('Unf', inplace=True)
house_data['BsmtExposure'].fillna('No', inplace=True)
house_data['KitchenQual'].fillna('TA', inplace=True)

# Fill numerical features with median values
house_data['GarageYrBlt'].fillna(house_data['GarageYrBlt'].median(), inplace=True)
house_data['LotFrontage'].fillna(house_data['LotFrontage'].median(), inplace=True)
house_data['OverallQual'].fillna(house_data['OverallQual'].median(), inplace=True)
house_data['OverallCond'].fillna(house_data['OverallCond'].median(), inplace=True)
house_data['YearBuilt'].fillna(house_data['YearBuilt'].median(), inplace=True)
house_data['YearRemodAdd'].fillna(house_data['YearRemodAdd'].median(), inplace=True)


# Cell 7: Encode Categorical Features

In [None]:
print("Encoding categorical features in house_data...")

# Define ordinal mappings for categorical features to convert them into numerical values
ordinal_mappings = {
    'BsmtFinType1': {'None': 0, 'Unf': 1, 'LwQ': 2, 'Rec': 3, 'BLQ': 4, 'ALQ': 5, 'GLQ': 6},
    'KitchenQual': {'Po': 1, 'Fa': 2, 'TA': 3, 'Gd': 4, 'Ex': 5},
    'BsmtExposure': {'No': 0, 'Mn': 1, 'Av': 2, 'Gd': 3, 'None': 0},
    'GarageFinish': {'None': 0, 'Unf': 1, 'RFn': 2, 'Fin': 3}
}

# Apply the ordinal mappings to the respective columns
for col, mapping in ordinal_mappings.items():
    if col in house_data.columns:
        house_data[col] = house_data[col].map(mapping)


# Cell 8: Identify and Transform Skewed Features


In [None]:
# Identify numerical features in the dataset
numeric_feats = house_data.select_dtypes(include=[np.number]).columns

# Calculate skewness for each numerical feature
skewness = house_data[numeric_feats].apply(lambda x: x.skew()).sort_values(ascending=False)

# Select features with absolute skewness greater than 0.75
skewed_features = skewness[abs(skewness) > 0.75].index.tolist()

print(f"Skewed Features: {skewed_features}")

# Apply log or box-cox transformation to reduce skewness
print("Transforming skewed features in house_data...")
lam_dict = {}  # Dictionary to store lambda values for box-cox transformation
for feat in skewed_features:
    if (house_data[feat] <= 0).any():
        # If any value is less than or equal to zero, apply log1p transformation
        house_data[feat] = np.log1p(house_data[feat])
    else:
        try:
            # Apply box-cox transformation and store lambda
            transformed_data, lam = boxcox(house_data[feat])
            house_data[feat] = transformed_data
            lam_dict[feat] = lam
        except ValueError:
            # If box-cox fails, fallback to log1p transformation
            house_data[feat] = np.log1p(house_data[feat])

# Save the skewed features and lambda values for future use
with open(os.path.join(models_dir, 'skewed_features.pkl'), 'wb') as f:
    pickle.dump(skewed_features, f)
with open(os.path.join(models_dir, 'lam_dict.pkl'), 'wb') as f:
    pickle.dump(lam_dict, f)


# Cell 8: Identify and Transform Skewed Features

In [None]:
# Identify numerical features in the dataset
numeric_feats = house_data.select_dtypes(include=[np.number]).columns

# Calculate skewness for each numerical feature
skewness = house_data[numeric_feats].apply(lambda x: x.skew()).sort_values(ascending=False)

# Select features with absolute skewness greater than 0.75
skewed_features = skewness[abs(skewness) > 0.75].index.tolist()

print(f"Skewed Features: {skewed_features}")

# Apply log or box-cox transformation to reduce skewness
print("Transforming skewed features in house_data...")
lam_dict = {}  # Dictionary to store lambda values for box-cox transformation
for feat in skewed_features:
    if (house_data[feat] <= 0).any():
        # If any value is less than or equal to zero, apply log1p transformation
        house_data[feat] = np.log1p(house_data[feat])
    else:
        try:
            # Apply box-cox transformation and store lambda
            transformed_data, lam = boxcox(house_data[feat])
            house_data[feat] = transformed_data
            lam_dict[feat] = lam
        except ValueError:
            # If box-cox fails, fallback to log1p transformation
            house_data[feat] = np.log1p(house_data[feat])

# Save the skewed features and lambda values for future use
with open(os.path.join(models_dir, 'skewed_features.pkl'), 'wb') as f:
    pickle.dump(skewed_features, f)
with open(os.path.join(models_dir, 'lam_dict.pkl'), 'wb') as f:
    pickle.dump(lam_dict, f)


# Cell 9: Feature Engineering


In [None]:
print("Performing feature engineering in house_data...")

# Create new features by combining existing ones
house_data['TotalSF'] = house_data['TotalBsmtSF'] + house_data['1stFlrSF'] + house_data['2ndFlrSF']
house_data['Qual_TotalSF'] = house_data['OverallQual'] * house_data['TotalSF']


# Cell 10: Prepare Data for Modeling

In [None]:
print("Preparing data for modeling...")

# Define the feature list based on metadata
feature_list = [
    '1stFlrSF', '2ndFlrSF', 'BedroomAbvGr', 'BsmtExposure', 'BsmtFinType1',
    'BsmtFinSF1', 'BsmtUnfSF', 'TotalBsmtSF', 'GarageArea', 'GarageFinish',
    'GarageYrBlt', 'GrLivArea', 'KitchenQual', 'LotArea', 'LotFrontage',
    'MasVnrArea', 'EnclosedPorch', 'OpenPorchSF', 'OverallCond', 'OverallQual',
    'WoodDeckSF', 'YearBuilt', 'YearRemodAdd'
]

# Ensure that only the specified features are included in X
X = house_data.drop(['SalePrice', 'SalePrice_Log'], axis=1, errors='ignore')
X = X[feature_list]

# Define the target variable
y = house_data['SalePrice_Log']


# Cell 11: Feature Selection Using Random Forest


In [None]:
print("Performing feature selection using Random Forest...")

# Initialize Random Forest Regressor for feature importance
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X, y)

# Extract feature importances and sort them
importances = pd.Series(rf.feature_importances_, index=X.columns).sort_values(ascending=False)

# Select the top 20 features based on importance
selected_features = importances[:20].index.tolist()

print(f"Selected Features: {selected_features}")

# Save the selected features for future reference
with open(os.path.join(models_dir, 'selected_features.pkl'), 'wb') as f:
    pickle.dump(selected_features, f)

# Update X to include only the selected features
X = X[selected_features]


# Cell 12: Split Data into Training and Test Sets

In [None]:
print("Splitting data into training and test sets...")

# Split the data into training and testing subsets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


# Cell 13: Feature Scaling

In [None]:
print("Scaling features...")

# Initialize StandardScaler for feature normalization
scaler = StandardScaler()

# Fit the scaler on the training data and transform both training and test data
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Save the scaler for future use
joblib.dump(scaler, os.path.join(models_dir, 'scaler.joblib'))


# Cell 14: Model Training

In [None]:
print("Training models...")

# Define a dictionary of models to train
models = {
    'Linear Regression': LinearRegression(),
    'Ridge Regression': RidgeCV(alphas=np.logspace(-4, 4, 10)),
    'ElasticNet': ElasticNetCV(
        alphas=np.logspace(-4, -0.5, 30), l1_ratio=0.5, cv=5, max_iter=10000
    ),
    'Lasso Regression': LassoCV(
        alphas=np.logspace(-4, -0.5, 30), cv=5, max_iter=10000
    ),
    'Gradient Boosting': GradientBoostingRegressor(
        n_estimators=300, learning_rate=0.05, max_depth=3,
        min_samples_leaf=5, max_features=0.8, random_state=42
    ),
    'Random Forest': RandomForestRegressor(
        n_estimators=100, max_depth=None, max_features='sqrt',
        min_samples_leaf=2, random_state=42
    ),
    'XGBoost': XGBRegressor(
        n_estimators=300, learning_rate=0.05, max_depth=5,
        min_child_weight=3, subsample=0.8, colsample_bytree=0.8, random_state=42
    )
}

# Initialize a dictionary to store evaluation results
results = {'Model': [], 'MAE': [], 'RMSE': [], 'R² Score': []}

# Iterate through each model, train it, evaluate, and save the results
for name, model in models.items():
    print(f"Training {name}...")
    model.fit(X_train_scaled, y_train)  # Train the model on scaled training data
    
    # Define a filename for saving the trained model
    model_filename = f"{name.replace(' ', '_').lower()}_model.joblib"
    
    # Save the trained model to the models directory
    joblib.dump(model, os.path.join(models_dir, model_filename))
    
    # Make predictions on the test set
    predictions = model.predict(X_test_scaled)
    
    # Calculate evaluation metrics
    mae = mean_absolute_error(np.expm1(y_test), np.expm1(predictions))
    rmse = np.sqrt(mean_squared_error(np.expm1(y_test), np.expm1(predictions)))
    r2 = r2_score(np.expm1(y_test), np.expm1(predictions))
    
    # Append the results to the dictionary
    results['Model'].append(name)
    results['MAE'].append(mae)
    results['RMSE'].append(rmse)
    results['R² Score'].append(r2)


# Cell 15: Model Evaluation

In [None]:
# Convert the results dictionary to a pandas DataFrame for better visualization
results_df = pd.DataFrame(results)

print("Model Evaluation Results:")
print(results_df)

# Save the evaluation results to a CSV file in the models directory
results_df.to_csv(os.path.join(models_dir, 'model_evaluation.csv'), index=False)


# Cell 16: Process Inherited Houses

In [None]:
print("Processing inherited houses...")

# Handle missing values in inherited_houses dataset

# Fill missing values with zero for the specified features
for feature in zero_fill_features:
    inherited_houses[feature].fillna(0, inplace=True)

# Fill categorical features with appropriate default values
inherited_houses['BedroomAbvGr'].fillna(
    house_data['BedroomAbvGr'].mode()[0], inplace=True
)
inherited_houses['BsmtFinType1'].fillna('None', inplace=True)
inherited_houses['GarageFinish'].fillna('Unf', inplace=True)
inherited_houses['BsmtExposure'].fillna('No', inplace=True)
inherited_houses['KitchenQual'].fillna('TA', inplace=True)

# Fill numerical features with median values from house_data
inherited_houses['GarageYrBlt'].fillna(
    house_data['GarageYrBlt'].median(), inplace=True
)
inherited_houses['LotFrontage'].fillna(
    house_data['LotFrontage'].median(), inplace=True
)
inherited_houses['OverallQual'].fillna(
    house_data['OverallQual'].median(), inplace=True
)
inherited_houses['OverallCond'].fillna(
    house_data['OverallCond'].median(), inplace=True
)
inherited_houses['YearBuilt'].fillna(
    house_data['YearBuilt'].median(), inplace=True
)
inherited_houses['YearRemodAdd'].fillna(
    house_data['YearRemodAdd'].median(), inplace=True
)

# Encode categorical features using the same mappings as house_data
for col, mapping in ordinal_mappings.items():
    if col in inherited_houses.columns:
        inherited_houses[col] = inherited_houses[col].map(mapping)

# Feature engineering on inherited_houses
inherited_houses['TotalSF'] = (
    inherited_houses['TotalBsmtSF']
    + inherited_houses['1stFlrSF']
    + inherited_houses['2ndFlrSF']
)
inherited_houses['Qual_TotalSF'] = (
    inherited_houses['OverallQual'] * inherited_houses['TotalSF']
)


# Cell 17: Transform Skewed Features in inherited_houses


In [None]:

print("Transforming skewed features in inherited_houses...")

# Iterate through each skewed feature and apply the same transformation as house_data
for feat in skewed_features:
    if feat in inherited_houses.columns:
        if (inherited_houses[feat] <= 0).any():
            # Apply log1p transformation if any value is <= 0
            inherited_houses[feat] = np.log1p(inherited_houses[feat])
        else:
            lam = lam_dict.get(feat)
            if lam is not None:
                try:
                    # Apply box-cox transformation using the stored lambda
                    inherited_houses[feat] = boxcox(inherited_houses[feat], lam)
                except ValueError:
                    # Fallback to log1p if box-cox fails
                    inherited_houses[feat] = np.log1p(inherited_houses[feat])
            else:
                # Fallback to log1p if lambda is not available
                inherited_houses[feat] = np.log1p(inherited_houses[feat])


In [18]:
# Cell 18: Ensure Feature Matching

# Reindex inherited_houses to include only the selected features, filling missing with zero
inherited_houses = inherited_houses.reindex(columns=selected_features, fill_value=0)


In [19]:
# Cell 19: Scaling Inherited Houses Features

# Scale the inherited_houses data using the previously fitted scaler
inherited_houses_scaled = scaler.transform(inherited_houses)


# Cell 20: Make Predictions on Inherited Houses


In [None]:

print("Making predictions on inherited houses...")

# Initialize an empty DataFrame to store predictions
predictions_df = pd.DataFrame()

# Iterate through each trained model to make predictions
for name, model in models.items():
    # Predict log sale prices
    predictions_log = model.predict(inherited_houses_scaled)
    
    # Convert log predictions back to original scale
    predictions_actual = np.expm1(predictions_log)
    
    # Handle infinite or NaN values by replacing them with the mean of the predictions
    predictions_actual = np.where(
        np.isfinite(predictions_actual), predictions_actual, np.nan
    )
    predictions_actual = np.nan_to_num(
        predictions_actual, nan=np.nanmean(predictions_actual)
    )
    
    # Add the predictions to the DataFrame
    predictions_df[name] = predictions_actual

# Save the predictions to a CSV file in the models directory
predictions_df.to_csv(
    os.path.join(models_dir, 'inherited_houses_predictions.csv'), index=False
)

print("Predictions saved to 'inherited_houses_predictions.csv'.")


# Cell 21: Display Predictions for Inherited Houses


In [None]:

# Display the predictions for inherited houses
print("Predictions for Inherited Houses:")
print(predictions_df)


# 🎉 Conclusion

In this project, we successfully built and evaluated multiple machine learning models to predict house prices based on various features. The steps included data preprocessing, handling missing values, encoding categorical variables, feature engineering, feature selection, model training, and evaluation. Additionally, we applied the trained models to make predictions on a separate dataset of inherited houses.

**Key Takeaways:**
- **Data Preprocessing:** Proper handling of missing values and encoding of categorical features are crucial for model performance.
- **Feature Engineering:** Creating new features can provide additional insights and improve model accuracy.
- **Model Evaluation:** Comparing multiple models using metrics like MAE, RMSE, and R² Score helps in selecting the best-performing model.
- **Scalability:** Saving models and preprocessing objects ensures that the pipeline can be reused for future predictions without retraining.

**Next Steps:**
- **Hyperparameter Tuning:** Further optimize model parameters to enhance performance.
- **Cross-Validation:** Implement cross-validation techniques to ensure model robustness.
- **Deployment:** Consider deploying the best-performing model as an API or integrate it into a web application for real-time predictions.

Thank you for following along this project! Feel free to reach out for any questions or further discussions.

---
