
# Model Deployment for House Prices Prediction

## 1. Introduction
In this notebook, we will:
1. Load the saved models from the model building process.
2. Apply the same feature engineering steps to the `test.csv` data as we did for the training set.
3. Ensure that the test dataset's features are aligned with the training dataset by adding any missing columns.
4. Run predictions using the models on the test set and save them in the required submission format.



## 2. Loading the Saved Models and Training Columns
We will load the saved Random Forest, Gradient Boosting, and tuned Gradient Boosting models, as well as the saved column names from the training dataset.


In [7]:

import joblib

# Load the saved models and training column names
loaded_rf = joblib.load('random_forest_model.pkl')
loaded_gb = joblib.load('gradient_boosting_model.pkl')
loaded_best_gb = joblib.load('tuned_gradient_boosting_model.pkl')
train_columns = joblib.load('train_columns.pkl')  # Load saved training columns

print("Models and training columns loaded successfully.")


Models and training columns loaded successfully.



## 3. Loading and Applying Feature Engineering to `test.csv` (with Feature Alignment)
We will load the test dataset (`test.csv`), apply the same feature engineering steps, and ensure that the features match those used during training by adding any missing columns.


In [9]:

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

# Load the test data
test_data = pd.read_csv('test.csv')

# 1. Handle missing values (same as training set)
numerical_with_nan = [feature for feature in test_data.columns if test_data[feature].isnull().sum() > 0 and test_data[feature].dtypes != 'O']
categorical_with_nan = [feature for feature in test_data.columns if test_data[feature].isnull().sum() > 0 and test_data[feature].dtypes == 'O']

# Impute missing values for numerical features with the median
for feature in numerical_with_nan:
    median_value = test_data[feature].median()
    test_data.loc[:, feature + '_nan'] = np.where(test_data[feature].isnull(), 1, 0)  # Use .loc[] to avoid chaining
    test_data.loc[:, feature] = test_data[feature].fillna(median_value)

# Impute missing values with the mode for categorical features
for feature in categorical_with_nan:
    mode_value = test_data[feature].mode()[0]
    test_data.loc[:, feature + '_nan'] = np.where(test_data[feature].isnull(), 1, 0)
    test_data.loc[:, feature] = test_data[feature].fillna(mode_value)

# 2. Encoding categorical features (one-hot encoding as in training set)
categorical_features = [feature for feature in test_data.columns if test_data[feature].dtype == 'O']
test_data_encoded = pd.get_dummies(test_data, columns=categorical_features, drop_first=True)

# 3. Creating new features (e.g., Age, TotalSF)
test_data_encoded['TotalSF'] = test_data_encoded['TotalBsmtSF'] + test_data_encoded['1stFlrSF'] + test_data_encoded['2ndFlrSF']
test_data_encoded['Age'] = test_data_encoded['YrSold'] - test_data_encoded['YearBuilt']

# 4. Feature scaling (apply the same scaling as in training set)
numerical_features_to_scale = ['LotArea', 'GrLivArea', 'TotalSF', 'Age']
scaler = StandardScaler()
test_data_encoded[numerical_features_to_scale] = scaler.fit_transform(test_data_encoded[numerical_features_to_scale])

# Align the test dataset with the training dataset's features
for col in train_columns:
    if col not in test_data_encoded.columns:
        test_data_encoded[col] = 0  # Add missing columns with 0s

# Ensure the test dataset has only the columns the model expects
test_data_encoded = test_data_encoded[train_columns]

# Preview the aligned test data
test_data_encoded.head()


Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial,TotalSF,Age
0,1461,20,80.0,0.363929,5,6,1961,1961,0.0,468.0,...,False,False,True,False,False,False,True,False,-0.954072,0.413784
1,1462,20,81.0,0.897861,6,6,1958,1958,108.0,923.0,...,False,False,True,False,False,False,True,False,0.164507,0.512398
2,1463,60,74.0,0.809646,5,5,1997,1998,0.0,791.0,...,False,False,True,False,False,False,True,False,0.036124,-0.769577
3,1464,60,78.0,0.032064,6,6,1998,1998,20.0,602.0,...,False,False,True,False,False,False,True,False,0.001804,-0.802448
4,1465,120,43.0,-0.971808,8,5,1992,1992,0.0,263.0,...,False,False,True,False,False,False,True,False,0.039938,-0.605221



## 4. Making Predictions using the Loaded Models
We will use the loaded models to make predictions on the aligned test data (`test.csv`).


In [10]:
# Ensure that SalePrice is not in the test dataset
if 'SalePrice' in test_data_encoded.columns:
    test_data_encoded = test_data_encoded.drop('SalePrice', axis=1)
    
# Use the loaded models to make predictions on the aligned test set
rf_predictions = loaded_rf.predict(test_data_encoded)
gb_predictions = loaded_gb.predict(test_data_encoded)
best_gb_predictions = loaded_best_gb.predict(test_data_encoded)

# Display the first few predictions from each model
print(f"Random Forest Predictions: {rf_predictions[:5]}")
print(f"Gradient Boosting Predictions: {gb_predictions[:5]}")
print(f"Tuned Gradient Boosting Predictions: {best_gb_predictions[:5]}")


Random Forest Predictions: [124922.83 167684.56 181902.29 189358.5  209899.46]
Gradient Boosting Predictions: [123127.33932033 159790.25023812 193302.5972056  189742.05597774
 180714.77470896]
Tuned Gradient Boosting Predictions: [123198.5565692  160405.23452153 195114.6230271  191109.26418761
 177549.25069156]



## 5. Saving the Predictions in Submission Format
We will save the predictions to CSV files in the required submission format: `Id,SalePrice`.


In [11]:

# Create a DataFrame for predictions
rf_results = pd.DataFrame({'Id': test_data['Id'], 'SalePrice': rf_predictions})
gb_results = pd.DataFrame({'Id': test_data['Id'], 'SalePrice': gb_predictions})
best_gb_results = pd.DataFrame({'Id': test_data['Id'], 'SalePrice': best_gb_predictions})

# Save the predictions to CSV files in submission format
rf_results.to_csv('rf_predictions_submission.csv', index=False)
gb_results.to_csv('gb_predictions_submission.csv', index=False)
best_gb_results.to_csv('best_gb_predictions_submission.csv', index=False)

print("Predictions saved in submission format as 'rf_predictions_submission.csv', 'gb_predictions_submission.csv', and 'best_gb_predictions_submission.csv'.")


Predictions saved in submission format as 'rf_predictions_submission.csv', 'gb_predictions_submission.csv', and 'best_gb_predictions_submission.csv'.



## 6. Summary and Next Steps
In this notebook, we:
1. Loaded the saved models and the feature columns from the training dataset.
2. Applied the same feature engineering steps to the `test.csv` data as we did for the training data.
3. Ensured the test dataset's features match the training set by adding any missing columns.
4. Ran the models on the aligned test data and saved the predictions in the required submission format.
