# Classification

## Objectives

*   Fit and evaluate a classification model to:
    * Accurately predict house sale prices
    * Meet or exceed an R² score of 0.75 on both training and test sets
    * Be used to predict prices for 4 inherited houses
    * Be deployed in an interactive dashboard for real-time predictions

## Inputs

* outputs/datasets/collection/house_prices_record.csv
* outputs/datasets/collection/inherited_houses.csv: used after model training for prediction only; not part of the training/testing

## Outputs

| Output                                  | Description                                              |
| --------------------------------------- | -------------------------------------------------------- |
| Trained ML model(s)                     | e.g., `LinearRegression`, `RandomForestRegressor`        |
| Model performance metrics               | R², MAE, RMSE for both train and test sets               |
| Predicted prices (for `X_test`)         | Needed for evaluating generalization                     |
| Predicted prices (for inherited houses) | Final client deliverable                                 |


---

# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspaces/housing-prices/jupyter_notebooks'

We want to make the parent of the current directory the new current directory.
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspaces/housing-prices'

---

# Load Data

In [5]:
import pandas as pd

# Load training set
train_set_path = "outputs/datasets/cleaned/train_set.csv"
TrainSet = pd.read_csv(train_set_path)

# Load test set
test_set_path = "outputs/datasets/cleaned/test_set.csv"
TestSet = pd.read_csv(test_set_path)


In [6]:
# Separate features and target for training
X_train = TrainSet.drop(columns=['SalePrice'])
y_train = TrainSet['SalePrice']

# Separate features and target for testing
X_test = TestSet.drop(columns=['SalePrice'])
y_test = TestSet['SalePrice']

---

# Train First Model (Linear Regression)
#Install scikit-learn (if needed)
pip install scikit-learn

Encode Remaining Categorical Values

In [11]:
X_train.select_dtypes(include='object').columns
from feature_engine.encoding import OrdinalEncoder

# List of remaining categorical variables
categorical_cols = ['BsmtExposure', 'BsmtFinType1', 'GarageFinish', 'KitchenQual']

# Fill missing values for categorical columns
X_train['BsmtExposure'] = X_train['BsmtExposure'].fillna('None')
X_train['BsmtFinType1'] = X_train['BsmtFinType1'].fillna('None')
X_train['GarageFinish'] = X_train['GarageFinish'].fillna('None')
X_train['KitchenQual'] = X_train['KitchenQual'].fillna('None')

X_test['BsmtExposure'] = X_test['BsmtExposure'].fillna('None')
X_test['BsmtFinType1'] = X_test['BsmtFinType1'].fillna('None')
X_test['GarageFinish'] = X_test['GarageFinish'].fillna('None')
X_test['KitchenQual'] = X_test['KitchenQual'].fillna('None')

# Create and fit the encoder
encoder = OrdinalEncoder(encoding_method='arbitrary', variables=categorical_cols)
encoder.fit(X_train)

# Transform both train and test
X_train_encoded = encoder.transform(X_train)
X_test_encoded = encoder.transform(X_test)

  if pd.api.types.is_categorical_dtype(X[feature]):
  if pd.api.types.is_categorical_dtype(X[feature]):
  if pd.api.types.is_categorical_dtype(X[feature]):
  if pd.api.types.is_categorical_dtype(X[feature]):
  if pd.api.types.is_categorical_dtype(X[feature]):
  if pd.api.types.is_categorical_dtype(X[feature]):
  if pd.api.types.is_categorical_dtype(X[feature]):
  if pd.api.types.is_categorical_dtype(X[feature]):


Train and Evaluate

In [12]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
import numpy as np

# Train model
lr_model = LinearRegression()
lr_model.fit(X_train_encoded, y_train)

# Predict
y_pred_train = lr_model.predict(X_train_encoded)
y_pred_test = lr_model.predict(X_test_encoded)

# Evaluation
def evaluate_model(true, pred, name):
    print(f"{name} Set:")
    print(f"R²: {r2_score(true, pred):.3f}")
    print(f"MAE: {mean_absolute_error(true, pred):,.0f}")
    print(f"RMSE: {np.sqrt(mean_squared_error(true, pred)):.0f}")
    print("-" * 30)

evaluate_model(y_train, y_pred_train, "Train")
evaluate_model(y_test, y_pred_test, "Test")


Train Set:
R²: 0.795
MAE: 20,899
RMSE: 34937
------------------------------
Test Set:
R²: 0.827
MAE: 22,541
RMSE: 36396
------------------------------


In [26]:
import os
import json

# Compute metrics for saving
r2_train = r2_score(y_train, y_pred_train)
mae_train = mean_absolute_error(y_train, y_pred_train)
rmse_train = np.sqrt(mean_squared_error(y_train, y_pred_train))

r2_test = r2_score(y_test, y_pred_test)
mae_test = mean_absolute_error(y_test, y_pred_test)
rmse_test = np.sqrt(mean_squared_error(y_test, y_pred_test))

# Prepare metrics dictionary
metrics = {
    "r2_train": round(r2_train, 3),
    "mae_train": int(mae_train),
    "rmse_train": int(rmse_train),
    "r2_test": round(r2_test, 3),
    "mae_test": int(mae_test),
    "rmse_test": int(rmse_test)
}

# Save to JSON
os.makedirs("outputs/evaluation", exist_ok=True)
with open("outputs/evaluation/metrics.json", "w") as f:
    json.dump(metrics, f, indent=4)

## Predict Inherited Houses Price

Load the `inherited_house.csv`

In [13]:
inherited_path = "outputs/datasets/collection/inherited_houses.csv"
df_inherited = pd.read_csv(inherited_path)
df_inherited.head()

Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,...,LotArea,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd
0,896,0,2,No,468.0,Rec,270.0,0,730.0,Unf,...,11622,80.0,0.0,0,6,5,882.0,140,1961,1961
1,1329,0,3,No,923.0,ALQ,406.0,0,312.0,Unf,...,14267,81.0,108.0,36,6,6,1329.0,393,1958,1958
2,928,701,3,No,791.0,GLQ,137.0,0,482.0,Fin,...,13830,74.0,0.0,34,5,5,928.0,212,1997,1998
3,926,678,3,No,602.0,GLQ,324.0,0,470.0,Fin,...,9978,78.0,20.0,36,6,6,926.0,360,1998,1998


Ensure Inherited Houses Data Mecthes X-train

In [18]:
from feature_engine.encoding import OrdinalEncoder

# Define categorical columns used in model
categorical_cols = ['BsmtExposure', 'BsmtFinType1', 'GarageFinish', 'KitchenQual']

# Combine X_train categorical data only
X_train_cat = X_train[categorical_cols]

# Refit encoder on categorical columns only
encoder_cat = OrdinalEncoder(encoding_method='arbitrary', variables=categorical_cols)
encoder_cat.fit(X_train_cat)

# Apply to inherited data
df_inherited_cat = df_inherited[categorical_cols].fillna('None')
df_inherited_encoded_cat = encoder_cat.transform(df_inherited_cat)

  if pd.api.types.is_categorical_dtype(X[feature]):
  if pd.api.types.is_categorical_dtype(X[feature]):
  if pd.api.types.is_categorical_dtype(X[feature]):
  if pd.api.types.is_categorical_dtype(X[feature]):


Combine with Numeric Features

In [19]:
# Get the numeric columns used in X_train
numerical_cols = X_train.drop(columns=categorical_cols).columns
df_inherited_num = df_inherited[numerical_cols]

# Combine final inputs
X_inherited_final = pd.concat([
    df_inherited_num.reset_index(drop=True),
    df_inherited_encoded_cat.reset_index(drop=True)
], axis=1)

Predict Prices

In [27]:
correct_order = X_train_encoded.columns
X_inherited_final = X_inherited_final[correct_order]
predictions = lr_model.predict(X_inherited_final)
df_inherited['Predicted_SalePrice'] = predictions.round(0)

# Show results
print(df_inherited[['Predicted_SalePrice']])

# Total value
total_value = df_inherited['Predicted_SalePrice'].sum()
print(f"Total Predicted Value of Inherited Houses: ${total_value:,.0f}")
# Save predictions to CSV for the dashboard
import os
os.makedirs("outputs/predictions", exist_ok=True)
df_inherited.to_csv("outputs/predictions/inherited_house_predictions.csv", index=False)


   Predicted_SalePrice
0             126767.0
1             167978.0
2             170321.0
3             198186.0
Total Predicted Value of Inherited Houses: $663,252


## Predict Prices for User House Inputs

Build Predictive Function

In [22]:
def predict_house_price(user_input: dict, model, encoder, feature_order):
    """
    Predict sale price for a new house based on user input.

    Parameters:
        user_input (dict): Raw input values from the user (must include all required features)
        model (sklearn estimator): Trained regression model
        encoder (OrdinalEncoder): Fitted encoder for categorical features
        feature_order (list): List of column names in the correct model input order

    Returns:
        float: Predicted sale price
    """
    import pandas as pd

    # Convert input to DataFrame
    input_df = pd.DataFrame([user_input])

    # Fill missing categorical values with 'None' if needed
    for col in ['BsmtExposure', 'BsmtFinType1', 'GarageFinish', 'KitchenQual']:
        if col in input_df.columns:
            input_df[col] = input_df[col].fillna('None')

    # Encode categorical columns
    input_cat = input_df[['BsmtExposure', 'BsmtFinType1', 'GarageFinish', 'KitchenQual']]
    input_cat_encoded = encoder.transform(input_cat)

    # Extract numeric columns (from model training)
    numeric_cols = [col for col in feature_order if col not in input_cat.columns]
    input_num = input_df[numeric_cols]

    # Combine numeric + encoded categorical
    input_final = pd.concat([input_num.reset_index(drop=True), input_cat_encoded.reset_index(drop=True)], axis=1)

    # Reorder to match training feature order
    input_final = input_final[feature_order]

    # Predict
    predicted_price = model.predict(input_final)[0]

    return round(predicted_price, 0)

Predictive-House-Function-Test

In [24]:
full_feature_list = X_train_encoded.columns.tolist()

default_input = {
    'GrLivArea': 1500,
    'GarageArea': 400,
    'TotalBsmtSF': 1000,
    '1stFlrSF': 1200,
    'LotArea': 8000,
    '2ndFlrSF': 300,
    'BedroomAbvGr': 3,
    'BsmtFinSF1': 600,
    'BsmtUnfSF': 400,
    'GarageYrBlt': 2005,
    'LotFrontage': 60,
    'MasVnrArea': 100,
    'OpenPorchSF': 40,
    'OverallCond': 5,
    'OverallQual': 6,
    'YearBuilt': 1995,
    'YearRemodAdd': 2005,
    'BsmtExposure': 'Av',
    'BsmtFinType1': 'GLQ',
    'GarageFinish': 'Fin',
    'KitchenQual': 'Gd'
}

predicted_price = predict_house_price(
    user_input=default_input,
    model=lr_model,
    encoder=encoder_cat,
    feature_order=full_feature_list
)

print(f"Predicted Sale Price: ${predicted_price:,.0f}")


Predicted Sale Price: $191,542


  if pd.api.types.is_categorical_dtype(X[feature]):
  if pd.api.types.is_categorical_dtype(X[feature]):
  if pd.api.types.is_categorical_dtype(X[feature]):
  if pd.api.types.is_categorical_dtype(X[feature]):


## Model Artifact Preservation for Deployment

To support the prediction interface and ensure full reproducibility of the machine learning pipeline, we have saved the following key assets:

1. Trained Model (linear_regression_model.pkl): The fitted Linear Regression model that achieved strong R² performance on both train and test datasets.
2. Ordinal Encoder (ordinal_encoder.pkl): The encoder fitted on categorical variables (BsmtExposure, BsmtFinType1, GarageFinish, KitchenQual) used to ensure consistent input formatting for both historical and new data.
3. Feature Order (feature_order.json): A list of all features used during model training, saved in their original order. This ensures that any future prediction input aligns exactly with what the model expects.
4. User Input Template (default_user_input.json): A complete dictionary of input fields with sample values. This serves as a reusable starting point for the interactive user interface, ensuring users can input values that match the trained pipeline.

These saved components allow seamless integration with the upcoming Streamlit dashboard, enabling both predefined and real-time house price predictions.

In [25]:
import joblib
import json
import os

# Create a models directory if it doesn't exist
os.makedirs("outputs/models", exist_ok=True)

# 1. Save the model
joblib.dump(lr_model, "outputs/models/linear_regression_model.pkl")

# 2. Save the encoder
joblib.dump(encoder_cat, "outputs/models/ordinal_encoder.pkl")

# 3. Save the feature list
feature_list = X_train_encoded.columns.tolist()
with open("outputs/models/feature_order.json", "w") as f:
    json.dump(feature_list, f)

# 4. Save the default input template
default_input = {
    'GrLivArea': 1500,
    'GarageArea': 400,
    'TotalBsmtSF': 1000,
    '1stFlrSF': 1200,
    'LotArea': 8000,
    '2ndFlrSF': 300,
    'BedroomAbvGr': 3,
    'BsmtFinSF1': 600,
    'BsmtUnfSF': 400,
    'GarageYrBlt': 2005,
    'LotFrontage': 60,
    'MasVnrArea': 100,
    'OpenPorchSF': 40,
    'OverallCond': 5,
    'OverallQual': 6,
    'YearBuilt': 1995,
    'YearRemodAdd': 2005,
    'BsmtExposure': 'Av',
    'BsmtFinType1': 'GLQ',
    'GarageFinish': 'Fin',
    'KitchenQual': 'Gd'
}
with open("outputs/models/default_user_input.json", "w") as f:
    json.dump(default_input, f)


Good job, you should clear outputs, then run git commands to push files to the repo. Next, move on to Predict Tenure notebook

---