# 🧹 Data Preprocessing for Toronto Housing

Now that we've explored the structure of our data, it's time to clean and prepare it for modeling.
We'll handle missing values, encode categorical variables, scale features (optional), and split our dataset into training and test sets.

In [3]:
# 📦 Load Required Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

# 📂 Load Cleaned Dataset
df = pd.read_csv('../data/processed/toronto_cleaned_housingdata.csv')
df.head()

Unnamed: 0,address,price,details,bedrooms,bathrooms,category,city
0,"218 Golden Trl, Vaughan, ON L6A 5A1",1199000.0,3 bds3 ba- Townhouse for sale,3.0,3.0,Townhouse,Vaughan
1,"24 Sicilia St, Vaughan, ON L4H 1G3",1399999.0,4 bds4 ba- House for sale,4.0,4.0,House,Vaughan
2,"81 Mahogany Forest Dr, Vaughan, ON L6A 0T1",1258800.0,4 bds4 ba- House for sale,4.0,4.0,House,Vaughan
3,"99 Abner Miles Dr, Vaughan, ON L6A 4X4",2299000.0,5 bds6 ba- House for sale,5.0,6.0,House,Vaughan
4,"26 Bruce St #E17, Vaughan, ON L4L 0H4",649999.0,2 bds2 ba- Condo for sale,2.0,2.0,Condo,Vaughan


## 🔻 Step 1: Drop Rows with Missing Price
We’ll drop rows where the price is missing, since price is our target variable.

In [4]:
df = df.dropna(subset=['price'])
print(f"✅ Remaining rows after dropping missing prices: {len(df)}")

✅ Remaining rows after dropping missing prices: 3903


## 🛏️ Step 2: Impute Missing Bedrooms and Bathrooms
We will use the median to fill missing bedroom and bathroom counts — a good strategy for skewed data.

In [5]:
bed_median = df['bedrooms'].median()
bath_median = df['bathrooms'].median()

df['bedrooms'].fillna(bed_median, inplace=True)
df['bathrooms'].fillna(bath_median, inplace=True)

df[['bedrooms', 'bathrooms']].isnull().sum()

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['bedrooms'].fillna(bed_median, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['bathrooms'].fillna(bath_median, inplace=True)


bedrooms     0
bathrooms    0
dtype: int64

## 🏷️ Step 3: Encode Categorical Features
We’ll apply one-hot encoding to the `category` and `city` columns.

In [6]:
# Define features and target
X = df[['bedrooms', 'bathrooms', 'category', 'city']]
y = df['price']

# Preprocessing pipeline
numeric_features = ['bedrooms', 'bathrooms']
categorical_features = ['category', 'city']

numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer([
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)
])

In [7]:
import joblib

# Save the fitted preprocessor
joblib.dump(preprocessor, "../data/models/preprocessor.pkl")

['../data/models/preprocessor.pkl']

## 🧪 Step 4: Split into Training and Test Sets
We’ll use an 80/20 split for training and evaluation.

In [8]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"Training set: {X_train.shape}, Test set: {X_test.shape}")

Training set: (3122, 4), Test set: (781, 4)


## ⚙️ Step 5: Fit Preprocessing Pipeline to Training Data

In [9]:
# Fit the pipeline to training data
X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)

print(f"✅ Transformed training data shape: {X_train_processed.shape}")
print(f"✅ Transformed test data shape: {X_test_processed.shape}")

✅ Transformed training data shape: (3122, 41)
✅ Transformed test data shape: (781, 41)


In [10]:
import numpy as np

np.save("../data/processed/X_train_processed.npy", X_train_processed)
np.save("../data/processed/X_test_processed.npy", X_test_processed)
np.save("../data/processed/y_train.npy", y_train)
np.save("../data/processed/y_test.npy", y_test)



In [11]:
import joblib

joblib.dump(X_train_processed, "../data/processed/X_train_processed.pkl")
joblib.dump(X_test_processed, "../data/processed/X_test_processed.pkl")
joblib.dump(y_train, "../data/processed/y_train.pkl")
joblib.dump(y_test, "../data/processed/y_test.pkl")
print("✅ Preprocessed data saved successfully.")

✅ Preprocessed data saved successfully.
