<a href="https://www.kaggle.com/code/jhtkoo0426/house-prices-prediction?scriptVersionId=148376212" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Introduction
We will perform house prices prediction using a pipeline to streamline the preprocessing, training and analysis process.

## 0. Prerequisites
Install essential packages and declare global variables that will be used throughout the notebook.

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from scipy.stats import zscore

from sklearn.base import TransformerMixin, BaseEstimator
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Preprocessing
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler

# Models
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.neighbors import KNeighborsRegressor
from xgboost import XGBRegressor, DMatrix
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor

In [2]:
# Fixed variables
RANDOM_STATE = 12

# Path variables
TRAIN_CSV_PATH = "/kaggle/input/house-prices-advanced-regression-techniques/train.csv"
TEST_CSV_PATH = "/kaggle/input/house-prices-advanced-regression-techniques/test.csv"

## 1. Data Preprocessing
Before we proceed with dataset processing, it's essential to delve deeper into the dataset to gain a better understanding of its features and their significance. This includes exploring potential correlations between different columns.

In [3]:
data = pd.read_csv(TRAIN_CSV_PATH)
print("Number of features in the dataset:", len(data.columns))

Number of features in the dataset: 81


### 1.1 Identifying & cleaning null features
Before preprocessing the dataset or applying feature engineering to specific columns, it is important to first deal with null values/features. First, let's identify all columns that contain null values.

In [4]:
# Calculate the percentage of null values in each column
null_columns = data.isna().mean().sort_values(ascending=False)
null_columns[null_columns > 0]

PoolQC          0.995205
MiscFeature     0.963014
Alley           0.937671
Fence           0.807534
MasVnrType      0.597260
FireplaceQu     0.472603
LotFrontage     0.177397
GarageYrBlt     0.055479
GarageCond      0.055479
GarageType      0.055479
GarageFinish    0.055479
GarageQual      0.055479
BsmtFinType2    0.026027
BsmtExposure    0.026027
BsmtQual        0.025342
BsmtCond        0.025342
BsmtFinType1    0.025342
MasVnrArea      0.005479
Electrical      0.000685
dtype: float64

We will deal with these columns in 2 ways:
1. Some **numerical** columns, such as `GaragraYrBlt`, have relatively low proportions of `null` values (<10%). In this case, we replace those `null` values with some value (0 or mean of column).
2. Some columns, such as `Id`, `PoolQC`, have relatively high proportions of `null` values (>45%). It would be meaningless to retain these columns for our analysis. In this case, we delete all such columns from the dataset.

Considering these two cases together, we create our first class in our ML pipeline, which drop features and rows with the aforementioned characteristics (i.e. large proportion of `null` values within column, rows with `null` values in numerical columns).

In [5]:
class DropFeatures(TransformerMixin, BaseEstimator):
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        X = X.copy()
        
        # Drop all columns with a large proportion of NULL values
        col_set1 = ['Id', 'PoolQC', 'MiscFeature', 'Alley', 'Fence', 'MasVnrType', 'FireplaceQu']
        X = X.drop(col_set1, axis=1)
        return X

In [6]:
class CleanFeatures(TransformerMixin, BaseEstimator):
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        col_set2 = ['GarageYrBlt', 'GarageCond', 'GarageType', 'GarageFinish', 'GarageQual', 'MasVnrArea', 'Electrical']
        X[col_set2] = X[col_set2].fillna(0)
        X = X.fillna(0)                   # Replacing NULL values with 0
        return X

### 1.2 Identifying & encoding non-numerical features
Non-numerical features must be processed before performing regression or other machine learning techniques. Common ways of dealing with non-numerical features include **one-hot encoding** and **label encoding**.

In [7]:
class EncodeNonNumericals(TransformerMixin, BaseEstimator):
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        non_numerical_columns = X.select_dtypes(object).columns
        label_encoder = LabelEncoder()
        
        for column in non_numerical_columns:
            X[column] = X[column].astype(str)
            X[column] = label_encoder.fit_transform(X[column])
        return X

### 1.3 Dealing with outliers
In this section, we will use the **z-score** (standard score) to understand how far each data point is from the mean. This will form the next phase in our pipeline.

Note: At the moment, we do not consider whether the data is **skewed** or not. This is something that we can improve upon in further versions.

We will use a threshold of 2 standard deviations away from the mean as our outlier detector.

In [8]:
class RemoveOutliers(TransformerMixin, BaseEstimator):
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        X = X.copy()
        numrical_col = X.select_dtypes(np.number).columns
        for column in numrical_col :
            # To Find IQR
            percentile25 = X[column].quantile(0.25)
            percentile75 = X[column].quantile(0.75)
            iqr = percentile75 - percentile25
            upper_limit = percentile75 + 1.5 * iqr
            lower_limit = percentile25 - 1.5 * iqr
            X[column] = np.where(
                X[column] > upper_limit,
                upper_limit,
                np.where(
                    X[column] < lower_limit,
                    lower_limit,
                    X[column]
                )
            )
        return X
        

### 1.4 Build data processing pipeline

In [9]:
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('drop', DropFeatures()),
    ('clean', CleanFeatures()),
    ('encode', EncodeNonNumericals()),
    ('outliers', RemoveOutliers()),
])

### 1.5 Process and clean input dataset

In [10]:
data = pipeline.fit_transform(data)

In [11]:
X = data.drop(['SalePrice'], axis=1)
y = data['SalePrice']


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((1168, 73), (292, 73), (1168,), (292,))

## 2. Model training & selection

In [12]:
def train_models(X_train, X_test, y_train, y_test):
    model_dict = {
        "linear": LinearRegression(),
        "Ridge": Ridge(alpha=0.2),
        "KNN": KNeighborsRegressor(n_jobs=-1, n_neighbors=4),
        "XGB": XGBRegressor(random_state=RANDOM_STATE),
        "light": LGBMRegressor(random_state=RANDOM_STATE),
        "Cat": CatBoostRegressor(random_state=RANDOM_STATE, loss_function='RMSE', verbose=False)
    }

    for model_name, model in model_dict.items() :
        model.fit(X_train, y_train)
        pred = model.predict(X_test)
        print(f"Training loss for model {model_name}: {mean_squared_error(y_test, pred, squared=False)}")

In [13]:
train_models(X_train, X_test, y_train, y_test)

Training loss for model linear: 21647.798258956846
Training loss for model Ridge: 21647.0963611605
Training loss for model KNN: 33367.48808795419
Training loss for model XGB: 20787.48576858588
Training loss for model light: 19266.266860050393
Training loss for model Cat: 18342.9744347733


We conclude that the `CatBoostRegressor` model performs the best for out dataset. Therefore, we will be training it for submission.

## 3. Create submission

In [14]:
test_df = pd.read_csv(TEST_CSV_PATH)
submission_df = test_df[['Id']]
test_df = pipeline.fit_transform(test_df)

In [15]:
model = CatBoostRegressor(random_state=RANDOM_STATE, verbose=False, loss_function='RMSE')
model.fit(X_train, y_train)

predictions = model.predict(test_df)
predictions.shape

(1459,)

In [16]:
submission_df['SalePrice'] = predictions
submission_df = submission_df.set_index('Id')
submission = submission_df.to_csv("submission.csv", encoding='utf-8')

In [17]:
!head submission.csv

Id,SalePrice
1461,125530.27204058287
1462,159694.42638339094
1463,177342.2827569663
1464,192303.16825895087
1465,184358.42069772468
1466,178485.30339587128
1467,176875.5944243623
1468,167011.35155374094
1469,174933.45743636764


## Appendix: References
- https://www.kaggle.com/code/theusman/easy-and-accurate-model-among-top-20#Feature-Droping-😥😥❄❄