In [1]:
import pandas as pd
import numpy as np

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.model_selection import train_test_split, KFold
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.dummy import DummyRegressor
from sklearn.metrics import mean_absolute_error, root_mean_squared_error
from category_encoders import TargetEncoder

# 1.0 Introduction

### Project Context
This notebook initiates a series dedicated to the Machine Learning component of the project. An exploratory data analysis (EDA) has been previously conducted and is available on GitHub alongside this code. This EDA provides foundational insights for the project, outlining essential preprocessing steps such as identifying columns for removal, addressing anomalies in the Price column, and initial assessments of feature characteristics.  
<br>
This dataset contains information about Airbnb properties in the London region as of September 2022. As a snapshot in time, it does not provide information on price variations over months. The dataset includes details such as nightly prices, locations, host information, reviews, property types (private or shared), amenities, and more.  
<br>
Dataset was acquired from Kaggle:  
https://www.kaggle.com/datasets/mrnabiz/detailed-airbnb-listing-data-london-sep-2022  
This public dataset is part of Airbnb and can be accessed at the original source link: http://insideairbnb.com/  

<span style="font-size: smaller;">**Note**:
This project is primarily intended for learning purposes. Initially, the EDA phase encompassed the entire dataset because there was no plan for a subsequent machine learning component. As per best practices, EDA should ideally be conducted exclusively on the training dataset to ensure model validity and reliability for production deployment. Conducting EDA on the entire dataset may lead to evaluation metrics that overestimate the model's generalization to new, unseen data. Despite this initial approach, the project continues with awareness of these implications, prioritizing learning and understanding.
</span> 

### Development Plan
As we continue focusing on the Machine Learning component of the project, we are establishing the framework for model evaluation. The target variable will be the Price of each property, and Mean Absolute Error has been selected as the primary evaluation metric for the models. This choice followed the identification of outliers and positively skewed distributions during the EDA.  
<br>
The primary objective of this initial notebook is to **establish a simple baseline model**. This model will serve as a reference for evaluating more advanced models that are expected to achieve lower Mean Absolute Error values. To create this baseline, we will employ a Dummy Regressor.  
<br>
Additionally, we will **develop a comprehensive data processing pipeline that includes feature engineering and preprocessing stages**. While these steps do not impact the performance of a Dummy Regressor, they are crucial for evaluating more advanced machine learning models in later stages of the project.

### Data import and initial processing

Based on the EDA, some columns are deemed unnecessary and will therefore be dropped.

In [2]:
df = pd.read_csv('../21 LondonAirbnb - EDA/clean_df.csv',
                 parse_dates=[
                     'first_review',
                     'last_review',
                     'host_since',
                 ])
df.drop(columns=[
    'id',
    'listing_url',
    'neighborhood_overview',
    'host_id',
    'host_url',
    'host_name',
    'host_picture_url',
    'property_type',
    'calculated_host_listings_count',
    'calculated_host_listings_count_entire_homes',
    'calculated_host_listings_count_private_rooms',
    'calculated_host_listings_count_shared_rooms',
    'minimum_nights_avg_ntm',
    'maximum_nights_avg_ntm',
    'host_location',
    'host_about',
    ], inplace = True)

Based on the EDA, we found that null and zero ('$0.00') Price values are rare and will be deleted.  
All Price values will also be converted to floats.

In [3]:
df['price'] = df['price'].replace('$0.00', np.nan)
df = df.dropna(subset=['price'])
df['price'] = (df['price'].str.replace('$', '', regex='False')
               .str.replace(',', '', regex='False')
               .astype(float, copy=False))

### Partition the data into Train and Test sets:

In [4]:
X = df.drop('price', axis=1)
y = df['price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                    random_state=777)

# 2.0 Feature Engineering

This project will use a straightforward and **simple feature engineering process that can precede preprocessing** seamlessly.  
<br>
This approach involves creating new features, applying basic encoding to the 'host_response_time' column, and converting numeric data currently stored as text into suitable formats. Each newly generated feature will replace its original column. The feature engineering process includes:
  
* Incorporating all features identified during the EDA phase.
* Generating new features for each datetime column to indicate the elapsed time from a reference date.
* Replacing the 'host_verification' column with separate columns for each of the most common host verification methods.
* Introducing a new 'distance_to_palace' column by merging the 'latitude' and 'longitude' columns for enhanced location-based analysis.

Declare functions that will be employed to generate new features:

In [5]:
def calc_shared_bathroom(row):
    """Returns 't' if word 'shared' is in column 'bathrooms_text'.  
    
    Returns 'f' otherwise.
    """
    if pd.isnull(row['bathrooms_text']):
        return np.nan
    elif 'shared' in row['bathrooms_text'].lower():
        return 't'
    else:
        return 'f'


def calc_number_of_bathrooms(row):
    """Returns the number of bathrooms in column 'bathrooms_text'.  
    
    Obs: incomplete bathrooms are counted as 0.5 bathroom.
    """
    if pd.isnull(row['bathrooms_text']):
        return np.nan
    elif 'half-bath' in row['bathrooms_text'].lower():
        return 0.5
    else:
        return float(row['bathrooms_text'].split()[0])


def calc_elapsed_time(row, column, time_unit='months'):
    """Returns the elapsed time since a date stored in the specified row and column.  
    
    'time_unit' can be either 'months'(default) or 'years'.
    """
    valid_time_units = {'months', 'years'}
    if time_unit not in valid_time_units:
        raise ValueError('order_by: must be one of %r.' % valid_orders)
    elif pd.isnull(row[column]):
        return np.nan
    elif time_unit == 'months':
        return (12*(2022-row[column].year) + (9-row[column].month))
    else:
        return (2022 - row[column].year)


def calc_number_of_amenities(row):
    """Returns the number of amenities stored in column 'amenities'."""
    if pd.isnull(row['amenities']):
        return np.nan
    else:
        return len(row['amenities'].strip('[]"').split('", "'))


def has_host_verification(row, verification_type):
    """Returns 't' if inputed verification_type is in column 'host_verifications'.  
    
    Returns 'f' otherwise.
    """
    if pd.isnull(row['host_verifications']):
        return np.nan
    elif verification_type in row['host_verifications']:
        return 't'
    else:
        return 'f'


def distance_from_palace(row):
    """Calculates distance in km from Buckingham Palace.  
    
    Columns 'longitude' and 'latitude' define the point that the distance is
    calculated from."""
    def haversine(lon1, lat1, lon2, lat2):
        """
        Calculate the great circle distance between two points
        on the earth (specified in decimal degrees).
        """
        # Convert decimal degrees to radians
        lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
        # Haversine formula
        dlon = lon2 - lon1
        dlat = lat2 - lat1
        a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
        c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1-a))
        km = 6371 * c
        return km
    
    
    if pd.isnull(row['longitude']):
        return np.nan
    elif pd.isnull(row['latitude']):
        return np.nan
    else:
        palace_lon, palace_lat = -0.140634, 51.501476
        lon = row['longitude']
        lat = row['latitude']
        return haversine(palace_lon, palace_lat, lon, lat)

Map to encode the 'host_response_time' column proportionally based on the number of hours of response time:

In [6]:
response_time_map = {
    'within an hour': 1,
    'within a few hours': 6,
    'within a day': 24,
    'a few days or more': 72,
}

Declare function to convert numeric data currently stored as text into float format:

In [7]:
def convert_percentage(row, column):
        if pd.isnull(row[column]):
            return np.nan
        try:
            return float(row[column].strip('%')) / 100.0
        except (ValueError, AttributeError):
            return np.nan

### Creating FeatureEngineering transformer

We are developing a custom transformer that subclasses BaseEstimator and TransformerMixin from scikit-learn, ensuring seamless integration into scikit-learn pipelines. This approach simplifies cross-validation and hyperparameter tuning, thereby enhancing model performance consistency across various configurations and preventing data leakage.  
<br>
The current version of the custom transformer supports the **encode_hrt** setting, which can be toggled between True or False. The code structure allows for easy integration of additional settings to explore various feature engineering possibilities.  
<br>
<span style="font-size: smaller;"> Note: by default, the value of **encode_hrt** is set to True. When set to False, the 'host_response_time' column remains categorical and is not encoded during feature engineering. Consequently, it is treated as a categorical column during preprocessing.</span>

In [8]:
class FeatureEngineering(BaseEstimator, TransformerMixin):
    """
    Transformer for feature engineering, compatible with scikit-learn pipelines.  
    
    This transformer subclasses BaseEstimator and TransformerMixin from scikit-learn,
    ensuring compatibility with scikit-learn pipelines. It facilitates streamlined
    integration into machine learning workflows, supporting cross-validation,
    hyperparameter tuning, and preventing data leakage.
    
    Parameters:
    ----------
    encode_hrt : bool, default=True
        Whether to encode the 'host_response_time' column during feature engineering.
        If True, 'host_response_time' is encoded; if False, it remains categorical.
   
    """
    def __init__(self, encode_hrt=True):
        self.encode_hrt = encode_hrt
    
    
    def fit(self, X, y=None):
        self.feature_names_ = X.columns.tolist()
        return self
    
    
    def transform(self, X):
        X = X.copy()
        
        # Create new features:
        X['shared_bathroom'] = X.apply(calc_shared_bathroom, axis=1)
        X['number_of_bathrooms'] = X.apply(calc_number_of_bathrooms, axis=1)
        X['number_of_amenities'] = X.apply(calc_number_of_amenities, axis=1)
        X['host_experience_y'] = X.apply(
            lambda row: calc_elapsed_time(row, 'host_since', 'years'), axis=1)
        X['months_first_rev'] = X.apply(
            lambda row: calc_elapsed_time(row, 'first_review'), axis=1)
        X['months_last_rev'] = X.apply(
            lambda row: calc_elapsed_time(row, 'last_review'), axis=1)
        lst_verif_types = [
            'phone',
            'email',
            'work_email',
        ]
        for verification_type in lst_verif_types:
            X['host_verif_'+verification_type] = X.apply(
                lambda row:  has_host_verification(row, verification_type), axis=1)
        X['distance_to_palace'] = X.apply(distance_from_palace, axis=1)
        
        # Transform categorical features into numerical features:
        if self.encode_hrt == True:
            X['host_response_time'] = (X['host_response_time']
                                            .map(response_time_map))
        
        # Transform numbers saved as strings into numbers:
        X['host_response_rate'] = (X.apply(
            lambda row: convert_percentage(row, 'host_response_rate'), axis=1))
        X['host_acceptance_rate'] = (X.apply(
            lambda row: convert_percentage(row, 'host_acceptance_rate'), axis=1))
    
        # Remove the columns that were originally used to generate new
        # features, as they are no longer necessary:
        X.drop(columns = [
            'name',
            'description',
            'bathrooms_text',
            'amenities',
            'host_since',
            'first_review',
            'last_review',
            'host_verifications',
            'latitude',
            'longitude',
            ], axis = 1, inplace = True)
        
        self.feature_names_ = X.columns.tolist()
        
        return X
    
    
    def get_feature_names_out(self, input_features=None):
            if input_features is None:
                input_features = self.feature_names_
            return input_features

# 3.0 Preprocessing

To demonstrate the functionality of automatic column type detection within the pipeline, we will apply a minimal preprocessing step. This step categorizes the data into true/false, categorical, and numeric columns, each undergoing specific processing: 
<br>
- True/False values will be converted to binary (ones and zeros).
- Category data will be encoded using TargetEncoder.
- Numeric data will undergo no preprocessing at the moment.  

In subsequent stages, imputers and scalers will be integrated into the pipeline to explore more complex models. It's important to note that different data types and models may necessitate varying strategies for imputation and scaling.

#### True/False columns preprocessing:

In [9]:
class TFToBinaryTransformer(BaseEstimator, TransformerMixin):
    """Custom transformer to convert 't' and 'f' values to binary (1 and 0)."""
    def fit(self, X, y=None):
        return self
    
    
    def transform(self, X):
        X = X.copy()
        return X.replace({'t': 1, 'f': 0})
    
    
    def get_feature_names_out(self, input_features=None):
        return input_features

In [10]:
tf_transformer = Pipeline(steps=[
    ('tf_to_binary', TFToBinaryTransformer()),
])

#### Category columns preprocessing:

In [11]:
cat_transformer = Pipeline(steps=[
    ('encoder', TargetEncoder()),
])

#### Numeric columns preprocessing:

In [12]:
num_transformer = Pipeline(steps=[
    ('pass', 'passthrough'),
])

### UpdateColumnTransformer

This transformer facilitates automatic column type detection within the pipeline. It dynamically identifies the correct lists of columns for preprocessing based on their types at the time preprocessing occurs.  
<br>
Due to the nature of Feature Engineering, which generates new columns within the pipeline, column type detection cannot be performed before the pipeline begins.

In [13]:
class UpdateColumnTransformer(BaseEstimator, TransformerMixin):
    """
    A custom transformer that updates the column grouping in a preprocessor
    based on data characteristics.  

    This transformer subclasses BaseEstimator and TransformerMixin from
    scikit-learn, making it compatible with scikit-learn pipelines. It
    dynamically categorizes columns into 'tf' (true/false), 'cat' (categorical),
    and 'num' (numerical) groups based on their data type and values, and
    updates the preprocessing steps accordingly.
    """
    def __init__(self, preprocessor):
        self.preprocessor = preprocessor
    
    
    def fit(self, X, y):
        # True/False columns contain only 't' and 'f' values or may have missing values.
        tf_features = [col for col in (X.select_dtypes(include=['object', 'category'])
                                       .columns)
                       if set(X[col].dropna().unique()) <= {'t', 'f'}]
        # All remaining columns of object or category type will be processed together.
        cat_features = [col for col in (X.select_dtypes(include=['object', 'category'])
                                       .columns)
                       if col not in tf_features]
        # Numeric type columns will be processed together
        numeric_features = X.select_dtypes(['number']).columns.to_list()
        
        new_transformers = []
        for name, transformer, columns in self.preprocessor.transformers:
            if name == 'cat':
                new_transformers.append((name, transformer, cat_features))
            elif name == 'tf':
                new_transformers.append((name, transformer, tf_features))
            elif name == 'num':
                new_transformers.append((name, transformer, numeric_features))
            else:
                new_transformers.append((name, transformer, columns))
        self.preprocessor.transformers = new_transformers
        
        return self.preprocessor.fit(X, y)
    
    
    def transform(self, X):
        return self.preprocessor.transform(X)
    
    
    def get_feature_names_out(self, input_features=None):
        return self.preprocessor.get_feature_names_out(input_features)

### Preprocessor
Preprocessor is initialized without listing any columns, as the list of columns will be updated by UpdateColumnTransformer.

In [14]:
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', cat_transformer, []),
        ('tf', tf_transformer, []),
        ('num', num_transformer, []),
    ],
    remainder='passthrough',
    verbose_feature_names_out=False
)

# 4.0 Pipeline

Pipeline integrates feature engineering and preprocessing components, using a Dummy Regressor as the baseline estimator. This Dummy Regressor consistently predicts the target value as the median of the training dataset used to fit it.

In [15]:
pipeline = Pipeline(steps=[
    ('feature_engineering', FeatureEngineering()),
    ('type_detection', UpdateColumnTransformer(preprocessor)),
    ('regressor', DummyRegressor(strategy='median')),
])

# 5.0 Baseline Evaluation

Given the identification of outliers and positively skewed distributions during the EDA in both the features and target values, the chosen primary evaluation metric for the models is Mean Absolute Error (MAE). MAE is preferred for its robustness to outliers and its straightforward measure of average prediction error.  
<br>
Additionally, Root Mean Squared Error (RMSE) will be considered as an auxiliary evaluation metric. RMSE provides a measure of the spread of prediction errors, offering a more sensitive evaluation of prediction accuracy. This dual metric approach enhances our understanding of the model's performance.

### Cross Validation
We will use cross-validation to measure the baseline values of the selected metric, Mean Absolute Error (MAE), and the auxiliary evaluation metric, Root Mean Square Error (RMSE).
<br>
This approach provides an estimate of the evaluation metric without relying on the test dataset.

In [16]:
kf = KFold(n_splits=5, shuffle=True, random_state=777)

avg_mae = 0
avg_rmse = 0
for tr, ts in kf.split(X_train, y_train):
    Xtr, Xvl = X_train.iloc[tr], X_train.iloc[ts]
    ytr, yvl = y_train.iloc[tr], y_train.iloc[ts]
    
    pipeline.fit(Xtr, ytr)
    ypred = pipeline.predict(Xvl)
    avg_mae += mean_absolute_error(yvl, ypred)
    avg_rmse += root_mean_squared_error(yvl, ypred)

print('Baseline Average MAE:', avg_mae/5)
print('Baseline Average RMSE:', avg_rmse/5)

Baseline Average MAE: 119.67833952398125
Baseline Average RMSE: 411.2180994583384
