# NYC Short-Term Renting Price Pipeline: Modeling Trials

This notebook is the research environment in which different modeling approaches are tried.

Some cells in this notebook are replicated in the final pipeline component/step `train_random_forest`.

Table of contents:

- [1. Donwload Dataset](#1.-Donwload-Dataset)
- [2. Split](#2.-Split)
- [3. Feature Engineering and Processing Pipeline](#3.-Feature-Engineering-and-Processing-Pipeline)
- [4. Model and Inference Pipeline](#4.-Model-and-Inference-Pipeline)
- [5. Train](#5.-Train)

In [35]:
import json
import yaml
import wandb

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, FunctionTransformer

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.pipeline import Pipeline, make_pipeline

## 1. Donwload Dataset

In [2]:
# Download the clean and segregated dataset
# Note this dataset needs to be already in W&B
# To that end, we need to execute these components/steps in order:
# mlflow run . -P steps="download"
# mlflow run . -P steps="basic_cleaning"
# mlflow run . -P steps="data_check"
# mlflow run . -P steps="data_split"
run = wandb.init(project="nyc_airbnb", group="modeling", save_code=True)
local_path = wandb.use_artifact("trainval_data.csv:latest").file()
df = pd.read_csv(local_path)

[34m[1mwandb[0m: Currently logged in as: [33mdatamix-ai[0m (use `wandb login --relogin` to force relogin)
[34m[1mwandb[0m: wandb version 0.13.4 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade


In [3]:
df.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,25937108,Luxury 2 Bedroom Grand Central,168465501,Ian,Manhattan,Murray Hill,40.75058,-73.97746,Entire home/apt,200,30,0,,,4,364
1,22483776,THE COOL HOUSE,55197894,Erick,Queens,Woodside,40.74907,-73.90083,Private room,30,5,7,2019-02-12,0.38,1,0
2,28584877,Cozy bedroom available in best area of Brooklyn!,92733485,Vitaly,Brooklyn,Downtown Brooklyn,40.69164,-73.99055,Private room,50,1,5,2018-10-16,0.54,1,0
3,19388198,Charming Hotel Alternative 2\nMount Sinai,661399,Vivianne,Manhattan,East Harlem,40.79179,-73.94506,Private room,89,3,21,2019-06-16,0.91,2,125
4,6492864,Private Artisitic Room East Village,1480124,Goldwyn,Manhattan,East Village,40.72174,-73.98418,Private room,82,3,11,2015-09-26,0.23,1,0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15200 entries, 0 to 15199
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              15200 non-null  int64  
 1   name                            15193 non-null  object 
 2   host_id                         15200 non-null  int64  
 3   host_name                       15193 non-null  object 
 4   neighbourhood_group             15200 non-null  object 
 5   neighbourhood                   15200 non-null  object 
 6   latitude                        15200 non-null  float64
 7   longitude                       15200 non-null  float64
 8   room_type                       15200 non-null  object 
 9   price                           15200 non-null  int64  
 10  minimum_nights                  15200 non-null  int64  
 11  number_of_reviews               15200 non-null  int64  
 12  last_review                     

## 2. Split

In [5]:
# Features and target
X = df
y = X.pop("price")

In [8]:
# Train/val split
# Values from config.yaml
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.3, stratify=X["neighbourhood_group"], random_state=42
)

## 3. Feature Engineering and Processing Pipeline

In [9]:
# Let's handle the categorical features first
# Ordinal categorical are categorical values for which the order is meaningful, for example
# for room type: 'Entire home/apt' > 'Private room' > 'Shared room'
ordinal_categorical = ["room_type"]
non_ordinal_categorical = ["neighbourhood_group"]

In [10]:
# NOTE: we do not need to impute room_type because the type of the room
# is mandatory on the websites, so missing values are not possible in production
# (nor during training). That is not true for neighbourhood_group
ordinal_categorical_preproc = OrdinalEncoder()

In [11]:
# Build a pipeline with two steps:
# 1 - A SimpleImputer(strategy="most_frequent") to impute missing values
# 2 - A OneHotEncoder() step to encode the variable

In [12]:
non_ordinal_categorical_preproc = make_pipeline(
    SimpleImputer(strategy="most_frequent"),
    OneHotEncoder()
)

In [13]:
# Let's impute the numerical columns to make sure we can handle missing values
# (note that we do not scale because the RF algorithm does not need that)
zero_imputed = [
    "minimum_nights",
    "number_of_reviews",
    "reviews_per_month",
    "calculated_host_listings_count",
    "availability_365",
    "longitude",
    "latitude"
]
zero_imputer = SimpleImputer(strategy="constant", fill_value=0)

In [14]:
# A MINIMAL FEATURE ENGINEERING step:
# we create a feature that represents the number of days passed since the last review
# First we impute the missing review date with an old date (because there hasn't been
# a review for a long time), and then we create a new feature from it
def delta_date_feature(dates):
    """
    Given a 2d array containing dates (in any format recognized by pd.to_datetime), it returns the delta in days
    between each date and the most recent date in its column
    """
    date_sanitized = pd.DataFrame(dates).apply(pd.to_datetime)
    return date_sanitized.apply(lambda d: (d.max() -d).dt.days, axis=0).to_numpy()

In [15]:
date_imputer = make_pipeline(
    SimpleImputer(strategy='constant', fill_value='2010-01-01'),
    FunctionTransformer(delta_date_feature, check_inverse=False, validate=False)
)

In [17]:
# Some minimal NLP for the "name" column
max_tfidf_features = 5 # from config.yaml
reshape_to_1d = FunctionTransformer(np.reshape, kw_args={"newshape": -1})
name_tfidf = make_pipeline(
    SimpleImputer(strategy="constant", fill_value=""),
    reshape_to_1d,
    TfidfVectorizer(
        binary=False,
        max_features=max_tfidf_features,
        stop_words='english'
    )
)

In [18]:
# PIPELINE
# Let's put everything together
preprocessor = ColumnTransformer(
    transformers=[
        ("ordinal_cat", ordinal_categorical_preproc, ordinal_categorical),
        ("non_ordinal_cat", non_ordinal_categorical_preproc, non_ordinal_categorical),
        ("impute_zero", zero_imputer, zero_imputed),
        ("transform_date", date_imputer, ["last_review"]),
        ("transform_name", name_tfidf, ["name"])
    ],
    remainder="drop",  # This drops the columns that we do not transform
)

In [19]:
processed_features = ordinal_categorical + non_ordinal_categorical + zero_imputed + ["last_review", "name"]

In [20]:
print(processed_features)

['room_type', 'neighbourhood_group', 'minimum_nights', 'number_of_reviews', 'reviews_per_month', 'calculated_host_listings_count', 'availability_365', 'longitude', 'latitude', 'last_review', 'name']


## 4. Model and Inference Pipeline

In [27]:
# Load config.yaml and extract dictionary
# related to the random forest
config = dict()
with open("../../config.yaml") as fp:
    config = yaml.safe_load(fp)

rf_config = dict(config["modeling"]["random_forest"].items())
rf_config['random_state'] = 42

In [28]:
rf_config

{'n_estimators': 100,
 'max_depth': 15,
 'min_samples_split': 4,
 'min_samples_leaf': 3,
 'n_jobs': -1,
 'criterion': 'mae',
 'max_features': 0.5,
 'oob_score': True,
 'random_state': 42}

In [32]:
# Create random forest
random_forest = RandomForestRegressor(**rf_config)

In [33]:
sk_pipe = Pipeline(
    steps=[
        ("processor", preprocessor),
        ("classifier", random_forest),
    ]
)

## 5. Train

In [38]:
# Define Grid Search: parameters to try, cross-validation size
# In the production code we train the model without grid search,
# instead, hyperparamater tuning is done with hydra sweeps.
# Note that with hydra sweeps we also vary max_tfidf_features,
# which is not that easy to vary here with the selected arrangement.
# Also note that I use the complete train-val split here and apply a k-fold
# cross-validation on it -- we should use the dedicated splits separately instead.
param_grid = {
    'classifier__n_estimators': [100, 200],
    'classifier__max_features': [0.1, 0.33, 0.5, 0.75, 1.0],
    'classifier__max_depth': [n for n in range(5,20,5)]
}
# Grid search
search = GridSearchCV(estimator=sk_pipe,
                      param_grid=param_grid,
                      cv=3,
                      scoring='neg_mean_absolute_error') # Negative MAE
# Find best hyperparameters and best estimator pipeline
search.fit(X, y)
# We would export this model
rfc_pipe = search.best_estimator_
# This is the best score
print('Best score: MAE = \n', search.best_score_)
# We can export the best parameters to a YAML and load them for inference
print('\nBest params:\n', search.best_params_)

Best score: MAE = 
 -32.81049986344972

Best params:
 {'classifier__max_depth': 15, 'classifier__max_features': 0.5, 'classifier__n_estimators': 200}
