In [0]:
from data_preparation import (
    split_data,
    preprocess_data,
    id_highly_correlated_variables,
    calc_mutual_info_scores,
    remove_highly_correlated_vars,
)
from feature_selection import recursive_feature_selector, feature_importance_selector
from train_model import fit_model
from hyperopt_utils import Hyperopt
from databricks.sdk.runtime import spark
import json

In [0]:
with open("./configs/config.json") as config_file:
    config = json.load(config_file)

data_path = config["data_path"]
target_col = config["target_col"]
seed = config["seed"]

In [0]:
pandas_df = spark.read.parquet(data_path).toPandas()

In [0]:
# Row count is small. I am going to use sklearn for modeling instead of spark because it will have a larger variety of choices
pandas_df.shape[0]

In [0]:
display(pandas_df)

In [0]:
# appears from the above that the data types came in correctly
pandas_df.dtypes

In [0]:
pandas_df.isnull().sum()

In [0]:
# It appears that the dataset is pre-imputed and includes a missing flag whereever the missing data has occurred
na_columns = pandas_df.columns[24:35]
pandas_df[na_columns].sum()

## Data summary observations
- Price the target variable is very skewed towards higher values, which may make correctly predicting higher prices difficult. However, given numbers shown below Airbnb likely gets most of its revenue from low-to-middle priced Airbnbs, which may mean we want to better predict those values rather than the tails. This means MAE or MAPE may be a better error metric for evaluation than RMSE or MSE, which will focus on predicting outliers well.
- property_type and neighborhood_cleansed have very high cardinality. One-hot-encoding will not be a good method for these features.
- There appear to be no obvious ordinal categorical variables in the category or numeric columns
- Most obvious data cleaning has already been completed by Databricks. Mostly just feature engineering is left.

In [0]:
dbutils.data.summarize(pandas_df)  # noqa

In [0]:
pandas_df.nunique()

## High cardinality variables
- For property_type, values are grouped into a small number of values and there is a long tail. I will recode as major types and create an "other" category
- For neighborhood_cleansed, the values are spread out more evently within categories. A leave-one-out target encoding strategy may work better here

In [0]:
pandas_df["property_type"].value_counts()

In [0]:
pandas_df["neighbourhood_cleansed"].value_counts()

In [0]:
# Group smaller property types into a larger "other" category
keep_property_values = list(
    pandas_df["property_type"].value_counts()[lambda x: x > 80].index
)
pandas_df["property_type"] = pandas_df["property_type"].apply(
    lambda x: x if x in keep_property_values else "Other"
)

### Choice of loss metric
As shown during the EDA, Airbnb's rates are highly skewed. The majority  of Airbnb's revenue likely comes where the bulk of their rates are, low and medium. Only a small number of listings had very high rates. 

I wanted to predict lower and medium rates better so chose MAE as my loss metric. This metric does not penalize models more harshly for missing higher values. MSE would have been a better choice if predicting higher rates was more important to the business.

In [0]:
# I'm going to work with the default missing value imputation.
# It appears an intelligent method was used other than the most simple imputation method.
modeling_df = pandas_df.drop(columns=na_columns)

In [0]:
pandas_df.head()

In [0]:
# Split dataframe into training, validation, and test data sets
# Separating target column from features
train, y_train, val, y_val, test, y_test = split_data(modeling_df, target_col)

In [0]:
print(f"Train data row count is {train.shape[0]}")
print(f"Validation data row count is {val.shape[0]}")
print(f"Holdout/test data row count is {test.shape[0]}")

In [0]:
# Create lists of features for different data types
high_card_vars = ["neighbourhood_cleansed"]
cat_vars = list(
    set(train.select_dtypes(include=["object"]).columns.to_list()) - set(high_card_vars)
)
cont_vars = train.select_dtypes(include=["float64"]).columns.to_list()

## Model Tracking
I prefer to use **MLflow** for model tracking and evaluation. It allows you to visualize performance of the different models versus their hyperparameters to better make trade offs. It also prevents duplicated work. Databricks also has some really awesome new capabilities to incorporate models with feature stores and data lineage.

In [0]:
import mlflow
from mlflow.exceptions import RestException

# Start MLflow experiment
experiment_path = "/Users/contact@kelseyhuntzberry.com/databricks-coding-challenge/mlflow_experiments/"
experiment_name = "airbnb_price_prediction"

try:
    experiment_id = mlflow.create_experiment(f"{experiment_path}{experiment_name}")
    mlflow.set_experiment(experiment_name=f"{experiment_path}{experiment_name}")
except RestException:
    mlflow.set_experiment(experiment_name=f"{experiment_path}{experiment_name}")

In [0]:
mlflow.sklearn.autolog()

In [0]:
# Send data through preprocessing pipeline
(
    preprocessor,
    X_train,
    X_val,
    X_test,
    cat_model_names,
    num_model_names,
    hc_model_names,
) = preprocess_data(train, y_train, val, test, cat_vars, cont_vars, high_card_vars)

In [0]:
mlflow.sklearn.autolog(disable=True)

In [0]:
# Identify highly correlated variables above the user-defined threshold
cont_var_names = num_model_names + hc_model_names

high_corr_df = id_highly_correlated_variables(X_train, cont_var_names, corr_cutoff=0.7)

high_corr_df

In [0]:
# Calculate mutual information scores
mi_score_df = calc_mutual_info_scores(X_train, y_train, cont_var_names)

mi_score_df.head()

In [0]:
# Removing variables with the highest mutual information score that are above the correlation threshold
corr_mi_score_df, remove_variables = remove_highly_correlated_vars(
    high_corr_df, mi_score_df
)

print("remove the following variables:", remove_variables)

corr_mi_score_df.head()

In [0]:
# Removing highly correlated features
X_train = X_train.drop(columns=remove_variables)
X_val = X_val.drop(columns=remove_variables)
X_test = X_test.drop(columns=remove_variables)

## Feature Selection
I am using two model-specific methods of feature selection, **recursive feature elimination** and a **feature importance** selector. These feature lists are then used in hyperopt hyperparameter tuning to improve model MAE.

In [0]:
# Setting up a dictionary to cleanly perform feature selection without a lot of repeated code
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor

feature_selection_model_dict = {
    "gbt": GradientBoostingRegressor(random_state=seed, max_depth=5),
    "rf": RandomForestRegressor(random_state=seed, max_depth=5),
    "ridge": Ridge(random_state=seed),
    "lasso": Lasso(random_state=seed),
    "elastic_net": ElasticNet(random_state=seed),
    "decision_trees": DecisionTreeRegressor(random_state=seed, max_depth=5),
}

In [0]:
features_dict = {}

# Performing feature selection for a variety of models
for model_name in list(feature_selection_model_dict.keys()):
    model = feature_selection_model_dict[model_name]

    feature_names = recursive_feature_selector(model, X_train, y_train)

    feature_set_name = f"{model_name}_rfe"

    features_dict[feature_set_name] = feature_names

In [0]:
# Performing feature selection for a variety of models
for model_name in list(feature_selection_model_dict.keys()):
    model = feature_selection_model_dict[model_name]

    feature_names = feature_importance_selector(model, X_train, y_train)

    feature_set_name = f"{model_name}_importance"

    features_dict[feature_set_name] = feature_names

In [0]:
with open("./configs/features.json", "w") as fp:
    json.dump(features_dict, fp)

In [0]:
mlflow.sklearn.autolog()

## Hyperparameter Tuning
I am using the Hyperopt package that does parallelized hyperparameter tuning using Bayesian optimization for better scalability

In [0]:
# Run hyperopt hyperparameter tuning across model types
from hyperopt import space_eval

hyperopt_class = Hyperopt(
    X_train, y_train, X_val, y_val, features_dict, apply_overfit_penalty=True
)

best_result, search_space = hyperopt_class.run_hyperopt(max_evals=150)

hyperopt_results = space_eval(search_space, best_result)
print(hyperopt_results)

## Final Evaluation
Because Hyperopt does not log or optimize multiple evaluation metrics, I am doing a final evaluation with the validation and holdout datasets below.

In [0]:
# Dynamically extract best hyperparameter values and correct model
final_model_type = hyperopt_results["type"]
final_feature_list_name = hyperopt_results["feature_list"]
final_feature_list = features_dict[final_feature_list_name]

final_model_params = hyperopt_results.copy()
del final_model_params["type"]
del final_model_params["feature_list"]

final_model_params = hyperopt_class.round_hyperparameters(final_model_params)

final_model = hyperopt_class.extract_model(
    model_name=final_model_type, **final_model_params
)

In [0]:
# Do full evaluation on model with validation data
final_model, final_val_results, final_feature_importances = fit_model(
    final_model,
    features_dict[final_feature_list_name],
    X_train,
    y_train,
    X_val,
    y_val,
    final_model_type,
    final_feature_list_name,
)

In [0]:
final_val_results

In [0]:
final_feature_importances.head()

In [0]:
# Evaluation model with holdout framework
holdout_model, holdout_results, holdout_feature_importances = fit_model(
    final_model,
    features_dict[final_feature_list_name],
    X_train,
    y_train,
    X_test,
    y_test,
    final_model_type,
    final_feature_list_name,
)

In [0]:
holdout_results

In [0]:
holdout_feature_importances.head()

## Chosen Model:
The model chosen above as the best model was random forest. It had the lowest MAE, which I chose to better predict the majority of listings.

## Results Overview:
The error of this model was quite high, which is likely driven by the low sample size of this dataset. In the entire dataset, there are only ~7k rows. There is also a long tail to the data, which is likely increasing error for the bulk of the dataset. 

## To Improve Model, Would Incorporate Business Goals:
If given this dataset in a business setting, you would want to question stakeholders on whether predicting the "bulk" was more important than predicting the larger values. You would adjust the modeling methodology based on this feedback.

If predicting the **high values** is more important, you would want to:
- Change the loss metric for evaluation to RMSE which would reduce missing the target by a large amount

If predicting the **low values and middle** is more important:
- You would want to stay with the current loss metric
- You could perform outlier removal of very high priced rentals
- You could try outlier-resistant models such as Huber regression

## Modeling Challenges: Overfitting
When I removed the "overfitting penalty" from they hyperopt function, the evaluation metrics were improved, but the training evaluation metrics were much higher than the validation metrics. In this case, we want to ensure that the model would generalize to new data so I prioritized reducing overfitting above higher evaluation metrics.