d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Capstone Project: Managing the Machine Learning Lifecycle

Create a workflow that includes pre-processing logic, the optimal ML algorithm and hyperparameters, and post-processing logic.

## Instructions

In this course, we've primarily used Random Forest in `sklearn` to model the Airbnb dataset.  In this exercise, perform the following tasks:
<br><br>
0. Create custom pre-processing logic to featurize the data
0. Try a number of different algorithms and hyperparameters.  Choose the most performant solution
0. Create related post-processing logic
0. Package the results and execute it as its own run

## Prerequisites
- Web browser: Chrome
- A cluster configured with **8 cores** and **DBR 7.0 ML**

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Classroom-Setup

For each lesson to execute correctly, please make sure to run the **`Classroom-Setup`** cell at the<br/>
start of each lesson (see the next cell) and the **`Classroom-Cleanup`** cell at the end of each lesson.

In [0]:
%run "./Includes/Classroom-Setup"

In [0]:
# Adust our working directory from what DBFS sees to what python actually sees
working_path = workingDir.replace("dbfs:", "/dbfs")

## Pre-processing

Take a look at the dataset and notice that there are plenty of strings and `NaN` values present. Our end goal is to train a sklearn regression model to predict the price of an airbnb listing.


Before we can start training, we need to pre-process our data to be compatible with sklearn models by making all features purely numerical.

In [0]:
import pandas as pd

airbnbDF = spark.read.parquet("/mnt/training/airbnb/sf-listings/sf-listings-correct-types.parquet").toPandas()

display(airbnbDF)

In the following cells we will walk you through the most basic pre-processing step necessary. Feel free to add additional steps afterwards to improve your model performance.

First, convert the `price` from a string to a float since the regression model will be predicting numerical values.

In [0]:
# ANSWER
airbnbDF["int_price"] = airbnbDF["price"].apply(lambda s: float(s.replace("$", "").replace(",", "")))
airbnbDF_cleaned_price = airbnbDF.drop(["price"],axis=1)

Take a look at our remaining columns with strings (or numbers) and decide if you would like to keep them as features or not.

Remove the features you decide not to keep.

In [0]:
# ANSWER
airbnbDF_cleaned_features = airbnbDF_cleaned_price.drop(["host_is_superhost", "instant_bookable","cancellation_policy"], axis=1)

For the string columns that you've decided to keep, pick a numerical encoding for the string columns. Don't forget to deal with the `NaN` entries in those columns first.

In [0]:
# ANSWER

airbnbDF_cleaned_features = airbnbDF_cleaned_features[airbnbDF_cleaned_features["zipcode"] != "-- default zip code --"] # removed entry with unusual zipcode
airbnbDF_cleaned_features = airbnbDF_cleaned_features.dropna(subset=["zipcode"])

# encoded each string label into an integer
airbnbDF_cleaned_features['zipcode'] = pd.factorize(airbnbDF_cleaned_features['zipcode'])[0]
airbnbDF_cleaned_features['neighbourhood_cleansed'] = pd.factorize(airbnbDF_cleaned_features['neighbourhood_cleansed'])[0]
airbnbDF_cleaned_features['property_type'] = pd.factorize(airbnbDF_cleaned_features['property_type'])[0]
airbnbDF_cleaned_features['room_type'] = pd.factorize(airbnbDF_cleaned_features['room_type'])[0]
airbnbDF_cleaned_features['bed_type'] = pd.factorize(airbnbDF_cleaned_features['bed_type'])[0]


Before we create a train test split, check that all your columns are numerical. Remember to drop the original string columns after creating numerical representations of them.

Make sure to drop the price column from the training data when doing the train test split.

In [0]:
# ANSWER
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(airbnbDF_cleaned_features.drop(["int_price"], axis=1), airbnbDF_cleaned_features[["int_price"]].values.ravel(), random_state=42)

## Model

After cleaning our data, we can start creating our model!

Firstly, if there are still `NaN`'s in your data, you may want to impute these values instead of dropping those entries entirely. Make sure that any further processing/imputing steps after the train test split is part of a model/pipeline that can be saved.

In the following cell, create and fit a single sklearn model.

In [0]:
# ANSWER

import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

# impute NaN values of data with median
columns_to_impute = ["review_scores_value", "review_scores_location", "review_scores_rating", "review_scores_accuracy", "review_scores_cleanliness", "review_scores_checkin", "review_scores_communication", "host_total_listings_count", "bathrooms", "beds"]
preprocessing_steps = []
for col in columns_to_impute:
  imp = SimpleImputer(missing_values=np.nan, strategy='median')
  preprocessing_steps.append((col+"Imputer", imp))

# define regression 
n_estimators=100
max_depth=5
model = RandomForestRegressor(n_estimators=n_estimators, max_depth=max_depth) 

# create and train pipeline
pipeline = Pipeline(preprocessing_steps+[("model", model)])
pipeline.fit(X_train, y_train) 


Pick and calculate a regression metric for evaluating your model.

In [0]:
# ANSWER
import numpy as np
from sklearn.metrics import mean_squared_error

pipeline.predict(X_test)

rmse = np.sqrt(mean_squared_error(y_test, pipeline.predict(X_test)))
rmse

Log your model on MLflow with the same metric you calculated above so we can compare all the different models you have tried! Make sure to also log any hyperparameters that you plan on tuning!

In [0]:
# ANSWER
import mlflow.sklearn

with mlflow.start_run() as run:
  mlflow.sklearn.log_model(pipeline, "model")
  mlflow.log_param("max_depth", max_depth)
  mlflow.log_param("n_estimators", n_estimators)
  mlflow.log_metric("rmse", rmse)

  experimentID = run.info.experiment_id

Change and re-run the above 3 code cells to log different models and/or models with different hyperparameters until you are satisfied with the performance of at least 1 of them.

Look through the MLflow UI for the best model. Copy its `URI` so you can load it as a `pyfunc` model.

In [0]:
# ANSWER
import mlflow.pyfunc
from mlflow.tracking import MlflowClient
import pandas as pd
client = MlflowClient()

runs = []
for run in client.search_runs(experimentID):
  run_dict = run.to_dictionary()
  rmse = run_dict["data"]["metrics"].get("rmse")
  artifact_uri = run_dict["info"].get("artifact_uri")
  if rmse:
    runs.append((rmse, artifact_uri))
  
runsDF = pd.DataFrame(runs, columns = ["rmse", "artifact_uri"])
best_URI = runsDF.sort_values("rmse").iloc[0,-1]

best_model_path = best_URI.replace("dbfs:", "/dbfs") + "/model"
best_model = mlflow.pyfunc.load_model(model_uri=best_model_path)

## Post-processing

Our model currently gives us the predicted price per night for each Airbnb listing. Now we would like our model to tell us what the price per person would be for each listing, assuming the number of renters is equal to the `accommodates` value.

-sandbox
Fill in the following model class to add in a post-processing step which will get us from total price per night to **price per person per night**.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> Check out <a href="https://www.mlflow.org/docs/latest/models.html#id13" target="_blank">the MLFlow docs for help.</a>

In [0]:
# ANSWER
class Airbnb_Model(mlflow.pyfunc.PythonModel):

    def __init__(self, model):
        self.model = model

    def postprocess_result(self, model_input, results):
        return results/list(model_input["accommodates"])
    
    def predict(self, context, model_input):
        results = self.model.predict(model_input)
        return self.postprocess_result(model_input, results)

Construct and save the model to the given `final_model_path`.

In [0]:
# ANSWER
final_model_path =  f"{working_path}/model"

import shutil # just in case
try: shutil.rmtree(final_model_path)
except: pass # ignore any errors

price_per_person = Airbnb_Model(best_model)
mlflow.pyfunc.save_model(path=final_model_path, python_model=price_per_person)

Load the model in `python_function` format and apply it to our test data `X_test` to check that we are getting price per person predictions now.

In [0]:
# ANSWER
# Load the model
final_model = mlflow.pyfunc.load_model(final_model_path)

# Apply the model
final_model.predict(X_test)

## Packaging your Model

Now we would like to package our completed model!

-sandbox
First save your testing data at `test_data_path` so we can test the packaged model.

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** When using `.to_csv` make sure to set `index=False` so you don't end up with an extra index column in your saved dataframe.

In [0]:
# ANSWER
# save the testing data 
test_data_path = f"{working_path}/test_data.csv"
X_test.to_csv(test_data_path, index=False)

prediction_path = f"{working_path}/predictions.csv"

First we will determine what the project script should do. Fill out the `model_predict` function to load out the trained model you just saved (at `final_model_path`) and make price per person predictions on the data at `test_data_path`. Then those predictions should be saved under `prediction_path` for the user to access later.

Run the cell to check that your function is behaving correctly and that you have predictions saved at `demo_prediction_path`.

In [0]:
# ANSWER
import click
import mlflow.pyfunc
import pandas as pd

@click.command()
@click.option("--final_model_path", default="", type=str)
@click.option("--test_data_path", default="", type=str)
@click.option("--prediction_path", default="", type=str)
def model_predict(final_model_path, test_data_path, prediction_path):
  final_model = mlflow.pyfunc.load_model(final_model_path)
  X_test = pd.read_csv(test_data_path)
  prediction = final_model.predict(X_test) 
  pd.DataFrame(prediction).to_csv(prediction_path, index=False)

# test model_predict function
demo_prediction_path = f"{working_path}/predictions.csv"

from click.testing import CliRunner
runner = CliRunner()
result = runner.invoke(model_predict, ['--final_model_path', final_model_path, 
                                       '--test_data_path', test_data_path,
                                       '--prediction_path', demo_prediction_path], catch_exceptions=True)

assert result.exit_code == 0, "Code failed" # Check to see that it worked
print("Price per person predictions: ")
print(pd.read_csv(demo_prediction_path))

Next, we will create a MLproject file and put it under our `workingDir`. Complete the parameters and command of the file.

In [0]:
# ANSWER
dbutils.fs.put(f"{workingDir}/MLproject", 
'''
name: Capstone-Project

conda_env: conda.yaml

entry_points:
  main:
    parameters:
      final_model_path: {type: str, default: ""}
      test_data_path: {type: str, default: ""}
      prediction_path: {type: str, default: ""}
      stacktrace_path: {type: str, default: ""}
    command: "python predict.py --test_data_path {test_data_path} --final_model_path {final_model_path} --prediction_path {prediction_path} --stacktrace_path {stacktrace_path}"
'''.strip(), overwrite=True)

In [0]:
print(prediction_path)

We then create a `conda.yaml` file to list the dependencies needed to run our script.

For simplicity, we will ensure we use the same version as we are running in this notebook.

In [0]:
import cloudpickle, numpy, pandas, sklearn, sys

version = sys.version_info # Handles possibly conflicting Python versions

file_contents = f"""
name: Capstone
channels:
  - defaults
dependencies:
  - python={version.major}.{version.minor}.{version.micro}
  - cloudpickle={cloudpickle.__version__}
  - numpy={numpy.__version__}
  - pandas={pandas.__version__}
  - scikit-learn={sklearn.__version__}
  - pip:
    - mlflow=={mlflow.__version__}
""".strip()

dbutils.fs.put(f"{workingDir}/conda.yaml", file_contents, overwrite=True)

print(file_contents)

Now we will put the **`predict.py`** script into our project package.

Complete the **`.py`** file by copying and placing the **`model_predict`** function you defined above.

In [0]:
# ANSWER
dbutils.fs.put(f"{workingDir}/predict.py", 
'''
import click
import mlflow.pyfunc
import pandas as pd
import traceback

@click.command()
@click.option("--final_model_path", default="", type=str)
@click.option("--test_data_path", default="", type=str)
@click.option("--prediction_path", default="", type=str)
@click.option("--stacktrace_path", default="", type=str)
def model_predict(final_model_path, test_data_path, prediction_path, stacktrace_path):
  try:
    final_model = mlflow.pyfunc.load_model(final_model_path)
    X_test = pd.read_csv(test_data_path)
    prediction = final_model.predict(X_test) 
    pd.DataFrame(prediction).to_csv(prediction_path, index = False)
    
  except:
    file = open(stacktrace_path, "w")
    traceback.print_exc(file=file)
    file.close()
    
if __name__ == "__main__":
  model_predict()

'''.strip(), overwrite=True)

Let's double check all the files we've created are in the `workingDir` folder. You should have at least the following 3 files:
* `MLproject`
* `conda.yaml`
* `predict.py`

In [0]:
display( dbutils.fs.ls(workingDir) )

Under **`workingDir`** is your completely packaged project.

Run the project to use the model saved at **`final_model_path`** to predict the price per person of each Airbnb listing in **`test_data_path`** and save those predictions under **`second_prediction_path`** (defined below).

In [0]:
# ANSWER

second_prediction_path = f"{working_path}/predictions-2.csv"

mlflow.projects.run(working_path,
  parameters={
    "final_model_path": final_model_path,
    "test_data_path": test_data_path,
    "prediction_path": second_prediction_path,
    "stacktrace_path": f"{working_path}/stacktrace.txt"
})

try: print( dbutils.fs.head(f"{workingDir}/stacktrace.txt") )
except: print("No errors detected") # No file, then no error

Run the following cell to check that your model's predictions are there!

In [0]:
print("Price per person predictions: ")
print(pd.read_csv(second_prediction_path))

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Classroom-Cleanup<br>

Run the **`Classroom-Cleanup`** cell below to remove any artifacts created by this lesson.

In [0]:
%run "./Includes/Classroom-Cleanup"

<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> All done!</h2>

Thank you for your participation!

-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>