# Model building and Deployment

In this exercise we will build a machine learning model, introduce the vetiver package for machine learning operations, and learn how to use the package to deploy models to Posit Connect.

#### We are starting a new activity, so deactivate the older and create a new virtual environment!

```bash
# deactivate older venv
deactivate 

# Create a virtual environment
# Use our alias!
py-venv
python -m pip install -r requirements.txt

```

In [9]:
# import required libraries
import os
import ibis
from dotenv import load_dotenv
import pandas as pd
import xgboost as xgb
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.ensemble import RandomForestClassifier
import vetiver
import pins
import rsconnect
from datetime import datetime, timedelta

### Task 1 - Exploratory Data Analysis

#### 🔄 Task

Before we build a machine learning model to predict if an inspection would be successful or not, we need to understand the data.
Use this time to see what features we have available and how they can be used for our model.


In [None]:
# begin by making a connection to the database using the ibis package

import os
import ibis

con = ibis.postgres.connect(
    ...
)

inspection_data = con.table("food_inspection_validated").to_pandas()

There are different functions available in pandas dataframe to do this, like *columns*, *groupby*, *count*, and *unique*

In [7]:
# (Uncomment to run some analysis)
#
#inspection_data.columns
#inspection_data.dtypes
#inspection_data.groupby("facility_type").count()["inspection_id"].sort_values(ascending=False)
#inspection_data.groupby("results").count()["inspection_id"]
#inspection_data['results'].unique()
#inspection_data.groupby("risk").count()["inspection_id"]
#inspection_data['violations']

**For Bonus Points**

If you want, you can also use *matplotlib* to plot relationships between different features!

### Task 2 - Feature Engineering

#### 🔄 Task

Now that we have some understanding of our columns, lets look into building features for our model
- which columns can we use as features?
- do all the column values make sense? eg. look at *category* column values
- are these columns in the right format?
- what kind of encoding can we do define categorical columns?
- Do we need to build a new feature to correctly capture the data?

For this execise, we will use the following model features
- facility type, encoded as "BAKERY", "RESTUARANT" and "GROCERY STORE"
- Risk encoded as HIGH, MEDIUM and LOW
- Cumulative violations, as the number of violations till that inspection date

**Note**: this is just a preliminary list, you can always add or remove features

In [10]:
# clean up input data for modelling

inspection_data_for_training = (
    inspection_data
    # remove NA licenses
    .loc[inspection_data["license_"] != 0]
    # only use inspections in the last year
    .loc[inspection_data["inspection_date"]>=(inspection_data["inspection_date"].max() - timedelta(days=365))]
    # select only Restaurant, Bakery, Grocery Store
    .loc[
        inspection_data["facility_type"].isin(["RESTAURANT", "BAKERY", "GROCERY STORE"])
    ]
    .pipe(
        lambda inspection_data: pd.get_dummies(
            inspection_data, columns=["facility_type"], prefix=[""], dtype=int
        )
    )
    .rename(
        columns={
            "_BAKERY": "BAKERY",
            "_RESTAURANT": "RESTAURANT",
            "_GROCERY STORE": "GROCERY_STORE",
        }
    )
    # filter out relavant inspection results
    .loc[inspection_data["results"].isin(["FAIL", "PASS", "PASS W/ CONDITIONS"])]
    # make Pass with Conditions results as Fail, since they are not completely pass
    .assign(results=(lambda x: x["results"].replace(["PASS W/ CONDITIONS"], "FAIL")))
    .assign(RESULTS=(lambda x: x["results"].map({"PASS": 1, "FAIL": 0}).astype(int)))
    .drop(columns=["results"])
    # filter out valid risk entries
    .loc[
        inspection_data["risk"].isin(
            ["RISK 1 (HIGH)", "RISK 2 (MEDIUM)", "RISK 3 (LOW)"]
        )
    ]
    # create dummy variables for risk
    .pipe(
        lambda inspection_data: pd.get_dummies(
            inspection_data, columns=["risk"], prefix=[""], dtype=int
        )
    )
    .rename(
        columns={
            "_RISK 1 (HIGH)": "HIGH_RISK",
            "_RISK 2 (MEDIUM)": "MEDIUM_RISK",
            "_RISK 3 (LOW)": "LOW_RISK",
        }
    )
    # sort results by business and inspection date
    .sort_values(by=["license_", "inspection_date"])
)

In [11]:
# count violations for each inspection
inspection_data_for_training["count_violations"] = (
    inspection_data_for_training["violations"]
    .apply(lambda x: len(x[1:-1].split('","')) if x is not None else None)
    .fillna(0)
)
# count cumilative violations for each date for a license
inspection_data_for_training["CUM_VIOLATIONS"] = inspection_data_for_training.groupby(
    ["license_"]
)["count_violations"].cumsum()

In [12]:
inspection_data_for_training

Unnamed: 0,inspection_id,dba_name,aka_name,license_,zip,inspection_date,inspection_type,violations,BAKERY,GROCERY_STORE,RESTAURANT,RESULTS,HIGH_RISK,MEDIUM_RISK,LOW_RISK,count_violations,CUM_VIOLATIONS
173,120273,"QUITEFRANKLY,LTD.",UPS CAFETERIA,0,60607,2010-01-06,CANVASS,"{""33. FOOD AND NON-FOOD CONTACT EQUIPMENT UTEN...",0,0,1,1,1,0,0,4.0,4.0
2731,68320,TACOS REYNA,,0,60617,2010-03-02,CONSULTATION,,0,0,1,0,1,0,0,0.0,4.0
2342,78339,JEWEL FOOD STORE #3030,JEWEL FOOD STORE #3030,1000572,60649,2010-03-10,COMPLAINT,"{""34. FLOORS: CONSTRUCTED PER CODE, CLEANED, G...",0,1,0,1,1,0,0,1.0,1.0
2726,114354,NAHA RESTAURANT,NAHA RESTAURANT,1000612,60654,2010-02-26,CANVASS,"{""35. WALLS, CEILINGS, ATTACHED EQUIPMENT CONS...",0,0,1,1,1,0,0,3.0,3.0
4250,176979,CHICAGOLAND PIZZA & PASTA,CHICAGOLAND PIZZA & PASTA,1000639,60629,2010-04-05,CANVASS,"{""32. FOOD AND NON-FOOD CONTACT SURFACES PROPE...",0,0,1,1,1,0,0,3.0,3.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2614,160280,O'BRIENS RESTAURANT & BAR,O'BRIENS RESTAURANT & BAR,9027,60610,2010-02-25,CANVASS,"{""18. NO EVIDENCE OF RODENT OR INSECT OUTER OP...",0,0,1,0,1,0,0,6.0,6.0
2857,160282,O'BRIENS RESTAURANT & BAR,O'BRIENS RESTAURANT & BAR,9027,60610,2010-03-04,CANVASS RE-INSPECTION,"{""32. FOOD AND NON-FOOD CONTACT SURFACES PROPE...",0,0,1,1,1,0,0,5.0,11.0
3370,215247,STAN'S CORNER INC,STAN'S CORNER INC,9498,60638,2010-03-15,CANVASS,"{""34. FLOORS: CONSTRUCTED PER CODE, CLEANED, G...",0,0,1,1,1,0,0,2.0,2.0
4131,197351,GRANDMA GEBHARDS,GRANDMA GEBHARDS,9616,60602,2010-03-31,CANVASS,"{""30. FOOD IN ORIGINAL CONTAINER, PROPERLY LAB...",0,0,1,1,1,0,0,2.0,2.0


### Task 3 - Building "a" classification model

Now we will use these features to build a classification model.

!DISCLAIMER!

This is NOT a modelling workshop so we wont be spending a lot of time in this task.  

#### 🔄 Task

Since we can utilize our data science skills here, feel free to build a model that works for you. The purpose of this exercise is to build a simple functioning model that we can work with later.(there is no one solution here) 

In [None]:
# your code here

One example of a classification model is a random forest model, as built below:

In [13]:
# Create training and test split
X = inspection_data_for_training.drop(
    columns=[
        "license_",
        "RESULTS",
        "inspection_id",
        "dba_name",
        "aka_name",
        "inspection_type",
        "violations",
        "count_violations",
        "inspection_date",
        "zip",
    ]
)
y = inspection_data_for_training[["RESULTS"]]

X_train, X_test, y_train, y_test = train_test_split(X, y)

In [14]:
# Train a random forest model
clf = RandomForestClassifier(max_depth=10, random_state=0)
clf.fit(X_train, np.ravel(y_train))

In [15]:
# test predictions
y_pred = clf.predict(X_test)


In [16]:
mse = metrics.mean_squared_error(y_test, y_pred)

print(np.sqrt(mse))

0.5789290259492332


### Task 4 - Vetiver model deployment

Now that we have a functioning model, we will use the *vetiver* framework to deploy it to Posit Connect

#### 🔄 Task

In this task:
- Install the *vetiver* and *pins* package in your virtual environment
- Convert your model to a Vetiver model
- Add a model board for workshop Posit Connect instance
- Use *vetiver*'s vetiver_pin_write function to deploy the model as a pin on to Connect
- Use *vetiver*'s deploy_rsconnect function to deploy the model as an API on to Connect
- Test out the API using the *vetiver.predict* function

In [None]:
# your code here

# start by creating a vetiver model
v = vetiver.VetiverModel(
    # change username in model name
    model=..., model_name="[USER.NAME]/inspection_results", prototype_data=X_train[:1]
)
# then create a pins model board for Connect
model_board = pins.board_connect(
    server_url=...,
    api_key=...,
    allow_pickle_read=True
)

# use vetiver_pin_write to write the moel as a pin
vetiver.vetiver_pin_write(board=..., model=...)

# use vetiver_deploy_connect to deploy the model as an API
rsc_server = os.getenv("CONNECT_SERVER")
rsc_key = os.getenv("CONNECT_API_KEY")
connect_server = rsconnect.api.RSConnectServer(url=rsc_server, api_key=rsc_key)


vetiver.deploy_rsconnect(
    connect_server=...,
    board=...,
    # change username in pin name
    pin_name="[USER.NAME]/inspection_results",
)
# test model predictions using the API
...

### Task 5 - Explore API performance tuning on Posit Connect

#### 🔄 Task

Now that our model has been deployed as an API on Connect and we can make predictions from it, explore what options we have on Connect to make this API available to other consumers. Some questions to consider:
- How do we make sure our API is always available?
- How many external processes would be using the API?
- Who and what process would have access to it?

More details on what settings can be tuned are available [here](https://docs.posit.co/connect/user/content-settings/#content-runtime)

### Task 6 - Schedule model run for regular deployments

Now that we have built and deployed the model, we can deploy the code that generates the model also on Connect to refresh the model performance 

#### 🔄 Task

Use the *rsconnect-python* deployment functionality learned in the previous exercise to deploy your notebook

In [None]:
# your code here

After deployment, explore the scheduling options available on Connect. More details can be found [here](https://docs.posit.co/connect/user/scheduling/)