# Model building and Deployment

In this exercise we will build a machine learning model, introduce the vetiver package for machine learning operations, and learn how to use the package to deploy models to Posit Connect.

### Task 1 - Exploratory Data Analysis

#### 🔄 Task

Before we build a machine learning model to predict if an inspection would be successful or not, we need to understand the data.
Use this time to see what features we have available and how they can be used for our model.


#### ✅ Solution

Load the food inspection data(validated) from the database and look at different columns and their values.

In [1]:
# Use the ibis package from previus exercise to perform this operation

There are different functions available in pandas dataframe to do this, like *columns*, *groupby*, *count*, and *unique*

In [None]:
# (Uncomment to run some analysis)
#
# inspection_data.columns
# inspection_data.dtypes
# inspection_data.groupby("facility_type").count()["inspection_id"].sort_values(ascending=False)
# inspection_data.groupby("results").count()["inspection_id"]
# inspection_data['results'].unique()
# inspection_data.groupby("risk").count()["inspection_id"]
# inspection_data['violations']

**For Bonus Points**

If you want, you can also use *matplotlib* to plot relationships between different features!

### Task 2 - Feature Engineering

#### 🔄 Task

Now that we have some understanding of our columns, lets look into building features for our model
- which columns can we use as features?
- do all the column values make sense? eg. look at *category* column values
- are these columns in the right format?
- what kind of encoding can we do define categorical columns?
- Do we need to build a new feature to correctly capture the data?

#### ✅ Solution

For this execise, we will use the following model features
- facility type, encoded as "BAKERY", "RESTUARANT" and "GROCERY STORE"
- Risk encoded as HIGH, MEDIUM and LOW
- Cumulative violations, as the number of violations till that inspection date

**Note**: this is just a premilniary list, you can always add or remove features

In [None]:
# clean up input data for modelling

inspection_data_for_training = (
    inspection_data
    # remove NA licenses
    .loc[inspection_data["license_"] != 0]
    # select only Restaurant, Bakery, Grocery Store
    .loc[
        inspection_data["facility_type"].isin(["RESTAURANT", "BAKERY", "GROCERY STORE"])
    ]
    .pipe(
        lambda inspection_data: pd.get_dummies(
            inspection_data, columns=["facility_type"], prefix=[""], dtype=int
        )
    )
    .rename(
        columns={
            "_BAKERY": "BAKERY",
            "_RESTAURANT": "RESTAURANT",
            "_GROCERY STORE": "GROCERY_STORE",
        }
    )
    # filter out relavant inspection results
    .loc[inspection_data["results"].isin(["FAIL", "PASS", "PASS W/ CONDITIONS"])]
    # make Pass with Conditions results as Fail, since they are not completely pass
    .assign(results=(lambda x: x["results"].replace(["PASS W/ CONDITIONS"], "FAIL")))
    .assign(RESULTS=(lambda x: x["results"].map({"PASS": 1, "FAIL": 0}).astype(int)))
    .drop(columns=["results"])
    # filter out valid risk entries
    .loc[
        inspection_data["risk"].isin(
            ["RISK 1 (HIGH)", "RISK 2 (MEDIUM)", "RISK 3 (LOW)"]
        )
    ]
    # create dummy variables for risk
    .pipe(
        lambda inspection_data: pd.get_dummies(
            inspection_data, columns=["risk"], prefix=[""], dtype=int
        )
    )
    .rename(
        columns={
            "_RISK 1 (HIGH)": "HIGH_RISK",
            "_RISK 2 (MEDIUM)": "MEDIUM_RISK",
            "_RISK 3 (LOW)": "LOW_RISK",
        }
    )
    # sort results by business and inspection date
    .sort_values(by=["license_", "inspection_date"])
)

In [None]:
# count violations for each inspection
inspection_data_for_training["count_violations"] = (
    inspection_data_for_training["violations"]
    .apply(lambda x: len(x[1:-1].split('","')) if x is not None else None)
    .fillna(0)
)
# count cumilative violations for each date for a license
inspection_data_for_training["CUM_VIOLATIONS"] = inspection_data_for_training.groupby(
    ["license_"]
)["count_violations"].cumsum()

### Task 3 - Building "a" classification model

Now we will use these features to build a classification model.

#### 🔄 Task

Since we can utilize our data science skills here, feel free to build a model that works for you. The purpose of this exercise is to build a simple functioning model that we can work with later.(there is no one solution here) 

#### ✅ Solution

One example of a classification model is a random forest model, as built below:

In [None]:
# Create training and test split
X = inspection_data_for_training.drop(
    columns=[
        "license_",
        "RESULTS",
        "inspection_id",
        "dba_name",
        "aka_name",
        "inspection_type",
        "violations",
        "count_violations",
        "inspection_date",
        "zip",
    ]
)
y = inspection_data_for_training[["RESULTS"]]

X_train, X_test, y_train, y_test = train_test_split(X, y)

In [None]:
# Train a random forest model
clf = RandomForestClassifier(max_depth=10, random_state=0)
clf.fit(X_train, np.ravel(y_train))

In [None]:
# test predictions
y_pred = clf.predict(X_test)
mse = metrics.mean_squared_error(y_test, y_pred)

print(np.sqrt(mse))

### Task 4 - Vetiver model deployment

Now that we have a functioning model, we will use the *vetiver* framework to deploy it to Posit Connect

#### 🔄 Task

In this task:
- Install the *vetiver* and *pins* package in your virtual environment
- Convert your model to a Vetiver model
- Add a model board for workshop Posit Connect instance
- Use *vetiver*'s vetiver_pin_write function to deploy the model as a pin on to Connect
- Use *vetiver*'s deploy_rsconnect function to deploy the model as an API on to Connect
- Test out the API using the *vetiver.predict* function

#### ✅ Solution


```bash
python -m pip install vetiver pins
```

In [None]:
# Create a vetiver model object
v = vetiver.VetiverModel(
    model=clf, model_name="gagan/inspection_results", prototype_data=X_train[:1]
)
v

# Write the vetiver model as a pin for versioning
model_board = pins.board_connect(
    os.getenv("CONNECT_SERVER"),
    api_key=os.getenv("CONNECT_API_KEY"),
    allow_pickle_read=True
)
vetiver.vetiver_pin_write(model_board, model=v)

# Deploy the vetiver model as an API on Posit Connect
rsc_server = os.getenv("CONNECT_SERVER")
rsc_key = os.getenv("CONNECT_API_KEY")
connect_server = rsconnect.api.RSConnectServer(url=rsc_server, api_key=rsc_key)


vetiver.deploy_rsconnect(
    connect_server=connect_server,
    board=model_board,
    pin_name="gagan/inspection_results",
)

### Task 5 - Explore API performance tuning on Posit Connect

#### 🔄 Task

Now that our model has been deployed as an API on Connect and we can make predictions from it, explore what options we have on Connect to make this API available to other consumers. Some questions to consider:
- How do we make sure our API is always available?
- How many external processes would be using the API?
- Who and what process would have access to it?

More details on what settings can be tuned are available [here](https://docs.posit.co/connect/user/content-settings/#content-runtime)

### Task 6 - Schedule model run for regular deployments

Now that we have built and deployed the model, we can deploy the code that generates the model also on Connect to refresh the model performance 

#### 🔄 Task

Use the *rsconnect-python* deployment functionality learned in the previous exercise to deploy your notebook

After deployment, explore the scheduling options available on Connect. More details can be found [here](https://docs.posit.co/connect/user/scheduling/)