# Model Training

In this exercise we will train a model on the ferries and weather sets from earlier today. We'll be using a mix of `polars` and `scikit-learn` for some feature engineering and preprocessing of the data. The model will be deployed to and served from [Posit Connect](https://pub.ferryland.posit.team/) using [`pins`](https://rstudio.github.io/pins-python/) and [`vetiver`](https://rstudio.github.io/vetiver-python/stable/). For this section, we'll be using a [random forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#) model to predict the delay in ferry departures.

## Preliminaries

First we'll load our environment variables from `.env` file and get our Connect username using the [Posit SDK for Python](https://github.com/posit-dev/posit-sdk-py).

In [None]:
# Load environment variables from .env.
import os
from pathlib import Path
from dotenv import load_dotenv

if Path(".env").exists():
    load_dotenv()

In [None]:
# Get Connect username.
from posit.connect import Client

connect_url = os.environ["CONNECT_SERVER"]
connect_api_key = os.environ["CONNECT_API_KEY"]

with Client(url=connect_url, api_key=connect_api_key) as client:
    username = client.me.username

print(username)

## Task 0 - Reading the data

### 🔄 Task

- Read in and glimpse the vessel history data
- Read in and glimpse the vessel verbose data
- Read in and glimpse the weather data

### 🧑‍💻 Code


In [None]:
import polars as pl

db_uri = os.environ["DATABASE_URI_PYTHON"]

In [None]:
vessel_history = pl.read_database_uri(
    query=f"SELECT * FROM {username}_vessel_history_clean;", uri=db_uri, engine="adbc"
)

vessel_history.head(3)

In [None]:
vessel_verbose = pl.read_database_uri(
    query=f"SELECT * FROM {username}_vessel_verbose_clean;", uri=db_uri, engine="adbc"
)

vessel_verbose.head(3)

In [None]:
weather = pl.read_database_uri(
    query=f"SELECT * FROM {username}_terminal_weather_clean;", uri=db_uri, engine="adbc"
)

weather.head(3)

## Task 1 - Feature Engineering

### 🔄 Task

- Join the `vessel_history`, `vessel_verbose` and `weather` data into a form useful for modeling
- Transform the columns in new ones we can use for modeling

### 🧑‍💻 Code

In [None]:
ferry_trips = vessel_history.select(
    pl.col("Vessel", "Departing", "Arriving"),
    (pl.col("ActualDepart") - pl.col("ScheduledDepart"))
    .dt.total_seconds()
    .alias("Delay"),
    pl.col("Date"),
    pl.col("Date").dt.year().alias("Year"),
    pl.col("Date").dt.month().alias("Month"),
    pl.col("Date").dt.weekday().alias("Weekday"),
    pl.col("Date").dt.hour().alias("Hour"),
)

ferry_trips.head(3)

A quick look at the `Delay` data shows that there's significant skew and even some negative delays.

In [None]:
ferry_trips.plot.hist("Delay", bin_range=(-1800, 7200), bins=30)

 For the purposes of making it easier to model we'll assume delays can only be non-negative and log them in order to get a nicer distribution for regression. 

In [None]:
ferry_trips = ferry_trips.select(
    pl.exclude("Delay"),
    pl.col("Delay")
    .map_elements(lambda x: max(x, 1), return_dtype=pl.Float64)
    .log()
    .alias("LogDelay"),
)

ferry_trips.plot.hist("LogDelay")

Now we'll want to join the ferry data describing the vessels the trips were taken in. First we're selecting a subset of the columns and extracting the year from the `YearBuilt` and `YearRebuilt` columns.

In [None]:
ferry_info = vessel_verbose.select(
    pl.col("VesselName").str.to_lowercase(),
    pl.col("ClassName"),
    pl.col(
        "SpeedInKnots",
        "EngineCount",
        "Horsepower",
        "MaxPassengerCount",
        "PassengerOnly",
        "FastFerry",
        "PropulsionInfo",
    ),
    pl.col("YearBuilt", "YearRebuilt").dt.year(),
)

ferry_trips = ferry_trips.join(
    ferry_info, left_on="Vessel", right_on="VesselName", how="left", coalesce=True
)

ferry_trips.head(3)

The weather data has a granularity of one hour, so in order to join this with the `ferry_trips` data we're going to round the timestamp associated with the trip to the nearest hour. We're going to join in the weather data twice for both the departing terminal and arriving terminal. Finally, there a number of columns associated with the weather data that are not needed and will be dropped.

In [None]:
import polars.selectors as cs

ferry_trips = (
    ferry_trips.with_columns(pl.col("Date").dt.round("1h").alias("time"))
    .join(
        weather.rename(lambda col_name: f"departing_{col_name}"),
        how="left",
        left_on=["Departing", "time"],
        right_on=["departing_terminal_name", "departing_time"],
        coalesce=True,
    )
    .join(
        weather.rename(lambda col_name: f"arriving_{col_name}"),
        how="left",
        left_on=["Arriving", "time"],
        right_on=["arriving_terminal_name", "arriving_time"],
        coalesce=True,
    )
    .select(
        ~cs.ends_with(
            "latitude",
            "longitude",
            "generationtime_ms",
            "utc_offset_seconds",
            "timezone",
            "timezone_abbreviation",
            "elevation",
            "hourly_units",
        ),
    )
    .select(pl.exclude("time"))
)

ferry_trips.head(3)

## Task 2 - Preprocessing and Modeling

### 🔄 Task

Define a `scikit-learn` pipeline that

- Transform the data for the model to ingest
- Trains a random forest model to predict the logged departure delay

### 🧑‍💻 Code

First we separate the columns in numeric features and categorical features. Our random forest model requires the categorical features be one-hot encoded while our numeric features can be left as-is.

In [None]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

numeric_features = [
    "Month",
    "Weekday",
    "Hour",
    "SpeedInKnots",
    "EngineCount",
    "Horsepower",
    "MaxPassengerCount",
    # "PassengerOnly",
    # "FastFerry",
    "YearBuilt",
    "YearRebuilt",
    "departing_temperature_2m",
    # "departing_precipitation",
    "departing_cloud_cover",
    "departing_wind_speed_10m",
    "departing_wind_direction_10m",
    "departing_wind_gusts_10m",
    "arriving_temperature_2m",
    # "arriving_precipitation",
    "arriving_cloud_cover",
    "arriving_wind_speed_10m",
    "arriving_wind_direction_10m",
    "arriving_wind_gusts_10m",
]

categorical_features = [
    "Departing",
    "Arriving",
    "ClassName",
    "PropulsionInfo",
    "departing_weather_code",
    "arriving_weather_code",
]


preprocessor = ColumnTransformer(
    [
        ("num", "passthrough", numeric_features),
        ("cat", OneHotEncoder(), categorical_features),
    ]
)

Here we define our random forest model.

In [None]:
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(verbose=2, random_state=2, n_jobs=-1)

Now our preprocessor and random forest model are joined together into a single pipeline. This makes using the model easier as we won't have to feed in pre-processed data - the pipeline will take of that step for us during inference.

In [None]:
from sklearn.pipeline import Pipeline

model = Pipeline([("preprocess", preprocessor), ("random-forest", rf)])

Now we're filtering the data, keeping only the data from the past year, and then splitting into a train and test set.

In [None]:
import datetime


from sklearn.model_selection import train_test_split

ferry_trips_filtered = ferry_trips.drop_nulls().filter(
    pl.col("Date").dt.date() >= (datetime.date.today() - datetime.timedelta(weeks=53))
)

X = ferry_trips_filtered.drop("Vessel", "Date", "Year", "LogDelay")

# TODO: review issues with prototype data
X = X.drop(
    "PassengerOnly",
    "FastFerry",
    "arriving_precipitation",
    "departing_precipitation",
)
y = ferry_trips_filtered["LogDelay"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2)
print(f"Nrows training data: {X_train.shape[0]}")
print(f"Nrows testing data:  {X_test.shape[0]}")

In order to use the test later for our model card (to be discussed) the test data will be saved to the database.

In [None]:
X_test.with_columns(y_test).write_database(
    table_name=f"{username}_test_data",
    connection=db_uri,
    engine="adbc",
    if_table_exists="replace",
)

Finally, we train the model and compute the r-squared using the test data set.

In [None]:
%%time
model.fit(X_train.to_pandas(), y_train)
model.score(X_test, y_test)

## Task 3 - Deploying

### 🔄 Task

- Deploy the model using `vetiver` and `pins` onto Posit Connect
- Deploy an API around the model onto Posit

### 🧑‍💻 Code

In [None]:
from vetiver import VetiverModel

v = VetiverModel(
    model, model_name=f"{username}/ferry_delay", prototype_data=X.to_pandas()
)

In [None]:
import pins
import vetiver

model_board = pins.board_connect(
    server_url=connect_url, api_key=connect_api_key, allow_pickle_read=True
)
vetiver.vetiver_pin_write(model_board, model=v)

In [None]:
%%time
from rsconnect.api import RSConnectServer

connect_server = RSConnectServer(url=connect_url, api_key=connect_api_key)
vetiver.deploy_rsconnect(
    connect_server=connect_server,
    board=model_board,
    pin_name=f"{username}/ferry_delay",
)

## Task 4 - Model Card

### 🔄 Task

- Use a model card to describe various metrics for how the model performs
- Deploy the card to Connect

### 🧑‍💻 Code

In [None]:
# vetiver.templates.model_card()