# Taxi drips example

## Peaking at the data

Running `simulate.py` will insert data into two topics. The first topic contains all the taxi departures. The topic is located in the default Redpanda broker provided by Beaver. Beaver also provides a default Materialize instance to process the data in Redpanda with SQL. We'll do just that to take a look at the taxi departures.

In [38]:
import psycopg

conn = psycopg.connect("postgresql://materialize@localhost:6875/materialize?sslmode=disable")
conn.autocommit = True
with conn.cursor() as cur:
    cur.execute("DROP VIEW IF EXISTS taxi_departures")
    cur.execute("DROP SOURCE IF EXISTS taxi_departures_src")

    cur.execute("""
    CREATE MATERIALIZED SOURCE taxi_departures_src
    FROM KAFKA BROKER 'redpanda:29092' TOPIC 'taxi-departures'
        KEY FORMAT TEXT
        VALUE FORMAT BYTES
        INCLUDE KEY AS trip_no, TIMESTAMP AS received_at;
    """)

    cur.execute("""
    CREATE VIEW taxi_departures AS (
        SELECT
            trip_no,
            received_at,
            CAST(CONVERT_FROM(data, 'utf8') AS JSONB) AS trip
        FROM taxi_departures_src
    );
    """)

In [39]:
import pandas as pd

pd.read_sql('SELECT * FROM taxi_departures ORDER BY received_at DESC LIMIT 10', conn)

  pd.read_sql('SELECT * FROM taxi_departures ORDER BY received_at DESC LIMIT 10', conn)


Unnamed: 0,trip_no,received_at,trip
0,9,2023-01-13 00:03:16.320,"{'dropoff_latitude': 40.74546813964844, 'dropo..."
1,8,2023-01-13 00:03:13.716,"{'dropoff_latitude': 40.81161117553711, 'dropo..."
2,7,2023-01-13 00:03:12.443,"{'dropoff_latitude': 40.8165397644043, 'dropof..."
3,6,2023-01-13 00:03:11.772,"{'dropoff_latitude': 40.761734008789055, 'drop..."
4,5,2023-01-13 00:03:11.502,"{'dropoff_latitude': 40.84777069091797, 'dropo..."
5,4,2023-01-13 00:03:10.631,"{'dropoff_latitude': 40.742988586425774, 'drop..."
6,3,2023-01-13 00:03:10.226,"{'dropoff_latitude': 40.75033950805664, 'dropo..."
7,2,2023-01-13 00:03:09.355,"{'dropoff_latitude': 40.81517028808594, 'dropo..."
8,1,2023-01-13 00:03:08.817,"{'dropoff_latitude': 40.71749114990234, 'dropo..."


Let's do the same for taxi arrivals.

In [40]:
with conn.cursor() as cur:
    cur.execute("DROP VIEW IF EXISTS taxi_arrivals")
    cur.execute("DROP SOURCE IF EXISTS taxi_arrivals_src")

    cur.execute("""
    CREATE MATERIALIZED SOURCE taxi_arrivals_src
    FROM KAFKA BROKER 'redpanda:29092' TOPIC 'taxi-arrivals'
        KEY FORMAT TEXT
        VALUE FORMAT BYTES
        INCLUDE KEY AS trip_no, TIMESTAMP AS received_at
    """)

    cur.execute("""
    CREATE VIEW taxi_arrivals AS (
        SELECT
            trip_no,
            received_at,
            CAST(CONVERT_FROM(data, 'utf8') AS JSONB) AS arrival
        FROM taxi_arrivals_src
    )
    """)

pd.read_sql('SELECT * FROM taxi_arrivals ORDER BY received_at DESC LIMIT 10', conn)

  pd.read_sql('SELECT * FROM taxi_arrivals ORDER BY received_at DESC LIMIT 10', conn)


Unnamed: 0,trip_no,received_at,arrival


## Streaming features

Beaver encourages you to process your streaming data with SQL. We'll start by building up some features which we'll then feed to a machine learning model. Let's start simple and calculate two features based on the distance between the pick-up and drop-off locations, as well as some basic temporal features.

In [41]:
feature_set_query = """
DROP VIEW IF EXISTS taxi_features;

CREATE VIEW taxi_features AS (
    SELECT 
        trip_no,
        ABS(dropoff_lat - pickup_lat) + ABS(dropoff_lon - pickup_lon) AS manhattan_distance,
        SQRT(POWER(dropoff_lat - pickup_lat, 2) + POWER(dropoff_lon - pickup_lon, 2)) AS euclidean_distance,
        EXTRACT(HOUR FROM pickup_datetime) AS pickup_hour,
        EXTRACT(DOW FROM pickup_datetime) = 1 AS is_monday,
        EXTRACT(DOW FROM pickup_datetime) = 2 AS is_tuesday,
        EXTRACT(DOW FROM pickup_datetime) = 3 AS is_wednesday,
        EXTRACT(DOW FROM pickup_datetime) = 4 AS is_thursday,
        EXTRACT(DOW FROM pickup_datetime) = 5 AS is_friday,
        EXTRACT(DOW FROM pickup_datetime) = 6 AS is_saturday,
        EXTRACT(DOW FROM pickup_datetime) = 7 AS is_sunday
    FROM (
        SELECT
            trip_no,
            CAST(trip ->> 'dropoff_latitude' AS FLOAT) AS dropoff_lat,
            CAST(trip ->> 'pickup_latitude' AS FLOAT) AS pickup_lat,
            CAST(trip ->> 'dropoff_longitude' AS FLOAT) AS dropoff_lon,
            CAST(trip ->> 'pickup_longitude' AS FLOAT) AS pickup_lon,
            CAST(trip ->> 'pickup_datetime' AS TIMESTAMP) AS pickup_datetime
        FROM taxi_departures
    )
)
"""

Instead of running this query like we did above, we'll register it in Beaver so it can be associated to a model. Registering a feature set happens through the API.

In [42]:
import requests
requests.post(
    "http://localhost:8000/api/features/",
    json={
        "name": "taxi_features",
        "query": feature_set_query,
        "key_field": "trip_no",
        "processor_id": 1
    },
)

<Response [200]>

## Streaming targets

The features we built can be used to do inference. But to train a model, we also need a target. Beaver also encourages you to define this target with SQL. For this example we'll predict the duration in seconds of each trip, which is a regression task.

In [43]:
target_query = """
DROP VIEW IF EXISTS taxi_targets;

CREATE VIEW taxi_targets AS (
    SELECT
        trip_no,
        CAST(arrival ->> 'duration' AS INTEGER) AS duration
    FROM taxi_arrivals
)
"""

requests.post(
    "http://localhost:8000/api/targets/",
    json={
        "name": "taxi_targets",
        "query": target_query,
        "key_field": "trip_no",
        "target_field": "duration",
        "task": "REGRESSION",
        "processor_id": 1
    }
)

<Response [200]>

## Sending a first model

Now let's upload a first model. We'll start with a plain and simple linear regression.

In [44]:
import base64
import dill
import requests
from river import linear_model, preprocessing

model = preprocessing.StandardScaler() | linear_model.LinearRegression()
model.learn = model.learn_one
model.predict = model.predict_one

requests.post(
    "http://localhost:8000/api/models/",
    json={
        "name": "taxis-linear-regression",
        "task": "REGRESSION",
        "content": base64.b64encode(dill.dumps(model)).decode("ascii"),
    },
)

<Response [200]>

In this example, the models are hosted in Beaver, which is why we encode the model and send it in the payload.

## Creating an experiment

We now have all we need to run a first experiment. An experiment boils down to training a model on a feature set to predict a target. The model has already been uploaded. Beaver will thus take care of training the model in real-time. Beaver will also issue a prediction for each new arriving sample.

In [45]:
exp = requests.post(
    "http://localhost:8000/api/experiments/",
    json={
        "name": "Taxi trips lin reg experiment",
        "feature_set_id": 2,
        "target_id": 2,
        "model_id": 2,
        "runner_id": 1,
        "sink_id": 1
    },
)
exp.content

b'{"id":2,"model_state":"gASVzgMAAAAAAACMFnJpdmVyLmNvbXBvc2UucGlwZWxpbmWUjAhQaXBlbGluZZSTlCmBlH2UKIwFc3RlcHOUjAtjb2xsZWN0aW9uc5SMC09yZGVyZWREaWN0lJOUKVKUKIwOU3RhbmRhcmRTY2FsZXKUjBlyaXZlci5wcmVwcm9jZXNzaW5nLnNjYWxllIwOU3RhbmRhcmRTY2FsZXKUk5QpgZR9lCiMCHdpdGhfc3RklIiMBmNvdW50c5RoBowHQ291bnRlcpSTlH2UhZRSlIwFbWVhbnOUaAaMC2RlZmF1bHRkaWN0lJOUjApkaWxsLl9kaWxslIwKX2xvYWRfdHlwZZSTlIwFZmxvYXSUhZRSlIWUUpSMBHZhcnOUaBloH4WUUpR1YowQTGluZWFyUmVncmVzc2lvbpSMGnJpdmVyLmxpbmVhcl9tb2RlbC5saW5fcmVnlIwQTGluZWFyUmVncmVzc2lvbpSTlCmBlH2UKIwJb3B0aW1pemVylIwPcml2ZXIub3B0aW0uc2dklIwDU0dElJOUKYGUfZQojAJscpSMFnJpdmVyLm9wdGltLnNjaGVkdWxlcnOUjAhDb25zdGFudJSTlCmBlH2UjA1sZWFybmluZ19yYXRllEc/hHrhR64Ue3NijAxuX2l0ZXJhdGlvbnOUSwB1YowEbG9zc5SMEnJpdmVyLm9wdGltLmxvc3Nlc5SMB1NxdWFyZWSUk5QpgZSMAmwylEcAAAAAAAAAAIwCbDGURwAAAAAAAAAAjA5pbnRlcmNlcHRfaW5pdJRHAAAAAAAAAACMCWludGVyY2VwdJRHAAAAAAAAAACMDGludGVyY2VwdF9scpRoNCmBlH2UaDdHP4R64UeuFHtzYowNY2xpcF9ncmFkaWVudJRHQm0alKIAAACMC2luaXRpYWxpemVylIwYcml2ZXIub3B0aW0uaW5pdGlhbGl6ZXJzlIwFWmV

## Monitoring an experiment

We can monitor progress of the experiment as it keeps running.

In [46]:
monitor = requests.get(
    "http://localhost:8000/api/experiments/2/monitor",
)

monitor.content

b'{"now":"2023-01-13T00:03:40.844116","training_progress":0.0,"mse":73230.8,"mae":246.4}'

## Sending a new model

TODO: send a random forest on the same dataset. Show how it compares to the existing model.

## Defining new features

TODO: creating stateful features with Materialize. Create a new experiment with the random forest on these features.