# Mobi Vancouver GBFS Tools (Unity Catalog)

This notebook collects small, reusable **tools** that make it easy to explore Mobi data from **SQL, Genie rooms, and agents**.

These examples highlight one of Databricks’ unique strengths: you can define **Python-backed SQL table functions** (UDTFs) that call live APIs, run custom logic, and return rows directly to SQL — all without deploying servers, containers, or external services. This makes it simple to build rich, real-time “agent tools” entirely inside your data platform.

These tools show how to:

- Wrap **Delta tables** in friendly, parameterized SQL functions.
- Call **live GBFS APIs** from SQL using Python UDTFs.
- Combine static trip history with real-time station status and basic geospatial logic.

You can treat this notebook as:

- A **demo** of how Databricks SQL and Python can work together.
- A **toolbox** you can copy from and extend during the hackathon.

**Steps in this notebook:**

1. Setup (config + imports)  
2. Create tools (SQL and Python functions)  
3. Try queries that you can reuse in your own projects or Genie rooms  

> **Prerequisite:** Run `01_data.ipynb` first to create `silver_trips` and `silver_stations` in your chosen `catalog.schema`.


In [0]:
# Setup: minimal deps + add src to sys.path
%pip install -q mlflow requests

import sys
from pathlib import Path
src_path = Path.cwd() / "src"
if str(src_path) not in sys.path:
    sys.path.insert(0, str(src_path))


In [0]:
import mlflow
config = mlflow.models.ModelConfig(development_config='config.yaml')

catalog = dbutils.widgets.get('catalog')
schema = dbutils.widgets.get('schema')
print(f"Using catalog.schema: {catalog}.{schema}")

In [0]:
%sql
SELECT station_id, station_name, lat, lon
FROM `${catalog}`.`${schema}`.`silver_stations`
LIMIT 10


## Tool: `recent_trips_by_station`

`recent_trips_by_station(station_id)` is a **pure SQL table function** over `silver_trips`.

What it does:

- Takes a **station_id** as input.
- Returns the **10 most recent trips** departing from that station.
- Normalizes IDs and types so you can call it safely from SQL or Genie.

Why this is useful:

- It hides the full `silver_trips` schema behind a simple interface.
- It’s easy to plug into **Genie tools** (e.g., “show me the latest trips from station 0152”).
- It’s a pattern you can reuse for other “recent X by Y” questions.

This function is not Databricks-exclusive, but Databricks makes it easy to:

- Define it in SQL, backed by Delta tables.
- Share it across notebooks, dashboards, and Genie rooms.


In [0]:
%sql
CREATE OR REPLACE FUNCTION `${catalog}`.`${schema}`.recent_trips_by_station(
  station_id STRING
)
RETURNS TABLE (
  trip_id BIGINT,
  departure_time TIMESTAMP,
  departure_station_id STRING,
  departure_station_name STRING,
  return_time TIMESTAMP,
  return_station_id STRING,
  return_station_name STRING,
  duration_sec DOUBLE
)
RETURN
SELECT
  trip_id,
  departure_time,
  CAST(departure_station_id AS STRING) AS departure_station_id,
  departure_station_name,
  return_time,
  CAST(return_station_id AS STRING) AS return_station_id,
  return_station_name,
  CAST(duration_sec AS DOUBLE) AS duration_sec
FROM `${catalog}`.`${schema}`.`silver_trips`
WHERE CAST(departure_station_id AS STRING) = station_id
ORDER BY departure_time DESC
LIMIT 10;

In [0]:
%sql
SELECT * FROM `${catalog}`.`${schema}`.recent_trips_by_station('0152')

## Tool: `live_station_status`

`live_station_status(station_id)` is a **Python table function** that calls the live GBFS API from SQL.

What it does:

- Fetches **live station status** from  
  `https://gbfs.kappa.fifteen.eu/gbfs/2.2/mobi/en/station_status.json`
- Filters to the requested `station_id`.
- Returns a single row with:
  - bikes available
  - docks available
  - renting/returning flags
  - last reported timestamp

Why this is interesting:

- It shows how you can call an **external HTTP API** directly from a SQL function.
- You can join it with `silver_stations` or `silver_trips` for richer views.
- It’s a great candidate for **Genie tools** or **agents** that answer “What’s happening at station X right now?”

This pattern (Python UDTF + external API) is particularly nice on Databricks because:

- The function runs on **serverless compute** managed by Databricks.
- You can invoke it from SQL, notebooks, dashboards, and Genie with the same syntax.


In [0]:
%sql
CREATE OR REPLACE FUNCTION `${catalog}`.`${schema}`.live_station_status(
  station_id STRING
)
RETURNS TABLE (
  station_id STRING,
  num_bikes_available INT,
  num_docks_available INT,
  is_renting BOOLEAN,
  is_returning BOOLEAN,
  last_reported BIGINT
)
LANGUAGE PYTHON
HANDLER 'LiveStationStatus'
AS $$
class LiveStationStatus:

    def ensure_station_fields(self, station):
        fields = [
            'station_id',
            'num_bikes_available',
            'num_docks_available',
            'is_renting',
            'is_returning',
            'last_reported'
        ]
        defaults = {
            'station_id': None,
            'num_bikes_available': 0,
            'num_docks_available': 0,
            'is_renting': True,
            'is_returning': True,
            'last_reported': 0
        }
        result = {k: station.get(k, defaults[k]) for k in fields}
        # Type conversions
        result['station_id'] = str(result['station_id']) if result['station_id'] is not None else None
        result['num_bikes_available'] = int(result['num_bikes_available'])
        result['num_docks_available'] = int(result['num_docks_available'])
        result['is_renting'] = bool(result['is_renting'])
        result['is_returning'] = bool(result['is_returning'])
        result['last_reported'] = int(result['last_reported'])
        return result

    def eval(self, station_id: str):
        import requests
        url = "https://gbfs.kappa.fifteen.eu/gbfs/2.2/mobi/en/station_status.json"
        try:
            r = requests.get(url, timeout=10)
            data = r.json()
            stations = data.get('data', {}).get('stations', [])
            matches = [s for s in stations if str(s.get('station_id')) == str(station_id)]
            if not matches:
                return []
            return [self.ensure_station_fields(matches[0])]
        except Exception:
            return []
$$

In [0]:
%sql
SELECT * FROM `${catalog}`.`${schema}`.live_station_status('0152')

## Tool: nearby_stations

`nearby_stations(target_lat, target_lon, radius_km)` finds stations **within a radius** of a given location.

What it does:

- Calls the GBFS **station_information** endpoint to get all stations and their coordinates.
- Uses the **haversine formula** inside Python to compute distance in kilometers.
- Returns all stations within `radius_km`, with:
  - `station_id`
  - `station_name`
  - `lat`, `lon`
  - `distance_km`

Why this is powerful:

- It turns a **raw API feed** into a simple SQL function:
  ```sql
  SELECT * FROM nearby_stations(49.2827, -123.1207, 1.0)


In [0]:
%sql
CREATE OR REPLACE FUNCTION `${catalog}`.`${schema}`.nearby_stations(
  target_lat DOUBLE,
  target_lon DOUBLE,
  radius_km DOUBLE
)
RETURNS TABLE (
  station_id STRING,
  station_name STRING,
  lat DOUBLE,
  lon DOUBLE,
  distance_km DOUBLE
)
LANGUAGE PYTHON
HANDLER 'Nearby'
AS $$
from typing import Any, Dict


class Nearby:
    """Calculate distances to nearby stations using GBFS metadata."""

    def haversine(self, lat1: float, lon1: float, lat2: float, lon2: float) -> float:
        """Return the haversine distance in kilometers between two points."""
        import math

        radius_km = 6371.0
        dlat = math.radians(lat2 - lat1)
        dlon = math.radians(lon2 - lon1)
        a = (
            math.sin(dlat / 2) ** 2
            + math.cos(math.radians(lat1))
            * math.cos(math.radians(lat2))
            * math.sin(dlon / 2) ** 2
        )
        c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))
        return radius_km * c

    def eval(self, target_lat: float, target_lon: float, radius_km: float):
        """Return stations within the requested radius ordered by proximity."""
        import requests

        url = "https://gbfs.kappa.fifteen.eu/gbfs/2.2/mobi/en/station_information.json"
        try:
            r = requests.get(url, timeout=10)
            stations = r.json().get("data", {}).get("stations", [])
            res = []
            for station in stations:
                try:
                    lat = float(station.get("lat"))
                    lon = float(station.get("lon"))
                    dist = self.haversine(float(target_lat), float(target_lon), lat, lon)
                    if dist <= float(radius_km):
                        res.append(
                            {
                                "station_id": str(station.get("station_id")),
                                "station_name": station.get("name"),
                                "lat": lat,
                                "lon": lon,
                                "distance_km": dist,
                            }
                        )
                except Exception:
                    # Skip malformed station entries
                    continue
            return res
        except Exception:
            # On network/parse error, return empty result
            return []
$$


In [0]:
%sql
SELECT * FROM `${catalog}`.`${schema}`.nearby_stations(49.2827, -123.1207, 1.0)

## Other Ideas for Hackathon Tools

These examples are meant to inspire your own tools. A few ideas you could build next:

- `weather_for_trip(trip_id)`  
  - Look up the trip’s start time and location from `silver_trips`.  
  - Call a weather API (e.g., historical weather) to return temperature, precipitation, etc.

- `station_health(station_id)`  
  - Aggregate the last N days of data to compute:
    - average bikes available  
    - % of time the station was empty or full  
  - Useful for planning rebalancing or spotting problem stations.

- `trip_summary_for_rider(rider_id)` (if rider IDs exist or are simulated)  
  - Summarize total trips, distance, time of day, and favorite stations.

- `events_near_station(station_id, day)`  
  - Join station coordinates with a dataset of events or POIs.  
  - Let an agent explain unusual demand patterns.

- `recommend_station(lat, lon, time_of_day)`  
  - Combine `nearby_stations` with historical load patterns to suggest the “best” station.

All of these can follow the same patterns shown here:

1. Use **SQL** to query `silver_trips` / `silver_stations`.  
2. Use **Python UDTFs** when you need external APIs or non-trivial logic.  
3. Expose them as **SQL table functions** so they are easy to call from Genie rooms and agents.
