# AI_FORECAST DBSQL Function

AI_FORECAST is a table-valued function designed to extrapolate time series data into the future. In its most general form, AI_FORECAST accepts __grouped, multivariate, mixed-granularity data,__ and forecasts that data up to some horizon in the future.


AI_FORECAST is an all-in-one function for doing out-of-sample predictions on a large number of time series simultaneously. AI_FORECAST is useful for

- On-the-fly applications where training and persisting models is not required (e.g. dashboards, investigations)
- Scenarios where model persistence is complicated or cumbersome (e.g. generating forecasts for multiple grouping set rollups over the same dataset, or if some dimensions have a few months of data & others have years of data)
- Forecasting “at scale” in the sense that many independent models are trained and evaluated simultaneously.

### Prerequisites 
Standard compute clusters or SQL warehouses running DBR 15.1+. The function is not yet available on serverless.

### Setup
AI_FORECAST can be enabled in standard compute environments (i.e. not SQL warehouses) at a session level via the Spark configuration.

`SET spark.databricks.sql.functions.aiForecast.enabled = TRUE`

### API

```
SELECT ... FROM AI_FORECAST(
  observed TABLE,
  horizon DATE | TIMESTAMP | STRING,
  time_col STRING,
  value_col STRING | ARRAY<STRING>,
  group_col STRING | ARRAY<STRING> | NULL DEFAULT NULL,
  prediction_interval_width DOUBLE DEFAULT 0.95,
  frequency STRING DEFAULT 'auto',
  seed INTEGER | NULL DEFAULT NULL,
  parameters STRING DEFAULT '{}' -- * New in DBR 15.3 *
)

```


# Examples

In [0]:
SET spark.databricks.sql.functions.aiForecast.enabled = TRUE

### Forecast until a Specified Date.

In [0]:
SELECT * FROM samples.nyctaxi.trips order by tpep_pickup_datetime desc

##### Let's say we wanted to forecast revenue (sum of fair_amount) per day.

In [0]:
WITH
aggregated AS (
  SELECT
    DATE(tpep_pickup_datetime) AS ds,
    SUM(fare_amount) AS revenue
  FROM
    samples.nyctaxi.trips
  GROUP BY
    1
)
SELECT * FROM AI_FORECAST(
  TABLE(aggregated),
  horizon => '2016-03-31',
  time_col => 'ds',
  value_col => 'revenue'
)


Databricks visualization. Run in Databricks to view.

##### Let's say we wanted to forecast number of trips a day.

In [0]:
WITH
aggregated AS (
  SELECT
    DATE(tpep_pickup_datetime) AS ds,
    COUNT(*) AS n_trips
  FROM
    samples.nyctaxi.trips
  GROUP BY
    1
)
SELECT * FROM AI_FORECAST(
  TABLE(aggregated),
  horizon => '2016-03-31',
  time_col => 'ds',
  value_col => 'n_trips'
)


Databricks visualization. Run in Databricks to view.

### A slightly more complex example.

It is very common for tables to not materialize 0s or empty entries. If the values of the missing entries can be inferred (e.g. 0, 100%, etc.) then these values should be coalesced prior to calling the forecast function. If the values are truly missing or unknown, then they can be left empty.

For very sparse data (e.g. >50% missing entries), it is best practice to provide a frequency value explicitly. Two entries 35 days apart will be inferred as a time series with granularity 35D, rather than a daily series with 34 missing entries.


Here's an example of missing dates in the `nyctaxi.trips` table.

In [0]:
SELECT
    DATE(tpep_pickup_datetime) AS ds,
    dropoff_zip,
    SUM(fare_amount) AS revenue,
    COUNT(*) AS n_trips
  FROM
    samples.nyctaxi.trips
  WHERE dropoff_zip = 7114
  GROUP BY
    1, 2
  ORDER BY ds 


##### Let's say we wanted to forecast revenue AND number of trips for each dropoff zip code.

In [0]:
-- Generate the aggregated table from the nyctaxi.trips
WITH
aggregated AS (
  SELECT
    DATE(tpep_pickup_datetime) AS ds,
    dropoff_zip,
    SUM(fare_amount) AS revenue,
    COUNT(*) AS n_trips
  FROM
    samples.nyctaxi.trips
  GROUP BY
    1, 2
),
-- Generate the full series of missing dates for each zip code
spine AS (
  SELECT all_dates.ds, all_zipcodes.dropoff_zip
  FROM (SELECT DISTINCT ds FROM aggregated) all_dates
  CROSS JOIN (SELECT DISTINCT dropoff_zip FROM aggregated) all_zipcodes
)
-- Perform forecast on the spine and aggregated table
SELECT * FROM AI_FORECAST(
-- Input table fills in zero for dates that were originally empty
  TABLE(
    SELECT
      spine.*,
      COALESCE(aggregated.revenue, 0) AS revenue,
      COALESCE(aggregated.n_trips, 0) AS n_trips
    FROM spine LEFT JOIN aggregated USING (ds, dropoff_zip)
  ),
  horizon => '2016-03-31',
  time_col => 'ds',
  value_col => ARRAY('revenue', 'n_trips'),
  group_col => 'dropoff_zip',
  prediction_interval_width => 0.9,
  parameters => '{"global_floor": 0}'
)
order by dropoff_zip,ds 


###### To help better visualize what the `spine` table looks like:

In [0]:
WITH
aggregated AS (
  SELECT
    DATE(tpep_pickup_datetime) AS ds,
    dropoff_zip,
    SUM(fare_amount) AS revenue,
    COUNT(*) AS n_trips
  FROM
    samples.nyctaxi.trips
  GROUP BY
    1, 2
),
-- Generate the aggregated table from the nyctaxi.trips
spine AS (
  SELECT all_dates.ds, all_zipcodes.dropoff_zip
  FROM (SELECT DISTINCT ds FROM aggregated) all_dates
  CROSS JOIN (SELECT DISTINCT dropoff_zip FROM aggregated) all_zipcodes
)SELECT * FROM spine where dropoff_zip = 7114
order by ds asc

###### To help better visualize what the input table looks like:

In [0]:
-- Generate the aggregated table from the nyctaxi.trips
WITH
aggregated AS (
  SELECT
    DATE(tpep_pickup_datetime) AS ds,
    dropoff_zip,
    SUM(fare_amount) AS revenue,
    COUNT(*) AS n_trips
  FROM
    samples.nyctaxi.trips
  GROUP BY
    1, 2
),
-- Generate the full series of missing dates for each zip code
spine AS (
  SELECT all_dates.ds, all_zipcodes.dropoff_zip
  FROM (SELECT DISTINCT ds FROM aggregated) all_dates
  CROSS JOIN (SELECT DISTINCT dropoff_zip FROM aggregated) all_zipcodes
)
-- Perform forecast on the spine and aggregated table
SELECT
  spine.*,
  COALESCE(aggregated.revenue, 0) AS revenue,
  COALESCE(aggregated.n_trips, 0) AS n_trips
FROM spine LEFT JOIN aggregated USING (ds, dropoff_zip)
where dropoff_zip = 7114
order by ds asc


### Daily + Hourly Forecasting
AI_FORECAST can be used to generate forecasts at multiple granularities spanning the same window of time.


In [0]:
SELECT * FROM AI_FORECAST(
-- Daily aggragtions of revenue
  TABLE(
    SELECT
      DATE_TRUNC('DAY', tpep_pickup_datetime) AS ts,
      ANY_VALUE('DAY') AS granularity,
      SUM(fare_amount) AS revenue
    FROM
      samples.nyctaxi.trips
    GROUP BY
      1
    
    UNION ALL
-- Hourly aggragtions of revenue
    SELECT
      DATE_TRUNC('HOUR', tpep_pickup_datetime) AS ts,
      ANY_VALUE('HOUR') AS granularity,
      SUM(fare_amount) AS revenue
    FROM
      samples.nyctaxi.trips
    GROUP BY
      1
  ),
  horizon => '2016-03-31',
  time_col => 'ts',
  value_col => 'revenue',
  group_col => 'granularity'
)


Databricks visualization. Run in Databricks to view.

### Investigations
AI_FORECAST can be used to perform drill-down investigations. Join forecasting results with the original table to compute residuals. Pair this functionality with grouping set rollups to quickly isolate unexpected changes.

In the sample data below we have introduce an anomaly for all CA/Rural UIDs on 2023-01-31.


In [0]:
CREATE OR REPLACE TEMPORARY VIEW
hierarchical_data_with_an_anomalous_date
AS
WITH
-- Create the dimensions for the dataset
dimensions AS (
  SELECT
    country, population, uid, 10 * RAND() AS intercept, RAND() AS slope
  LATERAL VIEW
    EXPLODE(ARRAY('US', 'CA', 'UK', 'IN')) t1 AS country
  LATERAL VIEW
    EXPLODE(ARRAY('Urban', 'Rural', 'Suburban')) t2 AS population
  LATERAL VIEW
    EXPLODE(SEQUENCE(0, 10)) t3 AS uid
),
-- Create the timestamps for the dataset
dim_times AS (
  SELECT dimensions.*, ts, DATEDIFF(HOUR, '2023-01-01', ts) AS x
  FROM dimensions
  LATERAL VIEW
    EXPLODE(SEQUENCE(
      TIMESTAMP('2023-01-01'),
      TIMESTAMP('2023-02-01'),
      INTERVAL 1 HOUR
    )) t AS ts
)
-- Create the value column
SELECT
  dim_times.*,
  (intercept + (slope * x) + RANDN())
  * IF(
      -- Introduce an anomaly for all CA/Rural UIDs on 2023-01-31
      DATE(ts) = '2023-01-31'
      AND country = 'CA'
      AND population = 'Rural',
      0.75,
      1.0
  ) AS y
FROM
  dim_times;
SELECT country,population,uid, ts, y FROM hierarchical_data_with_an_anomalous_date;


##### Let's see if we can detect anomalies in the data on `2023-01-31` using AI_FORECAST().

In [0]:
WITH
-- Calcuate rollup values for each dimension
rollups AS (
  SELECT country, population, uid, ts, SUM(y) AS y
  FROM hierarchical_data_with_an_anomalous_date
  GROUP BY GROUPING SETS(
    (country, population, uid, ts),
    (country, population, ts),
    (country, ts)
  )
),
-- Get observations for the target investigation date: 2023-01-31
obs AS (SELECT * FROM rollups WHERE DATE(ts) = '2023-01-31'),
-- Calculate forcast for 2023-01-31 using historical data.
fcst AS (
  SELECT * FROM AI_FORECAST(
    TABLE(SELECT * FROM rollups WHERE ts < '2023-01-31'),
    horizon => '2023-02-01',
    time_col => 'ts',
    value_col => 'y',
    group_col => ARRAY('country', 'population', 'uid')
)
)
-- Calculate which groupings have the highest std deviation from the predicted value.
SELECT
  obs.country,
  IF(obs.population IS NULL, '[All]', obs.population) AS population,
  IF(obs.uid IS NULL, '[All]', CAST(obs.uid AS STRING)) AS uid,
  AVG(ABS(obs.y - fcst.y_forecast)) AS mean_abs_deviation
FROM
  obs
JOIN
  fcst
ON
  fcst.ts = obs.ts
  AND fcst.country <=> obs.country
  AND fcst.population <=> obs.population
  AND fcst.uid <=> obs.uid
GROUP BY
  1, 2, 3
ORDER BY
  4 DESC
LIMIT
  15
