# Using the VIEWS forecasts storage system

One interacts with a the forecast storage system using a set of extensions to ordinary pandas dataframes. 

The basic unit which the prediction store deals with is a '**run**'. A run should be thought of as equivalent to an ensemble (fatalities001 and fatalities002 are runs). Forecasts from the ensemble itself and from all its constituent models can all be associated with the same run.

The storage system has two main components:

## (i) the metadata store
This serves as an index of what is stored in the system. It is just a database on gjoll called **forecasts3**. To access the metadata store (and, in practice, the prediction store), therefore, **one must be on the VPN**.

Runs have a unique integer run_id associated with a unique run_name. **Very annoyingly, in the fatalities 00x notebooks, the name and integer id of the run are often both called the run_id.**

The metadata store contains quite a lot of basic data attached to each run (see below).

## (ii) the forecasts store
The actual forecast dataframes are stored on a Microsoft Azure instance.

There are commands to associate a given dataframe with a given run, to push a dataframe to the store, and to fetch dataframes from the store.


To proceed, first import the library:

In [1]:
import views_forecasts
from views_forecasts.extensions import * # this will fail or hang if you are not on the VPN

OperationalError: (psycopg2.OperationalError) could not translate host name "gjoll.muspelheim.local" to address: Name or service not known

(Background on this error at: https://sqlalche.me/e/14/e3q8)

In [2]:
import pandas as pd

To communicate with the metadata store, one instantiates a ViewsMetadata object:

In [3]:
vmd=ViewsMetadata()

To **create a new run**, one needs to come up with a name which is not currently in use. One can check to see if a name is in use by doing:

In [11]:
name='desperate_measures'

In [12]:
ViewsMetadata().with_name(name=name).fetch()

Unnamed: 0,id,name,file_name,runs_id,model_generations_id,user_name,spatial_loa,temporal_loa,ds,osa,time_min,time_max,space_min,space_max,steps,target,prediction_columns,date_written,description,deleted


The df returned by ViewsMetadata is empty, so this name is not in use. To create a new run, one needs a description:

In [6]:
description='Protests model using xgboost'

In [7]:
# vmd.new_run(name=run_name,description=description,min_month=1,max_month=999)

Creating a new run **permanently** associates **a new integer run_id** with the supplied unique **run name**.

Once a run exists, one can check the metadata associated with every stored file belonging to the run by doing:

In [9]:
name='orange_pasta'
ViewsMetadata().with_name(name=name).fetch()

Unnamed: 0,id,name,file_name,runs_id,model_generations_id,user_name,spatial_loa,temporal_loa,ds,osa,time_min,time_max,space_min,space_max,steps,target,prediction_columns,date_written,description,deleted
0,22317,orange_pasta_calibration,pr_1_orange_pasta_calibration.parquet,1,1,xiaolong,pg,m,False,True,397,444,62356,190511,"[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...",ged_sb_dep,"[step_pred_1, step_pred_10, step_pred_11, step...",2024-05-30 13:53:47.545939+00:00,,False


This shows that there is one file belonging to this run in the forecast storage system. The 'pr_1' at the beginning of the filename is the integer run_id (run_id=1 indicates the generic 'test' run).  

Important data returned here:

(i) **runs_id**: the unique run id the returned filenames belong to

(ii) **username**: who created the files

(iii) **spatial and temporal loa**: this tells you what the index of the dataframe corresponsing to each file contains. Acceptable loas are:

        self.ACCEPTABLE_SPACE = {'c': ['country_id', 'c_id'],
                                 'a': ['actor_id', 'a_id'],
                                 'pg': ['priogrid_id', 'priogrid_gid', 'pg_gid', 'pg_id']}

        self.ACCEPTABLE_TIME = {'m': ['month_id'],
                                'y': ['year_id', 'year']}


(iv) **ds**: short for the obsolete 'dynasim' model class. This is autodetected from the input dataframe.

(v) **osa**: short for 'one_step_ahead' - models that produce n_steps predictions for every space unit for every time unit in the calibration/test partitions. This is autodetected from the input dataframe.

Note that a forecast may be **neither a ds nor an osa model**. This is not a problem.

(vi) **time min/max and space min/max** are computed when the file is pushed

(vii) **steps**: if the model type is osa (one-step-ahead), these are the steps that are being forecast. This is autodetected from the input dataframe.

(viii) **target**: what variable is being forecast

(ix) **prediction_columns**: the names of the predicted columns. To be autodetected as ds or osa, these must have '_pred' in their column names, and if step-shifted models, must have 'step_pred'. 

If no such column names exist, the run will be classified as neither ds nor osa.

In [13]:
ViewsMetadata().get_runs_by_id(1)

Unnamed: 0_level_0,name,description,min_month,max_month
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,tests,Just for testing/development,,


## To store a file in the prediction store:

(i) There must be a run with a **unique run name** and a **unique run_id** to associate the file with. 

The run name/run_id should probably be associated with **an ensemble**, so there can be the forecasts belonging to the ensemble and the forecasts belonging to the constituent models assciated with the same run_id, with the **names** of the models indicating which constituent model they come from.

If no suitable run exists, use 

    ViewsMetadata().new_run(name=run_name) 
    
to create one. If the run_name is already in use, an error will be returned.

Once a suitable run exists, a dataframe can be stored by doing

    df.forecasts.set_run(run_name)

    df.forecasts.to_store(name=dataframe_file_name,overwrite=True)

(ii) the dataframe to be pushed **should have a double-column MultiIndex**, with the time column first and the space column second, and the temporal and spatial loas being one of the ACCEPTABLE_TIME and ACCEPTABLE_SPACE lists given above.

If no such index exists, the storage system **will try to build one from the column names**, in which case there **must** be two columns corresponding to an ACCEPTABLE_TIME and ACCEPTABLE_SPACE in the df.

**It is essential that the df be sorted by time unit, and by space unit within each time unit**. Failure to adhere to this will likely cause errors further downstream.

(iii) the target is autodetected: **The target is defined as the FIRST column of the df that does not contain '_id' or '_pred' in its column name**. 

The target should therefore be the **first column after the index** in the df.

If this requirement is too onerous, do

    df.set_target(target_column_name)

**before** trying to store the dataframe.

(iv) There is no restriction on the names of forecast columns or what they can contain.

In the view of Simon and Jim, the prediction store should be used for storing **true forecasts** only, whether point predictions or probabilistic.

Dataframes to be stored shoulf therefore contain:

- a double-column MultiIndex or two columns that can be used to create one, and be sorted by time unit and by space unit within time units.
- a single column appearing immediately after (i.e. right of) the index or the columns to be used to build the index containing the target variable and named after the target variable
- for point predictions, a single column (i.e. one prediction per unit-of-analysis) with a sensible name
- for probabilistic/uncertainty predictions, an arbitrary number of columns named **concisely** in such a way that it is clear what they represent, e.g. cumulative probabilities, HDIs, etc, with the percentiles they represent included in the column name. **Anyone fetching the df should immediately be able to tell from the column names what they mean**