# Backcasting Demo Notebook

_Loren Champlin_

Adapted from _Adarsh Pyarelal_'s WM 12 Month Evaluation Notebook 

As always, we begin with imports, and print out the commit hash for a rendered
version of the notebook.

In [None]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
from IPython.display import set_matplotlib_formats
set_matplotlib_formats('retina')
import numpy as np
import pandas as pd
from delphi.db import engine
import random as rm
import delphi.evaluation_port as EN
import warnings
warnings.filterwarnings("ignore")
import logging
logging.getLogger().setLevel(logging.CRITICAL)
from delphi.cpp.AnalysisGraph import AnalysisGraph as AG, InitialBeta as IB, RNG
import time
import seaborn as sns
import matplotlib.pyplot as plt

Here I will set random seeds

In [None]:
np.random.seed(87)
rm.seed(87)
R = RNG.rng()
R.set_seed(87)

This is a example of constructing a statement that represents a two-node CAG. A statement is a list of tuples where each tuple represents an edge. Within the tuples that represent the edges are two tuples that have information representing the connected nodes. The first tuple represents the parent and the second tuple represents the child. Within a node tuple is the size of its effect on child nodes, whether it positively or negatively affects its child, and the full node name. 

In [None]:
statements = [ (("large", -1, "UN/entities/human/financial/economic/inflation"),("small", 1, "UN/events/human/human_migration"))]



Now we load the Causal Analysis Graph (CAG) using the statement above. 

In [None]:
start_time = time.time()
G = AG.from_statements(statements)


Next we map indicator variables to nodes. For the most part indicator variables can be inferred from available data and texts, but we can also manually map indicators to nodes.

In [None]:
G.map_concepts_to_indicators()

G.replace_indicator("UN/events/human/human_migration","Net migration","New asylum seeking applicants", "UNHCR")

Here we train the inference model for the Casual Analysis Graph. Below you can see that the CAG G calls the train_model method. 

Other important arguments are:
- start_year: The initial year to start training from.
- start_month: The initial month to start training from.
- end_year: The ending year for training.
- end_month: The ending month for training. 

The above arguments ensures that the model is trained with the appropriate data given a time range.

The 5th argument shown is the the sample resolution (current seen set at 10000, default is 200) and the 6th argument is the number of samples to burn in the mcmc sampler before retaining samples. Finally the last argument sets the initial conditions for the mcmc sampler, IB.ZERO sets the betas initially to zero.  

The train_model function can also take in all the parameter arguments as parameterize allowing for the setting of country, state and units. 

In [None]:
G.train_model(2015,1,2015,12,500,90000,initial_beta=IB.ZERO)

The next function generates predictions for a given time range. Like train_model this takes a set of arguments start_year, start_month, end_year, and end_month that specify the time range for predictions. The function returns a tuple with the first element containing a list of the prediction dates and the second element containing a nested data structure. The nested structure is a list containing a list for each sample (as set by res in train_model()). Each list within the outer list contains a dictionary for each time step in the prediction range (including the 0th time step). The keys of these dictionary objects are the node names as strings and the values are dictionary objects themselves. The keys of these inner dictionary objects are the indicator names as string and the values are the prediction values.

*Note: The predictions can be heavily reliant of the initial conditions, which are determined by the initial date of the prediction range (i.e, I suspect there is an initial condition bias). It still remains to be tested whether or not starting predictions from the initial training date or starting at the end of the training range yields more accuracte predictions. For example if training from January, 2015 to December, 2015 and we want to get predictions for January, 2016 to December, 2016, is it better to start predicting from January, 2015 or at the start of the dates we want (January, 2016). Initiating predictions at one time step before the prediction range is also a possibility for the most accurate predictions.

In [None]:
preds = G.generate_prediction(2016,1,2150,12)
end_time = time.time()

total_time = end_time-start_time

In [None]:
total_time

Now that the predictions have been generated, a user can store or present the predictions however they choose. However the evaluation module comes with several convienant options for displaying output for a specific indicator. The first option to just return the raw predictions for a given indicator variable in a numpy array. This allows one to do there own plotting and manipulations for a given indicator without having to sort through the entire prediction structure.  

*Note: True data values from the delphi database can be retrieved using the data_to_df function in evaluation.py. 

In [None]:
EN.pred_to_array(preds,'New asylum seeking applicants')

In [None]:
pred_array = EN.pred_to_array(preds,'New asylum seeking applicants')

df_pred = pd.DataFrame(pred_array)

df_pred.T.plot(legend=False)

In [None]:
sns.lineplot(x='variable',y='value',data=df_pred.melt(),err_style='bars',ci=99)

The evaluation module can also output a pandas dataframe with the mean of the predictions along with a specified confidence interval for a given indicator variable. There are also options for presenting the true values, residuals, and error bounds based off of the residuals. 

*Note: Setting true_vals = True assumes that real data values exist in the database that match the time points of the predictions. Since the data retrieval function is set to return heuristic estimates for missing data values, then it's possible to have completely "made-up" true data if none actually exist for the prediction time range. Also whatever the mean_pred_to_df function should be passed the same country, state, units arguments as train_model (if any were passed). 

In [None]:
EN.mean_pred_to_df(preds,'New asylum seeking applicants',true_vals=True,ci=0.99)

Finally we can get a plots representing the same data shown above. 

The plot types are:
- Prediction: Shows only the predictions with specified confidence intervals. This is the default setting.
- Comparison: Shows the predictions and confidence intervals along with a curve representing the true data values.
- Error: Plots just the error with the error bounds along with a red reference line at 0. 

*Note: The above note for mean_pred_to_df also holds true for the Comparison and Error plot type. Also any other string argument passed to plot_type results in the defaults in the 'Prediction' plot type. The save_as argument can be set to a filename (with extension) to save the plot as a file (e.g, save_as = pred_plot.pdf). 

In [None]:
EN.pred_plot(preds,'New asylum seeking applicants',plot_type='Prediction',ci=0.99,save_as=None)