# Backcasting Demo Notebook

_Loren Champlin_

Adapted from _Adarsh Pyarelal_'s WM 12 Month Evaluation Notebook 

As always, we begin with imports, and print out the commit hash for a rendered
version of the notebook.

In [None]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
import pickle
from IPython.display import set_matplotlib_formats
set_matplotlib_formats('retina')
from delphi.visualization import visualize
import delphi.jupyter_tools as jt
import numpy as np
import pandas as pd
from delphi.db import engine
jt.print_commit_hash_message()
import random as rm
import delphi.evaluation as EN
import delphi.AnalysisGraph as AG
import warnings
warnings.filterwarnings("ignore")
import logging
logging.getLogger().setLevel(logging.CRITICAL)

Here I will set random seeds

In [None]:
np.random.seed(87)
rm.seed(87)

Now we load the Causal Analysis Graph (CAG). This is CAG was inferred by reading in a JSON corpus and was pruned and adjusted to be human migration centered. Also is a list of the nodes contained in the CAG

In [None]:
with open("../scripts/build/migration_centered_CAG.pkl",'rb') as f:
    G = pickle.load(f)

for n in G.nodes:
    print(n)

Next we map indicator variables to nodes. For the most part indicator variables can be inferred from available data and texts, but we can also manually map indicators to nodes.

In [None]:
G.map_concepts_to_indicators()

G.set_indicator("UN/events/human/human_migration", "New asylum seeking applicants", "UNHCR")
G.set_indicator("UN/entities/human/financial/economic/market", "Inflation Rate", "ieconomics.com")
G.set_indicator("UN/entities/human/food/food_security", "IPC Phase Classification", "FEWSNET")

Here is also a list of the indicator variables in the same order as the list of nodes above (i.e "Claims on other sectors of the domestic economy" is attached to "UN/events/human/economic_crisis"
                                                                                              
                                                                                              

In [None]:
for n in G.nodes(data=True):
    for indicators in n[1]["indicators"].values():
        print(indicators.name)

In the cell below, we visualize the CAG parameterized with indicator values for January, 2012. Also note that you can specifiy units for a particular indicator variables using a dictionary object where the keys are the indicator variable names and the values are the specified units. Default units are used if the selected units for an indicator variable do not exist. 

Legend for visualization: 
- Red edge: overall inhibition, green edge: overall promotion
- Edge thickness corresponds roughly to the 'strength' of the influence.
- Edge opacity corresponds roughly to the number of evidence fragments 
  that support the causal relationship.

In [None]:
units = {"Claims on other sectors of the domestic economy": "annual growth as % of broad money"}
G.parameterize(year=2012, units=units)
visualize(G, indicators=True, indicator_values=True)

Here we train the inference model for the Casual Analysis Graph. Below you can see that the CAG G is passed to the train_model function. 

Other important arguments are:
- start_year: The initial year to start training from.
- start_month: The initial month to start training from.
- end_year: The ending year for training.
- end_month: The ending month for training. 

The above arguments ensures that the model is trained with the appropriate data given a time range.

The second last argument shown is the the sample resolution (current seen set at 1000, default is 200). 

The last argument passed is a scale parameter for setting the "standard deviation" for a set of data values for each indicator variable given a time range. This affects the standard deviation of predictions. 

The train_model function can also take in all the parameter arguments as parameterize allowing for the setting of country, state, units, etc. 

In [None]:
EN.train_model(G,2015,1,2015,12,1000,30000,k=1)

The next function generates predictions for a given time range. Like train_model this takes a set of arguments start_year, start_month, end_year, and end_month that specify the time range for predictions. 

*Note: The predictions can be heavily reliant of the initial conditions, which are determined by the initial date of the prediction range (i.e, I suspect there is an initial condition bias). It still remains to be tested whether or not starting predictions from the initial training date or starting at the end of the training range yields more accuracte predictions. For example if training from January, 2015 to December, 2015 and we want to get predictions for January, 2016 to December, 2016, is it better to start predicting from January, 2015 or at the start of the dates we want (January, 2016). Initiating predictions at one time step before the prediction range is also a possibility for the most accurate predictions. 

In [None]:
EN.generate_predictions(G,2016,1,2016,12)

Now that the predictions have been generated, there are several options for output. First is the to just return the raw predictions for a given indicator variable in a numpy array. This allows one to do there own plotting and manipulations. 

*Note: True data values from the delphi database can be retrieved using the data_to_df function in evaluation.py. 

In [None]:
EN.pred_to_array(G,'New asylum seeking applicants')

The evaluation.py module can also output a pandas dataframe with the mean of the predictions along with a specified confidence interval for a given indicator variable. There are also options for presenting the true values, residuals, and error bounds based off of the residuals. 

*Note: Setting true_vals = True assumes that real data values exist in the database that match the time points of the predictions. Since the data retrieval function is set to return heuristic estimates for missing data values, then it's possible to have completely "made-up" true data if none actually exist for the prediction time range. Also whatever the mean_pred_to_df function should be passed the same country, state, units arguments as train_model (if any were passed). 

In [None]:
EN.mean_pred_to_df(G,'New asylum seeking applicants',true_vals=True)

Finally we can get a plots representing the same data shown above. 

The plot types are:
- Prediction: Shows only the predictions with specified confidence intervals. This is the default setting.
- Comparison: Shows the predictions and confidence intervals along with a curve representing the true data values.
- Error: Plots just the error with the error bounds along with a red reference line at 0. 

*Note: The above note for mean_pred_to_df also holds true for the Comparison and Error plot type. Also any other string argument passed to plot_type results in the defaults in the 'Prediction' plot type. The save_as argument can be set to a filename (with extension) to save the plot as a file (e.g, save_as = pred_plot.pdf). 

In [None]:
EN.pred_plot(G,'New asylum seeking applicants',plot_type='Comparison',save_as=None)