# Evaluation with `justcause`

In this notebook, we examplify how to use `justcause` in order to evaluate methods using reference datasets. For simplicity, we only use one dataset, but show how evaluation works with multiple methods. Both standard causal methods implemented in the framework as well as custom methods. 


## Custom First
The goal of the `justcause` framework is to be a modular and flexible facilitator of causal evaluation.

In [1]:
%load_ext autoreload

%autoreload 2

# Loading all required packages 
import itertools
import numpy as np
from sklearn.model_selection import train_test_split

from justcause.data import Col
from justcause.data.sets import load_ihdp
from justcause.metrics import pehe_score, mean_absolute
from justcause.evaluation import setup_result_df, setup_scores_df, calc_scores, \
    summarize_scores

from sklearn.linear_model import LinearRegression

### Setup data and methods you want to evaluate
Let's say we wanted to compare a S-Learner with propensity weighting, based on a propensity estimate of our choice. Thus, we cannot simply use the predefined SLearner from `justcause.learners`, but have to provide our own adaption, which first estimates propensities and uses these for fitting an adjusted model. 

By providing a "blackbox" method like below, you can choose to do whatever you want inside. For example, you can replace your predictions available factual outcomes, estimate the propensity in different ways or even use a true propensity, in case of a generated dataset, where it is available. You can also resort to out-of-sample prediction, where no information about treatment is provided to the method. 

In [3]:
from justcause.learners import SLearner
from justcause.learners.propensity import estimate_propensities


data = load_ihdp()
metrics = [pehe_score, mean_absolute]

# Limit evaluation to the first 100 replications of IHDP
replications = list(itertools.islice(data, 100))
train_size = 0.8
random_state = 42

def weighted_slearner(train, test):
    """
    Custom method that takes 'train' and 'test' CausalFrames (see causal_frames.ipynb)
    and returns ITE predictions for both after training on 'train'. 
    
    Implement your own method in a similar fashion to evaluate them within the framework!
    """
    train_X, train_t, train_y = train.np.X, train.np.t, train.np.y
    test_X, test_t, test_y = test.np.X, test.np.t, test.np.y
    
    
    # Get calibrated propensity estimates
    p = estimate_propensities(train_X, train_t)

    # Make sure the supplied learner is able to use `sample_weights` in the fit() method
    slearner = SLearner(LinearRegression())
    
    # Weight with inverse probability of treatment (inverse propensity)
    slearner.fit(train_X, train_t, train_y, weights=1/p)
    return (
        slearner.predict_ite(train_X, train_t, train_y),
        slearner.predict_ite(test_X, test_t, test_y)
    )

### Example Evaluation Loop
Now given a callable like `weighted_slearner` we can evaluate that method using multiple metrics on the given replications. 
The result dataframe then contains two rows with the summarized scores over all replications for train and test separately. 

In [4]:
results_df = setup_result_df(metrics)
    
test_scores = setup_scores_df(metrics)
train_scores = setup_scores_df(metrics)

for rep in replications:

    train, test = train_test_split(
        rep, train_size=train_size, random_state=random_state
    )

    # REPLACE this with the function you implemented and want to evaluate
    train_ite, test_ite = weighted_slearner(train, test)

    # Calculate the scores and append them to a dataframe
    test_scores.loc[len(test_scores)] = calc_scores(test[Col.ite],
                                                    test_ite,
                                                    metrics)

    train_scores.loc[len(train_scores)] = calc_scores(train[Col.ite],
                                                    train_ite,
                                                    metrics)b

# Summarize the scores and save them in a dataframe
results_df.loc[len(results_df)] = np.append(['slearner', True], summarize_scores(train_scores))
results_df.loc[len(results_df)] = np.append(['slearner', False], summarize_scores(test_scores))

In [5]:
results_df

Unnamed: 0,method,train,pehe_score-mean,pehe_score-median,pehe_score-std,mean_absolute-mean,mean_absolute-median,mean_absolute-std
0,slearner,True,5.592355721307152,2.569471750747637,8.248291408843441,0.3699388743475459,0.2124273147540643,0.5243953093782159
1,slearner,False,5.493401193725237,2.589651399901557,7.903173959543428,0.6556018033508734,0.2872014321744491,0.9419412237403736


Now in this case, using `justcause` has hardly any advantages, because only one dataset and one method is used. You might as well just implement all the evaluation manually. However, this can simply be expanded to more methods by looping over the callables.

In [7]:
def basic_slearner(train, test):
    """ """
    train_X, train_t, train_y = train.np.X, train.np.t, train.np.y
    test_X, test_t, test_y = test.np.X, test.np.t, test.np.y

    slearner = SLearner(LinearRegression())
    slearner.fit(train_X, train_t, train_y)
    return (
        slearner.predict_ite(train_X, train_t, train_y),
        slearner.predict_ite(test_X, test_t, test_y)
    )

methods = [basic_slearner, weighted_slearner]

results_df = setup_result_df(metrics)

for method in methods:
    
    test_scores = setup_scores_df(metrics)
    train_scores = setup_scores_df(metrics)

    for rep in replications:

        train, test = train_test_split(
            rep, train_size=train_size, random_state=random_state
        )

        # REPLACE this with the function you implemented and want to evaluate
        train_ite, test_ite = method(train, test)

        # Calculate the scores and append them to a dataframe
        test_scores.loc[len(test_scores)] = calc_scores(test[Col.ite],
                                                        test_ite,
                                                        metrics)

        train_scores.loc[len(train_scores)] = calc_scores(train[Col.ite],
                                                        train_ite,
                                                        metrics)

    # Summarize the scores and save them in a dataframe
    results_df.loc[len(results_df)] = np.append([method.__name__, True], summarize_scores(train_scores))
    results_df.loc[len(results_df)] = np.append([method.__name__, False], summarize_scores(test_scores))

In [8]:
results_df

Unnamed: 0,method,train,pehe_score-mean,pehe_score-median,pehe_score-std,mean_absolute-mean,mean_absolute-median,mean_absolute-std
0,basic_slearner,True,5.633659795888926,2.623297102872905,8.362124759175456,0.7324426200135632,0.2381850431319927,1.4932757697867245
1,basic_slearner,False,5.625971000721637,2.6359926738390502,8.213625971533043,1.2926681149657069,0.3962455718526638,2.474603428686128
2,weighted_slearner,True,5.592355721307152,2.569471750747637,8.248291408843441,0.3699388743475459,0.2124273147540643,0.5243953093782159
3,weighted_slearner,False,5.493401193725237,2.589651399901557,7.903173959543428,0.6556018033508734,0.2872014321744491,0.9419412237403736


And because in most cases, we're not changing anything within this loop for the ITE case, `justcause` provides a default implementation. 

## Standard Evaluation of ITE predictions
Using the same list of method callables, we can just call `evaluate_ite` and pass all the information. The default implementation sets up a dataframe for the result following a certain convention. 

First, there's two columns to define the method for which the results are as well as whether they've been calculated on train or test. Then for all supplied `metrics`, all `formats` will be listed. 

Standard `metrics` like (PEHE or Mean absolute error) are implemented in `justcause.metrics`. 
Standard formats used for summarizing the scores over multiple replications are `np.mean, np.median, np.std`, other possibly interesting formats could be *skewness*, *minmax*, *kurtosis*. A method provided as format must take an `axis` parameter, ensuring that it can be applied to the scores dataframe. 



In [10]:
from justcause.evaluation import evaluate_ite

result = evaluate_ite(replications, methods, metrics, train_size=train_size, random_state=random_state)

In [11]:
result

Unnamed: 0,method,train,pehe_score-mean,pehe_score-median,pehe_score-std,mean_absolute-mean,mean_absolute-median,mean_absolute-std
0,basic_slearner,True,5.633659795888926,2.623297102872905,8.362124759175456,0.7324426200135632,0.2381850431319927,1.4932757697867245
1,basic_slearner,False,5.633659795888926,2.623297102872905,8.362124759175456,0.7324426200135632,0.2381850431319927,1.4932757697867245
2,weighted_slearner,True,5.592355721307152,2.569471750747637,8.248291408843441,0.3699388743475459,0.2124273147540643,0.5243953093782159
3,weighted_slearner,False,5.592355721307152,2.569471750747637,8.248291408843441,0.3699388743475459,0.2124273147540643,0.5243953093782159


### Adding standard causal methods to the mix
Within `justcause.learners` we've implemented a couple of standard methods that provide a `predict_ite()` method. Instead of going the tedious way like we've done in `weighted_slearner` above, we can just use these methods directly. The default implementation will use a default base learner for all the meta-learners, fit the method on train and predict the ITEs for train and test. 

By doing so, we can get rid of the `basic_slearner` method above, because it just uses the default setting and procedure for fitting the model. Instead, we just use `SLearner(LinearRegression())`. 

In [None]:
from justcause.learners import TLearner, XLearner, RLearner

# All in standard configuration
methods = [SLearner(LinearRegression()), weighted_slearner, TLearner(), XLearner(), RLearner(LinearRegression())]

result = evaluate_ite(replications, methods, metrics, train_size=train_size, random_state=random_state)

In [None]:
result