# Running custom and standard evaluation with `justcause`

In this notebook, we examplify how to use `justcause` in order to evaluate methods using reference datasets. For simplicity, we only use one dataset, but show how evaluation works with multiple methods. Both standard causal methods implemented in the framework as well as custom methods. 


## Custom First
The goal of the `justcause` framework is to be a modular and flexible facilitator of causal evaluation.

In [1]:
%load_ext autoreload

%autoreload 2

# Loading all required packages 
import itertools
import numpy as np
from sklearn.model_selection import train_test_split

from justcause.data import Col
from justcause.data.sets import load_ihdp
from justcause.learners import SLearner
from justcause.metrics import pehe_score, mean_absolute
from justcause.evaluation import setup_result_df, setup_scores_df, calc_scores, \
    summarize_scores

from sklearn.linear_model import LinearRegression

Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
The sklearn.utils.testing module is  deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.utils. Anything that cannot be imported from sklearn.utils is now part of the private API.


### Setup data and methods you want to evaluate

In [2]:
data = load_ihdp()
metrics = [pehe_score, mean_absolute]

# Limit evaluation to the first 100 replications of IHDP
replications = list(itertools.islice(data, 100))
train_size = 0.8
random_state = 42

def slearner_eval(train, test):
    """
    Custom method that takes 'train' and 'test' CausalFrames (see causal_frames.ipynb)
    and returns ITE predictions for both after training on 'train'. 
    
    Implement your own method in a similar fashion to evaluate them within the framework!
    """
    train_X, train_t, train_y = train.np.X, train.np.t, train.np.y
    test_X, test_t, test_y = test.np.X, test.np.t, test.np.y

    slearner = SLearner(LinearRegression())
    slearner.fit(train_X, train_t, train_y)
    return (
        slearner.predict_ite(train_X, train_t, train_y),
        slearner.predict_ite(test_X, test_t, test_y)
    )

### Example Evaluation Loop
Now given a callable like `slearner_eval` we can evaluate that method using multiple metrics on the given replications. 
The result dataframe then contains two rows with the summarized scores over all replications for train and test separately. 

In [6]:
results_df = setup_result_df(metrics)
    
test_scores = setup_scores_df(metrics)
train_scores = setup_scores_df(metrics)

for rep in replications:

    train, test = train_test_split(
        rep, train_size=train_size, random_state=random_state
    )

    # REPLACE this with the function you implemented and want to evaluate
    train_ite, test_ite = slearner_eval(train, test)

    # Calculate the scores and append them to a dataframe
    test_scores.loc[len(test_scores)] = calc_scores(test[Col.ite],
                                                    test_ite,
                                                    metrics)

    train_scores.loc[len(train_scores)] = calc_scores(train[Col.ite],
                                                    train_ite,
                                                    metrics)

# Summarize the scores and save them in a dataframe
results_df.loc[len(results_df)] = np.append(['slearner', True], summarize_scores(train_scores))
results_df.loc[len(results_df)] = np.append(['slearner', False], summarize_scores(test_scores))

Now this has hardly any advantages if only one dataset and one method is used, because you might as well just implement all the evaluation manually. However, this can simply be expanded to more methods by looping over the callables.

In [3]:
def another_eval(train, test):
    """ Nonesense weighted SLearner evaluation """
    train_X, train_t, train_y = train.np.X, train.np.t, train.np.y
    test_X, test_t, test_y = test.np.X, test.np.t, test.np.y

    slearner = SLearner(LinearRegression())
    slearner.fit(train_X, train_t, train_y, weights=np.full(len(train_t), 1))
    return (
        slearner.predict_ite(train_X, train_t, train_y),
        slearner.predict_ite(test_X, test_t, test_y)
    )

methods = [another_eval, slearner_eval]

results_df = setup_result_df(metrics)

for method in methods:
    
    test_scores = setup_scores_df(metrics)
    train_scores = setup_scores_df(metrics)

    for rep in replications:

        train, test = train_test_split(
            rep, train_size=train_size, random_state=random_state
        )

        # REPLACE this with the function you implemented and want to evaluate
        train_ite, test_ite = method(train, test)

        # Calculate the scores and append them to a dataframe
        test_scores.loc[len(test_scores)] = calc_scores(test[Col.ite],
                                                        test_ite,
                                                        metrics)

        train_scores.loc[len(train_scores)] = calc_scores(train[Col.ite],
                                                        train_ite,
                                                        metrics)

    # Summarize the scores and save them in a dataframe
    results_df.loc[len(results_df)] = np.append([method.__name__, True], summarize_scores(train_scores))
    results_df.loc[len(results_df)] = np.append([method.__name__, False], summarize_scores(test_scores))

In [4]:
results_df

Unnamed: 0,method,train,pehe_score-mean,pehe_score-median,pehe_score-std,mean_absolute-mean,mean_absolute-median,mean_absolute-std
0,another_eval,True,5.633659795888926,2.623297102872905,8.362124759175456,0.7324426200135632,0.2381850431319927,1.4932757697867245
1,another_eval,False,5.625971000721637,2.6359926738390502,8.213625971533043,1.2926681149657069,0.3962455718526638,2.474603428686128
2,slearner_eval,True,5.633659795888926,2.623297102872905,8.362124759175456,0.7324426200135632,0.2381850431319927,1.4932757697867245
3,slearner_eval,False,5.625971000721637,2.6359926738390502,8.213625971533043,1.2926681149657069,0.3962455718526638,2.474603428686128


And because in most cases, we're not changing anything within this loop expect the way `train_ite` and `test_ite` are calculated based on `train` and `test`, `justcause` provides a default implementation of that loop. 

In [12]:
from justcause.evaluation import evaluate_ite

result = evaluate_ite(replications, methods, metrics, train_size=train_size, random_state=random_state)

AttributeError: 'TLearner' object has no attribute '__name__'

In [9]:
result

Unnamed: 0,method,train,pehe_score-mean,pehe_score-median,pehe_score-std,mean_absolute-mean,mean_absolute-median,mean_absolute-std
0,another_eval,True,5.633659795888926,2.623297102872905,8.362124759175456,0.7324426200135632,0.2381850431319927,1.4932757697867245
1,another_eval,False,5.633659795888926,2.623297102872905,8.362124759175456,0.7324426200135632,0.2381850431319927,1.4932757697867245
2,slearner_eval,True,5.633659795888926,2.623297102872905,8.362124759175456,0.7324426200135632,0.2381850431319927,1.4932757697867245
3,slearner_eval,False,5.633659795888926,2.623297102872905,8.362124759175456,0.7324426200135632,0.2381850431319927,1.4932757697867245


### Adding standard causal methods to the mix
Within `justcause.learners` we've implemented a couple of standard methods that provide a `predict_ite()` method. Instead of going the tedious way like we've done in `slearner_eval` above, we can just use these methods directly.

In [18]:
from justcause.learners import TLearner, XLearner, RLearner

# All in standard configuration
methods = [another_eval, slearner_eval, TLearner(), XLearner(), RLearner(LinearRegression())]

result = evaluate_ite(replications, methods, metrics, train_size=train_size, random_state=random_state)

In [19]:
result

Unnamed: 0,method,train,pehe_score-mean,pehe_score-median,pehe_score-std,mean_absolute-mean,mean_absolute-median,mean_absolute-std
0,another_eval,True,5.633659795888926,2.623297102872905,8.362124759175456,0.7324426200135632,0.2381850431319927,1.4932757697867245
1,another_eval,False,5.633659795888926,2.623297102872905,8.362124759175456,0.7324426200135632,0.2381850431319927,1.4932757697867245
2,slearner_eval,True,5.633659795888926,2.623297102872905,8.362124759175456,0.7324426200135632,0.2381850431319927,1.4932757697867245
3,slearner_eval,False,5.633659795888926,2.623297102872905,8.362124759175456,0.7324426200135632,0.2381850431319927,1.4932757697867245
4,"TLearner(control=LassoLars, treated=LassoLars)",True,5.5726257778279535,2.5437982727262867,8.213573470353799,0.2931874178151998,0.166370039035391,0.4280283924070575
5,"TLearner(control=LassoLars, treated=LassoLars)",False,5.5726257778279535,2.5437982727262867,8.213573470353799,0.2931874178151998,0.166370039035391,0.4280283924070575
6,"XLearner(outcome_c=LassoLars, outcome_t=LassoL...",True,5.579284802381838,2.5437982727262867,8.24060645800564,0.2896989907685176,0.1663700390353935,0.427007847078984
7,"XLearner(outcome_c=LassoLars, outcome_t=LassoL...",False,5.579284802381838,2.5437982727262867,8.24060645800564,0.2896989907685176,0.1663700390353935,0.427007847078984
8,"RLearner(outcome=LinearRegression, effect=Line...",True,2.556630302290813,1.227161430077221,3.749933191978111,0.258468768775593,0.1798698811956178,0.3068464815378742
9,"RLearner(outcome=LinearRegression, effect=Line...",False,2.556630302290813,1.227161430077221,3.749933191978111,0.258468768775593,0.1798698811956178,0.3068464815378742
