# BBC dataset 
**Caution:** Many data valuation methods require training large number of models to get reliable estimates. **It is extremely inefficient**. We recommend using embeddings.

In [1]:
# Imports
import numpy as np
import pandas as pd
import torch

# Opendataval
from opendataval.dataloader import Register, DataFetcher, mix_labels, add_gauss_noise
from opendataval.dataval.random import RandomEvaluator
from opendataval.dataval.margcontrib import LeaveOneOut, DataShapley, BetaShapley
from opendataval.dataval.influence import InfluenceFunctionEval
from opendataval.dataval.dvrl import DVRL
from opendataval.dataval.knnshap import KNNShapley
from opendataval.dataval.margcontrib.banzhaf import DataBanzhaf
from opendataval.dataval.ame import AME
from opendataval.dataval.oob import DataOob

from opendataval.experiment import ExperimentMediator

## [Step 1] Set up an environment
`ExperimentMediator` is a fundamental concept in establishing the `opendataval` environment. It empowers users to configure hyperparameters, including a dataset, a type of synthetic noise, and a prediction model. With  `ExperimentMediator`, users can effortlessly compute various data valuation algorithms.

The following code cell demonstrates how to set up `ExperimentMediator` with a pre-registered dataset and a prediction model.
- Dataset: bbc
- Model: transformer's DistilBertModel
- Metric: Classification accuracy

In [2]:
dataset_name = "bbc" 
train_count, valid_count, test_count = 1000, 100, 500
noise_rate = 0.1
noise_kwargs = {'noise_rate': noise_rate}
model_name = "BertClassifier"
metric_name = "accuracy"
train_kwargs = {"epochs": 2, "batch_size": 50}
device = torch.device('cuda')

exper_med = ExperimentMediator.model_factory_setup(
    dataset_name=dataset_name,
    cache_dir="../data_files/",  
    force_download=False,
    train_count=train_count,
    valid_count=valid_count,
    test_count=test_count,
    add_noise=mix_labels,
    noise_kwargs=noise_kwargs,
    train_kwargs=train_kwargs,
    device=device,
    model_name=model_name,
    metric_name=metric_name
)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Base line model metric_name='accuracy': perf=0.8939999938011169


## [Step 2] Compute data values
`opendataval` provides various state-of-the-art data valuation algorithms. `ExperimentMediator.compute_data_values()` computes data values.

In [3]:
data_evaluators = [ 
    RandomEvaluator(),
#     LeaveOneOut(), # leave one out ## slow
    InfluenceFunctionEval(num_models=10), # influence function
#     DVRL(rl_epochs=10), # Data valuation using Reinforcement Learning ## inappropriate
#     KNNShapley(k_neighbors=valid_count), # KNN-Shapley ## inappropriate
#     DataShapley(gr_threshold=1.05, mc_epochs=300, cache_name=f"cached"), # Data-Shapley ## slow
#     BetaShapley(gr_threshold=1.05, mc_epochs=300, cache_name=f"cached"), # Beta-Shapley ## slow
    DataBanzhaf(num_models=10), # Data-Banzhaf
    AME(num_models=10), # Average Marginal Effects
    DataOob(num_models=10) # Data-OOB
]

In [4]:
%%time
# compute data values.
## Training multiple DistilBERT models is extremely slow. We recommend using embeddings.
exper_med = exper_med.compute_data_values(data_evaluators=data_evaluators)

Elapsed time RandomEvaluator(): 0:00:00.026010


100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:18<00:00,  1.88s/it]


Elapsed time InfluenceFunctionEval(num_models=10): 0:00:18.837721


100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:14<00:00,  1.41s/it]


Elapsed time DataBanzhaf(num_models=10): 0:00:14.123893


100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:06<00:00,  1.52it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:11<00:00,  1.17s/it]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:16<00:00,  1.65s/it]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:21<00:00,  2.16s/it]


Elapsed time AME(num_models=10): 0:00:56.581227


100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:30<00:00,  3.00s/it]

Elapsed time DataOob(num_models=10): 0:00:30.067738
CPU times: user 1min 50s, sys: 9.07 s, total: 1min 59s
Wall time: 1min 59s





## [Step 3] Store data values

In [5]:
from opendataval.experiment.exper_methods import save_dataval

# Saving the results
output_dir = f"../tmp/{dataset_name}_{noise_rate=}/"
exper_med.set_output_directory(output_dir)
output_dir

'../tmp/bbc_noise_rate=0.1/'

In [6]:
exper_med.evaluate(save_dataval, save_output=True)

Unnamed: 0,indices,data_values
RandomEvaluator(),102,0.666932
RandomEvaluator(),1934,0.861419
RandomEvaluator(),1380,0.566374
RandomEvaluator(),569,0.252284
RandomEvaluator(),1428,0.216644
...,...,...
DataOob(num_models=10),1482,1.0
DataOob(num_models=10),1945,0.75
DataOob(num_models=10),2151,1.0
DataOob(num_models=10),813,1.0
