# Local notebook

This notebook uses functions defined elsewhere in our codebase.

## BIG-Bench Metrics

Each task can support multiple metrics, but has one preferred metric, used for aggregate scores. The full list of available metrics for JSON tasks can be found on the [BIG-bench repo](https://github.com/google/BIG-bench/blob/main/docs/doc.md#available-metrics). Programmatic tasks can define their own metrics. The main JSON metrics are:

Text-to-text:
- `exact_string_match`
- `bleu`
- `bleurt`: uses BERT to judge similarity
- `rouge`

Multiple-choice:
- `multiple_choice_grade`: A weighted multiple choice accuracy between 0-100, where a set of
targets and scores for each potential target are specified. This reduces to standard multiple
choice accuracy when a single target is assigned a score of 1 and the rest score 0.
- `expected_calibration_error`: A measure of a model’s calibration – i.e. how well the model’s
accuracy matches the probability it assigns to a response. expected_calibration_error is the
absolute deviation between the assigned probability and average accuracy, after binning
examples in terms of assigned probability (Naeini et al., 2015).
- `multiple_choice_brier_score`: A measure of calibration given as the squared error between
model assigned probabilities and 0, 1 targets across classes (Brier, 1950).

## Running

In [3]:
from src.openai_bb import evaluate_on_task

ModuleNotFoundError: No module named 'bigbench'

In [None]:
results_data = evaluate_on_task(task_name='analytic_entailment', model_name='text-curie-001', shots_list=[3])

--------------------------------------------------------------------------------
evaluating text-curie-001...
evaluating analytic_entailment for 3 shots...
results:
{'calibration_multiple_choice_brier_score': 0.49853687789680134,
 'expected_calibration_error': 0.49579620574939887,
 'multiple_choice_grade': 0.4857142857142857,
 'normalized_aggregate_score': -2.857142857142858}


In [None]:
results_data = evaluate_on_task(task_name='emoji_movie', model_name='text-curie-001', shots_list=[3])

--------------------------------------------------------------------------------
evaluating text-curie-001...
evaluating emoji_movie for 3 shots...
results:
{'bleu': 10.839884430478508,
 'calibration_multiple_choice_brier_score': 0.22352366663024548,
 'exact_str_match': 0.08,
 'expected_calibration_error': 0.4583487377784708,
 'multiple_choice_grade': 0.2,
 'normalized_aggregate_score': 4.857225732735058e-14,
 'rouge1': 19.274314574314577,
 'rouge2': 11.933333333333332,
 'rougeLsum': 19.346176046176044}


In [None]:
results_data = evaluate_on_task(task_name='taboo', model_name='ada', shots_list=[1])

--------------------------------------------------------------------------------
evaluating ada...
results:
{'first_response_score': -0.97,
 'full': -0.954387807683251,
 'second_response_score': 0.015612192316749042}
{'first_response_score': -0.97,
 'full': -0.954387807683251,
 'normalized_aggregate_score': 67.42686987194583,
 'second_response_score': 0.015612192316749042}
