# Motivation

BIG-Bench, the largest modern language-model benchmark, does not provide example code for evaluating OpenAI GPT-3 models. This notebook implements a barebones example of doing just that on any of 200+ BIG-Bench tasks.

## BIG-Bench Metrics

Each task can support multiple metrics, but has one preferred metric, used for aggregate scores. The full list of available metrics for JSON tasks can be found on the [BIG-bench repo](https://github.com/google/BIG-bench/blob/main/docs/doc.md#available-metrics). Programmatic tasks can define their own metrics. The main JSON metrics are:

Text-to-text:
- `exact_string_match`
- `bleu`
- `bleurt`: uses BERT to judge similarity
- `rouge`

Multiple-choice:
- `multiple_choice_grade`: A weighted multiple choice accuracy between 0-100, where a set of
targets and scores for each potential target are specified. This reduces to standard multiple
choice accuracy when a single target is assigned a score of 1 and the rest score 0.
- `expected_calibration_error`: A measure of a model’s calibration – i.e. how well the model’s
accuracy matches the probability it assigns to a response. expected_calibration_error is the
absolute deviation between the assigned probability and average accuracy, after binning
examples in terms of assigned probability (Naeini et al., 2015).
- `multiple_choice_brier_score`: A measure of calibration given as the squared error between
model assigned probabilities and 0, 1 targets across classes (Brier, 1950).

## Install BIG-Bench

This takes a bit of time so you could probably go make tea and scroll through a few FTX Twitter threads.

In [2]:
!pip install git+https://github.com/google/BIG-bench.git
!pip install openai

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/google/BIG-bench.git
  Cloning https://github.com/google/BIG-bench.git to /tmp/pip-req-build-6v_2nxjk
  Running command git clone -q https://github.com/google/BIG-bench.git /tmp/pip-req-build-6v_2nxjk
Processing //tmp/pip-req-build-6v_2nxjk/bleurt/bleurt-b610120347ef22b494b6d69b4316e303f5932516.zip
Collecting tensorflow-text>=2.6
  Downloading tensorflow_text-2.10.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.9 MB)
[K     |████████████████████████████████| 5.9 MB 5.0 MB/s 
Collecting black>=21.6b0
  Downloading black-22.10.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.5 MB)
[K     |████████████████████████████████| 1.5 MB 53.7 MB/s 
[?25hCollecting datasets
  Downloading datasets-2.6.1-py3-none-any.whl (441 kB)
[K     |████████████████████████████████| 441 kB 54.6 MB/s 
Collecting immutabledict
  Downloading immutable

## Running
You'll need to provide an [OpenAI API key](https://openai.com/blog/api-no-waitlist/) in the input field below (appears after you run the cell).

In [2]:
import os
from getpass import getpass

os.environ['OPENAI_API_KEY'] = getpass('Enter token here:') # should look like `sk-XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX`

Enter token here:··········


In [3]:
import os
import importlib
import pprint as pp

import numpy as np
import scipy

import openai
import bigbench.models.model_utils as model_utils
import bigbench.api.json_task as json_task
import bigbench.api.results as results_api
import bigbench.api.util as util
import bigbench.models.huggingface_models as hf_models
from bigbench.evaluate_task import _sanitize_results
from bigbench.api.model import Model, ModelData

openai.api_key = os.getenv("OPENAI_API_KEY")

In [None]:
# from inverse-scaling-eval-pipeline
size_dict = {
    # based on https://blog.eleuther.ai/gpt3-model-sizes/
    "ada": 350_000_000,
    "babbage": 1_300_000_000,
    "curie": 6_700_000_000,
    "davinci": 175_000_000_000,
    "text-ada-001": 350_000_000,
    "text-babbage-001": 1_300_000_000,
    "text-curie-001": 6_700_000_000,
    "text-davinci-001": 175_000_000_000,
    "text-davinci-002": 175_000_000_000,
}

class OpenAIGPT3(Model):
    def __init__(self, model='ada', max_parallel=20):
        self.queries = []
        self.model = model
        self.max_parallel = max_parallel

    def generate_text(
        self, inputs, max_length=500, stop_string=None, output_regex=None,
    ):
        if isinstance(inputs, str):
            inputs = [inputs]
        outputs = []

        n_batches = int(np.ceil(len(inputs) / self.max_parallel))
        for batch_idx in range(n_batches):
            batch_inputs = inputs[
                batch_idx * self.max_parallel : (batch_idx + 1) * self.max_parallel
            ]
            batch_outputs = openai.Completion.create(
                model=self.model,
                prompt=batch_inputs,
                max_tokens=max_length,
                stop=stop_string,
                temperature=0,
            )
            for completion in batch_outputs.choices:
                outputs.append(completion.text)

        if len(inputs) == 1:
            outputs = outputs[0]
        
        outputs = model_utils.postprocess_output(
            outputs, max_length, stop_string, output_regex
        )
        return outputs

    def flatten_multiple_choice_examples(self, inputs, targets):
        flat_idx = []
        flat_inputs = []
        flat_choices = []
        for example_id, (example_input, choices) in enumerate(zip(inputs, targets)):
            for choice_id, choice in enumerate(choices):
                flat_idx.append((example_id, choice_id))
                flat_inputs.append(example_input)
                flat_choices.append(choice)

        return flat_idx, flat_inputs, flat_choices

    def get_target_logprobs(self, completion, target):
        '''Naive implementation of getting the logprobs of the target:
        
        To find out which tokens the target is made of, the function iteratively 
        concatenates returned tokens from the end, and compares a running 
        concatenation with the target.
        '''
        cum_sum = ''
        for i, token in enumerate(reversed(completion.logprobs['tokens'])):
            cum_sum = token + cum_sum
            if cum_sum.strip() == target.strip():
                break

        target_tokens_logprobs = completion.logprobs['token_logprobs'][-(i+1):]
        if None in target_tokens_logprobs:
            print('Found None in target_tokens_logprobs:', target_tokens_logprobs, 'in completion:', completion)
        return sum(target_tokens_logprobs)

    def cond_log_prob(self, inputs, targets, absolute_normalization=False):

        if isinstance(targets, str):
            targets = [targets]

        if isinstance(inputs, str):
            inputs = [inputs]
            targets = [targets]

        flat_idx, flat_inputs, flat_choices = self.flatten_multiple_choice_examples(
            inputs=inputs, targets=targets
        )
        num_examples = len(flat_idx)
        flat_scores = []
        batch_size = self.max_parallel
        for idx in range(0, num_examples, batch_size):
            batch_idx = flat_idx[idx : min(idx + batch_size, num_examples)]
            batch_inputs = flat_inputs[idx : min(idx + batch_size, num_examples)]
            batch_choices = flat_choices[idx : min(idx + batch_size, num_examples)]

            batch_queries = [inpt + target for inpt, target in zip(batch_inputs, batch_choices)]
            batch_outputs = openai.Completion.create(
                model=self.model,
                prompt=batch_queries,
                max_tokens=0,
                temperature=0,
                logprobs=1,
                echo=True,
            )

            for i, completion in enumerate(batch_outputs.choices):
                target_logprobs = self.get_target_logprobs(completion, batch_choices[i])
                flat_scores.append(target_logprobs)

        scores = [[] for _ in range(len(inputs))]

        for idx, score in zip(flat_idx, flat_scores):
            if score == 0:
              # all tokens were masked. Setting score to -inf.
              print('Found score identical to zero. Probably from empty target. '
                             'Setting score to -inf.'
                            )
              scores[idx[0]].append(-np.inf)
            else:
              scores[idx[0]].append(score)

        if not absolute_normalization:
            scores = [
                list(score_row - scipy.special.logsumexp(score_row))
                for score_row in scores
            ]

        if len(inputs) == 1:
            scores = scores[0]

        return scores

    def model_data(self):
        # TODO: replace with correct metadata
        return ModelData(
            model_family="GPT-3",
            model_name=self.model,
            total_params=size_dict[self.model],
            non_embedding_params=size_dict[self.model], # don't know
            flop_matched_non_embedding_params=size_dict[self.model], # don't know
            training_batch_size=1, # don't know
            training_steps=100_000_000, # don't know
            decoding_params={},
            description="see https://arxiv.org/abs/2005.14165"
        )

In [None]:
def evaluate_on_task(task_name, model_name, huggface=False, max_examples=None, shots_list=[0,1,2,3]):
    '''Also supports gpt2 models from huggingface.'''

    task_module_name = f"bigbench.benchmark_tasks.{task_name}"
    task_module = importlib.import_module(task_module_name)
    task_submodule_name = f"{task_module_name}.task"

    module_path = list(task_module.__path__)[0]
    json_path = os.path.join(module_path, "task.json")

    if os.path.exists(json_path):
        task = json_task.JsonTask(
            json_path,
            max_examples=max_examples,
            shot_list=list(map(int, shots_list)),
        )
    else:
        task = util.load_programmatic_task(task_submodule_name)

    model = None
    if huggface:
        model = hf_models.BIGBenchHFModel(
                model_name=model_name,
                max_length=1000,
                show_progress=False,
        )
    else:
        model = OpenAIGPT3(model_name)

    print("-" * 80)
    print(f"evaluating {model_name}...")

    results = task.evaluate_model(model, max_examples=max_examples)

    if isinstance(results, list):
        results_list = results
    else:
        results_list = [results]

    results_list = _sanitize_results(scores=results_list)
    results_list = results_api.add_aggregate_scores(
        task_name=task_name, scores=results_list
    )

    print(f"results:")
    for r in results_list:
        print(f"{pp.pformat(r.score_dict)}")


    return results_list

In [None]:
results_data = evaluate_on_task(task_name='analytic_entailment', model_name='text-curie-001', shots_list=[3])

--------------------------------------------------------------------------------
evaluating text-curie-001...
evaluating analytic_entailment for 3 shots...
results:
{'calibration_multiple_choice_brier_score': 0.49853687789680134,
 'expected_calibration_error': 0.49579620574939887,
 'multiple_choice_grade': 0.4857142857142857,
 'normalized_aggregate_score': -2.857142857142858}


In [None]:
results_data = evaluate_on_task(task_name='emoji_movie', model_name='text-curie-001', shots_list=[3])

--------------------------------------------------------------------------------
evaluating text-curie-001...
evaluating emoji_movie for 3 shots...
results:
{'bleu': 10.839884430478508,
 'calibration_multiple_choice_brier_score': 0.22352366663024548,
 'exact_str_match': 0.08,
 'expected_calibration_error': 0.4583487377784708,
 'multiple_choice_grade': 0.2,
 'normalized_aggregate_score': 4.857225732735058e-14,
 'rouge1': 19.274314574314577,
 'rouge2': 11.933333333333332,
 'rougeLsum': 19.346176046176044}


In [None]:
results_data = evaluate_on_task(task_name='taboo', model_name='ada', shots_list=[1])

--------------------------------------------------------------------------------
evaluating ada...
results:
{'first_response_score': -0.97,
 'full': -0.954387807683251,
 'second_response_score': 0.015612192316749042}
{'first_response_score': -0.97,
 'full': -0.954387807683251,
 'normalized_aggregate_score': 67.42686987194583,
 'second_response_score': 0.015612192316749042}
