
<a href="https://colab.research.google.com/github/google/seqio/blob/main/seqio/notebooks/Basics_Evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
print("Installing dependencies...")
!pip install seqio-nightly

import functools
import numpy as np
import seqio
import scipy
import tensorflow as tf

This colab demonstrates how to use `seqio.Task` and `seqio.Evaluator` to carry out model evaluation. Defining metric functions and model functions are the two central pieces of this process. They need to be created and then associated with `seqio.Task` construction.

Note: metric functions and model functions are defined at `Task` level, but one can still run evaluation on a `Mixture`. When you do that, SeqIO runs evaluation separately on each sub-task under the mixture, i.e., evaluating the metric-fns configured in each sub-task (this is different from the behavior when training  on a mixture, where SeqIO loads and samples from each sub-task and produces a single dataset of mixed data).

# s1. setup

We start with a Task created in a previous [colab on Task and Mixtures](https://github.com/google/seqio/blob/main/seqio/notebooks/Basics_Task_and_Mixtures.ipynb), which already has three preprocessors defined.

- `seqio.preprocessors.rekey`
- custom preprocessor: `sample_from_answers`
- `seqio.preprocessors.tokenize` using `seqio.SentencePieceVocabulary`

In [None]:
# seqio.map_over_dataset is decorator to map decorated function 
# (e.g., sample_from_answers below) over all examples in a dataset.
# for details, please refer to seqio.map_over_dataset() documentation.
@seqio.map_over_dataset(num_seeds=1)
def sample_from_answers(x, seed):
 answers = x['targets']
 sample_id = tf.random.stateless_uniform([],
                                         seed=seed,
                                         minval=0,
                                         maxval=len(answers),
                                         dtype=tf.int32)
 x['targets'] = answers[sample_id]
 return x

In [None]:
sentencepiece_model_file = "gs://t5-data/vocabs/cc_all.32000.100extra/sentencepiece.model"
vocab = seqio.SentencePieceVocabulary(sentencepiece_model_file)

In [None]:
seqio.TaskRegistry.remove('my_simple_task')
seqio.TaskRegistry.add(
    'my_simple_task',
    source=seqio.TfdsDataSource('natural_questions_open:1.0.0'),
    preprocessors=[
       functools.partial(
           seqio.preprocessors.rekey,
           key_map={
               'inputs': 'question',
               'targets': 'answer',
               'answers': 'answer',
           }),
       sample_from_answers,
       seqio.preprocessors.tokenize,
   ],
    output_features={
        'inputs': seqio.Feature(vocabulary=vocab),
        'targets': seqio.Feature(vocabulary=vocab),
    },
)

<seqio.dataset_providers.Task at 0x7ff12fab5430>

In [None]:
task = seqio.TaskRegistry.get('my_simple_task')
ds = task.get_dataset(sequence_length=None, split="train", shuffle=False)
list(ds.take(1).as_numpy_iterator())

[{'answers': array([b'Romi Van Renterghem.'], dtype=object),
  'inputs': array([ 113,   19,    8, 3202,   16,   72,  145,   25,  214], dtype=int32),
  'inputs_pretokenized': b'who is the girl in more than you know',
  'targets': array([12583,    23,  4480,  9405,    49,   122,  6015,     5],
        dtype=int32),
  'targets_pretokenized': b'Romi Van Renterghem.'}]

# s2. define metric functions

Currently `seqio` supports two types of metric functions.

- Type 1: metric depending on model predictions (i.e., model output sequence)
- Type 2: metric depending on model scores (i.e., log probability/likelihood of target sequence given input sequence)

We will define one for each type below.

- `sequence_accuracy()` belongs to Type 1 -  computing the accuracy of model output sequences matching the correponding target sequences.
- `log_likelihood()` belongs to Type 2 - computing average log likelihood of target sequences.

In [None]:
def sequence_accuracy(targets, predictions):
 seq_acc = 100 * np.mean([p == t for p, t in zip(predictions, targets)])
 return {"sequence_accuracy": seq_acc}

def log_likelihood(targets, scores):
 log_likelihood = np.mean([scipy.special.logsumexp(el) for el in scores])
 return {"log_likelihood": log_likelihood}

We supply these two metric_fns to the Task via `metric_fns` argument.

In [None]:
seqio.TaskRegistry.remove('my_simple_task')
seqio.TaskRegistry.add(
    'my_simple_task',
    source=seqio.TfdsDataSource('natural_questions_open:1.0.0'),
    preprocessors=[
       functools.partial(
           seqio.preprocessors.rekey,
           key_map={
               'inputs': 'question',
               'targets': 'answer',
               'answers': 'answer',
           }),
       sample_from_answers,
       seqio.preprocessors.tokenize,
   ],
    output_features={
        'inputs': seqio.Feature(vocabulary=vocab),
        'targets': seqio.Feature(vocabulary=vocab),
    },
    metric_fns=[sequence_accuracy, log_likelihood]
)

<seqio.dataset_providers.Task at 0x7ff1270145e0>

# s3. define model functions

Now we define model functions that return model outputs so that metric functions can take in and compute the metrics.

Currently, we only need two types of model outputs: predictions and scores (i.e., log probability/likelihood of target sequence given input sequence). We will have

- `dummy_predict_fn` to produce predictions
- `dummy_score_fn` to produce scores

Note: in real world applications, standard modeling frameworks such as T5X support SeqIO evaluator. Specifically, users provide model functions defined in those modeling frameworks for seqio. At eval time, modeling framework invokes SeqIO evaluator and reports metrics.


In [None]:
def dummy_predict_fn(ds):
 return [(i, d['decoder_target_tokens']) for i, d in ds]

def dummy_score_fn(ds):
 return [(i, 0.4) for i, d in ds]

We construct a seqio evaluator that's tied to the task we'd like to evaluate on, which ensures we are getting data from the desired task. Concretely, the evaluator loads the data and convert it into the format model functions expect.

In [None]:
evaluator = seqio.Evaluator(
   mixture_or_task_name='my_simple_task',
   feature_converter=seqio.EncDecFeatureConverter(pack=False),
   eval_split='validation')

We supply `dummy_predict_fn` and `dummy_score_fn` to the `evaluator.evaluate()` so that evaluator can call to get `predictions` and `scores` for metric computation.

In [None]:
metrics, _, _ = evaluator.evaluate(
   compute_metrics=True,
   step=None,
   predict_fn=dummy_predict_fn,
   score_fn=dummy_score_fn)

print(metrics.result())

# s4. add postprocess_fn (optional)

Sometimes we need carry out certain processing for predictions and targets. Here's where `postprocessor_fn` comes in handy. It runs on each target and prediction seperately before `metric_fn`.

In [None]:
def gather_answers(target_or_pred, example, is_target):
  if not is_target:
    return target_or_pred
  return [a.decode() for a in example["answers"]]

def multi_target_sequence_accuracy(targets, predictions):
  # targets is a list of lists.
  seq_acc = 100 * np.mean([p in t for p, t in zip(predictions, targets)])
  return {"multi_target_sequence_accuracy": seq_acc}

In [None]:
seqio.TaskRegistry.remove('my_simple_task')
seqio.TaskRegistry.add(
    'my_simple_task',
    source=seqio.TfdsDataSource('natural_questions_open:1.0.0'),
    preprocessors=[
       functools.partial(
           seqio.preprocessors.rekey,
           key_map={
               'inputs': 'question',
               'targets': 'answer',
               'answers': 'answer'
           }),
       sample_from_answers,
       seqio.preprocessors.tokenize,
   ],
    output_features={
        'inputs': seqio.Feature(vocabulary=vocab),
        'targets': seqio.Feature(vocabulary=vocab),
    },
    postprocess_fn=gather_answers,
    metric_fns=[multi_target_sequence_accuracy]
)

<seqio.dataset_providers.Task at 0x7ff127268670>

In [None]:
evaluator = seqio.Evaluator(
   mixture_or_task_name='my_simple_task',
   feature_converter=seqio.EncDecFeatureConverter(pack=False),
   eval_split='validation')

In [None]:
metrics, _, _ = evaluator.evaluate(
   compute_metrics=True,
   step=None,
   predict_fn=dummy_predict_fn,
   score_fn=dummy_score_fn)

print(metrics.result())

{'my_simple_task': {'multi_target_sequence_accuracy': 97.4792243767313}}
