
<a href="https://colab.research.google.com/github/google/seqio/blob/main/seqio/notebooks/Basics_Task_and_Mixtures.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
print("Installing dependencies...")
!pip install seqio-nightly

import seqio
import tensorflow as tf
import functools

# s1. define simplest seqio.Task

Let’s define the simplest SeqIO Task that just reads data from a TFDS dataset, no preprocessing.

In [None]:
seqio.TaskRegistry.add(
   'my_simple_task',
   source=seqio.TfdsDataSource('natural_questions_open:1.0.0'),
   output_features={}
)


<seqio.dataset_providers.Task at 0x7fde663c55b0>

then we get the task from the registry, get the dataset from the task, and see a batch of data.

In [None]:
task = seqio.TaskRegistry.get('my_simple_task')
ds = task.get_dataset(sequence_length=None, split="train", shuffle=False)
list(ds.take(1).as_numpy_iterator())

[{'answer': array([b'Romi Van Renterghem.'], dtype=object),
  'question': b'who is the girl in more than you know'}]

#s2. add preprocessors

To make it more ready for sequence modeling, we need change a few things. 

- We'd like data batch to have keys `inputs` and `targets` instead of `question` and `answer` to make the naming less task-dependent. This can be done by `seqio.preprocessors.rekey()`.

- The `answer` field is currently a list of sequences (texts), whereas modeling often assumes there's only one output sequence. We need to sample a single sequence from the list.

- We need to tokenize both `inputs` and `targets`, for which we can use `seqio.preprocessors.tokenize` and supply a seqio vocabulary.

## s2.1 add rekey

In [None]:
seqio.TaskRegistry.remove('my_simple_task')
seqio.TaskRegistry.add(
    'my_simple_task',
    source=seqio.TfdsDataSource('natural_questions_open:1.0.0'),
    preprocessors=[
       functools.partial(
           seqio.preprocessors.rekey,
           key_map={
               'inputs': 'question',
               'targets': 'answer',
           }),
   ],
    output_features={}
)

<seqio.dataset_providers.Task at 0x7fde3cb53d30>

we check the same batch of the data and will see the keys are changed to `inputs` and `targets`.

In [None]:
task = seqio.TaskRegistry.get('my_simple_task')
ds = task.get_dataset(sequence_length=None, split="train", shuffle=False)
list(ds.take(1).as_numpy_iterator())

[{'inputs': b'who is the girl in more than you know',
  'targets': array([b'Romi Van Renterghem.'], dtype=object)}]

## s2.2 sample one sequence from answers/targets

Here we need to define our own preprocessor function using `seqio.map_over_dataset`.

In [None]:
# seqio.map_over_dataset is decorator to map decorated function 
# (e.g., sample_from_answers below) over all examples in a dataset.
# for details, please refer to seqio.map_over_dataset() documentation.
@seqio.map_over_dataset(num_seeds=1)
def sample_from_answers(x, seed):
 answers = x['targets']
 sample_id = tf.random.stateless_uniform([],
                                         seed=seed,
                                         minval=0,
                                         maxval=len(answers),
                                         dtype=tf.int32)
 x['targets'] = answers[sample_id]
 return x


the preprocessor `sample_from_answers` can be added after `seqio.preprocessors.rekey` so that seqio will execute them in the order listed.

In [None]:
seqio.TaskRegistry.remove('my_simple_task')
seqio.TaskRegistry.add(
    'my_simple_task',
    source=seqio.TfdsDataSource('natural_questions_open:1.0.0'),
    preprocessors=[
       functools.partial(
           seqio.preprocessors.rekey,
           key_map={
               'inputs': 'question',
               'targets': 'answer',
           }),
       sample_from_answers,
   ],
    output_features={}
)

<seqio.dataset_providers.Task at 0x7fde476dfa90>

we check the same batch of the data and should see there's only one string in the `targets` field.

In [None]:
task = seqio.TaskRegistry.get('my_simple_task')
ds = task.get_dataset(sequence_length=None, split="train", shuffle=False)
list(ds.take(1).as_numpy_iterator())

[{'inputs': b'who is the girl in more than you know',
  'targets': b'Romi Van Renterghem.'}]

## s2.3 tokenize `inputs` and `targets`

We will use a common vocabulary `SentencePieceVocabulary` here to tokenize the sequences.

In [None]:
sentencepiece_model_file = "gs://t5-data/vocabs/cc_all.32000.100extra/sentencepiece.model"
vocab = seqio.SentencePieceVocabulary(sentencepiece_model_file)

In [None]:
seqio.TaskRegistry.remove('my_simple_task')
seqio.TaskRegistry.add(
    'my_simple_task',
    source=seqio.TfdsDataSource('natural_questions_open:1.0.0'),
    preprocessors=[
       functools.partial(
           seqio.preprocessors.rekey,
           key_map={
               'inputs': 'question',
               'targets': 'answer',
           }),
       sample_from_answers,
       seqio.preprocessors.tokenize,
   ],
    output_features={
        'inputs': seqio.Feature(vocabulary=vocab),
        'targets': seqio.Feature(vocabulary=vocab),
    }
)

<seqio.dataset_providers.Task at 0x7fde3cb539d0>

we check the same batch of the data and should see the `inputs` and `targets` become tokenized sequences (arrays of integers). We also keep the original `inputs` and `targets` in `inputs_pretokenized` and `targets_pretokenized` fields.

In [None]:
task = seqio.TaskRegistry.get('my_simple_task')
ds = task.get_dataset(sequence_length=None, split="train", shuffle=False)
list(ds.take(1).as_numpy_iterator())

[{'inputs': array([ 113,   19,    8, 3202,   16,   72,  145,   25,  214], dtype=int32),
  'inputs_pretokenized': b'who is the girl in more than you know',
  'targets': array([12583,    23,  4480,  9405,    49,   122,  6015,     5],
        dtype=int32),
  'targets_pretokenized': b'Romi Van Renterghem.'}]

# s3. define a Mixture

To make a seqio Mixture, we need at least two tasks. So let's define another one same as `my_simple_task` but name it as `my_simple_task2`.

In [None]:
seqio.TaskRegistry.remove('my_simple_task2')
seqio.TaskRegistry.add(
    'my_simple_task2',
    source=seqio.TfdsDataSource('natural_questions_open:1.0.0'),
    preprocessors=[
       functools.partial(
           seqio.preprocessors.rekey,
           key_map={
               'inputs': 'question',
               'targets': 'answer',
           }),
       sample_from_answers,
       seqio.preprocessors.tokenize,
   ],
    output_features={
        'inputs': seqio.Feature(vocabulary=vocab),
        'targets': seqio.Feature(vocabulary=vocab),
    }
)

<seqio.dataset_providers.Task at 0x7fde471cea90>

In [None]:
seqio.MixtureRegistry.add(
    'my_simple_mixture',
    [('my_simple_task', 0.5), 'my_simple_task2'],
    default_rate=1.0
)

<seqio.dataset_providers.Mixture at 0x7fde3a85b3a0>

In [None]:
mixture = seqio.MixtureRegistry.get('my_simple_mixture')
ds = mixture.get_dataset(sequence_length=None, split="train", shuffle=False)
list(ds.take(6).as_numpy_iterator())

[{'inputs': array([ 113,   19,    8, 3202,   16,   72,  145,   25,  214], dtype=int32),
  'targets': array([12583,    23,  4480,  9405,    49,   122,  6015,     5],
        dtype=int32)},
 {'inputs': array([  116,   405,   467,   147,   388,   369,    91,    30,  3134,
            89, 17591], dtype=int32),
  'targets': array([ 1332, 12992,   846], dtype=int32)},
 {'inputs': array([ 116,   47,    8,    3, 5397,   15,   13,    8, 4913, 8769,   52,
         1545], dtype=int32),
  'targets': array([1003, 4327,  104, 3916], dtype=int32)},
 {'inputs': array([11003,    13,    16,  8603,  2717,    16,  1353,    13, 19680,
             3,   122,    26,   102], dtype=int32),
  'targets': array([17353,    18, 15599,     7,    17], dtype=int32)},
 {'inputs': array([  113,    47,     8,   163,   178,  2753,   113,     7, 18573,
            47,    92,  2753], dtype=int32),
  'targets': array([15717, 20429], dtype=int32)},
 {'inputs': array([ 113, 2832,    3,   23,  317,    3,   23,   31,   51,  352,

# s4. add feature converters

Up to this point, both `Task` and `Mixture` are model-agnostic data processing. Feature converter provides a way to introduce model-specific data processing. The example feature converter below is used for encoder-decoder modeling, called `seqio.EncDecFeatureConverter`.

In [None]:
fc = seqio.EncDecFeatureConverter(pack=True)
feature_lengths = {"inputs": 5, "targets": 5}
# Truncate inputs or targets to the length specified in feature_lengths.
# And the truncation happens after tokenization.
task = seqio.TaskRegistry.get('my_simple_task')
ds = task.get_dataset(sequence_length=feature_lengths, split="train", shuffle=False)
list(ds.take(1).as_numpy_iterator())

[{'inputs': array([ 113,   19,    8, 3202,   16], dtype=int32),
  'inputs_pretokenized': b'who is the girl in more than you know',
  'targets': array([12583,    23,  4480,  9405,    49], dtype=int32),
  'targets_pretokenized': b'Romi Van Renterghem.'}]

we check the same batch and will see it has more fields that are needed for encoder-decoder modeling.

In [None]:
ds = fc(ds, feature_lengths)
list(ds.take(1).as_numpy_iterator())

[{'decoder_input_tokens': array([    0, 12583,    23,  4480,  9405], dtype=int32),
  'decoder_loss_weights': array([1, 1, 1, 1, 1], dtype=int32),
  'decoder_positions': array([0, 1, 2, 3, 4], dtype=int32),
  'decoder_segment_ids': array([1, 1, 1, 1, 1], dtype=int32),
  'decoder_target_tokens': array([12583,    23,  4480,  9405,    49], dtype=int32),
  'encoder_input_tokens': array([ 113,   19,    8, 3202,   16], dtype=int32),
  'encoder_positions': array([0, 1, 2, 3, 4], dtype=int32),
  'encoder_segment_ids': array([1, 1, 1, 1, 1], dtype=int32)}]