# ArrayRecordDataSource

This tutorial provides an example of how to retrieve records from ArrayRecord files using `grain.sources.ArrayRecordDataSource`, also covers how to process and transform the data with Grain.

## Read records from ArrayRecord files
This section reads records from ArrayRecord files, also defines an example transform function to parse and tokenize the record data.

### Define File Path

In [None]:
import grain
import numpy as np
from tensorflow_datasets.core.constants import ARRAY_RECORD_DATA_DIR

In [None]:
# The grain.sources.ArrayRecordDataSource supports sharded file path.
example_file_paths = (
    ARRAY_RECORD_DATA_DIR + '/aeslc/1.0.0/aeslc-train.array_record@1'
)
print(example_file_paths)

In [None]:
# @title Load Data Source
example_array_record_data_source = grain.sources.ArrayRecordDataSource(example_file_paths)
print(f"Number of records: {len(example_array_record_data_source)}")

### Define Transformation Function

In [None]:
# Load a pre trained tokenizer
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_pretrained("bert-base-cased")

In [None]:
class ParseAndTokenizeText(grain.transforms.Map):
  """Parses a serialized TF.Example containing a 'text' feature and tokenizes it.

  The 'text' feature is expected to be a list of bytes. This function decodes
  the bytes to UTF-8 string, tokenizes it using the provided tokenizer, flattens
  the resulting list of token IDs, and returns the first 10 tokens.
  """

  def __init__(self, tokenizer):
    self._tokenizer = tokenizer

  def map(self, proto_bytes: bytes) -> [str]:
    # parse individual data record
    parsed_element = grain.fast_proto.parse_tf_example_experimental(
        proto_bytes, strip_trailing_null_characters=True
    )
    tokens = [
        self._tokenizer.encode(item.decode('utf-8')).tokens
        for item in parsed_element['email_body']
    ]
    tokens = np.array(tokens).flatten()
    # only pick the first 10 token IDs from the tokenized text for testing
    return tokens[:10]

In [None]:
# Example using Grain's MapDataset with ArrayRecord file source.
example_datasets = (
    grain.MapDataset.source(example_array_record_data_source)
    .shuffle(seed=42)
    .map(ParseAndTokenizeText(tokenizer))
    .batch(batch_size=10)
)

In [None]:
# Output a record at a random index
print(example_datasets[100])