## Preprocess
After loading the dataset preprocessing to get into desired format for training.

There are many ways to preprocess a dataset and depends on what we are trying to achieve.

- Tokenizing
- Resampling an audio dataset
- Applying transformation to an image dataset

In [20]:
# !pip install transformers

In [21]:
# !pip install datasets

## Tokenize text
Converting text into numbers so that model can understand correctly.

In [22]:
from transformers import AutoTokenizer
from datasets import load_dataset

model_checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
dataset = load_dataset("rotten_tomatoes", split="train")

In [23]:
tokenizer(dataset[0]["text"])

{'input_ids': [101, 1996, 2600, 2003, 16036, 2000, 2022, 1996, 7398, 2301, 1005, 1055, 2047, 1000, 16608, 1000, 1998, 2008, 2002, 1005, 1055, 2183, 2000, 2191, 1037, 17624, 2130, 3618, 2084, 7779, 29058, 8625, 13327, 1010, 3744, 1011, 18856, 19513, 3158, 5477, 4168, 2030, 7112, 16562, 2140, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

- `**input_ids**`: the numbers representing the tokens in the text
- `**token_type_ids**`: indicates which sequence a token belongs to if there is more than one sequence
- `**attention_mask**`: indicates whether a token should be masked or not.

In [24]:
# tokenizing using map function

def tokenize(example: dict) -> dict:
  """tokenize the text"""
  tokenized = tokenizer(example["text"])
  return tokenized

In [25]:
dataset = dataset.map(tokenize, batched=True)

In [26]:
dataset

Dataset({
    features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 8530
})

In [27]:
# format according to our need of framework "pt" or "tf" or "np" any

dataset.set_format(type="torch")

dataset[0]["input_ids"]

tensor([  101,  1996,  2600,  2003, 16036,  2000,  2022,  1996,  7398,  2301,
         1005,  1055,  2047,  1000, 16608,  1000,  1998,  2008,  2002,  1005,
         1055,  2183,  2000,  2191,  1037, 17624,  2130,  3618,  2084,  7779,
        29058,  8625, 13327,  1010,  3744,  1011, 18856, 19513,  3158,  5477,
         4168,  2030,  7112, 16562,  2140,  1012,   102])

In [28]:
dataset.format["type"]

'torch'

We can use to_tf_dataste() function to convert to tensorflow format and data_collator from transformer to combine the varying sequence length into a single batch of equal lengths of dataset

In [29]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(
    tokenizer=tokenizer,
    return_tensors="tf"
)
tf_dataset = dataset.to_tf_dataset(
    columns=["input_ids"],
    label_cols=["label"],
    shuffle=True,
    batch_size=8,
    collate_fn=data_collator
)

Old behaviour: columns=['a'], labels=['labels'] -> (tf.Tensor, tf.Tensor)  
             : columns='a', labels='labels' -> (tf.Tensor, tf.Tensor)  
New behaviour: columns=['a'],labels=['labels'] -> ({'a': tf.Tensor}, {'labels': tf.Tensor})  
             : columns='a', labels='labels' -> (tf.Tensor, tf.Tensor) 


In [32]:
tf_dataset

<_PrefetchDataset element_spec=(TensorSpec(shape=(None, None), dtype=tf.int64, name=None), TensorSpec(shape=(None,), dtype=tf.int64, name=None))>

In [34]:
type(tf_dataset)

In [35]:
data_tf = next(iter(tf_dataset))
data_tf

(<tf.Tensor: shape=(8, 38), dtype=int64, numpy=
 array([[  101,  5064,  5796,  1012, 21960,  1998,  2720,  1012, 29198,
          3401,  3288,  2125,  2023,  3748,  6124,  1059, 14341,  6508,
          1012,   102,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0],
        [  101,  2025,  2130,  7112, 28740,  2038, 13830,  2039,  2107,
          1038, 20051,  4630,  1998,  5305,  7406,  4031, 11073,  1999,
          1037,  3185,  1012,   102,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0],
        [  101,  1037,  2659,  1011,  9278,  2128,  7913,  4215,  1997,
          1996,  7344,  4620,  1012,   102,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0],
        [  101, 