Whether your data is text, images, or audio, they need to be converted and assembled into batches of tensors.

Text, use a **Tokenizer** to convert text into a sequence of tokens, create a numerical representation of the tokens, and assemble them into tensors.

Speech and audio, use a 
**Feature extractor** to extract sequential features from audio waveforms and convert them into tensors.

Image inputs use a **ImageProcessor** to convert images into tensors.

Multimodal inputs, use a **Processor** to combine a tokenizer and a feature extractor or image processor

In [18]:
!pip -q install datasets evaluate transformers > /dev/null

In [None]:
#start by using the existing tokenizer

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

In [3]:
encoded_input = tokenizer("""Do not meddle in the affairs 
                    of pre trained models, for they are 
                    subtle and quick to error out.""")
print(encoded_input)

{'input_ids': [101, 2091, 1136, 1143, 13002, 1107, 1103, 5707, 1104, 3073, 3972, 3584, 117, 1111, 1152, 1132, 11515, 1105, 3613, 1106, 7353, 1149, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [4]:
tokenizer.decode(encoded_input["input_ids"])

'[CLS] Do not meddle in the affairs of pre trained models, for they are subtle and quick to error out. [SEP]'

In [5]:
batch_sentences = [
    "But what about the craft on the raft?",
    "Don't think he knows about the craft, Boogle.",
    "What about upsies?",
]

encoded_inputs = tokenizer(batch_sentences)

Padding is a strategy for ensuring tensors are rectangular by adding a special padding token to shorter sentences.

In [6]:
encoded_inputs = tokenizer(batch_sentences,padding=True)

A sequence may be too long for a model to handle. In this case, you’ll need to truncate the sequence to a shorter length

In [7]:
encoded_input = tokenizer(batch_sentences, 
                          padding=True, 
                          truncation=True)

To build tensors one has to simply say return_tensors="pt"

In [9]:
encoded_input = tokenizer(batch_sentences, 
                          padding=True, 
                          truncation=True,
                          return_tensors='pt')

### Lets now head to datasets, and load them

In [None]:
from datasets import load_dataset

dataset = load_dataset("yelp_review_full")

In [12]:
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

In [13]:
tokenized_datasets = dataset.map(tokenize_function, 
                                 batched=True)

Map:   0%|          | 0/650000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [14]:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

In [15]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", 
                                                           num_labels=5)

Downloading pytorch_model.bin:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b

In [16]:
from transformers import TrainingArguments

training_args = TrainingArguments(output_dir="test_trainer")

In [19]:
import numpy as np
import evaluate

metric = evaluate.load("accuracy")

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

In [20]:
from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(output_dir="test_trainer", 
                                  evaluation_strategy="epoch")

In [22]:
def compute_metrics(eval_pred):
  logits, labels = eval_pred
  predictions = np.argmax(logits, axis=-1)
  return metric.compute(predictions=predictions, references=labels)

In [23]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

In [None]:
trainer.train()