# Implementing Transformer

## What's attention?

- Attention was introduced in the Attention is all you need paper, for a translation task
- Attention is a small part of the network.
- GPT is a generative network which uses only the decoder part of the model presented in Attention is all you need.
- BERT is also a transformer model, that can be finetuned for many tasks

Let's implement the translation model.

# Let's start with BERT and transformers library

In [1]:
from torch.utils.tensorboard import SummaryWriter

writer = SummaryWriter(comment="new unfreezed BERT 64")
PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0
PYTORCH_ENABLE_MPS_FALLBACK=1

In [2]:
import pandas as pd

# Load the dataset into a pandas dataframe.
df = pd.read_csv("/Users/kyoto/git/phospho_project/data.csv")

# Report the number of sentences.
print("Number of training sentences: {:,}\n".format(df.shape[0]))

# Display 10 random rows from the data.
df.sample(10)

Number of training sentences: 1,121



Unnamed: 0,task_id,text,AI,code,news,phospho
132,1c171b70387345beb41272a92e5a0133,"Quels sont les légumes de saison? En mai 2024,...",0,0,0,0
876,c589d748d18043a4aec7fd0194aa02fd,i don't know what you are talking about i have...,0,1,0,0
838,bd4d5207284f46a5ab652b487061f6bb,L'IAG va-t-elle générer de nouveaux besoins en...,1,0,0,0
516,759b2c6621b04953b31bb5d1b93c940e,change la numerotation Voici le texte modifié ...,0,0,0,0
158,23a7092c5bbd4882a4f118c00126224e,"Qui est tu? Je suis Tak, un expert en recherch...",0,0,0,1
288,3d89f18e87ca462a8da463fcdbdfc7da,Peux-tu me lister 10 autres start-up qui ont l...,1,0,0,0
525,7801f653404c4cbda117f044ade6f015,What are the best news API? Here are some of t...,0,1,0,0
415,5a92bf4d55a247f0b0962f245aeefe5c,tu serais analyser un ERNT L'ERNT (État des R...,0,0,0,0
400,566afebfde0e441fbaac783ca950ae57,qu'est ce que phospho ? Phospho peut se référe...,0,0,0,1
382,512f70ca5d524b7099bf7804fbe95619,"improved As we discussed a few months ago, we ...",0,0,0,0


In [3]:
from transformers import BertTokenizer, AutoTokenizer

# Load the BERT tokenizer.
print("Loading BERT tokenizer...")
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
#tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-small")

  from .autonotebook import tqdm as notebook_tqdm


Loading BERT tokenizer...


In [4]:
import torch

# Tokenize all of the sentences and map the tokens to thier word IDs.
input_ids = []
attention_masks = []

# For every sentence...
for sent in df["text"]:
    # `encode_plus` will:
    #   (1) Tokenize the sentence.
    #   (2) Prepend the `[CLS]` token to the start.
    #   (3) Append the `[SEP]` token to the end.
    #   (4) Map tokens to their IDs.
    #   (5) Pad or truncate the sentence to `max_length`
    #   (6) Create attention masks for [PAD] tokens.
    encoded_dict = tokenizer.encode_plus(
        sent,  # Sentence to encode.
        add_special_tokens=True,  # Add '[CLS]' and '[SEP]'
        max_length=64,  # Pad & truncate all sentences.
        pad_to_max_length=True,
        return_attention_mask=True,  # Construct attn. masks.
        return_tensors="pt",  # Return pytorch tensors.
    )

    # Add the encoded sentence to the list.
    input_ids.append(encoded_dict["input_ids"])

    # And its attention mask (simply differentiates padding from non-padding).
    attention_masks.append(encoded_dict["attention_mask"])

# Convert the lists into tensors.
input_ids = torch.cat(input_ids, dim=0)
attention_masks = torch.cat(attention_masks, dim=0)
labels = df[["AI", "code", "news", "phospho"]].values
labels_tensor = torch.tensor(labels, dtype=torch.float32)


# Print sentence 0, now as a list of IDs.
print("Original: ", df["text"][0])
print("Token IDs:", input_ids[0])

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Original:  que faire a new york ce week end?  Voici quelques activités intéressantes à faire à New York ce week-end :

1. **Randy Rainbow for President**
   - **Description** : Assistez à un spectacle humoristique de Randy Rainbow au Beacon Theater.
   - **Date** : Vendredi soir.
   - **Lieu** : Beacon Theater, 2124 Broadway, Manhattan.
   - [Plus d'infos](https://www.nytimes.com/article/things-to-do-in-nyc.html)
   - ![Image](https://i2.wp.com/totravelandbeyond.com/wp-content/uploads/2016/02/things-to-do-in-new-york-city.jpg?resize=565,847)

2. **Frieze New York**
   - **Description** : Pour les amateurs d'art, visitez Frieze New York, une exposition d'art contemporain avec plus de 1 000 artistes et 200 galeries internationales.
   - **Date** : Du 1er au 5 mai 2024.
   - **Lieu** : The Shed, Hudson Yards.
   - [Plus d'infos](https://loving-newyork.com/new-york-in-may/)
   - ![Image](https://media.timeout.com/images/103495347/image.jpg)

3. **Fleet Week NYC 2024**
   - **Description** 

In [5]:
from torch.utils.data import TensorDataset, random_split

# Combine the training inputs into a TensorDataset.
dataset = TensorDataset(input_ids, attention_masks, labels_tensor)
# Create a 90-10 train-validation split.

# Calculate the number of samples to include in each set.
train_size = int(0.9 * len(dataset))
val_size = len(dataset) - train_size

# Divide the dataset by randomly selecting samples.
train_dataset, val_dataset = random_split(dataset, [train_size, val_size])

print("{:>5,} training samples".format(train_size))
print("{:>5,} validation samples".format(val_size))

1,008 training samples
  113 validation samples


In [6]:
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

# The DataLoader needs to know our batch size for training, so we specify it
# here. For fine-tuning BERT on a specific task, the authors recommend a batch
# size of 16 or 32.
batch_size = 4

# Create the DataLoaders for our training and validation sets.
# We'll take training samples in random order.
train_dataloader = DataLoader(
    train_dataset,  # The training samples.
    sampler=RandomSampler(train_dataset),  # Select batches randomly
    batch_size=batch_size,  # Trains with this batch size.
)

# For validation the order doesn't matter, so we'll just read them sequentially.
validation_dataloader = DataLoader(
    val_dataset,  # The validation samples.
    sampler=SequentialSampler(val_dataset),  # Pull out batches sequentially.
    batch_size=batch_size,  # Evaluate with this batch size.
)

In [35]:
from transformers import (
    AdamW,
    BertModel,
    AutoModelForSeq2SeqLM,
    BertForSequenceClassification
)
import numpy as np



### OG BERT MODEL

# class BERTClass(torch.nn.Module):
#     def __init__(self):
#         super(BERTClass, self).__init__()
#         self.bert_model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=4)
#         self.dropout = torch.nn.Dropout(0.1)
#         self.linear = torch.nn.Linear(768, 4)
#         #self.bert_model.requires_grad_ = False   #freezing the gradients or not
        
#     def forward(self, input_ids, attn_mask, token_type_ids):
#         output = self.bert_model(
#             input_ids, attention_mask=attn_mask, token_type_ids=token_type_ids
#         )
#         #output_dropout = self.dropout(output.pooler_output)
#         #output = self.linear(output_dropout)
#         return output


### T5 MODEL

# class T5Model(torch.nn.Module):
#     def __init__(self):
#         super(T5Model, self).__init__()
#         self.t5_model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-small")
#         self.dropout = torch.nn.Dropout(0.1)
#         self.linear = torch.nn.Linear(768, 4)
#         self.t5_model.requires_grad = False

#     def forward(self, input_ids, attn_mask, token_type_ids):
#         output = self.t5_model(
#             input_ids=input_ids,
#             attention_mask=attn_mask,
#             decoder_input_ids=input_ids,
#             output_hidden_states=True
#         )
#         output_dropout = self.dropout(output.encoder_hidden_states[-1][:,0,:])
#         output = self.linear(output_dropout)
#         return output

model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=4)

mps_device = torch.device("mps")
model.to(mps_device)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

In [36]:
def loss_fn(outputs, targets):
    return torch.nn.BCEWithLogitsLoss()(outputs, targets)

In [37]:
# Note: AdamW is a class from the huggingface library (as opposed to pytorch)
# I believe the 'W' stands for 'Weight Decay fix"
optimizer = AdamW(
    model.parameters(),
    lr=2e-5,  # args.learning_rate - default is 5e-5, our notebook had 2e-5
    eps=1e-8,  # args.adam_epsilon  - default is 1e-8.
)



In [38]:
from transformers import get_linear_schedule_with_warmup

# Number of training epochs. The BERT authors recommend between 2 and 4.
# We chose to run for 4, but we'll see later that this may be over-fitting the
# training data.
epochs = 4

# Total number of training steps is [number of batches] x [number of epochs].
# (Note that this is not the same as the number of training samples).
total_steps = len(train_dataloader) * epochs

# Create the learning rate scheduler.
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=0,  # Default value in run_glue.py
    num_training_steps=total_steps,
)

In [39]:
import torch
import torch.backends
import torch.backends.mps

# If there's a GPU available...
if torch.cuda.is_available():
    # Tell PyTorch to use the GPU.
    device = torch.device("cuda")

    print("There are %d GPU(s) available." % torch.cuda.device_count())

    print("We will use the GPU:", torch.cuda.get_device_name(0))

elif torch.backends.mps.is_available():
    print("Apple GPU")
    device = torch.device("mps")
else:
    print("No GPU available, using the CPU instead.")
    device = torch.device("cpu")

Apple GPU


In [40]:
PYTORCH_ENABLE_MPS_FALLBACK=1

In [41]:
import random
import numpy as np
import time
from sklearn.metrics import f1_score, recall_score, precision_score, multilabel_confusion_matrix

# This training code is based on the `run_glue.py` script here:
# https://github.com/huggingface/transformers/blob/5bfcd0485ece086ebcbed2d008813037968a9e58/examples/run_glue.py#L128

# Set the seed value all over the place to make this reproducible.


seed_val = 42

random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

# We'll store a number of quantities such as training and validation loss,
# validation accuracy, and timings.
training_stats = []

# Measure the total training time for the whole run.
total_t0 = time.time()
total_eval_f1 = 0

# Function to calculate the accuracy of our predictions vs labels
def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    if pred_flat.shape[0] != labels_flat.shape[0]:
        raise ValueError(
            f"Predicted labels and true labels are not the same length: {pred_flat.shape[0]} vs {labels_flat.shape[0]}"
        )
    return np.sum(pred_flat == labels_flat) / len(labels_flat)


def calculate_metrics(logits, labels):
    # Assuming logits and labels are PyTorch tensors when this function is called
    preds = torch.sigmoid(logits) > 0.5
    preds_flat = preds.view(-1)
    labels_flat = labels.view(-1)

    # Calculate accuracy and F1 score
    accuracy = (preds_flat == labels_flat).float().mean().item()
    return accuracy


# For each epoch...
for epoch_i in range(0, epochs):
    all_preds = []
    all_labels = []

    # ========================================
    #               Training
    # ========================================

    # Perform one full pass over the training set.

    print("")
    print("======== Epoch {:} / {:} ========".format(epoch_i + 1, epochs))
    print("Training...")

    # Measure how long the training epoch takes.
    t0 = time.time()

    # Reset the total loss for this epoch.
    total_train_loss = 0

    # Put the model into training mode. Don't be mislead--the call to
    # `train` just changes the *mode*, it doesn't *perform* the training.
    # `dropout` and `batchnorm` layers behave differently during training
    # vs. test (source: https://stackoverflow.com/questions/51433378/what-does-model-train-do-in-pytorch)
    model.train()

    # For each batch of training data...
    for step, batch in enumerate(train_dataloader):
        # Progress update every 40 batches.
        if step % 40 == 0 and not step == 0:
            # Calculate elapsed time in minutes.
            elapsed = time.time() - t0

            # Report progress.
            print(
                "  Batch {:>5,}  of  {:>5,}.    Elapsed: {:}.".format(
                    step, len(train_dataloader), elapsed
                )
            )

        # Unpack this training batch from our dataloader.
        #
        # As we unpack the batch, we'll also copy each tensor to the GPU using the
        # `to` method.
        #
        # `batch` contains three pytorch tensors:
        #   [0]: input ids
        #   [1]: attention masks
        #   [2]: labels
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)

        # Always clear any previously calculated gradients before performing a
        # backward pass. PyTorch doesn't do this automatically because
        # accumulating the gradients is "convenient while training RNNs".
        # (source: https://stackoverflow.com/questions/48001598/why-do-we-need-to-call-zero-grad-in-pytorch)
        model.zero_grad()

        # Perform a forward pass (evaluate the model on this training batch).
        # The documentation for this `model` function is here:
        # https://huggingface.co/transformers/v2.2.0/model_doc/bert.html#transformers.BertForSequenceClassification
        # It returns different numbers of parameters depending on what arguments
        # arge given and what flags are set. For our useage here, it returns
        # the loss (because we provided labels) and the "logits"--the model
        # outputs prior to activation.
        output = model(input_ids, attention_mask=attention_masks)
        logits = output
        loss = loss_fn(output, b_labels)
        # Accumulate the training loss over all of the batches so that we can
        # calculate the average loss at the end. `loss` is a Tensor containing a
        # single value; the `.item()` function just returns the Python value
        # from the tensor.
        total_train_loss += loss.item()

        # Perform a backward pass to calculate the gradients.
        loss.backward()

        # Clip the norm of the gradients to 1.0.
        # This is to help prevent the "exploding gradients" problem.
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        # Update parameters and take a step using the computed gradient.
        # The optimizer dictates the "update rule"--how the parameters are
        # modified based on their gradients, the learning rate, etc.
        optimizer.step()

        # Update the learning rate.
        scheduler.step()

    # Calculate the average loss over all of the batches.
    avg_train_loss = total_train_loss / len(train_dataloader)

    # Measure how long this epoch took.
    training_time = time.time() - t0

    print("")
    print("  Average training loss: {0:.4f}".format(avg_train_loss))
    writer.add_scalar("Loss func", avg_train_loss, epoch_i)

    print("  Training epoch took: {:}".format(training_time))

    # ========================================
    #               Validation
    # ========================================
    # After the completion of each training epoch, measure our performance on
    # our validation set.

    print("")
    print("Running Validation...")

    t0 = time.time()

    # Put the model in evaluation mode--the dropout layers behave differently
    # during evaluation.
    model.eval()

    # Tracking variables
    total_eval_accuracy = 0
    total_eval_loss = 0
    nb_eval_steps = 0

    # Evaluate data for one epoch
    for batch in validation_dataloader:
        # Unpack this training batch from our dataloader.
        #
        # As we unpack the batch, we'll also copy each tensor to the GPU using
        # the `to` method.
        #
        # `batch` contains three pytorch tensors:
        #   [0]: input ids
        #   [1]: attention masks
        #   [2]: labels
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)

        # Tell pytorch not to bother with constructing the compute graph during
        # the forward pass, since this is only needed for backprop (training).
        with torch.no_grad():
            output = model(b_input_ids, attn_mask=b_input_mask, token_type_ids=None)

            loss = loss_fn(output, b_labels)

            logits = output.detach().cpu()

            total_eval_loss += loss.item()

            labels = b_labels.detach().cpu()

            preds = (torch.sigmoid(logits) > 0.5).int()
            accuracy = calculate_metrics(logits, labels)
            total_eval_accuracy += accuracy

            # Store predictions and labels
            all_preds.append(preds)
            all_labels.append(labels)

    all_preds = torch.cat(all_preds).numpy()
    all_labels = torch.cat(all_labels).numpy()

    # Calculate F1 score using 'micro' to consider label imbalance
    f1 = f1_score(all_labels, all_preds, average="weighted")
    print("  F1 Score: {:.2f}".format(f1))

    recall = recall_score(all_labels, all_preds, average="weighted")
    print("  Recall Score: {:.2f}".format(recall))

    precision = precision_score(all_labels, all_preds, average="weighted")
    print("  Precision Score: {:.2f}".format(precision))


    multilabel_confusion_matrix(all_labels, all_preds)

    writer.add_scalar("F1", f1, epoch_i)
    writer.add_scalar("Recall", recall, epoch_i)
    writer.add_scalar("Precision", precision, epoch_i)

    # Report the final accuracy for this validation run.
    avg_val_accuracy = total_eval_accuracy / len(validation_dataloader)
    print("  Accuracy: {0:.2f}".format(avg_val_accuracy))

    # Calculate the average loss over all of the batches.
    avg_val_loss = total_eval_loss / len(validation_dataloader)

    # Measure how long the validation run took.
    validation_time = time.time() - t0

    print("  Validation Loss: {0:.4f}".format(avg_val_loss))
    print("  Validation took: {:}".format(validation_time))

    # Record all statistics from this epoch.
    training_stats.append(
        {
            "epoch": epoch_i + 1,
            "Training Loss": avg_train_loss,
            "Valid. Loss": avg_val_loss,
            "Valid. Accur.": avg_val_accuracy,
            "Training Time": training_time,
            "Validation Time": validation_time,
        }
    )

print("")
print("Training complete!")

print("Total training took {:} (h:mm:ss)".format(time.time() - total_t0))
writer.flush()


Training...


TypeError: BertForSequenceClassification.forward() got an unexpected keyword argument 'attn_mask'

tensorboard command: `tensorboard --logdir:runs`
