<a href="https://colab.research.google.com/github/larajakl/Computational-Linguistics/blob/main/tutorial5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tutorial 5: Introduction to Computational Linguistics

This is the fifth tutorial with practical exercises for the lecture Introduction to Computational Linguistics in the winter semester 2024. Hands-on exercises are marked with 👋 ⚒ and questions are marked with ❓. Remember to first **store this notebook** in your Drive or GitHub.

----
## **Lesson 1: Visualizing Hidden Layers**

This tutorial shows how to visualize the hidden representations/embeddings for each individual hidden layer and checkpoint with the [TensorFlow Projector API](https://projector.tensorflow.org/). Remember that you set the checkpoint. If it is set to epochs, you can visualize each hidden embedding layer for each epoch.

First, we need to again create and fine-tune a model. We will use the same as for Tutorial 4.

In [None]:
!pip install transformers
!pip install datasets
!pip install evaluate
!pip install bertviz transformers
!pip install bertviz

Please mount your Google Drive to store the hidden embedding layers permanently so that you can load them in the [TensorFlow Projector API](https://projector.tensorflow.org/). Alternatively, you can store the files locally on your computer:

```
from google.colab import files
files.download("file.tsv")
```

 Please do not only store them in the active session, since this will not allow you to load the files in the TensorFlow Projector. You need to store the files in Google Drive or locally. Alternatively but suboptimally, you can download them from the session once they have been created (see folder structure on the left in Google Colab).

In [None]:
from google.colab import drive
drive.mount('/content/drive')

##Simple visualization with BertViz

In [None]:
# Load model and retrieve attention weights
from bertviz import head_view, model_view
from transformers import BertTokenizer, BertModel

model_version = 'bert-base-uncased'
model = BertModel.from_pretrained(model_version, output_attentions=True)

tokenizer = BertTokenizer.from_pretrained(model_version)

sentence_a = "The cat sat on the mat"
sentence_b = "The cat lay on the rug"

inputs = tokenizer.encode_plus(sentence_a, sentence_b, return_tensors='pt')
input_ids = inputs['input_ids']
token_type_ids = inputs['token_type_ids']
attention = model(input_ids, token_type_ids=token_type_ids)[-1]
sentence_b_start = token_type_ids[0].tolist().index(1)
input_id_list = input_ids[0].tolist() # Batch index 0
tokens = tokenizer.convert_ids_to_tokens(input_id_list)

The so-called neuron view provides a nice visual explanation of attention. Once the visualization is available, click on one of the + signs next to the inputs to extend the visualization.

In [None]:
from bertviz.transformers_neuron_view import BertModel, BertTokenizer
from bertviz.neuron_view import show

model_type = 'bert'
model_version = 'bert-base-uncased'
model = BertModel.from_pretrained(model_version, output_attentions=True)
tokenizer = BertTokenizer.from_pretrained(model_version, do_lower_case=True)
show(model, model_type, tokenizer, sentence_a, sentence_b, layer=4, head=3)

## Store hidden embedding layers for visualization

We need to import some additional libraries for this step.  

In [None]:
from torch.utils.tensorboard import SummaryWriter
import re
import torch
import tensorflow as tf
import tensorboard as tb

**Load dataset**

We first need to load our example dataset again, which is the same as in Tutorial 4.

To speed up training and reduce the processing time, we will again drastically reduce the dataset. We will use the validation set for the visualiaztion for today. For your project, you need to use a separate test set.


In [None]:
from datasets import load_dataset, DatasetDict
from transformers import DataCollatorWithPadding
from transformers import AutoTokenizer

imdb_dataset = load_dataset("imdb")

tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-cased")


# Just take the first 50 tokens for speed on CPU
def truncate(example):
    return {
        'text': " ".join(example['text'].split()[:100]),
        'label': example['label']
    }

# Take 128 random examples for train and 32 validation
small_imdb_dataset = DatasetDict(
    train=imdb_dataset['train'].shuffle(seed=24).select(range(128)).map(truncate),
    val=imdb_dataset['train'].shuffle(seed=24).select(range(128, 160)).map(truncate),
)

def tokenize_function(examples):
    return tokenizer(examples["text"], padding=True, truncation=True)

small_tokenized_dataset = small_imdb_dataset.map(tokenize_function, batched=True, batch_size=16)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

**Load pre-trained model and fine-tune**

Just like in Tutorial 4, we first need to fine-tune the generic model to the specific task of emotion classification. You can either store the checkpoint in your session, in Google Drive or locally.

👋 ⚒ What needs to be changed in the following code to store the checkpoints in your Google Drive or locally?

In [None]:
import numpy as np
import evaluate
from transformers import TrainingArguments, Trainer
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained('distilbert/distilbert-base-cased', num_labels=2)
accuracy = evaluate.load("accuracy")

arguments = TrainingArguments(
    output_dir="sample_cl_trainer",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    logging_steps=8,
    num_train_epochs=5,
    eval_strategy="epoch", # run validation at the end of each epoch
    save_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    load_best_model_at_end=True,
    report_to='none',
    seed=224
)

def compute_metrics(eval_pred):
    """Called at the end of validation. Gives accuracy"""
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    # calculates the accuracy
    return accuracy.compute(predictions=predictions, references=labels)


trainer = Trainer(
    model=model,
    args=arguments,
    train_dataset=small_tokenized_dataset['train'],
    eval_dataset=small_tokenized_dataset['val'], # change to test when you do your final evaluation!
    processing_class=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
trainer.train()

**Load model for specific checkpoint**

👋 ⚒ Load one of the checkpoints (here: status of model at a specific epoch) for the fine-tuned model.  

In [None]:
# Complete the code here to load one of the checkpoints for the fine-tuned model
fine_tuned_model = AutoModelForSequenceClassification.from_pretrained("sample_cl_trainer/checkpoint-8")

model_inputs = tokenizer(small_tokenized_dataset['val']['text'], padding=True, truncation=True, return_tensors='pt')
outputs = fine_tuned_model(**model_inputs, output_hidden_states=True)

The following code stores the hidden states for each layer for the checkpoint to the folder `results_vis` in your Google Drive. Change the code if you wish to store files locally. Keep in mind that already created folders are not automatically overwritten. So if you run the code again, you first need to delete the `results_vis` folder or change the name to `results_vis_1`, etc.

Keep in mind that the number of layers can be set manually. For the [default pretrained model DistilBERT](https://huggingface.co/docs/transformers/model_doc/distilbert) the n_layers is set to 6.

❓ How can I get the hidden embedding layers for the model at the state of another epoch (we have five in total)?

❓ Which elements of the following code are very dataset-specific? What needs to be changed mainly for a different dataset?

In [None]:
import torch
import os

path = "/content/drive/MyDrive/results_vis"
path = "results_vis"
layer=0
if not os.path.exists(path):
  os.mkdir(path)

while layer in range(len(outputs['hidden_states'])):
  if not os.path.exists(path+'/layer_' + str(layer)):
    os.mkdir(path+'/layer_' + str(layer))

  example = 0
  tensors = []
  labels = []

  while example in range(len(outputs['hidden_states'][layer])):
    sp_token_position = 0
    for token in model_inputs['input_ids'][example]:
      if token != 101:
        sp_token_position += 1
      else:
        tensor = outputs['hidden_states'][layer][example][sp_token_position]
        tensors.append(tensor)
        break

    label = [small_tokenized_dataset['val']['text'][example],str(small_tokenized_dataset['val']['label'][example])]
    labels.append(label)
    example +=1

  writer=SummaryWriter(path+'/layer_' + str(layer))
  writer.add_embedding(torch.stack(tensors), metadata=labels, metadata_header=['Text','Emotion'])

  layer+=1


Now you can upload metadata and tensor file into the TensorFlow Embedding Projector [API](https://projector.tensorflow.org/).

👋 ⚒ Compare the visualization of Layer 1 and Layer 6. To this end, go to the [API](https://projector.tensorflow.org/), click on the option `Load` and load the two TSV-files stored in your local results_viz folder in layer_1 first.

Change the setting `Color by`to `Emotion`and the visualization method to `Custom` instead of `PCA`. Take a screenshot of the visualization before you do the same for Layer 6.



