# Implementing a Sentence Transformer Model

09/09/2024

This notebook is my submission to Fetch's interview challenge.

In this notebook, I explore using a sentence transformer model from the "SentenceTransformers", also known as "Sentence Transformers", library as a model for "sentence classification" and "named entity recognition."


The documentation for this library may be found here: https://www.sbert.net/

# Step 1: Implement a Sentence Transformer Model

First, I install the "sentence-transformer" library package using "pip install" so that we may be able to use it in the context of this notebook.

In [1]:
!pip install -U sentence-transformers



Now that the package is installed, I import the sentence_transformers library along with the SentenceTransformer method to initalize the model I will be using for the rest of this exercise.

In this case, I use one of the original models "all-mpnet-base-v2" which was trained on more than 1 billion training pairs and achieves the highest quantitative results compared against the other original models.

(Those quantitative results may be viewed here:
https://www.sbert.net/docs/sentence_transformer/pretrained_models.html#original-models)

In [2]:
from sentence_transformers import SentenceTransformer

# Here I am initializing a pretrained SentenceTransformer model
sbert_model = SentenceTransformer("all-mpnet-base-v2")

  from tqdm.autonotebook import tqdm, trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In the below code sell:
- I provide some example sentences.
- I input the sentences through the model's encoding method.
- I display the "shape" of the embeddings, also print them out.

In [3]:
# Here I am providing some example sentences.
# The fourth sentence uses three quotation marks to allow one string to span two lines.
sentences = [
    "This is an example sentence for a transformer model.",
    "The Seattle Seahawks were riding a high after winning another game of football.",
    "The two friends decided to play 'Golf' this weekend, and also invited their friends.",
    """The young man noticed a book at the bookstore, and wasn't sure if it was worth his time.
     He decided to buy it anyway."""
]

# I input the sentences through the sbert_model encoder
embeddings = sbert_model.encode(sentences)

# Then I print out the shape of the embeddings
print(embeddings.shape)
# [4, 768]

# Finally I also print out the embeddings themselves to showcase them
print(embeddings)


(4, 768)
[[-0.00960231 -0.10135006 -0.01356449 ... -0.04973421 -0.04874466
  -0.02236462]
 [-0.05270625  0.00597759 -0.01673021 ...  0.01083517  0.0528818
  -0.0270721 ]
 [ 0.01378679 -0.05925906 -0.01800795 ... -0.00167301  0.00147419
  -0.00167317]
 [ 0.04201893 -0.00050176  0.00044005 ... -0.01496557  0.06104949
  -0.06737804]]


The embeddings array variable has a shape or dimension of (4, 768), which tells us that the encoding process was successful in the sense that each of the four sentences are now transformed into data entry rows of seven hundred sixty eight values.

I would like to note that, for this section, I did not make any changes to the model architecture itself or outside of it.
The SentenceTransformer model was designed specifically to work best with short-form inputs like sentences, passages and paragraphs.

There was also no need to change any hyperparameters of the model itself especially to encode input sentences into embeddings of a fixed-length, since the SentenceTransformer class outputs encodings to a fixed length of 768 encoding values.

More information on the hyperparameters may be found here:
https://www.sbert.net/docs/package_reference/sentence_transformer/SentenceTransformer.html#id1

# Step 2: Multi-Task Learning Expansion

In this section, I will be doing something a bit differently.

Instead of exclusively using the sBERT sentence_transformers library, I will switch over to using a few libraries, made available by HuggingFace, which will help import the datasets and set-up a sentence transformer model for Sentence Classification and Named Entity Recognition.

Like the previous section, I first install the necessary libraries.

In [21]:
!pip install datasets
!pip install transformers
!pip install evaluate
!pip install seqeval

Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: seqeval
  Building wheel for seqeval (setup.py) ... [?25l[?25hdone
  Created wheel for seqeval: filename=seqeval-1.2.2-py3-none-any.whl size=16161 sha256=637420165ad38e526d559b438111da4766141c41a18d77ff6c5bf62888752319
  Stored in directory: /root/.cache/pip/wheels/1a/67/4a/ad4082dd7dfc30f2abfe4d80a2ed5926a506eb8a972b4767fa
Successfully built seqeval
Installing collected packages: seqeval
Successfully installed seqeval-1.2.2


### Task A: Sentence Classification


In [5]:
import numpy as np
# Huggingface imports below
from datasets import load_dataset, DatasetDict
from transformers import AutoTokenizer, DataCollatorWithPadding
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
import evaluate
from transformers import pipeline



In [6]:
# Importing "IMDb" dataset from Huggingface
imdb_ds = load_dataset("imdb")

# Displaying a sample from the training set of the overall dataset,
# and display the structure of the loaded dataset
print(imdb_ds["train"])
print(imdb_ds)

README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Dataset({
    features: ['text', 'label'],
    num_rows: 25000
})
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})


In [7]:
# since the dataset is rather large and causing me to run out of GPU credits
# I am going to downsample the dataset
# I aim to have 12500 samples in the training set, and 5000 samples in the testing set
train_imdb_ds = imdb_ds["train"]
test_imdb_ds = imdb_ds["test"]

# train_test_split is a method used to split any Dataset or Dataframe into a number of samples
# using it here to avoid any longwinded conversions between data
train_imdb_ds, _ = train_imdb_ds.train_test_split(test_size=0.5).values()
_, test_imdb_ds = test_imdb_ds.train_test_split(test_size=0.2).values()

print(train_imdb_ds)
print(test_imdb_ds)

imdb_ds = DatasetDict({'train': train_imdb_ds, 'test': test_imdb_ds})
print(imdb_ds)

Dataset({
    features: ['text', 'label'],
    num_rows: 12500
})
Dataset({
    features: ['text', 'label'],
    num_rows: 5000
})
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 12500
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 5000
    })
})


In [8]:
# We load an accuracy evaluation metric for our compute_metrics method
accuracy = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

# We also establish the label mappings to their id values
id2label = {0: "NEGATIVE", 1: "POSITIVE"}
label2id = {"NEGATIVE": 0, "POSITIVE": 1}

# We load a tokenizer and establish a preprocessing_function
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-mpnet-base-v2")

def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

tokenized_imdb = imdb_ds.map(preprocess_function, batched=True)


Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Map:   0%|          | 0/12500 [00:00<?, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

In [9]:
# We load a pretrained sentence transformer model
imbd_sentence_classifier_model = AutoModelForSequenceClassification.from_pretrained("sentence-transformers/all-mpnet-base-v2")

# And a data collator for the trainer
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Establishing trainer
training_args = TrainingArguments(
    output_dir="NA_imbd_trained_sentence_transformer",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=1,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
)

trainer = Trainer(
    model=imbd_sentence_classifier_model,
    args=training_args,
    train_dataset=tokenized_imdb["train"],
    eval_dataset=tokenized_imdb["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

Some weights of MPNetForSequenceClassification were not initialized from the model checkpoint at sentence-transformers/all-mpnet-base-v2 and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.2531,0.164366,0.9458


TrainOutput(global_step=782, training_loss=0.23303307047890276, metrics={'train_runtime': 1345.5397, 'train_samples_per_second': 9.29, 'train_steps_per_second': 0.581, 'total_flos': 3252089644241760.0, 'train_loss': 0.23303307047890276, 'epoch': 1.0})

In [10]:
text = "The original Korean version of this movie was allegedly superior to the American one. I haven't seen either of them, so I can't comment on it, but a lot of people nowadays say they vastly prefer the American version."

In [12]:
classifier = pipeline("sentiment-analysis", model="NA_imbd_trained_sentence_transformer/checkpoint-782")
classifier(text)

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


[{'label': 'LABEL_1', 'score': 0.7935006618499756}]

### Task B: Named Entity Recognition

In [75]:
from datasets import load_dataset
from transformers import DataCollatorForTokenClassification
import evaluate
import numpy as np
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer
from transformers import pipeline


movie_trivia_ds = load_dataset("tner/mit_movie_trivia")

print(movie_trivia_ds["train"][0])

label_list = movie_trivia_ds["train"].features[f"tags"].feature
print(label_list)

{'tokens': ['what', '1995', 'romantic', 'comedy', 'film', 'starred', 'michael', 'douglas', 'as', 'a', 'u', 's', 'head', 'of', 'state', 'looking', 'for', 'love'], 'tags': [0, 9, 10, 15, 0, 0, 1, 2, 0, 3, 4, 4, 4, 4, 4, 4, 4, 4]}
Value(dtype='int32', id=None)


In [76]:
# establish tokenizer
ner_tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-mpnet-base-v2")

# generate embeddings for one input
example = movie_trivia_ds["train"][0]
tokenized_input = ner_tokenizer(example["tokens"], is_split_into_words=True)
tokens = ner_tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])

# view the tokens
print(example)

{'tokens': ['what', '1995', 'romantic', 'comedy', 'film', 'starred', 'michael', 'douglas', 'as', 'a', 'u', 's', 'head', 'of', 'state', 'looking', 'for', 'love'], 'tags': [0, 9, 10, 15, 0, 0, 1, 2, 0, 3, 4, 4, 4, 4, 4, 4, 4, 4]}




In [77]:
id2label = {
    0: "O",
    1: "B-corporation",
    2: "I-corporation",
    3: "B-creative-work",
    4: "I-creative-work",
    5: "B-group",
    6: "I-group",
    7: "B-location",
    8: "I-location",
    9: "B-person",
    10: "I-person",
    11: "B-product",
    12: "I-product",
}

label2id = {
    "O": 0,
    "B-corporation": 1,
    "I-corporation": 2,
    "B-creative-work": 3,
    "I-creative-work": 4,
    "B-group": 5,
    "I-group": 6,
    "B-location": 7,
    "I-location": 8,
    "B-person": 9,
    "I-person": 10,
    "B-product": 11,
    "I-product": 12,
}

In [78]:
# now we initalize our sentence transformer model
ner_model = AutoModelForTokenClassification.from_pretrained(
    "sentence-transformers/all-mpnet-base-v2", num_labels=13, id2label=id2label, label2id=label2id
)

Some weights of MPNetForTokenClassification were not initialized from the model checkpoint at sentence-transformers/all-mpnet-base-v2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [79]:
text = "The Golden State Warriors are an American professional basketball team based in San Francisco."
classifier = pipeline("ner", model=ner_model, tokenizer=ner_tokenizer)
classifier(text)

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


[{'entity': 'B-creative-work',
  'score': 0.09131047,
  'index': 1,
  'word': 'the',
  'start': 0,
  'end': 3},
 {'entity': 'B-creative-work',
  'score': 0.08274924,
  'index': 2,
  'word': 'golden',
  'start': 4,
  'end': 10},
 {'entity': 'B-creative-work',
  'score': 0.08654826,
  'index': 3,
  'word': 'state',
  'start': 11,
  'end': 16},
 {'entity': 'B-creative-work',
  'score': 0.085353725,
  'index': 4,
  'word': 'warriors',
  'start': 17,
  'end': 25},
 {'entity': 'B-creative-work',
  'score': 0.08739137,
  'index': 5,
  'word': 'are',
  'start': 26,
  'end': 29},
 {'entity': 'B-creative-work',
  'score': 0.08954705,
  'index': 6,
  'word': 'an',
  'start': 30,
  'end': 32},
 {'entity': 'B-creative-work',
  'score': 0.08371857,
  'index': 7,
  'word': 'american',
  'start': 33,
  'end': 41},
 {'entity': 'B-creative-work',
  'score': 0.08758899,
  'index': 8,
  'word': 'professional',
  'start': 42,
  'end': 54},
 {'entity': 'B-creative-work',
  'score': 0.08833624,
  'index': 9,

# Step 3: Discussion Questions


Question 1) Consider the scenario of training the multi-task sentence transformer that you
implemented in Task 2. Specifically, discuss how you would decide which portions of the
network to train and which parts to keep frozen.
For example,

● When would it make sense to freeze the transformer backbone and only train the
task specific layers?

● When would it make sense to freeze one head while training the other?



Answer 1)

Unfortunately, with the state of model training pipelines and methods made available by sBert and Huggingface, I do not believe that there is a relatively "easy" or straightforward way to keep portions of the network frozen and others unfrozen, to take advantage of a "specific" transfer learning attempt.

I am familiar with this training methodology, especially from my experience with sequential computer vision models like VGG-16, VGG-19, and other models like Dense or Residual Neural Networks.

To accomplish the same freezing and unfreezing of the layers, I would need to review some research papers and make updates to the underlying code within the imported libraries.

To answer the questions, however, I would say that it makes sense to freeze the transformer backbone and only train the task specific layer when you are moving the overall model from one similar dataset to another dataset or task that needs to classify or work with other labels.

It would make sense to me to freeze one head, while training the other, when the input datasets or data is similar or different, but only one head needs to be fine-tuned towards the desired goal. I can imagine a situation where the Named Entity Recognition head already has the knowledge of the desired labels to identify, but the sentence classification or sentiment analysis head needs to be update.

Question 2)

Discuss how you would decide when to implement a multi-task model like the one in this
assignment and when it would make more sense to use two completely separate models
for each task.

Answer 2)

In my experience so far, I would implement a multi-task model when the "scenario" the model is implemented in limited in scope and handles relatively consistent data. I found that some models that would attempt multi-task performance sometimes would get less than ideal or "low-performing" results, or use up a lot of resources to train and deploy and potentially cause issues elsewhere.

In most situations, I would choose to use completely separate models for each task, attempt to train them to their best situational performance, and then work on optimizing them for what their larger system needs.

Question 3)

When training the multi-task model, assume that Task A has abundant data, while Task
B has limited data. Explain how you would handle this imbalance.





Answer 3)

Depending how big the difference is between the size of the datasets for Task A or Task B, I would try different approaches.

For example, if Task B has enough data to train a consistently "good" model on, like thousands or tens-of-thousands samples, but Task A has hundreds-of-thousands or millions of samples, I would attempt to downsample enough samples from Task A's dataset so that it matches the size of Task B's dataset.

I would also attempt to do some sort of subjective or quantitative analysis on both datasets to ensure both datasets are "balanced" in the desired way.

Depending on the situation, I would also attempt to bootstrap or upsample Task B's dataset to minimize the imbalance.