Notebook prepared by Henrique Lopes Cardoso (hlc@fe.up.pt).

# Sentence Representations

In a previous notebook, we have explored different ways of leveraging word embeddings to come up with a representation for an input sequence. Given a sequence of words (tokens), we can get the embeddings for each word in the sequence and either concatenate them, or use some kind of aggregation function such as taking the mean or element-wise max of the embeddings.

In the notebook on Transformer models, we have seen how to use a full pretrained BERT-based model as is, or even how to fine-tune the whole model to our task.

In this notebook, we explore alternative ways of coming up with richer sentence representations by leveraging language models, while avoiding the need to fine-tune the full BERT-based model. As mentioned in the [BERT paper](https://arxiv.org/abs/1810.04805), a _feature-based approach, where fixed features are extracted from the pretrained model, has certain advantages_. One of them is related with the computational benefits of pre-computing an expensive representation of the data and then running several experiments on top of this representation by resorting to computationally cheaper models.


## The dataset and some additional stuff

We will be comparing the effect of using different sentence representations for the same text classification task. For that, we start by loading our dataset:


In [36]:
import pandas as pd

# Importing the dataset
dataset = pd.read_csv("../data/restaurant_reviews.tsv", delimiter="\t", quoting=3)

dataset.head()

Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


For ease of testing different sentence representations, let's define a generic function that given the features used to represent each text entry and the output labels, partitions the dataset into training and testing, trains a (logistic regression) classifier on the training set, and outputs results on the test set.


In [37]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    confusion_matrix,
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
)


def evaluate_feature_representation(X, y):
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.20, random_state=0, stratify=y
    )

    # print(X_train.shape, y_train.shape)
    # print(X_test.shape, y_test.shape)

    # print("\nLabel distribution in the training set:")
    # print(y_train.value_counts())

    # print("\nLabel distribution in the test set:")
    # print(y_test.value_counts())

    clf = LogisticRegression()
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)

    print(confusion_matrix(y_test, y_pred))
    print("Accuracy: ", accuracy_score(y_test, y_pred))
    print("Precision: ", precision_score(y_test, y_pred))
    print("Recall: ", recall_score(y_test, y_pred))
    print("F1: ", f1_score(y_test, y_pred))

    return

## BERT embeddings

We can make use of BERT's internal representation of the input sequence as features. Let's start by loading a BERT model.


In [38]:
model_name = "distilbert-base-uncased"


In [39]:
from transformers import AutoTokenizer
from transformers import AutoModel

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)


Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_projector.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


### Using the CLS token

BERT models add a special CLS token to the beginning of the input sequence. In the model's final hidden state, this token's representation is used as an aggregate sequence representation for classification tasks.

Let's see what we get from the representation of the CLS token for
a specific example.


In [40]:
print(dataset["Review"][0])
inputs = tokenizer(
    dataset["Review"][0], padding=True, truncation=True, return_tensors="pt"
)
print(inputs["input_ids"])
# print(inputs['input_ids'].shape)

Wow... Loved this place.
tensor([[  101, 10166,  1012,  1012,  1012,  3866,  2023,  2173,  1012,   102]])


As you can see, the text has been tokenized to 10 tokens, including the special [CLS] (101) and [SEP] (102) tokens.

We now pass the input through BERT and obtain the last hidden state of the model.
(Note: if you want to check all hidden states, via _outputs.hidden_states_, you must load the model with the _output_hidden_states=True_ option.)


In [41]:
outputs = model(**inputs)
# print(outputs.last_hidden_state)   # or outputs["last_hidden_state"]
print(outputs.last_hidden_state.shape)

torch.Size([1, 10, 768])


The embeddings size is, in this case, 768, so we have a tensor with dimentions 1x10x768. To get the CLS token embeddings, we access the first one.


In [42]:
print(outputs.last_hidden_state[0][0].shape)
outputs.last_hidden_state[0][0]  # the CLS token is the first one

torch.Size([768])


tensor([ 4.2126e-02, -5.0134e-03, -3.5180e-02, -4.1186e-02,  2.2420e-02,
        -2.9745e-01,  2.7671e-01,  5.0136e-01, -1.5983e-01, -2.1376e-01,
         5.4243e-02, -4.1223e-01, -1.0171e-01,  5.8712e-01,  1.3457e-01,
         2.1712e-01, -1.0883e-01,  2.7112e-01,  1.1865e-01, -1.5052e-01,
         1.5066e-03, -3.5435e-01, -3.8533e-02,  1.1027e-01, -7.6557e-02,
        -3.8043e-02, -1.6030e-02, -2.0002e-01,  1.3438e-01,  2.5308e-02,
         1.4056e-02,  8.4898e-02, -2.6438e-01, -1.7116e-01, -1.0666e-01,
        -9.9342e-02, -1.1813e-02, -1.0511e-01, -2.2373e-01,  1.2219e-01,
        -3.2356e-02, -1.3577e-01,  1.8531e-01, -1.0054e-01, -1.7740e-01,
        -3.3294e-01, -2.3977e+00,  5.2049e-02, -2.2174e-02, -1.1433e-01,
         4.4919e-01,  8.9583e-03,  1.7044e-01,  2.8112e-01,  3.2639e-01,
         4.3146e-01, -2.8787e-01,  4.8212e-01,  1.5943e-02, -7.4780e-02,
         3.7156e-01,  3.0133e-02, -1.2007e-01, -1.6729e-01, -6.9575e-02,
         2.2837e-01,  5.0202e-02,  5.9328e-02, -4.7

Now, we can get the CLS token embeddings for every review. For that, we need to convert each tensor object into a numpy.ndarray by using the _numpy()_ method.


In [43]:
import numpy as np

X = np.empty([0, 768])
X = np.array(
    [
        model(**tokenizer(rev, padding=True, truncation=True, return_tensors="pt"))
        .last_hidden_state[0][0]
        .detach()
        .numpy()
        for rev in dataset["Review"]
    ]
)

We get the labels and check the shape of the feature matrix. Each input element should have 768 features (the dimension of encoder layers in Distill BERT, aka the hidden size for the BERT base model).


In [44]:
y = dataset["Liked"]
print(X.shape, y.shape)

(1000, 768) (1000,)


Let's see how this representation fares with our generic evaluation function, which trains and tests a classifier based on the representation we provide to it.


In [45]:
evaluate_feature_representation(X, y)


[[88 12]
 [ 9 91]]
Accuracy:  0.895
Precision:  0.883495145631068
Recall:  0.91
F1:  0.896551724137931


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### Averaging over token embeddings

Alternatively, we can also average across the embeddings for all tokens in the last hidden state. In fact, even though the [BERT](https://arxiv.org/abs/1810.04805) paper suggests the CLS token be used as a representation of the input sequence for classification tasks, in some cases averaging across embeddings obtains improved performance. Can you try it out?


In [46]:
# your code here
X = np.empty([0, 768])
y = dataset["Liked"]

X = np.array(
    [
        model(**tokenizer(rev, padding=True, truncation=True, return_tensors="pt"))
        .last_hidden_state[0]
        .mean(axis=0)
        .detach()
        .numpy()
        for rev in dataset["Review"]
    ]
)

print(X.shape, y.shape)
evaluate_feature_representation(X, y)

(1000, 768) (1000,)
[[90 10]
 [ 9 91]]
Accuracy:  0.905
Precision:  0.900990099009901
Recall:  0.91
F1:  0.9054726368159204


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


## SBERT (SentenceTransformers)

Several other sentence representation models exist, and we here explore the usage of [SentenceTransformers](https://www.sbert.net/). Although this framework has been built having semantic similarity tasks in mind, these representations can also be used for text classification tasks, as evidenced in the [original paper](https://arxiv.org/abs/1908.10084).

SBERT uses a modification of the BERT network using a siamese architecture and a triplet loss function, trained with Natural Language Inference data ([SNLI](https://nlp.stanford.edu/projects/snli/)).


### Comparing BERT and SBERT representations

To compare the representations obtained by BERT and those provided by SentenceTransformers, we can see how similar those representations are for a few sentences.


In [47]:
sentences = [
    "A man is eating food.",
    "A man is eating a piece of bread.",
    "The quick brown fox jumps over the lazy dog.",
]

Let's start with BERT, while making use of an util facility provided by SentenceTransformers to compute cosine similarities.


In [48]:
from sentence_transformers import util

embeddings = np.empty([0, 768])
embeddings = np.array(
    [
        model(**tokenizer(s, padding=True, truncation=True, return_tensors="pt"))
        .last_hidden_state[0][0]
        .detach()
        .numpy()
        for s in sentences
    ]
)

cos_sim = util.cos_sim(embeddings, embeddings)
cos_sim

tensor([[1.0000, 0.9867, 0.9358],
        [0.9867, 1.0000, 0.9248],
        [0.9358, 0.9248, 1.0000]])

As you can see, all sentence representation pairs have very high cosine similarities.
This can be somewhat alleviated by averaging across the embeddings for all tokens in the last hidden state, but the sentences will still have an unexpectedly high cosine similarity.


Let's now load a SentenceTransformer model and see what it gives us.


In [49]:
from sentence_transformers import SentenceTransformer

sbert_model = SentenceTransformer("all-MiniLM-L6-v2")

Using SentenceTransformers consists of simply encoding the sentences that we have, in a single step.


In [50]:
sbert_embeddings = sbert_model.encode(sentences)

cos_sim = util.cos_sim(sbert_embeddings, sbert_embeddings)
cos_sim


tensor([[ 1.0000,  0.7553, -0.0220],
        [ 0.7553,  1.0000,  0.0033],
        [-0.0220,  0.0033,  1.0000]])

### Using SBERT embeddings for classification

We now use SentenceTransformer embeddings for our classification problem. For that, we need to encode the reviews in the dataset. You will find that this step is much faster than doing it using BERT.
Then, we can use our generic function to train and test a classifier by passing it the reviews' embeddings and the labels.


In [51]:
# your code here
X = sbert_model.encode(dataset["Review"])
y = dataset["Liked"]
evaluate_feature_representation(X, y)

[[90 10]
 [12 88]]
Accuracy:  0.89
Precision:  0.8979591836734694
Recall:  0.88
F1:  0.888888888888889


# SimCSE: Simple Contrastive Learning of Sentence Embeddings

[SimCSE](https://github.com/princeton-nlp/SimCSE) is another recent model that trains a BERT-based model using contrastive learning.


Let's load a SimCSE model and see what it gives us as sentence representations.


In [None]:
from simcse import SimCSE

simcse_model = SimCSE("princeton-nlp/sup-simcse-bert-base-uncased")

We can easily obtain sentence embeddings:


In [None]:
simcse_model.encode(sentences)


But SimCSE's API allows us to obtain similarity scores directly from the source sentences.


In [None]:
simcse_model.similarity(sentences, sentences)


Compare these with those obtained using SBERT.


### Using SimCSE embeddings for classification

We now use SimCSE embeddings for our classification problem.


In [None]:
# your code here