# Unsupervised sentence embedding learning - TSDAE

In this notebook, we will look at the work of `Kexin Wang, Nils Reimers, Iryna Gurevych` on their paper [TSDAE: Using Transformer-based Sequential Denoising Auto-Encoder for Unsupervised Sentence Embedding Learning](https://arxiv.org/pdf/2104.06979.pdf).

Here is the summary of the paper by the authors.

> Learning sentence embeddings often requires
a large amount of labeled data. However,
for most tasks and domains, labeled data is
seldom available and creating it is expensive.
In this work, we present a new state-of-the-art unsupervised method based on pre-trained
Transformers and Sequential Denoising AutoEncoder (TSDAE) which outperforms previous approaches by up to 6.4 points. It can achieve up to 93.1% of the performance of indomain supervised approaches. Further, we
show that TSDAE is a strong domain adaptation and pre-training method for sentence
embeddings, significantly outperforming other approaches like Masked Language Model.
>
> A crucial shortcoming of previous studies is the narrow evaluation: Most work mainly evaluates on the single task of Semantic Textual Similarity (STS), which does not require any domain knowledge. It is unclear if these proposed methods generalize to other domains and tasks. We fill this gap and evaluate TSDAE and other recent approaches on four different datasets from heterogeneous domains.

The techniques we have discussed so far - Bi-Encoders, Cross-Encoders, etc. required labeled data. While they show great performance on in-domain data, their performance declines rapidly on out-of-domain data.

| ![](assets/models_generalize.png) | 
|:--:| 
| Fig. 1. Illustration of generalizability of Neural IR models on 18 IR datasets from the [BIER benchmark](https://arxiv.org/abs/2104.08663). (Image source: https://www.youtube.com/watch?v=xbdLowiQTlk&t=658s) |

We can see that BM25(retriever/candidate generator) and Cross-Encoder(Re-Ranker) works the best while many dense retrievers fail to outperform just the BM25 on most on the datasets.

So why not use BM25 + Cross-Encoders?

As we mentioned earlier, Cross-Encoders are expensive as we can't index the corpus beforehand. For each query, we would need to score the query against all the retrieved candidates(which could be in 100s).

So we again come back to Bi-Encoders as they are fast at both indexing the corpus and inference(using ANNs).

In TSDAE, the authors train encoder based models with pre-training objective similar to `Masked Language Modeling` but with slight variation.
In MLM, we mask some tokens and train the encoders to predict the masked tokens, but in TSDAE, we delete some tokens from the input sentences, create a pooled representation of the sentence(MEAN-pooling or CLS-embedding) and pass that to a Denoising Auto-Encoder to recreate the original input text.

| ![](assets/tsdae.png) | 
|:--:| 
| Fig. 2. Illustration of TSDAE. (Image source: https://arxiv.org/pdf/2104.06979.pdf) |

The authors tried a bunch of configurations for adding noise. The best results came from deletion with a deletion ratio of 0.6.
Note that the models were trained on a combination of SNLI and MultiNLI datasets without labels and evaluated on the STS benchmark with the metric Spearman rank correlation.
CLS and Mean-pooling worked the best with similar performance. They recommend choosing CLS-pooling so we will also use that.

| ![](assets/tsdae_config.png) |
|:--:|
| Fig. 3. Results with differnt noise types, noise ratios and pooling methods. (Image source: https://arxiv.org/pdf/2104.06979.pdf) |

Here we will use the `sentence_transformers` library to train this architecture. Lets start ...

## Data preparation

In [1]:
import random

import datasets as hf_datasets
import numpy as np
import nltk
import pandas as pd
from sentence_transformers import InputExample, SentenceTransformer, LoggingHandler
from sentence_transformers import models, util, datasets, evaluation, losses
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
import torch
from torch.utils.data import DataLoader

In [2]:
nltk.download("punkt")
pd.set_option("display.max_colwidth", None)

[nltk_data] Downloading package punkt to /home/utsav/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [3]:
seed = 10

random.seed(seed)
np.random.seed(seed)

torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)

torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

For training, we will use a sample of the `snli` dataset.

In [4]:
dataset = hf_datasets.load_dataset("snli", split="train")
dataset = dataset.filter(lambda _: True if random.random() > 0.9 else False)

len(dataset), dataset[0]

Reusing dataset snli (/home/utsav/.cache/huggingface/datasets/snli/plain_text/1.0.0/1f60b67533b65ae0275561ff7828aad5ee4282d0e6f844fd148d05d3c6ea251b)
Loading cached processed dataset at /home/utsav/.cache/huggingface/datasets/snli/plain_text/1.0.0/1f60b67533b65ae0275561ff7828aad5ee4282d0e6f844fd148d05d3c6ea251b/cache-08577eb1924770d3.arrow


(109812,
 {'premise': 'Children smiling and waving at camera',
  'hypothesis': 'They are smiling at their parents',
  'label': 1})

In [5]:
train_sentences = [d["premise"] for d in dataset]
train_sentences = list(set(train_sentences))
len(train_sentences)

77550

In [6]:
train_dataset = datasets.DenoisingAutoEncoderDataset(train_sentences)

Let's look at some examples of the noise added by the `DenoisingAutoEncoderDataset`.

In [7]:
df = pd.DataFrame([train_dataset[i].texts for i in range(20)], columns=["noisy", "original"])
df

Unnamed: 0,noisy,original
0,Two are crossing a while pedestrians them.,"Two men are crossing a street carrying a frame, while pedestrians walk around them."
1,a sleeping on with and dog sitting next him,a man sleeping on a bench outside with a white and black dog sitting next to him.
2,men volleyball,Why are grown men playing volleyball?
3,A boy on hill of,A young boy standing on a hill that overlooks a village of homes.
4,A man a Mohawk,A man with a many colored Mohawk smiling.
5,swing,A girl rides on a swing.
6,", white coats, standing","Two women, both wearing white coats, are standing outside a large framed doorway."
7,man with,"From inside building, view of man washing window with tool."
8,woman glass of,A blond woman pours a glass of wine in a dim restaurant.
9,Children.,Children are playing with baseball bats.


In [8]:
batch_size = 16
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

## Model config 

In [9]:
model_name = "bert-base-uncased"

In [10]:
embedding_model = models.Transformer(model_name)
pooling = models.Pooling(embedding_model.get_word_embedding_dimension(), "cls")
model = SentenceTransformer(modules=[embedding_model, pooling], device="cuda")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [11]:
train_loss = losses.DenoisingAutoEncoderLoss(model, decoder_name_or_path=model_name, tie_encoder_decoder=True)

In [12]:
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=5,
    weight_decay=0,
    scheduler="constantlr",
    optimizer_params={'lr': 3e-5},
    show_progress_bar=True
)

Epoch: 5/5 [00:00<?, ?it/s]

In [22]:
model.save(f"models/{model_name}-tsdae")

## Evaluation

We will evaluate the trained model on a STS dataset. We will compare the finetuned model and the bert-base model.

In [23]:
sts = hf_datasets.load_dataset("glue", "stsb", split="validation")
sts = sts.map(lambda x: {"label": x["label"] / 5.0})

len(sts), sts[0]

Using the latest cached version of the module from /home/utsav/.cache/huggingface/modules/datasets_modules/datasets/glue/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad (last modified on Fri Jul 22 22:08:30 2022) since it couldn't be found locally at glue., or remotely on the Hugging Face Hub.
Reusing dataset glue (/home/utsav/.cache/huggingface/datasets/glue/stsb/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)




(1500,
 {'sentence1': 'A man with a hard hat is dancing.',
  'sentence2': 'A man wearing a hard hat is dancing.',
  'label': 1.0,
  'idx': 0})

In [24]:
input_examples = [InputExample(texts=[data["sentence1"], data["sentence2"]], label=data["label"])
                  for data in sts]

In [25]:
evaluator = EmbeddingSimilarityEvaluator.from_input_examples(input_examples, write_csv=False)
evaluator(model)

0.7487081683853725

In [26]:
original_embedding_model = models.Transformer("bert-base-uncased")
pooling = models.Pooling(original_embedding_model.get_word_embedding_dimension(), "cls")

original_model = SentenceTransformer(modules=[original_embedding_model, pooling])
evaluator(original_model)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


0.3172617272784805

Our unsupervised finetuned model is doing much better on this semantic similarity dataset as compared to the bert-base model.
It's not performing as good as the supervised training we did for Bi and Cross encoders(where we saw average spearman rank correlation of `~0.80`) but that's to be expected.

Let's look how this model can be used for retrieval tasks.

In [44]:
def get_ranked_docs(model: SentenceTransformer, query: str, corpus_emebds: np.array) -> None:
    query_embed = model.encode(query)
    scores = util.cos_sim(query_embed, corpus_embeds)
    print(f"Query - {query}\n---")
    scores = scores.cpu().detach().numpy()[0]
    scores_ix = np.argsort(scores)[::-1]
    for ix in scores_ix:
        print(f"{scores[ix]: >.2f}\t{corpus[ix]}")

In [55]:
corpus = [
    "A man is eating food.",
    "A man is eating a piece of bread.",
    "The girl is carrying a baby.",
    "A man is riding a horse.",
    "A woman is playing violin.",
    "Two men pushed carts through the woods.",
    "A man is riding a white horse on an enclosed ground.",
    "A monkey is playing drums.",
    "A cheetah is running behind its prey."
]

query = "A man is eating pasta."

In [56]:
corpus_embeds = model.encode(corpus)
get_ranked_docs(model, query, corpus_embeds)

Query - A man is eating pasta.
---
0.80	A man is eating food.
0.75	A man is eating a piece of bread.
0.53	A man is riding a horse.
0.47	A woman is playing violin.
0.42	A cheetah is running behind its prey.
0.41	A monkey is playing drums.
0.41	A man is riding a white horse on an enclosed ground.
0.39	The girl is carrying a baby.
0.26	Two men pushed carts through the woods.


In [57]:
corpus_embeds = original_model.encode(corpus)
get_ranked_docs(original_model, query, corpus_embeds)

Query - A man is eating pasta.
---
0.98	A man is eating a piece of bread.
0.96	A man is riding a horse.
0.96	A woman is playing violin.
0.96	A man is eating food.
0.95	A monkey is playing drums.
0.91	A man is riding a white horse on an enclosed ground.
0.89	A cheetah is running behind its prey.
0.86	The girl is carrying a baby.
0.85	Two men pushed carts through the woods.


In [58]:
corpus = [
    "A man is eating food.",
    "A man is eating a piece of bread.",
    "A woman is playing violin.",
    "Two men pushed carts through the woods.",
    "A woman is practicing jumps with her horse.",
    "A horse is running around the track."
]

query = "Horse jumped over the obstacle."

In [59]:
corpus_embeds = model.encode(corpus)
get_ranked_docs(model, query, corpus)

Query - Horse jumped over the obstacle.
---
0.61	A horse is running around the track.
0.49	A woman is practicing jumps with her horse.
0.41	Two men pushed carts through the woods.
0.33	A man is eating a piece of bread.
0.32	A man is eating food.
0.28	A woman is playing violin.


In [60]:
corpus_embeds = original_model.encode(corpus)
get_ranked_docs(original_model, query, corpus_embeds)

Query - Horse jumped over the obstacle.
---
0.84	A horse is running around the track.
0.84	Two men pushed carts through the woods.
0.82	A man is eating food.
0.78	A woman is playing violin.
0.76	A man is eating a piece of bread.
0.72	A woman is practicing jumps with her horse.


---

## References

[1] Kexin Wang, Nils Reimers and Iryna Gurevych. "[TSDAE: Using Transformer-based Sequential Denoising Auto-Encoder for Unsupervised Sentence Embedding Learning](https://arxiv.org/pdf/2104.06979.pdf)"

[2] [BIER benchmark](https://arxiv.org/abs/2104.08663)