# DeCLUTR
We present DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations, a self-supervised method for learning universal sentence embeddings that transfer to a wide variety of natural language processing (NLP) tasks. Our objective leverages recent advances in deep metric learning (DML) and has the advantage of being conceptually simple and easy to implement, requiring no specialized architectures or labelled training data. We demonstrate that our objective can be used to pretrain transformers to state-of-the-art performance on SentEval, a popular benchmark for evaluating universal sentence embeddings, outperforming existing supervised, semi-supervised and unsupervised methods. We perform extensive ablations to determine which factors contribute to the quality of the learned embeddings. Our code will be publicly available and can be easily adapted to new datasets or used to embed unseen text. 

!pip install git+https://github.com/JohnGiorgi/DeCLUTR.git

In [1]:
import torch
import pandas as pd
from declutr import Encoder
pretrained_model_or_path = "declutr-small"


if torch.cuda.is_available():
    device = torch.device("cuda")
    cuda_device = torch.cuda.current_device()
else:
    device = torch.device("cpu")
    cuda_device = -1

In [2]:
### ------------------------- DATA -------------------------------------

DATA_PATH = "../data/final_tweets/"

train_df = pd.read_csv(DATA_PATH +'train_df.csv')
valid_df = pd.read_csv(DATA_PATH +'validate_df.csv')
test_df  = pd.read_csv(DATA_PATH + 'test_df.csv')

In [4]:
X_train = train_df['tweet_text']
Y_train = train_df['text_info']

X_valid = valid_df['tweet_text']
Y_valid = valid_df['text_info']

train_full_df = pd.concat([X_train, X_valid])
Y_train_val  = pd.concat([Y_train, Y_valid])

In [19]:
tweet_text = list(X_valid.values)
tweet10 = tweet_text[0:10]

In [10]:
encoder = Encoder(pretrained_model_or_path, cuda_device=cuda_device)
embeddings = encoder(tweet10)

Some weights of RobertaForMaskedLM were not initialized from the model checkpoint at distilroberta-base and are newly initialized: ['lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [23]:
Y_valid[0:10]

0    0
1    0
2    1
3    1
4    1
5    1
6    1
7    1
8    1
9    1
Name: text_info, dtype: int64

In [20]:
tweet10

['finally turntable running irma to celebrate i thought i play dancin tunes haim',
 'couple blow away irma amp harvey schluter happy 75th wedding anniversary',
 'state tab irma already rising flapol',
 'rt looking ice food hot meals irma here find',
 '5 am edt sep 27 key messages tropical storm maria',
 'hurricane insurance claims faq answered experts maria irma',
 'cyclone mora adds rohingya plight bangladesh',
 'rt we survived once we can survive again hurricaneharvey',
 'prepare mobile command vehicles deployment assist hurricane maria',
 'donate underwear harvey victims get free cup coffee cobb']

In [21]:
import torch
from scipy.spatial.distance import cosine

from transformers import AutoModel, AutoTokenizer

# Load the model
tokenizer = AutoTokenizer.from_pretrained("johngiorgi/declutr-small")
model = AutoModel.from_pretrained("johngiorgi/declutr-small")
model = model.to(device)

# Prepare some text to embed
text = ['finally turntable running irma to celebrate i thought i play dancin tunes haim',
 'couple blow away irma amp harvey schluter happy 75th wedding anniversary',
 'state tab irma already rising flapol',
 'rt looking ice food hot meals irma here find',
 '5 am edt sep 27 key messages tropical storm maria',
 'hurricane insurance claims faq answered experts maria irma',
 'cyclone mora adds rohingya plight bangladesh',
 'rt we survived once we can survive again hurricaneharvey',
 'prepare mobile command vehicles deployment assist hurricane maria',
 'donate underwear harvey victims get free cup coffee cobb']


inputs = tokenizer(text, padding=True, truncation=True, return_tensors="pt")
# Put the tensors on the GPU, if available
for name, tensor in inputs.items():
    inputs[name] = tensor.to(model.device)

# Embed the text
with torch.no_grad():
    sequence_output, _ = model(**inputs, output_hidden_states=False)

# Mean pool the token-level embeddings to get sentence-level embeddings
embeddings = torch.sum(
    sequence_output * inputs["attention_mask"].unsqueeze(-1), dim=1
) / torch.clamp(torch.sum(inputs["attention_mask"], dim=1, keepdims=True), min=1e-9)
embeddings = embeddings.cpu()

# Compute a semantic similarity via the cosine distance
semantic_sim = 1 - cosine(embeddings[0], embeddings[1])
print(semantic_sim)

Some weights of the model checkpoint at johngiorgi/declutr-small were not used when initializing RobertaModel: ['roberta.embeddings.position_ids']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


0.496265172958374


In [24]:
semantic_sim = 1 - cosine(embeddings[0], embeddings[3])
print(semantic_sim)

0.46503373980522156


In [17]:
semantic_sim = 1 - cosine(embeddings[0], embeddings[3])
print(semantic_sim)

0.541486382484436


In [18]:
embeddings[0].shape

torch.Size([768])