<a href="https://colab.research.google.com/github/ounospanas/AIDL_B_CS01/blob/main/Retrieving_Similar_News_Posts_with_Transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine tune RoBERTa on STS-b

In [1]:
!pip install datasets transformers[sentencepiece]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.7.1-py3-none-any.whl (451 kB)
[K     |████████████████████████████████| 451 kB 5.0 MB/s 
[?25hCollecting transformers[sentencepiece]
  Downloading transformers-4.25.1-py3-none-any.whl (5.8 MB)
[K     |████████████████████████████████| 5.8 MB 65.8 MB/s 
[?25hCollecting huggingface-hub<1.0.0,>=0.2.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
[K     |████████████████████████████████| 182 kB 69.0 MB/s 
Collecting xxhash
  Downloading xxhash-3.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 74.9 MB/s 
[?25hCollecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting multiprocess
  Downloading multiprocess-0.70.14-py38-none-any.whl (132 kB)
[K     |████████████████████████████████| 132 kB 104.1 MB/s 
Collecting urllib3!=1

In [2]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

# download dataset
raw_datasets = load_dataset("glue", "stsb")

# define transformer and tokenizer
checkpoint = "roberta-large"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

# set a tokenization function
def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

# apply tokenization to data
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Downloading builder script:   0%|          | 0.00/28.8k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/28.7k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/27.8k [00:00<?, ?B/s]

Downloading and preparing dataset glue/stsb to /root/.cache/huggingface/datasets/glue/stsb/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...


Downloading data:   0%|          | 0.00/803k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/5749 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1500 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1379 [00:00<?, ? examples/s]

Dataset glue downloaded and prepared to /root/.cache/huggingface/datasets/glue/stsb/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/482 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

  0%|          | 0/6 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

In [3]:
# change dataset's column names
tokenized_datasets = tokenized_datasets.remove_columns(["sentence1", "sentence2", "idx"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")
tokenized_datasets["train"].column_names

['labels', 'input_ids', 'attention_mask']

In [4]:
# define the train/eval dataloaders

from torch.utils.data import DataLoader

train_dataloader = DataLoader(
    tokenized_datasets["train"], shuffle=True, batch_size=8, collate_fn=data_collator
)
eval_dataloader = DataLoader(
    tokenized_datasets["validation"], batch_size=8, collate_fn=data_collator
)

In [5]:
# retrieve pretrained model and set num of labels to 1 (it is a regression task)
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=1)

Downloading:   0%|          | 0.00/1.43G [00:00<?, ?B/s]

Some weights of the model checkpoint at roberta-large were not used when initializing RobertaForSequenceClassification: ['lm_head.dense.weight', 'lm_head.layer_norm.weight', 'lm_head.decoder.weight', 'lm_head.bias', 'roberta.pooler.dense.weight', 'lm_head.dense.bias', 'roberta.pooler.dense.bias', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-large and are newly initialized: ['classifier.dense.weight', 'classif

In [6]:
# define an optimizer, an optimization scheduler and the number of epochs
from transformers import AdamW
from transformers import get_scheduler

optimizer = AdamW(model.parameters(), lr=1e-5)
num_epochs = 3

num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)
print(num_training_steps)

2157




In [7]:
import torch

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)
device

device(type='cuda')

In [8]:
# train loop
# takes around 15 minutes 
from tqdm.auto import tqdm

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

  0%|          | 0/2157 [00:00<?, ?it/s]

You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


In [9]:
# run inference to get the eval scores
from datasets import load_metric

metric = load_metric("glue", "stsb")
model.eval()
for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    metric.add_batch(predictions=logits, references=batch["labels"])

metric.compute()

  metric = load_metric("glue", "stsb")


Downloading builder script:   0%|          | 0.00/1.84k [00:00<?, ?B/s]

{'pearson': 0.9243598203217201, 'spearmanr': 0.921807147785083}

In [10]:
# store model
torch.save(model.state_dict(), 'roberta_stsb.pt')

In [11]:
# load model
model.load_state_dict(torch.load('roberta_stsb.pt'))

<All keys matched successfully>

# Download example dataset

In [12]:
# install library
! pip install -q kaggle

In [13]:
# import files class to upload files to colab
from google.colab import files

In [14]:
# upload kaggle.json
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"pkasnesis","key":"d202848b9a00e8f6959f9753b8abf697"}'}

In [15]:
# Make directory named kaggle and copy kaggle.json file there.
! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json

In [16]:
# Download news category dataset and unzip to news folder
! kaggle datasets download 'rmisra/news-category-dataset'
! mkdir news
! unzip news-category-dataset.zip  -d news

Downloading news-category-dataset.zip to /content
 60% 16.0M/26.5M [00:00<00:00, 47.6MB/s]
100% 26.5M/26.5M [00:00<00:00, 73.9MB/s]
Archive:  news-category-dataset.zip
  inflating: news/News_Category_Dataset_v3.json  


In [18]:
# Convert json to list

import json 

list_ = []
with open('news/News_Category_Dataset_v3.json') as files:
    for file in files:
        list_.append(json.loads(file))

In [19]:
# Convert list to dataframe

import pandas as pd
data = pd.DataFrame(list_)
data.head()

Unnamed: 0,link,headline,category,short_description,authors,date
0,https://www.huffpost.com/entry/covid-boosters-...,Over 4 Million Americans Roll Up Sleeves For O...,U.S. NEWS,Health experts said it is too early to predict...,"Carla K. Johnson, AP",2022-09-23
1,https://www.huffpost.com/entry/american-airlin...,"American Airlines Flyer Charged, Banned For Li...",U.S. NEWS,He was subdued by passengers and crew when he ...,Mary Papenfuss,2022-09-23
2,https://www.huffpost.com/entry/funniest-tweets...,23 Of The Funniest Tweets About Cats And Dogs ...,COMEDY,"""Until you have a dog you don't understand wha...",Elyse Wanshel,2022-09-23
3,https://www.huffpost.com/entry/funniest-parent...,The Funniest Tweets From Parents This Week (Se...,PARENTING,"""Accidentally put grown-up toothpaste on my to...",Caroline Bologna,2022-09-23
4,https://www.huffpost.com/entry/amy-cooper-lose...,Woman Who Called Cops On Black Bird-Watcher Lo...,U.S. NEWS,Amy Cooper accused investment firm Franklin Te...,Nina Golgowski,2022-09-22


In [20]:
# get description column

descriptions = data['short_description']

# Create Sentence embedding with Sentence Transformers (SRoBERTa)

In [21]:
!pip install sentence_transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentence_transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[?25l[K     |███▉                            | 10 kB 32.7 MB/s eta 0:00:01[K     |███████▋                        | 20 kB 6.7 MB/s eta 0:00:01[K     |███████████▍                    | 30 kB 9.6 MB/s eta 0:00:01[K     |███████████████▎                | 40 kB 4.0 MB/s eta 0:00:01[K     |███████████████████             | 51 kB 4.1 MB/s eta 0:00:01[K     |██████████████████████▉         | 61 kB 4.8 MB/s eta 0:00:01[K     |██████████████████████████▊     | 71 kB 5.3 MB/s eta 0:00:01[K     |██████████████████████████████▌ | 81 kB 5.7 MB/s eta 0:00:01[K     |████████████████████████████████| 85 kB 3.1 MB/s 
Building wheels for collected packages: sentence-transformers
  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone
  Created wheel for sentence-transformers: file

In [22]:
from sentence_transformers import SentenceTransformer

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

In [23]:
embedder = SentenceTransformer('roberta-large-nli-stsb-mean-tokens')

Downloading:   0%|          | 0.00/748 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/191 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/4.00k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/335 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

In [24]:
# create embeddings (takes around 22 min) and store them
post_embeddings = embedder.encode(descriptions)
np.save('post_embeddings.npy', post_embeddings)

# Retrieve k most similar posts/news with cosine similarity

In [25]:
# add an example post and get the embeddings

input_post = 'A man killed his wife'
input_emb = embedder.encode(input_post)

In [26]:
%%timeit
# cosine similarity using input post embeddings and compare with the stored ones

cosine_similarity([input_emb],post_embeddings)

902 ms ± 15.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [27]:
# function for retrieving the k most similar news based on their textual silarity (SRoBERTa) 

def get_highest_similarity(embedding, post_embeddings, highest = 32):
    '''
    highest: how many relevant posts to retrieve
    '''

    text_similarities = cosine_similarity([embedding], 
                                                     post_embeddings)
    
    high_txt = np.argsort(text_similarities, )[0,-highest:]
    
    sim_txt = text_similarities[0, high_txt]
    
    highest_texts = {}
    
    for i in range(len(high_txt)):
        highest_texts[str(high_txt[i])] = [sim_txt[i]]
        
        
    return highest_texts

In [28]:
# get 32 most similar ones
highest_texts = get_highest_similarity(input_emb, post_embeddings, highest=32)

# Batch, tokenize and run inference using the finetuned RoBERTa_large on STS-b dataset

In [29]:
# store them pairwise in a list to be fed to the tokenizer
k_similar_posts = []

for i in highest_texts.keys():
  print(highest_texts[i],data.iloc[int(i)]['short_description'])
  k_similar_posts.append([input_post, data.iloc[int(i)]['short_description']])

[0.5756899] The actor's death was ruled a suicide.
[0.57667387] The suspect was found dead of a self-inflicted gunshot wound.
[0.5769217] The suspect is accused of taking the life of an elderly man who just happened to cross his path.
[0.5779034] The suspect called his mother before killing himself.
[0.5807812] A grand jury indicted a 73-year-old man on Thursday for the alleged murder of his first wife more than 50 years ago who he
[0.58272517] "Kill me!" suspect says.
[0.5829412] Authorities say Kevin Janson Neal killed his wife late Monday before going on a shooting spree the following day.
[0.5830883] She left her husband. He killed their children. Just another day in America.
[0.5849714] Police are searching for the child's father over the woman's death.
[0.58602935] Michael Stasko allegedly shot wife and daughter dead before turning gun on himself.
[0.5865575] Love is dead.
[0.5866624] The man allegedly did this while sitting between his wife and the victim.
[0.5870414] A body fou

In [30]:
# tokenize the news
tokenized_similar_posts = tokenizer(k_similar_posts, padding=True,
                                    truncation=True, return_tensors='pt')
print(tokenized_similar_posts)

tokenized_similar_posts['input_ids'] = tokenized_similar_posts['input_ids'].to(device)
tokenized_similar_posts['attention_mask'] = tokenized_similar_posts['attention_mask'].to(device)

{'input_ids': tensor([[  0, 250, 313,  ...,   1,   1,   1],
        [  0, 250, 313,  ...,   1,   1,   1],
        [  0, 250, 313,  ...,   1,   1,   1],
        ...,
        [  0, 250, 313,  ...,   1,   1,   1],
        [  0, 250, 313,  ...,   1,   1,   1],
        [  0, 250, 313,  ...,   1,   1,   1]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]])}


In [31]:
# run inference using the RoBERTa_large STSb models
model.eval()
with torch.no_grad():
    outputs = model(**tokenized_similar_posts)

In [32]:
# print the most similar one, which is different and more relevant than the output of the SRoBERTa
k_similar_posts[np.argmax(outputs.logits.cpu().detach().numpy())]

['A man killed his wife',
 'A California businessman is in custody on suspicion of murdering his wife, who has been missing since Wednesday. The couple']