The purpose of this notebook is to finetune BERT model for article spoiling task. <br>
Use the conda env: eda_env

What do I want to do? <br>
I would like to check pre-trained roberta and its generated text. Input should be tokenized. If output will be not maching, then I would like to fine-tune the model.

Problems encountered: <br>
BERT based models can process the ninput of 512 tokens maximum, the average of article body text is ~1600 words

In [None]:
# Data processing
import pandas as pd
import numpy as np

# Visualisation
import matplotlib.pyplot as plt

# BERT model
from transformers import RobertaTokenizer, RobertaForQuestionAnswering,  BertTokenizer, BertModel
import torch
import nltk
from rank_bm25 import BM25Okapi
from nltk.tokenize import word_tokenize

In [62]:
# VARIABLES
nltk.download('punkt')
RANDOM_STATE = 42 # Random state for reproducibility

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\wojom\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Loading data and splitting data.

In [19]:
spoil_df = pd.read_csv("../data/spoiling_data.csv", sep=";")
print(spoil_df.shape)

(3358, 6)


In [20]:
spoil_df.columns

Index(['targetTitle', 'targetParagraphs', 'humanSpoiler', 'spoiler', 'tags',
       'spoilerPositions'],
      dtype='object')

In [22]:
x_train, x_test, y_train, y_test = train_test_split(
    spoil_df.drop(columns=["humanSpoiler", "spoiler"]), 
    spoil_df[["humanSpoiler", "spoiler"]],
    test_size=0.2, 
    random_state=RANDOM_STATE
)

x_test, x_val, y_test, y_val = train_test_split(
    x_test, 
    y_test,
    test_size=0.5,  # 50% of the original x_test size for validation
    random_state=RANDOM_STATE
)

### Loading models.

In [209]:
tokenizer_bert_uncased = BertTokenizer.from_pretrained("bert-base-uncased")
model_bert_uncased = BertModel.from_pretrained("bert-base-uncased")

def calculate_bert_similarity(texts):
    embeddings = []
    for text in texts:
        inputs = tokenizer_bert_uncased(text, return_tensors="pt", padding=True, truncation=True)
        outputs = model_bert_uncased(**inputs)
        embedding = outputs.last_hidden_state.mean(dim=1).squeeze().detach().numpy()
        embeddings.append(embedding)
    
    similarity = np.dot(embeddings[0], embeddings[1]) / (np.linalg.norm(embeddings[0]) * np.linalg.norm(embeddings[1]))
    return similarity

### Q&A with roBERTa

In [None]:
tokenizer_roberta = RobertaTokenizer.from_pretrained('roberta-base')
model_roberta = RobertaForQuestionAnswering.from_pretrained('roberta-base')

Some weights of RobertaForQuestionAnswering were not initialized from the model checkpoint at roberta-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [252]:
query = x_train.iloc[1]["targetTitle"]
context = x_train.iloc[1]["targetParagraphs"]

In [256]:
sentences = nltk.sent_tokenize(context)
tokenized_corpus = [word_tokenize(doc.lower()) for doc in sentences]
bm25 = BM25Okapi(tokenized_corpus)

tokenized_query = word_tokenize(query.lower())
scores = bm25.get_scores(tokenized_query)

scores_idx = [[score, idx] for idx, score in enumerate(scores)]
scores_idx.sort(reverse=True, key=lambda x: x[0])
sentences = pd.DataFrame(sentences)
important_sentences = sentences.iloc[[arr[1] for arr in scores_idx[:10]],:]

In [257]:
context = ""
for sentence in important_sentences[0]:
    context += str(sentence) + " "
context

'In its 2017 budget request, NASA asked Congress for $1.3 billion to build its next jumbo rocket. It is happened before, Garver said, with the the $100 billion International Space Station. Building SLS let us NASA keep its options open if the next president decides to look to lunar landings instead, something that Obama seemed to rule out in a 2010 speech. The SLS — a rocket to nowhere, as Cowing put it — fits this pattern neatly because it provides thousands of jobs in space states. Might SpaceX privately building a big rocket in the next two years, and sending something to Mars, make SLS look redundant to politicians, finally ending NASA’s long cycle of rocket-building jobs programs? Last week, despite years of fighting with the Obama Administration over its plans to explore an asteroid with the rocket, the Senate Appropriations Committee not only granted the request, but gave the space agency an extra $995 million to build it. One piece of NASA’s massive new rocket NASA / Via nasa.g

In [None]:
inputs = tokenizer_roberta(query, context, truncation=True, padding=True, max_length=512, return_tensors="pt")
print("Number of input tokens: ", len(inputs['input_ids'][0]), "\nShould be less than 512")
model_roberta.eval()
with torch.no_grad():
    outputs = model_roberta(**inputs)
    start_index = torch.argmax(outputs.start_logits)
    end_index = torch.argmax(outputs.end_logits)

answer = tokenizer_roberta.decode(inputs['input_ids'][0][start_index:end_index + 1])
print("Extracted Spoiler:", answer)

Number of input tokens:  374 
Should be less than 512
Extracted Spoiler:  a rocket to nowhere, as Cowing put it — fits this pattern neatly because it provides thousands of jobs in space states. Might SpaceX privately building a big rocket in the next two years, and sending something to Mars, make SLS look redundant to politicians, finally ending NASA’s long cycle of rocket-building jobs programs? Last week, despite years of fighting with the Obama Administration over its plans to explore an asteroid with the rocket, the Senate Appropriations Committee not only granted the request, but


In [262]:
bert_sim = calculate_bert_similarity([
    answer,
    y_train.iloc[1]["humanSpoiler"]
])
print("BERT Similarity Score: ", bert_sim)

BERT Similarity Score:  0.86259645


### Q&A chunking with roberta-base BERT

In [263]:
def chunk_text(sentences, max_len=512, tokenizer=tokenizer_roberta):
    chunks = []
    chunk = []
    token_count = 0
    
    for sentence in sentences:
        tokens = tokenizer.encode(sentence, add_special_tokens=False)
        token_count += len(tokens)
        
        if token_count > max_len:
            chunks.append(chunk)
            chunk = [sentence]
            token_count = len(tokens)
        else:
            chunk.append(sentence)
    
    if chunk:
        chunks.append(chunk)
    
    return chunks

In [264]:
context = x_train.iloc[1]["targetParagraphs"]
query = x_train.iloc[1]["targetTitle"]

sentences = nltk.sent_tokenize(context)
chunks = chunk_text(sentences)

answers = []
for chunk in chunks:
    context_chunk = " ".join(chunk)
    inputs = tokenizer_roberta(query, context_chunk, truncation=True, padding=True, max_length=512, return_tensors="pt")
    
    model_roberta.eval()
    with torch.no_grad():
        outputs = model_roberta(**inputs)
        start_index = torch.argmax(outputs.start_logits)
        end_index = torch.argmax(outputs.end_logits)
        
        answer_tokens = inputs['input_ids'][0][start_index:end_index + 1]
        answer = tokenizer_roberta.decode(answer_tokens)
        answers.append(answer)

final_answer = " ".join(answers)
print(f"Final Answer: {final_answer}")

Final Answer:  to explore an asteroid with the rocket, the Senate Appropriations Committee not only granted the request, but  to Mars. Eric Berger, a reporter at Ars Technica decried the move to build the rocket ahead of a destination. Cuts also came to NASA’s space technology funding, a $130 million bite meant to keep aloft earth-observing satellites run out of Goddard Space Flight Center, based in Mikulski’s state of Maryland. As Berger said, the motivation seems primarily to be keeping people employed. One problem with this approach is that you might end up with space rocket that does not have any space technology — a habitat module, deep-space propulsion, landers — to actually let astronauts voyage to asteroids, or the moon, or Mars. You might spend billions on workers who maintain the rocket you have built, and not have any money left over to build the interplanetary survival gear needed to ever get to Mars, the ultimate goal of the space agency. It is happened before, Garver said

In [265]:
bert_sim = calculate_bert_similarity([
    final_answer,
    y_train.iloc[1]["humanSpoiler"]
])
print("BERT Similarity Score: ", bert_sim)

BERT Similarity Score:  0.8618836
