The purpose of this notebook is to finetune BERT model for article spoiling task. <br>
Use the conda env: eda_env

What do I want to do? <br>
I would like to check pre-trained roberta and its generated text. Input should be tokenized. If output will be not maching, then I would like to fine-tune the model.

Problems encountered: <br>
BERT based models can process the ninput of 512 tokens maximum, the average of article body text is ~1600 words

In [270]:
# Data processing
import pandas as pd
import numpy as np

# Visualisation
import matplotlib.pyplot as plt

# BERT model
from transformers import RobertaTokenizer, RobertaForQuestionAnswering,  BertTokenizer, BertModel
import torch
import nltk
from rank_bm25 import BM25Okapi
from nltk.tokenize import word_tokenize

In [271]:
# VARIABLES
nltk.download('punkt')
RANDOM_STATE = 42 # Random state for reproducibility

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\wojom\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Loading data and splitting data.

In [272]:
spoil_df = pd.read_csv("../data/spoiling_data.csv", sep=";")
print(spoil_df.shape)

(3358, 6)


In [273]:
spoil_df.columns

Index(['targetTitle', 'targetParagraphs', 'humanSpoiler', 'spoiler', 'tags',
       'spoilerPositions'],
      dtype='object')

In [274]:
x_train, x_test, y_train, y_test = train_test_split(
    spoil_df.drop(columns=["humanSpoiler", "spoiler"]), 
    spoil_df[["humanSpoiler", "spoiler"]],
    test_size=0.2, 
    random_state=RANDOM_STATE
)

x_test, x_val, y_test, y_val = train_test_split(
    x_test, 
    y_test,
    test_size=0.5,  # 50% of the original x_test size for validation
    random_state=RANDOM_STATE
)

### Loading models.

In [None]:
tokenizer_bert_uncased = BertTokenizer.from_pretrained("bert-base-uncased")
model_bert_uncased = BertModel.from_pretrained("bert-base-uncased")

def calculate_bert_similarity(texts):
    embeddings = []
    for text in texts:
        inputs = tokenizer_bert_uncased(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
        outputs = model_bert_uncased(**inputs)
        embedding = outputs.last_hidden_state.mean(dim=1).squeeze().detach().numpy()
        embeddings.append(embedding)
    
    similarity = np.dot(embeddings[0], embeddings[1]) / (np.linalg.norm(embeddings[0]) * np.linalg.norm(embeddings[1]))
    return similarity

### Q&A bm25 with roberta-base

In [278]:
tokenizer_roberta = RobertaTokenizer.from_pretrained('roberta-base')
model_roberta = RobertaForQuestionAnswering.from_pretrained('roberta-base')

Some weights of RobertaForQuestionAnswering were not initialized from the model checkpoint at roberta-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [301]:
def calculate_bm25(query, context):
    sentences = nltk.sent_tokenize(context)
    tokenized_corpus = [word_tokenize(doc.lower()) for doc in sentences]
    bm25 = BM25Okapi(tokenized_corpus)

    tokenized_query = word_tokenize(query.lower())
    scores = bm25.get_scores(tokenized_query)

    scores_idx = [[score, idx] for idx, score in enumerate(scores)]
    scores_idx.sort(reverse=True, key=lambda x: x[0])
    sentences = pd.DataFrame({"sentences": sentences})
    important_sentences = sentences.iloc[[arr[1] for arr in scores_idx[:10]],:]
    important_sentences_list = important_sentences["sentences"].tolist()
    return " ".join(important_sentences_list)

In [302]:
x_train["important_sentences"] = x_train.apply(lambda x: calculate_bm25(x["targetTitle"], x["targetParagraphs"]), axis=1)

In [None]:
def generate_spoiler(query, context, model=model_roberta, tokenizer=tokenizer_roberta):
    inputs = tokenizer_roberta(query, context, truncation=True, padding=True, max_length=512, return_tensors="pt")
    if len(inputs['input_ids'][0]) == 512:
        print("Number of input tokens: ", len(inputs['input_ids'][0]), "/512")
    model.eval()
    with torch.no_grad():
        outputs = model(**inputs)
        start_index = torch.argmax(outputs.start_logits)
        end_index = torch.argmax(outputs.end_logits)

    answer = tokenizer.decode(inputs['input_ids'][0][start_index:end_index + 1])
    return answer

In [309]:
x_train["spoiler_bm25"] = x_train.apply(lambda x: generate_spoiler(x["targetTitle"], x["important_sentences"]), axis=1)

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Number of input tokens:  512 /512


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Number of input tokens:  512 /512


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Number of input tokens:  512 /512


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Number of input tokens:  512 /512


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Number of input tokens:  512 /512


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Number of input tokens:  512 /512


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Number of input tokens:  512 /512


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Number of input tokens:  512 /512


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Number of input tokens:  512 /512


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Number of input tokens:  512 /512


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Number of input tokens:  512 /512


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Number of input tokens:  512 /512


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Number of input tokens:  512 /512


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Number of input tokens:  512 /512


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Number of input tokens:  512 /512


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Number of input tokens:  512 /512


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Number of input tokens:  512 /512


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Number of input tokens:  512 /512


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Number of input tokens:  512 /512


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Number of input tokens:  512 /512
Number of input tokens:  512 /512


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Number of input tokens:  512 /512


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Number of input tokens:  512 /512


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Number of input tokens:  512 /512


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Number of input tokens:  512 /512


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Number of input tokens:  512 /512


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Number of input tokens:  512 /512


In [None]:
y_train_bm25 = y_train.loc[x_train["spoiler_bm25"] != ""]
x_train_bm25 = x_train.loc[x_train["spoiler_bm25"] != ""]

In [None]:
df_bm25 = pd.concat([x_train_bm25, y_train_bm25], axis=1)
df_bm25.shape # around 1000 rows with empty spoilers 

(1839, 8)

In [331]:
df_bm25["bert_sim"] = df_bm25.apply(lambda x: calculate_bert_similarity([
    x["spoiler_bm25"],
    x["spoiler"]
]), axis=1)

In [332]:
df_bm25["bert_sim"].mean()

0.59938526

### Q&A chunking with roberta-base

In [263]:
def chunk_text(sentences, max_len=512, tokenizer=tokenizer_roberta):
    chunks = []
    chunk = []
    token_count = 0
    
    for sentence in sentences:
        tokens = tokenizer.encode(sentence, add_special_tokens=False)
        token_count += len(tokens)
        
        if token_count > max_len:
            chunks.append(chunk)
            chunk = [sentence]
            token_count = len(tokens)
        else:
            chunk.append(sentence)
    
    if chunk:
        chunks.append(chunk)
    
    return chunks

In [None]:
def generate_chuncked_spoiler(query, context, model=model_roberta, tokenizer=tokenizer_roberta):
    sentences = nltk.sent_tokenize(context)
    chunks = chunk_text(sentences)

    answers = []
    for chunk in chunks:
        context_chunk = " ".join(chunk)
        inputs = tokenizer_roberta(query, context_chunk, truncation=True, padding=True, return_tensors="pt")
        
        model_roberta.eval()
        with torch.no_grad():
            outputs = model_roberta(**inputs)
            start_index = torch.argmax(outputs.start_logits)
            end_index = torch.argmax(outputs.end_logits)
            
            answer_tokens = inputs['input_ids'][0][start_index:end_index + 1]
            answer = tokenizer_roberta.decode(answer_tokens)
            answers.append(answer)

    final_answer = " ".join(answers)
    return final_answer

In [337]:
x_train["spoiler_chuncked"] = x_train.apply(lambda x: generate_chuncked_spoiler(x["targetTitle"], x["targetParagraphs"]), axis=1)

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pai

In [346]:
y_train_chuncked = y_train.loc[x_train["spoiler_chuncked"] != ""]
x_train_chuncked = x_train.loc[x_train["spoiler_chuncked"] != ""]
df_chuncked = pd.concat([x_train_chuncked, y_train_chuncked], axis=1)
df_chuncked.shape

(2184, 9)

In [348]:
df_chuncked["bert_sim"] = df_chuncked.apply(lambda x: calculate_bert_similarity([
    x["spoiler_chuncked"],
    x["spoiler"]
]), axis=1)
df_chuncked["bert_sim"].mean()

0.58484