The purpose of this notebook is to finetune BERT model for article spoiling task. <br>
Use the conda env: eda_env

What do I want to do? <br>
I would like to check pre-trained roberta and its generated text. Input should be tokenized. If output will be not maching, then I would like to fine-tune the model.

Problems encountered: <br>
BERT based models can process the ninput of 512 tokens maximum, the average of article body text is ~1600 words

In [1]:
# Data processing
import pandas as pd
import numpy as np

# Visualisation
import matplotlib.pyplot as plt

# BERT model
from transformers import RobertaTokenizer, RobertaForQuestionAnswering,  BertTokenizer, BertModel, logging, RobertaTokenizerFast
import torch
import nltk
from rank_bm25 import BM25Okapi
from nltk.tokenize import word_tokenize
from sklearn.model_selection import train_test_split

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# VARIABLES
nltk.download('punkt')
RANDOM_STATE = 42 # Random state for reproducibility
logging.set_verbosity_error()

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\wojom\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Loading data and splitting data.

In [3]:
spoil_df = pd.read_csv("../data/spoiling_data.csv", sep=";")
print(spoil_df.shape)

(4000, 6)


In [4]:
spoil_df.columns

Index(['targetTitle', 'targetParagraphs', 'humanSpoiler', 'spoiler',
       'spoilerPositions', 'tags'],
      dtype='object')

In [5]:
x_train, x_test, y_train, y_test = train_test_split(
    spoil_df.drop(columns=["humanSpoiler", "spoiler"]), 
    spoil_df[["humanSpoiler", "spoiler"]],
    test_size=0.2, 
    random_state=RANDOM_STATE
)

x_test, x_val, y_test, y_val = train_test_split(
    x_test, 
    y_test,
    test_size=0.5,  # 50% of the original x_test size for validation
    random_state=RANDOM_STATE
)

### Loading models.

In [6]:
tokenizer_bert_uncased = BertTokenizer.from_pretrained("bert-base-uncased")
model_bert_uncased = BertModel.from_pretrained("bert-base-uncased")

def calculate_bert_similarity(texts):
    embeddings = []
    for text in texts:
        inputs = tokenizer_bert_uncased(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
        outputs = model_bert_uncased(**inputs)
        embedding = outputs.last_hidden_state.mean(dim=1).squeeze().detach().numpy()
        embeddings.append(embedding)
    
    similarity = np.dot(embeddings[0], embeddings[1]) / (np.linalg.norm(embeddings[0]) * np.linalg.norm(embeddings[1]))
    return similarity

### Q&A bm25 with roberta-base

In [7]:
tokenizer_roberta = RobertaTokenizer.from_pretrained('roberta-base')
model_roberta = RobertaForQuestionAnswering.from_pretrained('roberta-base')

In [8]:
def calculate_bm25(query, context):
    sentences = nltk.sent_tokenize(context)
    tokenized_corpus = [word_tokenize(doc.lower()) for doc in sentences]
    bm25 = BM25Okapi(tokenized_corpus)

    tokenized_query = word_tokenize(query.lower())
    scores = bm25.get_scores(tokenized_query)

    scores_idx = [[score, idx] for idx, score in enumerate(scores)]
    scores_idx.sort(reverse=True, key=lambda x: x[0])
    sentences = pd.DataFrame({"sentences": sentences})
    important_sentences = sentences.iloc[[arr[1] for arr in scores_idx[:10]],:]
    important_sentences_list = important_sentences["sentences"].tolist()
    return " ".join(important_sentences_list)

In [9]:
def generate_spoiler(query, context, model=model_roberta, tokenizer=tokenizer_roberta):
    overflowing_tokens_case = 0
    inputs = tokenizer_roberta(query, context, truncation=True, padding=True, max_length=512, return_tensors="pt")
    if len(inputs['input_ids'][0]) == 512:
        overflowing_tokens_case += 1
    model.eval()
    with torch.no_grad():
        outputs = model(**inputs)
        start_index = torch.argmax(outputs.start_logits)
        end_index = torch.argmax(outputs.end_logits)

    answer = tokenizer.decode(inputs['input_ids'][0][start_index:end_index + 1])
    print(f"Overflowing tokens case: {overflowing_tokens_case}")
    return answer

In [10]:
x_train["important_sentences"] = x_train.apply(lambda x: calculate_bm25(x["targetTitle"], x["targetParagraphs"]), axis=1)

AttributeError: 'float' object has no attribute 'lower'

In [None]:
x_train["spoiler_bm25"] = x_train.apply(lambda x: generate_spoiler(x["targetTitle"], x["important_sentences"]), axis=1)

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Number of input tokens:  512 /512


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Number of input tokens:  512 /512


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Number of input tokens:  512 /512


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Number of input tokens:  512 /512


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Number of input tokens:  512 /512


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Number of input tokens:  512 /512


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Number of input tokens:  512 /512


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Number of input tokens:  512 /512


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Number of input tokens:  512 /512


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Number of input tokens:  512 /512


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Number of input tokens:  512 /512


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Number of input tokens:  512 /512


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Number of input tokens:  512 /512


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Number of input tokens:  512 /512


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Number of input tokens:  512 /512


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Number of input tokens:  512 /512


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Number of input tokens:  512 /512


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Number of input tokens:  512 /512


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Number of input tokens:  512 /512


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Number of input tokens:  512 /512
Number of input tokens:  512 /512


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Number of input tokens:  512 /512


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Number of input tokens:  512 /512


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Number of input tokens:  512 /512


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Number of input tokens:  512 /512


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Number of input tokens:  512 /512


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Number of input tokens:  512 /512


In [None]:
y_train_bm25 = y_train.loc[x_train["spoiler_bm25"] != ""]
x_train_bm25 = x_train.loc[x_train["spoiler_bm25"] != ""]

In [None]:
df_bm25 = pd.concat([x_train_bm25, y_train_bm25], axis=1)
df_bm25.shape # around 1000 rows with empty spoilers 

(1839, 8)

In [331]:
df_bm25["bert_sim"] = df_bm25.apply(lambda x: calculate_bert_similarity([
    x["spoiler_bm25"],
    x["spoiler"]
]), axis=1)

In [332]:
df_bm25["bert_sim"].mean()

0.59938526

### Q&A chunking with roberta-base

In [263]:
def chunk_text(sentences, max_len=512, tokenizer=tokenizer_roberta):
    chunks = []
    chunk = []
    token_count = 0
    
    for sentence in sentences:
        tokens = tokenizer.encode(sentence, add_special_tokens=False)
        token_count += len(tokens)
        
        if token_count > max_len:
            chunks.append(chunk)
            chunk = [sentence]
            token_count = len(tokens)
        else:
            chunk.append(sentence)
    
    if chunk:
        chunks.append(chunk)
    
    return chunks

In [None]:
def generate_chuncked_spoiler(query, context, model=model_roberta, tokenizer=tokenizer_roberta):
    sentences = nltk.sent_tokenize(context)
    chunks = chunk_text(sentences)

    answers = []
    for chunk in chunks:
        context_chunk = " ".join(chunk)
        inputs = tokenizer_roberta(query, context_chunk, truncation=True, padding=True, return_tensors="pt")
        
        model_roberta.eval()
        with torch.no_grad():
            outputs = model_roberta(**inputs)
            start_index = torch.argmax(outputs.start_logits)
            end_index = torch.argmax(outputs.end_logits)
            
            answer_tokens = inputs['input_ids'][0][start_index:end_index + 1]
            answer = tokenizer_roberta.decode(answer_tokens)
            answers.append(answer)

    final_answer = " ".join(answers)
    return final_answer

In [337]:
x_train["spoiler_chuncked"] = x_train.apply(lambda x: generate_chuncked_spoiler(x["targetTitle"], x["targetParagraphs"]), axis=1)

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pai

In [346]:
y_train_chuncked = y_train.loc[x_train["spoiler_chuncked"] != ""]
x_train_chuncked = x_train.loc[x_train["spoiler_chuncked"] != ""]
df_chuncked = pd.concat([x_train_chuncked, y_train_chuncked], axis=1)
df_chuncked.shape

(2184, 9)

In [348]:
df_chuncked["bert_sim"] = df_chuncked.apply(lambda x: calculate_bert_similarity([
    x["spoiler_chuncked"],
    x["spoiler"]
]), axis=1)
df_chuncked["bert_sim"].mean()

0.58484

### Roberta finetuning task

In [37]:
tokenizer_roberta_fast = RobertaTokenizerFast.from_pretrained('roberta-base')
def head_tail_turncation(text, max_length=512, head_length=128, tail_length=382, tokenizer=tokenizer_roberta_fast):
    tokenized = tokenizer(text, truncation=False, return_offsets_mapping=True)
    input_ids = tokenized["input_ids"]
    attention_mask = tokenized["attention_mask"]
    offsets = tokenized["offset_mapping"]

    if len(input_ids) > max_length:
        truncated_input_ids = input_ids[:head_length] + input_ids[-tail_length:]
        truncated_attention_mask = attention_mask[:head_length] + attention_mask[-tail_length:]
        truncated_offsets = offsets[:head_length] + offsets[-tail_length:]
    else:
        truncated_input_ids = input_ids
        truncated_attention_mask = attention_mask
        truncated_offsets = offsets

    return {
        "input_ids": truncated_input_ids,
        "attention_mask": truncated_attention_mask,
        "offset_mapping": truncated_offsets,
    }

In [None]:
# x_train
title = x_train["targetTitle"].iloc[0]
paragraph = x_train["targetParagraphs"].iloc[0]
spoiler = y_train["spoiler"].iloc[0]
x_train

Unnamed: 0,targetTitle,targetParagraphs,tags,spoilerPositions,important_sentences
2597,Why people like Edward Snowden say they will b...,Googles new messenger app is stirring up a deb...,passage,"[[[2, 0], [2, 78]]]",— Edward Snowden (Snowden) — Edward Snowden (S...
1957,Why NASA Is Building An $18 Billion Rocket To ...,One piece of NASA’s massive new rocket NASA / ...,phrase,"[[[8, 33], [8, 44]]]","In its 2017 budget request, NASA asked Congres..."
1926,Justin Bieber Kicked Out Of Hotel In Argentina...,Justin Bieber was allegedly kicked out of a ho...,passage,"[[[1, 0], [1, 149]]]",Justin Bieber was allegedly kicked out of a ho...
1991,"Two players meet in No Man’s Sky, guess what h...",No Mans Sky is finally released in the UK and ...,phrase,"[[[6, 112], [6, 135]]]",Two players have posted proof on Reddit that t...
2807,Ayvani Hope Perez Kidnapping: Teens Mom Was Ar...,Authorities in Georgia have revealed a new lin...,passage,"[[[1, 88], [1, 133]]]",Police records show that Maria Corral and susp...
...,...,...,...,...,...
1095,The Theory To Why Jennifer Anistons Nipples We...,TheLADbible http:www.theladbible.com/ http:www...,passage,"[[[4, 1], [4, 99]]]",I do not recall the episode or particular scen...
1130,Sandra Bullock Googled Herself And This Is Wha...,Sandra Bullock googled herself only to discove...,passage,"[[[0, 66], [0, 109]]]","In conclusion, Sandra Bullock is our imaginary..."
1294,Former CEO of a $33 billion fast-food company ...,During his tenure as CEO of Yum Brands from 19...,multi,"[[[0, 28], [0, 38]], [[0, 58], [0, 69]], [[4, ...",It\s why he tells all 20-something is just sta...
860,The No. 1 Because Men Care About The Most,"Women give more charity than men do, but that ...",phrase,"[[[1, 196], [1, 231]]]",That was the most popular because across all a...


: 

In [42]:
truncated = head_tail_turncation(title + paragraph)

In [44]:
print("Truncated Input IDs:", truncated["input_ids"])
print("Truncated Offset Mapping:", truncated["offset_mapping"])
print(truncated["attention_mask"])

Truncated Input IDs: [0, 7608, 82, 101, 7393, 23339, 224, 51, 40, 13978, 1204, 17, 27, 29, 8946, 11203, 1553, 11478, 2154, 1634, 92, 34697, 1553, 16, 21881, 62, 10, 2625, 59, 549, 451, 197, 146, 670, 23061, 10, 6814, 1905, 13, 1434, 4, 36, 1251, 43, 1204, 42, 186, 585, 10, 92, 11203, 1553, 19, 670, 23061, 6, 3099, 14, 110, 4372, 1395, 28, 30537, 11251, 4, 125, 89, 16, 10, 2916, 35, 370, 33, 7, 1004, 15, 14, 1905, 2512, 4, 20, 2903, 39996, 17, 27, 29, 563, 7, 1709, 404, 139, 42, 1035, 396, 23061, 30, 6814, 34, 4777, 885, 37414, 3633, 31, 103, 5666, 4, 2381, 2154, 1634, 568, 7, 33022, 253, 12, 560, 12, 1397, 23061, 30, 6814, 11, 63, 92, 849, 3684, 139, 7359, 1553, 6, 2626, 394, 9, 806, 23, 5, 13468, 933, 10261, 18823, 4, 5900, 15797, 6, 13, 1246, 6, 115, 109, 103, 9, 5, 5774, 15, 5, 2187, 4, 125, 6, 37, 26, 6, 24, 74, 28, 1202, 7, 1950, 609, 22680, 7, 6267, 396, 5, 476, 9, 1204, 17, 27, 29, 6063, 13282, 6, 61, 74, 240, 7, 192, 5, 542, 46706, 22680, 4, 38, 109, 45, 206, 5, 806, 16, 89, 64