# Clickbait Challenge at SemEval 2023 - Clickbait Spoiling

Task 1 on Spoiler Type Classification: The input is the clickbait post and the linked document. The task is to classify the spoiler type that the clickbait post warrants (either "phrase", "passage", "multi"). For each input, an output like ```{"uuid": "<UUID>", "spoilerType": "<SPOILER-TYPE>"}``` has to be generated where <SPOILER-TYPE> is either phrase, passage, or multi.
    
For each entry in the training and validation dataset, the following fields are available:

* uuid: The uuid of the dataset entry.
* postText: The text of the clickbait post which is to be spoiled.
* **targetParagraphs**: The main content of the linked web page to classify the spoiler type ***(task 1)*** and to generate the spoiler (task 2). Consists of the paragraphs of manually extracted main content.
* **targetTitle**: The title of the linked web page to classify the spoiler type ***(task 1)*** and to generate the spoiler (task 2).
* targetUrl: The URL of the linked web page.
* humanSpoiler: The human generated spoiler (abstractive) for the clickbait post from the linked web page. This field is only available in the training and validation dataset (not during test).
* spoiler: The human extracted spoiler for the clickbait post from the linked web page. This field is only available in the training and validation dataset (not during test).
* spoilerPositions: The position of the human extracted spoiler for the clickbait post from the linked web page. This field is only available in the training and validation dataset (not during test).
* **tags**: The spoiler type (might be "phrase", "passage", or "multi") that is to be classified in ***task 1*** (spoiler type classification). For task 1, this field is only available in the training and validation dataset (not during test). For task 2, this field is always available and can be used.

Some fields contain additional metainformation about the entry but are unused: postId, postPlatform, targetDescription, targetKeywords, targetMedia.

## Deep models

### Training process

Importing all needed libraries and defining some custom functions

In [4]:
from datasets import load_dataset,Dataset,DatasetDict
from transformers import DataCollatorWithPadding,AutoModelForSequenceClassification, Trainer, TrainingArguments,AutoTokenizer,AutoModel,AutoConfig
from transformers.modeling_outputs import TokenClassifierOutput
import torch
import torch.nn as nn
import pandas as pd

def create_df_from_jsonl(path):
    df = pd.read_json(path, lines=True)
    df['input'] = df['postText'].apply(', '.join) + '. ' + df['targetParagraphs'].apply(', '.join)
    df['label'] = df['tags'].apply(', '.join)
    return df[['input', 'label']]
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', None)

Uploading data and preprocessing it

In [5]:
train_df = create_df_from_jsonl('data/train.jsonl')
train_df = train_df[train_df.label != "multi"]
train_df['label']= pd.get_dummies(train_df['label'],drop_first=True)['phrase']

test_df = create_df_from_jsonl('data/validation.jsonl')
test_df = test_df[test_df.label != "multi"]
test_df['label']= pd.get_dummies(test_df['label'],drop_first=True)['phrase']

train_df = Dataset.from_pandas(train_df)
test_df = Dataset.from_pandas(test_df)

train_df = train_df.remove_columns(['__index_level_0__'])
test_df = test_df.remove_columns(['__index_level_0__'])
data = DatasetDict({
    'train': train_df,
    'test': test_df})


Downloading pretrained model and using its tokenizer to tokenize data.
Other models that were trained were:
- bert-base-uncased with batch_size equal to 4
- microsoft/deberta-base-mnli with batch_size equal to 1
- textattack/roberta-base-MNLI with batch_size equal to 4

In [6]:
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenizer.model_max_len=512

def tokenize(batch):
    return tokenizer(batch["input"], truncation=True,max_length=512)

tokenized_dataset = data.map(tokenize, batched=True)
tokenized_dataset.set_format("torch",columns=["input_ids", "attention_mask", "label"])
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)



  0%|          | 0/3 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

Defining model structure and its fuctions that will be used for training and evaluating model

In [7]:
class CustomModel(nn.Module):
    def __init__(self,checkpoint,num_labels): 
        super(CustomModel,self).__init__() 
        self.num_labels = num_labels 

        #Load Model with given checkpoint and extract its body
        self.model = model = AutoModel.from_pretrained(checkpoint,config=AutoConfig.from_pretrained(checkpoint, output_attentions=True,output_hidden_states=True))
        self.dropout = nn.Dropout(0.1) 
        self.classifier = nn.Linear(768,num_labels) # load and initialize weights

    def forward(self, input_ids=None, attention_mask=None,labels=None):
        #Extract outputs from the body
        outputs = self.model(input_ids=input_ids, attention_mask=attention_mask)

        #Add custom layers
        sequence_output = self.dropout(outputs[0]) #outputs[0]=last hidden state

        logits = self.classifier(sequence_output[:,0,:].view(-1,768)) # calculate losses

        loss = None
        if labels is not None:
            loss_fct = nn.CrossEntropyLoss()
            loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))

        return TokenClassifierOutput(loss=loss, logits=logits, hidden_states=outputs.hidden_states,attentions=outputs.attentions)

Setting the hiperparameters for training process

In [9]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model=CustomModel(checkpoint=checkpoint,num_labels=2).to(device)
from torch.utils.data import DataLoader

train_dataloader = DataLoader(
    tokenized_dataset["train"], shuffle=True, batch_size=4, collate_fn=data_collator
)
eval_dataloader = DataLoader(
    tokenized_dataset["test"], shuffle=True, batch_size=4, collate_fn=data_collator
)
from transformers import AdamW,get_scheduler

optimizer = AdamW(model.parameters(), lr=5e-5)

num_epochs = 10
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)
from datasets import load_metric
metric = load_metric('f1')

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Training the model

In [None]:
from tqdm.auto import tqdm

progress_bar_train = tqdm(range(num_training_steps))
progress_bar_eval = tqdm(range(num_epochs * len(eval_dataloader)))


for epoch in range(num_epochs):
    model.train()
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar_train.update(1)

    model.eval()
    for batch in eval_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        with torch.no_grad():
            outputs = model(**batch)

        logits = outputs.logits
        predictions = torch.argmax(logits, dim=-1)
        metric.add_batch(predictions=predictions, references=batch["labels"])
        progress_bar_eval.update(1)
    
    print(metric.compute())

      

### Evaluating the model

All trained models are saved on hard disc with paths:
- output/roberta
- output/deberta
- output/bert-best

In [10]:
model = AutoModelForSequenceClassification.from_pretrained("output/bert-best")
from transformers import pipeline
evaluate_model = pipeline('text-classification', model=model, tokenizer=tokenizer)

In [11]:
text1 = "This Is How Many People Police Have Killed So Far In 2016In the first half of 2016, police have killed 532 people — many of whom were unarmed, mentally ill, and people of color. Going by the Going by the Guardian’s count , 261 white people were killed by police — the highest total out of any racial group. But data also shows that black people and Native Americans are being killed at higher rates than any other group. The slight discrepancies in numbers between Killed by Police and The Guardian reflect differences in how each outlet collects data about police killings. Killed by Police is mainly open-sourced and also relies on The slight discrepancies in numbers between Killed by Police and The Guardian reflect differences in how each outlet collects data about police killings. Killed by Police is mainly open-sourced and also relies on corporate news reports for its data on people killed by police. For its database, The Guardian relies on traditional reporting on police reports and witness statements, while also culling data from verified crowdsourced information using regional news outlets, research groups, and reporting projects that include Killed by Police. There has always been a high volume of police killings, although damning videos, photos, and news reports highlight officer violence — especially against people of color — now more than ever. But what’s become an even more alarming trend is the number of officers involved in these killings who receive minor to no punishment. According to the According to the Wall Street Journal , 2015 saw the highest number of police officers being charged for deadly, on-duty shootings in a decade: 12 as of September 2015. Still, in a year when approximately 1,200 people were killed by police, zero officers were convicted of murder or manslaughter, painting the picture that officers involved in killing another person will not be held accountable for their actions."
text0 ="Videos show the most delightful protest everAustralians know how to protest. Hundreds of people gathered Saturday local time at Parliament House in Canberra to make their way down a hill in a mass protest roll. The government plans to build a security fence to block access to the hill and other capital grounds. Protesters opposed to the fence rolled down the grassy slope just as many visitors to Parliament House often do. Even dogs got in on the democratic action. The event was organized by Lester Yao, an architect, on Facebook and delightful videos of the roll-a-thon were shared widely on social media. It was only going to be about 20 friends and families, and now we had more than 600 or 700 people, Yao told the Sydney Morning Herald. Unfortunately, kids might not be able to do this again and they're just enjoying themselves. The fence became a matter of debate after demonstrators breached security at Parliament House earlier this year. Lawmakers had even tossed around the idea of digging a moat around the slope, but that was sanely rejected."

Testing models on sample validation data

In [12]:
evaluate_model(text0)

[{'label': 'LABEL_0', 'score': 0.9987004995346069}]

In [13]:
evaluate_model(text1)

[{'label': 'LABEL_1', 'score': 0.9718432426452637}]