# SciBERT Multi-label Classification with Global Context

This notebook demonstrates how to train a multi-label classification model using SciBERT and global context features. The process includes data loading, preprocessing, tokenization, model setup, training, and evaluation.

## Import Libraries

We start by importing all necessary libraries for data handling, model training, and evaluation.

In [None]:
# Import required libraries
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer, AutoConfig, EarlyStoppingCallback
import evaluate
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import torch

## Device Setup

Check if a GPU is available and set the device for PyTorch accordingly.

In [None]:
# Set device to GPU if available, otherwise use CPU
if torch.cuda.is_available():
    device_id = 0  # Select the first GPU
    torch.cuda.set_device(device_id)
    print(f"Using GPU: {torch.cuda.current_device()}")
else:
    device = torch.device("cpu")
    print(f"Using CPU: {device}")

## Load and Prepare Data

Load the ACT2 dataset, select relevant columns, and rename the label column for clarity.

In [None]:
# Load dataset and rename label column
df_act2_full = pd.read_csv('ACT2_dataset.tsv', sep='\t', usecols=['cited_title','cited_abstract','citation_context', 'unique_id','citation_class_label'])
df_act2_full.rename(columns={'citation_class_label': 'labels'}, inplace=True)

## Data Cleaning

Fill missing values in the title and abstract columns with empty strings.

In [None]:
# Fill missing values
df_act2_full['cited_title'] = df_act2_full['cited_title'].fillna('')
df_act2_full['cited_abstract'] = df_act2_full['cited_abstract'].fillna('')

## Feature Engineering

Concatenate the title, abstract, and citation context into a single input string for the model.

In [None]:
# Create input string for the model
df_act2_full['input_model'] = df_act2_full['cited_title'] + " " + df_act2_full['cited_abstract'] +  " [ES_SEP] " + df_act2_full['citation_context']

## Split Data

Split the dataset into test and train sets. The first 1000 samples are used for testing, and the rest for training.

In [None]:
# Split data into train and test sets
df_act2_test = df_act2_full.head(1000)
df_act2_train = df_act2_full.tail(len(df_act2_full) - 1000)

## Train/Validation Split

Further split the training data into train and validation sets, stratified by label.

In [None]:
# Stratified split for train and validation
train_df, val_df = train_test_split(
    df_act2_train,
    test_size=0.2,
    stratify=df_act2_train['labels'],
    random_state=42
)
print(f"Train data size: {len(train_df)}")
print(f"Validation data size: {len(val_df)}")

## Convert to HuggingFace Datasets

Convert the pandas DataFrames to HuggingFace Dataset objects for easier handling.

In [None]:
# Convert to HuggingFace datasets
from datasets import Dataset, DatasetDict
dataset = DatasetDict({
    'train': Dataset.from_pandas(train_df, preserve_index=False),
    'validation': Dataset.from_pandas(val_df, preserve_index=False),
    'test': Dataset.from_pandas(df_act2_test, preserve_index=False)
})

## Tokenizer Setup

Initialize the SciBERT tokenizer and add custom tokens.

In [None]:
# Initialize tokenizer and add custom tokens
model_name = "allenai/scibert_scivocab_uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.add_tokens(["[ES_SEP]", "#CITATION_TAG"])

## Tokenization Example

Tokenize a sample input to verify the tokenizer setup.

In [None]:
# Tokenize a sample input
batch = train_df.iloc[0]
text = batch["input_model"]
tokens = tokenizer.tokenize(text)
print(text)
print(tokens)

## Tokenize the Dataset

Define a tokenization function and apply it to the entire dataset.

In [None]:
# Tokenization function
def tokenize(batch):
    return tokenizer(batch["input_model"], padding="max_length", truncation=True, max_length=512)

# Tokenize all splits
tokenized_datasets = dataset.map(tokenize, batched=True)

## Label Mapping

Define the mapping between label IDs and label names.

In [None]:
# Label mappings
id2label = {
    0 : "Background",
    1: "Compares_contrasts",
    2: "Extension",
    3: "Future",
    4: "Motivation",
    5: "Uses"
}
label2id = {v: k for k, v in id2label.items()}

## Model Setup

Load the SciBERT model for sequence classification and resize the token embeddings to include new tokens.

In [None]:
# Load model and resize embeddings
num_labels = len(id2label)
config = AutoConfig.from_pretrained(model_name, num_labels=num_labels, id2label=id2label, label2id=label2id)
model = AutoModelForSequenceClassification.from_pretrained(model_name, config=config)
model.resize_token_embeddings(len(tokenizer))

## Training Arguments

Set up the training arguments, including batch size, learning rate, and logging.

In [None]:
# Training arguments
batch_size = 32
training_dir = "./checkpoints/scibert_training_global_info"
training_args = TrainingArguments(
    output_dir=training_dir,
    overwrite_output_dir=True,
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=20,
    weight_decay=0.01,
    logging_dir="./logs/global_train",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    logging_strategy="steps",
    logging_steps=10,
)

## Early Stopping

Set up early stopping to prevent overfitting.

In [None]:
# Early stopping callback
early_stopping = EarlyStoppingCallback(early_stopping_patience=3)

## Metrics Function

Define a function to compute accuracy and macro F1 score during evaluation.

In [None]:
# Compute metrics function
def compute_metrics(eval_pred):
    accuracy = evaluate.load("accuracy")
    f1 = evaluate.load("f1")
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=1)
    acc = accuracy.compute(predictions=predictions, references=labels)["accuracy"]
    f1_score = f1.compute(predictions=predictions, references=labels, average="macro")["f1"]
    return {
        "f1": f1_score,
        "accuracy": acc
    }

## Trainer Setup

Initialize the HuggingFace Trainer with the model, data, metrics, and callbacks.

In [None]:
# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
    callbacks=[early_stopping],
)

In [None]:
# Train the model
trainer.train()