<a href="https://colab.research.google.com/github/len-rtz/2024-10-15-Master_Digital_Sciences-Linked_Open_Data_and_Knowledge_Graphs_WiSe_2024_20245/blob/main/encoder/Finetuning_PEFT_encoder.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!apt update
!apt install -y libmariadb-dev

[33m0% [Working][0m            Hit:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
Hit:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Hit:3 http://security.ubuntu.com/ubuntu jammy-security InRelease
Hit:4 http://archive.ubuntu.com/ubuntu jammy InRelease
Hit:5 https://r2u.stat.illinois.edu/ubuntu jammy InRelease
Hit:6 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:7 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:9 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:10 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
83 packages can be upgraded. Run 'apt list --upgradable' to see them.
[1;33mW: [0mSkipping acquire of configured file 'main/source/Sources' as repository 

In [2]:
!pip install mysql-connector-python sqlalchemy mariadb



In [3]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [4]:
!git clone https://github.com/Horizontal-Labs/Argument-Mining.git

fatal: destination path 'Argument-Mining' already exists and is not an empty directory.


In [5]:
# change direcotry
import sys
sys.path.append('/content/Argument-Mining')

In [6]:
from db.queries import get_training_data, get_test_data

# Load training data
claims_train, premises_train, relationships_train = get_training_data()

# Load test data
claims_test, premises_test, relationships_test = get_test_data()

In [7]:
import pandas as pd

# Create pairs of claims and premises
debate_pairs = []

for i in range(len(claims_train)):
    debate_pairs.append({
        "claim": claims_train[i].text,
        "premise": premises_train[i].text,
        "stance": relationships_train[i],
    })

# Create final DataFrame
train_data = pd.DataFrame(debate_pairs)

print(train_data.head())

                                               claim  \
0  This house believes that the sale of violent v...   
1  This house supports the one-child policy of th...   
2  This house would permit the use of performance...   
3  This house would make physical education compu...   
4  This house believes in the use of affirmative ...   

                                             premise      stance  
0  video game violence is not related to serious ...  stance_con  
1         The policy had proved remarkably effective  stance_pro  
2  The use of drugs to enhance performance is con...  stance_con  
3  Frequent and regular physical exercise boosts ...  stance_pro  
4  In some countries which have laws on racial eq...  stance_con  


In [8]:
# Remove 'stance_' prefix for simplicity
train_data['stance'] = train_data['stance'].str.replace('stance_', '')

print(train_data.head())

                                               claim  \
0  This house believes that the sale of violent v...   
1  This house supports the one-child policy of th...   
2  This house would permit the use of performance...   
3  This house would make physical education compu...   
4  This house believes in the use of affirmative ...   

                                             premise stance  
0  video game violence is not related to serious ...    con  
1         The policy had proved remarkably effective    pro  
2  The use of drugs to enhance performance is con...    con  
3  Frequent and regular physical exercise boosts ...    pro  
4  In some countries which have laws on racial eq...    con  


# Finetune RoBERTa

In [9]:
import torch
import pandas as pd
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    RobertaForSequenceClassification,
    RobertaTokenizer,
    TrainingArguments,
    Trainer,
    DataCollatorWithPadding
)
from datasets import Dataset
from sklearn.model_selection import train_test_split
from peft import (
    get_peft_model,
    LoraConfig,
    TaskType,
    PeftModel,
    PeftConfig
)

In [10]:
from sklearn.model_selection import train_test_split

train_data, eval_data = train_test_split(train_data, test_size=0.2, random_state=42)

In [11]:
# Task definitions
TASKS = {
    "adu_identification": {
        "num_labels": 2,  # Yes/No - contains ADU
        "id2label": {0: "No", 1: "Yes"},
        "label2id": {"No": 0, "Yes": 1}
    },
    "adu_classification": {
        "num_labels": 2,  # claim or premise
        "id2label": {0: "claim", 1: "premise"},
        "label2id": {"claim": 0, "premise": 1}
    },
    "stance_classification": {
        "num_labels": 2,  # pro or con
        "id2label": {0: "con", 1: "pro"},
        "label2id": {"con": 0, "pro": 1}
    },
    "relationship_identification": {
        "num_labels": 2,  # supportive or contradictory
        "id2label": {0: "contradictory", 1: "supportive"},
        "label2id": {"contradictory": 0, "supportive": 1}
    }
}

# Data Formatting

In [12]:
def format_for_argument_mining_roberta(df):
    formatted_data = []

    for _, row in df.iterrows():
        # Task 1: ADU Identification (claim)
        claim_adu_sample = {
            "task": "adu_identification",
            "text": row['claim'],
            "label": 1  # 1 for contains ADU
        }

        # Task 1: ADU Identification (premise)
        premise_adu_sample = {
            "task": "adu_identification",
            "text": row['premise'],
            "label": 1  # 1 for contains ADU
        }

        # Task 2: ADU Classification (claim)
        adu_class_sample_claim = {
            "task": "adu_classification",
            "text": row['claim'],
            "label": 0  # 0 for claim
        }

        # Task 2: ADU Classification (premise)
        adu_class_sample_premise = {
            "task": "adu_classification",
            "text": row['premise'],
            "label": 1  # 1 for premise
        }

        # Task 3: Stance Classification
        stance_sample = {
            "task": "stance_classification",
            "text": f"{row['claim']} </s> {row['premise']}",  # RoBERTa uses </s> as separator
            "label": 1 if row['stance'] == 'pro' else 0  # 1 for 'pro', 0 for 'con'
        }

        # Task 4: Relationship identification
        relationship_sample = {
            "task": "relationship_identification",
            "text": f"{row['claim']} </s> {row['premise']}",  # RoBERTa uses </s> as separator
            "label": 1 if row['stance'] == 'pro' else 0  # 1 for supportive, 0 for contradictory
        }

        # Add all tasks to our dataset
        formatted_data.extend([
            claim_adu_sample,
            premise_adu_sample,
            adu_class_sample_claim,
            adu_class_sample_premise,
            stance_sample,
            relationship_sample
        ])

    return formatted_data

In [13]:
def tokenize_function_roberta(examples, tokenizer, max_length=512):
    return tokenizer(
        examples["text"],
        truncation=True,
        padding="max_length",
        max_length=max_length,
        return_tensors="pt"
    )

# Language Model configuration

In [14]:
def setup_roberta_model(model_name="roberta-base", num_labels=2):
    try:
        # Load pre-trained RoBERTa model and tokenizer
        model = RobertaForSequenceClassification.from_pretrained(
            model_name,
            num_labels=num_labels,
            problem_type="single_label_classification"
        )
        tokenizer = RobertaTokenizer.from_pretrained(model_name)

        return model, tokenizer
    except Exception as e:
        print(f"Error loading model: {e}")
        raise


# PEFT LoRa configuration

In [15]:
def configure_peft_roberta(model):
    peft_config = LoraConfig(
        task_type=TaskType.SEQ_CLS,
        inference_mode=False,
        r=8,
        lora_alpha=32,
        lora_dropout=0.1,
        # Target attention modules in RoBERTa
        target_modules=["query", "key", "value"],
        bias="none",
    )

    model = get_peft_model(model, peft_config)

    # Print trainable parameters info
    print_trainable_parameters(model)

    return model

def print_trainable_parameters(model):
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()

    print(f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param:.2f}")

In [16]:
def setup_training_roberta(model, train_dataset, eval_dataset, output_dir="./argument-mining-roberta"):
    # Training arguments
    training_args = TrainingArguments(
        output_dir=output_dir,
        num_train_epochs=3,
        per_device_train_batch_size=16,  # RoBERTa can handle larger batches than decoder models
        learning_rate=2e-5,  # Standard learning rate for RoBERTa fine-tuning
        weight_decay=0.01,
        logging_steps=20,
        save_strategy="epoch",
        warmup_ratio=0.1,
        eval_strategy="epoch",
        save_total_limit=2,
        load_best_model_at_end=True,
        metric_for_best_model="accuracy"
    )

    # Define compute metrics function
    def compute_metrics(eval_pred):
        logits, labels = eval_pred
        predictions = torch.argmax(torch.tensor(logits), dim=-1)
        accuracy = (predictions == torch.tensor(labels)).float().mean().item()
        return {"accuracy": accuracy}

    # Set up the trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        compute_metrics=compute_metrics
    )

    return trainer

In [17]:
def train_argument_mining_roberta(train_data, model_name="roberta-base"):
    # Split train data for cross-validation
    train_df, eval_df = train_test_split(train_data, test_size=0.2, random_state=42)

    # Format data for RoBERTa training
    formatted_train_data = format_for_argument_mining_roberta(train_df)
    formatted_eval_data = format_for_argument_mining_roberta(eval_df)

    # Create datasets for each task
    task_datasets = {}
    eval_task_datasets = {}

    for task in ["adu_identification", "adu_classification", "stance_classification", "relationship_identification"]:
        # Filter data for specific task
        task_data = [item for item in formatted_train_data if item["task"] == task]
        task_eval_data = [item for item in formatted_eval_data if item["task"] == task]

        # Skip if no data for task
        if not task_data or not task_eval_data:
            continue

        # Setup model and tokenizer for this task
        num_labels = max([item["label"] for item in task_data]) + 1
        model, tokenizer = setup_roberta_model(model_name, num_labels=num_labels)

        # Create HF Dataset
        hf_train_dataset = Dataset.from_list(task_data)
        hf_eval_dataset = Dataset.from_list(task_eval_data)

        # Apply tokenizer
        train_dataset = hf_train_dataset.map(
            lambda x: tokenize_function_roberta(x, tokenizer),
            batched=True,
            remove_columns=["text", "task"]
        )

        eval_dataset = hf_eval_dataset.map(
            lambda x: tokenize_function_roberta(x, tokenizer),
            batched=True,
            remove_columns=["text", "task"]
        )

        # Configure PEFT/LoRA
        model = configure_peft_roberta(model)

        # Setup training
        trainer = setup_training_roberta(
            model,
            train_dataset,
            eval_dataset,
            output_dir=f"./argument-mining-roberta-{task}"
        )

        # Train the model
        trainer.train()

        # Save the model and tokenizer
        peft_model_id = f"argument-mining-roberta-{task}"
        trainer.model.save_pretrained(peft_model_id)
        tokenizer.save_pretrained(peft_model_id)

        print(f"Model for task {task} saved to {peft_model_id}")

        # Store model for inference
        task_datasets[task] = {"model": model, "tokenizer": tokenizer, "model_id": peft_model_id}

    return task_datasets

In [18]:
def predict_with_roberta_models(text, claim=None, task_datasets=None):
    results = {}

    # ADU Identification
    if "adu_identification" in task_datasets:
        model = task_datasets["adu_identification"]["model"]
        tokenizer = task_datasets["adu_identification"]["tokenizer"]

        inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
        outputs = model(**inputs)
        prediction = torch.argmax(outputs.logits, dim=-1).item()

        results["adu_identification"] = "Contains ADU" if prediction == 1 else "Does not contain ADU"

    # ADU Classification
    if "adu_classification" in task_datasets and results.get("adu_identification") == "Contains ADU":
        model = task_datasets["adu_classification"]["model"]
        tokenizer = task_datasets["adu_classification"]["tokenizer"]

        inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
        outputs = model(**inputs)
        prediction = torch.argmax(outputs.logits, dim=-1).item()

        results["adu_classification"] = "Claim" if prediction == 0 else "Premise"

    # Stance Classification & Relationship (only if claim is provided)
    if claim and text and "stance_classification" in task_datasets:
        model = task_datasets["stance_classification"]["model"]
        tokenizer = task_datasets["stance_classification"]["tokenizer"]

        combined_text = f"{claim} </s> {text}"  # RoBERTa uses </s> as separator
        inputs = tokenizer(combined_text, return_tensors="pt", padding=True, truncation=True)
        outputs = model(**inputs)
        prediction = torch.argmax(outputs.logits, dim=-1).item()

        stance = "pro" if prediction == 1 else "con"
        results["stance_classification"] = stance

        # Relationship identification uses the same prediction for simplicity
        relationship = "supportive" if prediction == 1 else "contradictory"
        results["relationship"] = relationship

    return results

# Training

In [None]:
# Train all task models
task_models = train_argument_mining_roberta(train_data)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/52380 [00:00<?, ? examples/s]

Map:   0%|          | 0/13096 [00:00<?, ? examples/s]

trainable params: 1034498 || all params: 125681668 || trainable%: 0.82


No label_names provided for model class `PeftModelForSequenceClassification`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.
[34m[1mwandb[0m: Currently logged in as: [33mlen-rtz[0m ([33mlen-rtz-th-k-ln[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss,Accuracy
1,0.0,2e-06,1.0
2,0.0,1e-06,1.0
3,0.0,0.0,1.0


Model for task adu_identification saved to argument-mining-roberta-adu_identification


Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/52380 [00:00<?, ? examples/s]

Map:   0%|          | 0/13096 [00:00<?, ? examples/s]

No label_names provided for model class `PeftModelForSequenceClassification`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


trainable params: 1034498 || all params: 125681668 || trainable%: 0.82


Epoch,Training Loss,Validation Loss,Accuracy
1,0.1723,0.221383,0.947618
2,0.089,0.158815,0.957697
