# Sequential Fine-tuning of MARBERT for Arabic NLP Tasks

This notebook demonstrates how to fine-tune the [MARBERT model](https://huggingface.co/UBC-NLP/MARBERT) sequentially on three Arabic NLP tasks:
1. Dialect detection (Egypt, MSA, Gulf, Magreb, Levant)
2. Sarcasm detection (True, False)
3. Sentiment classification (Positive, Negative, Neutral)

We'll use the same model architecture and tokenizer, fine-tuning on each task sequentially.

## Install Required Libraries

First, let's install the necessary libraries for our fine-tuning process.

In [None]:
!pip install transformers datasets evaluate scikit-learn pandas numpy

## Import Libraries

Now let's import all the necessary libraries for our fine-tuning pipeline.

In [None]:
import os
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import torch
import evaluate
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    Trainer,
    TrainingArguments,
    DataCollatorWithPadding
)
from datasets import Dataset
from typing import Dict, List, Union
import logging

# Set up logging
logging.basicConfig(level=logging.INFO)

## Load and Preprocess Data

In this section, we'll load our dataset and prepare it for the three different tasks. We'll use the training-data.csv and testing-data.csv files in our workspace.

In [None]:
# Load data
train_df = pd.read_csv('training-data.csv')
test_df = pd.read_csv('testing-data.csv')

# Display a few samples to understand the data
train_df.head()

In [None]:
# Check dataset information
print(f"Training data shape: {train_df.shape}")
print(f"Testing data shape: {test_df.shape}")
print("\nColumns in dataset:")
print(train_df.columns.tolist())

# Check for missing values
print("\nMissing values in training data:")
print(train_df.isnull().sum())

# Check distribution of labels for each task
print("\nDialect distribution:")
print(train_df['dialect'].value_counts())

print("\nSarcasm distribution:")
print(train_df['sarcasm'].value_counts())

print("\nSentiment distribution:")
print(train_df['sentiment'].value_counts())

## Helper Functions for Preprocessing and Training

Let's create helper functions for preprocessing our data, encoding labels, and evaluating our model.

In [None]:
def preprocess_and_encode_data(df: pd.DataFrame, task: str, tokenizer, max_length: int = 128):
    """Preprocess text data and encode labels for a specific task
    
    Args:
        df: DataFrame containing the data
        task: One of 'dialect', 'sarcasm', or 'sentiment'
        tokenizer: BERT tokenizer
        max_length: Maximum sequence length for tokenization
        
    Returns:
        Processed Dataset object with encoded inputs and labels
    """
    # Get label mapping based on the task
    if task == 'dialect':
        unique_labels = sorted(df['dialect'].unique())
        label_column = 'dialect'
    elif task == 'sarcasm':
        unique_labels = sorted(df['sarcasm'].unique())
        label_column = 'sarcasm'
    elif task == 'sentiment':
        unique_labels = sorted(df['sentiment'].unique())
        label_column = 'sentiment'
    else:
        raise ValueError(f"Unsupported task: {task}")
    
    label_mapping = {label: i for i, label in enumerate(unique_labels)}
    print(f"Label mapping for {task}: {label_mapping}")
    
    # Encode labels
    labels = [label_mapping[label] for label in df[label_column]]
    
    # Create dataset
    dataset_dict = {
        "text": df["tweet"].tolist(),
        "label": labels
    }
    dataset = Dataset.from_dict(dataset_dict)
    
    # Tokenize function
    def tokenize_function(examples):
        return tokenizer(
            examples["text"],
            padding="max_length",
            truncation=True,
            max_length=max_length,
        )
    
    # Tokenize all examples
    tokenized_dataset = dataset.map(tokenize_function, batched=True)
    
    return tokenized_dataset, label_mapping

In [None]:
def compute_metrics(eval_pred):
    """Compute metrics for evaluation"""
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    
    accuracy = evaluate.load("accuracy")
    f1 = evaluate.load("f1")
    precision = evaluate.load("precision")
    recall = evaluate.load("recall")
    
    accuracy_score = accuracy.compute(predictions=predictions, references=labels)["accuracy"]
    # For multiclass, we use macro averaging
    f1_score = f1.compute(predictions=predictions, references=labels, average="macro")["f1"]
    precision_score = precision.compute(predictions=predictions, references=labels, average="macro")["precision"]
    recall_score = recall.compute(predictions=predictions, references=labels, average="macro")["recall"]
    
    return {
        "accuracy": accuracy_score,
        "f1": f1_score,
        "precision": precision_score,
        "recall": recall_score
    }

In [None]:
def fine_tune_model(task: str, 
                 train_dataset, 
                 eval_dataset, 
                 model, 
                 num_labels: int,
                 output_dir: str,
                 epochs: int = 3,
                 batch_size: int = 16):
    """Fine-tune MARBERT for a specific task
    
    Args:
        task: The name of the task ('dialect', 'sarcasm', or 'sentiment')
        train_dataset: Training dataset
        eval_dataset: Evaluation dataset
        model: Pre-trained or previously fine-tuned model
        num_labels: Number of labels for the task
        output_dir: Directory to save the model
        epochs: Number of training epochs
        batch_size: Training batch size
        
    Returns:
        Fine-tuned model
    """
    logging.info(f"Fine-tuning model for task: {task}")
    
    # Create output directory if it doesn't exist
    os.makedirs(output_dir, exist_ok=True)
    
    # Set up training arguments
    training_args = TrainingArguments(
        output_dir=output_dir,
        num_train_epochs=epochs,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        learning_rate=2e-5,
        weight_decay=0.01,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
        metric_for_best_model="f1",
        report_to="none",  # Disable reporting to avoid wandb or other integrations
    )
    
    # Set up trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        compute_metrics=compute_metrics,
    )
    
    # Train the model
    trainer.train()
    
    # Evaluate the model
    eval_results = trainer.evaluate()
    logging.info(f"Evaluation results for {task}: {eval_results}")
    
    # Save the model
    trainer.save_model(output_dir)
    logging.info(f"Model saved to {output_dir}")
    
    return model

## Sequential Fine-tuning Pipeline

Now we'll implement the sequential fine-tuning pipeline where we'll first fine-tune on dialect detection, then use that model for sarcasm detection, and finally for sentiment classification.

In [None]:
# Define paths for saving models
dialect_model_path = "marbert_dialect"
sarcasm_model_path = "marbert_sarcasm"
sentiment_model_path = "marbert_sentiment"

# Define hyperparameters
model_name = "UBC-NLP/MARBERT"
max_length = 128
batch_size = 16
epochs = 3
random_state = 42

In [None]:
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

### Task 1: Dialect Detection

First, we'll fine-tune the MARBERT model on the dialect detection task.

In [None]:
# Preprocess data for dialect detection task
train_dialect_dataset, dialect_label_mapping = preprocess_and_encode_data(
    train_df, 'dialect', tokenizer, max_length
)
eval_dialect_dataset, _ = preprocess_and_encode_data(
    test_df, 'dialect', tokenizer, max_length
)

# Load base MARBERT model for dialect detection
dialect_model = AutoModelForSequenceClassification.from_pretrained(
    model_name, 
    num_labels=len(dialect_label_mapping),
    ignore_mismatched_sizes=True
)

# Fine-tune for dialect detection
dialect_model = fine_tune_model(
    task='dialect',
    train_dataset=train_dialect_dataset,
    eval_dataset=eval_dialect_dataset,
    model=dialect_model,
    num_labels=len(dialect_label_mapping),
    output_dir=dialect_model_path,
    epochs=epochs,
    batch_size=batch_size
)

### Task 2: Sarcasm Detection

Next, we'll use the fine-tuned dialect model as a starting point for the sarcasm detection task.

In [None]:
# Preprocess data for sarcasm detection task
train_sarcasm_dataset, sarcasm_label_mapping = preprocess_and_encode_data(
    train_df, 'sarcasm', tokenizer, max_length
)
eval_sarcasm_dataset, _ = preprocess_and_encode_data(
    test_df, 'sarcasm', tokenizer, max_length
)

# Load fine-tuned dialect model for sarcasm detection
sarcasm_model = AutoModelForSequenceClassification.from_pretrained(
    dialect_model_path,
    num_labels=len(sarcasm_label_mapping),
    ignore_mismatched_sizes=True
)

# Fine-tune for sarcasm detection
sarcasm_model = fine_tune_model(
    task='sarcasm',
    train_dataset=train_sarcasm_dataset,
    eval_dataset=eval_sarcasm_dataset,
    model=sarcasm_model,
    num_labels=len(sarcasm_label_mapping),
    output_dir=sarcasm_model_path,
    epochs=epochs,
    batch_size=batch_size
)

### Task 3: Sentiment Classification

Finally, we'll use the fine-tuned sarcasm model as a starting point for the sentiment classification task.

In [None]:
# Preprocess data for sentiment classification task
train_sentiment_dataset, sentiment_label_mapping = preprocess_and_encode_data(
    train_df, 'sentiment', tokenizer, max_length
)
eval_sentiment_dataset, _ = preprocess_and_encode_data(
    test_df, 'sentiment', tokenizer, max_length
)

# Load fine-tuned sarcasm model for sentiment classification
sentiment_model = AutoModelForSequenceClassification.from_pretrained(
    sarcasm_model_path,
    num_labels=len(sentiment_label_mapping),
    ignore_mismatched_sizes=True
)

# Fine-tune for sentiment classification
sentiment_model = fine_tune_model(
    task='sentiment',
    train_dataset=train_sentiment_dataset,
    eval_dataset=eval_sentiment_dataset,
    model=sentiment_model,
    num_labels=len(sentiment_label_mapping),
    output_dir=sentiment_model_path,
    epochs=epochs,
    batch_size=batch_size
)

## Model Inference

Now let's create functions to use our fine-tuned models for inference on new Arabic text.

In [None]:
def predict_with_model(text, model_path, label_mapping):
    """Make a prediction using a fine-tuned model
    
    Args:
        text: The input Arabic text
        model_path: Path to the fine-tuned model
        label_mapping: Dictionary mapping numerical labels to text labels
        
    Returns:
        Predicted label and confidence score
    """
    # Load model and tokenizer
    model = AutoModelForSequenceClassification.from_pretrained(model_path)
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    
    # Tokenize input
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128, padding="max_length")
    
    # Make prediction
    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits
        probabilities = torch.nn.functional.softmax(logits, dim=1)
        prediction = torch.argmax(probabilities, dim=1).item()
        confidence = probabilities[0][prediction].item()
    
    # Map numerical label back to text label
    inverse_mapping = {v: k for k, v in label_mapping.items()}
    predicted_label = inverse_mapping[prediction]
    
    return predicted_label, confidence

In [None]:
# Example of how to use the fine-tuned models for inference
def analyze_arabic_text(text):
    """Analyze Arabic text using all three fine-tuned models
    
    Args:
        text: Arabic text input
        
    Returns:
        Dictionary with predictions for all three tasks
    """
    # Predict dialect
    dialect, dialect_confidence = predict_with_model(
        text, dialect_model_path, dialect_label_mapping
    )
    
    # Predict sarcasm
    sarcasm, sarcasm_confidence = predict_with_model(
        text, sarcasm_model_path, sarcasm_label_mapping
    )
    
    # Predict sentiment
    sentiment, sentiment_confidence = predict_with_model(
        text, sentiment_model_path, sentiment_label_mapping
    )
    
    return {
        "text": text,
        "dialect": {
            "prediction": dialect,
            "confidence": f"{dialect_confidence:.4f}"
        },
        "sarcasm": {
            "prediction": sarcasm,
            "confidence": f"{sarcasm_confidence:.4f}"
        },
        "sentiment": {
            "prediction": sentiment,
            "confidence": f"{sentiment_confidence:.4f}"
        }
    }

## Test with Example Sentences

Let's test our models with a few example Arabic sentences:

In [None]:
# Test examples (add these after training is complete)
test_examples = [
    "أنا سعيد جدا بهذا الخبر العظيم",  # I am very happy with this great news
    "هههههه والله انك مسخرة يا رجل",   # Hahaha, you're so funny man
    "الطقس حار جدا اليوم في القاهرة"    # The weather is very hot today in Cairo
]

# Uncomment to run predictions after training is complete
'''
for i, example in enumerate(test_examples):
    print(f"\nExample {i+1}: {example}")
    results = analyze_arabic_text(example)
    print(f"Dialect: {results['dialect']['prediction']} (confidence: {results['dialect']['confidence']})")
    print(f"Sarcasm: {results['sarcasm']['prediction']} (confidence: {results['sarcasm']['confidence']})")
    print(f"Sentiment: {results['sentiment']['prediction']} (confidence: {results['sentiment']['confidence']})")
'''

## Conclusion

In this notebook, we've demonstrated how to fine-tune the MARBERT model sequentially on three Arabic NLP tasks: dialect detection, sarcasm detection, and sentiment classification. We've shown how to:

1. Load and preprocess the data for each task
2. Fine-tune the model sequentially, using the model from the previous task as a starting point
3. Save the fine-tuned models
4. Use the models for inference on new Arabic text

This approach allows for knowledge transfer between related tasks, potentially improving performance over training each task from scratch.