# <span style="color:#FF7F7F">**Model Comparison Notebook**</span>

## <span style="color:#FF7F7F">**1. Installation of Required Libraries**</span>
- Installs necessary libraries like `transformers`, `datasets`, `seqeval`, and `evaluate`.

## <span style="color:#FF7F7F">**2. Loading the Dataset**</span>
- Loads a `.conll` file containing labeled messages and parses it into tokens and labels.

## <span style="color:#FF7F7F">**3. Dataset Preparation**</span>
- Converts the parsed data into a Hugging Face `Dataset` object and splits it into training and testing sets.

## <span style="color:#FF7F7F">**4. Tokenization and Label Alignment**</span>
- Tokenizes the dataset and aligns labels with the tokenized input.

## <span style="color:#FF7F7F">**5. Model Fine-Tuning**</span>
- Fine-tunes three models (`xlm-roberta`, `distilbert`, and `mbert`) on the dataset.
- Saves the fine-tuned models and evaluates their accuracy.

## <span style="color:#FF7F7F">**6. Evaluation**</span>
- Evaluates the fine-tuned models using accuracy metrics and prints the results.

In [1]:
!pip install transformers datasets seqeval




In [2]:
!pip install evaluate




In [3]:
file_path = "C:/Users/ibsan/Desktop/TenX/week-5/data/labeled_messages.conll"

In [4]:
from transformers import AutoTokenizer, AutoModelForTokenClassification, TrainingArguments, Trainer
from datasets import Dataset, Features, Value, Sequence, ClassLabel
import evaluate  # Updated import for metrics
import numpy as np

# Function to parse the .conll file
def read_conll_file(file_path):
    with open(file_path, "r", encoding="utf-8") as f:
        lines = f.readlines()

    tokens = []
    labels = []
    current_tokens = []
    current_labels = []

    for line in lines:
        line = line.strip()
        if line == "":  # End of a sentence
            if current_tokens:
                tokens.append(current_tokens)
                labels.append(current_labels)
                current_tokens = []
                current_labels = []
        else:
            parts = line.split()  # Assuming the format is: token [tab] label
            if len(parts) == 2:
                current_tokens.append(parts[0])
                current_labels.append(parts[1])

    # Add the last sentence if the file doesn't end with a blank line
    if current_tokens:
        tokens.append(current_tokens)
        labels.append(current_labels)

    return {"tokens": tokens, "labels": labels}

# Load your .conll file
file_path = "C:/Users/ibsan/Desktop/TenX/week-5/data/labeled_messages.conll"
data = read_conll_file(file_path)






In [5]:
# Get unique labels from the dataset
unique_labels = sorted(list(set(label for sublist in data["labels"] for label in sublist)))

# Define the features for the dataset
features = Features({
    "tokens": Sequence(Value("string")),
    "labels": Sequence(ClassLabel(names=unique_labels)),  # Treat labels as ClassLabel
})

# Convert to Hugging Face Dataset with explicit features
dataset = Dataset.from_dict(data, features=features)

# Split into train and validation sets
dataset = dataset.train_test_split(test_size=0.2, seed=42)

print(dataset)

# Function to tokenize and align labels
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples["tokens"],
        truncation=True,
        padding="max_length",
        max_length=512,
        is_split_into_words=True
    )

    labels = []
    for i, label in enumerate(examples["labels"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)  # Map tokens to their respective word
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            if word_idx is None:  # Special tokens (e.g., [CLS], [SEP])
                label_ids.append(-100)
            elif word_idx != previous_word_idx:  # New word
                label_ids.append(label[word_idx])
            else:  # Same word (subword)
                label_ids.append(-100)  # Use -100 to ignore subwords in the loss function
            previous_word_idx = word_idx
        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs


DatasetDict({
    train: Dataset({
        features: ['tokens', 'labels'],
        num_rows: 40
    })
    test: Dataset({
        features: ['tokens', 'labels'],
        num_rows: 10
    })
})


In [19]:
from transformers import AutoTokenizer, AutoModelForTokenClassification, TrainingArguments, Trainer
import evaluate
import numpy as np

# Define the models to compare
models_to_compare = {
    "xlm-roberta": "xlm-roberta-base",
    "distilbert": "distilbert-base-multilingual-cased",
    "mbert": "bert-base-multilingual-cased",
}

# Define a custom base directory for saving models
custom_base_dir = "C:/Users/ibsan/Desktop/TenX/week-5/model_output/results"

# Fine-tune each model
for model_name, model_checkpoint in models_to_compare.items():
    print(f"Fine-tuning {model_name}...")

    # Load the tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

    # Tokenize the dataset
    tokenized_dataset = dataset.map(tokenize_and_align_labels, batched=True)

    # Load the model
    model = AutoModelForTokenClassification.from_pretrained(
        model_checkpoint,
        num_labels=len(unique_labels)  # Number of unique labels
    )

    # Update evaluation_strategy to eval_strategy
    training_args = TrainingArguments(
        output_dir=f"{custom_base_dir}/{model_name}-fine-tuned",
        eval_strategy="epoch",
        learning_rate=2e-5,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=16,
        num_train_epochs=3,
        weight_decay=0.01,
        save_strategy="epoch",
        logging_dir=f"{custom_base_dir}/{model_name}-logs",
        logging_steps=10,
        report_to="none",
    )

    # Update tokenizer to processing_class
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_dataset["train"],
        eval_dataset=tokenized_dataset["test"],
        processing_class=tokenizer,
    )

    # Fine-tune the model
    trainer.train()

    # Save the fine-tuned model to the custom directory
    model_save_path = f"{custom_base_dir}/{model_name}-fine-tuned"
    trainer.save_model(model_save_path)
    tokenizer.save_pretrained(model_save_path)

    print(f"{model_name} fine-tuning completed and saved successfully at {model_save_path}!")

    # Load the accuracy metric
    accuracy_metric = evaluate.load("accuracy")

    # Evaluate the model and get predictions
    predictions, labels, metrics = trainer.predict(tokenized_dataset["test"])

    # Convert logits to predicted class indices
    predictions = np.argmax(predictions, axis=-1)

    # Flatten predictions and labels
    flat_predictions = predictions.flatten()
    flat_labels = labels.flatten()

    # Filter out -100 values (ignored labels)
    mask = flat_labels != -100
    filtered_predictions = flat_predictions[mask]
    filtered_labels = flat_labels[mask]

    # Calculate accuracy using the metric
    eval_accuracy = accuracy_metric.compute(predictions=filtered_predictions, references=filtered_labels)
    print(f"{model_name} Accuracy: {eval_accuracy}")

Fine-tuning xlm-roberta...


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

Map:   0%|          | 0/40 [00:00<?, ? examples/s]

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

model.safetensors:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

Some weights of XLMRobertaForTokenClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss
1,No log,1.609584
2,No log,1.42928
3,No log,1.331598


xlm-roberta fine-tuning completed and saved successfully at C:/Users/ibsan/Desktop/TenX/week-5/model_output/results/xlm-roberta-fine-tuned!


Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

xlm-roberta Accuracy: {'accuracy': 0.6911196911196911}
Fine-tuning distilbert...


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/466 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

Map:   0%|          | 0/40 [00:00<?, ? examples/s]

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

model.safetensors:   0%|          | 0.00/542M [00:00<?, ?B/s]

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss
1,No log,1.308204
2,No log,1.117758
3,No log,1.068869


distilbert fine-tuning completed and saved successfully at C:/Users/ibsan/Desktop/TenX/week-5/model_output/results/distilbert-fine-tuned!


distilbert Accuracy: {'accuracy': 0.6911196911196911}
Fine-tuning mbert...


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

Map:   0%|          | 0/40 [00:00<?, ? examples/s]

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

model.safetensors:   0%|          | 0.00/714M [00:00<?, ?B/s]

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss
1,No log,1.006708
2,No log,0.828552
3,No log,0.782473


mbert fine-tuning completed and saved successfully at C:/Users/ibsan/Desktop/TenX/week-5/model_output/results/mbert-fine-tuned!


mbert Accuracy: {'accuracy': 0.7683397683397684}
