# 0. Instructions and setup

## 0.1. Instructions. Part 2: Data Scientist Challenge (3.5 points)

- **Objective:** Explore different techniques to enhance model performance with limited  labeled data. You will be limited to 32 labeled examples in your task.  The rest can be viewed as unlabelled data. 

- **Tasks:**
  - **a. BERT Model with Limited Data (0.5 points):** Train a BERT-based model using only 32 labeled examples and assess its performance.
  - **b. Dataset Augmentation (1 point):** Experiment with an automated technique to increase your dataset size **without using LLMs** (chatGPT / Mistral / Gemini / etc...). Evaluate the impact on model performance.
  - **c. Zero-Shot Learning with LLM (0.5 points):** Apply a LLM (chatGPT/Claude/Mistral/Gemini/...) in a zero-shot learning setup. Document the performance.
  - **d. Data Generation with LLM (1 point):** Use a LLM (chatGPT/Claude/Mistral/Gemini/...) to generate new, labeled  dataset points. Train your BERT model with it + the 32 labels. Analyze  how this impacts model metrics.
  - **e. Optimal Technique Application (0.5 points):** Based on the previous experiments, apply the most effective  technique(s) to further improve your model's performance. Comment your results and propose improvements.

## 0.2. Libraries

In [1]:
# !pip install polars  # Install polars for faster data processing

In [2]:
# Utilities
import numpy as np
import polars as pl
from library.metrics import Metrics
from library.utilities import set_seed
from datasets import Dataset
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
import os

# Deep Learning and NLP
import torch
from setfit import SetFitModel, Trainer, TrainingArguments
import matplotlib.pyplot as plt

2025-06-12 17:28:25.152302: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-06-12 17:28:25.161169: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1749742105.171638  185183 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1749742105.174648  185183 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1749742105.183306  185183 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking 

In [3]:
# Initialize the metrics object to save the results
metrics = Metrics()

In [4]:
# Check the availability of a GPU
print(torch.cuda.is_available())

True


## 0.3. Random Seed

In [5]:
# Set random seed for reproducibility
seed = 42
set_seed(42)

Seed set to 42. This ensures reproducibility of results across runs.


## 0.4. Loading the data: Swiss Judgement Prediction

Source: https://huggingface.co/datasets/rcds/swiss_judgment_prediction

In [6]:
# Load the cleaned Parquet file
df = pl.read_parquet('swiss_judgment_prediction_fr&it_clean.parquet')

# Display the loaded DataFrame
print("\nLoaded DataFrame shape:", df.shape)
print("\nLoaded DataFrame schema:")
print(df.schema)
print("\nFirst few rows of the loaded DataFrame:")
df.head()


Loaded DataFrame shape: (35386, 9)

Loaded DataFrame schema:
Schema({'id': Int32, 'year': Int32, 'text': String, 'labels': Int64, 'language': String, 'region': String, 'canton': String, 'legal area': String, 'split': String})

First few rows of the loaded DataFrame:


id,year,text,labels,language,region,canton,legal area,split
i32,i32,str,i64,str,str,str,str,str
22014,2011,"""Faits: A. Le 28 octobre 2002 à…",0,"""fr""","""Région lémanique""","""ge""","""civil law""","""train"""
11593,2007,"""Faits : Faits : A. Le 17 avril…",1,"""fr""","""Région lémanique""","""ge""","""penal law""","""train"""
26670,2013,"""Faits: A. Par jugement du 2 ma…",0,"""fr""","""Région lémanique""","""vd""","""penal law""","""train"""
5864,2004,"""Faits: Faits: A. N._, née en 1…",1,"""fr""","""Région lémanique""","""vd""","""insurance law""","""train"""
16122,2009,"""Faits: A. Y._ est propriétaire…",0,"""fr""","""Région lémanique""","""ge""","""public law""","""train"""


In [7]:
# Split the DataFrame into training, validation and test sets
train_df = df.filter(pl.col('split') == 'train')
val_df = df.filter(pl.col('split') == 'validation')
test_df = df.filter(pl.col('split') == 'test')

# Split each of the splits into the different languages
# train_fr_df = train_df.filter(pl.col('language') == 'fr')
# train_it_df = train_df.filter(pl.col('language') == 'it')
# val_fr_df = val_df.filter(pl.col('language') == 'fr')
# val_it_df = val_df.filter(pl.col('language') == 'it')
# test_fr_df = test_df.filter(pl.col('language') == 'fr')
# test_it_df = test_df.filter(pl.col('language') == 'it')

# Delete the original data to free up memory
del df

# 1. BERT Model with Limited Data

Outline of the intermediate tasks:

1. Preprocessing Pipeline
   - Lowercasing, punctuation stripping (or not, depending on BERT tokenizer).
   - Sentencepiece/BPE tokenization via the CamemBERT (for French) or UmBERTo (for Italian).
   - (Optional) language tags if you merge FR+IT in one model.
2. Hold-out Split. Since you only have 32 labels: use stratified k-fold CV (e.g. 8 × 4-fold) to get reliable estimates, or leave-one-out if you want maximum training data per fold.
3. BERT Model with Only 32 Examples
   - Model Choice: Pick a multilingual BERT (mBERT) or separate CamemBERT/UmBERTo checkpoint. Alternatives:
     - Multilingual/monolingual models:
       - One multilingual [`Sentence-Transformer` model](https://www.sbert.net/docs/sentence_transformer/pretrained_models.html#original-models), such as `paraphrase-multilingual-mpnet-base-v2` (best multilingual performer), `paraphrase-multilingual-MiniLM-L12-v2` (similar performer as the former, but much faster and smaller) or `distiluse-base-multilingual-cased-v1` (worst of the bunch).
       -  [BERT multilingual base model (cased)](https://huggingface.co/google-bert/bert-base-multilingual-cased). [Uncased model](https://huggingface.co/google-bert/bert-base-multilingual-uncased) also available. Paper: "[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)". For an even smaller model, the distilled version can be used: [`distilbert-base-multilingual-cased`](https://huggingface.co/distilbert/distilbert-base-multilingual-cased).
       -  [CamemBERT 2.0](https://huggingface.co/almanach/camembertv2-base) and [CamemBERTav2](https://huggingface.co/almanach/camembertav2-base), models trained with French text and explained in the paper: [CamemBERT 2.0: A Smarter French Language Model Aged to Perfection](https://arxiv.org/html/2411.08868v1#S3). These models supposedly improve over the performance of the original [CamemBERT](https://huggingface.co/docs/transformers/en/model_doc/camembert) model, explained in "[CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894)".
       -  [FlauBERT](https://huggingface.co/docs/transformers/en/model_doc/flaubert), another model pre-trained on French text. Paper: "[FlauBERT: Unsupervised Language Model Pre-training for French](https://arxiv.org/abs/1912.05372)". 
       -  [BERT Base Italian Cased](https://huggingface.co/dbmdz/bert-base-italian-cased), [Uncased](https://huggingface.co/dbmdz/bert-base-italian-uncased), ,and [XXL Uncased](https://huggingface.co/dbmdz/bert-base-italian-xxl-uncased). 
       -  [UmBERTo Commoncrawl Cased](https://huggingface.co/Musixmatch/umberto-commoncrawl-cased-v1), another model trained with a large corpus of texts in Italian.
      -  Domain-specific models (law):
          - [LEGAL-BERT](https://huggingface.co/nlpaueb/legal-bert-base-uncased) does not seem to be a good option as it was trained only on English data.
          - [JuriBERT](https://huggingface.co/dascim/juribert-base) for legal texts in French. Paper explaining the model: [JuriBERT: A Masked-Language Model Adaptation for French Legal Text](https://arxiv.org/pdf/2110.01485).
          - [ITALIAN-LEGAL-BERT](https://huggingface.co/dlicari/Italian-Legal-BERT) for legal text in Italian. Paper explaining the model: [ITALIAN-LEGAL-BERT models for improving natural language processing tasks in the Italian legal domain](https://www.sciencedirect.com/science/article/pii/S0267364923001188).
     - Models to try: 
       - For both languages: `paraphrase-multilingual-MiniLM-L12-v2` (of the SentenceTransformers, good balance between performance and speed), BERT multilingual base model (cased).
       - For French: CamemBERT, FlauBERT, CamemBERTav2 (based on DebertaV2 architecture), JuriBERT.
       - For Italian: BERT Base Italian Cased, UmBERTo Commoncrawl Cased, ITALIAN-LEGAL-BERT.
   - Fine-tuning Setup.
     - Freeze or unfreeze last n encoder layers—try both.
     - Small learning rate (2e-5 – 5e-5), batch size = 8 or 16.
     - Early stopping on validation loss.
4. Training & Evaluation
    - Run your k-fold CV training loops.
    - Track accuracy, F1, precision, recall per fold.
    - Report mean ± std.
5. Error Analysis and feature interpretation.
    - Use `LIME` for analyzing the most relevant features for classifying the texts. 
    - Look at which examples are mispredicted.
    - Check language breakdown (FR vs. IT) to see if one is harder.

## 1.1. Standard fine-tuning

Notebook of reference: `Session_5_1_BERT_HF_Implementation.ipynb`, sections 1, 2 and 5.

## 1.2. Using SetFit ("Sentence Transformer Fine-Tuning")

Notebook of reference: `Session_6_2_Zero_Shot_Classification.ipynb`, introduction, step 1 (loading data) and step 6 ("Few-Shot Classification with SetFit").

Applying SetFit (the “Sentence Transformer Fine-Tuning” recipe) can be regarded as **training** (it fine-tunes a pre-trained sentence-embedding model, plus fits a small classifier on top). Furthermore, SetFit was built **for** getting strong performance with as few as a few dozen labeled examples.

---

**Why SetFit can be regarded as training**

- **Contrastive fine-tuning:**
  We start with a frozen (or lightly unfrozen) Sentence-Transformer model and then *fine-tune* it on automatically generated sentence pairs derived from your 32 labels.
- **Classifier head training:**
  After contrastive tuning, SetFit fits a lightweight logistic-regression (or small MLP) classifier on the resulting embeddings.
- Both steps update model parameters—so it’s training/fine-tuning, not mere prompt-engineering or zero-shot.

---

**Why SetFit excels in limited-label regimes**

1. **Data amplification via contrastive pairs**

   - From each labeled example, SetFit creates positive pairs (e.g. two different augmentations of the same sentence) and negative pairs (across classes), turning 32 labels into hundreds or thousands of pairwise signals.
   - That extra signal helps the embedding space separate classes, even when you only have a few “gold” labels.

1. **Lightweight classifier**

   * Because the embedding model has already been tuned to distinguish the classes, the final classifier can be a simple logistic or MLP—so it needs very few examples to learn decision boundaries.

2. **Empirical few-shot strength**

   * In benchmarks, SetFit often outperforms standard BERT fine-tuning with few labels, and it’s much faster to train (no full back-prop through all BERT layers).

---

Note that there are 2 aspects that we need to implement for SetFit to work properly:

1. **Balance the examples of positive and negative classes** that will be passed for contrastive learning. Since approximately 29% of the labels are positive (approved motions) and the other 71% correspond to rejected motions, passing examples for contrastive learning without balancing the clases would bias the embedding space towards the majority class.
   
2. We need to **run several iterations** in order to get lower variability in the results, as a single training run can be highly variable due to randomness in the example sampling (as we only pass 32 labelled examples). Therefore, if we don't run several iterations for SetFit, the results would be highly noisy. Averaging across iterations smooths out this variance and gives us the true expected performance.

### Utilities

In [8]:
# Function to sample balanced dataset
def sample_balanced_dataset(dataset: pl.DataFrame, num_samples, seed):

    """
    Sample a balanced dataset with equal numbers of positive and negative examples.
    Args:
        dataset (pl.DataFrame): The input dataset containing 'text' and 'label' columns.
        num_samples (int): Total number of samples to return, must be even.
        seed (int): Random seed for reproducibility.
    Returns:
        Dataset: A balanced Dataset object with equal numbers of positive and negative examples.
    """

    # Get positive and negative examples
    pos_examples = dataset.filter(pl.col('label') == 1)
    neg_examples = dataset.filter(pl.col('label') == 0)
    
    # Sample equal numbers from each class
    samples_per_class = num_samples // 2
    
    if seed is not None:
        pos_sampled = pos_examples.sample(n=samples_per_class, shuffle=True, seed=seed)
        neg_sampled = neg_examples.sample(n=samples_per_class, shuffle=True, seed=seed)
    else:
        raise ValueError("Seed must be provided for reproducibility.")
    
    # Concatenate the sampled DataFrames, only the text and the label columns
    pos_sampled = pos_sampled.select(['text', 'label'])
    neg_sampled = neg_sampled.select(['text', 'label'])
    df = pl.concat([pos_sampled, neg_sampled], how='vertical')

    # Combine the datasets into a single Dataset object
    combined = Dataset.from_polars(df)
    
    return combined

In [18]:
# Function to run SetFit training and evaluation routine
def run_setfit_training(train_df: pl.DataFrame, val_df: pl.DataFrame, 
                        model_name='sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2',
                        num_epochs=5, batch_size=16, learning_rate=2e-5, sample_size=32,
                        metric='f1', num_iterations=10, seed=42):
    """
    Run SetFit training and evaluation routine.
    
    Args:
        train_df (pl.DataFrame): Training DataFrame with 'text' and 'label' columns.
        val_df (pl.DataFrame): Validation DataFrame with 'text' and 'label' columns.
        model_name (str): Pretrained model name for SetFit.
        num_epochs (int): Number of epochs for training.
        batch_size (int): Batch size for training.
        learning_rate (float): Learning rate for the optimizer.
        sample_size (int): Number of samples to use for training in each iteration.
        metric (str): Metric to optimize during training ('f1', 'accuracy', etc.).
        num_iterations (int): Number of iterations to run the training process.
        seed (int): Random seed for reproducibility.
    
    Returns:
        list: A list of dictionaries containing evaluation metrics for each iteration.
    """
    
    # Prepare the validation set
    val_set = Dataset.from_polars(val_df.select(['text', 'label']))

    # Store results across iterations (for different metrics and iterations)
    iteration_results = []
    best_score = 0
    best_model = None
    best_iteration = 0

    # Run the sampling and training process with SetFit
    for iteration in range(num_iterations):
        print(f"\nIteration {iteration + 1}/{num_iterations}")

        # Create a fresh model for each iteration to avoid contamination
        model = SetFitModel.from_pretrained(model_name)

        # Sample balanced training data
        train_samples = sample_balanced_dataset(train_df, sample_size, seed + iteration)

        # Create the training arguments
        train_args = TrainingArguments(
            num_epochs=num_epochs,  # Number of epochs for training
            batch_size=batch_size,  # Batch size for training
            body_learning_rate=learning_rate  # Learning rate for the optimizer
        )

        # Initialize and train SetFit model
        trainer = Trainer(
            model=model,
            train_dataset=train_samples,  # Pairs of text and labels for Contrastive Learning
            eval_dataset=val_set,  # Validation set for evaluation
            metric=metric,  # Metric to optimize
            args=train_args  # Training arguments
        )

        trainer.train()  # Train the model
        print("Training completed.")

        # Evaluate on validation set (for hyperparameter tuning/model selection)
        val_predictions = trainer.model.predict(val_set['text'])
        val_metrics = {
            'accuracy': accuracy_score(val_set['label'], val_predictions),
            'f1': f1_score(val_set['label'], val_predictions),
            'precision': precision_score(val_set['label'], val_predictions),
            'recall': recall_score(val_set['label'], val_predictions)
        }
        
        # Store results for this iteration
        iteration_results.append(val_metrics)
        print(f"Validation F1: {val_metrics['f1']:.4f}")

        # Check if this is the best model so far
        current_score = val_metrics[metric]
        if current_score > best_score:
            best_score = current_score
            best_iteration = iteration + 1
            # Clean up previous best model
            if best_model is not None:
                del best_model
            # Store reference to current best model
            best_model = trainer.model
            print(f"New best model found with {metric}: {best_score:.4f}")

        # Clean up memory for this iteration - SINGLE, CLEAR LOGIC
        if best_model is trainer.model:
            # This is the best model, only delete trainer
            del model, trainer
        else:
            # This is not the best model, delete everything
            del model, trainer

        # Clear CUDA cache if using GPU
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
            torch.cuda.synchronize()  # Wait for all operations to complete
    
    print('Finished training in all iterations. Saving the results to a parquet file...')
    
    # Save the best model
    san_model_name = model_name.split(sep='/')[-1]  # Sanitize model name for file path, keep only the last part
    model_path = os.path.join('models', 'part_2', 'a', f'setfit_best_{san_model_name}')
    os.makedirs(model_path, exist_ok=True)
    best_model.save_pretrained(model_path)  # Save the best model to the specified path
    print(f'Best model saved to: {model_path}')

    # Save the results in a polars DataFrame
    results_df = pl.DataFrame(iteration_results)
    results_df = results_df.with_columns(pl.Series("iteration", range(1, len(results_df) + 1)))  # Add iteration number to the results DataFrame

    # Save the results DataFrame to a Parquet file
    results_path = os.path.join('results', 'part_2', 'a', f'setfit_results_{san_model_name}.parquet')
    os.makedirs(os.path.dirname(results_path), exist_ok=True)  # Ensure the directory exists
    results_df.write_parquet(results_path)
    print(f'Results saved to: {results_path}')

    return results_df

Note that SetFit only natively supports sentence-transformer models: passing one which is [not](https://huggingface.co/models?library=sentence-transformers&author=sentence-transformers) (see also the [docs](https://www.sbert.net/docs/sentence_transformer/pretrained_models.html#original-models)) will make SetFit automatically wrap the BERT model with a sentence-transformers layer using mean pooling. Using a model that is not part of the `sentence-transformers` will yield the following message:

> No sentence-transformers model found with name google-bert/bert-base-multilingual-cased. Creating a new one with mean pooling.

However, for better performance with SetFit, it is preferred to use a model that was specifically pre-trained as a sentence transformer.

   - Model Choice: Pick a multilingual BERT (mBERT) or separate CamemBERT/UmBERTo checkpoint. Alternatives:
     - Multilingual/monolingual models:
       - One multilingual [`Sentence-Transformer` model](https://www.sbert.net/docs/sentence_transformer/pretrained_models.html#original-models), such as `paraphrase-multilingual-mpnet-base-v2` (best multilingual performer), `paraphrase-multilingual-MiniLM-L12-v2` (similar performer as the former, but much faster and smaller) or `distiluse-base-multilingual-cased-v1` (worst of the bunch).
       -  [BERT multilingual base model (cased)](https://huggingface.co/google-bert/bert-base-multilingual-cased). [Uncased model](https://huggingface.co/google-bert/bert-base-multilingual-uncased) also available. Paper: "[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)". 
       -  [CamemBERT 2.0](https://huggingface.co/almanach/camembertv2-base) and [CamemBERTav2](https://huggingface.co/almanach/camembertav2-base), models trained with French text and explained in the paper: [CamemBERT 2.0: A Smarter French Language Model Aged to Perfection](https://arxiv.org/html/2411.08868v1#S3). These models supposedly improve over the performance of the original [CamemBERT](https://huggingface.co/docs/transformers/en/model_doc/camembert) model, explained in "[CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894)".
       -  [FlauBERT](https://huggingface.co/docs/transformers/en/model_doc/flaubert), another model pre-trained on French text. Paper: "[FlauBERT: Unsupervised Language Model Pre-training for French](https://arxiv.org/abs/1912.05372)". 
       -  [BERT Base Italian Cased](https://huggingface.co/dbmdz/bert-base-italian-cased), [Uncased](https://huggingface.co/dbmdz/bert-base-italian-uncased), ,and [XXL Uncased](https://huggingface.co/dbmdz/bert-base-italian-xxl-uncased). 
       -  [UmBERTo Commoncrawl Cased](https://huggingface.co/Musixmatch/umberto-commoncrawl-cased-v1), another model trained with a large corpus of texts in Italian.
      -  Domain-specific models (law):
          - [LEGAL-BERT](https://huggingface.co/nlpaueb/legal-bert-base-uncased) does not seem to be a good option as it was trained only on English data.
          - [JuriBERT](https://huggingface.co/dascim/juribert-base) for legal texts in French. Paper explaining the model: [JuriBERT: A Masked-Language Model Adaptation for French Legal Text](https://arxiv.org/pdf/2110.01485).
          - [ITALIAN-LEGAL-BERT](https://huggingface.co/dlicari/Italian-Legal-BERT) for legal text in Italian. Paper explaining the model: [ITALIAN-LEGAL-BERT models for improving natural language processing tasks in the Italian legal domain](https://www.sciencedirect.com/science/article/pii/S0267364923001188).

Models to try: 
- For both languages: `paraphrase-multilingual-MiniLM-L12-v2` (of the SentenceTransformers, good balance between performance and speed), distilled BERT multilingual base model (cased) ([`distilbert-base-multilingual-cased`](https://huggingface.co/distilbert/distilbert-base-multilingual-cased).)
- For French: CamemBERTav2 (based on DebertaV2 architecture), JuriBERT.
- For Italian: BERT Base Italian Cased, ITALIAN-LEGAL-BERT.

[Model memory estimator](https://huggingface.co/docs/accelerate/en/usage_guides/model_size_estimator): in order to know whether a model actually fits in the memory of your computer:



Finally, consider that there might be some memory issues by running SetFit:  ( https://github.com/huggingface/setfit/issues/472 ). Adapt the hyperparameters consequently to your memory limitations.

### Multilingual models

In [19]:
# Rename columns to match SetFit requirements (labels should be 'labels'), creating a copy
train_df_setfit = train_df.clone()
val_df_setfit = val_df.clone()
train_df_setfit = train_df_setfit.rename({'labels': 'label'})
val_df_setfit = val_df_setfit.rename({'labels': 'label'})

# Define sample size (labelled examples for training)
sample_size = 32
num_iterations = 10  # Run several iterations for the sample size (32 labels) to minimize the impact of randomness
metric = 'f1'  # Metric to optimize
model_name = "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"  # Path to the pre-trained model
batch_size = 64 # Batch size for training (reduce if you run into memory issues, but larger batch sizes will speed up training)
num_epochs = 5  # Number of epochs for training
learning_rate= 2e-5  # Learning rate for the optimizer

# Create a subset of the validation data frame for faster processing
sample_size_val = 500
val_df_setfit = val_df_setfit.sample(n=sample_size_val, shuffle=True, seed=seed)

results_m1 = run_setfit_training(
    train_df=train_df_setfit, 
    val_df=val_df_setfit, 
    model_name=model_name, 
    num_epochs=num_epochs, 
    batch_size=batch_size, 
    learning_rate=learning_rate, 
    sample_size=sample_size, 
    metric=metric, 
    num_iterations=num_iterations, 
    seed=seed
)


Iteration 1/10


model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.
  obj.co_lnotab,  # for < python 3.10 [not counted in args]


Map:   0%|          | 0/32 [00:00<?, ? examples/s]

***** Running training *****
  Num unique pairs = 544
  Batch size = 64
  Num epochs = 5


Step,Training Loss
1,0.2428


Training completed.
Validation F1: 0.2857
New best model found with f1: 0.2857

Iteration 2/10


model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.
  obj.co_lnotab,  # for < python 3.10 [not counted in args]


Map:   0%|          | 0/32 [00:00<?, ? examples/s]

***** Running training *****
  Num unique pairs = 544
  Batch size = 64
  Num epochs = 5


Step,Training Loss
1,0.2876


Training completed.
Validation F1: 0.2508

Iteration 3/10


model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.
  obj.co_lnotab,  # for < python 3.10 [not counted in args]


Map:   0%|          | 0/32 [00:00<?, ? examples/s]

***** Running training *****
  Num unique pairs = 544
  Batch size = 64
  Num epochs = 5


Step,Training Loss
1,0.2936


Training completed.
Validation F1: 0.2566

Iteration 4/10


model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.
  obj.co_lnotab,  # for < python 3.10 [not counted in args]


Map:   0%|          | 0/32 [00:00<?, ? examples/s]

***** Running training *****
  Num unique pairs = 544
  Batch size = 64
  Num epochs = 5


Step,Training Loss
1,0.3051


Training completed.
Validation F1: 0.2593

Iteration 5/10


model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.
  obj.co_lnotab,  # for < python 3.10 [not counted in args]


Map:   0%|          | 0/32 [00:00<?, ? examples/s]

***** Running training *****
  Num unique pairs = 544
  Batch size = 64
  Num epochs = 5


Step,Training Loss
1,0.2565


Training completed.
Validation F1: 0.3069
New best model found with f1: 0.3069

Iteration 6/10


model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.
  obj.co_lnotab,  # for < python 3.10 [not counted in args]


Map:   0%|          | 0/32 [00:00<?, ? examples/s]

***** Running training *****
  Num unique pairs = 544
  Batch size = 64
  Num epochs = 5


Step,Training Loss
1,0.2692


Training completed.
Validation F1: 0.2338

Iteration 7/10


model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.
  obj.co_lnotab,  # for < python 3.10 [not counted in args]


Map:   0%|          | 0/32 [00:00<?, ? examples/s]

***** Running training *****
  Num unique pairs = 544
  Batch size = 64
  Num epochs = 5


Step,Training Loss
1,0.2593


Training completed.
Validation F1: 0.2830

Iteration 8/10


model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.
  obj.co_lnotab,  # for < python 3.10 [not counted in args]


Map:   0%|          | 0/32 [00:00<?, ? examples/s]

***** Running training *****
  Num unique pairs = 544
  Batch size = 64
  Num epochs = 5


Step,Training Loss
1,0.282


Training completed.
Validation F1: 0.2618

Iteration 9/10


model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.
  obj.co_lnotab,  # for < python 3.10 [not counted in args]


Map:   0%|          | 0/32 [00:00<?, ? examples/s]

***** Running training *****
  Num unique pairs = 544
  Batch size = 64
  Num epochs = 5


Step,Training Loss
1,0.2593


Training completed.
Validation F1: 0.2837

Iteration 10/10


model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.
  obj.co_lnotab,  # for < python 3.10 [not counted in args]


Map:   0%|          | 0/32 [00:00<?, ? examples/s]

***** Running training *****
  Num unique pairs = 544
  Batch size = 64
  Num epochs = 5


Step,Training Loss
1,0.2597


Training completed.
Validation F1: 0.2279
Finished training in all iterations. Saving the results to a parquet file...
Best model saved to: models/part_2/a/setfit_best_paraphrase-multilingual-MiniLM-L12-v2
Results saved to: results/part_2/a/setfit_results_paraphrase-multilingual-MiniLM-L12-v2.parquet


In [24]:
print(f'Results for model: {model_name}')
display(results_m1)

# Print mean of all metrics across iterations
metrics = ['accuracy', 'f1', 'precision', 'recall']
mean_metrics = results_m1.select([pl.col(metric).mean().alias(metric) for metric in metrics]).to_dict(as_series=False)
print("\nMean metrics across all iterations:")
print(mean_metrics)

Results for model: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2


accuracy,f1,precision,recall,iteration
f64,f64,f64,f64,i64
0.55,0.285714,0.204545,0.473684,1
0.546,0.250825,0.182692,0.4,2
0.49,0.25656,0.177419,0.463158,3
0.52,0.259259,0.183406,0.442105,4
0.458,0.306905,0.202703,0.631579,5
0.502,0.233846,0.165217,0.4,6
0.544,0.283019,0.201794,0.473684,7
0.436,0.26178,0.174216,0.526316,8
0.384,0.283721,0.18209,0.642105,9
0.58,0.227941,0.175141,0.326316,10



Mean metrics across all iterations:
{'accuracy': [0.5010000000000001], 'f1': [0.26495709982755045], 'precision': [0.1849223869644958], 'recall': [0.47789473684210526]}


In [None]:
# Rename columns to match SetFit requirements (labels should be 'labels'), creating a copy
train_df_setfit = train_df.clone()
val_df_setfit = val_df.clone()
train_df_setfit = train_df_setfit.rename({'labels': 'label'})
val_df_setfit = val_df_setfit.rename({'labels': 'label'})

# Define sample size (labelled examples for training)
sample_size = 32
num_iterations = 5  # Run several iterations for the sample size (32 labels) to minimize the impact of randomness
metric = 'f1'  # Metric to optimize
model_name = "google-bert/bert-base-multilingual-cased"  # Path to the pre-trained model
batch_size = 8 # Batch size for training (reduce if you run into memory issues, but larger batch sizes will speed up training)
num_epochs = 5  # Number of epochs for training
learning_rate= 2e-5  # Learning rate for the optimizer

# Create a subset of the validation data frame for faster processing
sample_size_val = 500
val_df_setfit = val_df_setfit.sample(n=sample_size_val, shuffle=True, seed=seed)

results_m2 = run_setfit_training(
    train_df=train_df_setfit, 
    val_df=val_df_setfit, 
    model_name=model_name, 
    num_epochs=num_epochs, 
    batch_size=batch_size, 
    learning_rate=learning_rate, 
    sample_size=sample_size, 
    metric=metric, 
    num_iterations=num_iterations, 
    seed=seed
)

In [None]:
print(f'Results for model: {model_name}')
display(results_m2)

# Print mean of all metrics across iterations
metrics = ['accuracy', 'f1', 'precision', 'recall']
mean_metrics = results_m1.select([pl.col(metric).mean().alias(metric) for metric in metrics]).to_dict(as_series=False)
print("\nMean metrics across all iterations:")
print(mean_metrics)

Results for model: google-bert/bert-base-multilingual-cased


accuracy,f1,precision,recall,iteration
f64,f64,f64,f64,i64
0.532,0.26875,0.191111,0.452632,1
0.42,0.299517,0.194357,0.652632,2
0.61,0.229249,0.183544,0.305263,3
0.464,0.298429,0.198606,0.6,4
0.496,0.296089,0.201521,0.557895,5


### Models trained in French

In [26]:
# Rename columns to match SetFit requirements (labels should be 'labels'), creating a copy
train_fr_df = train_df.clone()
val_fr_df = val_df.clone()
# Keep only the columns where the language is French
train_fr_df = train_fr_df.filter(pl.col('language') == 'fr').rename({'labels': 'label'})
val_fr_df = val_fr_df.filter(pl.col('language') == 'fr').rename({'labels': 'label'})

# Define sample size (labelled examples for training)
sample_size = 32
num_iterations = 5  # Run several iterations for the sample size (32 labels) to minimize the impact of randomness
metric = 'f1'  # Metric to optimize
model_name = "almanach/camembert-base"  # Path to the pre-trained model
batch_size = 8  # Batch size for training (reduce if you run into memory issues, but larger batch sizes will speed up training)
num_epochs = 5  # Number of epochs for training
learning_rate= 2e-5  # Learning rate for the optimizer

# Create a subset of the validation data frame for faster processing
sample_size_val = 500
val_fr_df = val_fr_df.sample(n=sample_size_val, shuffle=True, seed=seed)

results_m3 = run_setfit_training(
    train_df=train_fr_df, 
    val_df=val_fr_df, 
    model_name=model_name, 
    num_epochs=num_epochs, 
    batch_size=batch_size, 
    learning_rate=learning_rate, 
    sample_size=sample_size, 
    metric=metric, 
    num_iterations=num_iterations, 
    seed=seed
)


Iteration 1/5


No sentence-transformers model found with name almanach/camembert-base. Creating a new one with mean pooling.
model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.
  obj.co_lnotab,  # for < python 3.10 [not counted in args]


Map:   0%|          | 0/32 [00:00<?, ? examples/s]

***** Running training *****
  Num unique pairs = 544
  Batch size = 8
  Num epochs = 5


Step,Training Loss


OutOfMemoryError: CUDA out of memory. Tried to allocate 48.00 MiB. GPU 0 has a total capacity of 7.75 GiB of which 12.69 MiB is free. Including non-PyTorch memory, this process has 7.73 GiB memory in use. Of the allocated memory 7.46 GiB is allocated by PyTorch, and 109.34 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [None]:
print(f'Results for model: {model_name}')
display(results_m3)

# Print mean of all metrics across iterations
metrics = ['accuracy', 'f1', 'precision', 'recall']
mean_metrics = results_m1.select([pl.col(metric).mean().alias(metric) for metric in metrics]).to_dict(as_series=False)
print("\nMean metrics across all iterations:")
print(mean_metrics)

Results for model: almanach/camembert-base


NameError: name 'results_m3' is not defined

# 2. Dataset Augmentation

Outline of the intermediate tasks: We want a fully automated pipeline. A good candidate is Easy Data Augmentation (EDA) or back-translation via open‐source MT models.
1. Choose Technique(s)
   - EDA: random synonym substitution (WordNet or fastText), random swap, insertion, deletion.
   - Back-translation: FR → EN → FR and IT → EN → IT using MarianMT or opus-MT.
2. Implement & Generate
   - For each of the 32 labeled examples, generate k augmented pseudo-examples (e.g. k=5).
   - Deduplicate and filter (e.g. reject if new text <50% overlap).
3. Merge & Re-split
   - Combine original 32 + synthetic N = 32×k examples.
   - Re-run the same CV split strategy, ensuring augmented copies of a given original stay in the same fold.
4. Re-train BERT
   - Exactly the same hyperparams as in (a).
   - Track performance uplift vs. the baseline.
5. Analysis
   - Compare metrics: ΔAccuracy, ΔF1.
   - Ablation: EDA vs. back-translation vs. combined.
   - Qualitative: inspect a few synthetic samples.

# 3. Zero-Shot Learning with LLM

# 4. Data Generation with LLM

# 5. Optimal Technique Application