# 0. Instructions and setup

## 0.1. Instructions. Part 2: Data Scientist Challenge (3.5 points)

- **Objective:** Explore different techniques to enhance model performance with limited  labeled data. You will be limited to 32 labeled examples in your task.  The rest can be viewed as unlabelled data. 

- **Tasks:**
  - **a. BERT Model with Limited Data (0.5 points):** Train a BERT-based model using only 32 labeled examples and assess its performance.
  - **b. Dataset Augmentation (1 point):** Experiment with an automated technique to increase your dataset size **without using LLMs** (chatGPT / Mistral / Gemini / etc...). Evaluate the impact on model performance.
  - **c. Zero-Shot Learning with LLM (0.5 points):** Apply a LLM (chatGPT/Claude/Mistral/Gemini/...) in a zero-shot learning setup. Document the performance.
  - **d. Data Generation with LLM (1 point):** Use a LLM (chatGPT/Claude/Mistral/Gemini/...) to generate new, labeled  dataset points. Train your BERT model with it + the 32 labels. Analyze  how this impacts model metrics.
  - **e. Optimal Technique Application (0.5 points):** Based on the previous experiments, apply the most effective  technique(s) to further improve your model's performance. Comment your results and propose improvements.

## 0.2. Libraries

In [None]:
# !pip install polars  # Install polars for faster data processing

Collecting polars
  Using cached polars-1.30.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (14 kB)
Using cached polars-1.30.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (36.3 MB)
Installing collected packages: polars
Successfully installed polars-1.30.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [28]:
# Utilities
import numpy as np
import polars as pl
from library.metrics import Metrics
from library.utilities import set_seed
from datasets import Dataset
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

# Deep Learning and NLP
import torch
from setfit import SetFitModel, Trainer, TrainingArguments
import matplotlib.pyplot as plt

In [17]:
# Initialize the metrics object to save the results
metrics = Metrics()

In [18]:
# Check the availability of a GPU
print(torch.cuda.is_available())

True


## 0.3. Random Seed

In [19]:
# Set random seed for reproducibility
seed = 42
set_seed(42)

Seed set to 42. This ensures reproducibility of results across runs.


## 0.4. Loading the data: Swiss Judgement Prediction

Source: https://huggingface.co/datasets/rcds/swiss_judgment_prediction

In [20]:
# Load the cleaned Parquet file
df = pl.read_parquet('swiss_judgment_prediction_fr&it_clean.parquet')

# Display the loaded DataFrame
print("\nLoaded DataFrame shape:", df.shape)
print("\nLoaded DataFrame schema:")
print(df.schema)
print("\nFirst few rows of the loaded DataFrame:")
df.head()


Loaded DataFrame shape: (35386, 9)

Loaded DataFrame schema:
Schema({'id': Int32, 'year': Int32, 'text': String, 'labels': Int64, 'language': String, 'region': String, 'canton': String, 'legal area': String, 'split': String})

First few rows of the loaded DataFrame:


id,year,text,labels,language,region,canton,legal area,split
i32,i32,str,i64,str,str,str,str,str
22014,2011,"""Faits: A. Le 28 octobre 2002 à…",0,"""fr""","""Région lémanique""","""ge""","""civil law""","""train"""
11593,2007,"""Faits : Faits : A. Le 17 avril…",1,"""fr""","""Région lémanique""","""ge""","""penal law""","""train"""
26670,2013,"""Faits: A. Par jugement du 2 ma…",0,"""fr""","""Région lémanique""","""vd""","""penal law""","""train"""
5864,2004,"""Faits: Faits: A. N._, née en 1…",1,"""fr""","""Région lémanique""","""vd""","""insurance law""","""train"""
16122,2009,"""Faits: A. Y._ est propriétaire…",0,"""fr""","""Région lémanique""","""ge""","""public law""","""train"""


In [23]:
# Split the DataFrame into training, validation and test sets
train_df = df.filter(pl.col('split') == 'train')
val_df = df.filter(pl.col('split') == 'val')
test_df = df.filter(pl.col('split') == 'test')

# Split each of the splits into the different languages
train_fr_df = train_df.filter(pl.col('language') == 'fr')
train_it_df = train_df.filter(pl.col('language') == 'it')
val_fr_df = val_df.filter(pl.col('language') == 'fr')
val_it_df = val_df.filter(pl.col('language') == 'it')
test_fr_df = test_df.filter(pl.col('language') == 'fr')
test_it_df = test_df.filter(pl.col('language') == 'it')

# 1. BERT Model with Limited Data

Outline of the intermediate tasks:

1. Preprocessing Pipeline
   - Lowercasing, punctuation stripping (or not, depending on BERT tokenizer).
   - Sentencepiece/BPE tokenization via the CamemBERT (for French) or UmBERTo (for Italian).
   - (Optional) language tags if you merge FR+IT in one model.
2. Hold-out Split. Since you only have 32 labels: use stratified k-fold CV (e.g. 8 × 4-fold) to get reliable estimates, or leave-one-out if you want maximum training data per fold.
3. BERT Model with Only 32 Examples
   - Model Choice: Pick a multilingual BERT (mBERT) or separate CamemBERT/UmBERTo checkpoint. Alternatives:
     - Multilingual/monolingual models:
       - One multilingual [`Sentence-Transformer` model](https://www.sbert.net/docs/sentence_transformer/pretrained_models.html#original-models), such as `paraphrase-multilingual-mpnet-base-v2` (best multilingual performer), `paraphrase-multilingual-MiniLM-L12-v2` (similar performer as the former, but much faster and smaller) or `distiluse-base-multilingual-cased-v1` (worst of the bunch).
       -  [BERT multilingual base model (cased)](https://huggingface.co/google-bert/bert-base-multilingual-cased). [Uncased model](https://huggingface.co/google-bert/bert-base-multilingual-uncased) also available. Paper: "[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)". 
       -  [CamemBERT 2.0](https://huggingface.co/almanach/camembertv2-base) and [CamemBERTav2](https://huggingface.co/almanach/camembertav2-base), models trained with French text and explained in the paper: [CamemBERT 2.0: A Smarter French Language Model Aged to Perfection](https://arxiv.org/html/2411.08868v1#S3). These models supposedly improve over the performance of the original [CamemBERT](https://huggingface.co/docs/transformers/en/model_doc/camembert) model, explained in "[CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894)".
       -  [FlauBERT](https://huggingface.co/docs/transformers/en/model_doc/flaubert), another model pre-trained on French text. Paper: "[FlauBERT: Unsupervised Language Model Pre-training for French](https://arxiv.org/abs/1912.05372)". 
       -  [BERT Base Italian Cased](https://huggingface.co/dbmdz/bert-base-italian-cased), [Uncased](https://huggingface.co/dbmdz/bert-base-italian-uncased), ,and [XXL Uncased](https://huggingface.co/dbmdz/bert-base-italian-xxl-uncased). 
       -  [UmBERTo Commoncrawl Cased](https://huggingface.co/Musixmatch/umberto-commoncrawl-cased-v1), another model trained with a large corpus of texts in Italian.
      -  Domain-specific models (law):
          - [LEGAL-BERT](https://huggingface.co/nlpaueb/legal-bert-base-uncased) does not seem to be a good option as it was trained only on English data.
          - [JuriBERT](https://huggingface.co/dascim/juribert-base) for legal texts in French. Paper explaining the model: [JuriBERT: A Masked-Language Model Adaptation for French Legal Text](https://arxiv.org/pdf/2110.01485).
          - [ITALIAN-LEGAL-BERT](https://huggingface.co/dlicari/Italian-Legal-BERT) for legal text in Italian. Paper explaining the model: [ITALIAN-LEGAL-BERT models for improving natural language processing tasks in the Italian legal domain](https://www.sciencedirect.com/science/article/pii/S0267364923001188).
     - Models to try: 
       - For both languages: `paraphrase-multilingual-MiniLM-L12-v2` (of the SentenceTransformers, good balance between performance and speed), BERT multilingual base model (cased).
       - For French: CamemBERT, FlauBERT, CamemBERTav2 (based on DebertaV2 architecture), JuriBERT.
       - For Italian: BERT Base Italian Cased, UmBERTo Commoncrawl Cased, ITALIAN-LEGAL-BERT.
   - Fine-tuning Setup.
     - Freeze or unfreeze last n encoder layers—try both.
     - Small learning rate (2e-5 – 5e-5), batch size = 8 or 16.
     - Early stopping on validation loss.
4. Training & Evaluation
    - Run your k-fold CV training loops.
    - Track accuracy, F1, precision, recall per fold.
    - Report mean ± std.
5. Error Analysis and feature interpretation.
    - Use `LIME` for analyzing the most relevant features for classifying the texts. 
    - Look at which examples are mispredicted.
    - Check language breakdown (FR vs. IT) to see if one is harder.

## 1.1. Standard fine-tuning

Notebook of reference: `Session_5_1_BERT_HF_Implementation.ipynb`, sections 1, 2 and 5.

## 1.2. Using SetFit ("Sentence Transformer Fine-Tuning")

Notebook of reference: `Session_6_2_Zero_Shot_Classification.ipynb`, introduction, step 1 (loading data) and step 6 ("Few-Shot Classification with SetFit").

Applying SetFit (the “Sentence Transformer Fine-Tuning” recipe) absolutely counts as **training** (it fine-tunes a pre-trained sentence-embedding model, plus fits a small classifier on top). And indeed, SetFit was built **for** the exactly your scenario—getting strong performance with as few as a few dozen labeled examples.

---

**Why SetFit = Training**

* **Contrastive fine-tuning:**
  You start with a frozen (or lightly unfrozen) Sentence-Transformer model and then *fine-tune* it on automatically generated sentence pairs derived from your 32 labels.
* **Classifier head training:**
  After contrastive tuning, you fit a lightweight logistic-regression (or small MLP) classifier on the resulting embeddings.
* Both steps update model parameters—so it’s training/fine-tuning, not mere prompt-engineering or zero-shot.

---

**Why SetFit excels in limited-label regimes**

1. **Data amplification via contrastive pairs**

   * From each labeled example, SetFit creates positive pairs (e.g. two different augmentations of the same sentence) and negative pairs (across classes), turning 32 labels into hundreds or thousands of pairwise signals.
   * That extra signal helps the embedding space separate classes, even when you only have a few “gold” labels.

2. **Lightweight classifier**

   * Because the embedding model has already been tuned to distinguish your classes, the final classifier can be a simple logistic or MLP—so it needs very few examples to learn decision boundaries.

3. **Empirical few-shot strength**

   * In benchmarks, SetFit often outperforms standard BERT fine-tuning when you have <100 labels, and it’s much faster to train (no full back-prop through all BERT layers).

---

**How to plug SetFit into the task**

1. **Install** the SetFit library (e.g. via `pip install setfit`).
2. **Initialize** a pre-trained checkpoint:

   ```python
   from setfit import SetFitModel, SetFitTrainer
   model = SetFitModel.from_pretrained("sentence-transformers/paraphrase-mpnet-base-v2")
   ```
3. **Prepare** your 32 labeled examples as `(text, label)` tuples.
4. **Train** the model:

   ```python
   trainer = SetFitTrainer(
       model=model,
       train_dataset=my_32_examples,
       eval_dataset=my_dev_split,
       metric="accuracy",
       loss="cosine-similarity",
       batch_size=16,
       num_iterations=20,          # controls number of contrastive steps
       num_epochs=1                # just one pass for the classifier
   )
   trainer.train()
   ```
5. **Evaluate** on your held-out data.

You’ll have fine-tuned embeddings *and* a classifier head—all with only 32 labels. That makes SetFit not just “considered training,” but *one of the best* training-with-few-labels recipes out there.

---

Note that there are 2 aspects that we need to implement for SetFit to work properly:
1. **Balance the examples of positive and negative classes** that will be passed for contrastive learning. Since approximately 29% of the labels are positive (approved motions) and the other 71% correspond to rejected motions, passing examples for contrastive learning without balancing the clases would bias the embedding space towards the majority class.
2. We need to **run several iterations** in order to get lower variability in the results, as a single training run can be highly variable due to randomness in the example sampling (as we only pass 32 labelled examples). Therefore, if we don't run several iterations for SetFit, the results would be highly noisy. Averaging across iterations smooths out this variance and gives us the true expected performance.

In [35]:
# Function to sample balanced dataset
def sample_balanced_dataset(dataset: pl.DataFrame, num_samples, seed):

    # Get positive and negative examples
    pos_examples = dataset.filter(pl.col('label') == 1)
    neg_examples = dataset.filter(pl.col('label') == 0)
    
    # Sample equal numbers from each class
    samples_per_class = num_samples // 2
    
    if seed is not None:
        pos_sampled = pos_examples.sample(n=samples_per_class, shuffle=True, seed=seed)
        neg_sampled = neg_examples.sample(n=samples_per_class, shuffle=True, seed=seed)
    else:
        raise ValueError("Seed must be provided for reproducibility.")
    
    # Concatenate the sampled DataFrames, only the text and the label columns
    pos_sampled = pos_sampled.select(['text', 'label'])
    neg_sampled = neg_sampled.select(['text', 'label'])
    df = pl.concat([pos_sampled, neg_sampled], how='vertical')

    # Combine the datasets into a single Dataset object
    combined = Dataset.from_polars(df)
    
    return combined

Note that SetFit only natively supports sentence-transformer models: passing one which is [not](https://huggingface.co/models?library=sentence-transformers&author=sentence-transformers) (see also the [docs](https://www.sbert.net/docs/sentence_transformer/pretrained_models.html#original-models)) will make SetFit automatically wrap the BERT model with a sentence-transformers layer using mean pooling. Using a model that is not part of the `sentence-transformers` will yield the following message:

> No sentence-transformers model found with name google-bert/bert-base-multilingual-cased. Creating a new one with mean pooling.

However, for better performance with SetFit, it can be better to use a model that was specifically pre-trained as a sentence transformer.

   - Model Choice: Pick a multilingual BERT (mBERT) or separate CamemBERT/UmBERTo checkpoint. Alternatives:
     - Multilingual/monolingual models:
       - One multilingual [`Sentence-Transformer` model](https://www.sbert.net/docs/sentence_transformer/pretrained_models.html#original-models), such as `paraphrase-multilingual-mpnet-base-v2` (best multilingual performer), `paraphrase-multilingual-MiniLM-L12-v2` (similar performer as the former, but much faster and smaller) or `distiluse-base-multilingual-cased-v1` (worst of the bunch).
       -  [BERT multilingual base model (cased)](https://huggingface.co/google-bert/bert-base-multilingual-cased). [Uncased model](https://huggingface.co/google-bert/bert-base-multilingual-uncased) also available. Paper: "[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)". 
       -  [CamemBERT 2.0](https://huggingface.co/almanach/camembertv2-base) and [CamemBERTav2](https://huggingface.co/almanach/camembertav2-base), models trained with French text and explained in the paper: [CamemBERT 2.0: A Smarter French Language Model Aged to Perfection](https://arxiv.org/html/2411.08868v1#S3). These models supposedly improve over the performance of the original [CamemBERT](https://huggingface.co/docs/transformers/en/model_doc/camembert) model, explained in "[CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894)".
       -  [FlauBERT](https://huggingface.co/docs/transformers/en/model_doc/flaubert), another model pre-trained on French text. Paper: "[FlauBERT: Unsupervised Language Model Pre-training for French](https://arxiv.org/abs/1912.05372)". 
       -  [BERT Base Italian Cased](https://huggingface.co/dbmdz/bert-base-italian-cased), [Uncased](https://huggingface.co/dbmdz/bert-base-italian-uncased), ,and [XXL Uncased](https://huggingface.co/dbmdz/bert-base-italian-xxl-uncased). 
       -  [UmBERTo Commoncrawl Cased](https://huggingface.co/Musixmatch/umberto-commoncrawl-cased-v1), another model trained with a large corpus of texts in Italian.
      -  Domain-specific models (law):
          - [LEGAL-BERT](https://huggingface.co/nlpaueb/legal-bert-base-uncased) does not seem to be a good option as it was trained only on English data.
          - [JuriBERT](https://huggingface.co/dascim/juribert-base) for legal texts in French. Paper explaining the model: [JuriBERT: A Masked-Language Model Adaptation for French Legal Text](https://arxiv.org/pdf/2110.01485).
          - [ITALIAN-LEGAL-BERT](https://huggingface.co/dlicari/Italian-Legal-BERT) for legal text in Italian. Paper explaining the model: [ITALIAN-LEGAL-BERT models for improving natural language processing tasks in the Italian legal domain](https://www.sciencedirect.com/science/article/pii/S0267364923001188).

Models to try: 
- For both languages: `paraphrase-multilingual-MiniLM-L12-v2` (of the SentenceTransformers, good balance between performance and speed), BERT multilingual base model (cased).
- For French: CamemBERT, FlauBERT, CamemBERTav2 (based on DebertaV2 architecture), JuriBERT.
- For Italian: BERT Base Italian Cased, UmBERTo Commoncrawl Cased, ITALIAN-LEGAL-BERT.

In [None]:
# Rename columns to match SetFit requirements (labels should be 'labels'), creating a copy
train_df_setfit = train_df.__copy__()
val_df_setfit = val_df.__copy__()
train_df_setfit = train_df_setfit.rename({'labels': 'label'})
val_df_setfit = val_df_setfit.rename({'labels': 'label'})

# Define sample size (labelled examples for training)
sample_size = 32
num_iterations = 10  # Run several iterations for the sample size (32 labels) to minimize the impact of randomness
metric = 'f1'  # Metric to optimize
model = SetFitModel.from_pretrained("sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2")  # Path to the pre-trained model
batch_size = 32 # Batch size for training
num_epochs = 5  # Number of epochs for training
learning_rate= 2e-5  # Learning rate for the optimizer

# Prepare the validation set
val_set = Dataset.from_polars(val_df_setfit.select(['text', 'label']))

# Store results across iterations (for different metrics and iterations)
iteration_results = []

# Run the sampling and training process with SetFit
for iteration in range(num_iterations):
    print(f"\nIteration {iteration + 1}/{num_iterations}")

    # Sample balanced training data
    train_samples = sample_balanced_dataset(train_df_setfit, sample_size, seed + iteration)

    # Create the training arguments
    train_args = TrainingArguments(
        num_epochs=num_epochs,  # Number of epochs for training
        batch_size=batch_size,  # Batch size for training
        body_learning_rate=learning_rate  # Learning rate for the optimizer
    )

    # Initialize and train SetFit model
    trainer = Trainer(
        model=model,
        train_dataset=train_samples,  # Pairs of text and labels for Contrastive Learning
        eval_dataset=val_set,  # Validation set for evaluation
        metric=metric,  # Metric to optimize
        args=train_args  # Training arguments
    )

    trainer.train()  # Train the model
    print("Training completed.")

    # Evaluate on validation set (for hyperparameter tuning/model selection)
    val_predictions = trainer.model.predict(val_set['text'])
    val_metrics = {
        'accuracy': accuracy_score(val_set['label'], val_predictions),
        'f1': f1_score(val_set['label'], val_predictions),
        'precision': precision_score(val_set['label'], val_predictions),
        'recall': recall_score(val_set['label'], val_predictions)
    }
    
    # Store results for this iteration
    iteration_results.append(val_metrics)
    print(f"Validation F1: {val_metrics['f1']:.4f}")

# Calculate average performance across iterations
avg_results = {metric: np.mean([r[metric] for r in iteration_results]) 
               for metric in ['accuracy', 'f1', 'precision', 'recall']}
std_results = {metric: np.std([r[metric] for r in iteration_results]) 
               for metric in ['accuracy', 'f1', 'precision', 'recall']}

print(f"\nAverage Validation Results across {num_iterations} iterations:")
for metric in avg_results:
    print(f"{metric.capitalize()}: {avg_results[metric]:.4f} ± {std_results[metric]:.4f}")

# Delete the model to free up memory
del model

model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.



Iteration 1/10


  obj.co_lnotab,  # for < python 3.10 [not counted in args]


Map:   0%|          | 0/32 [00:00<?, ? examples/s]

***** Running training *****
  Num unique pairs = 544
  Batch size = 32
  Num epochs = 5


# 2. Dataset Augmentation

Outline of the intermediate tasks: We want a fully automated pipeline. A good candidate is Easy Data Augmentation (EDA) or back-translation via open‐source MT models.
1. Choose Technique(s)
   - EDA: random synonym substitution (WordNet or fastText), random swap, insertion, deletion.
   - Back-translation: FR → EN → FR and IT → EN → IT using MarianMT or opus-MT.
2. Implement & Generate
   - For each of the 32 labeled examples, generate k augmented pseudo-examples (e.g. k=5).
   - Deduplicate and filter (e.g. reject if new text <50% overlap).
3. Merge & Re-split
   - Combine original 32 + synthetic N = 32×k examples.
   - Re-run the same CV split strategy, ensuring augmented copies of a given original stay in the same fold.
4. Re-train BERT
   - Exactly the same hyperparams as in (a).
   - Track performance uplift vs. the baseline.
5. Analysis
   - Compare metrics: ΔAccuracy, ΔF1.
   - Ablation: EDA vs. back-translation vs. combined.
   - Qualitative: inspect a few synthetic samples.

# 3. Zero-Shot Learning with LLM

# 4. Data Generation with LLM

# 5. Optimal Technique Application