# Legal document classification in zero-shot cross lingual transfer setting

# Part II: Results reproduction

Date: May 2025

Project of course: Natural Language Processing - ENSAE 3A S2

Author: Noémie Guibé

In [1]:
# imports
import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
import pandas as pd 
from datasets import Dataset
from sklearn.preprocessing import MultiLabelBinarizer
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from sklearn.metrics import f1_score
import os

from src import baseline_model, frozen_model, adapter_model

2025-05-03 17:01:13.328979: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-05-03 17:01:13.330555: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-05-03 17:01:13.337237: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-05-03 17:01:13.349328: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1746291673.369711  223430 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1746291673.37

In [11]:
# Taining parameters
train_size = 5000
test_size = 5000
batch_size = 32
epochs = 2

In [2]:
# import data base
df = pd.read_parquet('https://minio.lab.sspcloud.fr/nguibe/NLP/multi_eurlex_reduced.parquet', engine='pyarrow')

# 1 - First result reproduction: Performance drop from English-only fine-tuning

In [None]:
# Run training and evaluation
results = baseline_model.run_training_pipeline(data=df,train_sample_size=train_size,
                                test_sample_size=test_size,
                                batch_size=batch_size,
                                epochs=epochs)

# Results will appear as log but can also be displayed with:
#import pprint
#pprint.pprint(results)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(ilocs[0], value, pi)
Map: 100%|██████████| 5000/5000 [00:04<00:00, 1102.49 examples/s]
Map: 100%|██████████| 987/987 [00:02<00:00, 472.89 examples/s]
Map: 100%|██████████| 1024/1024 [00:01<00:00, 760.46 examples/s]
Map: 100%|██████████| 1003/1003 [00:01<00:00, 566.83 examples/s]
Map: 100%|██████████| 1014/1014 [00:02<00:00, 489.41 examples/s]
Map: 100%|██████████| 972/972 [00:01<00:00, 556.74 examples/s]
All PyTorch model weights were used when initializing TFXLMRobertaForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFXLMRobertaForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias',

Epoch 1/2
     54/Unknown - 826s 15s/step - loss: 0.3683 - auc_3: 0.4961

## Performance analysis - draft

This section was intended to explore weight shifts between a pre-trained multilingual model and an English-only retrained version, as a way to quantify potential catastrophic forgetting.

Due to time constraints, this analysis was not completed. However, the following code sketch could be used to pursue this direction in future work.

In [None]:
# Draft code for future weight comparison analysis

def compute_weight_difference(pretrained_weights, retrained_weights):
    differences = []
    for pretrained, retrained in zip(pretrained_weights, retrained_weights):
        diff = np.linalg.norm(pretrained - retrained)  # L2 norm
        differences.append(diff)
    return differences

# Example usage:
# pretrained_weights = model_pretrained.get_weights()
# retrained_weights = model_retrained.get_weights()
# weight_differences = compute_weight_difference(pretrained_weights, retrained_weights)
# most_changed_layers = np.argsort(weight_differences)[::-1]

# 2 - Second result reproduction: "better" performance with adaptation strategies

## Frozen layers

In [7]:
N= 6

In [8]:
# Run training and evaluation of model with N frozen layers and same other parameters
results = frozen_model.run_training_pipeline_with_freezing(df=df,train_sample_size=train_size,
                                test_sample_size=test_size,
                                batch_size=batch_size,
                                epochs=epochs, n_frozen_layer= N)

# Results will appear as log but can also be displayed with:
#import pprint
#pprint.pprint(results)

Map: 100%|██████████| 5000/5000 [00:04<00:00, 1191.71 examples/s]
Map: 100%|██████████| 987/987 [00:01<00:00, 499.02 examples/s]
Map: 100%|██████████| 1024/1024 [00:01<00:00, 950.96 examples/s]
Map: 100%|██████████| 1003/1003 [00:01<00:00, 588.22 examples/s]
Map: 100%|██████████| 1014/1014 [00:01<00:00, 543.28 examples/s]
Map: 100%|██████████| 972/972 [00:01<00:00, 569.87 examples/s]
All PyTorch model weights were used when initializing TFXLMRobertaForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFXLMRobertaForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.weight', 'classifier.out_proj.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[INFO] Successfully froze first 6 transformer layers.
Epoch 1/2
Epoch 2/2
Training time: 4220.72 seconds
Initial memory usage: 13012.20 MB
Final memory usage: 40761.23 MB
Memory used during training: 27749.03 MB

[INFO] Evaluating for de
R-Precision: 0.2700
Micro F1: 0.2289
Macro F1: 0.0329
LRAP: 0.5416
Evaluation time: 156.76 seconds

[INFO] Evaluating for en


2025-05-03 18:20:21.151905: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


R-Precision: 0.2688
Micro F1: 0.2366
Macro F1: 0.0335
LRAP: 0.5417
Evaluation time: 164.81 seconds

[INFO] Evaluating for fi
R-Precision: 0.2687
Micro F1: 0.2309
Macro F1: 0.0331
LRAP: 0.5330
Evaluation time: 159.60 seconds

[INFO] Evaluating for fr
R-Precision: 0.2740
Micro F1: 0.2419
Macro F1: 0.0339
LRAP: 0.5527
Evaluation time: 160.89 seconds

[INFO] Evaluating for pl
R-Precision: 0.2711
Micro F1: 0.2399
Macro F1: 0.0342
LRAP: 0.5387
Evaluation time: 154.70 seconds


## Adaptaters

In [12]:
# Taining parameters
train_size = 5000
test_size = 5000
batch_size = 32
epochs = 2

In [None]:
# Run training and evaluation of model with N frozen layers and same other parameters
results = adapter_model.run_adapter_training_pipeline(data=df,train_sample_size=train_size,
                                test_sample_size=test_size,
                                batch_size=batch_size,
                                epochs=epochs)

# Results will appear as log but can also be displayed with:
#import pprint
#pprint.pprint(results)

Map: 100%|██████████| 5000/5000 [00:14<00:00, 335.83 examples/s]
Map: 100%|██████████| 987/987 [00:05<00:00, 173.12 examples/s]
Map: 100%|██████████| 1024/1024 [00:04<00:00, 212.96 examples/s]
Map: 100%|██████████| 1003/1003 [00:06<00:00, 166.37 examples/s]
Map: 100%|██████████| 1014/1014 [00:06<00:00, 146.61 examples/s]
Map: 100%|██████████| 972/972 [00:06<00:00, 150.48 examples/s]
Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFXLMRobertaModel: ['lm_head.layer_norm.weight', 'lm_head.dense.weight', 'lm_head.bias', 'lm_head.dense.bias', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing TFXLMRobertaModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFXLMRobertaModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceC

Epoch 1/2
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m980s[0m 6s/step - auc_3: 0.5002 - loss: 0.4051
Epoch 2/2




[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1506s[0m 10s/step - auc_3: 0.5266 - loss: 0.3229
Training time: 2486.09 seconds
Initial memory usage: 16112.71 MB
Final memory usage: 15944.74 MB
Memory used during training: -167.97 MB
[INFO] Evaluating on language: de
R-Precision: 0.2873
Micro F1: 0.4511
Macro F1: 0.1413
LRAP: 0.6569
Evaluation time: 321.10 seconds
[INFO] Evaluating on language: en
R-Precision: 0.2905
Micro F1: 0.4477
Macro F1: 0.1525
LRAP: 0.6519
Evaluation time: 193.24 seconds
[INFO] Evaluating on language: fi
R-Precision: 0.2922
Micro F1: 0.4409
Macro F1: 0.1389
LRAP: 0.6331
Evaluation time: 176.75 seconds
[INFO] Evaluating on language: fr
R-Precision: 0.2888
Micro F1: 0.4470
Macro F1: 0.1265
LRAP: 0.6357
Evaluation time: 180.75 seconds
[INFO] Evaluating on language: pl
R-Precision: 0.2884
Micro F1: 0.4238
Macro F1: 0.1401
LRAP: 0.6204
Evaluation time: 239.95 seconds
{'de': {'Eval Time (s)': 321.10262393951416,
        'LRAP': 0.6569023707135659,
        