# Legal document classification in zero-shot cross lingual transfer setting

# Part II: Results reproduction

Date: May 2025

Project of course: Natural Language Processing - ENSAE 3A S2

Author: Noémie Guibé

In [1]:
# imports
import pandas as pd
import json

In [2]:
# import data base
df = pd.read_parquet('data/dataset/multi_eurlex_reduced.parquet', engine='pyarrow')

In [None]:
# import train, test and dev datasets - Note: this take 2-3 min
with open('../train.jsonl', 'r', encoding='utf-8') as f:
    train_ds = [json.loads(line) for line in f]
train_df = pd.DataFrame(train_ds).assign(split='train')
print(f'Train dataset has {len(train_df)} rows, as expected')

with open('../dev.jsonl', 'r', encoding='utf-8') as f:
    dev_ds = [json.loads(line) for line in f]
dev_df   = pd.DataFrame(dev_ds).assign(split='dev')
print(f'Dev dataset has {len(dev_df)} rows, as expected')

with open('../test.jsonl', 'r', encoding='utf-8') as f:
    test_ds = [json.loads(line) for line in f]
test_df  = pd.DataFrame(test_ds).assign(split='test')
print(f'Test dataset has {len(test_df)} rows, as expected')

df = pd.concat([train_df, dev_df, test_df], ignore_index=True)
df.head()

In [11]:
langs_to_keep = ['en', 'de', 'fr', 'pl', 'fi'] 
df_reduced = df.copy()
df_reduced['text'] = df['text'].apply(lambda x: {lang: x[lang] for lang in langs_to_keep if lang in x and x[lang] is not None})

# Memory reduction
original_memory = df.memory_usage(deep=True).sum() / (1024 ** 2)  # Convert to MB
reduced_memory = df_reduced.memory_usage(deep=True).sum() / (1024 ** 2)
memory_reduction_percentage = ((original_memory - reduced_memory) / original_memory) * 100
print(f"Number of empty rows: {len(df_reduced[df_reduced['text'].apply(lambda x: len(x))==0])}")
print(df_reduced.iloc[0]['text'])
print(f"Original memory usage: {original_memory:.2f} MB")
print(f"Reduced memory usage: {reduced_memory:.2f} MB")
print(f"Memory reduction: {memory_reduction_percentage:.2f}% lighter")

Number of empty rows: 0
{'en': 'COMMISSION DECISION\nof 6 March 2006\nestablishing the classes of reaction-to-fire performance for certain construction products as regards wood flooring and solid wood panelling and cladding\n(notified under document number C(2006) 655)\n(Text with EEA relevance)\n(2006/213/EC)\nTHE COMMISSION OF THE EUROPEAN COMMUNITIES,\nHaving regard to the Treaty establishing the European Community,\nHaving regard to Directive 89/106/EEC of 21 December 1988, on the approximation of laws, regulations and administrative provisions of the Member States relating to construction products (1), and in particular Article 20(2) thereof,\nWhereas:\n(1)\nDirective 89/106/EEC envisages that in order to take account of different levels of protection for construction works at national, regional or local level, it may be necessary to establish in the interpretative documents classes corresponding to the performance of products in respect of each essential requirement. Those docume

In [3]:
langs_to_keep = ['en', 'de', 'fr', 'pl', 'fi'] 

In [4]:
df_reduced = df.copy()

In [5]:
# Calculate the length of the document for each language
def compute_lengths(text_dict):
    lengths = {lang: len(text_dict[lang]) for lang in langs_to_keep if text_dict.get(lang) is not None}
    return lengths
# Apply the function to the 'text' column and store the result in a new column 'doc_lengths'
df_reduced['doc_lengths'] = df_reduced['text'].apply(compute_lengths)

In [6]:
df_reduced['max_doc_length'] = df_reduced['doc_lengths'].apply(lambda d: max(d.values(), default=0))

In [7]:
df_reduced = df_reduced[df_reduced['max_doc_length']<500000]

# Get the data ready

In [8]:
# keep only level 3 labels
df_reduced['level_3_labels'] = df_reduced['eurovoc_concepts'].apply(lambda d: d['level_3'] if 'level_3' in d else [])

In [12]:
df_reduced

Unnamed: 0,celex_id,publication_date,text,eurovoc_concepts,split,doc_lengths,max_doc_length,level_3_labels
0,32006D0213,2006-03-06,{'de': 'ENTSCHEIDUNG DER KOMMISSION vom 6. Mär...,"{'all_levels': ['1706', '1826', '2754', '3690'...",train,"{'en': 3233, 'de': 3302, 'fr': 3642, 'pl': 332...",3642,"[1386, 2825, 138, 2475, 3879, 3641]"
1,32003R1330,2003-07-25,{'de': 'Verordnung (EG) Nr. 1330/2003 der Komm...,"{'all_levels': ['1117', '1118', '1605', '2635'...",train,"{'en': 1328, 'de': 1430, 'fr': 1437, 'fi': 1366}",1437,"[1115, 2656, 1602]"
2,32003R1786,2003-09-29,{'de': 'Verordnung (EG) Nr. 1786/2003 des Rate...,"{'all_levels': ['2173', '4854', '614', '797'],...",train,"{'en': 17741, 'de': 19641, 'fr': 19133, 'pl': ...",19641,"[614, 712, 1277, 2443]"
3,31985R2590,1985-09-13,{'de': '***** VERORDNUNG (EWG) Nr. 2590/85 DER...,"{'all_levels': ['1201', '1261', '5334', '755',...",train,"{'en': 2525, 'de': 2720, 'fr': 2684, 'fi': 2527}",2720,"[2413, 712, 2477, 4488, 2443]"
4,31993R1103,1993-04-30,{'de': 'VERORDNUNG (EWG) Nr. 1103/93 DER KOMMI...,"{'all_levels': ['1309', '2159', '2192', '235',...",train,"{'en': 27992, 'de': 29436, 'fr': 32297}",32297,"[539, 956, 1847, 2106, 614, 2858, 6205, 1845, ..."
...,...,...,...,...,...,...,...,...
64995,32014R1325,2014-12-10,{'de': 'VERORDNUNG (EU) Nr. 1325/2014 DER KOMM...,"{'all_levels': ['1871', '1877', '2282', '2308'...",test,"{'en': 1828, 'de': 2007, 'fr': 1995, 'pl': 200...",2007,"[130, 5283, 1652, 122, 2106, 3146, 4522, 567, ..."
64996,32015R0122,2015-01-22,{'de': 'VERORDNUNG (EU) 2015/122 DER KOMMISSIO...,"{'all_levels': ['1188', '1294', '1318', '1509'...",test,"{'en': 1890, 'de': 2115, 'fr': 2116, 'pl': 208...",2116,"[130, 5283, 122, 1652, 2106, 4590, 2564, 913, ..."
64997,32014R0860,2014-08-05,{'de': 'DURCHFÜHRUNGSVERORDNUNG (EU) Nr. 860/2...,"{'all_levels': ['1391', '3183', '5751'], 'leve...",test,"{'en': 2522, 'de': 2647, 'fr': 2802, 'pl': 261...",2802,"[3895, 4380]"
64998,32013D0392,2013-07-22,{'de': 'BESCHLUSS DES RATES vom 22. Juli 2013 ...,"{'all_levels': ['4359', '5334', '5796', '616']...",test,"{'en': 5104, 'de': 5381, 'fr': 5420, 'pl': 477...",5420,"[4488, 566, 1422]"


In [9]:
train_df = df_reduced[df_reduced['split']=='train']
# English-only training set
train_df['text'] = train_df["text"].apply(lambda x: isinstance(x, dict) and x.get("en"))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df['text'] = train_df["text"].apply(lambda x: isinstance(x, dict) and x.get("en"))


In [10]:
# test 
test_df = df_reduced[df_reduced['split']=='test']

# Test set in multiple languages
test_langs = ["fr", "de", "es"]  # whatever languages you want
test_dfs = []

for lang in test_langs:
    # Filter rows where the language exists in the text dictionary
    df_lang = test_df[test_df["text"].apply(lambda x: isinstance(x, dict) and lang in x)]
    
    # Now extract the respective language text, and add the 'lang' column
    df_lang["text"] = df_lang["text"].apply(lambda x: x[lang])  # Extract the language text
    df_lang["lang"] = lang  # Add a new column for language
    
    # Append to test_dfs
    test_dfs.append(df_lang)

# Combine the list of DataFrames into one (exploded test set)
final_test_df = pd.concat(test_dfs, ignore_index=True)

no need to keep only laxs in english for train because parameter in function

In [11]:
from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
label_matrix = mlb.fit_transform(train_df["level_3_labels"])
train_df["label_vector"] = list(label_matrix)

# Apply same transformation to test sets
final_test_df["label_vector"] = list(mlb.transform(final_test_df["level_3_labels"]))


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df["label_vector"] = list(label_matrix)


In [12]:
from datasets import Dataset

train_dataset = Dataset.from_pandas(train_df[["text", "label_vector"]])
test_datasets = {
    lang: Dataset.from_pandas(df[["text", "label_vector"]]) 
    for lang, df in final_test_df.groupby("lang")
}

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")

def tokenize(batch):
    # Make sure batch["text"] is a list of strings, not a list of dictionaries
    if isinstance(batch["text"], list):
        # If already a list of strings, continue
        return tokenizer(batch["text"], padding="max_length", truncation=True, max_length=512)
    else:
        # If it's not, extract the correct string from each entry (e.g., handling dicts)
        texts = [str(item) for item in batch["text"]]  # Convert each item to string (adjust if it's a dictionary)
        return tokenizer(texts, padding="max_length", truncation=True, max_length=512)

In [48]:
# Save the tokenizer to a specified directory
tokenizer.save_pretrained("model/tokenizer")

('model/tokenizer\\tokenizer_config.json',
 'model/tokenizer\\special_tokens_map.json',
 'model/tokenizer\\tokenizer.json')

In [16]:
from transformers import AutoTokenizer

# Load the tokenizer from the local directory
tokenizer = AutoTokenizer.from_pretrained("model/tokenizer/")


ModuleNotFoundError: No module named 'torch'

In [23]:
# Apply tokenization to the training dataset
def tokenize(batch):
    # batch["text"] is already List[str]
    return tokenizer(batch["text"], padding="max_length", truncation=True, max_length=512)

train_dataset = train_dataset.map(tokenize, batched=True)

# Apply tokenization to each language-specific test dataset
for lang in test_datasets:
    test_datasets[lang] = test_datasets[lang].map(tokenize, batched=True)

Map: 100%|██████████| 54994/54994 [03:19<00:00, 275.60 examples/s]
Map: 100%|██████████| 4996/4996 [00:48<00:00, 103.90 examples/s]
Map: 100%|██████████| 4996/4996 [01:11<00:00, 69.93 examples/s]


In [26]:
import torch
print(torch.__version__)
print(torch.cuda.is_available())

2.7.0+cpu
False


In [28]:
from transformers import AutoModelForSequenceClassification

In [31]:
import os
os.environ["TRANSFORMERS_BACKEND"] = "pt"

In [34]:
from transformers import TFAutoModelForSequenceClassification


# Get the number of labels
num_labels = len(mlb.classes_)

# Load model
model = TFAutoModelForSequenceClassification.from_pretrained(
    "xlm-roberta-base",
    problem_type="multi_label_classification",
    num_labels=num_labels,
    id2label={i: label for i, label in enumerate(mlb.classes_)},
    label2id={label: i for i, label in enumerate(mlb.classes_)}
)




Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development





All PyTorch model weights were used when initializing TFXLMRobertaForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFXLMRobertaForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.weight', 'classifier.out_proj.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [35]:
def prepare_dataset(example):
    example["labels"] = example["label_vector"]
    return example

# Apply the transformation for training and test datasets
train_dataset = train_dataset.map(prepare_dataset)
for lang in test_datasets:
    test_datasets[lang] = test_datasets[lang].map(prepare_dataset)

# Set format for TensorFlow
train_dataset.set_format(type="tensorflow", columns=["input_ids", "attention_mask", "labels"])
for lang in test_datasets:
    test_datasets[lang].set_format(type="tensorflow", columns=["input_ids", "attention_mask", "labels"])

Map: 100%|██████████| 54994/54994 [00:42<00:00, 1288.60 examples/s]
Map: 100%|██████████| 4996/4996 [00:04<00:00, 1242.97 examples/s]
Map: 100%|██████████| 4996/4996 [00:03<00:00, 1313.88 examples/s]


In [36]:
from sklearn.metrics import f1_score

def compute_metrics(pred):
    logits, labels = pred
    probs = torch.sigmoid(torch.tensor(logits))
    preds = (probs > 0.5).int().numpy()
    return {
        "micro_f1": f1_score(labels, preds, average="micro"),
        "macro_f1": f1_score(labels, preds, average="macro")
    }

In [46]:
from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments

model = AutoModelForSequenceClassification.from_pretrained(
    "xlm-roberta-base",
    num_labels=2,  # Adjust as per your specific task
    problem_type="multi_label_classification"
)


ImportError: 
AutoModelForSequenceClassification requires the PyTorch library but it was not found in your environment.
However, we were able to find a TensorFlow installation. TensorFlow classes begin
with "TF", but are otherwise identically named to our PyTorch classes. This
means that the TF equivalent of the class you tried to import would be "TFAutoModelForSequenceClassification".
If you want to use TensorFlow, please use TF classes instead!

If you really do want to use PyTorch please go to
https://pytorch.org/get-started/locally/ and follow the instructions that
match your environment.


In [45]:
pip show transformers datasets torch

Name: transformersNote: you may need to restart the kernel to use updated packages.

Version: 4.51.3
Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow
Home-page: https://github.com/huggingface/transformers
Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)
Author-email: transformers@huggingface.co
License: Apache 2.0 License
Location: c:\users\guibe\onedrive\documents\ensae\3a\s2\nlp\projet\nlp-legal-document-classification\tf_env\lib\site-packages
Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors, tokenizers, tqdm
Required-by: 
---
Name: datasets
Version: 3.5.0
Summary: HuggingFace community-driven open-source library of datasets
Home-page: https://github.com/huggingface/datasets
Author: HuggingFace Inc.
Author-email: thomas@huggingface.co
License: Apache 2.0
Location: c:\users\guibe\onedrive\documents\ensae\3a\s2\nlp\p

In [47]:
# Load pre-trained TensorFlow model and tokenizer
from transformers import TFAutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments
model = TFAutoModelForSequenceClassification.from_pretrained("xlm-roberta-base", num_labels=2)
tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")

All PyTorch model weights were used when initializing TFXLMRobertaForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFXLMRobertaForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.weight', 'classifier.out_proj.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [48]:
# Define the training arguments
training_args = TrainingArguments(
    output_dir="./test_output",      
    evaluation_strategy="epoch",             
    save_strategy="epoch",                   
    learning_rate=2e-5,                      
    per_device_train_batch_size=8,           
    per_device_eval_batch_size=8,            
    num_train_epochs=3,                      
    weight_decay=0.01,                       
    save_total_limit=1,                      
    load_best_model_at_end=True,
    logging_dir="./logs",                    # Log directory
    report_to="tensorboard"                  # Use tensorboard for logging
)


TypeError: TrainingArguments.__init__() got an unexpected keyword argument 'evaluation_strategy'

In [44]:
# Define your training arguments (adjust hyperparameters as needed)
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
    output_dir="./xlm-roberta-eurovoc",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=5,
    weight_decay=0.01,
    save_total_limit=1,
    load_best_model_at_end=True,
    metric_for_best_model="micro_f1",
)


TypeError: TrainingArguments.__init__() got an unexpected keyword argument 'evaluation_strategy'

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_datasets["fr"],  # Or "de", "es" — you can loop through them too
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

trainer.train()


In [None]:
for lang, dataset in test_datasets.items():
    results = trainer.evaluate(dataset)
    print(f"Language: {lang}")
    print(results)


## Test with article code

In [None]:
# label encoding
from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
mlb.fit(train_df['level_3_labels'])

train_df['label_vector'] = list(mlb.transform(train_df['level_3_labels']))
test_df['label_vector'] = list(mlb.transform(test_df['level_3_labels']))

label_index = {label: idx for idx, label in enumerate(mlb.classes_)}

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df['label_vector'] = list(mlb.transform(train_df['level_3_labels']))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_df['label_vector'] = list(mlb.transform(test_df['level_3_labels']))


In [None]:
from datasets import load_dataset

streamed_dataset = load_dataset("path/to/your/data.csv", split="train", streaming=True)


In [46]:
from datasets import Dataset as HFDataset

train_dataset = HFDataset.from_pandas(train_df[['text', 'label_vector', 'celex_id']])
test_dataset = HFDataset.from_pandas(test_df[['text', 'label_vector', 'celex_id']])

ArrowMemoryError: realloc of size 826277888 failed

In [51]:
small_train_df = train_df.sample(1000).copy()
train_dataset = HFDataset.from_pandas(small_train_df[['text', 'label_vector', 'celex_id']])

In [52]:
small_test_df = test_df.sample(500).copy()
test_dataset = HFDataset.from_pandas(small_test_df[['text', 'label_vector', 'celex_id']])

In [53]:
from experiments.model import Classifier

ModuleNotFoundError: No module named 'tensorflow'

In [None]:
model = Classifier(bert_model_path='xlm-roberta-base', num_labels=len(label_index))
model.adapt_model(use_adapters=True, num_frozen_layers=None)  # Or skip this for baseline


##

In [None]:
train_gen = SampleGenerator(train_dataset, label_index, 'xlm-roberta-base', lang='en', multilingual_train=False)
test_gen = SampleGenerator(test_dataset, label_index, 'xlm-roberta-base', lang=['fr', 'de', 'es', 'it'], multilingual_train=True)


Suggested Steps to Fix the Issue:
Downgrade Python to a Compatible Version (3.7–3.10): To resolve the issue, I recommend downgrading Python to a version that TensorFlow supports, ideally Python 3.10. Here's how to do it:

Step 1: Install Python 3.10
Download Python 3.10: Go to the Python 3.10 download page and download the installer for your operating system.

Install Python 3.10: During installation, make sure to check the box that says "Add Python to PATH" to make it accessible from the command line.

Step 2: Create a Virtual Environment with Python 3.10
After installing Python 3.10, create a new virtual environment:

Windows:

bash
Copier
Modifier
python3.10 -m venv tf_env
.\tf_env\Scripts\activate
macOS/Linux:

bash
Copier
Modifier
python3.10 -m venv tf_env
source tf_env/bin/activate
Step 3: Install TensorFlow in the New Virtual Environment
After creating and activating your new environment, install TensorFlow:

bash
Copier
Modifier
pip install tensorflow
Verify TensorFlow Installation: Once TensorFlow is installed, verify that the installation is successful by running:

python
Copier
Modifier
import tensorflow as tf
print(tf.__version__)
This should print the TensorFlow version without errors.

Alternative: Using Docker for Isolation (Optional)
If you prefer not to downgrade Python globally or create a new Python installation, you can use Docker to run a TensorFlow-compatible environment in an isolated container. Docker allows you to run a specific version of Python and TensorFlow without affecting your system-wide Python installation.