# 0. Instructions and setup

## 0.1. Instructions. Part 1: Setting Up the Problem (1.5 points)

- **Objective**: Understand and establish the baseline for your chosen dataset.

- **Tasks:**
  - **a. Bibliography and SOA (0.25 points):** Present briefly your task by researching and documenting the main  objective, a potential business case and the current state of the art  for your dataset's task. Include relevant benchmarks and methodologies.  You can look at google scholar, NLP index or papers with code. 
  - **b. Dataset Description (0.5 points):** Provide a brief overview of your dataset, including size, class  distribution, and any peculiar characteristics. Include basic  descriptive statistics.
  - **c. Random Classifier Performance (0.25 points):** Calculate the expected performance of a random classifier for your  dataset to set a benchmark. The calculation should include an  implementation. 
  - **d. Baseline Implementation (0.5 points):** Develop a rule-based classifier as a baseline. Discuss its performance in the  context of the dataset's complexity and compare it with human-level  performance if available.

## 0.2. Libraries

In [None]:
# !pip install polars  # Install polars for faster data processing

Collecting polars
  Using cached polars-1.30.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (14 kB)
Using cached polars-1.30.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (36.3 MB)
Installing collected packages: polars
Successfully installed polars-1.30.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [None]:
from datasets import load_dataset
from pprint import pprint
import polars as pl
from library.metrics import Metrics
from library.utilities import set_seed

In [3]:
# Initialize the metrics object to save the results
metrics = Metrics()

## 0.3. Random Seed

In [None]:
# Set random seed for reproducibility
seed = 42
set_seed(42)

Seed set to 42. This ensures reproducibility of results across runs.


# 1. Bibliography and SOA

# 2. Dataset Description: Swiss Judgement Prediction

Source: https://huggingface.co/datasets/rcds/swiss_judgment_prediction

## 2.1. Loading and reestructuring the data

In [5]:
# Load original dataset (with the original languages)
swiss_french = load_dataset('swiss_judgment_prediction', 'fr', trust_remote_code=True)
swiss_italian = load_dataset('swiss_judgment_prediction', 'it', trust_remote_code=True)

In [6]:
# Print the dataset structure for the French version
print("French version of the dataset:\n", swiss_french)
print("\nItalian version of the dataset:\n", swiss_italian)

French version of the dataset:
 DatasetDict({
    train: Dataset({
        features: ['id', 'year', 'text', 'label', 'language', 'region', 'canton', 'legal area', 'source_language'],
        num_rows: 21179
    })
    validation: Dataset({
        features: ['id', 'year', 'text', 'label', 'language', 'region', 'canton', 'legal area', 'source_language'],
        num_rows: 3095
    })
    test: Dataset({
        features: ['id', 'year', 'text', 'label', 'language', 'region', 'canton', 'legal area', 'source_language'],
        num_rows: 6820
    })
})

Italian version of the dataset:
 DatasetDict({
    train: Dataset({
        features: ['id', 'year', 'text', 'label', 'language', 'region', 'canton', 'legal area', 'source_language'],
        num_rows: 3072
    })
    validation: Dataset({
        features: ['id', 'year', 'text', 'label', 'language', 'region', 'canton', 'legal area', 'source_language'],
        num_rows: 408
    })
    test: Dataset({
        features: ['id', 'year', 'text'

In [7]:
# Access the first example in the training set
pprint(swiss_french['train'][0])

# Access the first example in the validation set
pprint(swiss_french['validation'][0])

# Access the first example in the test set
pprint(swiss_french['test'][0])

{'canton': 'n/a',
 'id': 0,
 'label': 0,
 'language': 'fr',
 'legal area': 'civil law',
 'region': 'n/a',
 'source_language': 'n/a',
 'text': "A.- Par contrat d'entreprise signé le 2 octobre 1998, Narcisse "
         'Pannatier, domicilié à Sion, a adjugé à Georges-André Dorsaz les '
         "travaux de construction d'une charpente pour une villa sise à Fully, "
         'dans le district de Martigny. Le 11 novembre 1998, Georges-André '
         'Dorsaz a adressé à Narcisse Pannatier une facture de 16 179 fr.85. '
         "Le maître de l'ouvrage a payé 15 712 fr.55. Diverses tentatives de "
         "recouvrement du solde à l'amiable ont échoué. B.- Par demande des 23 "
         'novembre/3 décembre 1999, Georges-André Dorsaz a ouvert action '
         'contre Narcisse Pannatier devant le juge de commune de Fully en vue '
         "d'obtenir le paiement de 467 fr. plus intérêts. Le juge de commune a "
         'fixé une audience au 4 février 2000. Le demandeur a donné suite à '
    

In [8]:
# Convert datasets to Polars DataFrames
def dataset_to_polars(dataset_split):
    """Convert a Hugging Face dataset split to a Polars DataFrame"""
    polars_df = dataset_split.shuffle(seed=42).to_polars()
    return polars_df

# Convert French dataset splits
french_train_df = dataset_to_polars(swiss_french['train'])
french_val_df = dataset_to_polars(swiss_french['validation'])
french_test_df = dataset_to_polars(swiss_french['test'])

# Convert Italian dataset splits
italian_train_df = dataset_to_polars(swiss_italian['train'])
italian_val_df = dataset_to_polars(swiss_italian['validation'])
italian_test_df = dataset_to_polars(swiss_italian['test'])

# Combine all French splits into one DataFrame with a split indicator
french_df = pl.concat([
    french_train_df.with_columns(pl.lit("train").alias("split")),
    french_val_df.with_columns(pl.lit("validation").alias("split")),
    french_test_df.with_columns(pl.lit("test").alias("split"))
])

# Combine all Italian splits into one DataFrame with a split indicator
italian_df = pl.concat([
    italian_train_df.with_columns(pl.lit("train").alias("split")),
    italian_val_df.with_columns(pl.lit("validation").alias("split")),
    italian_test_df.with_columns(pl.lit("test").alias("split"))
])

# Combine both languages into a single DataFrame
combined_df = pl.concat([french_df, italian_df])

print("Combined DataFrame shape:", combined_df.shape)
print("\nDataFrame schema:")
print(combined_df.schema)
print("\nFirst few rows:")
combined_df.head()

Combined DataFrame shape: (35386, 10)

DataFrame schema:
Schema({'id': Int32, 'year': Int32, 'text': String, 'label': Int64, 'language': String, 'region': String, 'canton': String, 'legal area': String, 'source_language': String, 'split': String})

First few rows:


id,year,text,label,language,region,canton,legal area,source_language,split
i32,i32,str,i64,str,str,str,str,str,str
22014,2011,"""Faits: A. Le 28 octobre 2002 à…",0,"""fr""","""Région lémanique""","""ge""","""civil law""","""n/a""","""train"""
11593,2007,"""Faits : Faits : A. Le 17 avril…",1,"""fr""","""Région lémanique""","""ge""","""penal law""","""n/a""","""train"""
26670,2013,"""Faits: A. Par jugement du 2 ma…",0,"""fr""","""Région lémanique""","""vd""","""penal law""","""n/a""","""train"""
5864,2004,"""Faits: Faits: A. N._, née en 1…",1,"""fr""","""Région lémanique""","""vd""","""insurance law""","""n/a""","""train"""
16122,2009,"""Faits: A. Y._ est propriétaire…",0,"""fr""","""Région lémanique""","""ge""","""public law""","""n/a""","""train"""


In [9]:
# Remove columns that are not needed for the final dataset
columns_to_remove = ['source_language']
combined_df = combined_df.drop(columns_to_remove)

# Clean up "n/a" strings in several columns to actual NaN values
columns_to_clean = ['region', 'canton', 'legal area']

for column in columns_to_clean:
    combined_df = combined_df.with_columns(
        pl.col(column).replace("n/a", None)
    )

print("\nFirst few rows:")
combined_df.head()


First few rows:


id,year,text,label,language,region,canton,legal area,split
i32,i32,str,i64,str,str,str,str,str
22014,2011,"""Faits: A. Le 28 octobre 2002 à…",0,"""fr""","""Région lémanique""","""ge""","""civil law""","""train"""
11593,2007,"""Faits : Faits : A. Le 17 avril…",1,"""fr""","""Région lémanique""","""ge""","""penal law""","""train"""
26670,2013,"""Faits: A. Par jugement du 2 ma…",0,"""fr""","""Région lémanique""","""vd""","""penal law""","""train"""
5864,2004,"""Faits: Faits: A. N._, née en 1…",1,"""fr""","""Région lémanique""","""vd""","""insurance law""","""train"""
16122,2009,"""Faits: A. Y._ est propriétaire…",0,"""fr""","""Région lémanique""","""ge""","""public law""","""train"""


In [10]:
# Save the combined DataFrame to a Parquet file
combined_df.write_parquet('swiss_judgment_prediction_fr&it_clean.parquet')

## 2.2. EDA

**Data fields**:
- id: (int) a unique identifier of the for the document
- year: (int) the publication year
- text: (str) the facts of the case
- label: (class label) the judgment outcome: 0 (dismissal) or 1 (approval)
- language: (str) one of (de, fr, it)
- region: (str) the region of the lower court
- canton: (str) the canton of the lower court
- legal area: (str) the legal area of the case 

In [12]:
# Load the cleaned Parquet file
df = pl.read_parquet('swiss_judgment_prediction_fr&it_clean.parquet')

# Display the loaded DataFrame
print("\nLoaded DataFrame shape:", df.shape)
print("\nLoaded DataFrame schema:")
print(df.schema)
print("\nFirst few rows of the loaded DataFrame:")
df.head()


Loaded DataFrame shape: (35386, 9)

Loaded DataFrame schema:
Schema({'id': Int32, 'year': Int32, 'text': String, 'label': Int64, 'language': String, 'region': String, 'canton': String, 'legal area': String, 'split': String})

First few rows of the loaded DataFrame:


id,year,text,label,language,region,canton,legal area,split
i32,i32,str,i64,str,str,str,str,str
0,2000,"""A.- Par contrat d'entreprise s…",0,"""fr""",,,"""civil law""","""train"""
1,2000,"""A.- Le 12 avril 1995, A._ a su…",0,"""fr""",,,"""insurance law""","""train"""
2,2000,"""A.- En février 1994, M._ a été…",0,"""fr""","""Région lémanique""","""ge""","""insurance law""","""train"""
3,2000,"""A.- M._ a travaillé en qualité…",0,"""fr""",,,"""insurance law""","""train"""
6,2000,"""A.- Le 29 septembre 1997, X._ …",0,"""fr""","""Espace Mittelland""","""ne""","""penal law""","""train"""


# 3. Random Classifier Performance

# 4. Baseline Implementation