#### COLIEE 2024

This is the implementation of Task 1 of the 2024 Competition on Legal Information and Extraction/Entailment (COLIEE) by Damian Curran and Mike Conway.

Details of the implementation can be found in our paper 'Similarity Ranking of Case Law Using Propositions as Features' (2024).

#### Imports

In [6]:
# Import functions from helper python files.

import t5train_code, file_code, pairs_code, model_code
import importlib
from tqdm import tqdm
tqdm.pandas()

import pandas as pd

#### t5 Proposition Extraction Model

In [None]:
# Fine-tune t5-base on training data

importlib.reload(t5train_code)
from t5train_code import get_trainer, train_save_model

trainer = get_trainer()
train_save_model(trainer)

#### Files

In [7]:
importlib.reload(file_code)
from file_code import *

In [8]:
files = get_files()

Reading raw data from text files. Generating files dataframe.


Read-in train files: 100%|█████████████████| 7350/7350 [00:10<00:00, 687.59it/s]
Read-in test files: 100%|██████████████████| 2159/2159 [00:05<00:00, 410.31it/s]

Returning "files" df.





In [9]:
files_all

Unnamed: 0,filename,set,query,cases,text
0,059961,train,False,[],<FRAGMENT_SUPPRESSED> <FRAGMENT_SUPPRESSED...
1,058794,train,False,[],\n \n \n \n \n \n \n \n <FRAGMENT_SUPPRESSED...
2,057680,train,False,[],"[1]\nLeBlanc, J.\n: This is a judicial review ..."
3,075584,train,True,"[067672.txt, 019050.txt, 095238.txt, 047989.txt]","I.\nIntroduction\n[1]\nShore, J.\n[Translation..."
4,085456,train,False,[],<FRAGMENT_SUPPRESSED> <FRAGMENT_SUPPRESSED...
...,...,...,...,...,...
9284,076100,test,False,[],"[1]\nGagné, J.\n: This application for judicia..."
9285,035730,test,False,[],<FRAGMENT_SUPPRESSED> \nFederal Court\nMactav...
9286,005193,test,False,[],"[1]\nMartineau, J.\n: This is an application f..."
9287,053188,test,True,[],<FRAGMENT_SUPPRESSED> \nFederal Court\nO'Keef...


In [12]:
# row_true = files_all[files_all['query'] == True].sample(n=1)
# row_false = files_all[files_all['query'] == False].sample(n=1)

# # Combine them
# files = pd.concat([row_true, row_false])
# files

Unnamed: 0,filename,set,query,cases,text
7174,32547,test,True,[],"I.\nBackground\n[1]\nHugessen, J.\n: This is a..."
13,85437,train,False,[],-\nPara.\nINTRODUCTION\n1\n--I. NATURE OF THE ...


In [13]:
add_paragraphs(files)
get_paragraphs_formatted(files)
add_suppressed_sections(files)

Reading text from files. Extracting paragraphs based on regex pattern.


Extracting all paragraphs: 100%|██████████████████| 2/2 [00:00<00:00, 43.71it/s]

Updated "files" df has with "paragraphs".
Getting formatted paragraphs of length < 250 words.



Getting formatted paragraphs: 100%|███████████████| 2/2 [00:00<00:00, 24.86it/s]

Added formatted paragraphs to "files" df in "paragraphs_formatted".
Using regex and spacy (for long paragraphs) to extract and modify suppressed sections from paragraphs:



Extracting suppressed sections from paragraphs: 100%|█| 2/2 [00:00<00:00, 687.14

Added suppressed sections to "suppressed_sections" field.





In [14]:
files

Unnamed: 0,filename,set,query,cases,text,paragraphs,paragraphs_formatted,suppressed_sections
7174,32547,test,True,[],"I.\nBackground\n[1]\nHugessen, J.\n: This is a...","[[1] Hugessen, J. : This is an action in which...","[Hugessen, J. : This is an action in which the...",[[13] The Supreme Court of Canada has confirme...
13,85437,train,False,[],-\nPara.\nINTRODUCTION\n1\n--I. NATURE OF THE ...,"[[1] de Montigny, J. [Translation]: On Decembe...","[de Montigny, J. [Translation]: On December 23...",[]


In [15]:
def get_suppressed_formatted(files):
    # logging.info('Getting formatted paragraphs of length < 250 words from suppressed_sections.')

    def extract_paragraphs_formatted(row):
        # Only keep paragraphs with 2+ words and < 250 words
        paragraphs_formatted = [clean_text(p) for p in row['suppressed_sections'] if len(p.split()) < 250]
        paragraphs_formatted = [p for p in paragraphs_formatted if len(p.split()) > 1]
        return paragraphs_formatted

    tqdm.pandas(desc="Formatting suppressed sections")
    files['propositions'] = files.progress_apply(extract_paragraphs_formatted, axis=1)

    # logging.info('Added formatted paragraphs to "suppressed_formatted" column.')
get_suppressed_formatted(files)

Formatting suppressed sections: 100%|████████████| 2/2 [00:00<00:00, 535.06it/s]


In [16]:
files

Unnamed: 0,filename,set,query,cases,text,paragraphs,paragraphs_formatted,suppressed_sections,propositions
7174,32547,test,True,[],"I.\nBackground\n[1]\nHugessen, J.\n: This is a...","[[1] Hugessen, J. : This is an action in which...","[Hugessen, J. : This is an action in which the...",[[13] The Supreme Court of Canada has confirme...,[The Supreme Court of Canada has confirmed tha...
13,85437,train,False,[],-\nPara.\nINTRODUCTION\n1\n--I. NATURE OF THE ...,"[[1] de Montigny, J. [Translation]: On Decembe...","[de Montigny, J. [Translation]: On December 23...",[],[]


In [17]:
get_english_propositions(files)

Using language detection model to filter non-English propositions.


Getting English propositions: 100%|███████████████| 2/2 [00:12<00:00,  6.13s/it]

Added English-only propositions to "files" df in "propositions_en".





In [18]:
add_sentences(files)

Using spacy to extract sentences of char length > 25 from paragraphs:


Extracting sentences from paragraphs: 100%|███████| 2/2 [00:13<00:00,  6.61s/it]

Added lists of sentences to "sentences".





In [19]:
files

Unnamed: 0,filename,set,query,cases,text,paragraphs,paragraphs_formatted,suppressed_sections,propositions,propositions_en,sentences
7174,32547,test,True,[],"I.\nBackground\n[1]\nHugessen, J.\n: This is a...","[[1] Hugessen, J. : This is an action in which...","[Hugessen, J. : This is an action in which the...",[[13] The Supreme Court of Canada has confirme...,[The Supreme Court of Canada has confirmed tha...,[The Supreme Court of Canada has confirmed tha...,[This is an action in which the plaintiffs see...
13,85437,train,False,[],-\nPara.\nINTRODUCTION\n1\n--I. NATURE OF THE ...,"[[1] de Montigny, J. [Translation]: On Decembe...","[de Montigny, J. [Translation]: On December 23...",[],[],[],"[de Montigny, J. [Translation]: On December 23..."


In [20]:
for s in files['sentences'].iloc[:2]:
    print(len(s))   # how many sentences per row?

135
1717


In [None]:
files.to_csv('just_before_english_sentences_all.csv', index=False)

In [23]:
# from transformers import AutoTokenizer, AutoModelForSequenceClassification
# import torch
# from tqdm import tqdm

# # Device
# device = "cpu"

# # Load model and tokenizer
# tokenizer = AutoTokenizer.from_pretrained("papluca/xlm-roberta-base-language-detection")
# model = AutoModelForSequenceClassification.from_pretrained(
#     "papluca/xlm-roberta-base-language-detection"
# ).to(device)
# model.eval()

# # Batch size for processing sentences (adjust if memory is low)
# BATCH_SIZE = 128

# # Helper: check which sentences are English
# def custom_get_english_sections(sections, tokenizer, model, device, batch_size=BATCH_SIZE):
#     english_sentences = []

#     # Split sentences into batches
#     for i in range(0, len(sections), batch_size):
#         batch = sections[i:i+batch_size]
#         # Tokenize batch
#         inputs = tokenizer(batch, padding=True, truncation=True, return_tensors="pt").to(device)
#         with torch.no_grad():
#             outputs = model(**inputs)
#             # Get predicted language id
#             preds = torch.argmax(outputs.logits, dim=1).cpu().numpy()

#         # The model's config id2label maps language IDs to ISO codes
#         for sentence, pred_id in zip(batch, preds):
#             lang = model.config.id2label[pred_id]
#             if lang.lower() == "en":  # Keep English sentences
#                 english_sentences.append(sentence)

#     return english_sentences

# # Apply to your DataFrame
# def custom_get_english_sections(sections, tokenizer, model, device, batch_size=BATCH_SIZE):
#     english_sentences = []

#     for i in range(0, len(sections), batch_size):
#         batch = sections[i:i+batch_size]
#         print(f"Processing batch {i//batch_size + 1} of {len(sections)//batch_size + 1}...")  # DEBUG

#         inputs = tokenizer(batch, padding=True, truncation=True, return_tensors="pt").to(device)
#         with torch.no_grad():
#             outputs = model(**inputs)
#             preds = torch.argmax(outputs.logits, dim=1).cpu().numpy()

#         for sentence, pred_id in zip(batch, preds):
#             lang = model.config.id2label[pred_id]
#             if lang.lower() == "en":
#                 english_sentences.append(sentence)

#     return english_sentences


from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
from tqdm import tqdm

# Device
device = "cpu"

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("papluca/xlm-roberta-base-language-detection")
model = AutoModelForSequenceClassification.from_pretrained(
    "papluca/xlm-roberta-base-language-detection"
).to(device)
model.eval()

# Batch size for processing sentences (adjust if memory is low)
BATCH_SIZE = 128

# # Helper: check which sentences are English
# def custom_get_english_sections(sections, tokenizer, model, device, batch_size=BATCH_SIZE):
#     english_sentences = []

#     # Split sentences into batches
#     for i in range(0, len(sections), batch_size):
#         batch = sections[i:i+batch_size]
#         # Tokenize batch
#         inputs = tokenizer(batch, padding=True, truncation=True, return_tensors="pt").to(device)
#         with torch.no_grad():
#             outputs = model(**inputs)
#             # Get predicted language id
#             preds = torch.argmax(outputs.logits, dim=1).cpu().numpy()

#         # The model's config id2label maps language IDs to ISO codes
#         for sentence, pred_id in zip(batch, preds):
#             lang = model.config.id2label[pred_id]
#             if lang.lower() == "en":  # Keep English sentences
#                 english_sentences.append(sentence)

#     return english_sentences

# # Apply to your DataFrame
# def custom_get_english_sentences(files):
#     import logging
#     logging.info('Using language detection model to filter non-English sentences.')

#     tqdm.pandas(desc="Getting English sentences")
#     files['sentences_en'] = files['sentences'].progress_apply(
#         lambda sections: custom_get_english_sections(sections, tokenizer, model, device)
#     )

#     logging.info('Added English-only sentences to "sentences_en" column.')



'(ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)"), '(Request ID: 85884490-31c9-4f23-94f4-848c23711b78)')' thrown while requesting HEAD https://huggingface.co/papluca/xlm-roberta-base-language-detection/resolve/main/tokenizer_config.json


'(ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)"), '(Request ID: 85884490-31c9-4f23-94f4-848c23711b78)')' thrown while requesting HEAD https://huggingface.co/papluca/xlm-roberta-base-language-detection/resolve/main/tokenizer_config.json


Retrying in 1s [Retry 1/5].


Retrying in 1s [Retry 1/5].


In [24]:
def custom_get_english_sections(sections, tokenizer, model, device, batch_size=BATCH_SIZE):
    english_sentences = []

    for i in range(0, len(sections), batch_size):
        batch = sections[i:i+batch_size]
        print(f"Processing batch {i//batch_size + 1} of {len(sections)//batch_size + 1}...")  # DEBUG

        inputs = tokenizer(batch, padding=True, truncation=True, return_tensors="pt").to(device)
        with torch.no_grad():
            outputs = model(**inputs)
            preds = torch.argmax(outputs.logits, dim=1).cpu().numpy()

        for sentence, pred_id in zip(batch, preds):
            lang = model.config.id2label[pred_id]
            if lang.lower() == "en":
                english_sentences.append(sentence)

    return english_sentences

# Apply to your DataFrame
def custom_get_english_sentences(files):
    import logging
    logging.info('Using language detection model to filter non-English sentences.')

    tqdm.pandas(desc="Getting English sentences")
    files['sentences_en'] = files['sentences'].progress_apply(
        lambda sections: custom_get_english_sections(sections, tokenizer, model, device)
    )

    logging.info('Added English-only sentences to "sentences_en" column.')



In [25]:
custom_get_english_sentences(files)
files.to_csv('get_english_sentences_all.csv', index=False)

Using language detection model to filter non-English sentences.


Getting English sentences:   0%|                          | 0/2 [00:00<?, ?it/s]

Processing batch 1 of 2...
Processing batch 2 of 2...


Getting English sentences: 100%|██████████████████| 2/2 [00:58<00:00, 29.26s/it]

Processing batch 1 of 14...
Processing batch 2 of 14...
Processing batch 3 of 14...
Processing batch 4 of 14...
Processing batch 5 of 14...
Processing batch 6 of 14...
Processing batch 7 of 14...
Processing batch 8 of 14...
Processing batch 9 of 14...
Processing batch 10 of 14...
Processing batch 11 of 14...
Processing batch 12 of 14...
Processing batch 13 of 14...
Processing batch 14 of 14...


Getting English sentences: 100%|█████████████████| 2/2 [11:56<00:00, 358.13s/it]

Added English-only sentences to "sentences_en" column.





In [26]:
add_quotes(files)

Using regex to extract quotations from suppressed sections:


Extracting quotes from suppressed sections:: 100%|█| 2/2 [00:00<00:00, 237.45it/

Added quotes to the "quotes" field for query cases.





In [27]:
add_entities(files)

Using spacy to extract noun entities, from English sentences:


Extracting entities from english sentences: 100%|█| 2/2 [00:13<00:00,  6.54s/it]

Added entity strings to "entity_string" and entities as sets to "entity_set".





In [28]:
add_strings_sets(files)

Extracting case word strings (for tfidf) and sets (for case jaccard):


Extracting case string and sets from sentences_en:: 100%|█| 2/2 [00:00<00:00,  3

Added case strings and sets.





In [29]:
add_set_lists(files)

Extracting set lists:


Extracting set list sentences_en:: 100%|██████████| 2/2 [00:00<00:00,  3.31it/s]
Extracting set list paragraphs_formatted:: 100%|██| 2/2 [00:00<00:00,  3.87it/s]
Extracting set list from propositions_en:: 100%|██| 2/2 [00:00<00:00, 69.88it/s]

Added set lists.





In [30]:
add_judge_name(files)

Using regex to extract judge surname from first paragraphs:


Extracting judge surname from paragraphs: 100%|█| 2/2 [00:00<00:00, 1877.49it/s]

Added judge name to "judge" field.





In [31]:
add_year(files)

Using string search to find year:


Extracting most recent year from file text: 100%|█| 2/2 [00:00<00:00, 83.80it/s]

Added year to files.





In [1]:
files

NameError: name 'files' is not defined

In [32]:
get_embeddings(files)
files.to_csv('embeddings_all.csv', index=False)
files

Getting embeddings from sentences, paragraphs and propositions.


Getting embeddings for sentences_en: 100%|███████| 2/2 [06:53<00:00, 206.56s/it]
Getting embeddings for paragraphs formatted: 100%|█| 2/2 [03:56<00:00, 118.30s/i
Getting embeddings for propositions_en: 100%|█████| 2/2 [00:07<00:00,  3.74s/it]

Added embeddings.





#### Pairs

In [33]:
importlib.reload(pairs_code)
from pairs_code import (get_pairs, add_bins, get_prop_max_cos_sim_sents, get_prop_max_cos_sim_paras,
                        get_prop_max_jaccard_sents, get_prop_max_jaccard_paras, get_prop_max_overlap_sents, get_prop_max_overlap_paras, add_max_overall,
                        get_case_jaccard_sims, check_same_case, get_case_tfidf_scores, get_num_quotes, binarize_quotes, check_years, add_judge_checks)

In [None]:
# Generate pairs dataframe. One query-candidate case pair per row. Compare file features from files df to generate pair features:

pairs = get_pairs(files)
get_prop_max_cos_sim_sents(files, pairs)
get_prop_max_cos_sim_paras(files, pairs)
get_prop_max_jaccard_sents(files,pairs)
get_prop_max_jaccard_paras(files,pairs)
get_prop_max_overlap_sents(files,pairs)
get_prop_max_overlap_paras(files,pairs)
add_max_overall(pairs,files)
get_case_jaccard_sims(files,pairs)
check_same_case(pairs)
get_case_tfidf_scores(files,pairs)
get_num_quotes(files,pairs)
binarize_quotes(pairs)
check_years(files,pairs)
add_judge_checks(files,pairs)
add_bins(files, pairs)

Generating pairs dataframe from files.


#### Model

In [None]:
# Do k-fold validation on train set to identify best hyperparameters:

importlib.reload(model_code)
from model_code import get_k_fold_model_dev_pairs, save_model_df_pairs

model_df_pairs = get_k_fold_model_dev_pairs(pairs)
save_model_df_pairs(model_df_pairs)

In [None]:
importlib.reload(model_code)
from model_code import apply_models_to_dfs
apply_models_to_dfs(model_df_pairs, infer_type=1)

In [None]:
importlib.reload(model_code)
from model_code import apply_models_to_dfs
apply_models_to_dfs(model_df_pairs, infer_type=2)

#### Final Inference

In [1]:
# Train model

importlib.reload(model_code)
from model_code import build_train_model

train_df = pairs[pairs['set']=='train']
model, train_df = build_train_model(train_df)

In [2]:
# Generate final results

importlib.reload(model_code)
from model_code import inference_on_test

test_df = pairs[pairs['set']=='test']

for infer_type in [1,2]:
    results_df = inference_on_test(model, test_df, infer_type)
    print()