# Lex Rosetta: Transfer of Predictive Models Across Languages, Jurisdictions, and Legal Domains

This notebook contains the code for the paper:

*Jaromir Savelka, Hannes Westermann, Karim Benyekhlef, Charlotte S. Alexander, Jayla C. Grant, David Restrepo Amariles, Rajaa El Hamdani, Sébastien Meeùs, Aurore Troussel, Michał Araszkiewicz, Kevin D. Ashley, Alexandra Ashley, Karl Branting, Mattia Falduti, Matthias Grabmair, Jakub Harašta,Tereza Novotná, Elizabeth Tippett, and Shiwanni Johnson. 2021. Lex Rosetta: Transfer of Predictive Models Across Languages, Jurisdictions, and Legal Domains. In Eighteenth International Conference for Artificial Intelligence and Law (ICAIL’21), June 21–25, 2021, São Paulo, Brazil. ACM, New York, NY, USA, 10 pages. https://doi.org/10.1145/3462757.34661491*

The notebook contains the code necessary to load data from annotated multicontextual legal cases. Then, it embeds the individual sentences into a multilingual vector embedding, using [Language-Agnostic SEntence Representations](https://github.com/facebookresearch/LASER). Finally, it trains a gated recurrent unit network to take a new case and predict the label of each sentence. The code to run the experiments presented in the paper and create the visualizations used in the discussion section is included. Notably, that inincludes an evaluation fo how well the model performs when evaluated on a context different from the one it is trained on.

This notebook can be run either locally or in the cloud with Google Colab. 

* On Google Colab **(easier)**. Follow this link: https://colab.research.google.com/drive/1zSsKIPZXp3JdlU5E5GVZox-FiapYm4oD?usp=sharing The notebook will download the data from github. Note that RAM restrictions on colab means that the experiments for H2 and H3 are likely to crash. 
* Locally: It is recommended to set up a new python environment and run the first cell to install the necessary requirements. Instructions to enable CUDA training, which significantly speeds up execution, is available at https://www.tensorflow.org/install/gpu and https://pytorch.org/get-started/locally/. 
Certain cells (such as loading the data from github) are only required on Google Colab and can be skipped.


Created by Hannes Westermann, Cyberjustice Laboratory

# Installation

Only required first time, or when running online. Can take a while, especially when running locally, since it downloads a number of large packages.

In [None]:
## Only required when running locally, these are already installed on Google Colab
!pip install tensorflow
!pip install pandas
!pip install matplotlib
!pip install scikit-learn

In [None]:
#Install laserembeddings and load the model. Required both on colab and locally.

from IPython.display import clear_output
!pip install laserembeddings
!python -m laserembeddings download-models
clear_output()

## RESTART RUNTIME

In [None]:
# if you are on colab, run this cell to load the data and switch to the relevant path
!git clone https://github.com/lexrosetta/caselaw_functional_segmentation_multilingual.git
%cd ./caselaw_functional_segmentation_multilingual/

# Data preparation

Select the datasets and select the chosen annotator where there are multiple annotators.

In [None]:
# If you have obtained the canadian data from CanLII, set the following variable to true:
include_canada = False

In [None]:
import os
import pandas as pd
rootdir = './data/'
datasets = []
annotators = []
if include_canada:
    datasets.append("Canada-EN-1")
    annotators.append("annotator-2")

datasets += [
            "Czech_Republic-CZ-1",
            "France-FR-1",
            "Germany-DE-1",
            "Italy-IT-1",
            "Poland-PL-1",
            "United_States-EN-1",
            "United_States-EN-2",

]
annotators += [
              "annotator-1",
              "annotator-1",
              "annotator-1",
              "annotator-1",
              "annotator-1",
              "annotator-2",
              "annotator-1"
]
dataframes = {}
for i, dataset in enumerate(datasets):
    dataframes[dataset] = pd.read_csv(rootdir+dataset+f"/{annotators[i]}-ICAIL2021.csv") 



Split the documents into 10 folds, around 10 documents per fold per dataset.

In [None]:
# K-fold split
import numpy as np
from sklearn.model_selection import KFold


for k,v in dataframes.items():
    df = v
    docs = df["Document"].unique()
    kf = KFold(n_splits=10, shuffle=True, random_state = 42)
    current_fold = 0
    ref = {}
    for train_index, test_index in kf.split(docs):
        
        for index in test_index:
            ref[docs[index]] = current_fold
        current_fold += 1
    fold_column = []
    for doc in df["Document"].values:
        fold_column.append(ref[doc])
    df["fold"] = fold_column
    elements = k.split("-")
    df["lang"] = elements[1].lower()
    df["dataset"] = k

Load all sentences into a dataframe. Perform encoding of labels into numerical values.

In [None]:
lst = []
for k,v in dataframes.items():
    lst.append(v)
data_df=pd.concat(lst, ignore_index=True)
data_df = data_df.astype({'Type': 'category'})
data_df["labels_all"] = data_df["Type"].cat.codes
data_df["labels"] = data_df["Type"].cat.codes
data_df = data_df.rename(columns={"Text": "text"})
data_df

## Verification

Quick and dirty verification that labels and folds are distributed as expected.

In [None]:
print (data_df.groupby(['dataset','labels']).size())

In [None]:
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    print(data_df.groupby(['dataset','fold', 'labels']).size())

## Sentence embedding

Load data into dictionary type that will be used for prediction.

In [None]:
%%time

cases = []
for doc_name in data_df["Document"].unique():
    case_df = data_df.loc[data_df['Document'] == doc_name]
    case = {"file_name": doc_name, "sents": [], "labels_text": [], "labels": []}
    case["sents"] = case_df["text"].values
    case["labels_text"] = case_df["Type"].values
    case["labels"] = list(case_df["labels_all"].values)
    case["fold"] = case_df["fold"].values[0]
    case["lang"] = case_df["lang"].values[0]
    case["dataset"] = case_df["dataset"].values[0]
    cases.append(case)

Get list of labels, number of unique labels.

In [None]:
unique_labels = list(data_df["Type"].cat.categories)
num_labels = data_df["labels_all"].nunique()

Establish maximum length (in sentences) of a case. For each case, create an embedding and pad with blank vectors to reach the maximum length of cases in sentences. This cell takes a few minutes to run.

In [None]:
%%time
import numpy as np
from numpy import argmax
import math  
from laserembeddings import Laser

laser = Laser()


#Determine maximal length in sentences of a case for padding
max_len = 0
for t, case in enumerate(cases):
    sents = case["sents"]
    case["original_length"] = len(sents)
    max_len = max(max_len, len(sents))
print ("Longest sequence: ",max_len)

pad_label = 0

# Embed senteces in cases and pad until max_len
for t, case in enumerate(cases):
    sents = case["sents"]
    case["original_length"] = len(sents)
    diff = (max_len-case["original_length"])
    message_embeddings = laser.embed_sentences(case["sents"], lang=case["lang"])
    embs = np.array(message_embeddings).tolist()
    embs_ext = embs + [[0]*1024]*diff
    case["embs_ext"] = embs_ext
    case["embs"] = embs
    case["label_ext"] = case["labels"]+[pad_label]*diff
    print (f"Created embeddings for case {t+1}/{len(cases)}")

In [None]:
# Save embedded cases, since previous cell is slow.

import pickle
with open("cases.pkl", "wb") as outfile:
    pickle.dump((cases, datasets, data_df), outfile)

Restart point - if you need to clear ram, you can restart the notebook or colab session here. You can then run from the next cell to  load all the required data from a file.


# Model

Loads data, builds model.

In [None]:
# Only necessary if running in google colab
%cd ./caselaw_functional_segmentation_multilingual/

In [None]:
# Load everything. The notebook should run from here after restart.

import pickle
with open("cases.pkl", "rb") as infile:
    (cases, datasets, data_df) = pickle.load(infile)

unique_labels = list(data_df["Type"].cat.categories)
num_labels = data_df["labels_all"].nunique()

In [None]:
# Required when running on windows for some reason...
# https://www.tensorflow.org/guide/gpu#limiting_gpu_memory_growth

import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
        # Currently, memory growth needs to be the same across GPUs
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
        logical_gpus = tf.config.experimental.list_logical_devices('GPU')
        print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
    except RuntimeError as e:
        # Memory growth must be set before GPUs have been initialized
        print(e)

In [None]:
# Reset seed for consistent results.
from numpy.random import seed

import tensorflow as tf


def resetSeed():
    tf.compat.v1.set_random_seed(1)
    seed(1)

The following cell defines the model and some utility functions.

In [None]:
from __future__ import print_function
import numpy as np

# Relevant references:
# https://machinelearningmastery.com/develop-bidirectional-lstm-sequence-classification-python-keras/
# https://keras.io/examples/lstm_stateful/
# https://machinelearningmastery.com/timedistributed-layer-for-long-short-term-memory-networks-in-python/

from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import *
from tensorflow.keras.datasets import imdb
import tensorflow.keras
import tensorflow.keras.backend as K
from tensorflow.keras.callbacks import *
from tensorflow.keras.optimizers import *

batch_size = 32


def getData(selectedCases):
    # Load data from list of cases, transform into format expected by keras
  X = []
  y = []
  for case in selectedCases:
    X.append(case["embs_ext"])
    y.append(case["label_ext"])
  y2 = []

  # One-hot encode labels
  for case in y:
    lss = []
    for sample in case:
      ls = [0]*num_labels
      ls[sample] = 1
      lss.append(ls)
    y2.append(lss)

  num_samples = len(X)
  length = len(X[0])
  features = len(X[0][0])
  X = np.array(X)
  y = np.array(y2)
  X = X.reshape(num_samples, length, features)
  y = y.reshape(num_samples, length, num_labels)
  
  return X,y

def trainModel(train_cases, val_cases):
    resetSeed()
    x_train, y_train = getData(train_cases)
    num_samples = len(x_train)
    length = len(x_train[0])
    features = len(x_train[0][0])
    X_val, Y_val = getData(val_cases)
    print (length, features)




    ## Model
    model = Sequential()
    model.add(Masking(mask_value=0., input_shape=(length, features)))
    model.add(Bidirectional(GRU(256,stateful=False,return_sequences=True, input_shape=(length, features), dropout=0.2)))
    model.add(TimeDistributed(Dense(num_labels, activation='softmax')))
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    print (model.summary())


    ## Callbacks
    reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.1,
                                patience=2, min_lr=1e-7, verbose=1)
    checkpoint_filepath = './tmp/checkpoint'
    earlyStopping = EarlyStopping(monitor='val_accuracy', patience=80, verbose=1, mode='auto')
    mcp_save = ModelCheckpoint(checkpoint_filepath, save_best_only=True, monitor='val_accuracy', mode='max', save_weights_only=True,)
    reduce_lr_loss = ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=50, verbose=1, epsilon=1e-4, mode='min')

    ## Train
    history = model.fit(x_train, y_train,
            batch_size=batch_size,
            epochs=1000,
            validation_data=(X_val, Y_val),
            callbacks=[ mcp_save, earlyStopping, reduce_lr_loss])
    plot_graphs(history, "accuracy")
    model.load_weights(checkpoint_filepath)
    return model, history

Function to plot history of train and val scores

In [None]:
import matplotlib.pyplot as plt
# https://www.tensorflow.org/text/tutorials/text_classification_rnn

def plot_graphs(history, string):
  plt.plot(history.history[string])
  plt.plot(history.history['val_'+string])
  plt.xlabel("Epochs")
  plt.ylabel(string)
  plt.legend([string, 'val_'+string])
  plt.show()

Function to test model on a number of cases.

In [None]:
from sklearn.metrics import classification_report


def testModel(model, test_cases):
    x_test, y_test = getData(test_cases)
    y_pred = model.predict(x_test)
    test = test_cases
    print (y_pred.shape)
    preds = []
    trues = []

    # Remove the padded part of the case for evaluation purposes.
    for a, sample in enumerate(y_pred):
        for b in range(test[a]["original_length"]):
            sentence_pred = sample[b]
            preds.append(argmax(sentence_pred))
            trues.append(test[a]["labels"][b])
    return classification_report(trues, preds, target_names=unique_labels, output_dict=True)




Function to get relevant folds for training, test and eval.

In [None]:
def getFolds(index):
    eval_folds = set([index])
    test_folds = set([(index + 1) % 10])
    train_folds = set([(index + 2 + n) % 10 for n in range(8)])
    return (train_folds, eval_folds, test_folds)

for n in range(10):
    print(getFolds(n))

These functions retrieve the relevant cases, depending on dataset and fold. They correspond to the hypotheses in the paper.

In [None]:
def getCases(dataset, folds):
    # H1 - get cases for training from a single dataset
    return [case for case in cases if case["dataset"] == dataset and case["fold"] in folds]

def getCasesExcept(dataset, folds):
    # H2 - get cases for training from all dataset except the supplied dataset
    return [case for case in cases if case["dataset"] != dataset and case["fold"] in folds]

def getAllCases(folds):
    # H3 - get cases from all datasets
    return [case for case in cases if case["fold"] in folds]

This cell sets up a list of all results, and a function to save results in a standard format.

In [None]:
import json
allResults = []
def addResults(model, train, test, results, fold_index):
    newRes = {
        "model": model,
        "train": train,
        "test": test,
        "fold_index": fold_index,
        "results": results
    }
    print ("Weighted F1-score: ",results["weighted avg"]["f1-score"])
    allResults.append(newRes)
    print (newRes)
    with open("results_tmp.json", "w") as outfile:
        json.dump(allResults, outfile)


# Experiments
This section contains the different hypotheses that we evaluate in the paper.

## Dummy
This dummy classifier is used as a baseline for H1.

In [None]:
import numpy as np
from sklearn.dummy import DummyClassifier

def trainDummy(casesProv):
    X = []
    y = []
    for case in casesProv:
        for i, sent in enumerate(case["sents"]):
            X.append(sent)
            y.append(case["labels"][i])
    dummy_clf = DummyClassifier(strategy="stratified")
    dummy_clf.fit(X, y)
    return dummy_clf

def testDummy(model, casesProv):
    X = []
    y = []
    for case in casesProv:
        for i, sent in enumerate(case["sents"]):
            X.append(sent)
            y.append(case["labels"][i])
    preds = model.predict(X)
    return classification_report(y, preds, target_names=unique_labels, output_dict=True)


for test_dataset in datasets:
    print ("-------------------")
    print (f"Starting dataset {test_dataset}")
    print ("-------------------")
    for fold_index in range(10):
        print ("-------------------")
        print (f"Starting fold index {fold_index}")
        print ("-------------------")
        (train_folds, eval_folds, test_folds) = getFolds(fold_index)
        train_cases = getCases(test_dataset, train_folds)
        val_cases = getCases(test_dataset, eval_folds)
        print (f"Training on dataset {test_dataset}, folds {train_folds}, evaluating on {eval_folds}")
        model = trainDummy(train_cases+val_cases)
        print (model)
        #for test_dataset in datasets:
        test_cases = getCases(test_dataset, test_folds)
        print (f"Model trained on {test_dataset}, evaluating on {test_dataset}, fold {test_folds}:")
        train = "dummy"
        modelName = "sequential"
        test = test_dataset
        results = testDummy(model, test_cases)
        print (results)
        print ("Weighted F1-score: ",results["weighted avg"]["f1-score"])
        addResults(modelName, train, test, results, fold_index)

## H1 - Out-Context Experiment
Trains model on single dataset, tests against each dataset, once for each 10 folds. Saves results to results table. This experiment takes a number of hours to run, depending on the speed of the GPU.

In [None]:
%%time
from numpy import argmax

for train_dataset in datasets:
    print ("-------------------")
    print (f"Starting dataset {train_dataset}")
    print ("-------------------")
    for fold_index in range(10):
        print ("-------------------")
        print (f"Starting fold index {fold_index}")
        print ("-------------------")
        (train_folds, eval_folds, test_folds) = getFolds(fold_index)
        train_cases = getCases(train_dataset, train_folds)
        val_cases = getCases(train_dataset, eval_folds)
        print (f"Training on dataset {train_dataset}, folds {train_folds}, evaluating on {eval_folds}")
        model, history = trainModel(train_cases, val_cases)
        for test_dataset in datasets:
            test_cases = getCases(test_dataset, test_folds)
            print (f"Model trained on {train_dataset}, evaluating on {test_dataset}, fold {test_folds}:")
            train = train_dataset
            modelName = "sequential"
            test = test_dataset
            results = testModel(model, test_cases)
            addResults(modelName, train, test, results, fold_index)

## H2 - Pooled Out-Context Experiment
Trains model on all datasets excluding the target dataset, tests against the target dataset, once for each 10 folds. Saves results to results table. WARNING: High RAM usage - likely to crash when running on google colab.

Can take a couple of hours.

In [None]:
%%time
from numpy import argmax
import json


for excl_dataset in datasets:
    print ("-------------------")
    print (f"Starting dataset {excl_dataset}")
    print ("-------------------")
    for fold_index in range(10):
        print ("-------------------")
        print (f"Starting fold index {fold_index}")
        print ("-------------------")
        (train_folds, eval_folds, test_folds) = getFolds(fold_index)
        train_cases = getCasesExcept(excl_dataset, train_folds)
        val_cases = getCasesExcept(excl_dataset, eval_folds)
        test_cases = getCases(excl_dataset, test_folds)
        model, history = trainModel(train_cases, val_cases)
        print (f"Model trained on all except test, evaluating on {excl_dataset}:")
        train = "all-excl"
        modelName = "sequential"
        test = excl_dataset
        results = testModel(model, test_cases)
        addResults(modelName, train, test, results, fold_index)
    print (f"Finished dataset {excl_dataset}. Results:")
    print (allResults)
    
    

## H3 - Pooled With In-Context Experiment
Trains model on all datasets, tests against each dataset, once for each 10 folds. Saves results to results table.

WARNING: High RAM usage - likely to crash when running on google colab.

In [None]:
%%time
for fold_index in range(10):
    print ("-------------------")
    print (f"Starting fold index {fold_index}")
    print ("-------------------")
    (train_folds, eval_folds, test_folds) = getFolds(fold_index)
    train_cases = getAllCases(train_folds)
    val_cases = getAllCases(eval_folds)
    model, history = trainModel(train_cases, val_cases)
    for test_dataset in datasets:
        print (f"Model trained on all, evaluating on {test_dataset}:")
        train = "all"
        test = test_dataset
        modelName = "sequential"
        test_cases = getCases(test_dataset, test_folds)
        results = testModel(model, test_cases)
        addResults(modelName, train, test, results, fold_index)

## Save results

Save results to file

In [None]:
with open("results_final.json", "w") as outfile:
        json.dump(allResults, outfile)

# Visualization
This section produces two visualizations of the cases, used in the discussion section of the paper.

## Average case visualization
Here, we take the average embedding of each case, project them to two dimensions using a Principal Component Analysis, draw a graph of the visualization, segmenting the cases by which dataset they are from.

Code based on `Principle Component Analysis (PCA) for Data Visualization` by Michael Galarnyk, licensed under the [MIT license](https://opensource.org/licenses/MIT), available here: https://github.com/mGalarnyk/Python_Tutorials/blob/master/Sklearn/PCA/PCA_Data_Visualization_Iris_Dataset_Blog.ipynb

In [None]:
import numpy as np
import pandas as pd

embeddings = []
for case in cases:
    embs = case["embs"]
    avg = np.mean(embs, axis=0)
    embeddings.append(avg)

In [None]:
case_df = pd.DataFrame({
    "dataset": [case["dataset"] for case in cases],
    "embedding": embeddings
})


x = embeddings
y = [case["dataset"] for case in cases]

In [None]:
from sklearn.decomposition import PCA
import pandas as pd

pca = PCA(n_components=2)
principalComponents = pca.fit_transform(x)
print (principalComponents.shape)
principalDf = pd.DataFrame(data = principalComponents,
             columns = ['pc1', 'pc2'])
pcaDF = pd.concat([principalDf, case_df[['dataset']]], axis = 1)
print (pcaDF.shape)

In [None]:
colors = [
    '#7fc97f',
    '#beaed4',
    '#fdc086',
    '#ffff99',
    '#386cb0',
    '#f0027f',
    '#bf5b17',
    '#666666',
]

In [None]:
trainNameMapping = {
            "dummy": "Dummy",
            "Canada-EN-1": "Canada",
            "Czech_Republic-CZ-1": "Czech R.",
            "France-FR-1": "France",
            "Germany-DE-1": "Germany",
            "Italy-IT-1": "Italy",
            "Poland-PL-1": "Poland",
            "United_States-EN-1": "U.S.A. I",
            "United_States-EN-2": "U.S.A. II",
            "all": "All",
            "all-excl": "All -target"

}

targetsProper = []
for target in datasets:
    targetsProper.append(trainNameMapping[target])
print (targetsProper)

In [None]:

import matplotlib.pyplot as plt
plt.rc('legend',fontsize=16)

fig = plt.figure(figsize = (10,10))
plt.axis('off')
ax = fig.add_subplot(1,1,1) 

targets = datasets
for target, color in zip(targets,colors):
    indicesToKeep = pcaDF['dataset'] == target
    ax.scatter(pcaDF.loc[indicesToKeep, 'pc1']
               , pcaDF.loc[indicesToKeep, 'pc2']
               , c = color
               , s = 15)
ax.legend(targetsProper)
plt.savefig('PCA.pdf', bbox_inches='tight')

## Cases visualization
Visualizes the distribution of labels in the cases for each dataset.

In [None]:
%%time
import matplotlib.pyplot as plt

trainNameMapping = {
            "dummy": "Dummy",
            "Canada-EN-1": "Canada",
            "Czech_Republic-CZ-1": "Czech R.",
            "France-FR-1": "France",
            "Germany-DE-1": "Germany",
            "Italy-IT-1": "Italy",
            "Poland-PL-1": "Poland",
            "United_States-EN-1": "U.S.A. I",
            "United_States-EN-2": "U.S.A. II",
            "all": "All",
            "all-excl": "All -target"

}


plt.rcParams["axes.grid"] = False
fig = plt.figure(figsize = (20,10))
plt.gca().invert_yaxis()
plt.legend(unique_labels)
plt.axis('off')

colors = [
    "#f3722c",
    "#f9c74f",
    "#43aa8b"
]

for i, dataset in enumerate(datasets):
    ax = fig.add_subplot(2,4,i+1)
    ax.invert_yaxis()
    ax.axis("off")
    ax.set_title(trainNameMapping[dataset], fontsize = 36)
    dCases = [case for case in cases if case["dataset"] == dataset]
    for a, case in enumerate(dCases):
        caseLen = len(case["sents"])
        for b, label in enumerate(case["labels"]):
            startY = b/caseLen
            endY = (b+1)/caseLen
            line, = ax.plot([a, a], [startY, endY], color=colors[label], linewidth=3)
