# Predicting Judicial Decisions of the European Court of Human Rights

In this notebook, we aim to train a classification model to classify cases as 'violation' or 'non-violation' using ULMFiT. 
The cases were originally downloaded from HUDOC and structured based on the articles they fall under.

In [0]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [0]:
from fastai import *
from fastai.text import *
from fastai.utils.mem import gpu_mem_get_free_no_cache
from sklearn.model_selection import train_test_split

In [0]:
import numpy as np
import re
import os
import copy
import torch

To read our dataset, we use os.walk to walk through a sub-tree of directories and files and load all of our training data and labels. We avoid the folder 'both' as the files inside are labelled both as violation and non-violation.
Our data set will be loaded into dictionaries, the keys corresponding to articles and the values will be a list of cases (X - our training set) or labels (Y).

In [0]:
def read_dataset(PATH):
    X_dataset = {}
    Y_dataset = {}
    for path, dirs, files in os.walk(PATH):
        for filename in files:
            fullpath = os.path.join(path, filename)
            if "both" not in fullpath:
                with open(fullpath, 'r', encoding="utf8") as file:
                    X_dataset, Y_dataset = add_file_to_dataset(fullpath, X_dataset, Y_dataset, file.read())

    return X_dataset, Y_dataset       

In [0]:
def add_file_to_dataset(fullpath, x_dataset, y_dataset, file):
    article = extract_article(fullpath)
    file = preprocess(file)
    if article not in x_dataset.keys() :
        x_dataset[article] = []
        y_dataset[article] = []
    x_dataset[article] = x_dataset[article] + [file]
    label = 0 if "non-violation" in fullpath else 1
    y_dataset[article] = y_dataset[article] + [label]
    return x_dataset, y_dataset  

We use regex to extract the number of the Article from the fullpath and insert the file into the list under that specific Article.

In [0]:
def extract_article(path): 
    pattern = r"(Article\d+)"
    result = re.search(pattern, path)
    article = result.group(1)
    return article

### Preprocessing 

In [0]:
def preprocess(file): 
    file = extract_paragraphs(file)
    return file

In [0]:
def extract_paragraphs(file): 
    file = re.sub(r'[\x00-\x08\x0b\x0c\x0e-\x1f\x7f-\xff]', '', file)
    pat = r'(PROCEDURE\s*\n.+?)?((THE CIRCUMSTANCES OF THE CASE\s*\n.+?RELEVANT DOMESTIC LAW.+?)|(\n(AS TO THE FACTS|THE FACTS)\s*\n.+?))(\nIII\.|THE LAW\s*\n|PROCEEDINGS BEFORE THE COMMISSION\s*\n|ALLEGED VIOLATION OF ARTICLE [0-9]+ OF THE CONVENTION \s*\n)'
    result = re.search(pat, file, re.S |  re.IGNORECASE)
    content = ""
    if result.group(1) is not None:
        content += result.group(1)
    content += result.group(2)
    return content

### Loading the data

In [0]:
base_path = "/content/drive/My Drive/Colab Notebooks/Datasets/Human rights dataset"

In [0]:
X_train_docs, Y_train_docs = read_dataset(base_path + "/train")

In [0]:
X_train_docs.keys()

dict_keys(['Article11', 'Article10', 'Article12', 'Article13', 'Article5', 'Article3', 'Article4', 'Article18', 'Article6', 'Article7', 'Article14', 'Article2', 'Article8'])

Also, similarly to Medvedeva, M., Vols, M. & Wieling, M. Artif Intell Law (2019), we want to remove the articles which contain too few cases. We include Article 11 "as an estimate of how well the model performs when only very few cases are available".

In [0]:
def select_articles(train_set):
    selected_training_set = copy.deepcopy(train_set)
    
    for key in train_set.keys():
        if len(train_set[key]) <= 50:
            selected_training_set.pop(key)
            continue
    return selected_training_set

In [0]:
X_train_docs = select_articles(X_train_docs)

In [0]:
X_train_docs.keys()

dict_keys(['Article11', 'Article10', 'Article13', 'Article5', 'Article3', 'Article6', 'Article14', 'Article2', 'Article8'])

### Combining all the articles according to class

In [0]:
X_train = X_train_docs["Article2"] + X_train_docs["Article3"] + X_train_docs["Article5"] + X_train_docs["Article6"] + X_train_docs["Article8"] + X_train_docs["Article10"] + X_train_docs["Article11"] + X_train_docs["Article13"] + X_train_docs["Article14"]

In [0]:
print(str(len(X_train_docs["Article2"])) + "+" + str(len(X_train_docs["Article3"])) + "+" + str(len(X_train_docs["Article5"])) + "+" + str(len(X_train_docs["Article6"])) + "+" + str(len(X_train_docs["Article8"])) + "+" + str(len(X_train_docs["Article10"])) + "+" + str(len(X_train_docs["Article11"])) + "+" + str(len(X_train_docs["Article13"])) + "+" + str(len(X_train_docs["Article14"])) + "=" + str(len(X_train)))

114+568+300+916+457+212+64+212+288=3131


In [0]:
Y_train = Y_train_docs["Article2"] + Y_train_docs["Article3"] + Y_train_docs["Article5"] + Y_train_docs["Article6"] + Y_train_docs["Article8"] + Y_train_docs["Article10"] + Y_train_docs["Article11"] + Y_train_docs["Article13"] + Y_train_docs["Article14"]

In [0]:
len(Y_train)

3131

In [0]:
X_train[0]

"PROCEDURE\n1.The case originated in an application (no. 8532/06) against the Russian Federation lodged with the Court under Article 34 of the Convention for the Protection of Human Rights and Fundamental Freedoms (“the Convention”) by a Russian national, Ms Valentina Petrovna Geppa (“the applicant”), on 30 December 2005.\n2.The applicant was represented by Ms G.V. Zambrovskaya, a lawyer practising in Kursk. The Russian Government (“the Government”) were represented by Mr G. Matyushkin, Representative of the Russian Federation at the European Court of Human Rights.\n3.The applicant alleged that the authorities were responsible for the death of her son in a correctional colony and that there had been no effective investigation of the circumstances of his death.\n4.On 6 November 2009 the President of the First Section decided to give notice of the application to the Government. It was also decided to examine the merits of the application at the same time as its admissibility (Article 29 

### Creating the Classifier DataBunches with K-folds

Credit to FastAi for explaining and providing the code (https://github.com/fastai/course-nlp/blob/master/5-nn-imdb.ipynb)

In [0]:
base_path = "/content/drive/My Drive/Colab Notebooks"
path = base_path + "/ULMFiTModel"
path

'/content/drive/My Drive/Colab Notebooks/ULMFiTModel'

In [0]:
X_train_np = np.array(X_train)
Y_train_np = np.array(Y_train)

In [0]:
from sklearn.model_selection import StratifiedKFold
runs = 20
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=2)
for train_index, test_index in kfold.split(X_train_np, Y_train_np):
    runs += 1
    X_train_split, X_valid_split= X_train_np[train_index], X_train_np[test_index]
    y_train_split, y_valid_split = Y_train_np[train_index], Y_train_np[test_index]

    # Creating a DataFrame for train and test data
    train_cases = {'Case': X_train_split, 'Label': y_train_split}
    train_df = DataFrame(train_cases, columns= ['Case', 'Label'])

    valid_cases = {'Case': X_valid_split, 'Label': y_valid_split}
    valid_df = DataFrame(valid_cases, columns= ['Case', 'Label'])

    #Transforming the DataFrame into a DataBunch
    data_lm = TextLMDataBunch.from_df(path, train_df = train_df, valid_df = valid_df, text_cols = 0, bs=32)
    data_clas = TextClasDataBunch.from_df(path, train_df = train_df, valid_df = valid_df, vocab=data_lm.train_ds.vocab, text_cols = 0, label_cols = 1, bs=32)

    #Saving our training data
    lm_filename = 'data_lm_export_' + str(((runs / 10) + 1)) + '.' + str((runs % 10))
    clas_filename = 'data_clas_export_' + str(((runs / 10) + 1)) + '.' + str((runs % 10))
    # Should not have added runs % 10, unintended mistake, but will not re-run the above as it takes too long

    data_lm.save(lm_filename)
    data_clas.save(clas_filename)