# Predicting Judicial Decisions of the European Court of Human Rights

In [0]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In this notebook, we aim to train a classification model to classify cases as 'violation' or 'non-violation' using a Bert Sequence Classification model from the Transformer library. 
The cases were originally downloaded from HUDOC and structured based on the articles they fall under.

In [0]:
!pip install transformers



In [0]:
import tensorflow as tf
from transformers import *
import torch

In [0]:
import numpy as np
import re
import os
import copy

To read our dataset, we use os.walk to walk through a sub-tree of directories and files and load all of our training data and labels. We avoid the folder 'both' as the files inside are labelled both as violation and non-violation.
Our data set will be loaded into dictionaries, the keys corresponding to articles and the values will be a list of cases (X - our training set) or labels (Y).

## Data

In [0]:
def read_dataset(PATH):
    X_dataset = {}
    Y_dataset = {}
    for path, dirs, files in os.walk(PATH):
        for filename in files:
            fullpath = os.path.join(path, filename)
            if "both" not in fullpath:
                with open(fullpath, 'r', encoding="utf8") as file:
                    X_dataset, Y_dataset = add_file_to_dataset(fullpath, X_dataset, Y_dataset, file.read())

    return X_dataset, Y_dataset       

In [0]:
def add_file_to_dataset(fullpath, x_dataset, y_dataset, file):
    article = extract_article(fullpath)
    file = preprocess(file)
    if article not in x_dataset.keys() :
        x_dataset[article] = []
        y_dataset[article] = []
    x_dataset[article] = x_dataset[article] + [file]
    label = 0 if "non-violation" in fullpath else 1
    y_dataset[article] = y_dataset[article] + [label]
    return x_dataset, y_dataset  

We use regex to extract the number of the Article from the fullpath and insert the file into the list under that specific Article.

In [0]:
def extract_article(path): 
    pattern = r"(Article\d+)"
    result = re.search(pattern, path)
    article = result.group(1)
    return article

### Preprocessing 

Similar to the research paper this work is based on, we will only use the PROCEDURE and THE FACTS paragraphs of the cases as our training set. Otherwise, the model may be biased.

In [0]:
def preprocess(file): 
    file = extract_paragraphs(file)
    return file

In [0]:
def extract_paragraphs(file): 
  # Remove any non-ASCII characters
  file = re.sub(r'[\x00-\x08\x0b\x0c\x0e-\x1f\x7f-\xff]', '', file)

  # # Remove any number at the beginning of a new line
  # pat = r'\n[0-9].'
  # result = re.findall(pat, file, re.S | re.IGNORECASE)
  # for group in result:
  #     file = file.replace(group, "\n")

  # Extract THE FACTS paragraphs
  pat = r'((THE CIRCUMSTANCES OF THE CASE\s*\n.+?RELEVANT DOMESTIC LAW.+?)|(\n(AS TO THE FACTS|THE FACTS)\s*\n.+?))(\nIII\.|THE LAW\s*\n|PROCEEDINGS BEFORE THE COMMISSION\s*\n|ALLEGED VIOLATION OF ARTICLE [0-9]+ OF THE CONVENTION \s*\n)'
  result = re.search(pat, file, re.S |  re.IGNORECASE)

  content = ""
  content += result.group(1)
  return content

### Loading the data

In [0]:
base_path = "/content/drive/My Drive/Colab Notebooks/Datasets/Human rights dataset"

In [0]:
X_train_docs, Y_train_docs = read_dataset(base_path + "/train")
#X_extra_test_docs, Y_extra_test = read_dataset(base_path + "\\test_violations")

Also, similarly to Medvedeva, M., Vols, M. & Wieling, M. Artif Intell Law (2019), we want to remove the articles which contain too few cases. We include Article 11 "as an estimate of how well the model performs when only very few cases are available".

In [0]:
def select_articles(train_set):
    selected_training_set = copy.deepcopy(train_set)
    
    for key in train_set.keys():
        if len(train_set[key]) <= 50:
            selected_training_set.pop(key)
            continue
    return selected_training_set

In [0]:
X_train_docs = select_articles(X_train_docs)

In [0]:
print(len(X_train_docs))

9


In [0]:
print(X_train_docs.keys())

dict_keys(['Article11', 'Article10', 'Article13', 'Article5', 'Article3', 'Article6', 'Article14', 'Article2', 'Article8'])


### Combining all the articles according to class

In [0]:
X_train = X_train_docs["Article2"] + X_train_docs["Article3"] + X_train_docs["Article5"] + X_train_docs["Article6"] + X_train_docs["Article8"] + X_train_docs["Article10"] + X_train_docs["Article11"] + X_train_docs["Article13"] + X_train_docs["Article14"]

In [0]:
Y_train = Y_train_docs["Article2"] + Y_train_docs["Article3"] + Y_train_docs["Article5"] + Y_train_docs["Article6"] + Y_train_docs["Article8"] + Y_train_docs["Article10"] + Y_train_docs["Article11"] + Y_train_docs["Article13"] + Y_train_docs["Article14"]

In [0]:
len(Y_train)

3131

### Activate logging

In [0]:
import logging
logging.basicConfig(level=logging.INFO)

In [0]:
device = torch.device("cuda")

In [0]:
n_gpu = torch.cuda.device_count()
n_gpu

1

In [0]:
torch.cuda.get_device_name(0)

'Tesla P100-PCIE-16GB'

## Training with Bert Model

Credit to https://mccormickml.com/2019/07/22/BERT-fine-tuning/ for explaining and demonstrating how to train Bert

#### Tokenization

In [0]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

INFO:transformers.tokenization_utils:loading file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt from cache at /root/.cache/torch/transformers/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084


BERT has two constraints:

* All sentences must be padded or truncated to a single, fixed length and must be formatted with special token ([CLS], [SEP])
* The maximum sentence length is 512 tokens.

Because most of our sequence lengths are above 512 tokens, the tokenization step will only take the first 512 tokens. 
In case there is a sequence that is less than 512 tokens, we might need to pad it until it reaches 512. 

So, we need to find our maximum and minimum sentence lengths:

In [0]:
# max_len = 0
# avg = 0
# min_len = 999990000

# for case in X_train:
#     tokens = tokenizer.tokenize(case)
#     max_len = max(len(tokens), max_len)
#     avg += len(tokens)
#     min_len = min(len(tokens), min_len)

# print("Maximum tokens: " + str(max_len)) ### 60252
# print("Average tokens: " + str(avg/len(X_train))) ### 4266
# print("Minimum tokens: " + str(min_len)) ### 118

In [0]:
# Tokenize all of the sentences, map the tokens to their word IDs and retrieve attention masks.
attention_masks = []
input_ids = []

# `encode` will:
#   (1) Tokenize the sentence.
#   (2) Prepend the `[CLS]` token to the start.
#   (3) Append the `[SEP]` token to the end.
#   (4) Pad shorter sentences until they all have the maximum length.
#   (5) Map tokens to their IDs.
#   (6) Map which tokens are actual words versus which are padding.
encoded_sent = tokenizer.batch_encode_plus(
                    X_train,                   # list of sentences to encode.
                    add_special_tokens = True, # Add '[CLS]' and '[SEP]'
                    pad_to_max_length = True,  # Add padding
                    max_length = 512
               )
    
# Retrieve attention mask and token IDs.
attention_masks = encoded_sent['attention_mask']
input_ids = encoded_sent['input_ids']

In [0]:
print(type(attention_masks))

<class 'list'>


In [0]:
attention_masks_np = np.array(attention_masks)
input_ids_np = np.array(input_ids)
Y_train_np = np.array(Y_train)

#### Training Set & Validation Set

In [0]:
from sklearn.model_selection import StratifiedKFold
base_path = "/content/drive/My Drive/Colab Notebooks/BertModel/"

runs = 0
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)
for train_index, test_index in kfold.split(input_ids_np, Y_train_np):
    runs += 1
    version = "" + str(((runs / 10) + 1)) + '.' + str((runs % 10))
    #print("Version: " + version)

    # Retrieving the actual data based on index
    train_inputs, validation_inputs= input_ids_np[train_index], input_ids_np[test_index]
    train_labels, validation_labels = Y_train_np[train_index], Y_train_np[test_index]

   
    # Converting to PyTorch Data Types
    train_inputs = torch.tensor(train_inputs)
    validation_inputs = torch.tensor(validation_inputs)
    #print("Shape: " + str(validation_inputs.shape))

    train_labels = torch.tensor(train_labels)
    validation_labels = torch.tensor(validation_labels)
    #print("Shape: " + str(validation_labels.shape))


    # Saving our training data
    torch.save(train_inputs, base_path + 'train_inputs_' + str(version) + '.pt')
    torch.save(validation_inputs, base_path + 'validation_inputs_' + str(version) + '.pt')

    torch.save(train_labels, base_path + 'train_labels_' + str(version) + '.pt')
    torch.save(validation_labels, base_path + 'validation_labels_' + str(version) + '.pt')

runs = 0
for train_index, test_index in kfold.split(attention_masks_np, Y_train_np):
    runs += 1
    version = "" + str(((runs / 10) + 1)) + '.' + str((runs % 10))
    #print("Version: " + version)


    train_masks, validation_masks= attention_masks_np[train_index], attention_masks_np[test_index]

    train_masks = torch.tensor(train_masks)
    validation_masks = torch.tensor(validation_masks) 
    #print("Shape: " + str(validation_masks.shape))

    torch.save(train_masks, base_path + 'train_masks_' + str(version) + '.pt')
    torch.save(validation_masks, base_path + 'validation_masks_' + str(version) + '.pt')

Version: 1.1.1
Shape: torch.Size([314, 512])
Shape: torch.Size([314])
Version: 1.2.2
Shape: torch.Size([313, 512])
Shape: torch.Size([313])
Version: 1.3.3
Shape: torch.Size([313, 512])
Shape: torch.Size([313])
Version: 1.4.4
Shape: torch.Size([313, 512])
Shape: torch.Size([313])
Version: 1.5.5
Shape: torch.Size([313, 512])
Shape: torch.Size([313])
Version: 1.6.6
Shape: torch.Size([313, 512])
Shape: torch.Size([313])
Version: 1.7.7
Shape: torch.Size([313, 512])
Shape: torch.Size([313])
Version: 1.8.8
Shape: torch.Size([313, 512])
Shape: torch.Size([313])
Version: 1.9.9
Shape: torch.Size([313, 512])
Shape: torch.Size([313])
Version: 2.0.0
Shape: torch.Size([313, 512])
Shape: torch.Size([313])
Version: 1.1.1
Shape: torch.Size([314, 512])
Version: 1.2.2
Shape: torch.Size([313, 512])
Version: 1.3.3
Shape: torch.Size([313, 512])
Version: 1.4.4
Shape: torch.Size([313, 512])
Version: 1.5.5
Shape: torch.Size([313, 512])
Version: 1.6.6
Shape: torch.Size([313, 512])
Version: 1.7.7
Shape: torch.Si