# Transjurisdictional Transformers
Helping lawyers quickly find similar legislation from another jurisdiction

### Problem Statement and Background

Lawyers engaged in cross-jurisdictional work might have to quickly familiarise themselves with legal positions in a few jurisdictions. These different jurisdictions may have statutes covering similar points of law, but may not use consistent wording. Note that the actual legal requirements may differ between jurisdictions, the provisions are analogous (think about the piece of legislation on mask wearing in Singapore vs that in UK).

Lawyers may have to go through a tedious process of finding out which are the analogous positions on a point of law in a different jurisdiction either through initial background research on cross-jurisdiction comparison articles, or simply manually comb through the different statutes. 

The hypothesis is that helping lawyers quickly get to the analogous provisions in another jurisdiction can be sped up using natural language processing techniques that encode semantic similarity.

### Project Goals

This is a project aimed at helping lawyers doing cross-jurisdictional research by helping them quickly match a legislation provision of interest to its equivalent in another jurisdiction. The matching will be done on a semantic level, using vector similarity, specifically on cosine similarity.

To achieve this, several methods for embedding each legislative provision will be used and the outcome will be evaluated on some prepared positive match examples. 

The embedding methods tried will be:
- tfidf (term frequency inverse document frequency)
- fastText
- BERT

And a baseline performance will be calculated based on edit distance between the titles of the legislation sections.

### Scope

For the scope of the first version of this project, each dataset will only contain one legislation from each jurisdiction, and the legislations covered are:
- Singapore Copyright Act & UK Copyright, Designs and Patents Act (CDPA)
- Singapore and UK Trade Marks Acts
- Singapore Personal Data Protection Act and European Union (EU) General Data Protection Regulation (GDPR)

The matching will be done on a section level. For the purposes of this project, articles in the GDPR will be treated as sections.

### Adaptability

Although the scope of this project is limited to the legislations mentioned above, the limitation comes more from the time needed to annotate examples for meaningful evaluation rather than any technical limitation on data volume.

The code in these notebooks is written in a way to allow it to be as reusable as possible for anyone to try their own legislation matching with these embedding methods once they have prepared their legislation data in the required input format.

```
required_input_cols = [id_col, 'title', 'url', 'cleaned']
```
For users wishing to expand the scope of this application, the `id_col` can be named to something more appropriate than 'sec', and the format for creating id numbers can be more sophisticated to account for multiple legislation per jurisdiction, or different provision granularity like divisions or subsections.

# Embedding Prep 1a: Baselines, tfidf Vectors, and fastText Vectors

This notebook will cover getting tfidf and fastText embeddings, as well as computing the edit distance of titles based on the levenshtein distance.

The edit distance between titles simulates lawyers simply eyeballing similar or identical section titles to find the closest match across jurisdictions. Analagously, if our embedding methods cannot outperform the baseline, a lawyer might as well quickly scan through or hopefully ctrl-f through for similar looking titles themselves to find equivalent legislation.

### Why tfidf?
We are trying tfidf as it is relatively simple to compute representation of the legislative text. tfidf is commonly used in many search engines and to featurise data for other downstream machine learning tasks like text classification. Its reliability over the years is also thanks to its robustness on longer texts, likely due to the term frequency element that considers the count of the terms versus the document length. 

One drawback of tfidf however, is that it does encode semantically similarity between synonymous terms, as each word/phrase is a separate feature in the matrix no matter how similar their meanings are. This could be an obstacle for this project if the two jurisdictions use very different terms to express the same legal concepts.

### Why fastText?

see [Enriching Word Vectors with Subword Information (Bojanowski et al., 2016)](https://arxiv.org/pdf/1607.04606v2.pdf)

fastText is a word embedding method that is similar to word2vec, but has added character level ngram representations that can preserve subword information. Like word2vec, the word representations are trained with a continuous bag of word or skipgram methods, in which the representation of a word is learned by the words occuring around it in the training corpus.

Theoretically, this should overcome the tfidf weakness of not being able to encode semantic similarity. Ideally, fastText should perform well on capturing similar meaning between legislation sections that do not necessarily use the exact same words.

## Expected Output from this Notebook

In this notebook, we are embedding our clean data to get vector representations for the legislation sections so that we can match them later on.

As such, these are the outputs we expect to get and save from this notebook:
- title edit distance scores, serving as our baseline (csv file)
- tfidf vectors representing each legislation (npy file)
- fasttext vectors representing each legislation (npy file)

The outputs below demo based on data for the SG Copyright Act and UK CDPA, but do note as mentioned, the data files containing legislation content will not be in the repo.

### Specifying Save Data Paths

In [1]:
edit_distance_file = '../data/baselines/test_saving_copyright_levdists.csv'
tfidf_vector_file = '../data/vectors/copyright/test_saving_cp_sg_uk_vecs_tfidf.npy'
ft_vector_file = '../data/vectors/copyright/test_saving_cp_sg_uk_vecs_ft.npy'

## Imports and Loading Data

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re
from nltk import edit_distance
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import fasttext as ft

In [3]:
id_col = 'sec'
required_input_cols = [id_col, 'title', 'url', 'cleaned']

In [4]:
input_data_filepath = '../data/clean/copyright/sg_uk_copyright.csv' 
# this data file will not be pushed to git repo 

In [5]:
data = pd.read_csv(input_data_filepath)

In [6]:
try:
    data = data[required_input_cols]
except KeyError:
    raise Exception('Ensure that the input data contains all the columns specified in required_input_cols.')

## Computing Baseline on Title Edit Distance

Our baseline will be later computed based on the edit (levenshtein) distance between legislation section titles. To prepare for that evaluation, we create a matrix that contains each entry's title's edit distance to every other entry's.

X represent what will be our edit distance matrix.

In [7]:
X = data[['title']]

In [8]:
X = pd.concat([X, pd.DataFrame(np.zeros((data.shape[0], data.shape[0])))], axis=1)

In [9]:
X.iloc[:, 1:] = data['title'].values

Compute levenshtein distance.

In [10]:
for _, row in X.iterrows():
    main_text = row.iloc[0]
    for i, c in enumerate(row.iloc[1:]):
        row.iloc[1+i] = edit_distance(main_text, c)  

In [11]:
result_lds = X.drop(columns='title')

In [12]:
result_lds = result_lds.set_index(data[id_col])

In [13]:
result_lds = result_lds.astype('int')

In [14]:
col_mapper = dict(zip(result_lds.columns, result_lds.index.values))

In [15]:
result_lds = result_lds.rename(columns=col_mapper)

In [16]:
result_lds.shape

(780, 780)

In [17]:
result_lds.to_csv(edit_distance_file)

## Preparing tfidf Embeddings

max_df of 0.95 is used to filter out very common words, hopefully getting rid of stopwords thus not requiring us to load a stopwords file. We use an ngram range of (1,2) as legal terms are important in context, but preferably staying below 3 so as to not bloat the feature space.

In [18]:
tfv = TfidfVectorizer(ngram_range=(1,2), max_df=0.95)

In [19]:
tfv_matrix = tfv.fit_transform(data['cleaned'])

In [20]:
tfv_matrix = tfv_matrix.todense()

In [21]:
np.save(tfidf_vector_file, tfv_matrix)

## Preparing fastText Embeddings

To train a fastText model with our data, the content must be converted to txt format as required by fastText.

In [22]:
# specify the path where the processed txt data will sit
# this is path that fastText will use to train the model too
ft_filename = '../data/clean/copyright/sg_uk_copyright.txt'

# this txt file will not be pushed to the repo

In [23]:
# convert our dataframe content entries to txt
with open(ft_filename, 'w') as f:
    for entry in data['cleaned'].values:
        f.write(entry)
        f.write('\n\n')

### fastText Params
The following params seemed to produce a decent model for the copyright and trade mark examples. However, more time can be spent tweaking the [hyperparams](https://fasttext.cc/docs/en/python-module.html#api) to get better perfomance.

In [24]:
wordNgrams = 2
ft_model_type = 'skipgram' # skipgram or cbow

### Train Model

In [25]:
model = ft.train_unsupervised(ft_filename, model=ft_model_type, wordNgrams=wordNgrams)

For data protection, these model hyperparams appear to perform better than the above.

In [None]:
# model = ft.train_unsupervised(filename, dim=50, 
#                               lr=0.0001, epoch=50, 
#                               minn=6, minCount=3, ws=10,
#                               model='skipgram', wordNgrams=3)

There is an option to use pretrained Wikipedia weights, simply download from the [here](https://fasttext.cc/docs/en/english-vectors.html).

However, for the scope of this project, the pretrained weights did not signficantly improve performance, and require a huge model file of a few GB to be stored, and so are not used.

In [None]:
# wiki_model_path = ''
# model = ft.load_model(wiki_model_path)

### Get Sentence Vectors

In [None]:
# for the GDPR data, this further cleaning was required for the sentence vectorisation to work
# data['cleaned'] = data['cleaned'].map(lambda x: x.replace('\n', ' '))

In [26]:
sent_vecs = [model.get_sentence_vector(s) for s in data['cleaned'].values]

In [27]:
np.save(ft_vector_file, sent_vecs)

In [28]:
len(sent_vecs)

780

### Additional Notes on fastText

fastText does not require your data to be lower-cased, neither does it do so for you automatically.
Lower casing the data does not necessarily improve performance, and in this project, a lower-cased version was tried and the results did not improve.

To try it out yourself, simply lowercase both the training data and data to be vectorized.

```
with open(ft_filename, 'w') as f:
    for entry in data['cleaned'].values:
        f.write(entry.lower())
        f.write('\n\n')
...

sent_vecs = [model.get_sentence_vector(s.lower()) for s in data['cleaned'].values]
```

## Next Notebooks

In notebook 1b, vectors from a BERT model will be prepared. It is done in a separate notebook to facilitate easy optional training on cloud GPUs.

In notebook 2, the vectors can be evluated according to answer keys after being matches on cosine similarity.