Legal Document Similarity

Implementation, trained models and result data for the paper Evaluating Document Representations for Content-based Legal Literature Recommendations (PDF on Arxiv). The supplemental material is available for download under GitHub Releases.

The qualitative analysis is available as PDF in /appendix.

Requirements

Python 3.7
CUDA GPU (for Transformers)
Case Law Access Project API key (only for dataset construction)
JHU-Legal-BERT must be downloaded from here.

Installation

Create a new virtual environment for Python 3.7 with Conda:

conda create -n paper python=3.7
conda activate paper

Clone repository and install dependencies:

git clone https://github.com/malteos/legal-document-similarity.git repo
cd repo
pip install -r requirements.txt

Results

Overall scores for top k=5 recommendations from Open Case Book and Wikisource (Table 2 in paper):

Jaccard index for similarity or diversity of two recommendation sets (average over all seeds from the two datasets):

Experiments

To reproduce our experiments, follow these steps:

Download datasets

We construct two silver standards from Open Case Book and WikiSource. The underlying full-text and citation data is taken from the Case Law Access Project and CourtListener. Scripts for data preprocessing are in ./datasets.

mkdir -p ./data/ocb ./data/wikisource

# Open Case Book
wget https://github.com/malteos/legal-document-similarity/releases/download/1.0/ocb.tar.gz
tar -xvzf ocb.tar.gz -C ./data/ocb

# WikiSource
wget https://github.com/malteos/legal-document-similarity/releases/download/1.0/wikisource.tar.gz
tar -xvzf wikisource.tar.gz -C ./data/wikisource

Prepare word vectors

With the following commands, fastText and GloVe vectors can be trained or downloaded.

# Extract plain-text corpora
python cli.py extract_text --data_dir=./data

# fastText vectors
python cli.py train_fasttext --data_dir=./data

# GloVe vectors
sh ./sbin/compute_glove.sh

# Download pretrained word vectors
wget -O ./data/ https://github.com/malteos/legal-document-similarity/releases/download/1.0/ocb_and_wikisource.fasttext.w2v.txt.gz
wget -O ./data/ https://github.com/malteos/legal-document-similarity/releases/download/1.0/ocb_and_wikisource.glove.w2v.txt.gz

Generate Document Representations

The following commands create or download vectors for all document in the two datasets.

# Generate (using GPU 0)
python cli.py compute_doc_vecs wikisource --override=1 --gpu 0 --data_dir=./data
python cli.py compute_doc_vecs ocb --override=1 --gpu 0 --data_dir=./data

# Download pretrained document vectors
wget https://github.com/malteos/legal-document-similarity/releases/download/1.0/models.tar.gz
tar -xvzf models.tar.gz

Evaluation

After generating the document representations for Open Case Book and WikiSource, the results can be computed and viewed with a Jupyter notebook. Figures and tables from the paper are part of the notebook.

jupyter notebook evaluation.ipynb

Due to the space constraints some results could not be included in the paper. The full results for all methods are available as CSV file (or via the Jupyter notebook).

How to cite

If you are using our code, please cite our paper:

@InProceedings{Ostendorff2021,
  title = {{Evaluating Document Representations for Content-based Legal
                  Literature Recommendations}},
  author = {Malte Ostendorff and Elliott Ash and Terry Ruas and Bela
                  Gipp and Julian Moreno-Schneider and Georg Rehm},
  publisher = {},
  editor = {},
  booktitle = {The 18th International Conference on Artificial Intelligence
                  and Law (ICAIL 2021)},
  year = 2021,
  note = {Accepted for publication},
  keywords = {aip},
  pages = {},
  month = 6,
  address = {Sao Paulo, Brasil},
}

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
appendix		appendix
commands		commands
data		data
datasets		datasets
docsim		docsim
environments		environments
figures		figures
models		models
sbin		sbin
tests		tests
.gitignore		.gitignore
README.md		README.md
cli.py		cli.py
dataset_stats.ipynb		dataset_stats.ipynb
evaluation.ipynb		evaluation.ipynb
requirements.txt		requirements.txt
sentence_legal_bert.ipynb		sentence_legal_bert.ipynb
sentence_poincare.ipynb		sentence_poincare.ipynb
utils.py		utils.py

malteos/legal-document-similarity

Folders and files

Latest commit

History

Repository files navigation

Legal Document Similarity

Requirements

Installation

Results

Overall scores for top k=5 recommendations from Open Case Book and Wikisource (Table 2 in paper):

Jaccard index for similarity or diversity of two recommendation sets (average over all seeds from the two datasets):

Experiments

Download datasets

Prepare word vectors

Generate Document Representations

Evaluation

How to cite

License

About

Resources

Stars

Watchers

Forks

Languages