Skip to content


Repository files navigation

Legal Document Similarity

Implementation, trained models and result data for the paper Evaluating Document Representations for Content-based Legal Literature Recommendations (PDF on Arxiv). The supplemental material is available for download under GitHub Releases.

The qualitative analysis is available as PDF in /appendix.


  • Python 3.7
  • CUDA GPU (for Transformers)
  • Case Law Access Project API key (only for dataset construction)
  • JHU-Legal-BERT must be downloaded from here.


Create a new virtual environment for Python 3.7 with Conda:

conda create -n paper python=3.7
conda activate paper

Clone repository and install dependencies:

git clone repo
cd repo
pip install -r requirements.txt


Overall scores for top k=5 recommendations from Open Case Book and Wikisource (Table 2 in paper):

Overall results

Jaccard index for similarity or diversity of two recommendation sets (average over all seeds from the two datasets):

Overlap of results


To reproduce our experiments, follow these steps:

Download datasets

We construct two silver standards from Open Case Book and WikiSource. The underlying full-text and citation data is taken from the Case Law Access Project and CourtListener. Scripts for data preprocessing are in ./datasets.

mkdir -p ./data/ocb ./data/wikisource

# Open Case Book
tar -xvzf ocb.tar.gz -C ./data/ocb

# WikiSource
tar -xvzf wikisource.tar.gz -C ./data/wikisource

Prepare word vectors

With the following commands, fastText and GloVe vectors can be trained or downloaded.

# Extract plain-text corpora
python extract_text --data_dir=./data

# fastText vectors
python train_fasttext --data_dir=./data

# GloVe vectors
sh ./sbin/

# Download pretrained word vectors
wget -O ./data/
wget -O ./data/

Generate Document Representations

The following commands create or download vectors for all document in the two datasets.

# Generate (using GPU 0)
python compute_doc_vecs wikisource --override=1 --gpu 0 --data_dir=./data
python compute_doc_vecs ocb --override=1 --gpu 0 --data_dir=./data

# Download pretrained document vectors
tar -xvzf models.tar.gz


After generating the document representations for Open Case Book and WikiSource, the results can be computed and viewed with a Jupyter notebook. Figures and tables from the paper are part of the notebook.

jupyter notebook evaluation.ipynb

Due to the space constraints some results could not be included in the paper. The full results for all methods are available as CSV file (or via the Jupyter notebook).

How to cite

If you are using our code, please cite our paper:

  title = {{Evaluating Document Representations for Content-based Legal
                  Literature Recommendations}},
  author = {Malte Ostendorff and Elliott Ash and Terry Ruas and Bela
                  Gipp and Julian Moreno-Schneider and Georg Rehm},
  publisher = {},
  editor = {},
  booktitle = {The 18th International Conference on Artificial Intelligence
                  and Law (ICAIL 2021)},
  year = 2021,
  note = {Accepted for publication},
  keywords = {aip},
  pages = {},
  month = 6,
  address = {Sao Paulo, Brasil},




Legal document similarity - Code, data, and models for the ICAIL 2021 paper "Evaluating Document Representations for Content-based Legal Literature Recommendations"