This repository has code used to generate legal instruction datasets.
- If there is no public Hugging Face repo for the dataset, take the raw data and upload it to the Hugging Face hub (in a private repo if the data is not permissively licensed)
- Add a class to the folder
instruction_datasets
that inherits fromAbstractDataset
and implements the abstract methodget_data
. Theget_data
method should yield datapoints with the following fields:- "instruction_language": the language of the instruction
- "prompt_language": the language of the prompt
- "answer_language": the language of the answer
- "instruction": the instruction telling the model what to do
- "prompt": the prompt input to the model
- "answer": the answer providing the solution
- "task_type": the type of task (e.g. "summarization")
- "jurisdiction": the jurisdiction of the example (e.g. "US")
- "subset": the subset of the dataset (e.g. "swiss_judgment_prediction" for "lextreme")
- Write one to 10 seed instructions to the en.json file for the new class
- Add the dataset to the list in
lawinstruct_datasets.py
- To generate the dataset run the command
python build_instruction_datasets.py --datasets ExampleDatasetName1 ExampleDatasetName2 --build_from_scratch
Install the requirements from requirements.txt
. Make sure to have python 3.10 or higher.
Make sure you have git and git-lfs installed.
On the ubelix slurm system, load the module with module load git-lfs/2.4.2
Run git lfs install
to install git-lfs.
Clone the lawinstruct_raw repository locally:
git clone https://huggingface.co/datasets/lawinstruct/lawinstruct_raw
Clone the natural instructions data there too
git clone https://github.com/allenai/natural-instructions lawinstruct_raw/raw_data/ni_instructions_data
The en.json file was created by writing one to 5 seed instructions. Using GPT4, we generated paraphrases for each task. We used the following prompt: "Below is a list of instructions for a large language model. Expand this json to 10 paraphrases. Provide json as output. Keep the provided examples."
- make huggingface dataset loading script better: enable dynamic loading of instructions in differing numbers of paraphrases and languages
- add retrieval datasets (see here for how to structure prompts: https://crfm-helm.readthedocs.io/en/latest/scenarios/#helm.benchmark.scenarios.msmarco_scenario) ==> average prompt is very long, so we could probably only use a small part of the data
- frame casehold as a generation task: let the model generate the correct holding statement
- put local data on huggingface hub (find them if they use the raw_data folder)
- translate some answers into the 24 EU languages ==> save instructions and answers into different columns
- do not use xP3 and natural instructions but only code and legal instructions because of figure 4: https://arxiv.org/pdf/2210.11416v5.pdf
- add CoT data (https://github.com/jasonwei20/flan-2/blob/main/mmlu-cot.json) ==> this is only for MMMLU (which we leave out)
- Long-form Legal Question Answering (https://huggingface.co/datasets/maastrichtlawtech/lleqa)
- ask for access
- add the question answering task similar to other qa tasks
- try to think of other tasks that could be added like asking for regions or topics
- Legal Lens (https://huggingface.co/collections/joelniklaus/legallens-datasets-6601a17a92e663923265c845)
- add the NER task similar to other ner tasks and the nli task similar to other nli tasks
- Swiss Citation Extraction (https://huggingface.co/datasets/rcds/swiss_citation_extraction/viewer/original/test?row=0)
- combine the words again into a standard string but use the indices to edit it
- two subsets:
- citation extraction:
- input: text with citations
- output: list of citations (as a markdown list (- before each item and separated with new line))
- citation prediction:
- input: text with the citations masked with a placeholder (e.g., , , etc.)
- output: - : the citation - : the citation - ...
- citation extraction:
- MultiLegalNeg (https://huggingface.co/datasets/rcds/MultiLegalNeg)
- combine words similar to Swiss Citation Extraction
- more information here: https://arxiv.org/pdf/2309.08695
- two subsets:
- negation cue detection: goal is for the language model to repeat the negation cue word
- negation scope detection: goal is for the language model to repeat the negation scope given the negation cue
- Keyphrase Generation (https://huggingface.co/datasets/NCube/europa)
- CLERC (https://huggingface.co/datasets/jhu-clsp/CLERC)
- GerDALIR (https://github.com/lavis-nlp/GerDaLIR)
- Covid Law Matching (https://github.com/DFKI-NLP/covid19-law-matching)
- BSARD (https://github.com/maastrichtlawtech/bsard)
- SwissIR (https://huggingface.co/datasets/rcds/doc2doc)
- Dutch Legal Summarization (https://github.com/prijsdf/dutch-legal-summarization) ==> Requires multiple requests per document to retrieve; documentation in Dutch; no actual summarization targets.
- LegalSum (https://github.com/sebimo/LegalSum) ==> complicated to read because of norms and would require large preprocessing. Additionally, contains very long sequences
- Indian/Australian Summarization (https://github.com/manavkapadnis/LegalEvaluation_LREC2022) ==> too long and for australian data, annotation done automatically
- Cookie Policy Summarization (https://github.com/senjed/Summarization-of-Privacy-Policies , http://ceur-ws.org/Vol-2645/paper3.pdf) ==> automatic annotation, no summarization data available
- BVA Summarization (https://github.com/luimagroup/bva-summarization, https://dl.acm.org/doi/10.1145/3322640.3326728) ==> repo very badly documented, it is not clear which dataset to use
- LegalCaseReports Summ (https://archive.ics.uci.edu/ml/machine-learning-databases/00239 , https://aclanthology.org/W12-0515.pdf) ==> no re-destribution allowed, thus upload to raw_data. (summaries not clear)
- BVACItationPrediction (https://github.com/TUMLegalTech/bva-citation-prediction) ==> no dataset downloadable directly
- Cornell eRulemaking Corpus (https://facultystaff.richmond.edu/~jpark/data/jpark_lrec18.zip , https://facultystaff.richmond.edu/~jpark/papers/jpark_lrec18.pdf) ==> the full text of the comments is not available
- US Caselaw Segmentation (https://github.com/jsavelka/us-dec-func-iss-sgm/blob/master/trade_secret_cases.json , http://ebooks.iospress.nl/volumearticle/50840) ==> sentence boundary detection is probably not the most useful task
- MultiLegalSBD (https://huggingface.co/datasets/rcds/MultiLegalSBD) ==> sentence boundary detection is probably not the most useful task
- Contract extraction dataset (http://nlp.cs.aueb.gr/software_and_datasets/CONTRACTS_ICAIL2017/index.html , http://nlp.cs.aueb.gr/pubs/icail2017.pdf) ==> looks like a complicated dataset requiring preprocessing
- CASS (https://github.com/euranova/CASS-dataset) ==> Couldn't download -
wget
failed. - LegalLinking (https://github.com/mayhewsw/legal-linking) ==> Could not recreate necessary Python environment.
- Privacy Policies (https://usableprivacy.org/data) (excluding OPP-115 Corpus: already present in natural instructions)
- MakeThisYourLastTime (https://www.makethisyourlasttime.com/essay-bank/) ==> Requires scraping several PDFs; format not standardized.
- ECHR Argument Mining (http://www.di.uevora.pt/~pq/echr/) ==> This is an argument mining dataset.
- AsyLex (https://huggingface.co/datasets/clairebarale/AsyLex) ==> complicated structure requireing preprocessing; additionally outcome prediction task seems quite trivial as outcome is often clearly stated at the end of the text.
Make sure to only yield from the same subset in the get_data()
method. Otherwise, it will only write one example to
the file and close it again.
Please cite the following preprint:
@misc{niklaus2024flawnt5,
title={FLawN-T5: An Empirical Examination of Effective Instruction-Tuning Data Mixtures for Legal Reasoning},
author={Joel Niklaus and Lucia Zheng and Arya D. McCarthy and Christopher Hahn and Brian M. Rosen and Peter Henderson and Daniel E. Ho and Garrett Honke and Percy Liang and Christopher Manning},
year={2024},
eprint={2404.02127},
archivePrefix={arXiv},
primaryClass={cs.CL}
}