Skip to content

harvard-lil/cold-french-law-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

68 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

COLD French Law Dataset Pipeline

CLI pipeline for generating and uploading the COLD French Law dataset.

The COLD French Law dataset is a collection of currently applicable law articles filtered from France's LEGI dataset.

English translations are available for ~800K articles. These translations were generated by OpenAI's GPT-4 and provided by Casetext, Part of Thomson Reuters.

⚠️ This process is transformative and, while data is sourced from France's LEGI dataset:

  • The accuracy of the data going in and out of this pipeline cannot be guaranteed.
  • This pipeline and resulting dataset are unofficial and experimental

Usage

This pipeline requires Python 3.11+ and Python Poetry.

Pulling and pushing data from HuggingFace may require the HuggingFace CLI and valid authentication.

1. Clone this repository

git clone https://github.com/harvard-lil/cold-french-law-pipeline.git

2. Install dependencies

poetry install

3. Run the "build" script

Will generate a CSV under data/cold_csv.

# See: build.py --help for a list of available options
poetry run python build.py

4. Upload to HuggingFace (optional)

Will attempt to upload the resulting CSV file to harvard-lil/cold-french-law

poetry run python upload.py

About

CLI pipeline for generating, storing and transforming the COLD French Law dataset.

Resources

Stars

Watchers

Forks

Languages