<a href="https://colab.research.google.com/github/louisbrulenaudet/hf-for-legal/blob/main/notebooks/Dataset_formatting.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# HF for Legal: A Community Package for Legal Applications 🤗

[![Python](https://img.shields.io/pypi/pyversions/tensorflow.svg)](https://badge.fury.io/py/tensorflow) [![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) ![Maintainer](https://img.shields.io/badge/maintainer-@louisbrulenaudet-blue)

<img src="https://huggingface.co/spaces/HFforLegal/README/resolve/main/assets/thumbnail.png">

Welcome to the HF for Legal package, a library dedicated to breaking down the opacity of language models for legal professionals. Our mission is to empower legal practitioners, scholars, and researchers with the knowledge and tools they need to navigate the complex world of AI in the legal domain. At HF for Legal, we aim to:
- Demystify AI language models for the legal community
- Share curated resources, including specialized legal models, datasets, and tools
- Foster collaboration on projects that enhance legal research and practice through AI
- Provide a platform for discussing ethical implications and best practices of AI in law
- Offer tutorials and workshops on leveraging AI technologies in legal work

## Installation

To use hf-for-legal, you need to have the following Python packages installed:
- `numpy`
- `datasets`

You can install these packages via pip:

```bash
pip install numpy datasets hf-for-legal
```

## Citing & Authors

If you use this code in your research, please use the following BibTeX entry.

```BibTeX
@misc{louisbrulenaudet2024,
  author =       {Louis Brulé Naudet},
  title =        {HF for Legal: A Community Package for Legal Applications},
  year =         {2024}
  howpublished = {\url{https://github.com/louisbrulenaudet/hf-for-legal}},
}
```

## Feedback

If you have any feedback, please reach out at [louisbrulenaudet@icloud.com](mailto:louisbrulenaudet@icloud.com).

# Configuration

In [1]:
!pip3 install numpy datasets hf-for-legal

Collecting datasets
  Downloading datasets-2.20.0-py3-none-any.whl.metadata (19 kB)
Collecting hf-for-legal
  Downloading hf_for_legal-0.0.13-py3-none-any.whl.metadata (9.7 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting requests>=2.32.2 (from datasets)
  Downloading requests-2.32.3-py3-none-any.whl.metadata (4.6 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-2.20.0-py3-none-any.whl (547 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m14.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading hf_for_legal-0.0.13-py3-none-any.whl (1

In [2]:
from typing import (
    IO,
    TYPE_CHECKING,
    Any,
    Dict,
    List,
    Type,
    Tuple,
    Union,
    Mapping,
    TypeVar,
    Callable,
    Optional,
    Sequence,
)

import datasets

from hf_for_legal import DatasetFormatter

# Dataset formatting

In [4]:
# Load a sample dataset
dataset = datasets.Dataset.from_dict(
  {
    "document": [
      "This is a test document.",
      "Another test document."
    ]
  }
)

# Create an instance of DatasetFormatter
formatter = DatasetFormatter(
    dataset
)

# Apply the hash and UUID functions
formatted_dataset = formatter()
formatted_dataset

Map:   0%|          | 0/2 [00:00<?, ? examples/s]

2024-07-26 11:52:21,778 - INFO - hash took 0.0357 seconds to execute.
INFO:hf_for_legal._logger:hash took 0.0357 seconds to execute.
2024-07-26 11:52:21,785 - INFO - Memory Usage Report for 'hash':
INFO:hf_for_legal._logger:Memory Usage Report for 'hash':
2024-07-26 11:52:21,789 - INFO -   Memory Used: 0.72 MB
INFO:hf_for_legal._logger:  Memory Used: 0.72 MB


Map:   0%|          | 0/2 [00:00<?, ? examples/s]

2024-07-26 11:52:21,840 - INFO - uuid took 0.0481 seconds to execute.
INFO:hf_for_legal._logger:uuid took 0.0481 seconds to execute.
2024-07-26 11:52:21,846 - INFO - Memory Usage Report for 'uuid':
INFO:hf_for_legal._logger:Memory Usage Report for 'uuid':
2024-07-26 11:52:21,851 - INFO -   Memory Used: 0.00 MB
INFO:hf_for_legal._logger:  Memory Used: 0.00 MB
2024-07-26 11:52:21,854 - INFO - __call__ took 0.1117 seconds to execute.
INFO:hf_for_legal._logger:__call__ took 0.1117 seconds to execute.
2024-07-26 11:52:21,857 - INFO - Memory Usage Report for '__call__':
INFO:hf_for_legal._logger:Memory Usage Report for '__call__':
2024-07-26 11:52:21,859 - INFO -   Memory Used: 0.72 MB
INFO:hf_for_legal._logger:  Memory Used: 0.72 MB


Dataset({
    features: ['document', 'hash', 'uuid'],
    num_rows: 2
})

In [5]:
formatted_dataset[0]

{'document': 'This is a test document.',
 'hash': 'ec7f4feddd1c1349ad5a4d9c913a5c4e21a226e6719cd1b2805225df7075d22f',
 'uuid': '8a7b2834-4ebc-420f-b76b-156f552154f6'}