---
{{ card_data }}
---

In [None]:
import datasets
import transformers
import warnings
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import pandas as pd
import polars as pl
import seaborn as sns
import yaml
from datasets import load_dataset
from transformers import AutoTokenizer
from IPython.display import display


warnings.filterwarnings('ignore')
sns.set_theme("notebook")
transformers.logging.set_verbosity_error()
datasets.logging.set_verbosity_error()
datasets.utils.disable_progress_bars()

In [None]:
ds = load_dataset("JuDDGES/en-court-instruct") 

# Dataset Card for [JuDDGES/en-court-instruct](https://huggingface.co/datasets/JuDDGES/en-court-instruct)

## Table of Contents
- [Table of Contents](#table-of-contents)
- [Dataset Description](#dataset-description)
  - [Dataset Summary](#dataset-summary)
  - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
  - [Languages](#languages)
- [Dataset Structure](#dataset-structure)
  - [Data Instances](#data-instances)
  - [Data Fields](#data-fields)
  - [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
  - [Curation Rationale](#curation-rationale)
  - [Source Data](#source-data)
  - [Annotations](#annotations)
  - [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
  - [Social Impact of Dataset](#social-impact-of-dataset)
  - [Discussion of Biases](#discussion-of-biases)
  - [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
  - [Dataset Curators](#dataset-curators)
  - [Licensing Information](#licensing-information)
  - [Citation Information](#citation-information)
  - [Contributions](#contributions)
- [Statistics](#statistics)

## Dataset Description

* **Homepage: TBA**
* **Repository: https://github.com/pwr-ai/JuDDGES**
* **Paper:  TBA**
* **Point of Contact: lukasz.augustyniak@pwr.edu.pl; jakub.binkowski@pwr.edu.pl; albert.sawczyn@pwr.edu.pl**

### Dataset Summary

The data was acquired from publicly available judgments from the Court of Appeal (Criminal Division) ([link](https://caselaw.nationalarchives.gov.uk/judgments/advanced_search?court=ewca/crim)) of England and Wales. These judgments are available in HTML format on the national archives website for online reading. They can be downloaded as XML or PDF files under the crown copyright license ([link](https://www.judiciary.uk/copyright/)): and the Open Government license (see Appendix 6 in the paper). These licenses encourage the use and re-use of the information available under them freely and flexibly, with only a few conditions ([link](https://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/)).

This dataset is designed for fine-tuning large language models (LLMs) for information extraction tasks and is formatted as instructions. For raw dataset see [`JuDDGES/en-court-raw`](https://huggingface.co/datasets/JuDDGES/en-court-raw).

### Supported Tasks and Leaderboards

* `information-extraction`: The dataset can be used for information extraction tasks.
* `text-generation`: The dataset can be used for text generation tasks, as the dataset is formatted as instructions.

### Languages

en-EN English 

## Dataset Structure

### Data Instances

<details>
<summary> Click to expand </summary>

```json

In [None]:
display(ds["train"][0])

```

 
</details>

### Data Fields


| Feature name     | Feature description                                                                                                                       | Type       |
|------------------|-------------------------------------------------------------------------------------------------------------------------------------------|------------|
| prompt           | The prompt template provided for extracting information from the judgement. It contains placeholder `{context}` for the judgement content. | `string`   |
| context          | The full text content of the judgement                                                                                                    | `string`   |
| output           | The extracted information in YAML format based on the provided context                                                                    | `string`   |


### Data Splits

In [None]:
data = []
for split in ds.keys():
    data.append({"split": split, "# samples": len(ds[split])})

df = pd.DataFrame(data)
df["% samples"] = (df["# samples"] / df["# samples"].sum() * 100).round(2)
# print(df.to_markdown(index=False))

| split   |   # samples |   % samples |
|:--------|------------:|------------:|
| train   |        3365 |       62.72 |
| test    |        2000 |       37.28 |



## Dataset Creation

For details on the dataset creation, see the paper [TBA]() and the code repository [here](https://github.com/pwr-ai/JuDDGES).

### Curation Rationale

Created to enable cross-jurisdictional legal analytics.

### Source Data

#### Initial Data Collection and Normalization

1. Utilize the raw dataset [`JuDDGES/en-court-raw`](https://huggingface.co/datasets/JuDDGES/en-court-raw).
1. First, we identified information from metadata which is contained in text of the judgement. Therefore, the following fields were selected for extraction as targets:
    * `citation`
    * `date`
    * `judges`
1. **Data filtering**: In order to ensure high quality of the dataset, we performed preprocessing procedure, as described below.
    1. Removal of bad judgements - if judgement text is too short which we reveal to correspond with missing content it's removed. 
    1. Dataset extraction - extracting *date* from judgement content with regular expression. 
    1. Removing examples wherein attributes are not a substring of judgment text - due to inherent errors in acquired data, some attribute values might be mistyped; hence, we filter them out. 
    (Data cleaning removes 685 examples, and the instruction dataset finally consists of 5365 examples.)
1. **Generating instructions**: After cleaning we generate instructions for information extraction. Specifically, we define same prompt for each document, as follows:

    ```text
    You are extracting information from the court judgments.
    Extract specified values strictly from the provided judgement. If information is not provided in the judgement, leave the field with null value.
    Please return the response in the identical YAML format:
    '''yaml
    citation: <string containing the neutral citation number>
    date: <date in format YYYY-MM-DD>
    judges: <list of judge full names>
    '''
    =====
    {context}
    ======
    ```
    where `{context}` is replaced by text of each judgement.


#### Who are the source language producers?

Produced by human legal professionals (judges, court clerks). Demographics was not analysed. Sourced from public court databases.

### Annotations

#### Annotation process

No annotation was performed by us. All features were provided.

#### Who are the annotators?

As above.

### Personal and Sensitive Information

Data comply with GDPR and The Data Protection Act 2018 (DPA). (See more in Section 3.4 in the paper.)

## Considerations for Using the Data

### Social Impact of Dataset

[More Information Needed]

### Discussion of Biases

[More Information Needed]

### Other Known Limitations

[More Information Needed]

## Additional Information

### Dataset Curators

[More Information Needed]

### Licensing Information

We license the actual packaging of these data under Open Government Licence https://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/

### Citation Information

[More Information Needed]

## Statistics

In [None]:
data = yaml.safe_load(ds["train"]["output"][0].replace("```yaml", "").replace("```", ""))
data["date"] = pd.to_datetime(data["date"])

In [None]:
def parse_output(output: str) -> dict:
    data = yaml.safe_load(output.replace("```yaml", "").replace("```", ""))
    data["date"] = pd.to_datetime(data["date"])
    return data

ds = ds.map(parse_output, input_columns="output", num_proc=20)

In [None]:
en_ds = pl.concat([pl.from_arrow(ds["train"].data.table), pl.from_arrow(ds["test"].data.table)])
en_ds = en_ds.with_columns(pl.Series(name="subset", values=["train"] * len(ds["train"]) + ["test"] * len(ds["test"]))) 

In [None]:
num_judges = en_ds.with_columns([pl.col("judges").list.len().alias("num_judges")]).select(["subset", "num_judges"]).to_pandas()
ax = sns.histplot(data=num_judges, x="num_judges", hue="subset", bins=num_judges["num_judges"].nunique(), stat="percent", common_norm=False)
ax.set(xlabel="#Judges per judgement", ylabel="%", title="#Judges per single judgement")
plt.show()

In [None]:
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")

def tokenize(batch: dict[str, list]) -> list[int]: 
    tokenized = tokenizer(batch["context"], add_special_tokens=False, return_attention_mask=False, return_token_type_ids=False, return_length=True)
    return {"length": tokenized["length"]}

ds = ds.map(tokenize, batched=True, batch_size=16, remove_columns=["context"], num_proc=20)

In [None]:
context_len_train = ds["train"].to_pandas()
context_len_train["subset"] = "train"
context_len_test = ds["test"].to_pandas()
context_len_test["subset"] = "test"
context_len = pd.concat([context_len_train, context_len_test])

ax = sns.histplot(data=context_len, x="length", bins=50, hue="subset")
ax.set(xlabel="#Tokens", ylabel="Count", title="#Tokens distribution in context (llama-3 tokenizer)", yscale="log")
ax.xaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos: f'{int(x/1_000)}k'))
plt.show()