# Datasets compressed via `zstandard`
https://github.com/EleutherAI/the-pile<br>
> **https://the-eye.eu/public/AI/pile/train/** > **make small versions**!!!

The paragraph [*What is the Pile?*](https://huggingface.co/learn/nlp-course/chapter5/4?fw=pt#what-is-the-pile) in section 5 of the [HuggingFace course](https://huggingface.co/learn/nlp-course/chapter1/1?fw=pt), the dataset [***the Pile***](https://en.wikipedia.org/wiki/The_Pile_(dataset)) is used to load the PubMed abstracts dataset in `.jsonl.zst` format, thus making use of the `zstandard` library that is used for compression.

However, the dataset source given in the course does not serve that dataset anymore. This notebook uses the [PubMed 200k RCT](https://www.kaggle.com/datasets/matthewjansen/pubmed-200k-rtc) kaggle dataset and brings it into the `jsonl.zst` format so that it can be used as a drop-in replacement for the Pile's PubMed abstracts dataset.

First, create a HuggingFace `Dataset` by loading the appropriate `.csv` file.

In [1]:
import os
import zstandard as zstd
from dotenv import load_dotenv
from datasets import load_dataset
from huggingface_hub import login

def compress_jsonl_zst(input_file, output_file):
    cctx = zstd.ZstdCompressor(level=3)
    with open(input_file, "rb") as f_in:
        with open(output_file, "wb") as f_out:
            compressor = cctx.stream_writer(f_out)
            for line in f_in:
                compressor.write(line)
            compressor.flush(zstd.FLUSH_FRAME)
    print("Compression completed!")

subpath = "pubmed_data/PubMed-200k-RTC_"
for version in ["train", "train_min"]:
    dataset = load_dataset("csv", data_files=f"{subpath+version}.csv").remove_columns([
        "abstract_id", "line_id", "line_number", "total_lines"
    ])
    print(dataset)
    jsonl_path = f"{subpath+version}.jsonl"
    dataset["train"].to_json(jsonl_path)
    compress_jsonl_zst(jsonl_path, jsonl_path+".zst")

!ls "pubmed_data"

DatasetDict({
    train: Dataset({
        features: ['abstract_text', 'target'],
        num_rows: 2211861
    })
})


Creating json from Arrow format:   0%|          | 0/2212 [00:00<?, ?ba/s]

Compression completed!
DatasetDict({
    train: Dataset({
        features: ['abstract_text', 'target'],
        num_rows: 7
    })
})


Creating json from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Compression completed!
PubMed-200k-RTC_train.csv	 PubMed-200k-RTC_train_min.csv
PubMed-200k-RTC_train.jsonl	 PubMed-200k-RTC_train_min.jsonl
PubMed-200k-RTC_train.jsonl.zst  PubMed-200k-RTC_train_min.jsonl.zst


Inspect a dataset instance!

In [2]:
i = 3
print(dataset["train"][i])

{'abstract_text': "The intervention group will participate in the online group program ` Positive Outlook ' .", 'target': 'METHODS'}


Show the file sizes.

In [3]:
subpath = "legalcitation_data/LegalText-classification_"
for version in ["train", "train_min"]:
    dataset = load_dataset("csv", data_files=f"{subpath+version}.csv").remove_columns([
        "case_id"
    ])
    print(dataset)
    jsonl_path = f"{subpath+version}.jsonl"
    print(f"\njsonl_path: {jsonl_path}")
    dataset["train"].to_json(jsonl_path)
    compress_jsonl_zst(jsonl_path, jsonl_path+".zst")
!ls -al

DatasetDict({
    train: Dataset({
        features: ['case_outcome', 'case_title', 'case_text'],
        num_rows: 24985
    })
})

jsonl_path: legalcitation_data/LegalText-classification_train.jsonl


Creating json from Arrow format:   0%|          | 0/25 [00:00<?, ?ba/s]

Compression completed!
DatasetDict({
    train: Dataset({
        features: ['case_outcome', 'case_title', 'case_text'],
        num_rows: 4
    })
})

jsonl_path: legalcitation_data/LegalText-classification_train_min.jsonl


Creating json from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Compression completed!
total 40
drwxr-xr-x 5 matthias matthias  4096 Feb 18 12:55 .
drwxr-xr-x 6 matthias matthias  4096 Feb 17 01:28 ..
drwxr-xr-x 2 matthias matthias  4096 Feb 16 13:58 .ipynb_checkpoints
drwxr-xr-x 2 matthias matthias  4096 Jul 15  2023 legalcitation_data
drwxr-xr-x 2 matthias matthias  4096 Jul 15  2023 pubmed_data
-rwxr-xr-x 1 matthias matthias  1890 Jul 15  2023 README.md
-rwxr-xr-x 1 matthias matthias 12998 Feb 18 12:55 Zstandard_datasets.ipynb


Check whether the dataset can be recreated from the compressed format and login to the HugginFace hub prior to pushing the dataset to its remote repo on the hub.

In [4]:
data_files = "legalcitation_data/LegalText-classification_train_min.csv"   # choose one of the files listed above and ...
pubmed_dataset = load_dataset("csv", data_files=data_files, split="train") # ... adapt the file type (.csv, .jsonl, .json.zst)
pubmed_dataset

Dataset({
    features: ['case_id', 'case_outcome', 'case_title', 'case_text'],
    num_rows: 4
})

Push the dataset to the hub.

In [5]:
load_dotenv()
login(token=os.getenv("HUGGINGFACE_TOKEN"))
pubmed_dataset.push_to_hub("TheMiniPile")
pubmed_dataset

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/383 [00:00<?, ?B/s]

No files have been modified since last commit. Skipping to prevent empty commit.


Dataset({
    features: ['case_id', 'case_outcome', 'case_title', 'case_text'],
    num_rows: 4
})

$\checkmark$