**** These pip installs need to be adapted to use the appropriate release level. Alternatively, The venv running the jupyter lab could be pre-configured with a requirement file that includes the right release. Example for transform developers working from git clone:

    make venv 
    source venv/bin/activate 
    pip install jupyterlab
    venv/bin/jupyter lab

In [1]:
%%capture
## This is here as a reference only
# Users and application developers must use the right tag for the latest from pypi
%pip install 'data-prep-toolkit-transforms[bloom]'

In [2]:
import os
import time
import glob
import pandas as pd
from hashlib import sha256
from pickle import dumps
from huggingface_hub import list_repo_files, hf_hub_download
from rbloom import Bloom
from dpk_bloom.transform import BLOOMTransform
from data_processing.data_access import DataAccessLocal

**** specify HuggingFace repo ID and bloom filter model

- REPO_ID: Specifies the HuggingFace repository ID. Defaults to 'HuggingFaceFW/fineweb'.
- SNAPSHOT: Defines the snapshot version, defaulting to CC-MAIN-2024-10. You may specify other available snapshots of FineWeb data.
- BLOOM_MODEL: Indicates the IBM's GneissWeb Bloom filter model, which is sourced from HuggingFace.
- batch_size: Adjust based on infrastructure capacity. The default value is 1000.

In [3]:
# Configuration
REPO_ID = "HuggingFaceFW/fineweb"
SNAPSHOT = "data/CC-MAIN-2024-10"
BLOOM_MODEL = "ibm-granite/GneissWeb.bloom"
BATCH_SIZE = 1000

**** Fetch a specific Parquet file from a snapshot of Hugging Face's FineWeb dataset. idx corresponds to the {idx}-th Parquet file in the snapshot. Defaults to the first parquet file (idx=0).

In [4]:
def load_parquet_path(repo_id, snapshot, idx=0):
    files = sorted(
        f for f in list_repo_files(repo_id, repo_type="dataset")
        if f.startswith(snapshot) and f.endswith(".parquet")
    )
    
    if not files:
        raise FileNotFoundError(f"No Parquet files found in snapshot: {snapshot}")

    print(f"Snapshot {snapshot} contains {len(files)} Parquet files.")
    file_path = hf_hub_download(repo_id=repo_id, filename=files[idx], repo_type="dataset")
    print(f"Downloaded {idx}th Parquet file: {file_path}")
    return file_path

**** input_folder is the path of the {idx}-th Parquet file in the snapshot

In [5]:
# Setup paths
input_folder = load_parquet_path(repo_id=REPO_ID, snapshot=SNAPSHOT, idx=0)
output_folder = os.path.abspath(os.path.join(os.path.dirname(os.getcwd()), "output"))

# Initialize local data access
data_access = DataAccessLocal({"input_folder": input_folder, "output_folder": output_folder})

# Load table
table, _ = data_access.get_table(input_folder)

Snapshot data/CC-MAIN-2024-10 contains 300 Parquet files.
Downloaded 0th Parquet file: /Users/ian/.cache/huggingface/hub/datasets--HuggingFaceFW--fineweb/snapshots/0f039043b23fe1d4eed300b504aa4b4a68f1c7ba/data/CC-MAIN-2024-10/000_00000.parquet


**** initalize Bloom transform class. BLOOM_MODEL Defaults to "ibm-granite/GneissWeb.bloom", which is 28GB in size and may take several minutes to download. **Once downloaded, it is cached and will be reused next time when calling BLOOMTransform**.

In [6]:
# Apply BLOOM transform
transform = BLOOMTransform({
    "model_name_or_path": BLOOM_MODEL,
    "annotation_column": "is_in_GneissWeb",
    "doc_text_column": "contents",
    "inference_engine": "CPU",
    "batch_size": BATCH_SIZE,
    "data_access": data_access
})

In [7]:
time0 = time.time()
table_list, metadata = transform.transform(table)
time1 = time.time()
print(f"it took {(time1-time0)/float(60):.1f} mins to process {len(table)} documents")

import glob
glob.glob("output/*")

Processing batch: 0/973
Processing batch: 1/973
Processing batch: 2/973
Processing batch: 3/973
Processing batch: 4/973
Processing batch: 5/973
Processing batch: 6/973
Processing batch: 7/973
Processing batch: 8/973
Processing batch: 9/973
Processing batch: 10/973
Processing batch: 11/973
Processing batch: 12/973
Processing batch: 13/973
Processing batch: 14/973
Processing batch: 15/973
Processing batch: 16/973
Processing batch: 17/973
Processing batch: 18/973
Processing batch: 19/973
Processing batch: 20/973
Processing batch: 21/973
Processing batch: 22/973
Processing batch: 23/973
Processing batch: 24/973
Processing batch: 25/973
Processing batch: 26/973
Processing batch: 27/973
Processing batch: 28/973
Processing batch: 29/973
Processing batch: 30/973
Processing batch: 31/973
Processing batch: 32/973
Processing batch: 33/973
Processing batch: 34/973
Processing batch: 35/973
Processing batch: 36/973
Processing batch: 37/973
Processing batch: 38/973
Processing batch: 39/973
Processing

['output/metadata.json', 'output/test1.parquet']