##### **** These pip installs need to be adapted to use the appropriate release level. Alternatively, The venv running the jupyter lab could be pre-configured with a requirement file that includes the right release. Example for transform developers working from git clone:
```
make venv
source venv/bin/activate && pip install jupyterlab
```

In [None]:
%%capture
## This is here as a reference only
# Users and application developers must use the right tag for the latest from pypi
#!pip install data-prep-toolkit
#!pip install data-prep-toolkit-transforms

##### ***** Import required Classes and modules

In [1]:
!pip install --upgrade huggingface_hub
from huggingface_hub import hf_hub_download
import pandas as pd

REPO_ID = "HuggingFaceFW/fineweb"
FILENAME = "data/CC-MAIN-2013-20/000_00000.parquet"

hf_hub_download(repo_id=REPO_ID, filename=FILENAME, repo_type="dataset")

Collecting huggingface_hub
  Downloading huggingface_hub-0.28.1-py3-none-any.whl.metadata (13 kB)
Collecting fsspec>=2023.5.0 (from huggingface_hub)
  Downloading fsspec-2025.2.0-py3-none-any.whl.metadata (11 kB)
Collecting tqdm>=4.42.1 (from huggingface_hub)
  Using cached tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
Downloading huggingface_hub-0.28.1-py3-none-any.whl (464 kB)
Downloading fsspec-2025.2.0-py3-none-any.whl (184 kB)
Using cached tqdm-4.67.1-py3-none-any.whl (78 kB)
Installing collected packages: tqdm, fsspec, huggingface_hub
Successfully installed fsspec-2025.2.0 huggingface_hub-0.28.1 tqdm-4.67.1


000_00000.parquet:   0%|          | 0.00/2.15G [00:00<?, ?B/s]

'/home/cma/.cache/huggingface/hub/datasets--HuggingFaceFW--fineweb/snapshots/0f039043b23fe1d4eed300b504aa4b4a68f1c7ba/data/CC-MAIN-2013-20/000_00000.parquet'

In [1]:
from dpk_readability.runtime import Readability

##### ***** Setup runtime parameters for this transform
We will only provide a description for the parameters used in this example. For a complete list of parameters, please refer to the README.md for this transform:
|parameter:type | value | description |
|-|-|-|
| input_folder:str | \${PWD}/test-data/input/ | folder that contains the input parquet files for the extreme tokenized algorithm |
| output_folder:str | \${PWD}/output/ | folder that contains the all the intermediate results and the output parquet files for the extreme tokenized algorithm |
| readability_contents_column_name:str | text | name of the column that stores document text |
| readability_score_list:Union[str, list[str]] | mcalpine_eflaw_textstat | list of readability scores or a single readability scores to be computed by the transform |

In [3]:
Readability(
    input_folder="test-data/hf",
    output_folder="output",
    readability_contents_column_name="text",
    readability_score_list=["mcalpine_eflaw_textstat"],
).transform()


10:39:03 INFO - Readability parameters are : {'readability_contents_column_name': 'text', 'readability_score_list': ['mcalpine_eflaw_textstat']}
10:39:03 INFO - pipeline id pipeline_id
10:39:03 INFO - code location None
10:39:03 INFO - data factory data_ is using local data access: input_folder - test-data/hf output_folder - output
10:39:03 INFO - data factory data_ max_files -1, n_sample -1
10:39:03 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
10:39:03 INFO - orchestrator readability started at 2025-02-07 10:39:03
10:39:03 INFO - Number of files is 1, source profile {'max_file_size': 2048.0454998016357, 'min_file_size': 2048.0454998016357, 'total_file_size': 2048.0454998016357}
10:42:55 INFO - Completed 1 files (100.0%) in 3.87 min
10:42:55 INFO - Done processing 1 files, waiting for flush() completion.
10:42:55 INFO - done flushing in 0.0 sec
10:42:55 INFO - Completed e

0

##### **** The specified folder will include the transformed parquet files.

In [4]:
import glob
glob.glob("output/*")

['output/readability-test.parquet',
 'output/metadata.json',
 'output/000_00000.parquet']

***** print the input data

In [5]:
import polars as pl
import os
input_df = pl.read_parquet(os.path.join(os.path.abspath(""), "test-data", "hf", "000_00000.parquet"))

with pl.Config(tbl_cols=-1):
    print(input_df)

shape: (1_091_396, 9)
┌──────────┬──────────┬──────────┬──────────┬──────────┬──────────┬──────────┬──────────┬──────────┐
│ text     ┆ id       ┆ dump     ┆ url      ┆ date     ┆ file_pat ┆ language ┆ language ┆ token_co │
│ ---      ┆ ---      ┆ ---      ┆ ---      ┆ ---      ┆ h        ┆ ---      ┆ _score   ┆ unt      │
│ str      ┆ str      ┆ str      ┆ str      ┆ str      ┆ ---      ┆ str      ┆ ---      ┆ ---      │
│          ┆          ┆          ┆          ┆          ┆ str      ┆          ┆ f64      ┆ i64      │
╞══════════╪══════════╪══════════╪══════════╪══════════╪══════════╪══════════╪══════════╪══════════╡
│ How AP   ┆ <urn:uui ┆ CC-MAIN- ┆ http://% ┆ 2013-05- ┆ s3://com ┆ en       ┆ 0.972142 ┆ 717      │
│ reported ┆ d:d66bc6 ┆ 2013-20  ┆ 20jwashi ┆ 18T05:48 ┆ moncrawl ┆          ┆          ┆          │
│ in all   ┆ fe-8477- ┆          ┆ ngton@ap ┆ :54Z     ┆ /crawl-d ┆          ┆          ┆          │
│ formats… ┆ 4adf-b…  ┆          ┆ .org/C…  ┆          ┆ ata/CC…  ┆  

***** print the output result

In [6]:
import polars as pl
import os
output_df = pl.read_parquet(os.path.join(os.path.abspath(""), "output", "000_00000.parquet"))
with pl.Config(tbl_cols=-1):
    print(output_df)

shape: (1_091_396, 10)
┌─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬────────┐
│ text    ┆ id      ┆ dump    ┆ url     ┆ date    ┆ file_pa ┆ languag ┆ languag ┆ token_c ┆ mcalpi │
│ ---     ┆ ---     ┆ ---     ┆ ---     ┆ ---     ┆ th      ┆ e       ┆ e_score ┆ ount    ┆ ne_efl │
│ str     ┆ str     ┆ str     ┆ str     ┆ str     ┆ ---     ┆ ---     ┆ ---     ┆ ---     ┆ aw_tex │
│         ┆         ┆         ┆         ┆         ┆ str     ┆ str     ┆ f64     ┆ i64     ┆ tstat  │
│         ┆         ┆         ┆         ┆         ┆         ┆         ┆         ┆         ┆ ---    │
│         ┆         ┆         ┆         ┆         ┆         ┆         ┆         ┆         ┆ f64    │
╞═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪════════╡
│ How AP  ┆ <urn:uu ┆ CC-MAIN ┆ http:// ┆ 2013-05 ┆ s3://co ┆ en      ┆ 0.97214 ┆ 717     ┆ 26.9   │
│ reporte ┆ id:d66b ┆ -2013-2 ┆ %20jwas ┆ -18T05: ┆ mmoncra ┆       