##### **** These pip installs need to be adapted to use the appropriate release level. Alternatively, The venv running the jupyter lab could be pre-configured with a requirement file that includes the right release. Example for transform developers working from git clone:
```
make venv
source venv/bin/activate && pip install jupyterlab
```

In [None]:
%%capture
## This is here as a reference only
# Users and application developers must use the right tag for the latest from pypi
#!pip install data-prep-toolkit
#!pip install data-prep-toolkit-transforms

##### ***** Import required Classes and modules

In [1]:
from dpk_readability.runtime import Readability

##### ***** Setup runtime parameters for this transform
We will only provide a description for the parameters used in this example. For a complete list of parameters, please refer to the README.md for this transform:
|parameter:type | value | description |
|-|-|-|
| input_folder:str | \${PWD}/test-data/input/ | folder that contains the input parquet files for the extreme tokenized algorithm |
| output_folder:str | \${PWD}/output/ | folder that contains the all the intermediate results and the output parquet files for the extreme tokenized algorithm |
| readability_contents_column_name:str | text | name of the column that stores document text |
| readability_score_list:Union[str, list[str]] | mcalpine_eflaw_textstat | list of readability scores or a single readability scores to be computed by the transform |

In [2]:
Readability(
    input_folder="test-data/input",
    output_folder="output",
    readability_contents_column_name="contents",
    readability_score_list=["mcalpine_eflaw_textstat"],
).transform()


19:29:24 INFO - Readability parameters are : {'readability_contents_column_name': 'contents', 'readability_score_list': ['mcalpine_eflaw_textstat']}
19:29:24 INFO - pipeline id pipeline_id
19:29:24 INFO - code location None
19:29:24 INFO - data factory data_ is using local data access: input_folder - test-data/input output_folder - output
19:29:24 INFO - data factory data_ max_files -1, n_sample -1
19:29:24 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
19:29:24 INFO - orchestrator readability started at 2025-02-10 19:29:24
19:29:24 INFO - Number of files is 1, source profile {'max_file_size': 0.014194488525390625, 'min_file_size': 0.014194488525390625, 'total_file_size': 0.014194488525390625}
19:29:25 INFO - Completed 1 files (100.0%) in 0.006 min
19:29:25 INFO - Done processing 1 files, waiting for flush() completion.
19:29:25 INFO - done flushing in 0.0 sec
19:29:25 INFO

0

##### **** The specified folder will include the transformed parquet files.

In [3]:
import glob
glob.glob("output/*")

['output/readability-test.parquet', 'output/metadata.json']

***** print the input data

In [4]:
import polars as pl
import os
input_df = pl.read_parquet(os.path.join(os.path.abspath(""), "test-data", "input", "readability-test.parquet"))

with pl.Config(fmt_str_lengths=10000000, tbl_rows=-1, tbl_cols=-1):
    print(input_df)

shape: (2, 2)
┌─────────────────────────────────────────────────┬────────────────────────────────────────────────┐
│ contents                                        ┆ id                                             │
│ ---                                             ┆ ---                                            │
│ str                                             ┆ str                                            │
╞═════════════════════════════════════════════════╪════════════════════════════════════════════════╡
│ Six Sigma Tips                                  ┆ <urn:uuid:01d4d837-d379-4466-8715-3d1edda758c1 │
│ These Six Sigma tips help you achieve high      ┆ >                                              │
│ quality every time, the desired objective of    ┆                                                │
│ most every business. However to get the desire  ┆                                                │
│ of high quality creations turn into reality     ┆                          

***** print the output result

In [5]:
import polars as pl
output_df = pl.read_parquet(os.path.join(os.path.abspath(""), "output", "readability-test.parquet"))
with pl.Config(fmt_str_lengths=10000000, tbl_rows=-1, tbl_cols=-1):
    print(output_df)

shape: (2, 3)
┌────────────────────────────────────┬───────────────────────────────────┬─────────────────────────┐
│ contents                           ┆ id                                ┆ mcalpine_eflaw_textstat │
│ ---                                ┆ ---                               ┆ ---                     │
│ str                                ┆ str                               ┆ f64                     │
╞════════════════════════════════════╪═══════════════════════════════════╪═════════════════════════╡
│ Six Sigma Tips                     ┆ <urn:uuid:01d4d837-d379-4466-8715 ┆ 28.7                    │
│ These Six Sigma tips help you      ┆ -3d1edda758c1>                    ┆                         │
│ achieve high quality every time,   ┆                                   ┆                         │
│ the desired objective of most      ┆                                   ┆                         │
│ every business. However to get the ┆                                   ┆   