##### **** These pip installs need to be adapted to use the appropriate release level. Alternatively, The venv running the jupyter lab could be pre-configured with a requirement file that includes the right release. Example for transform developers working from git clone:
```
make venv
source venv/bin/activate && pip install jupyterlab
```

In [None]:
%%capture
## This is here as a reference only
# Users and application developers must use the right tag for the latest from pypi
#!pip install data-prep-toolkit
#!pip install data-prep-toolkit-transforms

##### ***** Import required Classes and modules

In [1]:
from dpk_readability.runtime import Readability

##### ***** Setup runtime parameters for this transform
We will only provide a description for the parameters used in this example. For a complete list of parameters, please refer to the README.md for this transform:
|parameter:type | value | description |
|-|-|-|
| input_folder:str | \${PWD}/test-data/input/ | folder that contains the input parquet files for the extreme tokenized algorithm |
| output_folder:str | \${PWD}/output/ | folder that contains the all the intermediate results and the output parquet files for the extreme tokenized algorithm |
| readability_contents_column_name:str | text | name of the column that stores document text |
| readability_curriculum:str | False | curriculum parameter for transform; either True or False |

In [2]:
Readability(
    input_folder="test-data/input",
    output_folder="output",
    readability_contents_column_name="contents",
    readability_curriculum=False,
).transform()


11:49:27 INFO - Readability parameters are : {'contents_column_name': 'contents', 'curriculum': False}
11:49:27 INFO - pipeline id pipeline_id
11:49:27 INFO - code location None
11:49:27 INFO - data factory data_ is using local data access: input_folder - test-data/input output_folder - output
11:49:27 INFO - data factory data_ max_files -1, n_sample -1
11:49:27 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
11:49:27 INFO - orchestrator readability started at 2025-01-23 11:49:27
11:49:27 INFO - Number of files is 1, source profile {'max_file_size': 0.014194488525390625, 'min_file_size': 0.014194488525390625, 'total_file_size': 0.014194488525390625}
11:49:27 INFO - Completed 1 files (100.0%) in 0.003 min
11:49:27 INFO - Done processing 1 files, waiting for flush() completion.
11:49:27 INFO - done flushing in 0.0 sec
11:49:27 INFO - Completed execution in 0.003 min, execution

0

##### **** The specified folder will include the transformed parquet files.

In [3]:
import glob
glob.glob("output/*")

['output/readability-test.parquet', 'output/metadata.json']

***** print the input data

In [4]:
import polars as pl
import os
input_df = pl.read_parquet(os.path.join(os.path.abspath(""), "test-data", "input", "readability-test.parquet"))

with pl.Config(fmt_str_lengths=10000000, tbl_rows=-1, tbl_cols=-1):
    print(input_df)

shape: (2, 2)
┌─────────────────────────────────────────────────┬────────────────────────────────────────────────┐
│ contents                                        ┆ id                                             │
│ ---                                             ┆ ---                                            │
│ str                                             ┆ str                                            │
╞═════════════════════════════════════════════════╪════════════════════════════════════════════════╡
│ Six Sigma Tips                                  ┆ <urn:uuid:01d4d837-d379-4466-8715-3d1edda758c1 │
│ These Six Sigma tips help you achieve high      ┆ >                                              │
│ quality every time, the desired objective of    ┆                                                │
│ most every business. However to get the desire  ┆                                                │
│ of high quality creations turn into reality     ┆                          

***** print the output result

In [5]:
import polars as pl
output_df = pl.read_parquet(os.path.join(os.path.abspath(""), "output", "readability-test.parquet"))
with pl.Config(fmt_str_lengths=10000000, tbl_rows=-1, tbl_cols=-1):
    print(output_df)

shape: (2, 16)
┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬────────┐
│ con ┆ id  ┆ fle ┆ fle ┆ gun ┆ smo ┆ col ┆ aut ┆ dal ┆ dif ┆ lin ┆ tex ┆ spa ┆ mca ┆ rea ┆ avg_gr │
│ ten ┆ --- ┆ sch ┆ sch ┆ nin ┆ g_i ┆ ema ┆ oma ┆ e_c ┆ fic ┆ sea ┆ t_s ┆ che ┆ lpi ┆ din ┆ ade_le │
│ ts  ┆ str ┆ _ea ┆ _ki ┆ g_f ┆ nde ┆ n_l ┆ ted ┆ hal ┆ ult ┆ r_w ┆ tan ┆ _re ┆ ne_ ┆ g_t ┆ vel    │
│ --- ┆     ┆ se_ ┆ nca ┆ og_ ┆ x_t ┆ iau ┆ _re ┆ l_r ┆ _wo ┆ rit ┆ dar ┆ ada ┆ efl ┆ ime ┆ ---    │
│ str ┆     ┆ tex ┆ id_ ┆ tex ┆ ext ┆ _in ┆ ada ┆ ead ┆ rds ┆ e_f ┆ d_t ┆ bil ┆ aw_ ┆ _te ┆ f64    │
│     ┆     ┆ tst ┆ tex ┆ tst ┆ sta ┆ dex ┆ bil ┆ abi ┆ _te ┆ orm ┆ ext ┆ ity ┆ tex ┆ xts ┆        │
│     ┆     ┆ at  ┆ tst ┆ at  ┆ t   ┆ _te ┆ ity ┆ lit ┆ xts ┆ ula ┆ sta ┆ _te ┆ tst ┆ tat ┆        │
│     ┆     ┆ --- ┆ at  ┆ --- ┆ --- ┆ xts ┆ _in ┆ y_s ┆ tat ┆ _te ┆ t   ┆ xts ┆ at  ┆ --- ┆        │
│     ┆     ┆ f64 ┆ --- ┆ f64 ┆ f64 ┆ tat ┆ dex ┆ cor ┆ --- ┆ xts ┆ --- ┆ ta