##### **** These pip installs need to be adapted to use the appropriate release level. Alternatively, The venv running the jupyter lab could be pre-configured with a requirement file that includes the right release. Example for transform developers working from git clone:
```
make venv
source venv/bin/activate && pip install jupyterlab
```

In [None]:
%%capture
## This is here as a reference only
# Users and application developers must use the right tag for the latest from pypi
#!pip install data-prep-toolkit
#!pip install data-prep-toolkit-transforms

##### ***** Import required Classes and modules

In [1]:
from dpk_extreme_tokenized.ray.runtime import ExtremeTokenized

##### ***** Setup runtime parameters for this transform
We will only provide a description for the parameters used in this example. For a complete list of parameters, please refer to the README.md for this transform:
|parameter:type | value | description |
|-|-|-|
| input_folder:str | \${PWD}/test-data/input/ | folder that contains the input parquet files for the extreme tokenized algorithm |
| output_folder:str | \${PWD}/output/ | folder that contains the all the intermediate results and the output parquet files for the extreme tokenized algorithm |
| et_contents_column_name:str | text | name of the column that stores document text |
| et_arrow_path:str | \${PWD}/test-data/input/arrow/ | arrow folder location |

In [2]:
ExtremeTokenized(
    input_folder="test-data/input",
    output_folder="output",
    run_locally=True,
    et_contents_column_name="text",
    et_arrow_path="test-data/input/arrow",
).transform()

08:31:30 INFO - data factory et_ is using local configuration without input/output path
08:31:30 INFO - data factory et_ max_files -1, n_sample -1
08:31:30 INFO - data factory et_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
08:31:30 INFO - pipeline id pipeline_id
08:31:30 INFO - code location None
08:31:30 INFO - number of workers 1 worker options {'num_cpus': 0.8, 'max_restarts': -1}
08:31:30 INFO - actor creation delay 0
08:31:30 INFO - job details {'job category': 'preprocessing', 'job name': 'et', 'job type': 'ray', 'job id': 'job_id'}
08:31:30 INFO - data factory data_ is using local data access: input_folder - test-data/input output_folder - output
08:31:30 INFO - data factory data_ max_files -1, n_sample -1
08:31:30 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
08:31:30 INFO -

0

##### **** The specified folder will include the transformed parquet files.

In [3]:
import glob
glob.glob("output/*")

['output/test1.parquet', 'output/metadata.json']

***** print the input data

In [4]:
import polars as pl
import os
input_df = pl.read_parquet(os.path.join(os.path.abspath(""), "test-data", "input", "test1.parquet"))

with pl.Config(fmt_str_lengths=10000000, tbl_rows=-1, tbl_cols=-1):
    print(input_df)

shape: (10, 9)
┌──────────┬──────────┬──────────┬──────────┬──────────┬──────────┬──────────┬──────────┬──────────┐
│ text     ┆ id       ┆ dump     ┆ url      ┆ date     ┆ file_pat ┆ language ┆ language ┆ token_co │
│ ---      ┆ ---      ┆ ---      ┆ ---      ┆ ---      ┆ h        ┆ ---      ┆ _score   ┆ unt      │
│ str      ┆ str      ┆ str      ┆ str      ┆ str      ┆ ---      ┆ str      ┆ ---      ┆ ---      │
│          ┆          ┆          ┆          ┆          ┆ str      ┆          ┆ f64      ┆ i64      │
╞══════════╪══════════╪══════════╪══════════╪══════════╪══════════╪══════════╪══════════╪══════════╡
│ The      ┆ <urn:uui ┆ CC-MAIN- ┆ http://0 ┆ 2023-03- ┆ s3://com ┆ en       ┆ 0.975216 ┆ 989      │
│ F-word   ┆ d:d5a183 ┆ 2023-14  ┆ 00af36.n ┆ 20T09:12 ┆ moncrawl ┆          ┆          ┆          │
│ continue ┆ d8-c283- ┆          ┆ etsolhos ┆ :33Z     ┆ /crawl-d ┆          ┆          ┆          │
│ s to rev ┆ 4b30-989 ┆          ┆ t.com/wo ┆          ┆ ata/CC-M ┆         

***** print the output result

In [5]:
import polars as pl
output_df = pl.read_parquet(os.path.join(os.path.abspath(""), "output", "test1.parquet"))
with pl.Config(fmt_str_lengths=10000000, tbl_rows=-1, tbl_cols=-1):
    print(output_df)

shape: (10, 14)
┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬────────┬───────┬───────┬───────┬───────┬───────┬───────┐
│ tex ┆ id  ┆ dum ┆ url ┆ dat ┆ fil ┆ lan ┆ langua ┆ token ┆ doc_n ┆ doc_s ┆ doc_n ┆ token ┆ token │
│ t   ┆ --- ┆ p   ┆ --- ┆ e   ┆ e_p ┆ gua ┆ ge_sco ┆ _coun ┆ um_to ┆ ize_k ┆ um_ch ┆ s_per ┆ s_per │
│ --- ┆ str ┆ --- ┆ str ┆ --- ┆ ath ┆ ge  ┆ re     ┆ t     ┆ kens  ┆ bs    ┆ ars   ┆ _doc_ ┆ _doc_ │
│ str ┆     ┆ str ┆     ┆ str ┆ --- ┆ --- ┆ ---    ┆ ---   ┆ ---   ┆ ---   ┆ ---   ┆ size  ┆ num_c │
│     ┆     ┆     ┆     ┆     ┆ str ┆ str ┆ f64    ┆ i64   ┆ i64   ┆ f64   ┆ i64   ┆ ---   ┆ hars  │
│     ┆     ┆     ┆     ┆     ┆     ┆     ┆        ┆       ┆       ┆       ┆       ┆ f64   ┆ ---   │
│     ┆     ┆     ┆     ┆     ┆     ┆     ┆        ┆       ┆       ┆       ┆       ┆       ┆ f64   │
╞═════╪═════╪═════╪═════╪═════╪═════╪═════╪════════╪═══════╪═══════╪═══════╪═══════╪═══════╪═══════╡
│ The ┆ <ur ┆ CC- ┆ htt ┆ 202 ┆ s3: ┆ en  ┆ 0.9752 ┆ 989   ┆ 1071  ┆ 4.066 