##### **** These pip installs need to be adapted to use the appropriate release level. Alternatively, The venv running the jupyter lab could be pre-configured with a requirement file that includes the right release. Example for transform developers working from git clone:
```
cd tokenization2arrow
make venv 
source venv/bin/activate 
pip install jupyterlab
pip install -U ipywidgets
./venv/bin/jupyter lab
```

##### **** Configure the transform parameters. The set of dictionary keys holding DocIDTransform configuration for values are as follows: 
| Name | Description|
| -----|------------|
|tkn_tokenizer | Tokenizer used for tokenization. It also can be a path to a pre-trained tokenizer. By defaut, `hf-internal-testing/llama-tokenizer` from HuggingFace is used |
|tkn_tokenizer_args |Arguments for tokenizer. For example, `cache_dir=/tmp/hf,use_auth_token=Your_HF_authentication_token` could be arguments for tokenizer `bigcode/starcoder` from HuggingFace|
|tkn_doc_id_column|Column contains document id which values should be unique across dataset|
|tkn_doc_content_column|Column contains document content|
|tkn_text_lang|Specify language used in the text content for better text splitting if needed|
|tkn_chunk_size|Specify >0 value to tokenize each row/doc in chunks of characters (rounded in words)|
```

In [3]:
# Remove output folder
!rm -rf output01

In [4]:
from dpk_tokenization2arrow.transform_ray import Tokenization2Arrow

##### ***** Setup runtime parameters for this transform

In [5]:
Tokenization2Arrow(input_folder= "test-data/ds01/input",
        output_folder= "output01",
        tkn_tokenizer=  "hf-internal-testing/llama-tokenizer",
        run_locally= True,
        tkn_chunk_size= 20_000).transform()

22:39:47 INFO - pipeline id pipeline_id
22:39:47 INFO - code location None
22:39:47 INFO - number of workers 1 worker options {'num_cpus': 0.8, 'max_restarts': -1}
22:39:47 INFO - actor creation delay 0
22:39:47 INFO - job details {'job category': 'preprocessing', 'job name': 'Tokenization2Arrow', 'job type': 'ray', 'job id': 'job_id'}
22:39:47 INFO - data factory data_ is using local data access: input_folder - test-data/ds01/input output_folder - output01
22:39:47 INFO - data factory data_ max_files -1, n_sample -1
22:39:47 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
22:39:47 INFO - Running locally
2025-02-11 22:39:48,431	INFO worker.py:1777 -- Started a local Ray instance. View the dashboard at [1m[32m127.0.0.1:8265 [39m[22m
[36m(orchestrate pid=45135)[0m 22:39:51 INFO - orchestrator started at 2025-02-11 22:39:51
[36m(orchestrate pid=45135)[0m 22:39:51 INFO -

0

### Explore output folder

In [7]:
!tree output01

[01;34moutput01[0m
├── [01;34mlang=en[0m
│   ├── [01;34mdataset=cybersecurity_v2.0[0m
│   │   └── [01;34mversion=2.3.2[0m
│   │       └── [00mpq03.snappy.arrow[0m
│   ├── [00mpq01.arrow[0m
│   └── [00mpq02.arrow[0m
├── [01;34mmeta[0m
│   └── [01;34mlang=en[0m
│       ├── [01;34mdataset=cybersecurity_v2.0[0m
│       │   └── [01;34mversion=2.3.2[0m
│       │       ├── [00mpq03.snappy.docs[0m
│       │       └── [00mpq03.snappy.docs.ids[0m
│       ├── [00mpq01.docs[0m
│       ├── [00mpq01.docs.ids[0m
│       ├── [00mpq02.docs[0m
│       └── [00mpq02.docs.ids[0m
└── [00mmetadata.json[0m

8 directories, 10 files


### Check metadata.json

In [8]:
!cat output01/metadata.json

{
  "pipeline": "pipeline_id",
  "job details": {
    "job category": "preprocessing",
    "job name": "Tokenization2Arrow",
    "job type": "ray",
    "job id": "job_id",
    "start_time": "2025-02-11 22:39:51",
    "end_time": "2025-02-11 22:39:54",
    "status": "success"
  },
  "code": null,
  "job_input_params": {
    "tokenizer": "hf-internal-testing/llama-tokenizer",
    "tokenizer_args": null,
    "doc_id_column": "document_id",
    "doc_content_column": "contents",
    "text_lang": "en",
    "chunk_size": 20000,
    "checkpointing": false,
    "max_files": -1,
    "random_samples": -1,
    "files_to_use": [
      ".parquet"
    ],
    "number of workers": 1,
    "worker options": {
      "num_cpus": 0.8,
      "max_restarts": -1
    },
    "actor creation delay": 0
  },
  "execution_stats": {
    "cpus": 10,
    "gpus": 0,
    "memory": 14.796653747558594,
    "object_store": 2.0,
    "execution time, min": 0.055
  },
  "job_output_stats": {
    "source_files": 5,
    "source_