##### **** These pip installs need to be adapted to use the appropriate release level. Alternatively, The venv running the jupyter lab could be pre-configured with a requirement file that includes the right release. Example for transform developers working from git clone:
```
cd tokenization2arrow
make venv 
source venv/bin/activate 
pip install jupyterlab
pip install -U ipywidgets
./venv/bin/jupyter lab
```

##### **** Configure the transform parameters. The set of dictionary keys holding DocIDTransform configuration for values are as follows: 
| Name | Description|
| -----|------------|
|tkn_tokenizer | Tokenizer used for tokenization. It also can be a path to a pre-trained tokenizer. By defaut, `hf-internal-testing/llama-tokenizer` from HuggingFace is used |
|tkn_tokenizer_args |Arguments for tokenizer. For example, `cache_dir=/tmp/hf,use_auth_token=Your_HF_authentication_token` could be arguments for tokenizer `bigcode/starcoder` from HuggingFace|
|tkn_doc_id_column|Column contains document id which values should be unique across dataset|
|tkn_doc_content_column|Column contains document content|
|tkn_text_lang|Specify language used in the text content for better text splitting if needed|
|tkn_chunk_size|Specify >0 value to tokenize each row/doc in chunks of characters (rounded in words)|


In [1]:
# Remove output folder
!rm -rf output01

In [9]:
# Set huggingface token to download llama tokenizer
import os
os.environ['HF_TOKEN'] = 'hf_XXX'

In [10]:
from dpk_tokenization2arrow.ray.runtime import Tokenization2Arrow

##### ***** Setup runtime parameters for this transform

In [None]:
Tokenization2Arrow(input_folder= "test-data/ds01/input",
        output_folder= "output01",
        tkn_tokenizer=  "hf-internal-testing/llama-tokenizer",
        run_locally= True,
        tkn_chunk_size= 20_000).transform()

### Explore output folder

In [12]:
!tree output01

[01;34moutput01[0m
├── [01;34mlang=en[0m
│   ├── [01;34mdataset=cybersecurity_v2.0[0m
│   │   └── [01;34mversion=2.3.2[0m
│   │       └── [00mpq03.snappy.arrow[0m
│   ├── [00mpq01.arrow[0m
│   └── [00mpq02.arrow[0m
├── [01;34mmeta[0m
│   └── [01;34mlang=en[0m
│       ├── [01;34mdataset=cybersecurity_v2.0[0m
│       │   └── [01;34mversion=2.3.2[0m
│       │       ├── [00mpq03.snappy.docs[0m
│       │       └── [00mpq03.snappy.docs.ids[0m
│       ├── [00mpq01.docs[0m
│       ├── [00mpq01.docs.ids[0m
│       ├── [00mpq02.docs[0m
│       └── [00mpq02.docs.ids[0m
└── [00mmetadata.json[0m

8 directories, 10 files


### Check metadata.json

In [None]:
!cat output01/metadata.json