# Tokenization Instruction Tutorial

This notebook demonstrates how to run the tokenization_instruction task using LMTK.

- Loads a YAML config (`tokenization_instruction.yaml`)
- Runs tokenization via the framework
- Inspects and saves the resulting tokenized dataset

**Prerequisites:**
- Ensure `tutorials/data/raw_text_data/instruction_data.txt` exists.
- All dependencies are installed (see README).
- Run from the project root for correct path resolution.

In [None]:
# Install dependencies if needed
# %pip install pyyaml box datasets huggingface_hub
import os
import yaml
from box import Box

config_path = 'tutorials/configs/tokenization_instruction.yaml'
assert os.path.exists(config_path), f'Config file not found: {config_path}'
with open(config_path, 'r') as f:
    config = Box(yaml.safe_load(f))
print('Loaded config:')
print(config)

## Run Tokenization Task
This cell runs the tokenization_instruction task using the framework's CLI.

**Note:** You can also use the Python API if available.

In [None]:
# Run tokenization task
import subprocess
result = subprocess.run(['python', 'src/main.py', '--config', config_path], capture_output=True, text=True)
print(result.stdout)
if result.returncode != 0:
    print(result.stderr)
    raise RuntimeError('Tokenization failed!')

## Inspect Tokenized Dataset
Check that the dataset was saved as expected.

In [None]:
from datasets import load_from_disk
tokenized_path = config.dataset.output_dir
assert os.path.exists(tokenized_path), f'Tokenized dataset not found: {tokenized_path}'
ds = load_from_disk(tokenized_path)
print(ds)
# Show a sample
print(ds['test'][0] if 'test' in ds else ds[0])

## (Optional) Push to the Hub
Uncomment and configure the following if you want to push the tokenized dataset to the Hugging Face Hub.
Make sure you are authenticated and have set `repo_id` in the config.

In [None]:
# from datasets import load_dataset
# ds.push_to_hub(config.output.repo_id, commit_message=config.output.commit_message)