# Tokenization

This notebooks will present the tokenization of sequence for prokbert.
Parts are:
  * Tokenization process background and parameters
  * Tokenization of sequences
  * Tokenization for pretraining
  * HDF datasets for storing preprocessed sequence data


## Tokenization of Sequence Data

ProkBERT employs LCA tokenization, leveraging overlapping k-mers to capture rich local context information, enhancing model generalization and performance. The key parameters are the k-mer size and shift. For instance, with a k-mer size of 6 and a shift of 1, the tokenization captures detailed sequence information, while a k-mer size of 1 represents a basic character-based approach.

### Segmentation Strategies
Before tokenization, sequences are segmented using two main approaches:
1. **Contiguous Sampling**: Divides contigs into non-overlapping segments.
2. **Random Sampling**: Fragments the input sequence into randomly sized segments.

### Tokenization Process
After segmentation, sequences are encoded into a simpler vector format. The LCA method is pivotal in this phase, allowing the model to use a broader context and reducing computational demands while maintaining the information-rich local context.

### Context Size Limitations
It's important to note that transformer models, including ProkBERT, have a context size limitation. ProkBERT's design accommodates context sizes significantly larger than an average gene, yet smaller than the average bacterial genome.

We provide pretrained models for variants like ProkBERT-mini (k-mer size 6, shift 1), ProkBERT-mini-c (k-mer size 1, shift 1), and ProkBERT-mini-long (k-mer size 6, shift 2), catering to different sequence analysis requirements.

<img src="https://github.com/nbrg-ppcu/prokbert/blob/main/assets/Figure2_tokenization.png?raw=true" width="800" alt="Segmentation Process"> 

*Figure: The tokenization process in ProkBERT.*

It is important to see that, when the $shift > 1$ there are multiple possible tokenization, depending where we start the tokenizaiton. The window offset refers the actual tokenization window. 
I.e. let the sequence be `ATGTCCGCGACCTTTCATACATACCACCGGTAC` with the k-mer size 6, shift 2) we will have two possible tokenization:

Tokenization with offset=0:
```plaintext
    ATGTCCGCGACCTTTCATACATACCACCGGTAC
0.  ATGTCC  GACCTT  ATACAT  CACCGG
1.    GTCCGC  CCTTTC  ACATAC  CCGGTA
2.      CCGCGA  TTTCAT  ATACCA
3.        GCGACC  TCATAC  ACCACC
```
Tokenization with offset=1
```plaintext
    ATGTCCGCGACCTTTCATACATACCACCGGTAC
0.   TGTCCG  ACCTTT  TACATA  ACCGGT
1.     TCCGCG  CTTTCA  CATACC  CGGTAC
2.       CGCGAC  TTCATA  TACCAC
3.         CGACCT  CATACA  CCACCG
```
By default all possible tokenization is returned. 


## Key Tokenization Parameters

The most important tokenization parameters are the **k-mer size** and **shift**. 

The autotokenizer takes care of these for you, matching the right settings to the model. Easy and hassle-free.

In [None]:
from transformers import AutoTokenizer

# Load the autotokenizer for the ProkBERT-mini model
tokenizer = AutoTokenizer.from_pretrained("neuralbioinfo/prokbert-mini", trust_remote_code=True)

# Sample sequence
sequence = "ATGTCCGCGACCTTTCATACATACCACCGGTAC"

# Tokenize the sequence
tokens = tokenizer.tokenize(sequence)

# Encode the sequence to get token IDs
encoded = tokenizer(sequence, return_tensors="pt")
token_ids = encoded["input_ids"]

print("Tokens:", tokens)
print("Token IDs:", token_ids)

### Converting Token IDs Back to Sequences
Decode token IDs to view the kmer sequence:

In [None]:
token_ids = tokenizer("ATGTCCGCGACCTT", return_tensors="pt")["input_ids"]
decoded_sequence = tokenizer.decode(token_ids[0])
print(decoded_sequence)  # Output: [CLS] ATGTCCGCGACCTT [SEP]


## Processing Batches of Sequences

Keep in mind: the tokenizer doesn’t clean or preprocess your data. It assumes your sequences are ready to go—uppercase, properly chunked, and strictly nucleotide sequences. If your data isn’t quite there yet, you can use the **segmentation process** (see the [segmentation notebook](https://github.com/nbrg-ppcu/prokbert/blob/main/examples/Segmentation.ipynb)) and the handy **sequtils** in ProkBERT, which are built to handle large corpora of sequence data.

Already have a dataset in a Pandas DataFrame or a Hugging Face Dataset? Just define a `tokenize_function` and run the tokenization process. 

- **For training**: Make sure you prepare `input_ids`, `attention_mask`, and `labels`.
- **For inference**: You only need `input_ids` and `attention_mask`—no labels required.

If you’re working with large datasets, Hugging Face’s dataset utilities make it efficient to tokenize on the fly during training. 

For more details:
- Check out the **inference notebook** for inference examples.
- Dive into the **fine-tuning notebook** if you're preparing for model training.

Tokenizing big sequence corpora doesn’t have to be a headache! 😉


In [None]:
# lets just install the Huggingface datasets for the examples
!pip install datasets

In [None]:
from transformers import AutoTokenizer
from datasets import Dataset

# Load tokenizer and dataset
tokenizer = AutoTokenizer.from_pretrained("neuralbioinfo/prokbert-mini", trust_remote_code=True)
data = {"segment": ["ATGTCCGCGACCTT", "TGCATACCAGTCCG"]}
dataset = Dataset.from_dict(data)

# Define tokenization function
def tokenize_function(examples):
    return tokenizer(examples["segment"], padding=True, truncation=True)

# Tokenize dataset
tokenized_dataset = dataset.map(tokenize_function, batched=True)
print(tokenized_dataset)




### Batch Tokenization with Labels for Training

Prepare batches of sequences with labels for training.




In [None]:
def tokenize_with_labels(examples):
    encoded = tokenizer(examples["segment"], padding=True, truncation=True)
    encoded["labels"] = examples["y"]
    return encoded

# Example dataset
data = {"segment": ["ATGTCC", "TGCATC"], "y": [1, 0]}
dataset = Dataset.from_dict(data)
tokenized_dataset = dataset.map(tokenize_with_labels, batched=True)


### Tokenization Function for Training

The `tokenize_function` prepares sequence data for ProkBERT training by encoding input sequences, handling special token masks, and attaching labels.

#### Key Steps:
- **Tokenization**: Uses `batch_encode_plus` to tokenize the `segment` field, adding padding and special tokens. The resulting `input_ids` and `attention_mask` tensors are detached for further processing.
- **Masking Special Tokens**: Updates the `attention_mask` to ignore tokens with IDs `2` and `3` (e.g., padding or special tokens) for better training efficiency.
- **Label Preparation**: Converts the `y` column (class labels) into a PyTorch tensor, labeled as `labels`.

#### Notes:
- If segment lengths vary significantly, consider using a `data_collator` (e.g., Hugging Face's `DataCollatorForTokenClassification`) to handle padding and batching dynamically during training.

This function ensures clean and efficient data preparation for training while handling masking and label integration seamlessly.


In [None]:
from transformers import AutoTokenizer, DataCollatorWithPadding
from datasets import Dataset
import torch

# Load the tokenizer for ProkBERT-mini
tokenizer = AutoTokenizer.from_pretrained("neuralbioinfo/prokbert-mini", trust_remote_code=True)

# Sample dataset
data = {
    "segment": [
        "ATGTCCGCGACCTT",
        "TGCATACCAGTCCG",
        "ATGCC",
        "GCGTACCAG",
    ],
    "y": [1, 0, 1, 0]
}
dataset = Dataset.from_dict(data)

# Define the tokenization function
def tokenize_function(examples):
    # Tokenize and preprocess the input sequences
    encoded = tokenizer.batch_encode_plus(
        examples["segment"],
        add_special_tokens=True,
        padding=True,
        return_tensors="pt",
    )

    # Clone and modify input_ids and attention_mask for masking special tokens
    input_ids = encoded["input_ids"].clone().detach()
    attention_mask = encoded["attention_mask"].clone().detach()
    mask_tokens = (input_ids == 2) | (input_ids == 3)
    attention_mask[mask_tokens] = 0

    # Add labels
    labels = torch.tensor(examples["y"], dtype=torch.int64)

    return {
        "input_ids": input_ids,
        "attention_mask": attention_mask,
        "labels": labels,
    }

# Tokenize the dataset
tokenized_dataset = dataset.map(tokenize_function, batched=True)



# The old solution:

## Tokenization parameters

The following table outlines the configuration parameters for ProkBERT, detailing their purpose, default values, types, and constraints.


| Parameter | Description | Type | Default | Constraints |
|-----------|-------------|------|---------|-------------|
| **Tokenization** |
| `type` | Describes the tokenization approach. By default, the LCA (Local Context Aware) method is used. | string | `lca` | Options: `lca` |
| `kmer` | Determines the k-mer size for the tokenization process. | integer | 6 | Options: 1-9 |
| `shift` | Represents the shift parameter in k-mer. The default value is 1. | integer | 1 | Min: 0 |
| `max_segment_length` | Gives the maximum number of characters in a segment. This should be consistent with the language model's capability. It can be alternated with token_limit. | integer | 2050 | Min: 6, Max: 4294967296 |
| `token_limit` | States the maximum token count that the language model can process, inclusive of special tokens like CLS and SEP. This is interchangeable with max_segment_length. | integer | 4096 | Min: 1, Max: 4294967296 |
| `max_unknown_token_proportion` | Defines the maximum allowed proportion of unknown tokens in a sequence. For instance, if 10% of the tokens are unknown (when max_unknown_token_proportion=0.1), the segment won't be tokenized. | float | 0.9999 | Min: 0, Max: 1 |
| `vocabfile` | Path to the vocabulary file. If set to 'auto', the default vocabulary is utilized. | str | `auto` | - |
| `vocabmap` | The default vocabmap loaded from file | dict | `{}` | - |
| `isPaddingToMaxLength` | Determines if the tokenized sentence should be padded with [PAD] tokens to produce vectors of a fixed length. | bool | False | Options: True, False |
| `add_special_token` | The tokenizer should add the special starting and sentence end tokens. The default is yes. | bool | True | Options: True, False |
| **Computation** |
| `cpu_cores_for_segmentation` | Specifies the number of CPU cores allocated for the segmentation process. | integer | 10 | Min: 1 |
| `cpu_cores_for_tokenization` | Allocates a certain number of CPU cores for the k-mer tokenization process. | integer | -1 | Min: 1 |
| `batch_size_tokenization` | Determines the number of segments a single core processes at a time. The input segment list will be divided into chunks of this size. | integer | 10000 | Min: 1 |
| `batch_size_fasta_segmentation` | Sets the number of fasta files processed in a single batch, useful when dealing with a large number of fasta files. | integer | 3 | Min: 1 |
| `numpy_token_integer_prec_byte` | The type of integer to be used during the vectorization. The default is 2, if you want to work larger k-mers then increase it to 4. 1: np.int8, 2:np.int16. 4:np.int32. 8: np.int64 | integer | 2 | Options: 1, 2, 4, 8 |
| `np_tokentype` | Dummy | type | `np.int16` | - |


## Installation of ProkBERT (if needed)



In [None]:
try:
    import prokbert
    print("ProkBERT is already installed.")
except ImportError:
    !pip install prokbert
    print("Installed ProkBERT.")


# Tokenization of sequences examples

In [None]:
from prokbert.config_utils import *
from prokbert.sequtils import *
tokenization_parameters = {'kmer' : 6,
                          'shift' : 2}

segment = 'ATGTCCGCGACCT'
defconfig = SeqConfig() # For the detailed configarion parameters see the table above
tokenization_params = defconfig.get_and_set_tokenization_parameters(tokenization_parameters)
tokens, kmers = lca_tokenize_segment(segment, tokenization_params)

print(' '.join([str(t) for t in tokens]))
print(' '.join(kmers[0]))

results_pretty_print = pretty_print_overlapping_sequence(segment, kmers[0], tokenization_params)
print(results_pretty_print)



## Tokenizing Longer Sequences

To tokenize sequences longer than what the current ProkBERT model supports, you can adjust the `token_limit` and `max_segment_length` parameters. Keep in mind that the tokenization process is parallelized using Python's multiprocessing module at the segment level. Therefore, it's important to also consider adjusting the number of cores utilized, as well as the `batch_size_tokenization` parameter, which determines how many sequences a core should process at once. Failing to appropriately adjust these settings might lead to memory issues.


### Example Python Code for Long Sequence Tokenization

Below is an example of how you can configure and use the tokenization for longer sequences:

In [None]:
from prokbert.sequtils import lca_tokenize_segment
from prokbert.config_utils import SeqConfig

# Tokenization parameters
tokenization_parameters = {
    'kmer': 6,
    'shift': 1,
    'max_segment_length': 2000000,
    'token_limit': 2000000
}

# Example of a long sequence
segment = 'ATGTCCGCGACCT' * 100000

# Default configuration for tokenization
defconfig = SeqConfig() # For detailed configuration parameters, refer to the table above

# Get and set tokenization parameters
tokenization_params = defconfig.get_and_set_tokenization_parameters(tokenization_parameters)

# Perform tokenization
tokens, kmers = lca_tokenize_segment(segment, tokenization_params)

# Tokenization for Pretraining

Given the abundance and large size of sequence data, preprocessing this data in advance and storing it for later use is a recommended practice. The primary steps in this process are segmentation and tokenization.

The outcome of tokenization is a set of vectors, which need to be converted into a matrix-like structure, typically through padding. Additionally, randomizing these vectors is essential for effective training. The `sequtils` module in ProkBERT includes utilities to facilitate these steps. Below, we outline some examples of how to accomplish this.

## Basic Steps for Preprocessing:

1. **Load Fasta Files**: Begin by loading the raw sequence data from FASTA files.
2. **Segment the Raw Sequences**: Apply segmentation parameters to split the sequences into manageable segments.
3. **Tokenize the Segmented Database**: Use the defined tokenization parameters to convert the segments into tokenized forms.
4. **Create a Padded/Truncated Array**: Generate a uniform array structure, padding or truncating as necessary.
5. **Save the Array to HDF**: Store the processed data in an HDF (Hierarchical Data Format) file for efficient retrieval and use in training models.


In [None]:
import pkg_resources
from os.path import join
from prokbert.sequtils import *

# Directory for pretraining FASTA files
pretraining_fasta_files_dir = pkg_resources.resource_filename('prokbert','data/pretraining')

# Define segmentation and tokenization parameters
segmentation_params = {
    'max_length': 256,  # Split the sequence into segments of length L
    'min_length': 6,
    'type': 'random'
}
tokenization_parameters = {
    'kmer': 6,
    'shift': 1,
    'max_segment_length': 2003,
    'token_limit': 2000
}

# Setup configuration
defconfig = SeqConfig()
segmentation_params = defconfig.get_and_set_segmentation_parameters(segmentation_params)
tokenization_params = defconfig.get_and_set_tokenization_parameters(tokenization_parameters)

# Load and segment sequences
input_fasta_files = [join(pretraining_fasta_files_dir, file) for file in get_non_empty_files(pretraining_fasta_files_dir)]
sequences = load_contigs(input_fasta_files, IsAddHeader=True, adding_reverse_complement=True, AsDataFrame=True, to_uppercase=True, is_add_sequence_id=True)
segment_db = segment_sequences(sequences, segmentation_params, AsDataFrame=True)

# Tokenization
tokenized = batch_tokenize_segments_with_ids(segment_db, tokenization_params)
expected_max_token = max(len(arr) for arrays in tokenized.values() for arr in arrays)
X, torchdb = get_rectangular_array_from_tokenized_dataset(tokenized, tokenization_params['shift'], expected_max_token)

# Save to HDF file
hdf_file = '/tmp/pretraining.h5'
save_to_hdf(X, hdf_file, database=torchdb, compression=True)


### Tokenization with tokenizer class

The tokenizer class can be used for tokenization as well. There are various additional features as well. The tokenizer might operate on the original sequence space or k-mer space, the default is the latter. 
For example how to encode and decode sequence is important and we will give you examples here. 

