# Tokenization

This notebooks will present the tokenization of sequence for prokbert.
Parts are:
  * Tokenization process background and paramters
  * Tokenization of sequences
  * Tokenization for pretraining
  * HDF datasets for storing preprocessed sequence data


## Tokenization of Sequence Data

ProkBERT employs LCA tokenization, leveraging overlapping k-mers to capture rich local context information, enhancing model generalization and performance. The key parameters are the k-mer size and shift. For instance, with a k-mer size of 6 and a shift of 1, the tokenization captures detailed sequence information, while a k-mer size of 1 represents a basic character-based approach.

### Segmentation Strategies
Before tokenization, sequences are segmented using two main approaches:
1. **Contiguous Sampling**: Divides contigs into non-overlapping segments.
2. **Random Sampling**: Fragments the input sequence into randomly sized segments.

### Tokenization Process
After segmentation, sequences are encoded into a simpler vector format. The LCA method is pivotal in this phase, allowing the model to use a broader context and reducing computational demands while maintaining the information-rich local context.

### Context Size Limitations
It's important to note that transformer models, including ProkBERT, have a context size limitation. ProkBERT's design accommodates context sizes significantly larger than an average gene, yet smaller than the average bacterial genome.

We provide pretrained models for variants like ProkBERT-mini (k-mer size 6, shift 1), ProkBERT-mini-c (k-mer size 1, shift 1), and ProkBERT-mini-long (k-mer size 6, shift 2), catering to different sequence analysis requirements.

<img src="https://github.com/nbrg-ppcu/prokbert/blob/main/assets/Figure2_tokenization.png?raw=true" width="800" alt="Segmentation Process"> 

*Figure: The tokenization process in ProkBERT.*

It is important to see that, when the $shift > 1$ there are multiple possible tokenization, depending where we start the tokenizaiton. The window offset refers the actual tokenization window. 
I.e. let the sequence be `ATGTCCGCGACCTTTCATACATACCACCGGTAC` with the k-mer size 6, shift 2) we will have two possible tokenization:

Tokenization with offset=0:
```plaintext
    ATGTCCGCGACCTTTCATACATACCACCGGTAC
0.  ATGTCC  GACCTT  ATACAT  CACCGG
1.    GTCCGC  CCTTTC  ACATAC  CCGGTA
2.      CCGCGA  TTTCAT  ATACCA
3.        GCGACC  TCATAC  ACCACC
```
Tokenization with offset=1
```plaintext
    ATGTCCGCGACCTTTCATACATACCACCGGTAC
0.   TGTCCG  ACCTTT  TACATA  ACCGGT
1.     TCCGCG  CTTTCA  CATACC  CGGTAC
2.       CGCGAC  TTCATA  TACCAC
3.         CGACCT  CATACA  CCACCG
```
By default all possible tokenization is returned. 


## Tokenization parameters

The following table outlines the configuration parameters for ProkBERT, detailing their purpose, default values, types, and constraints.


| Parameter | Description | Type | Default | Constraints |
|-----------|-------------|------|---------|-------------|
| **Tokenization** |
| `type` | Describes the tokenization approach. By default, the LCA (Local Context Aware) method is used. | string | `lca` | Options: `lca` |
| `kmer` | Determines the k-mer size for the tokenization process. | integer | 6 | Options: 1-9 |
| `shift` | Represents the shift parameter in k-mer. The default value is 1. | integer | 1 | Min: 0 |
| `max_segment_length` | Gives the maximum number of characters in a segment. This should be consistent with the language model's capability. It can be alternated with token_limit. | integer | 2050 | Min: 6, Max: 4294967296 |
| `token_limit` | States the maximum token count that the language model can process, inclusive of special tokens like CLS and SEP. This is interchangeable with max_segment_length. | integer | 4096 | Min: 1, Max: 4294967296 |
| `max_unknown_token_proportion` | Defines the maximum allowed proportion of unknown tokens in a sequence. For instance, if 10% of the tokens are unknown (when max_unknown_token_proportion=0.1), the segment won't be tokenized. | float | 0.9999 | Min: 0, Max: 1 |
| `vocabfile` | Path to the vocabulary file. If set to 'auto', the default vocabulary is utilized. | str | `auto` | - |
| `vocabmap` | The default vocabmap loaded from file | dict | `{}` | - |
| `isPaddingToMaxLength` | Determines if the tokenized sentence should be padded with [PAD] tokens to produce vectors of a fixed length. | bool | False | Options: True, False |
| `add_special_token` | The tokenizer should add the special starting and sentence end tokens. The default is yes. | bool | True | Options: True, False |
| **Computation** |
| `cpu_cores_for_segmentation` | Specifies the number of CPU cores allocated for the segmentation process. | integer | 10 | Min: 1 |
| `cpu_cores_for_tokenization` | Allocates a certain number of CPU cores for the k-mer tokenization process. | integer | -1 | Min: 1 |
| `batch_size_tokenization` | Determines the number of segments a single core processes at a time. The input segment list will be divided into chunks of this size. | integer | 10000 | Min: 1 |
| `batch_size_fasta_segmentation` | Sets the number of fasta files processed in a single batch, useful when dealing with a large number of fasta files. | integer | 3 | Min: 1 |
| `numpy_token_integer_prec_byte` | The type of integer to be used during the vectorization. The default is 2, if you want to work larger k-mers then increase it to 4. 1: np.int8, 2:np.int16. 4:np.int32. 8: np.int64 | integer | 2 | Options: 1, 2, 4, 8 |
| `np_tokentype` | Dummy | type | `np.int16` | - |


## Installation of ProkBERT (if needed)



In [28]:
try:
    import prokbert
    print("ProkBERT is already installed.")
except ImportError:
    !pip install prokbert
    print("Installed ProkBERT.")


ProkBERT is already installed.


# Tokenization of sequences examples

In [12]:
from prokbert.config_utils import *
from prokbert.sequtils import *
tokenization_parameters = {'kmer' : 6,
                          'shift' : 2}

segment = 'ATGTCCGCGACCT'
defconfig = SeqConfig() # For the detailed configarion parameters see the table above
tokenization_params = defconfig.get_and_set_tokenization_parameters(tokenization_parameters)
tokens, kmers = lca_tokenize_segment(segment, tokenization_params)

print(' '.join([str(t) for t in tokens]))
print(' '.join(kmers[0]))

results_pretty_print = pretty_print_overlapping_sequence(segment, kmers[0], tokenization_params)
print(results_pretty_print)



2023-11-12 11:21:22,443 - INFO - Nr. line to cover the seq:  4


[2, 954, 2910, 1437, 2442, 3] [2, 3803, 3435, 1638, 1564, 3]
ATGTCC GTCCGC CCGCGA GCGACC
    ATGTCCGCGACCT
0.  ATGTCC
1.    GTCCGC
2.      CCGCGA
3.        GCGACC


## Tokenizing longer sequences
If you wish to tokenize longer sequences what the current prokBERT model supports, then adjust the `token_limit` and `max_segment_length` parameters.
Note that tokenization process is paralleled using the python multiprocessing module at segment levels, you might to adjust the number of cores to be used as well as the `batch_size_tokenization` (how many sequence a core should process at once) otherwise you might run out of memory. 



In [18]:
tokenization_parameters = {'kmer' : 6,
                          'shift' : 1,
                          'max_segment_length': 2000000,
                          'token_limit': 2000000}

segment = 'ATGTCCGCGACCT'*100000 # long sequence
defconfig = SeqConfig() # For the detailed configarion parameters see the table above
tokenization_params = defconfig.get_and_set_tokenization_parameters(tokenization_parameters)
tokens, kmers = lca_tokenize_segment(segment, tokenization_params)


### Tokenization with tokenizer class

The tokenizer class can be used for tokenization as well. 


# Tokenization for pretraining

The sequence data is largely abundant and it is not possible store in the memory. 
