# Preprocessing Sequence Data

The initial stage of our pipeline involves two primary steps: segmentation and tokenization.

Segmentation is depicted in the figure below. Genomic language models (GLMs) process limited-size chunks of sequence data, typically ranging from 0 to 4kb. Therefore, it's essential to divide the sequence into smaller parts, a process known as segmentation. Segmentation can be either contiguous, which splits the sequence into disjoint segments, or random, which involves randomly sampling segments of length L. 

The first practical step in segmentation is loading the sequence from a FASTA file. Often, it's also beneficial to include the reverse complement of the sequence.
Segmentation process:

<img src="https://github.com/nbrg-ppcu/prokbert/blob/main/assets/Figure2_segmentation.png?raw=true" width="500" height="300" alt="Segmentation Process"> 



## Preprocessing the dataset and create a segment database is a good practice

The sequence processing is a computationaly expensive process considering the size of the available sequence data. It is a good practice the create a database that contains the processed sequences alongside with the labels, which we often refer to as segment database.
 


# Loading the sequence data 

The `load_contigs` function efficiently handles sequence data from FASTA files, providing options to include reverse complements and sequence metadata. When the output is set to a DataFrame (`AsDataFrame=True`), the function organizes this data into a structured format, enhancing data accessibility and manipulation for downstream analyses.

The resulting DataFrame consists of the following columns:
- `sequence_id`: Unique identifier of the sequence (integer) if `is_add_sequence_id=True`.
- `fasta_id`: identifier of the sequence, parsed from the fasta file.
- `description`: Description or metadata associated with the sequence, typically extracted from the FASTA file.
- `source_file`: Path of the source FASTA file.
- `sequence`: Nucleotide sequence. Sequences are converted to uppercase if `to_uppercase=True`.
- `orientation`: Indicates 'forward' for original sequences and 'reverse' for reverse complements if `adding_reverse_complement=True`.


## Installation of ProkBERT (if needed)



In [28]:
try:
    import prokbert
    print("ProkBERT is already installed.")
except ImportError:
    !pip install prokbert
    print("Installed ProkBERT.")


ProkBERT is already installed.


In [21]:
from prokbert.sequtils import *
import pkg_resources
fasta_file = pkg_resources.resource_filename('prokbert','data/ESKAPE_sample.fasta')

# This pandas dataframe holds the sequence data parsed from the fasta file
sequences = load_contigs(fasta_file, IsAddHeader=True, adding_reverse_complement=True, AsDataFrame=True, to_uppercase=True, is_add_sequence_id=True)
sequences.head(2)


2023-11-12 11:52:19,694 - INFO - Loading sequence data into memory!
2023-11-12 11:52:19,695 - INFO - Since the fasta_files_list is a string, not a list, we convert it to a list.


Unnamed: 0,sequence_id,fasta_id,description,source_file,sequence,orientation
0,0,NC_000913.3,NC_000913.3 Escherichia coli str. K-12 substr....,/home/ligeti/github/prokbert/src/prokbert/data...,TATATTAGAAATGTCCGCGACCTTTCATACATACCACCGGTACGCC...,forward
1,1,NC_000913.3,NC_000913.3 Escherichia coli str. K-12 substr....,/home/ligeti/github/prokbert/src/prokbert/data...,CCAGCAATGGTGAAAGGAAAATCCCCAGCAGGCTGGATGCCGACGC...,reverse


# Segmentation and tokenization

The segmentation and tokenization process have multiple parameter sets. I.e. how large the segments, we want to sample, what is the minimum valid length, proportion of unknown tokens etc. 
The parameters are set by the config classes accordingly. 


The following table outlines the configuration parameters for ProkBERT, detailing their purpose, default values, types, and constraints.

| Parameter | Description | Type | Default | Constraints |
|-----------|-------------|------|---------|-------------|
| **Segmentation** |
| `type` | Defines the segmentation type. 'contiguous' means non-overlapping sections of the sequence are selected end-to-end. In 'random' segmentation, fragments are uniformly sampled from the original sequence. | string | `contiguous` | Options: `contiguous`, `random` |
| `min_length` | Sets the minimum length for a segment. Any segment shorter than this will be discarded. | integer | 0 | Min: 0 |
| `max_length` | Specifies the maximum length a segment can have. | integer | 512 | Min: 0 |
| `coverage` | Indicates the expected average coverage of any position in the sequence by segments. This is only applicable for type=random. Note that because segments are uniformly sampled, the coverage might vary, especially at the sequence ends. | float | 1.0 | Min: 0.0, Max: 100.0 |



### Segmentation of sequence database


In [22]:

# Segment a DataFrame:
segmentation_params = {'max_length' : 512, # We split the sequence into L  
                       'min_length' : 6,
                       'type' : 'contiguous'} #default segmentation type
segmentdb = segment_sequences(sequences, segmentation_params, AsDataFrame=True)
segmentdb.head(2)




2023-11-12 11:52:23,256 - INFO - Checking input DataFrame!
2023-11-12 11:52:23,257 - INFO - Checking input sequence_id is valid primary key in the DataFrame


Unnamed: 0,segment_id,sequence_id,segment_start,segment_end,segment
0,0,0,0,512,TATATTAGAAATGTCCGCGACCTTTCATACATACCACCGGTACGCC...
1,1,0,512,1024,TTATGCTATGAAAAAACATCTTTTAACTCTGACACTTTCCTCTATA...


### Segmentation of single sequence


In [23]:
from prokbert.config_utils import *
# This class provide validated input parameters for segmentation and tokenization
defconfig = SeqConfig() # For the detailed configarion parameters see: https://github.com/nbrg-ppcu/prokbert/blob/main/src/prokbert/configs/sequence_processing.yaml

segmentation_params = {'max_length' : 8, # We split the sequence into L  
                        'min_length' : 3,
                         'type' : 'contiguous'} #default segmentation type
# Setting up paramters for segmentation

segmentation_params = defconfig.get_and_set_segmentation_parameters(segmentation_params)

# Segment single sequence:
segment_list = segment_sequence_contiguous('ATCGATTTGCT', segmentation_params)
print(segment_list)



[{'segment': 'ATCGATTT', 'segment_start': 0, 'segment_end': 8, 'sequence_id': nan}, {'segment': 'GCT', 'segment_start': 8, 'segment_end': 11, 'sequence_id': nan}]
