# Preprocessing sequence data

The first step in the pipeline is the sequence preprocessing, which usually consinsts of two main steps:
  * Segmentation
  * Tokenization

The segmentation is illustrated in figure <img src="https://github.com/nbrg-ppcu/prokbert/blob/main/assets/Figure2_segmentation.png?raw=true" width="500" height="300" alt="Segmentation Process"> .

The genomic language models (glm) only able to process a limited sized chunk of sequence data at once, which is typcailly 0-4kb long, thus first it is needed to split it into multiple parts. This process is called segmentation. 
The segmentation can be contiguous, splitting the sequence into disjunct segments and 'random', randomly sampling segments with length of L. 
The first step is loading the sequence from fasta file. Many cases reasenable to add the reverse complement of sequence s needed as well. 


# Preprocessing Sequence Data

The initial stage of our pipeline involves two primary steps: segmentation and tokenization.

Segmentation is depicted in the figure below. Genomic language models (GLMs) process limited-size chunks of sequence data, typically ranging from 0 to 4kb. Therefore, it's essential to divide the sequence into smaller parts, a process known as segmentation. Segmentation can be either contiguous, which splits the sequence into disjoint segments, or random, which involves randomly sampling segments of length L. 

The first practical step in segmentation is loading the sequence from a FASTA file. Often, it's also beneficial to include the reverse complement of the sequence.
Segmentation process:

<img src="https://github.com/nbrg-ppcu/prokbert/blob/main/assets/Figure2_segmentation.png?raw=true" width="500" height="300" alt="Segmentation Process"> 



# Loading the sequence data 

The `load_contigs` function efficiently handles sequence data from FASTA files, providing options to include reverse complements and sequence metadata. When the output is set to a DataFrame (`AsDataFrame=True`), the function organizes this data into a structured format, enhancing data accessibility and manipulation for downstream analyses.

The resulting DataFrame consists of the following columns:
- `sequence_id`: Unique identifier of the sequence (integer) if `is_add_sequence_id=True`.
- `fasta_id`: identifier of the sequence, parsed from the fasta file.
- `description`: Description or metadata associated with the sequence, typically extracted from the FASTA file.
- `source_file`: Path of the source FASTA file.
- `sequence`: Nucleotide sequence. Sequences are converted to uppercase if `to_uppercase=True`.
- `orientation`: Indicates 'forward' for original sequences and 'reverse' for reverse complements if `adding_reverse_complement=True`.


In [7]:
from prokbert.sequtils import *

fasta_file = 'data/ESKAPE_sample.fasta'
# This pandas dataframe holds the sequence data parsed from the fasta file
sequences = load_contigs(fasta_file,IsAddHeader=True, adding_reverse_complement=True, AsDataFrame=True, to_uppercase=True, is_add_sequence_id=True)

sequences

2023-11-11 21:42:04,684 - INFO - Loading sequence data into memory!
2023-11-11 21:42:04,685 - INFO - Since the fasta_files_list is a string, not a list, we convert it to a list.


Unnamed: 0,sequence_id,fasta_id,description,source_file,sequence,orientation
0,0,NC_000913.3_2524612_2527387,NC_000913.3_2524612_2527387 NC_000913.3 Escher...,data/ESKAPE_sample.fasta,TATATTAGAAATGTCCGCGACCTTTCATACATACCACCGGTACGCC...,forward
1,1,NC_000913.3_2524612_2527387,NC_000913.3_2524612_2527387 NC_000913.3 Escher...,data/ESKAPE_sample.fasta,CCAGCAATGGTGAAAGGAAAATCCCCAGCAGGCTGGATGCCGACGC...,reverse
2,2,NC_007795.1_1702041_1705080,NC_007795.1_1702041_1705080 NC_007795.1 Staphy...,data/ESKAPE_sample.fasta,ATAATGCTAAATCGTAACCCCACTGCTTAAATGAGCCTTCTGTAAA...,forward
3,3,NC_007795.1_1702041_1705080,NC_007795.1_1702041_1705080 NC_007795.1 Staphy...,data/ESKAPE_sample.fasta,GAATGGACTGTTATCAGGATCAACTTGCTGCCAAGGGATAATAGAC...,reverse
4,4,NC_002516.2_410266_413817,NC_002516.2_410266_413817 NC_002516.2 Pseudomo...,data/ESKAPE_sample.fasta,GTGACCGGGGTCAGGTTCTCGGCGGCGGCGCGCATCACGTGCTTGC...,forward
5,5,NC_002516.2_410266_413817,NC_002516.2_410266_413817 NC_002516.2 Pseudomo...,data/ESKAPE_sample.fasta,GCGTTTCTACCGCGAGGTCGGCCCACTCGACTGCACCCTGGAGAGT...,reverse


# Segmentation and tokenization


In [13]:
from prokbert.config_utils import *
# This class provide validated input parameters for segmentation and tokenization
defconfig = SeqConfig() # For the detailed configarion parameters see: https://github.com/nbrg-ppcu/prokbert/blob/main/src/prokbert/configs/sequence_processing.yaml

segmentation_params = {'max_length' : 8, # We split the sequence into L  
                        'min_length' : 3,
                         'type' : 'contiguous'} #default segmentation type
tokenization_params = {}

segmentation_params = defconfig.get_and_set_segmentation_parameters(segmentation_params)
tokenization_params = defconfig.get_and_set_tokenization_parameters(tokenization_params)

# Segment single sequence:
segment_list = segment_sequence_contiguous('ATCGATCGAAATTTTTT', segmentation_params)
print(segment_list)

# Segment a DataFrame:
segmentation_params = {'max_length' : 512, # We split the sequence into L  
                        'min_length' : 6,
                         'type' : 'contiguous'} #default segmentation type
segmentdb = segment_sequences(sequences, segmentation_params, AsDataFrame=True)
segmentdb





#segmentation_params = defconfig.get_and_set_segmentation_parameters()  
#tokenization_params = defconfig.get_and_set_tokenization_params()



2023-11-11 21:43:41,138 - INFO - Checking input DataFrame!
2023-11-11 21:43:41,139 - INFO - Checking input sequence_id is valid primary key in the DataFrame


[{'segment': 'ATCGATCG', 'segment_start': 0, 'segment_end': 8, 'sequence_id': nan}, {'segment': 'AAATTTTT', 'segment_start': 8, 'segment_end': 16, 'sequence_id': nan}]


Unnamed: 0,segment_id,sequence_id,segment_start,segment_end,segment
0,0,0,0,512,TATATTAGAAATGTCCGCGACCTTTCATACATACCACCGGTACGCC...
1,1,0,512,1024,TTATGCTATGAAAAAACATCTTTTAACTCTGACACTTTCCTCTATA...
2,2,0,1024,1536,CCGGCTGGGTTGCAGGTTACAACTTTATGCTGGGCAGCGAGAAATT...
3,3,0,1536,2048,CCGATGCCTGCGGCTACCATCGGGAACAGCGTCGCCGGATGTCCAA...
4,4,0,2048,2560,AGCCAGCTGCTGCCCTGCATCTGTAAGCACCACTTCACGCGTGGTT...
5,5,0,2560,2780,ACCTTCGTGCTGTTTCCGATTCTGGGTGTACTGTTTGCCTGGTGGA...
6,6,1,0,512,CCAGCAATGGTGAAAGGAAAATCCCCAGCAGGCTGGATGCCGACGC...
7,7,1,512,1024,CCATTCATGAAGAAAACAGATGAATTATTCTTTAAAACAATTAAAG...
8,8,1,1024,1536,AGCCTTTCTTTCTGCTTTGCCATCGCGATAGCGCTTTGGCCGTGGA...
9,9,1,1536,2048,TAAAAATCTTCGCCCAGTTTGTCATCCGCATATCGATACTGAATCC...


In [None]:
params = {'min_length': 0, 'max_length': 100}
segment_sequence_contiguous('ATCGATCGA', params)




In [None]:
segmentation_params
