## Développement d’une solution de mapping de données de séquençage à haut-débit sur un génome de référence

#### Mise en place d'un algorithme permettant de chercher un mot de longueur fixe dans un texte

#### Plan du notebook:
1/ Prise en main de Biopython  
2/ Implémentation d'un algorithme "Difference Cover size 3" pour construire une table de suffixe à partir d'un génome en un temps linéaire  
3/ Recherche d'un mot dans un texte grâce à la transformée de Burrows-Wheeler

### 1/ Biopython

In [1]:
from Bio import SeqIO

In [2]:
path_to_genome_file="/home/azarkua/Documents/2023-2024/omiques2/developement/omique2/genome.fna"
path_to_reads_file="/home/azarkua/Documents/2023-2024/omiques2/developement/omique2/reads.fq"

In [3]:
genome=[]
nucleotide_genome=0
for seq_record in SeqIO.parse(path_to_genome_file, "fasta"):
    genome.append(seq_record)
    nucleotide_genome+=len(seq_record.seq)

In [4]:
print("Informations contenues dans un élément de 'genome'.")
print(genome[0])

Informations contenues dans un élément de 'genome'.
ID: NC_004325.2
Name: NC_004325.2
Description: NC_004325.2 Plasmodium falciparum 3D7 genome assembly, chromosome: 1
Number of features: 0
Seq('TGAACCCTaaaacctaaaccctaaaccctaaaccctgaaccctaaaccctgaac...agg')


In [5]:
print("Notre génome contient "+str(len(genome))+" séquences ("+ str(len(genome)-1) +" chromosomes et 1 plasmide) et est constitué de "+str(nucleotide_genome)+" nucléotides.")

Notre génome contient 15 séquences (14 chromosomes et 1 plasmide) et est constitué de 23326872 nucléotides.


In [6]:
reads=[]
nucleotide_reads=0
for seq_record in SeqIO.parse(path_to_reads_file, "fastq"):
    reads.append(seq_record)
    nucleotide_reads+=len(seq_record.seq)
    #if len(seq_record.seq)!= 100:
    #    print("False")

In [7]:
print("Informations contenues dans un élément de 'reads'.")
print(reads[10])

Informations contenues dans un élément de 'reads'.
ID: NC_004325.2-99990
Name: NC_004325.2-99990
Description: NC_004325.2-99990
Number of features: 0
Per letter annotation for: phred_quality
Seq('TCTTTTAACATACCTAAGAAGGAATACATTTTACACTTTACCTATTATTTATTC...ATA')


In [8]:
print("Nous avons "+str(len(reads))+" reads qui contiennent "+str(nucleotide_reads)+" nucléotides.")
print("Chaque read a une longueur de "+str(int(nucleotide_reads/len(reads)))+" nucléotide.")

Nous avons 1500000 reads qui contiennent 150000000 nucléotides.
Chaque read a une longueur de 100 nucléotide.


### 2/ DC3
#### Première étape de DC3: division des positions de notre génome, tri des positions

In [9]:
import time

In [10]:
def asciiDC3 (seq) : 
    """
    Create sequence of ascii equivalent of each element of given list parameter.
    Add three sentinel numbers necessary for DC3 algorithm.
    
    Args:
        seq (list of str): list of elements to transform
    
    Return:
        (list of int)
    """
    asc=[]
    for i in seq : 
        asc.append(ord(i))
    
    return asc+[0,0,0]

##### Complexité:
Parcourir la séquence 1 fois implique, que asciiDC3 a une compléxité $O(n)$

In [11]:
def position1_2 (asc):
    """
    Create list of positions not multiple of 3
    
    Args:
        asc (list of int): sequence we want to extract a suffix array from
    
    Return:
         ind1+ind2 (list of int): list of positions not multiple of 3
    """
    ind1=[]
    ind2=[]
    for k in range(len(asc)-2):
        #attention on a peut etre fait de la merde ici, pas sure du -2
        if k%3==1 :
            ind1.append(k)
        if k%3==2:
            ind2.append(k)
    # dans le cas où T a un nombre d'élément multiple de 3, r12 ne contiendra pas le triplet [0,0,0].
    # Or, la fonction, "removesentinel" retire 1 élément notre suffix table, pour justement se débarrasser de [0,0,0]
    # J'ai donc décider de rajouter artificiellement [0,0,0] dans les cas où il n'est pas déjà là.
    if len(asc)%3==0:
        ind1.append(len(asc)-1-2)
        
    return ind1+ind2

##### Complexité:
Parcourir la séquence 1 fois implique, que position1_2 a une compléxité $O(n)$

In [12]:
def radix_with_p12(p,t):
    """
    Creates a list of couples containing:
    - sequences of three elements from "t" staring from positions in parameter "p"
    - the position p
    
    Args : 
        p (list of int): list of positions not multiple of 3 
        t (list of int): sequence we want to extract a suffix array from
    
    Return : 
        r (list of list): list of triplets and their starting positions
    
    """
    r=[]
    for i in range(len(p)):
        index=p[i]
        r.append([[t[index],t[index+1], t[index+2]],index])
    return r

##### Complexité:
Parcourir la séquence 1 fois implique, que radix_with_p12 a une compléxité $O(n)$

In [13]:
def sort_with_p12(array, alphabet, column_number):
    """
    Sorts parameter "array" using Radix Sort
    
    Args:
        array (list of list): list of triplets and their positions.
        
        alphabet (dictionnary): alphabet of our array. Ouput of function "alphabetT" or "alphabetR0_with_p12"
        
        columnNumber (int): number of last column of the lists in our paramater "array". 
                                    for r12, columnNumber=2
                                    for r0, columnNumber=1
    
    Return:
        array (list of list) : sorted list of triplets and their positions

    """
    if len(array) == 0:
        return array
    # Perform counting sort on each column, starting at the last

    column = column_number
    while column>=0: 
        array = counting_sort_by_digit_with_p12(array, alphabet, column)
        column-=1 #Switches column in array

    return array

def counting_sort_by_digit_with_p12(array, alphabet, column):
    """
    Sorts parameter "array" using Counting Sort
    
    Args:
        array (list of list): list of triplets and their positions.
        
        alphabet (dictionnary): alphabet of our array.O uput of function "alphabetT" or "alphabetR0_with_p12"
        
        columnNumber (int): number of last column of the lists in our paramater "array". 
                                    for r12, columnNumber=2
                                    for r0, columnNumber=1
    
    Return:
        array (list of list) : sorted list of triplets and their positions according to column number "column"

    """
    count_index = -1
    count = [0] * len(alphabet)
    output = [None] * len(array)

  # Count frequencies
    for i in range(0, len(array)):
        count_index = alphabet[array[i][0][column]]
        count[count_index] += 1

  # Compute cumulates
    for i in range(1, len(alphabet)):
        count[i] += count[i - 1]

  # Move records
    for i in range(len(array) - 1, -1, -1):
        count_index = alphabet[array[i][0][column]]
        count[count_index] -= 1
        output[count[count_index]] = array[i]
       
    return output

##### Complexité:
"sort" a la même complexité que "CountingSortByDigit", qui ,lui, a une complexité linéaire.

In [14]:
def triplets_are_equal(a,b):
    """
    Checks if a and b have the same elements.
    
    Args:
        a (list of int) 
        
        b (list of int)

    Return:
        (bool): True if the triplets are equal, False if not

    """
    for i in range(len(a)):
        if a[i]!=b[i]:
            return False
    return True


def ordre_with_p12(r12_p12_sorted, use_index_12=False):
    """
    Returns the order of each element of parameter "R12p12sorted".
    If use_index_12 is set to "True", it also returns a dictionnary, with the position of the triplet
    as a key, and its order as element.
    
    Example: 
        ordre_with_p12([ [ [0,0,0],1], [ [0,0,0],4], [ [0,0,1],7])=[1,1,2]
        
        ordre_with_p12([ [ [0,0,0],1], [ [0,0,0],4], [ [0,0,1],7], use_index_12=True)= [ [1,1,2], {1:1,4:1,7:2} ]
    
    args : 
        r12_p12_sorted (list) : list of triplets with their position in the sequence, ordered 
        
        use_index_12 (boolean) : True if you want a dictionnary back
        
    return : 
        if use_index_12=True : 
            order (list of int) :  order of each element of parameter "r12_p12_sorted"
            
            repetition (boolean) : If two triplets have the same ordre, "repetition"=True
            
            indexdict (dictionary) :keys are the positions p12 and elements are the order of the positions 
        
        if use_index_12=False : 
            order (list of int): order of each element of parameter "r12_p12_sorted"
    """
    # nous renovie la liste order du genre (1,2,2,3,4,4,5)  et un booléen indiquant s'il y a répétition
    index=1
    repetition=False
    order=[1]
    
    if use_index_12:
        index_dict={r12_p12_sorted[0][1]:1}
        
    for i in range(1, len(r12_p12_sorted)): 
        if triplets_are_equal(r12_p12_sorted[i-1][0], r12_p12_sorted[i][0]):
            order.append(index)
            repetition=True
        else : 
            index+=1
            order.append(index)
            
        if use_index_12:
            index_dict[r12_p12_sorted[i][1]]=order[i]
    if use_index_12:
        return order, repetition, index_dict
    else: 
        return order
            

##### Complexité:
triplets_are_equal a une complexité de $O(1)$ donc ordre_with_p12 a une complexité de $O(n)$ à cause la boucle for. 

In [15]:
def alphabetT(T):
    """
    Returns a dictionnary with the order of each "letter" constituting the parameter T.
    
    Example: 
        alphabetT([4,9,14,67])={4:0, 9:1, 14:2, 67:3}
    
    Args:
        T (list of int): the sequence we want a suffix array from
    
    Return:
        dic (dictionnary) : order of each "letter" constituting the parameter T

    """

    dic={}
    a=[]
    for i in range(len(T)):
        a.append(T[i])
    a.sort() 
    
    for element in a:
        if not (element in dic):
            dic[element]=len(dic)
    return dic

In [16]:
def alphabet_r0_with_p12(r0_p0):
    """
    Returns a dictionnary with the order of each "letter" constituting the parameter r0_p0.
    Similar to the function alphabetT, coded specifically for the output of function position0_R0_p0
    
    Example: 
        alphabet_r0_with_p12([[[65, 14], 0], [[67, 11], 3]]) = {11: 0, 14: 1, 65: 2, 67: 3}
    
    Args:
        r0_p0 (list of list): list of couples and their position in the sequence we want a suffix array from.
                             output of function position0_R0_p0
                             [ [couple], position multiple of 3]
    
    Return:
        dic (dictionnary) : order of each "letter" constituting the parameter r0_p0

    """
    
    dic={}
    a=[]
    for column in range(2):
        for i in range(len(r0_p0)):
            a.append(r0_p0[i][0][column])
    a.sort() 
    
    for element in a:
        if not (element in dic):
            dic[element]=len(dic)
    return dic

##### Complexité:
alphabetT a une complexité de $O(n)$ car la fonction parcourt toute la séquence une fois. Il en est de même pour alphabetR0_with_p12, car la seconde boucle for n'est que de taille 2. 

#### Deuxième étape de DC3: utilisation des premières positions triées pour construire second groupe de positions

In [17]:
def Tprime_with_p12(p12, index_dict):
    """
    Return a list of the order of each element of p12.
    The list will be the new sequence from which we want to extract a suffix array.
    
    Args : 
        p12 (list of int) : positions not multiple of 3
        
        index_dict (dictionnary) : keys are the positions p12 and elements are the order of the positions 
        
    Return :
        t(list of int): order of each element of p12
    """
    t=[]
    for p in p12:
        t.append(index_dict[p])
    return t

##### Complexité:
Parcourir la liste des positions p12 une fois implique que Tprime_with_p12 a une compléxité $O(n)$

In [18]:
def position0_R0_p0(T, index_12_dict):
    """
    Returns:
    -list of positions multiples of 3
    -list of lists, composed of: 1/ [element at position multiple of 3, order of the element at next position]
                                2/ position multiple of 3
    Args : 
        T (list of int) : sequence for which we are looking to extract the suffix array. 
                        It must contain the three sentinel numbers
        
        index_12_dict (dictionary) : dictionary which contains the positions of p12 as key and their order as elements
        
    Return :
        position (list): all positions multiple of 3
        
        R (list of list): lists each position with the elements at the next position

    """
    position=[]
    R=[]
    for i in range(len(T)-3): 
        if i%3==0:
            position.append(i)
            if i+1<len(T)-3:
                R.append([[T[i],index_12_dict[i+1]],position[-1]])
            else:
                R.append([[T[i],1],position[-1]]) 
    return position, R

##### Complexité:
Parcourir notre séquence initiale 1 fois implique que position0_R0_p0 a une compléxité $O(n)$. La tache "chercher le rang de la position suivante" a une complexité de $O(1)$ car on cherche une clé dans un dictionnaire

#### Troisième étape de DC3: combiner les positions triées pour trouver le suffix array.

In [19]:
def merge_with_p12(Tfinal, r0_p0_sorted, index_12_dict) :
    """
    Constructs a list of the positions in the right order to construct the suffix table of parameter "Tfinale"
    
    Args : 
        Tfinal(list of int) : sequence for which we want to create a suffix table 
        
        r0_p0_sorted(list of list) : sorted output of positionR0P0 
        
        index_12_dict(dictionary):  dictionary which contains the positions of p12 as key and their order as elements
    
    Return : 
        liste (list of int) : positions in the right order to construct the suffix table 
    """

    index_12_dict_keys=list(index_12_dict.keys())
    liste=[]
    A=0
    B=0
    while A<len(r0_p0_sorted) and B<len(index_12_dict_keys):
        a=r0_p0_sorted[A][1]
        b=index_12_dict_keys[B]
        if Tfinal[a]!=Tfinal[b] :
            minimum=min(Tfinal[a], Tfinal[b])
            
            if minimum == Tfinal[a]:
                A+=1
                liste.append(a)
            else: 
                B+=1
                liste.append(b)

        else :
            if b%3==1 : 
                longueur=len(liste)
                i=0
                if index_12_dict[a+1]<index_12_dict[b+1]:
                    liste.append(a)
                    A+=1
                else:
                    liste.append(b)
                
                    B+=1
                    
                    
            elif b%3==2 :
            
                if Tfinal[a+1]!=Tfinal[b+1] :
                   
                    minimum=min(Tfinal[a+1], Tfinal[b+1])
                    if minimum == Tfinal[a+1]:
                        A+=1
                    
                        liste.append(a)
                    else: 
                        B+=1
                        
                        liste.append(b)

                else:
                  
                    if index_12_dict[a+2]<index_12_dict[b+2]:
                        liste.append(a)
                        A+=1
                    else:
                        liste.append(b)
                        B+=1
                        

    if A==len(r0_p0_sorted):

        for i in range(B,len(index_12_dict_keys)):
            liste.append(index_12_dict_keys[i])
                
    if B==len(index_12_dict_keys):

        for i in range(A, len(r0_p0_sorted)):
            liste.append(r0_p0_sorted[i][1])

    return liste

##### Complexité:
La boucle while suivit de n'importe quelle des deux boucles for ne parcourt que n éléments. Les boucles ne sont pas imbriquées les unes dans les autres, et dans la boucle while, seules des opérations de complexité $O(1)$ prennent place.
Donc merge_with_p12 a une complexité de $O(n)$.

In [20]:
def remove_sentinel(index):
    """
    Return our suffix array without the position of the three sentinels numbers
    
    Args : 
        index (list of int) : list of indexes 
        
    Return :
        index[1:](list of int) : list of indexes without the sentinels
    
    """
    return index[1:]

##### Complexité
removesentinel a une complexité temporelle de $0(1)$.

In [21]:
def resumeHigherOrder_with_p12(index_012_prime, P12):
    """
    Returns a dictionnary with positions not multiple of 3 "P12" as key, and their order with 
    the positions of the suffix array "index012prime" as elements.
    
    Args : 
        index012prime (list of int) : suffix array of the positions not multiple of 3
        
        P12 (list of int) : positions not multiple of 3
    
    Return : 
        output (dictionnary) : positions not multiple of 3 "P12" as key, and their order with the positions of the suffix array "index012prime" as elements.
    """
    output={}
    for i in range(len(index_012_prime)):
        output[P12[index_012_prime[i]]]=i
        
    return output

#### Dernière étape de DC3: assemblage final de la récursion.

In [22]:
def almost_dc3_with_p12(T):
    """
    Return the suffix array of the sequence T
    
    Args : 
        T (list of int) : list of int corresponding to the ascii code of the initial sequence
        
    Return : 
        index_012 (list of int) : positions in order to construct the suffix array of T
    """
    
    column_number=2
    p12=position1_2(T)
    r12=radix_with_p12(p12,T)
    alphabet_T=alphabetT(T)
    r12_sorted=sort_with_p12(r12, alphabet_T, column_number)    
    order12,repetition, index_12_dict=ordre_with_p12(r12_sorted, True)
    if repetition:
        Tprim=Tprime_with_p12(p12, index_12_dict)+[0,0,0]
        index_012=almost_dc3_with_p12(Tprim)
        index_12_dict=resumeHigherOrder_with_p12(index_012, p12)
    p0,r0_p0=position0_R0_p0(T, index_12_dict)
    alphabet_r0=alphabet_r0_with_p12(r0_p0)
    r0_sorted=sort_with_p12(r0_p0,alphabet_r0 ,column_number-1)
    index_012=remove_sentinel(merge_with_p12(T, r0_sorted, index_12_dict))
    
    return index_012
   
    
    

##### Complexité
almost_dc3_with_p12 est une fonction récursive dont chaque sous fonction a une complexité au pire linéaire.
Elle est donc de complexité $0(n)$

##### Testons le code avec l'exemple du cours

In [23]:
S="abcabcacab"
start=time.time()
T=almost_dc3_with_p12(asciiDC3(S))
end=time.time()
column_number=2
print(T)
print("La fonction pour trouver le suffix array de S a pris "+str(end-start)+" secondes.")

[8, 0, 3, 6, 9, 1, 4, 7, 2, 5]
La fonction pour trouver le suffix array de S a pris 0.00013256072998046875 secondes.


##### Testons le code avec le génome

Suffix table du premier chromosome

In [24]:
start=time.time()
suffix_array=almost_dc3_with_p12(asciiDC3(genome[0].seq.upper()))
end=time.time()
print("Pour le premier chromosome, il nous faut "+str(end-start)+" secondes pour calculer sa suffix table.")
print("Le premier chromosome a une taille de "+str(len(genome[0].seq.upper()))+" nucléotides.")

Pour le premier chromosome, il nous faut 13.967129230499268 secondes pour calculer sa suffix table.
Le premier chromosome a une taille de 640851 nucléotides.


Mesurons le temps que prend chaque sous fonction de notre algorithme DC3 grâce à la bibliothèque cProfile.

In [25]:
import pstats
import cProfile

In [26]:
cProfile.run("almost_dc3_with_p12(asciiDC3(genome[0].seq.upper()))", "dc3_stats")
p = pstats.Stats("dc3_stats")
p.sort_stats("cumulative").print_stats()

Thu Nov  9 22:31:55 2023    dc3_stats

         21257125 function calls (21257119 primitive calls) in 19.350 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   19.350   19.350 {built-in method builtins.exec}
        1    0.277    0.277   19.350   19.350 <string>:1(<module>)
      7/1    0.690    0.099   18.811   18.811 /tmp/ipykernel_5489/3977718216.py:1(almost_dc3_with_p12)
       14    0.094    0.007    5.067    0.362 /tmp/ipykernel_5489/2430097404.py:1(sort_with_p12)
       35    4.973    0.142    4.973    0.142 /tmp/ipykernel_5489/2430097404.py:29(counting_sort_by_digit_with_p12)
        7    2.902    0.415    3.001    0.429 /tmp/ipykernel_5489/3017114688.py:1(position0_R0_p0)
        7    2.811    0.402    2.876    0.411 /tmp/ipykernel_5489/762813728.py:1(radix_with_p12)
        7    2.383    0.340    2.796    0.399 /tmp/ipykernel_5489/3105641319.py:1(merge_with_p12)
        7    1.036    0

<pstats.Stats at 0x7f7d5330bc70>

On constate que les fonctions les plus consommatrices en temps, sont les fonctions de tri, et le fonctions initiales de séparation des positions.

### Transformée de Burrow-Wheeler

In [27]:
def BWT_suffix_table(T,end_of_string=False):
    """
    Compute the BWT from the suffix table
    
    Args:
        T (str): string
        end_of_string (char): end of string character to append
    
    Return:
        bwt (str): BWT
    """
    if end_of_string==False:
        T += '!'
    suffix_array=almost_dc3_with_p12(asciiDC3(T)) 
    bwt = ""
    for i in suffix_array:
        bwt += T[i-1]
    return(bwt)

def BWT(T, suffix_table,end_of_string=False):
    """
    Compute the BWT from the suffix table
    
    Args:
        T (str): string
        end_of_string (char): end of string character to append
    
    Return:
        bwt (str): BWT
    """
    if end_of_string==False:
        T += '!'
    bwt = ""
    for i in suffix_table:
        bwt += T[i-1]
    return(bwt)

In [28]:
def run_length_encoding(S):
    """
    Encode sequence using the Run Length method
    
    Args:
        text (str): string to be shifted
    
    Return:
        str: run length
    """
    encoded_S= ""
    i=0
    number=1
    while i<len(S):
        encoded_S+=S[i]
        i+=1
        while i<len(S) and S[i-1]==S[i]:
            number+=1
            i+=1
        if number>1:
            encoded_S+=str(number)
        number=1
    return encoded_S

In [29]:
def print_suffix_table(sequence, sf, visualize_bwt=False):
    for i in range(len(sf)):
        if visualize_bwt:
            print(sequence[sf[i]-1:])
        else:
            print(sequence[sf[i]:])
    return

#### Testons notre fonction

In [30]:
test_2='ATGCTAGCTGCCCTGATCTCTCTGA!'
suffix_array_2_with_p12=almost_dc3_with_p12(asciiDC3(test_2)) 
print(suffix_array_2_with_p12)


[25, 24, 5, 15, 0, 10, 11, 3, 17, 19, 21, 12, 7, 23, 14, 9, 2, 6, 4, 16, 18, 20, 22, 13, 8, 1]


In [31]:
print_suffix_table(test_2,suffix_array_2_with_p12)

!
A!
AGCTGCCCTGATCTCTCTGA!
ATCTCTCTGA!
ATGCTAGCTGCCCTGATCTCTCTGA!
CCCTGATCTCTCTGA!
CCTGATCTCTCTGA!
CTAGCTGCCCTGATCTCTCTGA!
CTCTCTGA!
CTCTGA!
CTGA!
CTGATCTCTCTGA!
CTGCCCTGATCTCTCTGA!
GA!
GATCTCTCTGA!
GCCCTGATCTCTCTGA!
GCTAGCTGCCCTGATCTCTCTGA!
GCTGCCCTGATCTCTCTGA!
TAGCTGCCCTGATCTCTCTGA!
TCTCTCTGA!
TCTCTGA!
TCTGA!
TGA!
TGATCTCTCTGA!
TGCCCTGATCTCTCTGA!
TGCTAGCTGCCCTGATCTCTCTGA!


In [32]:
print(BWT_suffix_table(test_2, True))
print(BWT(test_2,suffix_array_2_with_p12, True))

AGTG!GCGTTTCGTTTTACACCCCCA
AGTG!GCGTTTCGTTTTACACCCCCA


In [33]:
print(run_length_encoding(BWT_suffix_table(test_2, True)))

AGTG!GCGT3CGT4ACAC5A


BWT sur notre génome

In [34]:
bwt_T=BWT_suffix_table(genome[1].seq.upper())
print(run_length_encoding(bwt_T)[:100])

ACT2AT2A5TA4CA10GA8TACA5CATACA2TA2TAGA8CA2CA6G2CA3GA11CA2TA12TA3TA4TA4TA5CA13CA2TA3CA3TA7CA3TA6TGTA2


Mesurons le temps que l'opération prend

In [35]:
cProfile.run("run_length_encoding(BWT_suffix_table(genome[1].seq.upper()))", "bwt_stats")
p = pstats.Stats("bwt_stats")
p.sort_stats("cumulative").print_stats()

Thu Nov  9 22:32:45 2023    bwt_stats

         38466892 function calls (38466887 primitive calls) in 29.428 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   29.428   29.428 {built-in method builtins.exec}
        1    0.032    0.032   29.428   29.428 <string>:1(<module>)
        1    0.780    0.780   28.926   28.926 /tmp/ipykernel_5489/3234469646.py:1(BWT_suffix_table)
      6/1    1.130    0.188   26.495   26.495 /tmp/ipykernel_5489/3977718216.py:1(almost_dc3_with_p12)
       12    0.145    0.012    7.810    0.651 /tmp/ipykernel_5489/2430097404.py:1(sort_with_p12)
       30    7.665    0.255    7.665    0.255 /tmp/ipykernel_5489/2430097404.py:29(counting_sort_by_digit_with_p12)
        6    5.534    0.922    5.627    0.938 /tmp/ipykernel_5489/762813728.py:1(radix_with_p12)
        6    3.567    0.595    4.190    0.698 /tmp/ipykernel_5489/3105641319.py:1(merge_with_p12)
        6    1.530    

<pstats.Stats at 0x7f7d5330b190>

## String search with BWT

In [36]:
from collections import Counter

In [37]:
def occurrence_indexer(S):
    K = []
    last_index = {}
    for s in S:
        if s not in last_index: 
            last_index[s] = 0
        K.append(last_index[s])
        last_index[s] += 1
    return(K)

def lettre_et_occurence(BWT,pattern_letter, position_in_BWT, occurence_index):
    output=[]
    for i in range(len(position_in_BWT)):
        if BWT[position_in_BWT[i]]==pattern_letter:
            output.append([BWT[position_in_BWT[i]],occurence_index[position_in_BWT[i]]])
    return output

def new_suffix_table_positions(BWT,letter_and_occurence, counts):
    # Counter({'T': 3, 'A': 2, 'B': 2, 'C': 2, '!': 1})
    output=[]
    for i in range(len(letter_and_occurence)):
        letter=letter_and_occurence[i][0]
        occurence=letter_and_occurence[i][1]
        new_position_to_evaluate=occurence + sum([counts[char] for char in counts if char < letter])
        output.append(new_position_to_evaluate)
    return output

def initialize_suffix_table_positions(counts, pattern_letter):
    start=sum([counts[char] for char in counts if char < pattern_letter])
    end=start+counts[pattern_letter]
    return [i for i in range(start, end)]
    
def find_patterns_in_sequence_with_dc3(sequence, pattern, sequence_has_end_character=False):
    output=[]
    
    if sequence_has_end_character==False:
        sequence += '!'
        
    sf=almost_dc3_with_p12(asciiDC3(sequence)) 
    BWT = ""
    for i in sf:
        BWT += sequence[i-1]
    
    
    #print("BWT")
    #print(BWT)
    #print("\n")
    occurence_index = occurrence_indexer(BWT)
    #print('occurence_index')
    #print(occurence_index)
    #print("\n")
    counts = Counter(BWT)
    #print("counts")
    #print(counts)
    #print("\n")
    index=len(pattern)-1
    pattern_letter=pattern[index]
    
    suffix_table_positions=initialize_suffix_table_positions(counts, pattern_letter)
    #print("suffix_table_positions")
    #print(suffix_table_positions)
    #print("\n")
    while index>0 and len(suffix_table_positions)>0:
        
        index-=1
        #print("index")
        #print(index)
        #print("\n")
        pattern_letter=pattern[index]
        #print("pattern_letter")
        #print(pattern_letter)
        #print("\n")
        
        A=lettre_et_occurence(BWT,pattern_letter,suffix_table_positions, occurence_index)
        #print("A")
        #print(A)
        #print("\n")
        
        suffix_table_positions=new_suffix_table_positions(BWT, A, counts)
        #print("suffix_table_positions")
        #print(suffix_table_positions)
        #print("\n")
        
        
    for i in range(len(suffix_table_positions)):
        output.append(sf[suffix_table_positions[i]])
        

    return output

def find_patterns_in_sequence(sf, BWT, pattern):
    output=[]

    
    
    
    #print("BWT")
    #print(BWT)
    #print("\n")
    occurence_index = occurrence_indexer(BWT)
    #print('occurence_index')
    #print(occurence_index)
    #print("\n")
    counts = Counter(BWT)
    #print("counts")
    #print(counts)
    #print("\n")
    index=len(pattern)-1
    pattern_letter=pattern[index]
    
    suffix_table_positions=initialize_suffix_table_positions(counts, pattern_letter)
    #print("suffix_table_positions")
    #print(suffix_table_positions)
    #print("\n")
    while index>0 and len(suffix_table_positions)>0:
        
        index-=1
        #print("index")
        #print(index)
        #print("\n")
        pattern_letter=pattern[index]
        #print("pattern_letter")
        #print(pattern_letter)
        #print("\n")
        
        A=lettre_et_occurence(BWT,pattern_letter,suffix_table_positions, occurence_index)
        #print("A")
        #print(A)
        #print("\n")
        
        suffix_table_positions=new_suffix_table_positions(BWT, A, counts)
        #print("suffix_table_positions")
        #print(suffix_table_positions)
        #print("\n")
        
        
    for i in range(len(suffix_table_positions)):
        output.append(sf[suffix_table_positions[i]])
        

    return output
        
        
    

In [38]:
def new_suffix_table_positions_faster(BWT,pattern_letter, position_in_BWT, occurence_index, counts):
    # Counter({'T': 3, 'A': 2, 'B': 2, 'C': 2, '!': 1})
    output=[]
    letter_and_occurence=[]
    for i in range(len(position_in_BWT)):
        if BWT[position_in_BWT[i]]==pattern_letter:
            letter_and_occurence=[BWT[position_in_BWT[i]],occurence_index[position_in_BWT[i]]]
            letter=letter_and_occurence[0]
            occurence=letter_and_occurence[1]
            new_position_to_evaluate=occurence + sum([counts[char] for char in counts if char < letter])
            output.append(new_position_to_evaluate)
    return output

def find_patterns_in_sequence_faster(sf, BWT, pattern):
    output=[]
    occurence_index = occurrence_indexer(BWT)
    counts = Counter(BWT)
    index=len(pattern)-1
    pattern_letter=pattern[index]
    
    suffix_table_positions=initialize_suffix_table_positions(counts, pattern_letter)
    while index>0 and len(suffix_table_positions)>0:
        index-=1
        pattern_letter=pattern[index]
        suffix_table_positions=new_suffix_table_positions_faster(BWT,pattern_letter,suffix_table_positions,occurence_index,counts) 
        
    for i in range(len(suffix_table_positions)):
        output.append(sf[suffix_table_positions[i]])
        
    return output

#### Testons notre fonction

In [39]:
test_mltpl='ABCDEFGHIJKLMNOPPPPPPABCDE!'
sf=almost_dc3_with_p12(asciiDC3(test_mltpl))
print(sf)

[26, 21, 0, 22, 1, 23, 2, 24, 3, 25, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 20, 19, 18, 17, 16, 15]


In [40]:
positions_of_pattern=find_patterns_in_sequence_with_dc3(test_mltpl, 'GHIJK', sequence_has_end_character=True)
print(positions_of_pattern)
for p in positions_of_pattern:
    print(test_mltpl[p:p+6])


[6]
GHIJKL


#### Testons avec notre génome

Création de la suffix table et de la transfomée de burrow wheeler de notre génome 

In [41]:
sf=almost_dc3_with_p12(asciiDC3(genome[0].seq.upper()))
bwt=BWT(genome[0].seq.upper(), sf, False)

Création du pattern que nous voulons chercher dans le génome 

In [42]:
pattern=genome[0].seq.upper()[:10]
pattern

Seq('TGAACCCTAA')

Recherche des positions des kmers créés à partir du pattern ?

In [43]:
positions_of_pattern=find_patterns_in_sequence_faster(sf, bwt, pattern)
print(sorted(positions_of_pattern))
#for p in positions_of_pattern:
    #print(genome[1].seq.upper()[p:p+len(pattern)])

[0, 35, 49, 70, 91, 119, 140, 174, 195, 209, 230, 244]


Calcul du temps nécéssaire pour trouver les positions 

In [44]:
cProfile.run("find_patterns_in_sequence_faster(sf, bwt, pattern)", "pattern_matching_stats")
p = pstats.Stats("pattern_matching_stats")
p.sort_stats("cumulative").print_stats()

Thu Nov  9 22:32:59 2023    pattern_matching_stats

         1061834 function calls (1061833 primitive calls) in 0.424 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.424    0.424 {built-in method builtins.exec}
        1    0.007    0.007    0.424    0.424 <string>:1(<module>)
        1    0.003    0.003    0.417    0.417 /tmp/ipykernel_5489/4286594490.py:14(find_patterns_in_sequence_faster)
        9    0.123    0.014    0.194    0.022 /tmp/ipykernel_5489/4286594490.py:1(new_suffix_table_positions_faster)
        1    0.152    0.152    0.187    0.187 /tmp/ipykernel_5489/1333365769.py:1(occurrence_indexer)
   140285    0.051    0.000    0.051    0.000 /tmp/ipykernel_5489/4286594490.py:10(<listcomp>)
   781148    0.043    0.000    0.043    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.025    0.025 /usr/lib/python3.10/collections/__init__.py:565(__init__)
      

<pstats.Stats at 0x7f7d52135210>

### Travaillons avec des kmers

In [45]:
def kmer(sequence, longueur):
    n = len(sequence)
    kmer=[]
    for i in range(n-longueur+1):
        kmer.append(sequence[i:i+longueur])
    return kmer
    

In [46]:
def position_with_suffix_table(liste_kmer, genome):
    output=[]
    for i in range(len(liste_kmer)):
        output.append(find_patterns_in_sequence_with_dc3(genom, liste_kmer[i],sequence_has_end_character=True))
    return output

def position(liste_kmer, bwt, suffix_table):
    output=[]
    for i in range(len(liste_kmer)):
        output.append(find_patterns_in_sequence_faster(suffix_table, bwt, liste_kmer[i]))
    return output

def position_sorted(liste_kmer, bwt, suffix_table):
    output=[]
    for i in range(len(liste_kmer)):
        output.append(sorted(find_patterns_in_sequence_faster(suffix_table, bwt, liste_kmer[i])))
    return output

In [47]:
def complementaire_inverse(sequence):
    ci = []
    for i in range(len(sequence) - 1, -1, -1):
        if sequence[i] == 'A':
            ci.append('T')
        elif sequence[i] == 'T':
            ci.append('A')
        elif sequence[i] == 'C':
            ci.append('G')
        elif sequence[i] == 'G':
            ci.append('C')
    return ''.join(ci)

In [48]:
def assembler_des_kmers(numero_du_kmer, position_ds_genome,longueur, liste_position_kmers):
    compteur_de_kmer=1 #compte le nb de kmers qu'on arrive à aligner
    last_position=0
    i=numero_du_kmer+1
    difference=1
    
    while i<len(liste_position_kmers):
        j=0
        previous_compteur=compteur_de_kmer
        k=0
        while j<len(liste_position_kmers[i]) and compteur_de_kmer==previous_compteur:
            # on ne quitte pas la boucle while tant qu'on a pas réussi
            # à aligner un kmer, où qu'on a pas parcouru la liste des positions prises par le kmer
            if position_ds_genome+difference==liste_position_kmers[i][j]:
                compteur_de_kmer+=1
                last_position=liste_position_kmers[i][j]
            j+=1
            
        if compteur_de_kmer==previous_compteur:
            # si on a pas réussi à aligner ce kmer
            i+=longueur
            # on continue l'alignement à partir du kmer 10 positions plus loin
            difference+=longueur
            if  i>= len(liste_position_kmers):
                # on vérifie que 10 positions plus loin on aie pas atteint la fin de notre read
                return [[position_ds_genome,last_position+longueur],compteur_de_kmer]
            else:
                
                #on teste si l'absence d'alignement est du à un mutation
                k=0

                while k<len(liste_position_kmers[i]) and compteur_de_kmer==previous_compteur:
                    if position_ds_genome+difference==liste_position_kmers[i][k]:
                        compteur_de_kmer+=1
                        last_position=liste_position_kmers[i][k]
                    k+=1



                if compteur_de_kmer==previous_compteur:
                    ###cas d'une addition
                    difference-=1
                    l=0
                    while l<len(liste_position_kmers[i]) and compteur_de_kmer==previous_compteur:

                        if position_ds_genome+difference==liste_position_kmers[i][l]:
                            compteur_de_kmer+=1
                            last_position=liste_position_kmers[i][l]
                        l+=1

                    if compteur_de_kmer==previous_compteur:
                        ###cas d'une déletions
                        difference+=1
                        i-=1
                        m=0
                        while m<len(liste_position_kmers[i]) and compteur_de_kmer==previous_compteur:

                            if position_ds_genome+difference==liste_position_kmers[i][m]:
                                compteur_de_kmer+=1
                                last_position=liste_position_kmers[i][m]
                            m+=1

                        if compteur_de_kmer==previous_compteur:
                            ### cas où l'alignement est terminé
                            return [[position_ds_genome,last_position+longueur],compteur_de_kmer]

        i+=1
        difference+=1   
        
    return [[position_ds_genome,last_position+longueur],compteur_de_kmer]

In [49]:
def alignement_maximum_de_kmer(liste_position_kmers, longueur):
    output=[[0,0],0]
    maximum=0
    for i in range(longueur+1):
        current_kmer=liste_position_kmers[i]
        for p in current_kmer:
            alignement=assembler_des_kmers(i, p, longueur, liste_position_kmers)
            if maximum<alignement[1]:
                output=alignement
                maximum=alignement[1]
    return output

In [50]:
def align_read(chromosome, reads_list, longueur_kmer):
    sf=almost_dc3_with_p12(asciiDC3(chromosome))
    bwt=BWT(chromosome, sf)
    output=[]
    
    for i in range(len(reads_list)): # on parcourt la liste de toutes les positions des kmers
        liste_de_kmer=kmer(reads_list[i].seq.upper(), longueur_kmer)
        position=[]
        for j in range(len(liste_de_kmer)):
            # on démarre notre recherche d'alignement à partir des 10 premiers
            # kmers, pour anticiper le cas où une SNP se trouve au début du read
            position.append(find_patterns_in_sequence_faster(sf,bwt, liste_de_kmer[j]))
        
        alignment=alignement_maximum_de_kmer(position, longueur_kmer)
        
        
        if (alignment[0][1]-alignment[0][0])<30:
            # si l'alignement est trop mauvais
            # on cherche à aligner le complémentaire inverse
            liste_de_kmer=kmer(complementaire_inverse(reads_list[i].seq.upper()), longueur_kmer)
            position=[]
            for j in range(len(liste_de_kmer)):
                position.append(find_patterns_in_sequence_faster(sf,bwt, liste_de_kmer[j]))
                
            alignment=alignement_maximum_de_kmer(position, longueur_kmer)
            
        # on crée un immense liste avec tous les alignements trouvés        
        output.append({"read":i, "alignement":alignment})
    return output
        

### Testons notre algorithme avec un exemple simple 

In [51]:
read="ABCDEFGHIJKLMNOPQRSZUV"
genom="ABCDEFGHIJKLMNOPQRSTUVWWWWWWWWWABCDEFGHWWWWWWWWWWQRSTUV!"
liste_kmer=kmer(read, 5)
positions_alphabet=position_with_suffix_table(liste_kmer, genom)

Visualisons les positions occupées par les kmers de "read"

In [52]:
print(positions_alphabet)

[[0, 31], [1, 32], [2, 33], [3, 34], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [], [], []]


Quel alignement trouve la fonction "assembler_des_kmers" lorsque la première position occupée par le premier kmer lui est fournie ?

In [53]:
assembler_des_kmers(0, positions_alphabet[0][0],5,positions_alphabet)

[[0, 19], 15]

Quel est l'alignement de kmer le plus long que trouve la fonction "alignement_maximum_de_kmer" ?

In [54]:
alignement_maximum_de_kmer(positions_alphabet, 5)

[[0, 19], 15]

Developpons une méthode d'alignement qui ne prend en compte que les kmers initiaux et finaux

In [55]:
import copy
import math

In [143]:
def assembler_premier_kmer(positions, nb_de_kmer_a_aligner):
    """
    Les positions doivent être triées pour que cette fonction marche
    """

    positions_initiales=positions[0]

    for i in range(1,nb_de_kmer_a_aligner):
        current_kmer_positions=positions[i]
        index_1=0
        index_2=0
        new_initial_positions=copy.deepcopy(positions_initiales)
        
        while index_1<len(positions_initiales) and index_2<len(current_kmer_positions):

            while positions_initiales[index_1]+i>current_kmer_positions[index_2] and index_2+1<len(current_kmer_positions):
                index_2+=1
            if positions_initiales[index_1]+i!=current_kmer_positions[index_2]:
                new_initial_positions.remove(positions_initiales[index_1])
                index_1+=1
            else:

                index_1+=1
                
        positions_initiales=new_initial_positions
                     
    return positions_initiales

def assembler_dernier_kmer(positions, nb_de_kmer_a_aligner):
    """
    Les positions doivent être triées pour que cette fonction marche
    """

    positions_finales=positions[-1]
    for i in range(1,nb_de_kmer_a_aligner):
        current_kmer_positions=positions[-1-i]
        index_1=len(positions_finales)-1
        index_2=len(current_kmer_positions)-1
        new_final_positions=copy.deepcopy(positions_finales)
        
        while index_1>=0 and index_2>=0:

            while positions_finales[index_1]-i<current_kmer_positions[index_2] and index_2>=0:
                index_2-=1
            if positions_finales[index_1]-i!=current_kmer_positions[index_2]:
                new_final_positions.remove(positions_finales[index_1])
                index_1-=1
            else:

                index_1-=1
                
        positions_finales=new_final_positions
                     
    return positions_finales

In [193]:
def aligner_un_read_faster(read, taille_kmer, positions, nb_de_kmer_a_aligner, erreurs, decalage):
    positions_initiales=assembler_premier_kmer_2(positions, nb_de_kmer_a_aligner)
    positions_finales=assembler_dernier_kmer_2(positions, nb_de_kmer_a_aligner)
    ### les positions doivent être dans l'ordre normalement

    final_positions=[]
    for i in range(len(positions_initiales)):
        one_align=0
        for j in range(len(positions_finales)):
            if positions_initiales[i]+(len(read)-2*decalage-taille_kmer)-erreurs<=positions_finales[j]<=positions_initiales[i]+(len(read)-2*decalage-taille_kmer)+erreurs:
                one_align+=1
        if one_align>0:
            final_positions.append(positions_initiales[i])
            
    return final_positions

def aligner_un_read_avec_mutation(read, taille_kmer, liste_kmer, bwt, suffix_table, comp_inv_bwt, comp_inv_suffix_table,nb_de_kmer_a_aligner, erreurs, complementaire=False):
    if complementaire:
        BWT=comp_inv_bwt
        sf=comp_inv_suffix_table
    else:
        BWT=bwt
        sf=suffix_table
    positions= position_partial_sorted_2(liste_kmer, BWT, sf, nb_de_kmer_a_aligner, 0)
    output=aligner_un_read_faster_2(read, taille_kmer, positions, nb_de_kmer_a_aligner, erreurs,0)
    if len(output)!=1:
        # si on obtient pas la position initiale à partir du premier et dernier kmer, c'est probablement dû à une
        # mutation sur les premiers/derniers nucléotides du read
        
        #on décale nos recherches sur le 10eme premier kmer et 10e kmer avant la fin.
        positions= position_partial_sorted_2(liste_kmer, BWT, sf, nb_de_kmer_a_aligner, 10)
        output=aligner_un_read_faster_2(read, taille_kmer, positions, nb_de_kmer_a_aligner, erreurs, 10)
        # si rien ne marche, on essaie d'aligner au complémentaire
        if len(output)!=1 and complementaire==False:
            output=aligner_un_read_avec_mutation(read, taille_kmer, liste_kmer, bwt, suffix_table, comp_inv_bwt, comp_inv_suffix_table, nb_de_kmer_a_aligner, erreurs, complementaire=True)
        elif len(output)!=1 and complementaire==True:
            return [-1]
        else:
            output[0]-=taille_kmer
    return output

Testons cette méthode avec notre génome

In [184]:
current_read=reads[0].seq.upper()
current_genome=genome[0].seq.upper()
comp_inv_current_genome=complementaire_inverse(genome[0].seq.upper())
#starting_position du read 2: 471737, starting_position du read 1: 143900,  starting_position du read 0: 131734


In [185]:
sf=almost_dc3_with_p12(asciiDC3(current_genome+"!")) # suffix table du 1er chromosome
bwt=BWT(current_genome+"!", sf, True)
comp_inv_sf=almost_dc3_with_p12(asciiDC3(comp_inv_current_genome+"!"))# suffix table du complémentaire inverse du 1er chromosome
comp_inv_bwt=BWT(comp_inv_current_genome+"!", sf, True)

In [186]:
liste_kmer=kmer(current_read, 10)

Comme nous ne nous interessons plus à la totalité des kmers, nous n'avons pas besoin de calculer toutes leurs positions

In [187]:
def position_partial_sorted(liste_kmer, bwt, suffix_table, nb_kmer, decalage):
    debut=[]
    fin=[0 for i in range(nb_kmer)]
    for i in range(nb_kmer):
        debut.append(sorted(find_patterns_in_sequence_faster(suffix_table, bwt, liste_kmer[decalage+i])))
        fin[-1-i]=sorted(find_patterns_in_sequence_faster(suffix_table, bwt, liste_kmer[-1-decalage-i]))
    return debut+fin

In [188]:
start=time.time()
positions=position_sorted(liste_kmer,bwt,sf)
end=time.time()
print(end-start)

19.700209856033325


In [189]:
start=time.time()
positions=position_partial_sorted(liste_kmer, comp_inv_bwt, comp_inv_sf,3, 0)
end=time.time()
print(end-start)
print(positions)

1.50341796875
[[180985, 475388], [34608, 442558, 526196, 587002], [93805], [113138, 128896, 323924, 423882, 431380], [30284, 30434, 79806, 82721, 98685, 100082, 178803, 202712, 213396, 216252, 225871, 231322, 261366, 298468, 301324, 331859, 333919, 371784, 374562, 482229, 528826, 529132, 552676, 635982, 639277], [35391, 49137, 50476, 51809, 62204, 71853, 93471, 100613, 114869, 116964, 117393, 129785, 148945, 151534, 155007, 157448, 157465, 157475, 181510, 190116, 192471, 198694, 214256, 215819, 224926, 230336, 235946, 262230, 280863, 287483, 294481, 306219, 310508, 310803, 337147, 351264, 353608, 361764, 402221, 402845, 404492, 408442, 425652, 479829, 506858, 507864, 520328, 523005, 536241, 543319, 548651, 554360, 567675, 571922, 575784, 579343, 579919, 589090, 611570]]


On constate le gain de temps non négligeable qu'apporte notre nouvelle fonction, qui ne calcule qu'une partie des positions des kmers.

Alignons notre read grâce à nos nouvelles fonctions

In [194]:
position_initiale_read=aligner_un_read_avec_mutation(current_read, 10, liste_kmer, bwt,sf, comp_inv_bwt, comp_inv_sf, 3, 2)

Visualisons l'alignement proposé par notre algorithme

In [195]:
print("position initiale proposée par l'algorithme")
print(position_initiale_read[0])
print("read")
print(current_read)
print("genome")
print(current_genome[position_initiale_read[0]:position_initiale_read[0]+len(current_read)])

position initiale proposée par l'algorithme
-1
read
TTTCCTTTTTAAGCGTTTTATTTTTTAATAAAAAAAATATAGTATTATATAGTAACGGGTGAAAAGATCCATATAAATAAATATATGAGGAATATATTAA
genome



Essayons d'aligner un grand nombre de reads à chaque fois

In [207]:
start=time.time()
align_reads=[]
for i in range(100):
    liste_kmer=kmer(reads[i].seq.upper(), 10)
    align_reads.append({"read": i, "alignement":aligner_un_read_avec_mutation(reads[i].seq.upper(), 10, liste_kmer, bwt, sf, comp_inv_bwt, comp_inv_sf,3, 2)})
end=time.time()
print(end-start)

330.5504786968231


In [197]:
print(align_reads)

[{'read': 0, 'alignement': [-1]}, {'read': 1, 'alignement': [143900]}, {'read': 2, 'alignement': [471737]}, {'read': 3, 'alignement': [65426]}, {'read': 4, 'alignement': [-1]}, {'read': 5, 'alignement': [-1]}, {'read': 6, 'alignement': [592074]}, {'read': 7, 'alignement': [270169]}, {'read': 8, 'alignement': [-1]}, {'read': 9, 'alignement': [-1]}, {'read': 10, 'alignement': [-1]}, {'read': 11, 'alignement': [-1]}, {'read': 12, 'alignement': [56757]}, {'read': 13, 'alignement': [580626]}, {'read': 14, 'alignement': [106971]}, {'read': 15, 'alignement': [-1]}, {'read': 16, 'alignement': [121139]}, {'read': 17, 'alignement': [418998]}, {'read': 18, 'alignement': [-1]}, {'read': 19, 'alignement': [-1]}]


##### Algorithme lent, mais très juste, appliqué à notre génome

In [67]:
premier_chr=genome[0].seq.upper()+"!"
sf=almost_dc3_with_p12(asciiDC3(premier_chr)) # suffix table du 1er chromosome
bwt=BWT(premier_chr, sf, True)
liste_de_kmer=kmer(reads[2].seq.upper(), 10) # liste de kmers de taille 10 sur le 2e read
positions=position(liste_de_kmer, bwt,sf)

In [68]:
alignement_maximum_de_kmer(positions, 10)

[[471737, 471837], 91]

In [69]:
start=time.time()
align=align_read(premier_chr,reads[:3], 10)
end=time.time()
print(align)
print(end-start)

[{'read': 0, 'alignement': [[131734, 131834], 81]}, {'read': 1, 'alignement': [[143900, 143992], 83]}, {'read': 2, 'alignement': [[471737, 471837], 91]}]
91.34186816215515


In [None]:
print(align[0]["alignement"][0][0])

### Testons si nos reads ont des mutations 

In [71]:
def check_for_mutation(index, read_decalage,genome_decalage, read,genome):
    return read[read_decalage+index+1]==genome[genome_decalage+index+1]


def check_for_deletion(index, read_decalage, genome_decalage, read,genome):
    return read[read_decalage+index]==genome[genome_decalage+index+1]

def check_for_addition(index, read_decalage, genome_decalage, read,genome):
    return read[read_decalage+index+1]==genome[genome_decalage+index]

def compare_read_and_genome(position_in_genome, read,genome):
    read_decalage=0
    genome_decalage=position_in_genome
    str_read=""
    str_genome=""
    i=0
    while i+read_decalage<len(read):
        if read[read_decalage+i]!=genome[genome_decalage+i]:
            if check_for_mutation(i, read_decalage,genome_decalage, read,genome):
                str_read+="*"+read[read_decalage+i]+"*"
                str_genome+="*"+genome[genome_decalage+i]+"*"
                i+=1
                
            elif check_for_deletion(i, read_decalage,genome_decalage, read,genome):
                str_read+="-"
                str_genome+=genome[genome_decalage+i]
                genome_decalage+=1
                i+=1
                
            elif check_for_addition(i, read_decalage, genome_decalage, read,genome):
                str_read+=read[read_decalage+i]
                str_genome+="+"+genome[genome_decalage+i]
                i+=1
                read_decalage+=1
                
            else:
                i+=1
                print("Nous observons plus d'une erreur d'alignement à la suite qui se suivent entre les positions "+str(genome_decalage+i-1)+" et "+str(genome_decalage+i))
        else:
            str_read+=read[read_decalage+i]
            str_genome+=genome[genome_decalage+i]
            i+=1
            
    print("read")
    print(str_read)
        
    print("genome")
    print(str_genome)
    return

#### Testons notre algorithme sur un exemple 

In [203]:
current_read=reads[1].seq.upper()
current_genome=genome[0].seq.upper()
starting_position=143900

Les fonctions suivantes nous permettent de visualiser les zones de SNP

In [204]:
compare_read_and_genome(starting_position,current_read, current_genome)

read
TATATCTTTAAAATGATGTTGCAAATTTATTGAACATGTTAATAAATCATCCTGTTCATTTTGTATGTCTACTAAATTATGTAACGTATCCT*C*TTCTTCA
genome
TATATCTTTAAAATGATGTTGCAAATTTATTGAACATGTTAATAAATCATCCTGTTCATTTTGTATGTCTACTAAATTATGTAACGTATCCT*T*TTCTTCA


### Comparaison des résultats théoriques avec les résultats expérimentaux

In [75]:
pip install pysam

Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [76]:
#Importation du fichier BAM
import pysam
bam=pysam.AlignmentFile("/home/azarkua/Documents/2023-2024/omiques2/developement/omique2/single_Pfal_dat.bam", "rb")
#bam = pysam.AlignmentFile("/home/mvernier/Documents/2023-2024/omique2/projet/single_Pfal_dat.bam", "rb")


In [77]:
tab=[]
for alignment in bam:
    read_name = alignment.query_name
    reference_name = bam.getrname(alignment.reference_id)
    position = alignment.reference_start
    mapping_quality = alignment.mapping_quality
    
    tab.append(position)
    
bam.close()
#print(tab)

In [94]:
print(tab[:10])

[131734, 143900, 471737, 65426, 152677, 433417, 592074, 270169, 179004, 463570]


In [98]:
print(align)

[{'read': 0, 'alignement': [[131734, 131834], 81]}, {'read': 1, 'alignement': [[143900, 143992], 83]}, {'read': 2, 'alignement': [[471737, 471837], 91]}]


In [140]:
print(align_reads)

[{'read': 0, 'alignement': [-1]}, {'read': 1, 'alignement': [143900]}, {'read': 2, 'alignement': [471737]}, {'read': 3, 'alignement': [65426]}, {'read': 4, 'alignement': [-1]}, {'read': 5, 'alignement': [-1]}, {'read': 6, 'alignement': [592074]}, {'read': 7, 'alignement': [270169]}, {'read': 8, 'alignement': [-1]}, {'read': 9, 'alignement': [-1]}, {'read': 10, 'alignement': [-1]}, {'read': 11, 'alignement': [-1]}, {'read': 12, 'alignement': [56757]}, {'read': 13, 'alignement': [580626]}, {'read': 14, 'alignement': [106971]}, {'read': 15, 'alignement': [-1]}, {'read': 16, 'alignement': [121139]}, {'read': 17, 'alignement': [418998]}, {'read': 18, 'alignement': [-1]}, {'read': 19, 'alignement': [-1]}]


In [99]:
error=0
for i in range(len(align)):
    if (tab[i]!=align[i]["alignement"][0][0]):
        error+=1
pourcentage_erreur=(error/len(align))*100
print("Le pourcentage d'erreur est "+str(pourcentage_erreur)+"%")

Le pourcentage d'erreur est 0.0%


In [208]:
error=0
for i in range(len(align_reads)):
    if (tab[i]!=align_reads[i]["alignement"][0]):
        error+=1
pourcentage_erreur=(error/len(align_reads))*100
print("Le pourcentage d'erreur est "+str(pourcentage_erreur)+"%")

Le pourcentage d'erreur est 56.00000000000001%
