# Locating Motifs Despite Introns

In “Finding a Shared Motif”, we discussed searching through a database containing multiple genetic strings to find a longest common substring of these strings, which served as a motif shared by the two strings. However, as we saw in “RNA Splicing”, coding regions of DNA are often interspersed by introns that do not code for proteins.

We therefore need to locate shared motifs that are separated across exons, which means that the motifs are not required to be contiguous. To model this situation, we need to enlist subsequences.

-> A sequence is include 'introns'. Therefore, when the 2 sequences are not match completely, we need to find the sequece that match the parts of the 2 sequences.

ex. ATTG is the common subsequence of the following 3 sequences.  
common meanse the sequence contains the same letters in the same order of the match seq.  

match seq: ATTG  
sequence 1: A T T C G  
sequence 2: A C C T T T T G  
sequence 3: C C A A T T G  

In [1]:
def read_fasta(fasta: str) -> dict[str, str]:
    sequences = {}
    header = None
    seq = []
    fasta = fasta.splitlines()
    for line in fasta:
        if line.startswith('>'):
            if header is not None:
                sequences[header] = ''.join(seq)
            header = line[1:]  # Remove '>'
            seq = []
        else:
            seq.append(line)
    if header is not None:
        sequences[header] = ''.join(seq)  # Add the last sequence
    return sequences

In [2]:
# pick up the index of the candidate of common subsequence
from itertools import combinations
def pick_index(seq_1:str):
    index_list = []
    n = len(seq_1)
    for i in range(n, 0, -1):
        for perm in combinations(range(n), i):
            index_list.append(perm)
    return index_list

In [3]:
# extract candidate common subsequence by using the index
def extract_candidate_common_subseq(seq_1: str, index_tuple: tuple[int]) -> str:
    candidate = []
    for i in index_tuple:
        candidate.append(seq_1[i])
    return ''.join(candidate)

when we use `in` for itteration, it to repeatedly call `next()` until it finds a match.

In [4]:
# compare the caandidate common subsequence with the other sequence
def compare_candidate_with_other_seq(candidate: str, seq_2: str):    
    itter_seq_2 = iter(seq_2)
    return all(char in itter_seq_2 for char in candidate)

In [5]:
def main(fasta_txt):
    fasta_dict = read_fasta(fasta_txt)
    seqs = list(fasta_dict.values())
    seq_1 = seqs[0]
    seq_2 = seqs[1]

    index_list = pick_index(seq_1)
    for index_tuple in index_list:
        candidate = extract_candidate_common_subseq(seq_1, index_tuple)
        is_common = compare_candidate_with_other_seq(candidate, seq_2)
        if is_common:
            print(candidate)
            break

In [6]:
sample_input = """>Rosalind_23
AACCTTGG
>Rosalind_64
ACACTGTGA"""

# result: AACTGG

In [7]:
main(sample_input)

AACTTG
