# Locating Motifs Despite Introns

In “Finding a Shared Motif”, we discussed searching through a database containing multiple genetic strings to find a longest common substring of these strings, which served as a motif shared by the two strings. However, as we saw in “RNA Splicing”, coding regions of DNA are often interspersed by introns that do not code for proteins.

We therefore need to locate shared motifs that are separated across exons, which means that the motifs are not required to be contiguous. To model this situation, we need to enlist subsequences.

-> A sequence is include 'introns'. Therefore, when the 2 sequences are not match completely, we need to find the sequece that match the parts of the 2 sequences.

ex. ATTG is the common subsequence of the following 3 sequences.  
common meanse the sequence contains the same letters in the same order of the match seq.  

match seq: ATTG  
sequence 1: A T T C G  
sequence 2: A C C T T T T G  
sequence 3: C C A A T T G  

In [1]:
def read_fasta(fasta: str) -> dict[str, str]:
    sequences = {}
    header = None
    seq = []
    fasta = fasta.splitlines()
    for line in fasta:
        if line.startswith('>'):
            if header is not None:
                sequences[header] = ''.join(seq)
            header = line[1:]  # Remove '>'
            seq = []
        else:
            seq.append(line)
    if header is not None:
        sequences[header] = ''.join(seq)  # Add the last sequence
    return sequences

In [16]:
sample_input = """>Rosalind_23
AACCTTGG
>Rosalind_64
ACACTGTGA"""

# result: AACTGG

In [17]:
seq1, seq2 = read_fasta(sample_input).values()
print(seq1, seq2)

AACCTTGG ACACTGTGA


## Longest Common Subsequence(最長共通部分系列)を動的計画法で求める

https://www.youtube.com/watch?v=7uQ1Lehw7_k

In [18]:
def lcs(seq1:str, seq2: str) -> str:
    m, n = len(seq1), len(seq2)
    dp = [[0] * (n + 1) for _ in range(m + 1)]
    
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if seq1[i - 1] == seq2[j - 1]:
                dp[i][j] = dp[i - 1][j - 1] + 1
            else:
                dp[i][j] = max(dp[i - 1][j], dp[i][j - 1])
    
    # Reconstruct the longest common subsequence
    lcs_str = []
    i, j = m, n
    while i > 0 and j > 0:
        if seq1[i - 1] == seq2[j - 1]:
            lcs_str.append(seq1[i - 1])
            i -= 1
            j -= 1
        elif dp[i - 1][j] > dp[i][j - 1]:
            i -= 1
        else:
            j -= 1
    
    return ''.join(reversed(lcs_str))

In [19]:
seq1, seq2 = read_fasta(sample_input).values()
print(lcs(seq1, seq2))

ACCTGG
