In [1]:
import numpy as np
import numpy.linalg as la
from helper_funcs import generate_genomic_sequences

# Genome classification

In computational biology, a typical problem is:

- given a small strand of DNA, find if that DNA belong to a known mapped organism. 

DNA is comprised of long chains of base pairs.
4 nucleobases make up all of any organism's DNA.
These 4 nucleobases are A,T, C, and G.

Here is a small example of what a small snippet of a single sequence of DNA could look like:

```python
genome_seq = 'ATCGATTGAGCTCTAGCG'
```

Now, supposed we have sequenced some DNA for an unknown organism.
```python
small_sample = 'ATCG'
```

Does the small sample of DNA belong to the same organism as the provided genomic sequence?

We are going to be using our knowledge of **norms** to answer this question. But first, we need to convert the strand of DNA into an array of numbers:

### Write a function to convert a DNA sequence to an array of numbers

For this conversion, you are going to make the following assumption:

A = 1, T = 2, C = 3, G = 4.

In [11]:
#grade (enter your code in this cell - DO NOT DELETE THIS LINE) 
def seq_to_array(dna):
    # Converts the string dna into a numpy array
    # where each element corresponds to the numeric value of a nucleobase 
    # dna: string
    # numeric_dna: 1d numpy array of type integer
    
    # complete the function
    DNA2Int = {
        'A':1,
        'T':2,
        'C':3,
        'G':4
    }
    numeric_dna = []
    for i in dna:
        numeric_dna.append(DNA2Int[i])
    numeric_dna = np.array(numeric_dna)
    return numeric_dna

Test your function by using it on the sequence `genome_seq`.

In [12]:
genome_seq = 'ATCGATTGAGCTCTAGCG'
genome_numeric = seq_to_array(genome_seq)
print(genome_numeric)

[1 2 3 4 1 2 2 4 1 4 3 2 3 2 1 4 3 4]


Now that we have the numpy array, we can use vector norms to determine whether a small sample of DNA belongs to a larger known DNA sequence.

Suppose that $v_1$ is a subset of the larger known DNA sequence, and $v_2$ the small unknown sample of DNA.

The small unknown sample belongs to the DNA sequence if we can find a $v_1$ such that:

$$
||v_1-v_2||_1 = 0.
$$


In this example, we are trying to find a match by comparing a DNA sample with the known sequence for a group of animals. We give you the list of animals `animals_list`, the DNA sequence for each animal in the list `animal_dna` and the smaller sample DNA `unknown_dna`.

In [4]:
# generate inputs for students
animals = ['dog', 'bear', 'giraffe', 'tiger']
animal_dna, unknown_dna = generate_genomic_sequences(animals)
print(unknown_dna)
print(animal_dna)

TTCGTAAGCAATGTAA
{'dog': 'ATCTAGTTATCTCTATCTAGATTGATCGAGCTCTCGCATGCGCTCTCGCTAGATTGATCGAGCTCTCGCATGCGCTCTCGATCGATTGAGCTCTAGCCAGCTAGGAACGCAATTCGTAAGCAATGTAACTCGCATGCGCTCTCGCCAGCTAGGAACGCAAATCGATTGAGCTCTAGATCTAGTTATCTCTATATCTAGTTATCTCTATATCTAGTTATCTCTATCTAGATTGATCGAGCTCCAGCTAGGAACGCAA', 'bear': 'ATCTAGTTATCTCTATCTCGCATGCGCTCTCGCCAGCTAGGAACGCAAATCGATTGAGCTCTAGATCTAGTTATCTCTATCTAGATTGATCGAGCTATCGATTGAGCTCTAGATCGATTGAGCTCTAGCTCGCATGCGCTCTCGCTCGCATGCGCTCTCGCTCGCATGCGCTCTCGCTCGCATGCGCTCTCGCTAGATTGATCGAGCTATCGATTGAGCTCTAGATCTAGTTATCTCTATCTAGATTGATCGAGCT', 'giraffe': 'CTCGCATGCGCTCTCGCCAGCTAGGAACGCAAATCGATTGAGCTCTAGCTAGATTGATCGAGCTCTCGCATGCGCTCTCGATCTAGTTATCTCTATATCGATTGAGCTCTAGCTAGATTGATCGAGCTCTAGATTGATCGAGCTCTCGCATGCGCTCTCGATCGATTGAGCTCTAGCCAGCTAGGAACGCAAATCGATTGAGCTCTAGCTCGCATGCGCTCTCGCCAGCTAGGAACGCAAATCGATTGAGCTCTAG', 'tiger': 'ATCGATTGAGCTCTAGATCGATTGAGCTCTAGCTCGCATGCGCTCTCGCCAGCTAGGAACGCAAATCTAGTTATCTCTATATCGATTGAGCTCTAGCCAGCTAGGAACGCAACCAGCTAGGAACGCAACTCGCATGCGCTCTCGCCAGCTAGGAACGCAACCAGCT

Take a look at the code snippet below. It uses the function `find_the_match` that is not yet defined (so you will get an error if you try to run it now!)

In [14]:
print('Trying to find a match for ', unknown_dna)
for animal in animal_dna:
    
    known_dna = animal_dna[animal]

    numeric_genome = seq_to_array(known_dna)
    numeric_sample = seq_to_array(unknown_dna)
    
    pos = find_the_match(numeric_sample, numeric_genome)
    
    if pos >= 0:
        print('The sample DNA matches the sequence of the', animal, 'starting at position', pos)
        break
        
if pos < 0:  
    print('Could not find a match')

Trying to find a match for  TTCGTAAGCAATGTAA


KeyError: 2

### Write the function `find_the_match`

that uses the 1-norm to find the DNA sequence that matches the sample DNA. 

```python
def find_the_match(numeric_sample, numeric_genome)

    return integer
```

The function takes the 1d numpy array that were converted from the DNA strings, and returns a non-negative integer if it finds a match, or -1 otherwise.

When it finds a match, the function returns the position in the DNA sequence where the match starts (recall that python starts the index at zero).

Run the code snippet above that uses your now defined function, and check your results. You can also generate other input values (animal list), and re-run your code snippet.

In [18]:
#grade (enter your code in this cell - DO NOT DELETE THIS LINE) 
def find_the_match(numeric_sample, numeric_genome):
    for i in range(len(numeric_genome)-len(numeric_sample)+1):
        print(numeric_sample)
        print(numeric_genome[i:i+len(numeric_sample)])
        #out = seq_to_array(numeric_sample)-seq_to_array( numeric_genome[i:i+len(numeric_sample)])
        out = numeric_sample-numeric_genome[i:i+len(numeric_sample)]
        #if (numeric_sample == numeric_genome[i:i+len(numeric_sample)]):
        if la.norm(out,1)==0:
            return i
    return -1

In [19]:
find_the_match(numeric_sample, numeric_genome)

[2 2 3 4 2 1 1 4 3 1 1 2 4 2 1 1]
[1 2 3 2 1 4 2 2 1 2 3 2 3 2 1 2]
[2 2 3 4 2 1 1 4 3 1 1 2 4 2 1 1]
[2 3 2 1 4 2 2 1 2 3 2 3 2 1 2 3]
[2 2 3 4 2 1 1 4 3 1 1 2 4 2 1 1]
[3 2 1 4 2 2 1 2 3 2 3 2 1 2 3 2]
[2 2 3 4 2 1 1 4 3 1 1 2 4 2 1 1]
[2 1 4 2 2 1 2 3 2 3 2 1 2 3 2 1]
[2 2 3 4 2 1 1 4 3 1 1 2 4 2 1 1]
[1 4 2 2 1 2 3 2 3 2 1 2 3 2 1 4]
[2 2 3 4 2 1 1 4 3 1 1 2 4 2 1 1]
[4 2 2 1 2 3 2 3 2 1 2 3 2 1 4 1]
[2 2 3 4 2 1 1 4 3 1 1 2 4 2 1 1]
[2 2 1 2 3 2 3 2 1 2 3 2 1 4 1 2]
[2 2 3 4 2 1 1 4 3 1 1 2 4 2 1 1]
[2 1 2 3 2 3 2 1 2 3 2 1 4 1 2 2]
[2 2 3 4 2 1 1 4 3 1 1 2 4 2 1 1]
[1 2 3 2 3 2 1 2 3 2 1 4 1 2 2 4]
[2 2 3 4 2 1 1 4 3 1 1 2 4 2 1 1]
[2 3 2 3 2 1 2 3 2 1 4 1 2 2 4 1]
[2 2 3 4 2 1 1 4 3 1 1 2 4 2 1 1]
[3 2 3 2 1 2 3 2 1 4 1 2 2 4 1 2]
[2 2 3 4 2 1 1 4 3 1 1 2 4 2 1 1]
[2 3 2 1 2 3 2 1 4 1 2 2 4 1 2 3]
[2 2 3 4 2 1 1 4 3 1 1 2 4 2 1 1]
[3 2 1 2 3 2 1 4 1 2 2 4 1 2 3 4]
[2 2 3 4 2 1 1 4 3 1 1 2 4 2 1 1]
[2 1 2 3 2 1 4 1 2 2 4 1 2 3 4 1]
[2 2 3 4 2 1 1 4 3 1 1 2 4 2 1 1]
[1 2 3 2 1 4 1

112