# In class exercise: imports, kmers, and *de_bruijn* graphs

In [1]:
import random
import toyplot
LAYOUT = toyplot.layout.FruchtermanReingold(edges=toyplot.layout.CurvedEdges())


### kmers and the debruijn graph

kmers are substrings of length k of a larger string. The de Bruijn graph is a mathematical construct used in genome assembly and other network based problems. It relies on the conjecture that a sequence can be reconstructed by substrings of that sequence if all substrings are present and they overlap by n-1. One example type of substring would be a genomic short read sequence. However, these may not overlap perfectly by n-1. But, we can use smaller kmers from those reads to get more substrings that may overlap by n-1 and thus be used to reconstruct the sequence. See image for example. 



![image.png](https://upload.wikimedia.org/wikipedia/commons/thumb/5/53/K-mer-example.png/440px-K-mer-example.png)
https://en.wikipedia.org/wiki/K-mer

### Genome/contig assembly
In this notebook we are going to write functions to assemble a contig from a set of reads broken into kmers. 

### (1) Write a function to generate a random sequence of dna


In [2]:
def random_sequence(nbases):
    return 

### (2) Write a function to return all kmers for a string
Return the kmers as a count dictionary with the number of times each kmer was observed. 

In [3]:
def get_kmers(target):
    return 

### (3) Test our function on a random sequence

In [6]:
target = random_sequence(50)
kmers = get_kmers(target, 3)
kmers

{'AAG': 1,
 'AAT': 1,
 'ACC': 2,
 'AGA': 1,
 'AGC': 1,
 'AGT': 2,
 'ATA': 3,
 'ATG': 1,
 'ATT': 1,
 'CAG': 1,
 'CAT': 1,
 'CCA': 1,
 'CCT': 3,
 'CGC': 1,
 'CGT': 1,
 'CTA': 1,
 'CTC': 1,
 'CTT': 1,
 'GAA': 2,
 'GCA': 1,
 'GCC': 1,
 'GCG': 1,
 'GTG': 1,
 'GTT': 2,
 'TAC': 2,
 'TAG': 2,
 'TAT': 4,
 'TCC': 1,
 'TCG': 1,
 'TGA': 1,
 'TGC': 1,
 'TTA': 3,
 'TTC': 1}

### (4) Write a function to generate sequence reads of len N
This will take arguments `nreads` and `readlength`. 

In [4]:
def get_random_reads(nreads, readlength):
    return 

### (5) A function to break reads into kmers

In [5]:
def reads_to_kmers(reads, k):
    "stores kmers to dict uses .update() to join together kmer dict keys"
    kmers = {}
    #...
    return kmers

### (6) Test our functions

In [11]:
## params 
tlen = 100
nreads = 200
rlen = 25
k = 8

## call funcs
random.seed(123)
target = random_sequence(tlen)
reads = get_random_reads(target, nreads, rlen)
kmers = reads_to_kmers(reads, k)

In [12]:
kmers.keys()

dict_keys(['GTCTCCTT', 'GGCCATAT', 'TTTGACAT', 'ATTTTGAC', 'CTCCTTTA', 'CTAGTCTC', 'ATGAATGG', 'CAGTTGTA', 'TAGGTCGA', 'GACATGCG', 'GTAAACCA', 'ATTGACAG', 'ATGCGGTA', 'GTATTGAC', 'TTTTGACA', 'ACTCAGGG', 'TAACTCAG', 'AGCTAGTC', 'AGGTCGAT', 'AGAGCTAG', 'TAAAGAAT', 'GAGCTAGT', 'TTTAACTC', 'CTCAGGGT', 'AGATGAAT', 'ATATAAGT', 'GCTAGTCT', 'GCGGTATT', 'TCTCCTTT', 'GTTAAAGA', 'ATAAGTAA', 'ATGGACCG', 'AGTAAACC', 'CATATAAG', 'TAGTCTCC', 'AAGTAAAC', 'TCCTTTAA', 'CCAGTTGT', 'AAAGAATA', 'ACCAGTTG', 'GATGAATG', 'AGTTGTAG', 'CAGGGTTA', 'GCCATATA', 'GATTTTGA', 'TCGATTTT', 'TATTGACA', 'TTAACTCA', 'ACATGCGG', 'GTAGGTCG', 'GGTCGATT', 'GACAGAGC', 'CAGAGCTA', 'CATGCGGT', 'TTGACAGA', 'GAATGGAC', 'CGGCCATA', 'TGACATGC', 'AACCAGTT', 'GGGTTAAA', 'AAGAATAT', 'TGACAGAG', 'GGACCGGC', 'TTAAAGAA', 'TCAGGGTT', 'TTGTAGGT', 'AATGGACC', 'AGAATATA', 'GACCGGCC', 'AGGGTTAA', 'GGTATTGA', 'ACAGAGCT', 'GGTTAAAG', 'TGCGGTAT', 'TGGACCGG', 'TATAAGTA', 'CCATATAA', 'TTGACATG', 'GTCGATTT', 'AGTCTCCT', 'CTTTAACT', 'TAAACCAG', 'TGAA

### (7) Write a function to return a debruijn graph 
This should return a list of tuples where each tuple is a (kmer, kmer) pair where the two kmers overlap identically over n-1 of their length. This is the definition of a [deBruijn graph](https://en.wikipedia.org/wiki/De_Bruijn_graph). Use two nested for loops to compare all kmers to each other. Use slicing to compare the `[1:]` index to the `[:-1]` index of the other to test for `n-1` overlap.  

In [6]:
def get_debruijn_edges(kmers):
    edges = set()
    # ...
    return edges

In [15]:
edges = get_debruijn_edges(kmers)
edges

{('AAACCAG', 'AACCAGT'),
 ('AAAGAAT', 'AAGAATA'),
 ('AACCAGT', 'ACCAGTT'),
 ('AACTCAG', 'ACTCAGG'),
 ('AAGAATA', 'AGAATAT'),
 ('AAGTAAA', 'AGTAAAC'),
 ('AATGGAC', 'ATGGACC'),
 ('ACAGAGC', 'CAGAGCT'),
 ('ACATGCG', 'CATGCGG'),
 ('ACCAGTT', 'CCAGTTG'),
 ('ACCGGCC', 'CCGGCCA'),
 ('ACTCAGG', 'CTCAGGG'),
 ('AGAATAT', 'GAATATA'),
 ('AGAGCTA', 'GAGCTAG'),
 ('AGCTAGT', 'GCTAGTC'),
 ('AGGGTTA', 'GGGTTAA'),
 ('AGGTCGA', 'GGTCGAT'),
 ('AGTAAAC', 'GTAAACC'),
 ('AGTCTCC', 'GTCTCCT'),
 ('AGTTGTA', 'GTTGTAG'),
 ('ATAAGTA', 'TAAGTAA'),
 ('ATATAAG', 'TATAAGT'),
 ('ATGAATG', 'TGAATGG'),
 ('ATGCGGT', 'TGCGGTA'),
 ('ATGGACC', 'TGGACCG'),
 ('ATTGACA', 'TTGACAG'),
 ('ATTTTGA', 'TTTTGAC'),
 ('CAGAGCT', 'AGAGCTA'),
 ('CAGGGTT', 'AGGGTTA'),
 ('CAGTTGT', 'AGTTGTA'),
 ('CATATAA', 'ATATAAG'),
 ('CATGCGG', 'ATGCGGT'),
 ('CCAGTTG', 'CAGTTGT'),
 ('CCATATA', 'CATATAA'),
 ('CCGGCCA', 'CGGCCAT'),
 ('CCTTTAA', 'CTTTAAC'),
 ('CGATTTT', 'GATTTTG'),
 ('CGGCCAT', 'GGCCATA'),
 ('CGGTATT', 'GGTATTG'),
 ('CTAGTCT', 'TAGTCTC'),


### (8) Plot the deBruijn graph
Use the toyplot function below to plot the deBruijn graph generated with the following code. 

In [7]:
def plot_debruijn_graph(edges):
    "returns a toyplot graph of edges"
    graph = toyplot.graph(
        [i[0] for i in edges],
        [i[1] for i in edges],
        tmarker=">", 
        width=600,
        vlstyle={"font-size": "8px"},
        layout=LAYOUT)
    return graph

In [17]:
## plot as directed graph
plot_debruijn_graph(edges);

### (9) Test for a eulerian path
When there are many repeats in a sequence then there may be multiple paths through the graph that touch each edge once. Or, if the graph is not complete, for example if there are too few kmers to complete the graph, then a full eulerian path between all kmers cannot be found. The function to find the eulerian path is a bit complicated so for now we will just import a working 


In [18]:
from eulerian import eulerian_path

## this will raise an error if the path does not exist
path = eulerian_path(edges)

### Exporting function to .py files
Follow the lecture instructions to now copy all of the functions we defined above into a new text file which you can create from your jupyter dashboard by selecting `[new]/[text file]`.
We will then try using these functions again imported from our new python file. It is important that you name the file `debruijn.py`. 

### Imports: testing on a simple example 

In [19]:
import debruijn

In [20]:
target = "AAABBBBA"
kmers = debruijn.get_kmers(target, 3)
kmers

{'AAA': 1, 'AAB': 1, 'ABB': 1, 'BBA': 1, 'BBB': 2}

In [21]:
edges = debruijn.get_debruijn_edges(kmers)
edges

{('AA', 'AA'), ('AA', 'AB'), ('AB', 'BB'), ('BB', 'BA'), ('BB', 'BB')}

In [26]:
plot_debruijn_graph(edges);

### Repeats can create ambiguity in the de Bruijn graph
The de Bruijn graph represents a path to contructing the full genome by walking the path along directed edges to each node. This includes cylic walks along repeated elements (e.g., AA to AA) although such moves do not appear super clearly in plots we've generated. Repeat elements are particularly troubling for de bruijn graphs because they create ambiguity where there can be more than one way to walk across all edges of the graph. Let's test assembling a large sequence by decomposing kmers of different size from a different depth of 50 bp reads. 

In [13]:
import debruijn
import eulerian
import random

random.seed(123)
target = debruijn.random_sequence(500)

In [19]:
## a dictionary to store our results
results = {}

## iterate over kmer sizes
for kmersize in [10, 30, 100]:
    for nreads in [50, 100, 500, 1000]:
        
        ## store zero starting value
        name = (kmersize, nreads)
        results[name] = 0
        
        ## test over multiple replicates
        for replicate in range(10):
            
            ## call funcs
            reads = debruijn.get_reads(target, nreads=nreads, rlen=50)
            kmers = debruijn.reads_to_kmers(reads, kmersize)
            edges = debruijn.get_debruijn_edges(kmers)
            
            ## test for eulerian walk
            try:
                path = eulerian.eulerian_path(edges)
                results[name] += 1
                   
            except Exception:
                pass
                        

In [20]:
## show results
results

{(10, 50): 6,
 (10, 100): 10,
 (10, 500): 10,
 (10, 1000): 10,
 (30, 50): 0,
 (30, 100): 7,
 (30, 500): 10,
 (30, 1000): 10,
 (100, 50): 0,
 (100, 100): 0,
 (100, 500): 0,
 (100, 1000): 0}

### How does kmer size affect the results?

https://en.wikipedia.org/wiki/K-mer