Skip to content

humberto-ortiz/algomolbiol

Repository files navigation

Grassembler - graph-based assembly of short read sequences.

Copyright 2012 - Humberto Ortiz-Zuazaga, Cassandra Schaening, Jose
Carlos Bonilla, Alejandro Vientos del Valle, Richard Garcia Lebron.

    This program is free software: you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation, either version 3 of the License, or
    (at your option) any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.

    You should have received a copy of the GNU General Public License
    along with this program.  If not, see <http://www.gnu.org/licenses/>.

Introduction 

As a class project, build an eulerian assembler.

Input will be 5000 reads of length 100 bases from a random 50 Kb chunk
of NC_010278, Actinobacillus pleuropneumoniae serovar 3 str. JL03
chromosome, complete genome. The chromosome is 2242062 bases long, and circular.

Let's generate a random starting position:

>>> import ramdom
>>> random.randint(0, 2242062-50000)
896878

Now we can slice out the piece starting at 896878 (zero based).

$ python slice.py
$ head slice.fasta
>NC_010278.1 50000 bases starting at position 896878
GAGCTTACGCCGGAATTATTGGTCGATCTCCAATCCGGTAACTACGATCGTGGGATTGTA
ACCTTACCTTATGTACGTCAATCGGATAACCAAACCGTGTATATTCCGCAAAGCATTATC
GGCAACTTATTTGTTTCAAACGGTATGTCGGCTGGTAATACGAAAAACGAAGCGCGTGTA
CAAGGTTTGTCGGAAGTGTTCGAACGTTTTGTGAAAAACCGTATTATTACTGAAGCAATC
AGCTTACCGGAAATTCCGCAAAGCGTGATTGACGGCTATCCGACAATCAAAGCGTCTATC
GAAAAACTTGAGCAAGAAGGCTTCCCGATTTTATGCTATGACGCATCATTAGGCGGTGAA
TTCCCGGTTATTTGTGTGATCTTGTTAAATCCGAATAATGGTACTTGTTTCGCTTCATTC
GGTGCACACCCGAATTTCCAAGTGGCGTTTGAACGTACGGTAACCGAGTTATTACAAGGT
CGTAGCTTAAAAGATTTAGATGTATTTGCTCCGCCTTCATTTAATAATGAAGATGTGGCG

Just to check:

$ infoseq slice.fasta 
Display basic information about sequences
USA                      Database  Name           Accession      Type Length %GC    Organism            Description 
fasta::slice.fasta:NC_010278.1 -              NC_010278.1    -              N    50000  40.55                      50000 bases starting at position 896878

Now we basically want to pick 5000 random starting positions uniformly
from 0 to 50000 and 5000 read lengths from a normal distribution with
mean 100 and stddev 5. Slice those out and write to a file.

$ simulate-reads.py
$ head reads.fasta 
>NC_010278.1 50000 bases starting at position 896878 fragment 42221:42321
ATCACGCTGATAAACGAACGGTTCATTGCTTTACACATGATGTAAGACAGAATCGCACCT
GATGAACCTACTAACGCACCGGTTACAATTAACAAGTCAT
>NC_010278.1 50000 bases starting at position 896878 fragment 12945:13039
TGGCTCGGACGTATTTACGGCGAAAGAATTCGCCAGTTTCCGCTAATTCGTAATATTGTG
ACCGAAGAACGTTACCAAATGGTGCAGGAACGTT
>NC_010278.1 50000 bases starting at position 896878 fragment 25563:25655
TTACATCGCCGTATGCCAACTGTGCCAAGCTCACATCCAAATTACCTTGAATCGTTTTCT
CGGTATTTACTTTTCCTTTTGCGACTAAGTTT
>NC_010278.1 50000 bases starting at position 896878 fragment 15165:15269

Next step: read reads, build a graph for input into Eulerian path.

2012/04/03 - HOZ

Richard contributed a Graph class with some skeleton functions and
some test cases. He implemented manual graph constructors and a
breadth first search.

The test harness can be run from the command line:

$ python Graph.py 
[['B', 'B', 'B', 'W', 'W', 'W'], [0, 1, 1, (), (), ()], ['NIL', 0, 0, 'NIL', 'NIL', 'NIL']]
[['W', 'B', 'W', 'W', 'W', 'W'], [(), 0, (), (), (), ()], ['NIL', 'NIL', 'NIL', 'NIL', 'NIL', 'NIL']]
[['W', 'W', 'B', 'W', 'W', 'W'], [(), (), 0, (), (), ()], ['NIL', 'NIL', 'NIL', 'NIL', 'NIL', 'NIL']]
[['B', 'B', 'B', 'B', 'W', 'W'], [1, 2, 1, 0, (), ()], [3, 0, 3, 'NIL', 'NIL', 'NIL']]
[['B', 'B', 'B', 'B', 'B', 'W'], [1, 2, 2, 1, 0, ()], [4, 0, 3, 4, 'NIL', 'NIL']]
[['B', 'B', 'B', 'B', 'B', 'B'], [2, 3, 2, 1, 1, 0], [3, 0, 3, 5, 5, 'NIL']]
two  predecessor  zero
two  predecessor  one
three  predecessor  two
three  predecessor  one
fourth  predecessor  three
fourth  predecessor  two
five  predecessor  three
five  predecessor  fourth

Releases

No releases published

Packages

 
 
 

Contributors 4

  •  
  •  
  •  
  •  

Languages