-
Notifications
You must be signed in to change notification settings - Fork 5
/
README
96 lines (76 loc) · 4.08 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
Grassembler - graph-based assembly of short read sequences.
Copyright 2012 - Humberto Ortiz-Zuazaga, Cassandra Schaening, Jose
Carlos Bonilla, Alejandro Vientos del Valle, Richard Garcia Lebron.
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see <http://www.gnu.org/licenses/>.
Introduction
As a class project, build an eulerian assembler.
Input will be 5000 reads of length 100 bases from a random 50 Kb chunk
of NC_010278, Actinobacillus pleuropneumoniae serovar 3 str. JL03
chromosome, complete genome. The chromosome is 2242062 bases long, and circular.
Let's generate a random starting position:
>>> import ramdom
>>> random.randint(0, 2242062-50000)
896878
Now we can slice out the piece starting at 896878 (zero based).
$ python slice.py
$ head slice.fasta
>NC_010278.1 50000 bases starting at position 896878
GAGCTTACGCCGGAATTATTGGTCGATCTCCAATCCGGTAACTACGATCGTGGGATTGTA
ACCTTACCTTATGTACGTCAATCGGATAACCAAACCGTGTATATTCCGCAAAGCATTATC
GGCAACTTATTTGTTTCAAACGGTATGTCGGCTGGTAATACGAAAAACGAAGCGCGTGTA
CAAGGTTTGTCGGAAGTGTTCGAACGTTTTGTGAAAAACCGTATTATTACTGAAGCAATC
AGCTTACCGGAAATTCCGCAAAGCGTGATTGACGGCTATCCGACAATCAAAGCGTCTATC
GAAAAACTTGAGCAAGAAGGCTTCCCGATTTTATGCTATGACGCATCATTAGGCGGTGAA
TTCCCGGTTATTTGTGTGATCTTGTTAAATCCGAATAATGGTACTTGTTTCGCTTCATTC
GGTGCACACCCGAATTTCCAAGTGGCGTTTGAACGTACGGTAACCGAGTTATTACAAGGT
CGTAGCTTAAAAGATTTAGATGTATTTGCTCCGCCTTCATTTAATAATGAAGATGTGGCG
Just to check:
$ infoseq slice.fasta
Display basic information about sequences
USA Database Name Accession Type Length %GC Organism Description
fasta::slice.fasta:NC_010278.1 - NC_010278.1 - N 50000 40.55 50000 bases starting at position 896878
Now we basically want to pick 5000 random starting positions uniformly
from 0 to 50000 and 5000 read lengths from a normal distribution with
mean 100 and stddev 5. Slice those out and write to a file.
$ simulate-reads.py
$ head reads.fasta
>NC_010278.1 50000 bases starting at position 896878 fragment 42221:42321
ATCACGCTGATAAACGAACGGTTCATTGCTTTACACATGATGTAAGACAGAATCGCACCT
GATGAACCTACTAACGCACCGGTTACAATTAACAAGTCAT
>NC_010278.1 50000 bases starting at position 896878 fragment 12945:13039
TGGCTCGGACGTATTTACGGCGAAAGAATTCGCCAGTTTCCGCTAATTCGTAATATTGTG
ACCGAAGAACGTTACCAAATGGTGCAGGAACGTT
>NC_010278.1 50000 bases starting at position 896878 fragment 25563:25655
TTACATCGCCGTATGCCAACTGTGCCAAGCTCACATCCAAATTACCTTGAATCGTTTTCT
CGGTATTTACTTTTCCTTTTGCGACTAAGTTT
>NC_010278.1 50000 bases starting at position 896878 fragment 15165:15269
Next step: read reads, build a graph for input into Eulerian path.
2012/04/03 - HOZ
Richard contributed a Graph class with some skeleton functions and
some test cases. He implemented manual graph constructors and a
breadth first search.
The test harness can be run from the command line:
$ python Graph.py
[['B', 'B', 'B', 'W', 'W', 'W'], [0, 1, 1, (), (), ()], ['NIL', 0, 0, 'NIL', 'NIL', 'NIL']]
[['W', 'B', 'W', 'W', 'W', 'W'], [(), 0, (), (), (), ()], ['NIL', 'NIL', 'NIL', 'NIL', 'NIL', 'NIL']]
[['W', 'W', 'B', 'W', 'W', 'W'], [(), (), 0, (), (), ()], ['NIL', 'NIL', 'NIL', 'NIL', 'NIL', 'NIL']]
[['B', 'B', 'B', 'B', 'W', 'W'], [1, 2, 1, 0, (), ()], [3, 0, 3, 'NIL', 'NIL', 'NIL']]
[['B', 'B', 'B', 'B', 'B', 'W'], [1, 2, 2, 1, 0, ()], [4, 0, 3, 4, 'NIL', 'NIL']]
[['B', 'B', 'B', 'B', 'B', 'B'], [2, 3, 2, 1, 1, 0], [3, 0, 3, 5, 5, 'NIL']]
two predecessor zero
two predecessor one
three predecessor two
three predecessor one
fourth predecessor three
fourth predecessor two
five predecessor three
five predecessor fourth