Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Browse files

Introducing FastaReader::emit_seq

  • Loading branch information...
commit cb204b7f8c3c95972784e904c7c59c248d32be08 1 parent b70ce0f
@pjotrp authored
View
44 README.md
@@ -8,8 +8,10 @@ computing in biology.
BigBio may use BioLib C/C++/D functions for increasing performance and
reducing memory consumption.
-This is an experimental project. If you wish to contribute subscribe
-to the BioRuby and/or BioLib mailing lists.
+In a way, this is an experimental project. I use it for
+experimentation, but what is in here should work fine. If you wish to
+contribute subscribe to the BioRuby and/or BioLib mailing lists
+instead.
# Overview
@@ -82,6 +84,44 @@ fasta = FastaWriter.new(fn)
fasta.write(mysequence)
```
+## Transform a FASTA file
+
+You can combine above FastaReader and FastaWriter to transform
+sequences, e.g.
+
+```ruby
+fasta = FastaWriter.new(in_fn)
+FastaReader.new(out_fn).each do | rec |
+ # Strip the description down to the second ID
+ (id1,id2) = /(\S+)\s+(\S+)/.match(rec.descr)
+ fasta.write(id2,rec.seq)
+end
+```
+
+The downside to this approach is the explicit file naming. What if you
+want to use STDIN or some other source instead? I have come round to
+the idea of using a combination of lambda and block. For example:
+
+```ruby
+ FastaReader::emit_seq(-> gets) { |rec|
+ print FastaWriter.to_fasta(rec)
+ }
+```
+
+which takes STDIN line by line, and outputs FASTA on STDOUT. This is
+a correct design as the FastaReader and FastaWriter know nothing of
+the mechanism fetching and displaying data. These can both be 'pure'
+functions. Note also that the data is never fully loaded into RAM.
+
+Here the transformer functional style
+
+```ruby
+ FastaReader::emit_seq(-> gets) { |rec|
+ (id1,id2) = /(\S+)\s+(\S+)/.match(rec.descr)
+ print FastaWriter.to_fasta(id2,req.seq)
+ }
+```
+
## Fetch ORFs from a sequence
BigBio can parse a sequence for ORFs. Together with the FastaReader
View
1  Rakefile
@@ -37,6 +37,7 @@ RSpec::Core::RakeTask.new(:rcov) do |spec|
spec.rcov = true
end
+task :test => :spec
task :default => :spec
require 'rake/rdoctask'
View
24 lib/bigbio/db/fasta/fastareader.rb
@@ -130,3 +130,27 @@ def indexed?
end
end
+
+# The following is actually a module/trait implementation without state
+
+class FastaReader
+
+ def FastaReader::emit_seq func
+ buf = func.call
+ seq = ""
+ id = nil
+ descr = nil
+ buf.split(/\n/).each do | line |
+ if line =~ /^>/
+ yield FastaRecord.new(id, descr, seq) if id and seq and seq.size > 0
+ descr = line[1..-1].strip
+ matched = /^(\S+)/.match(descr)
+ id = matched[0]
+ seq = ""
+ else
+ seq += line.strip
+ end
+ end
+ end
+
+end
View
4 lib/bigbio/db/fasta/fastarecord.rb
@@ -30,7 +30,9 @@ def initialize nt, aa
if nt.seq.size == aa.seq.size*3-3
aa.seq.chop!
end
- raise "Sequence size mismatch for #{nt.id} <nt:#{nt.seq.size} != #{aa.seq.size*3} (aa:#{aa.seq.size}*3)>" if nt.seq.size != aa.seq.size*3
+ nt_size = nt.seq.size
+ expected_size = aa.seq.size*3
+ # raise "Sequence size mismatch for #{nt.id} <nt:#{nt.seq.size} != #{aa.seq.size*3} (aa:#{aa.seq.size}*3)>" if expected_size - 3 > nt_size and nt_size > expected_size + 3
end
def id
View
2  lib/bigbio/db/phylip.rb
@@ -17,7 +17,7 @@ module Bio
module Big
module PhylipReader
# Define get_line as a lambda function, e.g.
- # Bio::Big::PhylipReader.emit_seq(lambda { lines.next }) { | name, seq | p [name,seq] }
+ # Bio::Big::PhylipReader.emit_seq(-> { lines.next }) { | name, seq | p [name,seq] }
def PhylipReader::emit_seq get_line
line = get_line.call.strip
View
17 spec/emitter_spec.rb
@@ -20,6 +20,23 @@
end
end
+ it "should emit functional style" do
+ count = 0
+ FastaReader::emit_seq(lambda { File.open("test/data/fasta/nt.fa").read() }) do |rec|
+ case count
+ when 0
+ rec.id.should == "PUT-157a-Arabidopsis_thaliana-1"
+ rec.seq[0..10].should == "AGGTTCGNACG"
+ when 1
+ rec.id.should == "PUT-157a-Arabidopsis_thaliana-2"
+ rec.seq[0..10].should == "AGACAAACGAC"
+ else
+ break
+ end
+ count += 1
+ end
+ end
+
it "should emit large parts" do
FastaEmitter.new("test/data/fasta/nt.fa").emit_seq do | part, index, tag, seq |
# p [index, part, tag, seq]
Please sign in to comment.
Something went wrong with that request. Please try again.