-
Notifications
You must be signed in to change notification settings - Fork 23
Introduction
In the most simple form Biopieces can be pieced together on the command line like this (using the pipe character '|'):
read_data | calculate_something | write_result
However, a more comprehensive analysis could be composed:
read_data | select_entries | convert_entries | search_database | evaluate_results | plot_diagram | plot_another_diagram | load_to_database
In the documentation a long pipe like the above will typically be written on multiple lines to give a better over view like this:
read_data |
select_entries |
convert_entries |
search_database |
evaluate_results |
plot_diagram |
plot_another_diagram |
load_to_database
You can either type these on the command line in a single line, or copy/paste multiple lines.
Each Biopiece perform only a single task and may react only to data records (or Biopiece records) containing certain information. E.g. Biopieces that analyze sequence data only react to Biopiece records containing sequences, while all other records simply are passed through, so that they may be analyzed further downstream. Sometimes Biopiece records are not passed through, e.g. if sequences are assembled into longer sequences - then the input sequences are replaced by the output sequences. Biopieces do this by following the principle of least surprise. The data stream is terminated at the end of the last analysis.
The data stream that is piped through the Biopieces consists of records of key/value pairs in the same way a hash does in order to keep as simple a structure as possible. An example record can be seen below:
REC_TYPE: PATSCAN
MATCH: AGATCAAGTG
S_BEG: 7
S_END: 16
ALIGN_LEN: 9
S_ID: piR-t6
STRAND: +
PATTERN: AGATCAAGTG
---
The ---
denotes the delimiter of the records, and each key is a word followed by a ':' and a
white-space and then the value. By convention the Biopieces only uses upper case keys. Since the
records basically are hash structures this mean that the order of the keys in the stream is
unordered, and in the above example it is pure coincidence that HIT_BEG
is displayed before
HIT_END
.
All of the Biopieces are able to read and write a data stream to and from file as long as the records are in the Biopieces format. This means that if you are undertaking a lengthy analysis where one of the steps is time consuming, you may save the stream after this step, and subsequently start one or more analysis from that last step ).
If you are running a lengthy analysis, it is highly recommended that you create a small test sample of the data and run that through the pipeline --- and once you are satisfied with the result proceed with the full data set.
All of the Biopieces can be supplied with long arguments prefixed with switches or single character
arguments prefixed with - switches that can be grouped together (e.g. -xok
). In this introduction
only the long switches are used to emphasize what these switches do.
First you need to be acquainted with some basic Unix commands, here is a tutorial (Tutorial 1-3 should do):
http://www.ee.surrey.ac.uk/Teaching/Unix/
And when you are ready to move on - move on:
In order to find out how a specific Biopiece works, you just type the program name without any
arguments and press return and the usage of the Biopiece will be displayed. E.g. read_fasta
<return>
:
Biopiece: read_fasta
Usage
read_fasta [options] -i <FASTA file(s)>
Options
[-? | --help] # Print full usage description.
[-i <file(s)> | --data_in=<file(s)>] # Comma separated list of files or glob expression to read.
[-n <int> | --num=<int>] # Limit number of records to read.
[-I <file> | --stream_in=<file>] # Read input stream from file - Default=STDIN
[-O <file> | --stream_out=<file>] # Write output stream to file - Default=STDOUT
Help
read_fasta is part of the Biopieces framework.
http://www.biopieces.org
Now, this is only a short summary of the usage, to print the full usage use the --help
switch to
get this:
Biopiece: read_fasta
Description
read_fasta read in sequence entries from FASTA files. Each sequence
entry consists of a sequence name prefixed by a '>' followed by the sequence
name on a line of its own, followed by one or my lines of sequence until the
next entry or the end of the file. The resulting biopiece record consists of
the following record type:
SEQ_NAME: test
SEQ_LEN: 10
SEQ: ATCGATCGAC
---
For more about the FASTA format:
http://en.wikipedia.org/wiki/Fasta_format
Usage
read_fasta [options] -i <FASTA file(s)>
Options
[-? | --help] # Print full usage description.
[-i <file(s)> | --data_in=<file(s)>] # Comma separated list of files or glob expression to read.
[-n <int> | --num=<int>] # Limit number of records to read.
[-I <file> | --stream_in=<file>] # Read input stream from file - Default=STDIN
[-O <file> | --stream_out=<file>] # Write output stream to file - Default=STDOUT
Examples
To read all FASTA entries from a file:
read_fasta -i test.fna
To read in only 10 records from a FASTA file:
read_fasta -n 10 -i test.fna
To read all FASTA entries from multiple files:
read_fasta -i test1.fna,test2.fna
To read FASTA entries from multiple files using a glob expression:
read_fasta -i '*.fna'
See also
read_align
write_fasta
Author
Martin Asser Hansen - Copyright (C) - All rights reserved.
mail@maasha.dk
August 2007
License
GNU General Public License version 2
http://www.gnu.org/copyleft/gpl.html
Help
read_fasta is part of the Biopieces framework.
http://www.biopieces.org
Or you can go the the appropriate wiki page read_fasta at
Which currently points to Github:
http://github.com/maasha/biopieces/wiki/read_fasta
Since Biopieces work by passing data records from Biopiece to Biopiece you always need more than one Biopiece (execpt
special cases). Typically, one Biopiece to parse data from some format, and one Biopiece to perform some analysis like
counting or plotting - or writing the data in a new format. Data in different formats can be read with the appropriate
Biopiece for that format. The Biopieces are typically named read_<data type>
such as read_fasta,
read_tab, etc., and all behave in a similar manner. Data can be read by supplying the --data_in
switch and a file name
to the file containing the data:
<biopiece> --data_in=<file>
Consider the following three FASTA entries in the file test.fna
:
>test1
AGATTGAGATTATACTTGCAGCACCCGTCC
>test2
ACGTATTCGCGGGACAACACTTTCTGAACT
>test3
ATTGTAACGAACACCAGATCCCTGCTTGTG
In order to read these we use read_fasta like this:
read_fasta --data_in=test.fna
SEQ_NAME: test1
SEQ: AGATTGAGATTATACTTGCAGCACCCGTCC
SEQ_LEN: 30
---
SEQ_NAME: test2
SEQ: ACGTATTCGCGGGACAACACTTTCTGAACT
SEQ_LEN: 30
---
SEQ_NAME: test3
SEQ: ATTGTAACGAACACCAGATCCCTGCTTGTG
SEQ_LEN: 30
---
and et viola, the sequences are emitted in the data stream as Biopiece records - and displayed in the terminal window. Now we perform the most basic analysis - we count the records using count_records:
read_fasta --data_in=test.fna | count_records
SEQ: AGATTGAGATTATACTTGCAGCACCCGTCC
SEQ_LEN: 30
SEQ_NAME: test1
---
SEQ: ACGTATTCGCGGGACAACACTTTCTGAACT
SEQ_LEN: 30
SEQ_NAME: test2
---
SEQ: ATTGTAACGAACACCAGATCCCTGCTTGTG
SEQ_LEN: 30
SEQ_NAME: test3
---
RECORDS_COUNT: 3
---
Now the sequence records are supplemented with an extra record holding the RECORD_COUNT
. But that is not very handy, if you
have millions of sequences, that these are all displayed. This is because the data stream never stops unless the user want
to save the stream or by applying the --no_stream
switch that will terminate the stream. The --no_stream
switch only
works with those Biopieces where it makes sense that the user might want to terminate the data stream, i.e. after an
analysis step where the user wants to output the result, but not the data stream. Thus if we use the --no_stream
with
count_records we get:
read_fasta --data_in=test.fna | count_records --no_stream
RECORDS_COUNT: 3
---
Which has the desired effect. Notice that the output from count_records is a Biopiece record, but it was not part of the data stream.
If we want to save the sequence data in a new FASTA file we have to use write_fasta:
read_fasta --data_in=test.fna | count_records --no_stream | write_fasta
RECORDS_COUNT: 3
---
But this is wrong? - there are no FASTA entries. The output from count_records is a Biopiece record with the
RECORD_COUNT
and that record is passed to write_fasta don't detect any sequence type records
. Of cause we need to remove
the --no_stream
switch to count_records:
read_fasta --data_in=test.fna | count_records | write_fasta
RECORDS_COUNT: 3
---
>test1
AGATTGAGATTATACTTGCAGCACCCGTCC
SEQ: AGATTGAGATTATACTTGCAGCACCCGTCC
SEQ_NAME: test1
SEQ_LEN: 30
---
>test2
ACGTATTCGCGGGACAACACTTTCTGAACT
SEQ: ACGTATTCGCGGGACAACACTTTCTGAACT
SEQ_NAME: test2
SEQ_LEN: 30
---
>test3
ATTGTAACGAACACCAGATCCCTGCTTGTG
SEQ: ATTGTAACGAACACCAGATCCCTGCTTGTG
SEQ_NAME: test3
SEQ_LEN: 30
---
But this is also wrong? We now have a mixture of FASTA type data and Biopiece records including a RECORDS_COUNT
record.
This is of cause because we forgot to terminate the stream with --no_stream
:
read_fasta --data_in=test.fna | count_records | write_fasta --no_stream
>test1
AGATTGAGATTATACTTGCAGCACCCGTCC
>test2
ACGTATTCGCGGGACAACACTTTCTGAACT
>test3
ATTGTAACGAACACCAGATCCCTGCTTGTG
Now the sequences are output nicely in FASTA format, but where did the RECORD_COUNT
go? That was passed to write_fasta,
which still does not react to records without SEQ_NAME
and SEQ
keys - thus this record was simply dropped. The trick is
to save the output of count_records to a file:
read_fasta --data_in=test.fna | count_records --data_out=count.txt
SEQ: AGATTGAGATTATACTTGCAGCACCCGTCC
SEQ_LEN: 30
SEQ_NAME: test1
---
SEQ: ACGTATTCGCGGGACAACACTTTCTGAACT
SEQ_LEN: 30
SEQ_NAME: test2
---
SEQ: ATTGTAACGAACACCAGATCCCTGCTTGTG
SEQ_LEN: 30
SEQ_NAME: test3
---
Now the COUNT_RECORD
entry is saved to the file count.txt
and the sequence type Biopiece records are passed through. We
can verify the content of count.txt
using the UNIX command cat
:
cat count.txt
RECORDS_COUNT: 3
---
Of cause we could still use --no_stream
with count_records while at the same time saving the output to file:
read_fasta --data_in=test.fna | count_records --data_out=count.txt --no_stream
Thus the RECORD_COUNT
is written to the count.txt
file and no Biopiece records are emitted. Now let us save the record
count and the FASTA entries:
read_fasta --data_in=test.fna | count_records --data_out=count.txt | write_fasta --data_out=test2.fna
SEQ: AGATTGAGATTATACTTGCAGCACCCGTCC
SEQ_NAME: test1
SEQ_LEN: 30
---
SEQ: ACGTATTCGCGGGACAACACTTTCTGAACT
SEQ_NAME: test2
SEQ_LEN: 30
---
SEQ: ATTGTAACGAACACCAGATCCCTGCTTGTG
SEQ_NAME: test3
SEQ_LEN: 30
---
Now the COUNT_RECORD
is saved to count.txt
like before, and the FASTA entries are saved to test2.fna
which we can verify with cat
:
cat test2.fna
>test1
AGATTGAGATTATACTTGCAGCACCCGTCC
>test2
ACGTATTCGCGGGACAACACTTTCTGAACT
>test3
ATTGTAACGAACACCAGATCCCTGCTTGTG
The final touch is to use --no_stream
with write_fasta to terminate the stream:
read_fasta --data_in=test.fna | count_records --data_out=count.txt | write_fasta --data_out=test2.fna --no_stream
Which is identical to the below short-hand command using the short switches:
read_fasta -i test.fna | count_records -o count.txt | write_fasta -xo test2.fna
This is the end of the introduction. Now it is a good time to head over to the HowTo for inspiration.