Process lists easily with
While originally rei was not intended to be an abbreviation, one may think of it as of the Row Editing Interface. Working with lists is an important part of many people, including data-scientists and bioinformaticians, and
rei aims to make that experience more pleasant.
rei can be easily installed from
cabal update cabal install rei # $PATH should contain ~/.cabal/bin directory # (example for bash): PATH=$PATH:~/.cabal/bin export PATH
rei "rname x y -> rname y x" example.csv rei merge example_left.csv example_right.csv rei unite example_top.ssv example_bottom.ssv rei melt2 example_condensed.csv rei condense2 example_melted.csv rei join example_foo.ssv example_bar.ssv rei subtract minuend.csv subtrahend.csv rei transpose example_matrix.ssv rei filter "_ _ chr => chr ~ chr21" elements.tsv rei reduce "chr => chr ~ chrM" exons.bed rei distinct "chr _ type => type chr" transcripts.gtf
Defining the rule
The main idea of the
rei is to apply the rule over the lines of the file. The rule should consist of two parts — before and after — separated by the arrow sign («->»). The before part of the rule describes the fields (columns) in one record (line) in the initial file. The after part of the rule describes the desired format of the output.
The arrow sign should be surrounded with spaces. Like this:
.. -> ... The fields in the rule should be surrounded with spaces too. The field delimiter in the output file is the same as in the input file by default, however it's possible to change it via the option
--newdelim. (It's easy to remember, since
-f is the flag to set the delimiter in the input file.)
Providing the file
There's several ways to provide
rei with the content of the file. The first one is the-most-obvious-way-you-can-think-of: just provide the path to the file. Sometimes it is helpful to use process substitution. And if there's a need to pipe the content, just write a dash («-»). Well, here's the code:
> rei "x -> x" 0.ssv ... > rei "x -> x" <(cat 0.ssv) ... > cat 0.ssv | rei "x -> x" - ...
Let's use a small sample file with spaces as delimiters for these examples (saved as
A B C D E F G H I J K L M N O P Q R S T U V W X Y
This is how easily we can address the columns:
> rei "a b c -> c b a" 0.ssv C B A H G F M L K R Q P W V U
The columns are now in the reversed order.
We can extract the columns that we need:
> rei "a b c -> b" 0.ssv B G L Q V
It is possible to define only columns needed:
> rei "a b -> a b" 0.ssv A B F G K L P Q U V
You may want to keep the rest of the columns, here's the code for that:
> rei "a b ... -> a ..." A C D E F H I J K M N O P R S T U W X Y
The beauty is that one may (and sometimes should) give columns descriptive titles. And that is great in so many ways, as it increases readability, productivity, descriptiveness, maintainability and awareness of what's happening with all that list processing. See some real-world examples below.
You can define a delimiter (
--delim, for the input file and
--newdelim, for the output file). It's important to emphasize that only one-character long delimiters are used. Tabulation («\t») is considered one-character too. If multicharacter literal is provided,
rei uses its first symbol as a delimiter.
For some common file formats
rei doesn't require a delimiter to be provided individually:
- .ssv → space (' '),
- .csv → comma (','),
- .tsv → tab ('\t'),
- .txt → space (' '),
- .list → space (' '),
- .sam, .vcf, .bed, .gff, .gtf → tab ('\t'),
-g is powerful as it allows for fast format conversion. That's how
rei may be used to convert from .ssv to .csv:
> rei -g "," "... -> ..." 0.ssv A,B,C,D,E F,G,H,I,J K,L,M,N,O P,Q,R,S,T
As you see,
rei guessed the delimiter in the input file by its extension — space-separated values. The output won't change in the example above if
-f " " is provided.
Sometimes there is a need to cut out the header of the file or several lines in its end. It's generally accomplished by combining
tail programs, piping, etc. Since
rei is designed for easy list processing, such feature is implemented here. There are flags to define the number of lines to skip in the beginning (
-s) or in the end (
-t) of the file.
> rei -s 1 -t 2 "f g h i j -> f h j" 0.ssv F H J K M O
Sometimes it's handy to have line numbers in the data file. For that purpose
-n flag (or
--enum) which let the user treat the first variable in the rule as a line number (enumeration starts with
> rei -n "# _ _ _ d -> d d #" 0.ssv D D 1 I I 2 N N 3 S S 4 X X 5
Addressing columns with numbers
It happens that the columns in the file should be addressed with their indices. For those cases
-a — from awk-like — flag (or
--colnum). When using
rei -a no before part of the rule should be provided. Please, note that the arrow
-> should be preceded by a space in this case:
> rei -a ' -> 0 3' 0.ssv A D F I K N P S U X
It is recommended to use
-a with the
-n flag so that the first column can be referred to as
1 and the line number as
> rei -an ' -> 1' 0.ssv A F K P U
There's are some common tasks that one may want to do with lists and tables, and it seems convenient to include them in
rei: melt2, condense2, merge, unite, join, subtract, filter, reduce, distinct. Each magic rule has its own syntax.
Melting and condensing
Here, to merge several (typically two) lists means to get the data together. With merge one can add new columns. If the length of two lists (or tables) differs, the shortest possible list is returned.
rei cares, as usually, about the delimiters, but not about finding and reassorting rows when data is being merged.
> rei merge 0.ssv <(rei -s 1 "a -> a" 0.ssv) A B C D E F F G H I J K K L M N O P P Q R S T U
Uniting, or concatenating, several files can be achieved with
unite rule. This rule has a synonym:
concat for short. While simple file concatenation can be achieved using UNIX
rei unite <...> has to acknowledge the delimiter symbol (which should be the same for all input files) and can change the delimiter symbol for the whole output or skip / omit lines.
> rei unite 0.ssv <(head -n 1 0.ssv) A B C D E F G H I J K L M N O P Q R S T U V W X Y A B C D E
Another useful thing is finding common elements in multiple lists.
rei allows that with
join. (In most cases the order of the files provided does not matter. However, if the first file contains duplicates, so will the result.)
Let's prepare a file to join with
0.ssv and save it as
A B C D E K L M N O X X X X X
The code for join is straightforward:
> rei -g ',' join 0.ssv 01.ssv A,B,C,D,E K,L,M,N,O
Retrieving unique data with subtr
Finding differences between multiple lists with a clear and concise syntax is not a trivial task. To deal with this,
rei offers a magic rule called
subtr for short). It behaves exactly as it is titled: takes the first file and removes each row in it only if the row is present in any of the following files.
> tail -n 1 0.ssv > 02.ssv > rei subtr 0.ssv 01.ssv 02.ssv F G H I J P Q R S T
When you need to transpose the list, you can just do it with
rei. It can be beautifully demonstrated for the following matrix (
11 12 13 14 15 21 22 23 24 25 31 32 33 34 35 41 42 43 44 45 51 52 53 54 55
> rei -g ',' transpose 1.ssv 11,21,31,41,51 12,22,32,42,52 13,23,33,43,53 14,24,34,44,54 15,25,35,45,55
For selecting lines that meet some condition use
filter rule. Its syntax is simple:
> rei filter "a b => a ~ A" 0.ssv A B C D E
The filter word should be followed by a rule consisting of two parts — before and patern. It is similar to the standard rule in
rei, but in the filter case these two parts should be separated by a fat arrow (
=>). The pattern can be defined as
field_name ~ expression. One may use a regular expression as an expression in the pattern part of the rule.
You can use
reduce rule for negative filtering:
> rei reduce "a b => a ~ A|U" 0.ssv F G H I J K L M N O P Q R S T
Selecting distinct lines
Two lines are called distinct here if they are consecutive lines and have the same value in some fields.
Let's prepare a file
! @ ? ! * * ? * ?
Then select lines that are distinct in terms of the second field:
> rei distinct "_ 2 _ => 2" 2.ssv ! @ ? ! * *
Skip rownames and colnames:
> rei --skip 1 "rownames ... -> ..." example.ssv
It's easy to merge several files, and turn the output to .csv:
> rei -f ' ' -g ',' unite <(rei -t 3 "a b c -> a c" 0.ssv) <(rei -s2 "x y z -> y z" 1.ssv) A,C F,H 32,33 42,43 52,53
- .bam files stats
- .bed files: counting elements
- date and smth else
- uniting data
- merging data
Finding files that were not downloaded
You were downloading a set of
fastq files from a list
files.list when the connection was interrupted. It is handy to use
rei to generate a list a files that were not downloaded:
> ls *fastq.gz > downloaded.list > rei subtract files.list downloaded.list > to_download.list
Errors and warnings
rei tries to be friendly to the user. For example, when there's a field variable in the right part of the rule that is not present in the left part,
rei hides implementation details behind the user-friendly message, trying to guess that Something's wrong with the rule...
- guessing delimiters for "bioinformatic" formats, like: .sam, .vsf, etc.
- guessing delimiters for more formats: .bed, .gff, .gtf