Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Roadmap / Todos #3

Open
12 of 25 tasks
johanneskoester opened this issue Feb 22, 2015 · 68 comments
Open
12 of 25 tasks

Roadmap / Todos #3

johanneskoester opened this issue Feb 22, 2015 · 68 comments

Comments

@johanneskoester
Copy link
Contributor

johanneskoester commented Feb 22, 2015

This is a continuously updated list of Todos. If you have a suggestion, comment to the issue. I will update the list if the suggestion fits in.
Please note that none of the items will be guaranteed to be implemented. If you want to make sure something is done, consider to contribute the code to Rust-Bio, any help is welcome! Of course you will be listed as one of the authors.

If you want to implement an item, please post it in this thread and I will mark it in the list so that we don't duplicate work.

Changes to current code

  • make SAIS implementation generic over the used alphabet data type and dynamically choose the right type depending on the number of characters. This will further reduce memory usage.
  • investigate no_mangle flag to allow easy integration of Rust-Bio into e.g. Python via ctypes.

New code

  • add indexed FASTA reader based on the seek method of current fasta reader and https://github.com/BurntSushi/rust-csv
  • allow to call current pairwise alignment implementation without traceback for distance computation (hamming, edit, affine)
  • add common score matrices (e.g. PAM250, BLOSUM62 for amino acids) _[@rilut]*
  • add (gapped) q-gram/kmer implementation for counting and index building
  • add bed reader
  • add gff3 reader
  • add gtf/gff !=3 reader based on https://github.com/BurntSushi/rust-csv [@natir]
  • Hidden-Markov-Models (Viterby, Baum-Welch, ...)
  • add Myers Algorithm for approximate pattern matching
  • basic NGS statistics (qual distribution, nucleotide distribution)?
  • add amino acid alphabet to bio::alphabets
  • add multiple sequence alignment computation (partial order alignment for now) [@bnbowman]
  • provide an interval tree, segment tree or clustertree implementation
  • add parallelization examples
  • Add speed benchmarks for all algorithms (right now we have them for pattern matching and SAIS).
  • Add mate-pair FASTQ reader (using the zip method).
  • SDP algorithms for sparse alignment [@bnbowman]
  • Add tabix interface to Rust-Htslib
  • de Bruijn Graph implementation and related algorithms [@rob-p]
  • Overlap/String graphs
  • graph genome support (see work by Veli Mäkkinen, Richard Durbin et al.)
  • BCALM algorithm (@bryceperkins)
  • rewrite gff/gtf parser (and other parsers) using nom in order to support comments (see GTF parser breaks on GTF file from GENCODE #115). Look at the ideas of BioJulia.
@johanneskoester johanneskoester changed the title TODO Roadmap / Todos Feb 22, 2015
@vadimnazarov
Copy link
Contributor

Hello! I've been thinking what could be useful for generic library for bioinformatics tasks. This is what I thought:

  1. Add to your "HMM" point Baum-Welch and Viterbi algorithms implementation just to clarify what should be done in particular.
  2. Add simple hamming and levenshtein distance, so they could be effortlessly called with just one function like "hamming(seq1, seq2);".
    2.1. Or/and more general - add a data structure for fast search of sequences with specified number of indels and mismatches like this one http://mitcr.milaboratory.com/site/apidocs/index.html?com/milaboratory/core/sequence/tree/SequenceTreeMap.html
    2.2. Add efficient by space implementation of pairwise alignment.
  3. Add various BLOSUM??? and other matrices for alignments.
  4. Add wrappers and parsers of output for various command line tools (like bwa or clustalw), on-line services (BLAST) and databases (NCBI).
  5. Various phylogenetic stuff, but, sadly, I know nothing about it.
  6. Add functions for basic NGS data processing - read trimming, quality filtering, parsing of NGS .fastq mate files (and mate pairing). Everything for such grinding tasks, just to give user ability to concentrate on research not only on processing huge NGS data by hands.
    6.1. Basic statistics for NGS data like quality distributions, nucleotide-by-position distributions and others.
  7. Add functions for kmer generating, statistics like most frequent kmers.
  8. Add functions for sequence profiles like simple profiles (by position, like in [6]), probably HMM profiles.
  9. Parallelisation for all. Because Rust. (:

I think that it would be a good idea to see what other general bioinformatics libraries like BioPython have implemented. I hope I will be able to check this in a couple of days and add more suggestions and/or specify what I wrote higher.

Also, probably the project's name "Rust-Bio, a rusty bioinformatics library" is a bit un-easy to understand at first glance. Googling "rust for bioinformatics" don't give you a link to this repo, and I think that that search query is a more frequent one than other possible queries. May I suggest as a title something like "Rust-Bio, Rust for Bioinformatics", and after that, in description, "Rust module for general bioinformatics tasks"? Probably, "BioRust" would be the best possible name because of various other libraries like BioPython, BioJava and BioRuby, but it has been already claimed: https://github.com/hirokai/BioRust and https://github.com/biorust .

Feel free to comment and criticise my suggestions, I'd be glad to discuss todos.

@parir
Copy link
Contributor

parir commented Feb 22, 2015

Some more suggestions:

  • Add multiple alignments to bio::alignment.
  • Add the amino acid alphabet to bio::alphabets.
  • Some common AA substitution matrices like PAM250 or BLOSUM62 would be nice! :)

@vadimnazarov
Copy link
Contributor

@parir Thanks! (: What algorithms for multiple alignments do you think would be the most useful?

@parir
Copy link
Contributor

parir commented Feb 22, 2015

I agree with @VD-N. It would be good to look into the other Bio[programming language of your choice] frameworks to figure out common API design principles.

@parir
Copy link
Contributor

parir commented Feb 22, 2015

@VD-N Tough question! I haven't worked with sequence alignments in the last 3-4 years, so I don't know the current state of the art algorithms. Back then Muscle, MAFFT and ProbCons where supposed to be pretty good. ClustalW could also be included, because it's the classic one.

@johanneskoester
Copy link
Contributor Author

Thank you, guys!
I have included most of your suggestions above.
Regarding parallelization, I would like to leave it out of the API right now, and let the user decide by invoking functions within threads. However, there might be cases where this does not make sense, like multiple alignments. We can see how to update the paradigm later then.

@ghost
Copy link

ghost commented Jul 14, 2015

Something similar to bx-pythons clustertrees and intervaltrees, perhaps? Simple, but indispensable...

@johanneskoester
Copy link
Contributor Author

Very good idea, I have added it to the list.

@ghost
Copy link

ghost commented Jul 16, 2015

An extremely common use case I had in Python was turning a bed file into a dict of cluster or intervaltrees, where the keys were chromosomes or chromosome/strand tuples. Might be nice to have.

Of course this is trivial to code up oneself - is Rust-bio more about implementing the difficult stuff or also having convenience functions?

@johanneskoester
Copy link
Contributor Author

The key idea is to have a toolbox for anything that reoccurs when doing bioinformatics in Rust. In this sense, your suggestion fits perfectly.

@vadimnazarov
Copy link
Contributor

Hi everyone,
i have a couple of questions and suggestions regarding Rust-Bio.

  1. do I need to somehow "take" a task and say it somewhere so no one else will take it and therefore no parallel work will be done?
  2. after careful thinking I think that implementing multiple alignment algorithms, clones of bowtie, bwa-mem, BLAST or other widely used software isn't very practical. I think the best way would be is to have an interface to run this software tools with passed arguments and parse the output of them. What is your opinion?
  3. GHMM seems to be dead. https://github.com/dichodaemon/ghmm Also it doesn't have profile-HMM, which are also very useful. But profile-HMM has been implemented here: http://hmmer.janelia.org .
  4. suggestion: speed benchmarks in comparing to other Bio* software packages. It would be great if Rust-Bio would have a greater speed than most or even all of them.
  5. suggestion: a point by point comparison of Rust-Bio to other Bio* software packages.
  6. suggestion: a couple of clustering algorithms, e.g., hierarchical clustering, means.
  7. I think if Rust-Bio is a general bioinformatics library, then it would great if it would have a rust-htslib as a dependency, so user will get all necessary stuff at once, and won't be loading endless additional packages.
  8. suggestion: transform alphabets into singletons and / or give them simple aliases (macro), because this:
let alphabet = alphabets::dna::iupac_alphabet();
let pos = suffix_array(text);
let bwt = bwt(text, &pos);
let fmindex = FMIndex::new(&bwt, 3, &alphabet);

is much complicated for a medium user and takes more time to write and analyse than this:

let pos = suffix_array(text);
let bwt = bwt(text, &pos);
let fmindex = FMIndex::new(&bwt, 3, dna_iupac!);

@johanneskoester
Copy link
Contributor Author

Hi,

thanks for the useful suggestions! Let me answer piece by piece:

  1. Good question. I think it would be best to post it here (maybe together with a link to your fork) and I will format to TODO list at the top accordingly.
  2. The alignment algorithms provided in Rust-Bio are intended for tasks where you need more control (or have different questions) than with e.g. a traditional read mapper and want to use plain rust. If someone just wants to do read mapping or multiple alignment, he can just use the proper tool. From within rust, there is already the very convenient command API, so no need to provide any special interface in Rust-Bio.
  3. I actually meant this library: http://sourceforge.net/projects/ghmm/, which is a different one I think. It is just a plain lib. Not so much activity, but it seems to be just a feature complete very general project.
  4. Yes, we definetly need some speed benchmarks.
  5. Clustering maybe is better done in a separate statistics/machine learning library which will surely show up for Rust, soon. Further, these kinds of analyses are better done in a scripting language like Python in my opinion. Rust-Bio is more for the low-level stuff where performance matters.
  6. This is an interesting point. On the other hand, one could say that a user who uses Rust-Bio for a project which has nothing to do with BAM should not be required to have that being pulled into his dependencies by Rust-Bio. Especially, since Rust-Htslib relies on compiling C code, which is an additional source of failure.
  7. Interesting idea. So far I had the impression that singletons are not really encouraged in Rust. Do you have a real code example for implementing singletons from other Rust projects?

@vadimnazarov
Copy link
Contributor

Oops, please pardon me for the late response!

  1. Ok, that's nice! I think it would be good to make a very clear note on that on the very top of the first message in this issue.
  2. Sorry, I think i didn't get it, could you point me to this API?
  3. Ah, I thought that on github it was just a clone of this repo.
  4. Ok!
  5. (at this point you have responded to my point "6" of the my original message)
    5*. Hierarchical clustering and k-means are quite simple algorithms (to implement) yet very useful. I think I totally understand your point, but from the point of view of a user I would really like to simply call something like hclust(gene_mat_dist) within Rust-Bio, e.g., if I'm doing gene expression analysis, and don't bother with external packages. Or if I want to cluster NGS reads with k-means as a preprocessing step for the further analysis of clusters. For the more complicated analysis and algorithms for sure external packages are needed.
  6. Ok, I agree, also I didn't know about dependencies. A bit off-topic: is it possible to make Rust-only htslib? Or are there some major limitations / drawbacks of doing this?
  7. I haven't seen singletons in Rust but I think they could be implemented using "unsafe code". Honestly, here I talked about singletons just as an example. I think it could be done just by copying the alphabet object every time you call it, because I don't think that it will generate a lot of overhead.

@vadimnazarov
Copy link
Contributor

Two more suggestions and one clarification to the previous one:

  • a FASTQ reader to read pair-end / mate-pair files. This could be a simple wrapper around two FASTQ readers, that will match reads' accessions because sometimes orders of reads are different between two FASTQ files.
  • add a SAM file reader.
  • add Rust speed benchmarks to the code (cargo bench)

Also I would like to take this task: allow to call current pairwise alignment implementation without traceback for distance computation (hamming, edit, affine). I think I will also try to implement Rust benchmarks of this and of the existed alignment algorithms.

@johanneskoester
Copy link
Contributor Author

Regarding 2:

The API I mean is this one:
http://doc.rust-lang.org/nightly/std/process/struct.Command.html

We have benchmarks now in the README.

Regarding 5:

I agree, that should be easy to do. But if anybody wants to implement that, I would rather suggest him to do that in a separate library. Most importantly, these clustering methods are useful for other fields as well, so why hiding them in a bioinformatics library. Depending on an additional library with Cargo is extremely easy, and does not make it more complicated for the user.

Regarding 6:

Yes, implementing BAM/CRAM/BCF in Rust only is surely possible. However, these custom binary formats are really complicated stuff. It would require quite some work. If anyboday wants to try that, I would be happy to include the code or have it as an auxilliary project within Rust-Bio. But at the moment, I would prefer to focus on stuff that is not already possible by using htslib.

Regarding 7:

There certainly is some overhead, so that you don't want to reinitialize e.g. in an inner loop.
Actually I have seen a macro based approach for static non-primitive variables somewhere. We can always provide that method at a later stage, delegating to the current object oriented implementation.

Regarding your second post

  • interesting idea. So in principle, this is just the zip method you could use. Some error checking with the read names is also nice.
  • SAM format should be supported by rust-htslib already. If not, it is certainly only one or two lines that need to be changed in the BAM reader, because htslib has a unified object for reading SAM/BAM/CRAM I think
  • We have benchmarks for suffix arrays and pattern matching, but I agree it would be nice to have them for most of the other stuff as well.

I have added your suggestions to the TODO list.

@vadimnazarov
Copy link
Contributor

Ok, thank you for your response! One more little suggestion: make rust-bio.github.io repository with all documentation so it can be available at "https://rust-bio.github.io".

@johanneskoester
Copy link
Contributor Author

Good idea. I have done that as suggested.

@luthfianto
Copy link
Contributor

Hello, I'm interested in rust-bio, but I have no experiences with Rust. I only experienced with Biopython. Any suggestion?
Can I work on the BLOSUM?

@johanneskoester
Copy link
Contributor Author

@rilut sure feel free to fork and try it out. I will mark it in the list.

@bnbowman
Copy link

I'd also suggest SDP algorithms for sparse alignment - generally handy things tools, especially when working with genomic assemblies or noisy long-read technologies.

And much like @rilut , I'm fairly new to Rust and looking to learn more and help out. Will poke around a bit and see if anything jumps out at me.

@johanneskoester
Copy link
Contributor Author

Thanks! I have added sparse alignment to the list! Looking forward to any contribution!

@brentp
Copy link

brentp commented Dec 17, 2015

This is more for rust-htslib, but it'd be nice to have an interface to the tabix stuff and potentially .csi.

@johanneskoester
Copy link
Contributor Author

That's a good point, thanks!

@ghost
Copy link

ghost commented Feb 22, 2016

Greetings!

I have just finished the basic workings of a transcriptome translator. The repo is available here: https://github.com/rundrop1/transcriptome_translator

It currently generates all possible Amino Acid sequences from Nucleotide sequences. Support for RNA is coming today. Currently getting output is hacky as the way I did it yesterday was to capture stdout by executing through vim which requires a bit of trimming at the top of the file but otherwise works temporarily. (Note that if you do it this way you will get a valid fasta file)

I didn't look before making the parser so I've implemented a FASTA parser with nom. The parser currently is only able to deserialize FASTA for further use. It also requires rust-phf which currently requires a specific nightly. I also utilize memmap for loading the file. The file is not mutated during processing and while the code is unsafe by memory safety standards but in practice it is usable if you are sure your file will not be modified before the translator has a chance to complete.

I'd like to know if the translator is of any use to your crate and how I could get it into a more usable state for acceptance into this toolkit. I'd also like to add features like automatically BLASTing results and am open to implementing others ideas as well. I plan on adding features and otherwise maintaining 'transcriptome_translation' well into the future.

@johanneskoester
Copy link
Contributor Author

Hi!
Well, thanks for considering Rust-Bio. Your project sounds very interesting. Since Rust-Bio is an algorithm library and not a command line utility, it cannot go into it directly. However, I invite you to contribute any code that is of general use as a data structure or algorithm. This way, we can maintain it as a community and you can use it in your tool without having duplicate effort. So, can you maybe elaborate a bit on decomposable parts of your code base? E.g. I could immagine that the plain amino acid translation could be a function in Rust-Bio.

@anderspitman
Copy link

Hey @johanneskoester, I'd be happy to implement a gtf/gff reader.

@johanneskoester
Copy link
Contributor Author

Great, thank you. I have added you to the list!

@anderspitman
Copy link

Great! Did you have any specific preferences for the implementation? I was just planning on starting with making it as similar as possible to what you have for FASTA. I'm still relatively new to Rust so It'll probably be fairly iterative if you don't mind giving input while I learn best practices.

@ghost
Copy link

ghost commented Feb 26, 2016

@anderspitman I've just created #rust-bio on irc.mozilla.org if you care to join me I can help answer your questions regarding rust. It seems that the reader is not too complicated and I can walk you through how to do some of the Rusty things. My implementation is verbose but easily readable with quite a few comments to explain the exact steps taken. I am not a bio-informatics professional but I essentially freelance for labs at a local university when computation can be applied to the questions they are trying to answer.

@johanneskoester Would parallelizing the serializing and deserialzing of various formats be enough of a "killer feature" to potentially convert the existing parsers to utilize the external crate nom? There are a lot of good things about staying within std but nom and rayon could allow us to support machines with many cores. Can be done with std::mpsc::channels; I will do it this way instead so it will be more inline with how things are already done.

@rhagenson
Copy link
Contributor

What are the thoughts on adding an AppVeyor build to check the library's build status on Windows systems? A template for doing this already exists at: https://github.com/japaric/trust This repo also includes a .travis.yml template for building on both Linux and MacOS, but that may muddy the waters on a TravisCI build failure being on Linux or MacOS.

@johanneskoester
Copy link
Contributor Author

@rhagenson I don't think this is needed. Rust-Bio is pure Rust, hence it should work the same on all OSes.

@rhagenson
Copy link
Contributor

rhagenson commented Jun 6, 2017 via email

@johanneskoester
Copy link
Contributor Author

So, if there is a difference, it would be a bug in Rust I think. I am hesitating to activate Windows builds, because, to be fair we would also have to activate osx builds. This is 3x the build load we have now, for almost no benefit, since it is very unlikely to be a difference. I am also thinking about handling natures resources responsibly here. What we could do is a weekly test of the master branch.

@rhagenson
Copy link
Contributor

I can appreciate the conservative approach and it certainly is a bit excessive to launch three builds with each update to master. I am not sure how we would automate a weekly Windows build test though. Do you know of a way to do this? I am not seeing anything to space builds out by a week with AppVeyor.

@johanneskoester
Copy link
Contributor Author

If it is not yet possible, we could ask for a weekly build feature. No need to rush, windows builds can wait.

@ghuls
Copy link

ghuls commented Jul 27, 2017

BioJulia project seems to have good ideas that could be copied.

Search for BioJulia on https://julialang.org/blog/ for some interesting idea (like how to store FASTA sequences efficiently in memory).

BioJulia also has proper parsers for different bioinformatic formats which could be useful to look at (when reimplementing parsers with nom):

The parsing framework in Bio.jl is unique amongst bioinformatics libraries. Rather than hand-written parsers, we use grammar specifications that are compiled into parser code using a tool called ragel.

https://homes.cs.washington.edu/~dcjones/biojl/parsing.html

Some examples: https://github.com/BioJulia/GenomicFeatures.jl/tree/master/src/

@johanneskoester
Copy link
Contributor Author

Thanks! This is certainly something to consider!

@bnbowman
Copy link

Has anyone written a Partial-Order Aligner in Rust? I was thinking of trying my hand at it over the break, but I'd hate to duplicate effort if someone has already started

@bnbowman
Copy link

In fact, I want a partial-order aligner so much I, err, already wrote one:
https://github.com/rust-bio/rust-bio/tree/partial-order-aligner

So I guess I volunteer for that?

@johanneskoester
Copy link
Contributor Author

Great!!

@wizofe
Copy link

wizofe commented May 13, 2018

@johanneskoester I just introduced myself here #27 :) Do you have any recommendations for a short and sweet project? That should take a day or couple, just to discover your code base and see what's going on!

@johanneskoester
Copy link
Contributor Author

johanneskoester commented May 14, 2018

Hi and welcome @wizofe! Unfortunately, all the currently open todos are rather big. However, feel free to propose any textbook algorithm that is still missing. Another possibility is to write a file parser for a format that is not yet available in rust-bio or rust-htslib.

@brianjimenez
Copy link
Member

Hi! Just introduced myself in the separate thread, I'd like to know which are the plans to include PDB format parsing, especially for proteins.

@ghost
Copy link

ghost commented Aug 22, 2018

Hi. I love dynamic programming algorithms, graph algorithms, and anything involving the concept of an approximation algorithm (TSP approximation). I have a few years of Haskell experience, but I have not touched Rust yet. I am kind of interested in working at the intersection of algorithms, Rust, and bioinformatics, especially if I find something that brings together distributed algorithms and sequential algorithms to scale horizontally and vertically in combination.

I also tend to enjoy short programming exercises like HackerRank or Codility, if helps in what tasks you suggest.

I'll try to look through the list of todo's, but feel free to suggest some "new to rust" tasks, for me to get my feet wet.

@johanneskoester
Copy link
Contributor Author

@brianjimenez welcome! The plan is: if there is somebody who wants to do the work, we are happy to include it. In general the dev model is that things are added as needed and whenever a volunteer does the coding.
@kanishka-azimi welcome! In the current todos, there is no simple stuff left I am afraid. However, in general every single bioinformatics text-book algorithm is suitable for starting. My suggestion is to take one of the classic books, like this and check whether there is a nice algorithm that is not yet part of Rust-Bio. I am looking forward to your contribution and I will be happy to review it!

Sorry guys for the late response, I have been on vacation.

@brianjimenez
Copy link
Member

@johanneskoester thanks a lot! I've started reading the 2018 edition Rust book a few weeks ago and messing around with a dirty approach to PDB parsing here. Still a lot of Rust to learn, but I'll happy to help with the PDB parts.

@jimrybarski
Copy link

I am looking for something that calculates the molecular weight of nucleic acids and polypeptides, and there doesn't seem to be anything that does that on crates.io right now. Is that in scope for this library? I'd be happy to implement it if so.

@johanneskoester
Copy link
Contributor Author

johanneskoester commented Jun 28, 2019

@jimrybarski Yes, absolutely!

@huangjj27
Copy link

I'm wondering If rust-overlaps could be a reference for overlaps task?

@joshwilding4444
Copy link

I would be happy to start writing more benchmark tests. I'll start with the functions related to sparse alignments. Should I write benchmarks for every non-trivial function, or should I work on a particular subset of functions?

@carrascomj
Copy link

Hello! Is similarity clustering in the scope of this library (or rust-bio-tools)? It would be nice to know if someone is working in a UCLUST clone or any of the mmseqs2 clustering methods.
Also, If someone could point me to a good resource to understand these algorithms a bit better, I could try to implement one of them myself.

@dlaehnemann
Copy link
Member

Sure, any more general bioinformatics algorithms that can be used across tools should go into rust-bio, general data types can be defined in rust-bio-types and if you want to create some simple command-line tool based on rust-bio (and/or rust-htslib), you can do this in rust-bio-tools. Any bigger tools that use general rust-bio functionality but do some bigger amounts of extra work probably warrant a separate crate.

As for understanding the algorithms, the links you provided point to papers that are probably a good point to start, and for mmseqs2 you could dig into the code -- it does not seem to contain too many comments, but looks well structured. In general, it seems to support a lot of different functionality and it's probably a good idea to pick some small functionality to start with.

@XVilka
Copy link

XVilka commented Sep 23, 2020

Do you have any plans to setup OpenCollective/GitHub Sponsors? I would certainly be happy to send a bit from time to time, like it's done with BioJulia:

@bricoletc
Copy link

Hey, @johanneskoester , could you elaborate on the point graph genome support (see work by Veli Mäkkinen, Richard Durbin et al.)? Is this reading and writing GFA?

@SladeckovaKlara
Copy link

Hi, I would like to contrbute to the Hidden Markov model algorithms, Viterbi and perhaps Baum-Welch.

@YeungOnion
Copy link

Regarding the use of the no_mangle flag to modify existing code for python support, since that was mentioned in 2015 in #7, which appears to have been discussed when Rust 1.2 was new

I found this wrapper on github associated to https://bccfe.ca/
It uses PyO3, but it's certainly not aiming to covering the API of rust-bio. Is there something like this that already exists? I do expect it should probably be a different project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests