Skip to content

Commit

Permalink
biopyhon proposal draft
Browse files Browse the repository at this point in the history
  • Loading branch information
lijax committed Apr 22, 2013
1 parent f1ee611 commit e969c82
Showing 1 changed file with 81 additions and 0 deletions.
81 changes: 81 additions & 0 deletions proposal_biopython.md
@@ -0,0 +1,81 @@
# Phylogenetics in Biopython: Filling in the gaps

## Personal Information

- **Name**: Yanbo Ye
- **Email**: yeyanbo289@gmail.com
- **University**: Wuhan Institute of Virology, University of Chinese Academy of Sciences
- **Current Enrollment**: Third-year Master Student in Bioinformatics
- **Biography**:
- 2005-2010: Bachelor of Science in Biotechnology, Nankai University, Tianjin, China;
- 2010-2013: Master of Science in Bioinformatics, University of Chinese Academy of Sciences, Beijing, China.

## Interests and Motivation

I'm interested in the project of "Phylogenetics in Biopython: Filling in the gaps".

My current research is about phylogenetics and genome evolution of baculoviruses. To facilitate my work, I have developed a Java program integrating the workflow of my research, which includes sequence clustering, gene content tree building, visualization and gene gain and loss estimation. The most important thing is that I've re-implement most of the algorithms that needed in this biopython project, such as UPGMA, NJ, MP, consensus tree finding, newick tree parsing, cladogram and circular tree visualization. Though I never developed a python module, I use Biopython a lot during my research to do some trivial tasks like sequence parsing and data to tree mapping. So I'm very familiar with those algorithms and Biopython packages. I believe I can port my Java code into Biopython and implement those (tree comparison and bootstrap search for a target tree) not exist in my current code.

I have been looking for a chance to start a python project for a long time. Based on my research background and experience, this project is just right one for me. Through this project, I want to gain more experience in Python programming and improve my cooperation skills of open source software development.

## Programming Experience and Skills

- **Java:** 4 years
- **BlastGraph**(https://github.com/bigwiv/BlastGraph): This is a comparative genomics tool integrating a range of bioinformatics tools and algorithms. The function of it is to cluster homologous sequences, estimate gene content tree and gene gain and loss event.
- **Python:** 1 year
- **Bioscript**(https://github.com/lijax/bioscript): This includes some python and shell scripts I used for my research.

- I also like to hack some javascript and python code during my spare time. Here is my github account: https://github.com/lijax

## Project Plan

### Project abstract:
Biopython is a set of open source python packages and modules for bioinformatics works. In the Bio.Phylo package, there are already implementations for some basic phylogenetics tasks: basic tree operations, parsers for Newick, Nexus and PhyloXML, and wrapers for Phyml, Raxml and PAML. While there are some important components that remain to be implemented to better support phylogenetic workflows. These include simple tree construction algorithms, consensus tree searching, tree comparison and tree visualization.

### Approach & Goals

- Implement simple tree inference algorithms of UPGMA, neighbor-joining and maximum parsimony.
- Implement consensus tree search functions of multiple trees. Phylip like majority-rule consensus tree, Adams consensus.
- Implement branch support calculation functions given a target tree and a list of bootstrap replicate trees.
- Implement a bootstrap method for a given alignment and provide two interface methods to generate a tree list and construct a consensus tree(given the parameters of treeMethod, consensusMethod and bootstrapTime).
- Implement SH test for multiple tree comparisons.
- Implement visualization functions for circular rooted tree and radial unrooted tree.

### Project Timeline

- **Week1(6.17-6.23)**: Tree inference algorithms
- **Tasks**: 1. a method to calculate distance matrix for an alignment; 2. port my java code of UPGMA and NJ to python; 3. find a solution for maximum parsimony algorithms;
- **Difficulty**: Easy. Because I've already write those algorithms in java before, it should be easy to complete the first two tasks. So there will be more time to concentrate on the third one.
- **Week2(6.24-6.30)**: Maximum parsimony
- **Tasks**: 1. a method to calculate the parsimony score for a given tree and an alignment; 2. implement an efficient algorithm to search for a tree minimizing the score(Branch and Bound, Heuristic search etc.).
- **Difficulty**: Hard. The first one is easy, which I wrote before. So the main task is on the second one. If no optimal algorithms can be implemented by the end, maybe we should just write a wrapper for external programs(i.e. dnapars and protpars in PHYLIP).
- **Week3(7.1-7.7)**: Consensus tree search
- **Tasks**: 1. to be efficient in consensus tree search, create a binary array manipulation class to store and count clades; 2. write phylip like majority-rule consensus tree search method for a list of trees;
- **Difficulty**: Medium to Hard. The first one is the most important and difficult. I have written a similar class in Java for consensus tree search in BlastGraph. But it doesn't really use binary operations. This class really needs to be implemented using binary operations to be efficient.
- **Week4(7.8-7.14)**: Consensus tree and branch support
- **Tasks**: 1. dig into adams consensus tree algorithms and implement it; 2. write method for branch support calculation given a tree and a list of trees;
- **Difficulty**: Medium. Needs to get familiar with the adams consensus tree algorithms. The branch support calculation method for a target tree is similar to consensus tree search.
- **Week5(7.15-7.21)**: Bootstrap method
- **Tasks**: 1. write the bootstrap method; 2. write a interface method to generate a bootstrapped tree list providing the parameter of tree method(upgma,nj,mp) and bootstrap time; 3 write another one for consensus tree given the tree method, consensus method and bootstrap time.
- **Difficulty**: Easy.
- **Week6(7.22-7.28)**: cleanup
- **Tasks**: cleanup existing code, write tests and documentation
- **Week7(7.29-8.4)**: Mid-term evaluations
- **Tasks**: continue former tasks. Write and submit mid-term evaluations.
- **Week8(8.5-8.11)**: SH-test
- **Tasks**: 1.understand how SH-test works; 2. write code for SH-test
- **Difficulty**: Hard. Refer to *Inferring Phylogenies* Page 370. Any existing code in any language will be helpful.
- **Week9(8.12-8.18)**: circular rooted tree
- **Tasks**: 1. read code of Bio.Phylo._utils module and understand how the draw and draw_graphviz work; 2. write adaptable method draw_circular for circular rooted tree.
- **Difficulty**: Medium to Hard. There is existing code for circular tree in BlastGraph. May need to learn matplotlib API before porting the code.
- **Week10(8.19-8.25)**: radial unrooted tree
- **Tasks**: 1. understand Felsenstein's Equal Daylight algorithm and find a proper solution for radial layout; 2. write code for radial unrooted tree.
- **Difficulty**: Hard. Refer to *Inferring Phylogenies* Page 582. Any existing code in any language will be helpful.
- **Week11(8.26-9.1)**: continue radial unrooted tree
- **Week12(9.2-9.8)**: cleanup existing code, write tests and documentation
- **Week13(9.9-9.15)**: cleanup existing code, write tests and documentation
- **Week14(9.16-9.22)**: Suggested 'pencils down' date. Take a week to scrub code, write tests, improve documentation, etc.
- **Week15(9.23-9.29)**: Firm 'pencils down' date. Write and submit final evaluations to Google. Submit required code samples to Google.

## Conflict
No apparent conflicts yet.

1 comment on commit e969c82

@etal
Copy link

@etal etal commented on e969c82 Apr 26, 2013

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general:

  • Looks good! Given your experience with BlastGraph, your prioritization of tasks is very reasonable.
  • Have you looked at PyCogent and DendroPy? They are more ambitious Python libraries that implement many more advanced operations for phylogenetics. I don't intend for this project to crib their work, but for some of the fundamental operations proposed here, it will be helpful to refer to them.
  • Are you planning to use NumPy for the distance-based tree estimation and consensus tree search functions? If so, it will be important to keep this as a "soft" dependency. Depending on performance (e.g. test with PyPy), it might even be nice to attempt pure-Python implementations of these.

Week 1:

Week 2:

  • There are a couple of standard algorithms for parsimony which I think you could implement. It looks like you've managed to get a copy of Felsenstein's "Inferring Phylogenies" -- I recall there being a good description of the algorithm there, or at least a specific reference to it by name. It would be good if you could specify the algorithm you plan to use here.
  • You've specified the difficulty as hard, which is reasonable since you're not familiar with the algorithm yet. In this case, you might want to reserve 2 weeks for this task, or else do some more background reading now and justify a lower difficulty level with a more detailed description of your approach.
  • We do have wrappers for RAxML and some of the Phylip programs (via EMBOSS/Embassy), which implement maximum parsimony inference already.

Week 3:

  • Would it be a significant setback to first implement the strict consensus tree calculation the "normal" way, then speed it up with your binary array approach? If not, that would increase the likelihood that you have a working implementation of strict consensus before midterm evals.

Week 9:

  • I would recommend extracting a new module _draw.py from the existing _utils.py. I might do this split myself before the next Biopython release.
  • FYI: Phylo.draw is essentially the same layout algorithm as draw_ascii, but with a few basic matplotlib functions instead of ASCII-art. Phylo.draw_graphviz just calls out to networkx and pygraphviz for layout and rendering, and is not especially helpful for educational purposes.

Week 10-11:

  • There are two steps to this: (1) Implement radial layout, which is a fairly widely used algorithm. (2) Implement Felsenstein's Equal Daylight algorithm, which takes a radial layout and iteratively adjusts it to make it prettier. Consider moving (2) to Week 11.
  • The only widely used Equal Daylight implementation I'm aware of is in Phylip (language: C).

Please sign in to comment.