Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates #1

Open
sjackman opened this issue Sep 6, 2016 · 23 comments
Open

Updates #1

sjackman opened this issue Sep 6, 2016 · 23 comments
Assignees

Comments

@sjackman
Copy link
Contributor

sjackman commented Sep 6, 2016

Hi, all! My name is Shaun, and I'm leading project 2 (optimization). I am a bioinformatics PhD student, a developer of Linuxbrew and ABySS, an open-source programmer, an avid traveller, a singer and an experimental amateur chef. Please introduce yourself here, and I look forward to meeting you in October!

@sjackman sjackman self-assigned this Sep 6, 2016
@sjackman
Copy link
Contributor Author

sjackman commented Sep 6, 2016

There's some existing conversation about this project over at this GitHub issue: hackseq/hackseq_projects_2016#9

@sjackman
Copy link
Contributor Author

sjackman commented Sep 8, 2016

If you're new to the command line and git, check out these particularly good interactive training sites:
http://rik.smith-unna.com/command_line_bootcamp/
https://try.github.io
https://github.github.com/on-demand/intro-to-github/

@lisabang
Copy link
Contributor

Hey, I'm Lisa. I'm a bioinformatics analyst with Geisinger Health System in Pennsylvania, but I'm originally from California. See you in October.

ps: should we use docker images? is that mandated?

@sjackman
Copy link
Contributor Author

sjackman commented Oct 7, 2016

Hi, Lisa! We can use Docker images, but no they're not mandated. We'll be using Amazon for compute services. So we'll need an Amazon Machine Image, and we could use a Docker image. We also have access to the ORCA Docker HPC service at the BC Cancer Agency Genome Sciences Centre.

I'm a developer on Linuxbrew, and I like to use Linuxbrew to install bioinformatics formula in Homebrew-Science, so one likely Docker image to use is linuxbrew/linuxbrew.

@sjackman
Copy link
Contributor Author

sjackman commented Oct 7, 2016

Hi, all! @GlastonburyC @lisabang @hyounesy @yxblee

Here's the rough plan for Hackseq:

  1. Identify functions and data sets to optimize
    1. a toy function that can be optimized in minutes for development
    2. a real genome assembly problem that can be optimized in a few hours
  2. Evaluate optimizers for usability and speed
    1. OPAL by @dpoOptimization of algorithms with OPAL
    2. Spearmint by @mgelbartPredictive Entropy Search for Multi-objective Bayesian Optimization
    3. ParOpt by @sseemayer which uses scipy.optimize
    4. Possibly Python packages, like scikit-optimize
    5. Possibly R packages, a long list
  3. Generate a report of the results of the optimization
    1. Generate plots of target metric vs parameters
    2. Draw the Pareto frontier of the target metric and a second metric of interest (contiguity and correctness) likely in R using ggplot
  4. Write a short report of our experience
    1. Post on GitHub pages
    2. Possibly submit to a preprint server (bioRxiv, PeerJ, Figshare)
    3. Possibly submit for peer review, such as F1000Research Hackathons

There's a whole bunch of other optimizers discussed over at hackseq/hackseq_projects_2016#9

@GlastonburyC
Copy link

Hi @sjackman, this seems like a very thorough and well thought out plan. As I will be working remotely, I think it would be helpful to delegate tasks some-what in advance.

I would be very happy to evaluate the optimizers for usability and speed and write these results up on github. This seems like a very useful piece of work to have published, so i do think we should aim for a preprint, and I would happily contribute to the writing of such.

In regards to optimizing genome assembly, perhaps if you could suggest a tool you use for genome assembly that we can focus on and the parameters of interest? Additionally, supplying a dummy dataset - perhaps a very small un-assembled genome - we can test optimization against would be good? I work in RNA-seq so I'm unfamiliar with 1) genome assembly tools, 2) parameters that show 'good' assembly.

Cheers.

@sjackman
Copy link
Contributor Author

sjackman commented Oct 9, 2016

Great to have you on board, Craig! I was thinking of assigning one optimizer to evaluate to each participant. You can always come back for more if you finish that one. Do you have familiarity with either Python or R, and would you like to pick one of the optimizers to evaluate?

I would be very happy to evaluate the optimizers for usability and speed and write these results up on github. This seems like a very useful piece of work to have published, so i do think we should aim for a preprint, and I would happily contribute to the writing of such.

Great! I'm hoping to assign one person to continuously develop the report throughout the weekend, and then we can all contribute to writing and editing on the last day.

In regards to optimizing genome assembly, perhaps if you could suggest a tool you use for genome assembly that we can focus on and the parameters of interest?

I am a developer of the assembler ABySS, so I plan to use it for this hackathon because I'm most familiar with it, but I plan for the knowledge gained to be broadly applicable to any assembler.

Additionally, supplying a dummy dataset - perhaps a very small un-assembled genome - we can test optimization against would be good?

Here's a hands-on tutorial/activity that I developed on genome assembly. Exercise 3 shows how to assembly a small data set, a human bacterial artificial chromosome (BAC), using ABySS. This dataset will be one of the data sets that we use.

http://sjackman.ca/abyss-activity/#exercise-3-assemble-the-reads-into-contigs-using-abyss

I work in RNA-seq so I'm unfamiliar with 1) genome assembly tools, 2) parameters that show 'good' assembly.

The key metrics are contiguity (1) and correctness (2 through 4).

  1. contiguity (NG50, N50) and aligned contiguity (NGA50, NA50)
  2. number of breakpoints when aligned to the reference as a proxy for misassemblies
  3. number of mismatched nucleotides when aligned to the reference, Q = -10*log(mismatches / total_aligned)
  4. completeness, number of reference bases covered by aligned contigs divided by number of reference bases

We'll be optimizing the NG50 metric (or NGA50 with a reference genome) and reporting (but probably not optimizing) the correctness metrics. The primary parameter we'll be optimizing is k (a fundamental parameter of nearly all de Bruijn graph assemblers), and there's a bunch other parameters that we can play with (typically thresholds related to expected coverage).

@GlastonburyC
Copy link

Hi @sjackman, thanks! that's all very helpful.

I'm familiar with R and python - I would like to explore ParOpt as it seems quite generalisable.

@sjackman
Copy link
Contributor Author

sjackman commented Oct 9, 2016

Excellent. I believe it uses scipy.optimize and Nelder-Mead, also known as the amoeba. It's all yours!

@sjackman
Copy link
Contributor Author

sjackman commented Oct 9, 2016

Dominique Orban @dpo wrote…

I'd like to introduce two colleagues of mine who specialize in optimization with special interest in derivative-free and parameter optimization: Margherita and Philippe. Just so everybody can find out who everybody is, I'm going to list the home pages (I don't believe they currently are Github users):

Margherita: http://web.math.unifi.it/users/porcelli
Philippe: http://perso.fundp.ac.be/~phtoint/toint.html

Margherita and Philippe both author a nonsmooth optimization package called BFO (for Brute-Force Optimization), a new kid on the block that we believe would also fit well in your testing schedule. BFO is written in Matlab and is built on principles that share similarities with the optimizer underlying OPAL:

https://sites.google.com/site/bfocode

BFO is being actively worked on and I am told it should support surrogate models and categorical variables in the near future (if it doesn't already). One of its distinctive features is that it will self-tune on the test problems given.

If you don't have access to Matlab, one of us will be happy to run the tests; we just have to engineer communication between Matlab and your tools.

We're hoping you'll include BFO in your tests. We're available to answer any questions and to contribute as we're always looking for interesting new applications of parameter optimization.

@sjackman
Copy link
Contributor Author

sjackman commented Oct 9, 2016

Hi, Dominique @dpo, Margherita, Philippe. Yes, please do post this e-mail on the GitHub issue.

We won't have Matlab on the AWS instances that we'll be using, so we we won't be able to test BFO ourselves. If you'd like to test it yourselves on the datasets that we'll be using, I'd be happy to share the datasets and answer any questions that you have. Most info will all be available on the public GitHub repo. There will be some private Slack conversations as well, and I'd be happy to invite you to the Slack. I prefer GitHub for communication though, so we'll be using that mostly.

Cheers,
Shaun

@lisabang
Copy link
Contributor

I'm more proficient in Python than R; on slack as lisabang. Will spend tomorrow traveling to Vancouver, looking forward.

@daisieh
Copy link

daisieh commented Oct 15, 2016

Hello! I am interested in helping out with this project. I'm a general software engineer and phylogeneticist, so I work a lot with both ends of this sort of thing, as a user and a coder of tools.

@lisabang
Copy link
Contributor

@GlastonburyC I'd be interested in helping out with ParOpt

@GlastonburyC
Copy link

@sjackman What's a sensible range to optimize k with respect to K50?

@sjackman
Copy link
Contributor Author

The read length for this data set is 50 nucleotides per read. A reasonable range for k is 16 to 50.

@GlastonburyC
Copy link

@sjackman Great. Could you make available the 10x smaller genome assembly problem you mentioned on Slack?

@sjackman
Copy link
Contributor Author

sjackman commented Oct 15, 2016

The 10 fold smaller dataset is on ORCA at /home/sjackman/sjackman/HS0674/200k.fq.gz

@GlastonburyC
Copy link

Hi - I cannot push to this project as I don't have permission:
fatal: unable to access 'https://github.com/hackseq/2016_project_2.git/': The requested URL returned error: 403

@GlastonburyC
Copy link

I've implemented grid-search optimisation for Abyss using ParOpt as a first step.

@GlastonburyC
Copy link

GlastonburyC commented Oct 16, 2016

As a next step, now that I have a value of k that is optimal - I need to compare N50 vs correctness metrics. I'm not sure which metrics are simply output by Abyss or ones that need generating using additional means (alignment). Ideas? 👍 @sjackman

@sjackman
Copy link
Contributor Author

For smallish genomes the go-to package for evaluating correctness is Quast http://quast.bioinf.spbau.ru

The toy data set is human from chromosome 3. You can use this reference genome http://hgdownload.cse.ucsc.edu/goldenpath/hg38/chromosomes/chr3.fa.gz

@sjackman
Copy link
Contributor Author

Another option (rather than evaluating a second metric) would be to optimize multiple parameters. I'd suggest optimizing both k and s simultaneously. You add n next if that goes well.

There's a description of the parameters here: https://github.com/bcgsc/abyss

yxblee added a commit that referenced this issue Oct 17, 2016
Update fork from original
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants