Updates #1

sjackman · 2016-09-06T16:31:13Z

Hi, all! My name is Shaun, and I'm leading project 2 (optimization). I am a bioinformatics PhD student, a developer of Linuxbrew and ABySS, an open-source programmer, an avid traveller, a singer and an experimental amateur chef. Please introduce yourself here, and I look forward to meeting you in October!

sjackman · 2016-09-06T16:32:13Z

There's some existing conversation about this project over at this GitHub issue: hackseq/hackseq_projects_2016#9

sjackman · 2016-09-08T15:59:48Z

If you're new to the command line and git, check out these particularly good interactive training sites:
http://rik.smith-unna.com/command_line_bootcamp/
https://try.github.io
https://github.github.com/on-demand/intro-to-github/

lisabang · 2016-09-20T15:32:29Z

Hey, I'm Lisa. I'm a bioinformatics analyst with Geisinger Health System in Pennsylvania, but I'm originally from California. See you in October.

ps: should we use docker images? is that mandated?

sjackman · 2016-10-07T19:02:45Z

Hi, Lisa! We can use Docker images, but no they're not mandated. We'll be using Amazon for compute services. So we'll need an Amazon Machine Image, and we could use a Docker image. We also have access to the ORCA Docker HPC service at the BC Cancer Agency Genome Sciences Centre.

I'm a developer on Linuxbrew, and I like to use Linuxbrew to install bioinformatics formula in Homebrew-Science, so one likely Docker image to use is linuxbrew/linuxbrew.

sjackman · 2016-10-07T19:37:50Z

Hi, all! @GlastonburyC @lisabang @hyounesy @yxblee

Here's the rough plan for Hackseq:

Identify functions and data sets to optimize
1. a toy function that can be optimized in minutes for development
2. a real genome assembly problem that can be optimized in a few hours
Evaluate optimizers for usability and speed
1. OPAL by @dpo — Optimization of algorithms with OPAL
2. Spearmint by @mgelbart — Predictive Entropy Search for Multi-objective Bayesian Optimization
3. ParOpt by @sseemayer which uses scipy.optimize
4. Possibly Python packages, like scikit-optimize
5. Possibly R packages, a long list
Generate a report of the results of the optimization
1. Generate plots of target metric vs parameters
2. Draw the Pareto frontier of the target metric and a second metric of interest (contiguity and correctness) likely in R using ggplot
Write a short report of our experience
1. Post on GitHub pages
2. Possibly submit to a preprint server (bioRxiv, PeerJ, Figshare)
3. Possibly submit for peer review, such as F1000Research Hackathons

There's a whole bunch of other optimizers discussed over at hackseq/hackseq_projects_2016#9

GlastonburyC · 2016-10-09T16:23:55Z

Hi @sjackman, this seems like a very thorough and well thought out plan. As I will be working remotely, I think it would be helpful to delegate tasks some-what in advance.

I would be very happy to evaluate the optimizers for usability and speed and write these results up on github. This seems like a very useful piece of work to have published, so i do think we should aim for a preprint, and I would happily contribute to the writing of such.

In regards to optimizing genome assembly, perhaps if you could suggest a tool you use for genome assembly that we can focus on and the parameters of interest? Additionally, supplying a dummy dataset - perhaps a very small un-assembled genome - we can test optimization against would be good? I work in RNA-seq so I'm unfamiliar with 1) genome assembly tools, 2) parameters that show 'good' assembly.

Cheers.

sjackman · 2016-10-09T18:01:31Z

Great to have you on board, Craig! I was thinking of assigning one optimizer to evaluate to each participant. You can always come back for more if you finish that one. Do you have familiarity with either Python or R, and would you like to pick one of the optimizers to evaluate?

I would be very happy to evaluate the optimizers for usability and speed and write these results up on github. This seems like a very useful piece of work to have published, so i do think we should aim for a preprint, and I would happily contribute to the writing of such.

Great! I'm hoping to assign one person to continuously develop the report throughout the weekend, and then we can all contribute to writing and editing on the last day.

In regards to optimizing genome assembly, perhaps if you could suggest a tool you use for genome assembly that we can focus on and the parameters of interest?

I am a developer of the assembler ABySS, so I plan to use it for this hackathon because I'm most familiar with it, but I plan for the knowledge gained to be broadly applicable to any assembler.

Additionally, supplying a dummy dataset - perhaps a very small un-assembled genome - we can test optimization against would be good?

Here's a hands-on tutorial/activity that I developed on genome assembly. Exercise 3 shows how to assembly a small data set, a human bacterial artificial chromosome (BAC), using ABySS. This dataset will be one of the data sets that we use.

http://sjackman.ca/abyss-activity/#exercise-3-assemble-the-reads-into-contigs-using-abyss

I work in RNA-seq so I'm unfamiliar with 1) genome assembly tools, 2) parameters that show 'good' assembly.

The key metrics are contiguity (1) and correctness (2 through 4).

contiguity (NG50, N50) and aligned contiguity (NGA50, NA50)
number of breakpoints when aligned to the reference as a proxy for misassemblies
number of mismatched nucleotides when aligned to the reference, Q = -10*log(mismatches / total_aligned)
completeness, number of reference bases covered by aligned contigs divided by number of reference bases

We'll be optimizing the NG50 metric (or NGA50 with a reference genome) and reporting (but probably not optimizing) the correctness metrics. The primary parameter we'll be optimizing is k (a fundamental parameter of nearly all de Bruijn graph assemblers), and there's a bunch other parameters that we can play with (typically thresholds related to expected coverage).

GlastonburyC · 2016-10-09T18:14:47Z

Hi @sjackman, thanks! that's all very helpful.

I'm familiar with R and python - I would like to explore ParOpt as it seems quite generalisable.

sjackman · 2016-10-09T18:23:33Z

Excellent. I believe it uses scipy.optimize and Nelder-Mead, also known as the amoeba. It's all yours!

sjackman · 2016-10-09T18:42:01Z

Dominique Orban @dpo wrote…

I'd like to introduce two colleagues of mine who specialize in optimization with special interest in derivative-free and parameter optimization: Margherita and Philippe. Just so everybody can find out who everybody is, I'm going to list the home pages (I don't believe they currently are Github users):

Margherita: http://web.math.unifi.it/users/porcelli
Philippe: http://perso.fundp.ac.be/~phtoint/toint.html

Margherita and Philippe both author a nonsmooth optimization package called BFO (for Brute-Force Optimization), a new kid on the block that we believe would also fit well in your testing schedule. BFO is written in Matlab and is built on principles that share similarities with the optimizer underlying OPAL:

https://sites.google.com/site/bfocode

BFO is being actively worked on and I am told it should support surrogate models and categorical variables in the near future (if it doesn't already). One of its distinctive features is that it will self-tune on the test problems given.

If you don't have access to Matlab, one of us will be happy to run the tests; we just have to engineer communication between Matlab and your tools.

We're hoping you'll include BFO in your tests. We're available to answer any questions and to contribute as we're always looking for interesting new applications of parameter optimization.

sjackman · 2016-10-09T18:43:45Z

Hi, Dominique @dpo, Margherita, Philippe. Yes, please do post this e-mail on the GitHub issue.

We won't have Matlab on the AWS instances that we'll be using, so we we won't be able to test BFO ourselves. If you'd like to test it yourselves on the datasets that we'll be using, I'd be happy to share the datasets and answer any questions that you have. Most info will all be available on the public GitHub repo. There will be some private Slack conversations as well, and I'd be happy to invite you to the Slack. I prefer GitHub for communication though, so we'll be using that mostly.

Cheers,
Shaun

lisabang · 2016-10-13T15:50:36Z

I'm more proficient in Python than R; on slack as lisabang. Will spend tomorrow traveling to Vancouver, looking forward.

daisieh · 2016-10-15T16:09:25Z

Hello! I am interested in helping out with this project. I'm a general software engineer and phylogeneticist, so I work a lot with both ends of this sort of thing, as a user and a coder of tools.

lisabang · 2016-10-15T17:05:46Z

@GlastonburyC I'd be interested in helping out with ParOpt

GlastonburyC · 2016-10-15T17:41:45Z

@sjackman What's a sensible range to optimize k with respect to K50?

sjackman · 2016-10-15T17:49:23Z

The read length for this data set is 50 nucleotides per read. A reasonable range for k is 16 to 50.

GlastonburyC · 2016-10-15T19:41:39Z

@sjackman Great. Could you make available the 10x smaller genome assembly problem you mentioned on Slack?

sjackman · 2016-10-15T19:56:10Z

The 10 fold smaller dataset is on ORCA at /home/sjackman/sjackman/HS0674/200k.fq.gz

GlastonburyC · 2016-10-15T23:45:05Z

Hi - I cannot push to this project as I don't have permission:
fatal: unable to access 'https://github.com/hackseq/2016_project_2.git/': The requested URL returned error: 403

GlastonburyC · 2016-10-15T23:45:44Z

I've implemented grid-search optimisation for Abyss using ParOpt as a first step.

GlastonburyC · 2016-10-16T16:52:06Z

As a next step, now that I have a value of k that is optimal - I need to compare N50 vs correctness metrics. I'm not sure which metrics are simply output by Abyss or ones that need generating using additional means (alignment). Ideas? 👍 @sjackman

sjackman · 2016-10-16T19:37:57Z

For smallish genomes the go-to package for evaluating correctness is Quast http://quast.bioinf.spbau.ru

The toy data set is human from chromosome 3. You can use this reference genome http://hgdownload.cse.ucsc.edu/goldenpath/hg38/chromosomes/chr3.fa.gz

sjackman · 2016-10-16T19:40:27Z

Another option (rather than evaluating a second metric) would be to optimize multiple parameters. I'd suggest optimizing both k and s simultaneously. You add n next if that goes well.

There's a description of the parameters here: https://github.com/bcgsc/abyss

Update fork from original

sjackman self-assigned this Sep 6, 2016

sjackman mentioned this issue Oct 9, 2016

Project 2: Design a tool to optimize the parameters of any command line tool hackseq/hackseq_projects_2016#9

Open

yxblee added a commit that referenced this issue Oct 17, 2016

Merge pull request #1 from hackseq/master

18fd91a

Update fork from original

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updates #1

Updates #1

sjackman commented Sep 6, 2016 •

edited

Loading

sjackman commented Sep 6, 2016

sjackman commented Sep 8, 2016

lisabang commented Sep 20, 2016

sjackman commented Oct 7, 2016

sjackman commented Oct 7, 2016 •

edited

Loading

GlastonburyC commented Oct 9, 2016

sjackman commented Oct 9, 2016

GlastonburyC commented Oct 9, 2016

sjackman commented Oct 9, 2016

sjackman commented Oct 9, 2016

sjackman commented Oct 9, 2016

lisabang commented Oct 13, 2016

daisieh commented Oct 15, 2016

lisabang commented Oct 15, 2016

GlastonburyC commented Oct 15, 2016

sjackman commented Oct 15, 2016

GlastonburyC commented Oct 15, 2016

sjackman commented Oct 15, 2016 •

edited

Loading

GlastonburyC commented Oct 15, 2016

GlastonburyC commented Oct 15, 2016

GlastonburyC commented Oct 16, 2016 •

edited

Loading

sjackman commented Oct 16, 2016

sjackman commented Oct 16, 2016

Updates #1

Updates #1

Comments

sjackman commented Sep 6, 2016 • edited Loading

sjackman commented Sep 6, 2016

sjackman commented Sep 8, 2016

lisabang commented Sep 20, 2016

sjackman commented Oct 7, 2016

sjackman commented Oct 7, 2016 • edited Loading

GlastonburyC commented Oct 9, 2016

sjackman commented Oct 9, 2016

GlastonburyC commented Oct 9, 2016

sjackman commented Oct 9, 2016

sjackman commented Oct 9, 2016

sjackman commented Oct 9, 2016

lisabang commented Oct 13, 2016

daisieh commented Oct 15, 2016

lisabang commented Oct 15, 2016

GlastonburyC commented Oct 15, 2016

sjackman commented Oct 15, 2016

GlastonburyC commented Oct 15, 2016

sjackman commented Oct 15, 2016 • edited Loading

GlastonburyC commented Oct 15, 2016

GlastonburyC commented Oct 15, 2016

GlastonburyC commented Oct 16, 2016 • edited Loading

sjackman commented Oct 16, 2016

sjackman commented Oct 16, 2016

sjackman commented Sep 6, 2016 •

edited

Loading

sjackman commented Oct 7, 2016 •

edited

Loading

sjackman commented Oct 15, 2016 •

edited

Loading

GlastonburyC commented Oct 16, 2016 •

edited

Loading