-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updates #1
Comments
There's some existing conversation about this project over at this GitHub issue: hackseq/hackseq_projects_2016#9 |
If you're new to the command line and git, check out these particularly good interactive training sites: |
Hey, I'm Lisa. I'm a bioinformatics analyst with Geisinger Health System in Pennsylvania, but I'm originally from California. See you in October. ps: should we use docker images? is that mandated? |
Hi, Lisa! We can use Docker images, but no they're not mandated. We'll be using Amazon for compute services. So we'll need an Amazon Machine Image, and we could use a Docker image. We also have access to the ORCA Docker HPC service at the BC Cancer Agency Genome Sciences Centre. I'm a developer on Linuxbrew, and I like to use Linuxbrew to install bioinformatics formula in Homebrew-Science, so one likely Docker image to use is linuxbrew/linuxbrew. |
Hi, all! @GlastonburyC @lisabang @hyounesy @yxblee Here's the rough plan for Hackseq:
There's a whole bunch of other optimizers discussed over at hackseq/hackseq_projects_2016#9 |
Hi @sjackman, this seems like a very thorough and well thought out plan. As I will be working remotely, I think it would be helpful to delegate tasks some-what in advance. I would be very happy to evaluate the optimizers for usability and speed and write these results up on github. This seems like a very useful piece of work to have published, so i do think we should aim for a preprint, and I would happily contribute to the writing of such. In regards to optimizing genome assembly, perhaps if you could suggest a tool you use for genome assembly that we can focus on and the parameters of interest? Additionally, supplying a dummy dataset - perhaps a very small un-assembled genome - we can test optimization against would be good? I work in RNA-seq so I'm unfamiliar with 1) genome assembly tools, 2) parameters that show 'good' assembly. Cheers. |
Great to have you on board, Craig! I was thinking of assigning one optimizer to evaluate to each participant. You can always come back for more if you finish that one. Do you have familiarity with either Python or R, and would you like to pick one of the optimizers to evaluate?
Great! I'm hoping to assign one person to continuously develop the report throughout the weekend, and then we can all contribute to writing and editing on the last day.
I am a developer of the assembler ABySS, so I plan to use it for this hackathon because I'm most familiar with it, but I plan for the knowledge gained to be broadly applicable to any assembler.
Here's a hands-on tutorial/activity that I developed on genome assembly. Exercise 3 shows how to assembly a small data set, a human bacterial artificial chromosome (BAC), using ABySS. This dataset will be one of the data sets that we use. http://sjackman.ca/abyss-activity/#exercise-3-assemble-the-reads-into-contigs-using-abyss
The key metrics are contiguity (1) and correctness (2 through 4).
We'll be optimizing the NG50 metric (or NGA50 with a reference genome) and reporting (but probably not optimizing) the correctness metrics. The primary parameter we'll be optimizing is k (a fundamental parameter of nearly all de Bruijn graph assemblers), and there's a bunch other parameters that we can play with (typically thresholds related to expected coverage). |
Hi @sjackman, thanks! that's all very helpful. I'm familiar with R and python - I would like to explore ParOpt as it seems quite generalisable. |
Excellent. I believe it uses scipy.optimize and Nelder-Mead, also known as the amoeba. It's all yours! |
Dominique Orban @dpo wrote…
|
Hi, Dominique @dpo, Margherita, Philippe. Yes, please do post this e-mail on the GitHub issue. We won't have Matlab on the AWS instances that we'll be using, so we we won't be able to test BFO ourselves. If you'd like to test it yourselves on the datasets that we'll be using, I'd be happy to share the datasets and answer any questions that you have. Most info will all be available on the public GitHub repo. There will be some private Slack conversations as well, and I'd be happy to invite you to the Slack. I prefer GitHub for communication though, so we'll be using that mostly. Cheers, |
I'm more proficient in Python than R; on slack as lisabang. Will spend tomorrow traveling to Vancouver, looking forward. |
Hello! I am interested in helping out with this project. I'm a general software engineer and phylogeneticist, so I work a lot with both ends of this sort of thing, as a user and a coder of tools. |
@GlastonburyC I'd be interested in helping out with ParOpt |
@sjackman What's a sensible range to optimize k with respect to K50? |
The read length for this data set is 50 nucleotides per read. A reasonable range for k is 16 to 50. |
@sjackman Great. Could you make available the 10x smaller genome assembly problem you mentioned on Slack? |
The 10 fold smaller dataset is on ORCA at |
Hi - I cannot push to this project as I don't have permission: |
I've implemented grid-search optimisation for Abyss using ParOpt as a first step. |
As a next step, now that I have a value of k that is optimal - I need to compare N50 vs correctness metrics. I'm not sure which metrics are simply output by Abyss or ones that need generating using additional means (alignment). Ideas? 👍 @sjackman |
For smallish genomes the go-to package for evaluating correctness is Quast http://quast.bioinf.spbau.ru The toy data set is human from chromosome 3. You can use this reference genome http://hgdownload.cse.ucsc.edu/goldenpath/hg38/chromosomes/chr3.fa.gz |
Another option (rather than evaluating a second metric) would be to optimize multiple parameters. I'd suggest optimizing both There's a description of the parameters here: https://github.com/bcgsc/abyss |
Hi, all! My name is Shaun, and I'm leading project 2 (optimization). I am a bioinformatics PhD student, a developer of Linuxbrew and ABySS, an open-source programmer, an avid traveller, a singer and an experimental amateur chef. Please introduce yourself here, and I look forward to meeting you in October!
The text was updated successfully, but these errors were encountered: