Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New simulated dataset, new "kevlar localize" command #83

Merged
merged 15 commits into from
Jun 13, 2017
Merged

Conversation

standage
Copy link
Collaborator

This PR addresses two issues that have captured my focus recently.

  • First, we're in dire need of a data set for testing where 1) we know the "correct" answers (variant calls) and 2) we can run and re-run the kevlar workflow quickly. I've simulated various data sets for testing & continuous integration purposes in the past: there are on the side of "too trivial". I've also simulated some larger data sets recently, but these are on the side of "take too long to run".

  • Second, up until recently I've been doing all variant calling by examining alignments manually. It's past time to automate this! But this underscored the need for the first point, so there's been a bit of yak shaving going on.

I'm happy to present notebook/human-sim-pico and kevlar localize.

  • The directory notebook/human-sim-pico has the complete record for how I produced a 2.5 Mb random human-like genome from scratch. This includes a Jupyter notebook, some data files, several commands, and a bit of commentary.

  • The command kevlar localize takes a kevlar assemble-generated Fasta file, invokes BWA to localize the k-mers in the reference genome, and (assuming it maps to a single region) extracts the genomic interval associated with the assembled contig(s) plus a bit. The assembled contig(s) will then be aligned to this genomic region with a dynamic programming solution to be implemented soon by Fereydoun.

This PR still needs a bit of cleanup (mostly documentation and tests) before it is merged.

@codecov-io
Copy link

codecov-io commented Jun 12, 2017

Codecov Report

Merging #83 into master will increase coverage by 0.33%.
The diff coverage is 87.27%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #83      +/-   ##
==========================================
+ Coverage   82.72%   83.05%   +0.33%     
==========================================
  Files          27       29       +2     
  Lines        1430     1523      +93     
  Branches      220      239      +19     
==========================================
+ Hits         1183     1265      +82     
- Misses        205      211       +6     
- Partials       42       47       +5
Impacted Files Coverage Δ
kevlar/overlap.py 70.5% <0%> (ø) ⬆️
kevlar/novel.py 75.96% <100%> (ø) ⬆️
kevlar/cli/localize.py 100% <100%> (ø)
kevlar/reaugment.py 91.66% <100%> (ø) ⬆️
kevlar/seqio.py 93.06% <100%> (+0.08%) ⬆️
kevlar/__init__.py 90.27% <100%> (+0.13%) ⬆️
kevlar/cli/__init__.py 100% <100%> (ø) ⬆️
kevlar/assemble.py 84.82% <50%> (ø) ⬆️
kevlar/filter.py 63.11% <50%> (ø) ⬆️
kevlar/localize.py 86.41% <86.41%> (ø)
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 586c55e...df991ed. Read the comment docs.

@standage standage merged commit 7e52431 into master Jun 13, 2017
@standage standage deleted the sim/pico branch June 13, 2017 16:50
@standage standage mentioned this pull request Jun 13, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants