Skip to content

Commit

Permalink
Merge pull request #296 from rmcolq/dev
Browse files Browse the repository at this point in the history
Bump version to 0.9.2
  • Loading branch information
leoisl committed Sep 27, 2022
2 parents 93c5410 + 66209a2 commit a496870
Show file tree
Hide file tree
Showing 20 changed files with 668 additions and 675 deletions.
124 changes: 71 additions & 53 deletions CHANGELOG.md
Expand Up @@ -2,93 +2,110 @@

All notable changes to this project will be documented in this file.

The format is based on
[Keep a Changelog](https://keepachangelog.com/en/1.0.0/), and this
project adheres to
[Semantic Versioning](https://semver.org/spec/v2.0.0.html).
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), and
this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [Unreleased]

## [0.9.2]

### Changed

- The VCF INFO field `SVTYPE` has now been changed to `VC` [[#249][249]]

### Fixed

- More robust TSV file parsing. Empty line no longer required at end [[#213][213]]
- Handle ambiguous bases properly instead of skipping to next read once we reach one [[#294][294]]

## [0.9.1]

### Added

- `pandora` is now installable through `conda`;
- A script to archive the `pandora` repository with git submodules;

### Changed
- Improved the sample example so now we can assert that the output produced is the expected one;
- Changes to the build process that enables `pandora` to be compiled in the `conda` environment;

- Improved the sample example so now we can assert that the output produced is the
expected one;
- Changes to the build process that enables `pandora` to be compiled in the `conda`
environment;

## [0.9.0]

### Changed

- Version bump from `0.9.0-rc2` to `0.9.0`.

## [0.9.0-rc2]

### Changed
- `pandora discover` now processes one sample at a time, but runs with several threads on the heavy tasks, i.e. when
mapping reads, finding candidate regions, and finding denovo variants. The result is that it now takes a lot less RAM to
run on multiple samples.

- `pandora discover` now processes one sample at a time, but runs with several threads
on the heavy tasks, i.e. when mapping reads, finding candidate regions, and finding
denovo variants. The result is that it now takes a lot less RAM to run on multiple
samples.

## [0.9.0-rc1]

### Changed
- `pandora discover` now receives read index files describing samples and reads, and discover denovo sequences in these samples.
To improve performance on discovering denovo sequences on several samples, `pandora discover` is now multithreaded, but
the performance is still the same as the previous version, i.e. each sample is processed in a single-threaded way;
- `pandora discover` output changed to a proprietary format. See [example](example) for the new output;
- `pandora` can now communicate with a [`make_prg` prototype](https://github.com/leoisl/make_prg) that is able to update PRGs
without needing to realign and remake the PRG. This provides major performance upgrades to running the full `pandora` pipeline
with denovo discovery enabled, and there is no need anymore to use a `snakemake` pipeline
(see [this example](example/run_pandora.sh) to how to run the full pipeline);
- We now use [musl libc](https://musl.libc.org/) instead of [Holy Build Box](https://github.com/phusion/holy-build-box)
to build a precompiled portable binary, removing the dependency on `OpenMP 4.0+` or `GCC 4.9+`, and `GLIBC`;

- `pandora discover` now receives read index files describing samples and reads, and
discover denovo sequences in these samples. To improve performance on discovering
denovo sequences on several samples, `pandora discover` is now multithreaded, but the
performance is still the same as the previous version, i.e. each sample is processed
in a single-threaded way;
- `pandora discover` output changed to a proprietary format. See [example](example) for
the new output;
- `pandora` can now communicate with a
[`make_prg` prototype](https://github.com/leoisl/make_prg) that is able to update PRGs
without needing to realign and remake the PRG. This provides major performance
upgrades to running the full `pandora` pipeline with denovo discovery enabled, and
there is no need anymore to use a `snakemake` pipeline (see
[this example](example/run_pandora.sh) to how to run the full pipeline);
- We now use [musl libc](https://musl.libc.org/) instead of
[Holy Build Box](https://github.com/phusion/holy-build-box) to build a precompiled
portable binary, removing the dependency on `OpenMP 4.0+` or `GCC 4.9+`, and `GLIBC`;

## [0.8.0]

### Added

- We now provide a script to build a portable precompiled binary as
another option to run `pandora` easily. The portable binary is now
provided with the release;
- `pandora` can now provide a meaningful stack trace in case of errors,
to facilitate debugging (need to pass flag `-DPRINT_STACKTRACE` to
`CMake`). Due to this, we now add debug symbols (`-g` flag) to every
`pandora` build type, but this
[does not impact performance](https://stackoverflow.com/a/39223245).
The precompiled binary has this enabled.
- We now provide a script to build a portable precompiled binary as another option to
run `pandora` easily. The portable binary is now provided with the release;
- `pandora` can now provide a meaningful stack trace in case of errors, to facilitate
debugging (need to pass flag `-DPRINT_STACKTRACE` to `CMake`). Due to this, we now add
debug symbols (`-g` flag) to every `pandora` build type, but this
[does not impact performance](https://stackoverflow.com/a/39223245). The precompiled
binary has this enabled.

### Changed

- We now use the [Hunter](https://github.com/cpp-pm/hunter) package
manager, removing the requirement of having `ZLIB` and `Boost`
system-wide installations;
- `GATB` is now a git submodule instead of an external project
downloaded and compiled during compilation time. This means that when
git cloning `pandora`, `cgranges` and `GATB` are also
downloaded/cloned, and when preparing the build (running `cmake`),
`Hunter` downloads and installs `Boost`, `GTest` and `ZLIB`. Thus we
still need internet connection to prepare the build (running `cmake`)
but not for compiling (running `make`).
- We now use the [Hunter](https://github.com/cpp-pm/hunter) package manager, removing
the requirement of having `ZLIB` and `Boost` system-wide installations;
- `GATB` is now a git submodule instead of an external project downloaded and compiled
during compilation time. This means that when git cloning `pandora`, `cgranges` and
`GATB` are also downloaded/cloned, and when preparing the build (running `cmake`),
`Hunter` downloads and installs `Boost`, `GTest` and `ZLIB`. Thus we still need
internet connection to prepare the build (running `cmake`) but not for compiling
(running `make`).
- We now use a GATB fork that accepts a `ZLIB` custom installation;
- Refactored all thirdparty libraries (`cgranges`, `GATB`, `backward`,
`CLI11`, `inthash`) into their own directory `thirdparty`.
- Refactored all thirdparty libraries (`cgranges`, `GATB`, `backward`, `CLI11`,
`inthash`) into their own directory `thirdparty`.

### Fixed

- Refactored asserts into exceptions, and now `pandora` can be compiled
correctly in the `Release` mode. The build process is thus able to
create a more optimized binary, resulting in improved performance.
- Refactored asserts into exceptions, and now `pandora` can be compiled correctly in the
`Release` mode. The build process is thus able to create a more optimized binary,
resulting in improved performance.
- Don't assume Nanopore reads are longer than loci [[#265][265]]



## [v0.7.0]

There is a significant amount of changes to the project between version
0.6 and this release. Only major things are listed here. Future releases
from this point will have their changes meticulously documented here.
There is a significant amount of changes to the project between version 0.6 and this
release. Only major things are listed here. Future releases from this point will have
their changes meticulously documented here.

### Added

Expand All @@ -98,8 +115,7 @@ from this point will have their changes meticulously documented here.
### Changed

- FASTA/Q files are now parsed with `klib` [[#223][223]]
- command-line interface is now overhauled with many breaking changes
[[#224][224]]
- command-line interface is now overhauled with many breaking changes [[#224][224]]
- global genotyping has been made default [[#220][220]]
- Various improvements to VCF-related functions

Expand All @@ -108,18 +124,20 @@ from this point will have their changes meticulously documented here.
- k-mer coverage underflow bug in `LocalPRG` [[#183][183]]

[Unreleased]: https://github.com/rmcolq/pandora/compare/0.9.1...HEAD
[0.9.2]: https://github.com/rmcolq/pandora/compare/0.9.2...0.9.1
[0.9.1]: https://github.com/rmcolq/pandora/releases/tag/0.9.1
[0.9.0]: https://github.com/rmcolq/pandora/releases/tag/0.9.0
[0.9.0-rc2]: https://github.com/rmcolq/pandora/releases/tag/0.9.0-rc2
[0.9.0-rc1]: https://github.com/rmcolq/pandora/releases/tag/0.9.0-rc1
[0.8.0]: https://github.com/rmcolq/pandora/releases/tag/0.8.0
[v0.7.0]: https://github.com/rmcolq/pandora/releases/tag/v0.7.0

[183]: https://github.com/rmcolq/pandora/issues/183
[213]: https://github.com/rmcolq/pandora/issues/213
[220]: https://github.com/rmcolq/pandora/pull/220
[223]: https://github.com/rmcolq/pandora/pull/223
[224]: https://github.com/rmcolq/pandora/pull/224
[234]: https://github.com/rmcolq/pandora/pull/234
[249]: https://github.com/rmcolq/pandora/issues/249
[265]: https://github.com/rmcolq/pandora/pull/265

[294]: https://github.com/rmcolq/pandora/issues/294
[v0.7.0]: https://github.com/rmcolq/pandora/releases/tag/v0.7.0

2 changes: 1 addition & 1 deletion CMakeLists.txt
Expand Up @@ -12,7 +12,7 @@ HunterGate(

# project configuration
set(PROJECT_NAME_STR pandora)
project(${PROJECT_NAME_STR} VERSION "0.9.1" LANGUAGES C CXX)
project(${PROJECT_NAME_STR} VERSION "0.9.2" LANGUAGES C CXX)
set(ADDITIONAL_VERSION_LABELS "")
configure_file( include/version.h.in ${CMAKE_BINARY_DIR}/include/version.h )

Expand Down
6 changes: 3 additions & 3 deletions README.md
Expand Up @@ -81,13 +81,13 @@ In this binary, all libraries are linked statically.

* **Download**:
```
wget https://github.com/rmcolq/pandora/releases/download/0.9.1/pandora-linux-precompiled-v0.9.1
wget https://github.com/rmcolq/pandora/releases/download/0.9.2/pandora-linux-precompiled-v0.9.2
```

* **Running**:
```
chmod +x pandora-linux-precompiled-v0.9.1
./pandora-linux-precompiled-v0.9.1 -h
chmod +x pandora-linux-precompiled-v0.9.2
./pandora-linux-precompiled-v0.9.2 -h
```

* **Notes**:
Expand Down
12 changes: 6 additions & 6 deletions example/README.md
Expand Up @@ -63,9 +63,9 @@ Taking a quick look at an excerpt of `out/output_toy_example_no_denovo/pandora_m

```
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT toy_sample_1 toy_sample_2
GC00006032 146 . T C . . SVTYPE=SNP;GRAPHTYPE=SIMPLE GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF 1:0,41:0,52:0,41:0,52:0,83:0,105:1,0:-526.281,-18.7786:507.502 0:15,0:15,0:15,0:15,0:31,0:31,0:0,1:-3.53065,-214.155:210.624
GC00006032 160 . A C . . SVTYPE=SNP;GRAPHTYPE=SIMPLE GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF 1:0,26:0,40:0,33:0,50:0,106:0,160:1,0.25:-401.941,-17.9221:384.019 0:19,0:12,0:19,0:12,0:38,0:24,0:0,1:-3.32705,-218.76:215.433
GC00006032 218 . T C . . SVTYPE=SNP;GRAPHTYPE=SIMPLE GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF 1:3,11:4,14:0,11:0,14:12,23:16,28:0.75,0:-182.162,-41.9443:140.217 0:11,0:5,0:13,0:6,0:44,0:21,0:0.25,1:-19.9705,-149.683:129.712
GC00006032 146 . T C . . VC=SNP;GRAPHTYPE=SIMPLE GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF 1:0,41:0,52:0,41:0,52:0,83:0,105:1,0:-526.281,-18.7786:507.502 0:15,0:15,0:15,0:15,0:31,0:31,0:0,1:-3.53065,-214.155:210.624
GC00006032 160 . A C . . VC=SNP;GRAPHTYPE=SIMPLE GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF 1:0,26:0,40:0,33:0,50:0,106:0,160:1,0.25:-401.941,-17.9221:384.019 0:19,0:12,0:19,0:12,0:38,0:24,0:0,1:-3.32705,-218.76:215.433
GC00006032 218 . T C . . VC=SNP;GRAPHTYPE=SIMPLE GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF 1:3,11:4,14:0,11:0,14:12,23:16,28:0.75,0:-182.162,-41.9443:140.217 0:11,0:5,0:13,0:6,0:44,0:21,0:0.25,1:-19.9705,-149.683:129.712
```

We can see samples `toy_sample_1` and `toy_sample_2` genotype towards different alleles.
Expand All @@ -76,9 +76,9 @@ The VCF (`out/output_toy_example_with_denovo/pandora_multisample_genotyped.vcf`)

```
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT toy_sample_1.100x.random.illumina toy_sample_2.100x.random.illumina
GC00006032 49 . A G . . SVTYPE=SNP;GRAPHTYPE=SIMPLE GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF 0:44,0:59,0:44,0:59,0:44,0:59,0:0,1:-26.8805,-570.333:543.452 1:0,48:0,50:0,48:0,50:0,97:0,100:1,0:-537.307,-28.9415:508.365
GC00010897 44 . C T . . SVTYPE=SNP;GRAPHTYPE=SIMPLE GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF 1:0,11:0,16:0,11:0,16:0,23:0,32:1,0:-220.34,-8.03511:212.304 0:22,0:18,0:22,0:18,0:44,0:37,0:0,1:-2.87264,-270.207:267.334
GC00010897 422 . A T . . SVTYPE=SNP;GRAPHTYPE=SIMPLE GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF 1:0,8:0,5:0,8:0,5:0,16:0,11:1,0:-155.867,-20.2266:135.641 0:12,0:9,0:12,0:9,0:12,0:9,0:0,1:-9.39494,-182.709:173.314
GC00006032 49 . A G . . VC=SNP;GRAPHTYPE=SIMPLE GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF 0:44,0:59,0:44,0:59,0:44,0:59,0:0,1:-26.8805,-570.333:543.452 1:0,48:0,50:0,48:0,50:0,97:0,100:1,0:-537.307,-28.9415:508.365
GC00010897 44 . C T . . VC=SNP;GRAPHTYPE=SIMPLE GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF 1:0,11:0,16:0,11:0,16:0,23:0,32:1,0:-220.34,-8.03511:212.304 0:22,0:18,0:22,0:18,0:44,0:37,0:0,1:-2.87264,-270.207:267.334
GC00010897 422 . A T . . VC=SNP;GRAPHTYPE=SIMPLE GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF 1:0,8:0,5:0,8:0,5:0,16:0,11:1,0:-155.867,-20.2266:135.641 0:12,0:9,0:12,0:9,0:12,0:9,0:0,1:-9.39494,-182.709:173.314
```

## Extra
Expand Down
17 changes: 14 additions & 3 deletions include/seq.h
Expand Up @@ -11,17 +11,21 @@ class Seq {
public:
uint32_t id;
std::string name;
std::string seq;
std::set<Minimizer> sketch;

// the original sequence is split into several valid subsequences (composed of ACGT only)
std::vector<std::string> subseqs; // these are the subsequences themselves
std::vector<size_t> offsets; // these are the subsequences offsets on the original string


Seq(uint32_t, const std::string&, const std::string&, uint32_t, uint32_t);

~Seq();

void initialize(
uint32_t, const std::string&, const std::string&, uint32_t, uint32_t);

bool add_letter_to_get_next_kmer(const char&, const uint64_t&, const uint64_t&,
void add_letter_to_get_next_kmer(const char&, const uint64_t&, const uint64_t&,
uint32_t&, uint64_t (&)[2], uint64_t (&)[2]);

void add_minimizing_kmers_to_sketch(const std::vector<Minimizer>&, const uint64_t&);
Expand All @@ -30,9 +34,16 @@ class Seq {

void add_new_smallest_minimizer(std::vector<Minimizer>&, uint64_t&);

void minimizer_sketch(const uint32_t w, const uint32_t k);
uint64_t length() const;

friend std::ostream& operator<<(std::ostream& out, const Seq& data);

void minimizer_sketch(const uint32_t w, const uint32_t k);

private:
void minimizer_sketch(const std::string &s, const size_t seq_offset,
const uint32_t w, const uint32_t k);

};

#endif
4 changes: 4 additions & 0 deletions include/utils.h
Expand Up @@ -8,12 +8,14 @@
#include <cstdint>
#include <string>
#include <limits>
#include <utility>
#include <boost/filesystem/path.hpp>
#include "minihits.h"
#include "pangenome/ns.cpp"
#include <boost/log/trivial.hpp>
#include <sstream>
#include "fatal_error.h"
#include "inthash.h"

namespace fs = boost::filesystem;

Expand Down Expand Up @@ -130,4 +132,6 @@ std::vector<std::pair<SampleIdText, SampleFpath>> load_read_index(

std::string remove_spaces_from_string(const std::string& str);

std::pair<std::vector<std::string>, std::vector<size_t>> split_ambiguous(const std::string& input_string, uint8_t delim = 4);

#endif
21 changes: 13 additions & 8 deletions include/vcf.h
Expand Up @@ -53,6 +53,11 @@ class VCF {
virtual bool operator!=(const VCF& y) const;
////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
// static constants for common VCF fields
static const std::string VARIANT_CLASS_ID;
////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
// adders
virtual void add_record(const std::string& chrom, uint32_t position,
Expand Down Expand Up @@ -106,11 +111,11 @@ class VCF {
// to_string methods
virtual std::string header() const;
virtual std::string to_string(bool genotyping_from_maximum_likelihood,
bool genotyping_from_coverage, bool output_dot_allele = false,
bool graph_is_simple = true, bool graph_is_nested = true,
bool graph_has_too_many_alts = true, bool sv_type_is_snp = true,
bool sv_type_is_indel = true, bool sv_type_is_ph_snps = true,
bool sv_type_is_complex = true);
bool genotyping_from_coverage, bool output_dot_allele = false,
bool graph_is_simple = true, bool graph_is_nested = true,
bool graph_has_too_many_alts = true, bool variant_class_is_snp = true,
bool variant_class_is_indel = true, bool variant_class_is_ph_snps = true,
bool variant_class_is_complex = true);
////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
Expand All @@ -128,9 +133,9 @@ class VCF {
virtual void save(const fs::path& filepath, bool genotyping_from_maximum_likelihood,
bool genotyping_from_coverage, bool output_dot_allele = false,
bool graph_is_simple = true, bool graph_is_nested = true,
bool graph_has_too_many_alts = true, bool sv_type_is_snp = true,
bool sv_type_is_indel = true, bool sv_type_is_ph_snps = true,
bool sv_type_is_complex = true);
bool graph_has_too_many_alts = true, bool variant_class_is_snp = true,
bool variant_class_is_indel = true, bool variant_class_is_ph_snps = true,
bool variant_class_is_complex = true);

// concatenate several VCF files that were previously written to disk as .vcfs into
// a single VCF file
Expand Down

0 comments on commit a496870

Please sign in to comment.