Skip to content
leoisl edited this page Jul 9, 2019 · 10 revisions

Performance

Multi-threading

  • Scheduling
  • Mutex/locks per site in grouped_allele_counts; or map-reduce (each thread gets its own coverage structure, and reduce later)

Read filtering

  • Bloom filter on kmers in (variant sites of) prg; then reads must have all constituent kmers in filter, to get mapped

Indexing

  • Rework on indexing only kmers overlapping variant sites (with max read size context)
  • [Infer kmer-size for the index] Analyse the PRG string and find the most dense region and infer the kmer size with a number of paths that it is possible to enumerate

Mapping

  • Singleton intervals: directly get the bwt character rather than double rank query it
  • SearchState: Maybe, don't record allele/site combinations traversed, and do a round of forward mapping for successfully mapped reads, afterwards.
  • concurrent querying of alternate alleles in the same site; they are postfixed with the same even site marker. Later (when end read or encounter site entry boundary), we can record site/allele combo ids.

Data structures

  • Assign each (seen) Marker/Allele ID combination a unique ID, and store that in each SearchState
  • Use std::vector not std::list for variant site paths, to avoid 64-bit pointer overhead. Maybe convert from list to vector after kmer is indexed; and convert from vector to list for quasimapping. Further, the vector can be stored compressed and only SearchStates relevant for a given read be decompressed at quasimap.
  • SearchState::variant_site_paths can all be store in a vector of VariantSitePath, and in SearchState we would store only a uint32_t to the id of its respective variant_site_path. This is useful only if many SearchStates have exactly the same variant_site_path.
  • Store only one uint for singleton intervals!
  • Represent SearchState::variant_site_state and SearchState::invalid as 1 byte and with masks. Add a getter method to the enum to make it transparent.

Grouped alleles representation

1. Represent as a set-trie (see https://hal.inria.fr/hal-01506780/document)
	Implementations are already available, but not in C++:
		Java: https://github.com/SmartDataAnalytics/TagMap
		Rust: https://github.com/makoConstruct/set_trie
		C++: us?