Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with partitioned DB and merging #42

Open
jaimeortiz-david opened this issue Jun 4, 2024 · 0 comments
Open

Issues with partitioned DB and merging #42

jaimeortiz-david opened this issue Jun 4, 2024 · 0 comments

Comments

@jaimeortiz-david
Copy link

Dear Andre,

why Metacache classifies almost all reads to the species level when querying an extensive database (1000 genomes partitioned in 28 DBs).

Is there any inherent error bias introduced by partitioning databases and posterior merging?

Can you also explain the difference between -hitmin and -hitdiff? (In the GitHub page the description for both are the same)

Additionally, is there a way to set a threshold to classify at the species level based on the number of reads (Is this what you refer to as a feature)? For example, a species will be considered "present" if at least five reads are assigned.

We are contemplating the number of possible k-mers one can make with k = 16, 4^16, which would mean ~4.3 million available/possible k-mers for each species in a 1000 genome database (correct?) if they all were in the same DB.

For a 1 Gb genome, there are about 7.9 million 127 bp windows, and if 16 k-mers are selected from each, 125 million k-mers from each genome without any threshold and just one genome. This is 29 times the 4.3 million available/possible k-mers. I'm not sure of these numbers. Could you let me know if they sound right?

We were trying to extract info from the databases using the info command and found that featureCounts produces:

63168104 -> 69
805452997 -> 2
303063736 -> 10

But we are not able to relate this number to species. Is this a feature number? Is the feature different from k-mer?
Are Targets synonymous with sequences?

We want stats on how many k-mers and unique-k-mers each species is represented within the database. Is it possible to retrieve this using the current commands?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant