Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Estimated genome size using mash sketch #114

Closed
MostafaYA opened this issue May 2, 2019 · 5 comments
Closed

Estimated genome size using mash sketch #114

MostafaYA opened this issue May 2, 2019 · 5 comments

Comments

@MostafaYA
Copy link

Hello,

I understand that mash sketch can gives rough estimation about the genome size based on the unique kmers in the sample.

I using the following command for this purpose, to roughly estimate genome size of bacteria
mash sketch -o tempFile -k 32 -m 3 -r read1.fastq
In most cases, I am getting estimations of bacterial genomes close to real size (size mentioned in literatures)
In some cases, however, there is a great discrepancy between the estinated genome and the real size.
Does that mean

  • the original sample may not be fully sequenced in case that the estimated value of the genome size is far too low than normal?
  • the original sample may be contaminated in case that the estimated value of the genome size is far too high than normal?

I would appreciate if you declare any misunderstanding from my side

@ondovb
Copy link
Member

ondovb commented May 3, 2019

Those both sound like reasonable explanations. A good check might be to run mash screen on your reads against RefSeq (see https://mash.readthedocs.io/en/latest/tutorials.html#screening-a-read-set-for-containment-of-refseq-genomes). This will tell you, roughly, how well your genome is covered (or at least its nearest RefSeq neighbor) and if there are significant contaminants.

@tseemann
Copy link
Contributor

@MostafaYA are these Illumina reads?

@MostafaYA
Copy link
Author

@tseemann Yes, they are MiSeq Illumina reads

@tseemann
Copy link
Contributor

To estimate genome size in shovill I do the same as what you have done -k 32 -r -m 3. If you have very high coverage you could try increasing to -m 10, or if you have high error rate try reducing to -k 24. Also, only use R1 as it is higher quality, esp on MiSeq.

Secondly, just do a genome assembly of the reads with Shovill, SKESA or Spades. If the total contig size is too high, you probably have contamination, or not a pure colony.

@MostafaYA
Copy link
Author

@tseemann @ondovb Thanks a lot your answers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants