Skip to content

Commit

Permalink
docs: new refseq sketch links and winner-take-all example in tutorial
Browse files Browse the repository at this point in the history
  • Loading branch information
ondovb committed Sep 27, 2017
1 parent 2c1831f commit 1c4f118
Showing 1 changed file with 59 additions and 30 deletions.
89 changes: 59 additions & 30 deletions doc/sphinx/tutorials.rst
Original file line number Diff line number Diff line change
Expand Up @@ -65,67 +65,96 @@ work):

.. download::

`RefSeqSketches.msh.gz <http://gembox.cbcb.umd.edu/mash/RefSeqSketches.msh.gz>`_
`refseq.genomes.k21.s1000.msh <https://gembox.cbcb.umd.edu/mash/refseq.genomes.k21s1000.msh>`_

Plasmids also available:

.. download::

`refseq.genomes+plasmids.k21.s1000.msh <https://gembox.cbcb.umd.edu/mash/refseq.genomes%2Bplasmid.k21s1000.msh>`_
`refseq.plasmids.k21.s1000.msh <https://gembox.cbcb.umd.edu/mash/refseq.plasmid.k21s1000.msh>`_

Concatenate paired ends (this could also be piped to :code:`mash` to save space by
specifying :code:`-` for standard input, zipped or unzipped):

.. code::
cat reads_1.fastq read_2.fastq > reads.fastq
cat reads_1.fastq reads_2.fastq > reads.fastq
Sketch the reads, using :code:`-m 2` to improve results
by ignoring single-copy k-mers, which are more likely to be erroneous:

.. code::
mash sketch -m 2 -k 16 -s 400 reads.fastq
mash sketch -m 2 reads.fastq
Run :code:`mash dist` with the RefSeq archive as the reference and the read
sketch as the query:

.. code::
mash dist RefSeqSketches.msh reads.fastq.msh > distances.tab
mash dist refseq.genomes.k21.s1000.msh reads.fastq.msh > distances.tab
Sort the results to see the top hits and their p-values:

.. code ::
sort -gk3 distances.tab | head
Containment of RefSeq genomes in a read set
-------------------------------------------
Screening a read set for containment of RefSeq genomes
------------------------------------------------------

If a read set potentially has multiple genomes, it can be "screened" against the
database to estimate how well each genome is contained in the read set:
database to estimate how well each genome is contained in the read set. We can
use the `SRA Toolkit <https://www.ncbi.nlm.nih.gov/sra/docs/toolkitsoft/>`_ to
download ERR024951:

.. code::
fastq-dump ERR024951
...and screen it against Refseq Genomes (link above), sorting the results:

.. code::
mash screen refseq.genomes.k21s1000.msh ERR024951.fastq > screen.tab
sort -gr screen.tab | head
We see the expected organism, *Salmonella enterica*, but also an apparent contaminant, *Klebsiella pneumoniae*. The fields are [identity, shared-hashes, median-multiplicity, p-value, query-ID, query-comment]:

.. code::
0.99957 991/1000 26 0 GCF_000841985.1_ViralProj14228_genomic.fna.gz NC_004313.1 Salmonella phage ST64B, complete genome
0.99957 991/1000 24 0 GCF_002054545.1_ASM205454v1_genomic.fna.gz [57 seqs] NZ_MYON01000010.1 Salmonella enterica strain BCW_4905 NODE_10_length_152932_cov_1.77994, whole genome shotgun sequence [...]
0.999522 990/1000 102 0 GCF_900086185.1_12082_4_85_genomic.fna.gz [51 seqs] NZ_FLIP01000001.1 Klebsiella pneumoniae strain k1037, whole genome shotgun sequence [...]
0.999329 986/1000 24 0 GCF_002055205.1_ASM205520v1_genomic.fna.gz [72 seqs] NZ_MYOO01000010.1 Salmonella enterica strain BCW_4904 NODE_10_length_177558_cov_3.07217, whole genome shotgun sequence [...]
0.999329 986/1000 24 0 GCF_002054075.1_ASM205407v1_genomic.fna.gz [88 seqs] NZ_MYNK01000010.1 Salmonella enterica strain BCW_4936 NODE_10_length_177385_cov_3.78874, whole genome shotgun sequence [...]
0.999329 986/1000 24 0 GCF_000474475.1_CFSAN001184_01.0_genomic.fna.gz [45 seqs] NZ_AUQM01000001.1 Salmonella enterica subsp. enterica serovar Typhimurium str. CDC_2009K1158 isolate 2009K-1158 SEET1158_1, whole genome shotgun sequence [...]
0.999329 986/1000 24 0 GCF_000474355.1_CFSAN001186_01.0_genomic.fna.gz [46 seqs] NZ_AUQN01000001.1 Salmonella enterica subsp. enterica serovar Typhimurium str. CDC_2009K1283 isolate 2009K1283 (Typo) SEET1283_1, whole genome shotgun sequence [...]
0.999329 986/1000 24 0 GCF_000213635.1_ASM21363v1_genomic.fna.gz [2 seqs] NC_016863.1 Salmonella enterica subsp. enterica serovar Typhimurium str. UK-1, complete genome [...]
0.999281 985/1000 24 0 GCF_001271965.1_Salmonella_enterica_CVM_N43825_v1.0_genomic.fna.gz [67 seqs] NZ_LIMN01000001.1 Salmonella enterica subsp. enterica serovar Typhimurium strain CVM N43825 N43825_contig_1, whole genome shotgun sequence [...]
0.999281 985/1000 24 0 GCF_000974215.1_SALF-297-3.id2_v1.0_genomic.fna.gz [90 seqs] NZ_LAPO01000001.1 Salmonella enterica subsp. enterica serovar Typhimurium strain SALF-297-3 NODE_1, whole genome shotgun sequence [...]
Note, however, that multiple strains of *Salmonella enterica* have good identity. This is because they are each contained well when considered independently. For this reason :code:`mash screen` is not a true classifier. However, we can remove much of the redundancy
for interpreting the results using the winner-take-all strategy (:code:`-w`). And while we're at it, let's throw some more cores at
the task to speed it up (:code:`-p 4`):

.. code::
mash screen RefSeqSketches.msh reads.fastq > screen.tab
mash screen -w -p 4 refseq.genomes.k21s1000.msh ERR024951.fastq > screen.tab
sort -gr screen.tab | head
Screening ERR024951, for example, shows the expected organism, *Salmonella enterica*, but also an apparent contaminant, *Klebsiella pneumoniae*, and many contained plasmids (output sorted and abridged; the fields are [identity, shared-hashes, median-multiplicity, p-value, query-ID, query-comment]):
The output is now much cleaner, with just the two whole genomes, plus phages (a lot of other hits to viruses and assembly contigs would appear further down):

.. code::
1 400/400 93 0 ./rcn/refseq-NG-623-.-.-.-pSTR1-Shigella_flexneri.fna
1 400/400 78 0 ./rcn/refseq-NZ-573-PRJNA224116-SAMN02885365-GCF_000764615.1-pPMK1_A-Klebsiella_pneumoniae.fna
1 400/400 76 0 ./rcn/refseq-NG-1351-.-.-.-pNCC801-Enterococcus_faecalis.fna
1 400/400 54 0 ./rcn/refseq-NG-98360-.-.-.-virulence_plasmid-Salmonella_enterica_subsp._enterica_serovar_Dublin.fna
[...]
0.999844 399/400 1130 0 ./rcn/refseq-NR-90371-PRJNA188943-.-.-.-Salmonella_enterica_subsp._enterica_serovar_Typhimurium.fna
[...]
0.998897 393/400 113 0 ./rcn/refseq-NZ-1328378-PRJNA224116-SAMN02138586-GCF_000492775.1-.-Klebsiella_pneumoniae_MGH_46.fna
[...]
Building a custom RefSeq database
---------------------------------

To create the RefSeq Mash database, genomes were downloaded from NCBI
(:code:`ftp.ncbi.nlm.nih.gov/refseq/release/complete`, fasta sequence and
GenBank annotations for :code:`genomic`), and the
`refseqCollate <https://github.com/ondovb/refseqCollate/releases>`_ utility was
used to collate contigs/chromosomes into individual fasta files per genome.
Groups of these files were sketched in parallel and then pasted together with
:code:`mash paste`. This process could be repeated for more current or custom
databases.
0.99957 991/1000 24 0 GCF_002054545.1_ASM205454v1_genomic.fna.gz [57 seqs] NZ_MYON01000010.1 Salmonella enterica strain BCW_4905 NODE_10_length_152932_cov_1.77994, whole genome shotgun sequence [...]
0.99899 979/1000 26 0 GCF_000841985.1_ViralProj14228_genomic.fna.gz NC_004313.1 Salmonella phage ST64B, complete genome
0.998844 976/1000 101 0 GCF_900086185.1_12082_4_85_genomic.fna.gz [51 seqs] NZ_FLIP01000001.1 Klebsiella pneumoniae strain k1037, whole genome shotgun sequence [...]
0.923964 190/1000 40 0 GCF_000900935.1_ViralProj181984_genomic.fna.gz NC_019545.1 Salmonella phage SPN3UB, complete genome
0.900615 111/1000 100 0 GCF_001876675.1_ASM187667v1_genomic.fna.gz [137 seqs] NZ_MOXK01000132.1 Klebsiella pneumoniae strain AWD5 Contig_(1-18003), whole genome shotgun sequence [...]
0.887722 82/1000 31 3.16322e-233 GCF_001470135.1_ViralProj306294_genomic.fna.gz NC_028699.1 Salmonella phage SEN34, complete genome
0.873204 58/1000 22 1.8212e-156 GCF_000913735.1_ViralProj227000_genomic.fna.gz NC_022749.1 Shigella phage SfIV, complete genome
0.868675 52/1000 57 6.26251e-138 GCF_001744215.1_ViralProj344312_genomic.fna.gz NC_031129.1 Salmonella phage SJ46, complete genome
0.862715 45/1000 1 1.05185e-116 GCF_001882095.1_ViralProj353688_genomic.fna.gz NC_031940.1 Salmonella phage 118970_sal3, complete genome
0.856856 39/1000 21 6.70643e-99 GCF_000841165.1_ViralProj14230_genomic.fna.gz NC_004348.1 Enterobacteria phage ST64T, complete genome

0 comments on commit 1c4f118

Please sign in to comment.