New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Access to the 272,000 assemblies? #12

tseemann opened this Issue Oct 10, 2018 · 11 comments


None yet
5 participants

tseemann commented Oct 10, 2018

As of March 2018, SKESA had been used by NCBI to assemble over 272,000 read sets available in SRA including assemblies for Salmonella (131,581 assemblies), Listeria (19,718 assemblies), Escherichia (65,307 assemblies), Shigella (10,942 assemblies), Campylobacter (32,416 assemblies), and Clostridioides (12,042 assemblies). These species are of importance for detecting pathogens in the food supply chain and in hospitals. Assemblies are publicly available in a downloadable object for each read set from the SRA website.

I am a bit unclear on where can we download these assemblies.
Can you give an example?

I looked without luck for more info here


This comment has been minimized.

andrewjpage commented Oct 11, 2018

It would be awesome to get these assemblies!


This comment has been minimized.


souvorov commented Oct 12, 2018

I'm in contact with our SRA gurus, and I will get back to you soon.


This comment has been minimized.


souvorov commented Oct 19, 2018

The assemblies are publicly available for practically all Illumina samples of the above bacteria. It will improve in the future, but at this point the process of downloading is not particularly user friendly.

For accessing SKESA assembly in SRA for a single run, say SRR498276, one should first download file
The command line to extract fasta for assembly from above object is
dump-ref-fasta SRR498276_SRR498276.realign > SRR498276_skesa.fa
Source code and pre-compiled binaries for the SRA toolkit that contain dump-ref-fasta can be downloaded from

Those who want to download multiple assemblies could use a script. For example, the following script will download all available Salmonellae

#! /bin/sh
TOOLKIT= # / terminated path to toolkit or nothing if in the path 
ACACHE=  # / terminated path to where fasta assemblies should go. if nothing - dump locally. 
# esearch and efetch are ncbi scripts  around eutils.
esearch -db sra -query "${ORGN}[orgn]" | efetch -db sra -format acclist | 
while read acc ; do  
    rp=$( ${TOOLKIT}srapath -f  names -r ${acc}.realign | awk '-F|' 'NF>8 && $(NF-1)==200 { print $8;}'  ) ; 
    [ "$rp" = "" ] && continue;  # skip unassembled runs. 
    ${TOOLKIT}dump-ref-fasta "$rp" >${ACACHE}${acc}.assm.fasta; 

To make this script work one has to download SRA toolkit and Entrez Direct scripts.


This comment has been minimized.

yaschenk commented Oct 19, 2018

A few notes:

  1. http path is predictable if you know SRR number
  2. You can use SRA tools directly on URL without downloading:
    dump-ref-fasta > SRR498276_skesa.fa
  3. For people familiar with SRAToolkit: you can also produce SAM and/or pileup to see how reads support the contigs. One caveat: to save space base quality scores are discarded and replaced with 30 for recent runs. This alone provides savings of ~80% of disk space.
  4. SRA team is working on updating its name resolver to make guessing URL unneccessary.

This comment has been minimized.

tseemann commented Oct 19, 2018

@yaschenk @souvorov @andrewjpage @lskatz

It sounds like SRR498276_SRR498276.realign is an NCBI-style BAM+REF file?
Ok it seems SRR498276_SRR498276.realign is 257 MB.

So I need to to download 257 MB to get a 3 MB FASTA reference file?
I don't have the bandwidth from Australia for that.

Or can dump-ref-fasta use random indexed access over HTTP ?

Also, I have sratoolkit 2.9.2 and dump-ref-fasta is not in it?
The only reference to it in the github is in two shell scripts?

I do have the align-ingo command:

% align-info SRR498276_SRR498276.realign


This comment has been minimized.

bewt85 commented Oct 20, 2018

I'm also interested in easy ways to download the data. I had a go with the suggested method and it looks like it's only downloading the assembly, but I might be wrong.

I didn't have the SRA toolkit installed so I installed them fresh from I think I found the code for dump-ref-fasta in but I wasn't 100% sure if it downloaded the full .realign so I tried running it in a docker container.

root@0a3578e0dbc2:/code# time sratoolkit.2.9.2-ubuntu64/bin/dump-ref-fasta > SRR498276_skesa.fa

real    0m7.387s
user    0m0.067s
sys     0m0.058s

root@0a3578e0dbc2:/code# ls -lh 
-rw-r--r-- 1 root root 4.7M Oct 20 15:12 SRR498276_skesa.fa
drwxrwxr-x 5 1000 1000 4.0K Jul 23 21:34 sratoolkit.2.9.2-ubuntu64

In another window I ran:

ben@ben-july2017:~/projects/ncbi_assembly$ docker stats --no-stream
CONTAINER           CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
0a3578e0dbc2        0.00%               3.043MiB / 15.55GiB   0.02%               1.58MB / 32.8kB     0B / 0B             1

I think this means that it only downloaded 1.58 MB of data which made 4.7 MB of SRR498276_skesa.fa

The only bit that I'm now not sure about is whether there are any md5 checksums anywhere so that I can check there weren't any issues during the download (or is that also part of dump-ref-fasta)?


This comment has been minimized.

yaschenk commented Oct 20, 2018

When SRA Tools are used directly on URL, they only download pages necessary to complete the task.
dump-ref-fasta only needs contigs, so it only brings parts of SRA file containing 2-bit-per-base-compressed contigs. If you start running sam-dump or fastq-dump then you would see a lot more data downloaded.
Read more about download-on-demand feature of SRA Toolkit at:


This comment has been minimized.

yaschenk commented Oct 20, 2018

About checksums. SRA format contains a lot of internal checksums to ensure data delivered correctly. All SRA Tools conduct necessary validation and generate error when something goes wrong. It SRA Tool completes the task with return code 0, then everything is correct.


This comment has been minimized.

yaschenk commented Oct 20, 2018

Apologies for inconsistent packaging of dump-ref-fasta
You can do the same with a different tools - vdb-dump
Use 'vdb-dump -T REFERENCE -f fasta2' in place of 'dump-ref-fasta'


This comment has been minimized.

tseemann commented Oct 20, 2018

@yaschenk thank you so much! that worked perfectly and did not download the whole .realign file.

$ time vdb-dump -T REFERENCE -f fasta2 > SRR498276.REFERENCE.fna

real    0m9.888s
user    0m0.057s
sys     0m0.028s

$ echo $?

$ fa SRR498276.REFERENCE.fna

no=35 bp=4763249 ok=4763249 Ns=0 gaps=0 min=278 avg=136092 max=645633 N50=313057

$ head -n 4 SRR498276.REFERENCE.fna

I will need to study more the VDB container format and learn how to list the available containers etc.

Is there a way to do it with just theSRRnnnnnnnn number and not knowning the URL?
(like fastq-dump)

Or, how do i configure srapath to find .realign VDB files?
Not clear from here:


This comment has been minimized.

yaschenk commented Oct 21, 2018

All name resolutions used by srapath are now done by SRA name server. If you read the script above in details it is already using srapath but in a bit awkward way. We are working on making SRA name server to resolve realign objects as painless as SRR accessions. This will become available in future releases of SRA Toolkit.

vdb-dump is the tool which you can use to explore SRA format as a list of tables and columns.
It is documented here:

It may be a bit overwhelming though. For simple programmatic access, we have developed 'ngs' api which works in c++, java, and python. The best way to learn it is through examples. If you like python, this is a reference dumper in python:
This is java version:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment