-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Access to the 272,000 assemblies? #12
Comments
It would be awesome to get these assemblies! |
I'm in contact with our SRA gurus, and I will get back to you soon. |
The assemblies are publicly available for practically all Illumina samples of the above bacteria. It will improve in the future, but at this point the process of downloading is not particularly user friendly. For accessing SKESA assembly in SRA for a single run, say SRR498276, one should first download file Those who want to download multiple assemblies could use a script. For example, the following script will download all available Salmonellae
To make this script work one has to download SRA toolkit and Entrez Direct scripts. |
A few notes:
|
@yaschenk @souvorov @andrewjpage @lskatz It sounds like So I need to to download 257 MB to get a 3 MB FASTA reference file? Or can Also, I have I do have the
|
I'm also interested in easy ways to download the data. I had a go with the suggested method and it looks like it's only downloading the assembly, but I might be wrong. I didn't have the SRA toolkit installed so I installed them fresh from https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software. I think I found the code for
In another window I ran:
I think this means that it only downloaded 1.58 MB of data which made 4.7 MB of SRR498276_skesa.fa The only bit that I'm now not sure about is whether there are any md5 checksums anywhere so that I can check there weren't any issues during the download (or is that also part of |
When SRA Tools are used directly on URL, they only download pages necessary to complete the task. |
About checksums. SRA format contains a lot of internal checksums to ensure data delivered correctly. All SRA Tools conduct necessary validation and generate error when something goes wrong. It SRA Tool completes the task with return code 0, then everything is correct. |
Apologies for inconsistent packaging of dump-ref-fasta |
@yaschenk thank you so much! that worked perfectly and did not download the whole
I will need to study more the VDB container format and learn how to list the available containers etc. Is there a way to do it with just the Or, how do i configure |
All name resolutions used by srapath are now done by SRA name server. If you read the script above in details it is already using srapath but in a bit awkward way. We are working on making SRA name server to resolve realign objects as painless as SRR accessions. This will become available in future releases of SRA Toolkit. vdb-dump is the tool which you can use to explore SRA format as a list of tables and columns. It may be a bit overwhelming though. For simple programmatic access, we have developed 'ngs' api |
Hi, I've noticed that most of these techniques no longer work. Perhaps this is because of the name server changes @yaschenk mentioned?
I then tried changing the URL to https:
I thought the URL structure might have changed so I tried:
It looks like it's possible to download the whole file and extract the reference
I think I am using the latest SRA toolkit:
Many thanks, Ben |
Ben, As for name resolver, it is coded, but not released. we are working on preparing the package. |
Thanks @yaschenk I don't seem to be able to reproduce the issue from other machines so I think it's a local environment problem. |
How would I verify if an assembly does or does not exist? For example, SRR5424152 did not return an assembly when I used |
And this is my perl code to mimic your shell code
|
SRA is currently building assemblies only within a "Pathogen Detection Program". Which means only human bacterial pathogens are included. The exact list of organisms you can find following the link, Pathogen Project is a very small fraction of SRA. Your example SRR5424152 is Douglas Fir. We do plan to expand "assembly building" in SRA. It does take time since SRA is very heterogeneous. |
Is there a standard way to download many files at once? To batch them? I was trying to think of some kind of method where you
|
@yaschenk i notice that most of the |
Yes, Pathogen Pipeline is now generating submissions to Genbank and automatically annotating them for submitters who requested this service. |
@yaschenk can you describe how to use SKESA to target or guide assembly around particular reference genes of importance? |
@tseemann, as far as SKESA is concerned, the only way to target it is to preselect reads close to the gene of interest. On the other hand, we have a different assembler which is close to get published (SAUTE -Sequence assembly using target evidence). It does exactly what you asked about. If you have a particular set of references and reads I'll be glad to try it. |
Ok, i thought "SAUTE" was a feature being added to SKESA but I now understand it is a separate tool. We look forward to it! |
Hi, The dump-ref-fasta technique is not working for me for quite a while. And wget is now returning a 500 internal server error. I wonder if there is new name resolver or it is a temporary server issue. Thanks, |
I am able to reproduce an error with dump-ref-fasta in testing. We will take a look at the problem, In the interim I was able to use |
@stineaj Thanks! I used Tongzhou |
I thought I would contribute here since I was using the methods outlined here but they did stop working. I emailed the SRA help desk and got an answer on downloading the assemblies. Below is their response: The location of the realign objects can change over time. This one now has a new location: https://sra-download-nfs.be-md.ncbi.nlm.nih.gov/traces/sra48/SRZ/000498/SRR498276/SRR498276.realign So a command like this should work: vdb-dump -T REFERENCE -f fasta2 https://sra-download-nfs.be-md.ncbi.nlm.nih.gov/traces/sra48/SRZ/000498/SRR498276/SRR498276.realign Please note that you may need to uncheck 'enable remote access' on the first tab of the configuration utility, and save the config in some cases if it is not working. You can get the current location for a realign file using our SDL service and changing out the run accession or giving a list of comma delimited accessions: I tested this and it worked for me, so hopefully, this helps others |
As of March 2018, SKESA had been used by NCBI to assemble over 272,000 read sets available in SRA including assemblies for Salmonella (131,581 assemblies), Listeria (19,718 assemblies), Escherichia (65,307 assemblies), Shigella (10,942 assemblies), Campylobacter (32,416 assemblies), and Clostridioides (12,042 assemblies). These species are of importance for detecting pathogens in the food supply chain and in hospitals. Assemblies are publicly available in a downloadable object for each read set from the SRA website.
I am a bit unclear on where can we download these assemblies.
Can you give an example?
I looked without luck for more info here ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/skesa/datasets
The text was updated successfully, but these errors were encountered: