Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metadata and download #126

Open
cecilpert opened this issue Jul 27, 2020 · 3 comments
Open

Metadata and download #126

cecilpert opened this issue Jul 27, 2020 · 3 comments

Comments

@cecilpert
Copy link

Hello,

First thanks for your tool, it's really useful and works great.
I have a remark about the metadata file. I don't know if you're aware of that, but when we download several file formats (example : -F fasta,genbank), only the last downloaded file is written in the metadata local_file column. It's a detail but maybe will it be better to have all files specified ? (I don't really know how, maybe one column per file format ?)
Also for my own needs, I wanted to have information written on the metadata file even if genomes already exist in my output directory. To do that, I made some dirty modifications of the code but if you're interested in this functionality I can make a proper version and propose a pull request.

@fconstancias
Copy link

Hi Cecile,

I am quite interested in the possibility to get Biosample attributes (such as host, collection date, geographic location, ...) from prokaryotic genomes downloaded using ncbi-genome-download.
If you have an idea how to do that that would be extremely helpfull.

Thks

@cecilpert
Copy link
Author

Hi,

I don't think you can have these information directly by using ncbi-genome-download, it will probably require to interrogate other databases. In general, this kind of information is not well standardized so it can be triggered to automatize interrogation.

My first idea would be to isolate bioproject id and/or biosample id from metadata file provided by ncbi-genome-download and use NCBI Entrez API to search in NCBI BioProject and BioSample databases. For Python, I know it can be used with Biopython library (http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec138), but maybe simpler libraries exist. But by checking rapidly it doesn't seem to have much information (to not say not at all) in bioProject and bioSample databases.
It seems to have more informations on EBI/ENA database, an API seems to exists (https://www.ebi.ac.uk/ena/portal/api/). When I had to get things from this database I directly sent HTTP requests with python requests library (https://requests.readthedocs.io/en/master/) and then parsed the result.
In your case, it seems you can access the sample information with biosample id, and it's downloadable as xml file. For example :

import requests
r = requests.get("https://www.ebi.ac.uk/ena/data/view/SAMEA1705929&display=xml")

Then you need to parse xml, I know beautifoulsoup4 or xml.etree python libraries can do that, but I don't exactly know how to use them.

@fconstancias
Copy link

excellent thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants