Metadata and download #126

cecilpert · 2020-07-27T10:33:30Z

Hello,

First thanks for your tool, it's really useful and works great.
I have a remark about the metadata file. I don't know if you're aware of that, but when we download several file formats (example : -F fasta,genbank), only the last downloaded file is written in the metadata local_file column. It's a detail but maybe will it be better to have all files specified ? (I don't really know how, maybe one column per file format ?)
Also for my own needs, I wanted to have information written on the metadata file even if genomes already exist in my output directory. To do that, I made some dirty modifications of the code but if you're interested in this functionality I can make a proper version and propose a pull request.

fconstancias · 2020-08-18T15:58:54Z

Hi Cecile,

I am quite interested in the possibility to get Biosample attributes (such as host, collection date, geographic location, ...) from prokaryotic genomes downloaded using ncbi-genome-download.
If you have an idea how to do that that would be extremely helpfull.

Thks

cecilpert · 2020-08-20T09:36:07Z

Hi,

I don't think you can have these information directly by using ncbi-genome-download, it will probably require to interrogate other databases. In general, this kind of information is not well standardized so it can be triggered to automatize interrogation.

My first idea would be to isolate bioproject id and/or biosample id from metadata file provided by ncbi-genome-download and use NCBI Entrez API to search in NCBI BioProject and BioSample databases. For Python, I know it can be used with Biopython library (http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec138), but maybe simpler libraries exist. But by checking rapidly it doesn't seem to have much information (to not say not at all) in bioProject and bioSample databases.
It seems to have more informations on EBI/ENA database, an API seems to exists (https://www.ebi.ac.uk/ena/portal/api/). When I had to get things from this database I directly sent HTTP requests with python requests library (https://requests.readthedocs.io/en/master/) and then parsed the result.
In your case, it seems you can access the sample information with biosample id, and it's downloadable as xml file. For example :

import requests
r = requests.get("https://www.ebi.ac.uk/ena/data/view/SAMEA1705929&display=xml")

Then you need to parse xml, I know beautifoulsoup4 or xml.etree python libraries can do that, but I don't exactly know how to use them.

fconstancias · 2020-08-24T06:57:18Z

excellent thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metadata and download #126

Metadata and download #126

cecilpert commented Jul 27, 2020

fconstancias commented Aug 18, 2020

cecilpert commented Aug 20, 2020

fconstancias commented Aug 24, 2020

Metadata and download #126

Metadata and download #126

Comments

cecilpert commented Jul 27, 2020

fconstancias commented Aug 18, 2020

cecilpert commented Aug 20, 2020

fconstancias commented Aug 24, 2020