pyensembl
is a light Python script using the REST API to query information
and download data from Ensembl Bacteria.
Note: This is a project which is still being written and the command names and options might change in the future.
pip install --user --upgrade git+https://github.com/mdjbru-forge/pyensembl
To download all available genomes for the taxon “Serratia”:
pyensembl genomes -t serratia > myGenomes.list mkdir myGenomes.files pyensembl download -g myGenomes.list -f genbank -d myGenomes.files
There are two different ways to obtain the genomes information before downloading the actual data.
If the species of interest can all be found in a given taxon, this is the preferred choice as the information retrieval is fast:
pyensembl genomes -t serratia > serratia.genomes.results
This is an alternative method, based on searching first through the names of available species using a string.
The first step is to get the information about bacteria species available in EnsemblBacteria:
pyensembl refresh
This will download and save the information in a JSON format in a file
.pyensembl-bacteria-species.<timestamp>
in your home folder.
One can check the JSON files available locally with:
pyensembl refresh -c
The next steps (search) will automatically use the most recent local JSON file.
To search among the available species for a string, use pyensembl search
. The
string will be searched in the name
and display_name
fields
(case-insensitive).
pyensembl search serratia > search.results pyensembl search "Serratia marcescens" > search.results2
The information about matching species is sent to stdout and can be saved to a
file using the redirection symbol >
.
This step can take time as the information is retrieved one genome at a time.
pyensembl genomes -f search.results2 > genomes.results
(Note: For now only the GenBank format download is implemented.)
(Note: Plasmid genbank files are automatically filtered out and not downloaded.)
Use pyensembl download
to download sequence data for a list of genomes.
Genome information is read from a tab-separated file, e.g. produced by
pyensembl genomes
.
The only format of retrieved data is GenBank flat file for now.
Downloaded files can be sent to a destination folder with the -d
option.
# Done previously: retrieve genome information pyensembl genomes -t serratia > serratia.genomes.results # Download data for those genomes mkdir myGenomes pyensembl download -g serratia.genomes.results -f genbank -d myGenomes
Available formats are:
- EMBL
- FASTA
- GenBank
- Gff3
- gtf
- mySQL
- rdf
- tsv
- vep
This format (GenBank flat file) might be the most convenient to get all the information in one go. The format is rich, with DNA sequence, CDS, translated protein sequences and external references such as GO annotations.
For each genome, the available data on EnsemblBacteria as FASTA files is:
- DNA
- DNA / DNA repeat-masked / DNA soft repeat-masked
- top level / chromosome / non-chromosomal
- CDS
- All CDS (known, novel and pseudo gene predictions)
- cDNA
- cDNA all / cDNA ab initio
- peptides
- pep all / pep ab initio
- ncRNA
- non-coding RNA genes