Installation

pyensembl is a light Python script using the REST API to query information and download data from Ensembl Bacteria.

Note: This is a project which is still being written and the command names and options might change in the future.

Installation

pip install --user --upgrade git+https://github.com/mdjbru-forge/pyensembl

In a nutshell

To download all available genomes for the taxon “Serratia”:

pyensembl genomes -t serratia > myGenomes.list
mkdir myGenomes.files
pyensembl download -g myGenomes.list -f genbank -d myGenomes.files

Detailled usage

Get information about the genomes to download

There are two different ways to obtain the genomes information before downloading the actual data.

Search for all genomes under a taxon node (recommended)

If the species of interest can all be found in a given taxon, this is the preferred choice as the information retrieval is fast:

pyensembl genomes -t serratia > serratia.genomes.results

Search available species using a string (not recommended)

This is an alternative method, based on searching first through the names of available species using a string.

Get information about available species

The first step is to get the information about bacteria species available in EnsemblBacteria:

pyensembl refresh

This will download and save the information in a JSON format in a file .pyensembl-bacteria-species.<timestamp> in your home folder.

One can check the JSON files available locally with:

pyensembl refresh -c

The next steps (search) will automatically use the most recent local JSON file.

Search among available species

To search among the available species for a string, use pyensembl search. The string will be searched in the name and display_name fields (case-insensitive).

pyensembl search serratia > search.results
pyensembl search "Serratia marcescens" > search.results2

The information about matching species is sent to stdout and can be saved to a file using the redirection symbol >.

Retrieve the corresponding genome information

This step can take time as the information is retrieved one genome at a time.

pyensembl genomes -f search.results2 > genomes.results

Download sequence data

(Note: For now only the GenBank format download is implemented.)

(Note: Plasmid genbank files are automatically filtered out and not downloaded.)

Download GenBank files for a list of genomes

Use pyensembl download to download sequence data for a list of genomes.

Genome information is read from a tab-separated file, e.g. produced by pyensembl genomes.

The only format of retrieved data is GenBank flat file for now.

Downloaded files can be sent to a destination folder with the -d option.

# Done previously: retrieve genome information
pyensembl genomes -t serratia > serratia.genomes.results
# Download data for those genomes
mkdir myGenomes
pyensembl download -g serratia.genomes.results -f genbank -d myGenomes

Other formats available on EnsemblBacteria ftp

Available formats are:

EMBL
FASTA
GenBank
Gff3
gtf
mySQL
rdf
tsv
vep

GenBank format

This format (GenBank flat file) might be the most convenient to get all the information in one go. The format is rich, with DNA sequence, CDS, translated protein sequences and external references such as GO annotations.

FASTA format

For each genome, the available data on EnsemblBacteria as FASTA files is:

DNA
- DNA / DNA repeat-masked / DNA soft repeat-masked
- top level / chromosome / non-chromosomal
CDS
- All CDS (known, novel and pseudo gene predictions)
cDNA
- cDNA all / cDNA ab initio
peptides
- pep all / pep ab initio
ncRNA
- non-coding RNA genes

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.gitignore		.gitignore
Makefile		Makefile
README.org		README.org
pyensembl.py		pyensembl.py
pyensemblScripts.py		pyensemblScripts.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Installation

In a nutshell

Detailled usage

Get information about the genomes to download

Search for all genomes under a taxon node (recommended)

Search available species using a string (not recommended)

Get information about available species

Search among available species

Retrieve the corresponding genome information

Download sequence data

Download GenBank files for a list of genomes

Other formats available on EnsemblBacteria ftp

GenBank format

FASTA format

About

Releases

Packages

Languages

mdjbru-forge/pyensembl

Folders and files

Latest commit

History

Repository files navigation

Installation

In a nutshell

Detailled usage

Get information about the genomes to download

Search for all genomes under a taxon node (recommended)

Search available species using a string (not recommended)

Get information about available species

Search among available species

Retrieve the corresponding genome information

Download sequence data

Download GenBank files for a list of genomes

Other formats available on EnsemblBacteria ftp

GenBank format

FASTA format

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages