Skip to content

mdjbru-forge/pyensembl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pyensembl is a light Python script using the REST API to query information and download data from Ensembl Bacteria.

Note: This is a project which is still being written and the command names and options might change in the future.

Installation

pip install --user --upgrade git+https://github.com/mdjbru-forge/pyensembl

In a nutshell

To download all available genomes for the taxon “Serratia”:

pyensembl genomes -t serratia > myGenomes.list
mkdir myGenomes.files
pyensembl download -g myGenomes.list -f genbank -d myGenomes.files

Detailled usage

Get information about the genomes to download

There are two different ways to obtain the genomes information before downloading the actual data.

Search for all genomes under a taxon node (recommended)

If the species of interest can all be found in a given taxon, this is the preferred choice as the information retrieval is fast:

pyensembl genomes -t serratia > serratia.genomes.results

Search available species using a string (not recommended)

This is an alternative method, based on searching first through the names of available species using a string.

Get information about available species

The first step is to get the information about bacteria species available in EnsemblBacteria:

pyensembl refresh

This will download and save the information in a JSON format in a file .pyensembl-bacteria-species.<timestamp> in your home folder.

One can check the JSON files available locally with:

pyensembl refresh -c

The next steps (search) will automatically use the most recent local JSON file.

Search among available species

To search among the available species for a string, use pyensembl search. The string will be searched in the name and display_name fields (case-insensitive).

pyensembl search serratia > search.results
pyensembl search "Serratia marcescens" > search.results2

The information about matching species is sent to stdout and can be saved to a file using the redirection symbol >.

Retrieve the corresponding genome information

This step can take time as the information is retrieved one genome at a time.

pyensembl genomes -f search.results2 > genomes.results

Download sequence data

(Note: For now only the GenBank format download is implemented.)

(Note: Plasmid genbank files are automatically filtered out and not downloaded.)

Download GenBank files for a list of genomes

Use pyensembl download to download sequence data for a list of genomes.

Genome information is read from a tab-separated file, e.g. produced by pyensembl genomes.

The only format of retrieved data is GenBank flat file for now.

Downloaded files can be sent to a destination folder with the -d option.

# Done previously: retrieve genome information
pyensembl genomes -t serratia > serratia.genomes.results
# Download data for those genomes
mkdir myGenomes
pyensembl download -g serratia.genomes.results -f genbank -d myGenomes

Other formats available on EnsemblBacteria ftp

Available formats are:

  • EMBL
  • FASTA
  • GenBank
  • Gff3
  • gtf
  • mySQL
  • rdf
  • tsv
  • vep

GenBank format

This format (GenBank flat file) might be the most convenient to get all the information in one go. The format is rich, with DNA sequence, CDS, translated protein sequences and external references such as GO annotations.

FASTA format

For each genome, the available data on EnsemblBacteria as FASTA files is:

  • DNA
    • DNA / DNA repeat-masked / DNA soft repeat-masked
    • top level / chromosome / non-chromosomal
  • CDS
    • All CDS (known, novel and pseudo gene predictions)
  • cDNA
    • cDNA all / cDNA ab initio
  • peptides
    • pep all / pep ab initio
  • ncRNA
    • non-coding RNA genes

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published