pybio: basic genomics toolset
- Reporting problems
- automatized download of genome assemblies from Ensembl and STAR indexing,
- automatized download of genome annotations from Ensembl GTF with fast-searching capabilities,
- Fasta, Fastq, bedGraph and other file format handling,
- motif sequence searches,
- alternative polyadenylation site-pair classification (same-exon, skipped-exon, composite-exon),
- and other.
A few steps of how to download and setup
Clone the GitHub repository
For now the most direct way of installing pybio is to clone the repository and add the containing folder to PYTHONPATH:
git clone https://github.com/grexor/pybio.git
If, for example, you installed
/home/user/pybio, you would add this command to the
.profile file in your home folder:
export PYTHONPATH=$PYTHONPATH:/home/user export PATH=$PATH:/home/user/pybio/bin
There are a few software tools pybio depends on:
- STAR aligner,
sudo apt-get install rna-star
sudo apt-get install python-pysam
sudo apt-get install python-numpy
- Salmon, download and install from Salmon webpage
sudo apt-get install samtools
Here we provide basic
pybio usage examples.
Downloading Ensembl genomes
In the folder
pybio/genomes, there are
.sh scripts you can use to automatically download and pre-process Ensembl genomes. For example, to download and prepare the hg38 Ensembl v98, simply run:
cd pybio/genomes ./hg38.download.ensembl98.sh
This will download the FASTA sequence, GTF and TAB annotation (via Biomart) of the genome, and create several folders:
hg38.assembly.ensembl98 # FASTA files of the genome, each chromosome in a separate file hg38.annotation.ensembl98 # Annotation in GTF and TAB format hg38.assembly.ensembl98.star # STAR index, GTF annotation aware hg38.transcripts.ensembl98 # transcriptome, this is the Ensembl "cDNA" file in FASTA format hg38.transcripts.ensembl98.salmon # Salmon index of the transcriptome
Retrieving genomic sequence
To retrieve stretches of genomic sequence, we use the seq(genome, chr, strand, position, upstream, downstream) method:
pybio.genomes.seq("hg38", "1", "+", 450000, -20, 20) # returns 'TACCCTGATTCTGAAACGAAAAAGCTTTACAAAATCCAAGA' for hg38, Ensembl v98
The above command fetches the chr1 sequence from 450000-20..450000+20, the resulting sequence is of length 41.
Annotate genomic position
Given a genomic position, we can retrieve the gene at the position and the closest upstream and downstream gene on the same strand:
(gene_up, gene_id, gene_down, gene_interval, gene_feature) = pybio.genomes.annotate("hg38", "1", "-", 450000)
The above command would return:
gene_id: ENSG00000237094 gene_interval: (379972, 450701, 'i') gene_feature: 'intron' gene_up: 'ENSG00000284733' gene_down: 'ENSG00000228463'
There is gene ENSG00000237094 at position 450000, specifically the position is in an intron of the gene spanning the region 379972..450701.
Importing genome annotation
pybio imports genome annotations from Ensembl or from a GTF file. The Ensembl import is from the TAB separated file generated by querying Biomart.
b = pybio.data.bedGraph()
Use the issues page to report issues and leave suggestions.