Genome Annotation Paser (GAP) is a tool for parsing Gencode/Ensembl GTF and NCBI GFF3 genome annotation files.
Python 2.7 or Python 3
pysam: https://pypi.org/project/pysam/
Add these lines to your ~/.bash_profile
, [GAP]
is the absolute path of GAP.
export PYTHONPATH=[GAP]:$PYTHONPATH
export PATH=[GAP]:$PATH
- Fetch transcritome file from genome with GFF/GTF annotation;
- Conversion between three coordinate systems: Gene/Transcript/Genome;
- Get transcription information (chrID, gene type, length, CDS region...) given a transcription ID or gene name;
- Show mRNA structure
- ...
Gencode GTF/Genome: https://www.gencodegenes.org
Enesembl GTF/Genome: https://ensembl.org/info/data/ftp/index.html
NCBI GFF3/Genome: ftp://ftp.ncbi.nlm.nih.gov/genomes
parseGTF.py -g chicken.gtf -o chicken -s ensembl --genome chicken_ensembl.fa
The genome file --genome chicken_ensembl.fa
is optional.
It will produce three files:
chicken.genomeCoor.bed
-- A simple version of the genome-based annotation filechicken.transCoor.bed
-- A simple version of the transcript-based annotation filechicken_transcriptome.fa
-- Transcriptome file
import GAP
chicken_parser = GAP.init("chicken.genomeCoor.bed", "chicken_transcriptome.fa")
Another way is to read the GTF file directly.
chicken_parser = GAP.initGTF("chicken.gtf", genomeFile="chicken_ensembl.fa", source='Ensembl')
mRNAs = chicken_parser.getmRNATransList(); print len(mRNAs)
# 30252
print mRNAs[:5]
# ['ENSGALT00000005443', 'ENSGALT00000005447', 'ENSGALT00000013873', 'ENSGALT00000001353', 'ENSGALT00000001352']
It shows that chicken has 30252 mRNA transcripts.
GAPDH_gID = chicken_parser.getGeneByGeneName("GAPDH"); print GAPDH_gID
# ENSGALG00000014442
GAPDH_transcripts = chicken_parser.getTransByGeneID(GAPDH_gID); print GAPDH_transcripts
# ['ENSGALT00000046744', 'ENSGALT00000086833', 'ENSGALT00000023323', 'ENSGALT00000086032', 'ENSGALT00000074237', 'ENSGALT00000090208', 'ENSGALT00000051222', 'ENSGALT00000054080', 'ENSGALT00000089752', 'ENSGALT00000085687']
## Print all transcript gene type and length
for tid in GAPDH_transcripts:
trans_feature = chicken_parser.getTransFeature(tid)
print tid, trans_feature['gene_type'], trans_feature['trans_len']
# ENSGALT00000046744 protein_coding 1076
# ENSGALT00000086833 protein_coding 1122
# ENSGALT00000023323 protein_coding 1288
# ENSGALT00000086032 protein_coding 1302
# ENSGALT00000074237 protein_coding 1091
# ENSGALT00000090208 protein_coding 670
# ENSGALT00000051222 protein_coding 1179
# ENSGALT00000054080 protein_coding 1498
# ENSGALT00000089752 protein_coding 434
# ENSGALT00000085687 protein_coding 991
## Method 1
ft = chicken_parser.getTransFeature("ENSGALT00000054080")
cds_start, cds_end = ft['cds_start'], ft['cds_end']
GAPDH = chicken_parser.getTransSeq("ENSGALT00000054080")
UTR_5 = GAPDH[:cds_start-1]
CDS = GAPDH[cds_start-1:cds_end]
UTR_3 = GAPDH[cds_end:]
## Method 2
print chicken_parser.showRNAStructure("ENSGALT00000054080")
chrID, chrPos, strand = chicken_parser.transCoor2genomeCoor("ENSGALT00000054080", cds_start)
print chrID, chrPos, strand
# ['1', 76953317, '+']
It shows that GAPDH start codon is located at 76953317 of positive strand of chromosome 1
## Get Sequence
seq = chicken_parser.getTransSeq("ENSGALT00000054080")
## Collect motif sites
locs = []
start = 0
while 1:
start = seq.find("GGAC", start+1)
if start == -1: break
locs.append(start)
## Label
for loc in locs:
print str(loc)+"\t"+chicken_parser.labelRNAPosition("ENSGALT00000054080", [loc,loc])
There are two ways to init a GAP object:
- Convert the GTF or GFF3 file to a *.genomeCoor.bed file and read it;
- Read the GTF or GFF3 directly.
parseGTF.py -g chicken.gtf -o chicken -s ensembl --genome chicken_ensembl.fa
The --genome
is optional.
import GAP
chicken_parser = GAP.init("chicken.genomeCoor.bed", "chicken_transcriptome.fa")
import GAP
chicken_parser = GAP.initGTF("chicken.gtf", genomeFile="chicken_ensembl.fa", source='Ensembl')
- geneCoor2genomeCoor
- genomeCoor2geneCoor
- genomeCoor2transCoor
- transCoor2genomeCoor
- transCoor2geneCoor
- geneCoor2transCoor
- getTransList
- getGeneList
- getmRNATransList
- getmRNAGeneList
- getLenSortedTransDictForGenes
- getTransByGeneID
- getGeneByGeneName
- getTransByGeneName
- getTransByGeneType
Parse GTF/GFF3 file and produce *.genomeCoor.bed, *.transCoor.bed and *_transcriptome.fa files
-g genome annotation, Gencode/Ensembl GTF file or NCBI GFF3 -o output file prefix -s [ensembl|gencode|ncbi] data source optional: --genome fetch transcriptome from genome file, produce a prefix_transcriptome.fa file --noscaffold Remove scaffolds. Scaffolds are defined as those chromosomes with id length > 6 and not startswith chr and NC_ --rawchr Use raw chromosome ID, don't convert NC_* to chrXX. Only useful for NCBI source GFF3
*.genomeCoor.bed, *.transCoor.bed and *_transcriptome.fa
Combine single chromosomes download from NCBI. Convert the NC_* code to chr* code NC_006088.5 => chr1
./combineNCBIGenome.py chr1.fasta chr2.fasta chr3.fasta... > outFile.fa
A combined genome file with chr* code
Init a GAP object from *.genomeCoor.bed file
genomeCoorBedFile -- A *.genomeCoor.bed file produced by parseGTF.py seqFn -- Transcriptome fasta file produced by parseGTF.py showAttr -- Show an example rem_tVersion -- Remove version information. ENST000000022311.2 => ENST000000022311 rem_gVersion -- Remove version information. ENSG000000022311.2 => ENSG000000022311
GAP object
initGTF(AnnotationGTF, source, genomeFile="", showAttr=True, rem_tVersion=False, rem_gVersion=False, verbose=False)
Init a GAP object from GTF/GFF3 file
AnnotationGTF -- Ensembl/Gencode GTF file or NCBI GFF3 file genomeFile -- Genome file source -- Gencode/Ensembl/NCBI showAttr -- Show an example rem_tVersion -- Remove version information. ENST000000022311.2 => ENST000000022311 rem_gVersion -- Remove version information. ENSG000000022311.2 => ENSG000000022311 verbose -- Show process information
GAP object
If the sequence is not provided when init a GAP object, add the sequence to it.
seqFileName -- Transcriptome fasta file produced by parseGTF.py remove_tid_version -- Remove version information. ENST000000022311.2 => ENST000000022311
No
Get transcript features given the transcript id
transID -- Transcript ID showAttr -- Show an example verbose -- Print the warning information when transcript not found
Return a dictionary: chr -- Genome chromosome strand -- + or - start -- Genome start end -- Genome end gene_name -- Gene symbol gene_id -- Gene id gene_type -- Gene type trans_len -- Transcript length utr_5_start -- Transcript-based start site of 5'UTR utr_5_end -- Transcript-based end site of 5'UTR cds_start -- Transcript-based start site of CDS cds_end -- Transcript-based end site of CDS utr_3_start -- Transcript-based start site of 3'UTR utr_3_end -- Transcript-based end site of 3'UTR exon_str -- Genome-based exon string
Get gene parser with gene informations
showAttr -- Show an example
Return a list of dictionaries: { geneID => { ... }, ... } includes chr -- Genome chromosome strand -- + or - start -- Genome start end -- Genome end gene_name -- Gene symbol gene_type -- All transcript types length -- Gene length (end-start+1) transcript -- All transcripts belong to this gene
Get genome-based intron corrdinates of given gene
geneID -- Gene id
Get introns of all transcripts of this gene: { transID => [[intron1_start, intron1_end], [intron2_start, intron2_end], [intron3_start, intron3_end]],... }
Get genome-based exon corrdinates of given gene
geneID -- Gene id
Get exons of all transcripts of this gene: { transID => [[exon1_start, exon1_end], [exon2_start, exon2_end], [exon3_start, exon3_end]...],... }
Convert gene-based coordinates to genome-based coordinates
geneID -- Gene id pos -- Gene-based position
Return: [chrID, chrPos, Strand]
Convert genome-based coordinates to gene-based coordinates
chrID -- Chromosome id start -- Chromosome-based start position end -- Chromosome-based end position strand -- Chromosome strand
Return [ [chrID, chrStart, chrEnd, geneID, geneStart, geneEnd], ... ]
Convert genome-based coordinates to transcript-based coordinates
chrID -- Chromosome id start -- Chromosome-based start position end -- Chromosome-based end position strand -- Chromosome strand
Return [ [chrID, chrStart, chrEnd, transID, transStart, transEnd], ... ]
Convert transcript-based coordinates to genome-based coordinates
transID -- Transcript id pos -- Transcript-based position
Return [chrID, chrPos, Strand]
Convert gene-based coordinates to transcript-based coordinates
geneID -- Gene id start -- Gene-based start position end -- Gene-based end position
Return [chrID, chrStart, chrEnd, transID, transStart, transEnd], ... ]
Convert transcript-based coordinates to gene-based coordinates
transID -- Transcript id start -- Transcript-based start position end -- Transcript-based end position
return [ [chrID, chrStart, chrEnd, geneID, geneStart, geneEnd], ... ]
Return list of all transcript id
No
return [ transID1, transID2, ... ]
Return list of all gene id
No
return [ geneID1, geneID2, ... ]
Return list of all mRNA id
No
return [ mRNAID1, mRNAID2, ... ]
Return list of all mRNA gene id
No
return [ mRNAGeneID1, mRNAGeneID2, ... ]
Return a dictionary of geneID to sorted transID list
only -- Contraint transcript types, such as mRNA, snRNA ...
return { geneID => [ transID1, transID2, ... ], ...} transcripts are sorted by length
Return transcripts belong to specific gene
geneID -- Gene id
return [ transID1, transID2... ]
Return gene id with specific gene name
geneName -- Gene symbol
return geneID
Return transcripts belong to specific gene
geneName -- Gene symbol
return [ transID1, transID2... ]
Return a list of transcripts belong to specific gene type
geneType -- Gene type
return [ transID1, transID2... ]
Return transcript sequence
transID -- Transcript id
return sequence
Parse gene intron/exon regions: 1. Combine exons from all transcripts, they are defined as exon regions 2. Remove the exon regions from gene regions, they are defined as intron regions
geneID -- Gene id verbose -- Print error information when error occured
return: [intron_regions, exon_regions]
Return string of mRNA sequence with color labeled its UTR/CDS/Codons
transID -- Transcript id
return: colored sequence
Return string of mRNA sequence with color labeled its UTR/CDS/Codons return RNA structure and label the region -------|||||||||||||||||-------------------------- 5'UTR CDS 3'UTR
transID -- Transcript id region -- [start, end] bn -- Bin number (default: 50) bw -- Bin width bw and bn cannot be specified at the same time
return: -------|||||||||||||||||--------------------------
Get the position of region located in mRNA
transID -- Transcript id region -- [start, end]
Return any one of: -> not_mRNA -> 5UTR -> span_5UTR_CDS -> CDS -> span_CDS_3UTR -> 3UTR -> span_5UTR_CDS_3UTR -> INVALID
Complete documentation is in doc/Introduction.html
- Li Pan - Programmer - Zhanglab