Skip to content

GFF Annotation Paser (GAP) is a tool for parsing Gencode/Ensembl GTF and NCBI GFF3 genome annotation files.

Notifications You must be signed in to change notification settings

lipan6461188/GAP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GAP

Genome Annotation Paser (GAP) is a tool for parsing Gencode/Ensembl GTF and NCBI GFF3 genome annotation files.

Installation

Prerequisites

Python 2.7 or Python 3

pysam: https://pypi.org/project/pysam/

Installing

Add these lines to your ~/.bash_profile, [GAP] is the absolute path of GAP.

export PYTHONPATH=[GAP]:$PYTHONPATH
export PATH=[GAP]:$PATH

Main functions

  • Fetch transcritome file from genome with GFF/GTF annotation;
  • Conversion between three coordinate systems: Gene/Transcript/Genome;
  • Get transcription information (chrID, gene type, length, CDS region...) given a transcription ID or gene name;
  • Show mRNA structure
  • ...

Download data

Gencode GTF/Genome: https://www.gencodegenes.org
Enesembl GTF/Genome: https://ensembl.org/info/data/ftp/index.html
NCBI GFF3/Genome: ftp://ftp.ncbi.nlm.nih.gov/genomes

Quick start

1. Convert a GTF file to *.genomeCoor.bed file

parseGTF.py -g chicken.gtf -o chicken -s ensembl --genome chicken_ensembl.fa

The genome file --genome chicken_ensembl.fa is optional.

It will produce three files:

  • chicken.genomeCoor.bed -- A simple version of the genome-based annotation file
  • chicken.transCoor.bed -- A simple version of the transcript-based annotation file
  • chicken_transcriptome.fa -- Transcriptome file

2. Read the annotation

import GAP
chicken_parser = GAP.init("chicken.genomeCoor.bed", "chicken_transcriptome.fa")

Another way is to read the GTF file directly.

chicken_parser = GAP.initGTF("chicken.gtf", genomeFile="chicken_ensembl.fa", source='Ensembl')

3. Get all mRNAs

mRNAs = chicken_parser.getmRNATransList(); print len(mRNAs)
# 30252
print mRNAs[:5]
# ['ENSGALT00000005443', 'ENSGALT00000005447', 'ENSGALT00000013873', 'ENSGALT00000001353', 'ENSGALT00000001352']

It shows that chicken has 30252 mRNA transcripts.

4. Get GAPDH gene id and transcripts

GAPDH_gID = chicken_parser.getGeneByGeneName("GAPDH"); print GAPDH_gID
# ENSGALG00000014442

GAPDH_transcripts = chicken_parser.getTransByGeneID(GAPDH_gID); print GAPDH_transcripts
# ['ENSGALT00000046744', 'ENSGALT00000086833', 'ENSGALT00000023323', 'ENSGALT00000086032', 'ENSGALT00000074237', 'ENSGALT00000090208', 'ENSGALT00000051222', 'ENSGALT00000054080', 'ENSGALT00000089752', 'ENSGALT00000085687']

## Print all transcript gene type and length
for tid in GAPDH_transcripts:
	trans_feature = chicken_parser.getTransFeature(tid)
	print tid, trans_feature['gene_type'], trans_feature['trans_len']

# ENSGALT00000046744 protein_coding 1076
# ENSGALT00000086833 protein_coding 1122
# ENSGALT00000023323 protein_coding 1288
# ENSGALT00000086032 protein_coding 1302
# ENSGALT00000074237 protein_coding 1091
# ENSGALT00000090208 protein_coding 670
# ENSGALT00000051222 protein_coding 1179
# ENSGALT00000054080 protein_coding 1498
# ENSGALT00000089752 protein_coding 434
# ENSGALT00000085687 protein_coding 991

5. Get UTR and CDS region of longest transcript of GAPDH

## Method 1
ft = chicken_parser.getTransFeature("ENSGALT00000054080")
cds_start, cds_end = ft['cds_start'], ft['cds_end']
GAPDH = chicken_parser.getTransSeq("ENSGALT00000054080")

UTR_5 = GAPDH[:cds_start-1]
CDS = GAPDH[cds_start-1:cds_end]
UTR_3 = GAPDH[cds_end:]

## Method 2
print chicken_parser.showRNAStructure("ENSGALT00000054080")

6. Get genome coordination of GAPDH start codon

chrID, chrPos, strand = chicken_parser.transCoor2genomeCoor("ENSGALT00000054080", cds_start)
print chrID, chrPos, strand
# ['1', 76953317, '+']

It shows that GAPDH start codon is located at 76953317 of positive strand of chromosome 1

7. Find and label all GGAC motifs in GAPDH

## Get Sequence
seq = chicken_parser.getTransSeq("ENSGALT00000054080")

## Collect motif sites
locs = []
start = 0
while 1:
    start = seq.find("GGAC", start+1)
    if start == -1: break
    locs.append(start)

## Label
for loc in locs:
    print str(loc)+"\t"+chicken_parser.labelRNAPosition("ENSGALT00000054080", [loc,loc])

Subjects

Init a GAP object

There are two ways to init a GAP object:

  1. Convert the GTF or GFF3 file to a *.genomeCoor.bed file and read it;
  2. Read the GTF or GFF3 directly.

Method 1 -- Parse GTF/GFF3 file

parseGTF.py -g chicken.gtf -o chicken -s ensembl --genome chicken_ensembl.fa

The --genome is optional.

import GAP
chicken_parser = GAP.init("chicken.genomeCoor.bed", "chicken_transcriptome.fa")

Method 2 -- Read GTF/GFF3 directly

import GAP
chicken_parser = GAP.initGTF("chicken.gtf", genomeFile="chicken_ensembl.fa", source='Ensembl')

Get transcript or gene information

Coordinate transformation

Get transcript list or gene list

Get transcript sequence and mRNA structure

Excutable files

parseGTF.py

Features
Parse GTF/GFF3 file and produce *.genomeCoor.bed, *.transCoor.bed and *_transcriptome.fa files
Parameters
-g         genome annotation, Gencode/Ensembl GTF file or NCBI GFF3
-o         output file prefix
-s         [ensembl|gencode|ncbi] data source

optional:
--genome   fetch transcriptome from genome file, produce a prefix_transcriptome.fa file
--noscaffold	Remove scaffolds. Scaffolds are defined as those chromosomes with id length > 6 and not startswith chr and NC_
--rawchr	Use raw chromosome ID, don't convert NC_* to chrXX. Only useful for NCBI source GFF3
Output
*.genomeCoor.bed, *.transCoor.bed and *_transcriptome.fa

combineNCBIGenome.py

Features
Combine single chromosomes download from NCBI. Convert the NC_* code to chr* code
NC_006088.5 => chr1
Parameters
./combineNCBIGenome.py chr1.fasta chr2.fasta chr3.fasta... > outFile.fa
Return
A combined genome file with chr* code

Functions

GAP.init(genomeCoorBedFile, seqFn="", showAttr=True, rem_tVersion=False, rem_gVersion=False)

Features
Init a GAP object from *.genomeCoor.bed file
Parameters
genomeCoorBedFile   -- A *.genomeCoor.bed file produced by parseGTF.py
seqFn               -- Transcriptome fasta file produced by parseGTF.py
showAttr            -- Show an example
rem_tVersion        -- Remove version information. ENST000000022311.2 => ENST000000022311
rem_gVersion        -- Remove version information. ENSG000000022311.2 => ENSG000000022311
Return
GAP object

initGTF(AnnotationGTF, source, genomeFile="", showAttr=True, rem_tVersion=False, rem_gVersion=False, verbose=False)

Features
Init a GAP object from GTF/GFF3 file
Parameters
AnnotationGTF       -- Ensembl/Gencode GTF file or NCBI GFF3 file
genomeFile          -- Genome file
source              -- Gencode/Ensembl/NCBI
showAttr            -- Show an example
rem_tVersion        -- Remove version information. ENST000000022311.2 => ENST000000022311
rem_gVersion        -- Remove version information. ENSG000000022311.2 => ENSG000000022311
verbose             -- Show process information
Return
GAP object

addSeq(seqFileName, remove_tid_version=False)

Features
If the sequence is not provided when init a GAP object, add the sequence to it.
Parameters
seqFileName         -- Transcriptome fasta file produced by parseGTF.py
remove_tid_version  -- Remove version information. ENST000000022311.2 => ENST000000022311
Return
No

getTransFeature(transID, showAttr=False, verbose=True)

Features
Get transcript features given the transcript id
Parameters
transID             -- Transcript ID
showAttr            -- Show an example
verbose             -- Print the warning information when transcript not found
Return
Return a dictionary:
    chr             -- Genome chromosome
    strand          -- + or -
    start           -- Genome start
    end             -- Genome end
    
    gene_name       -- Gene symbol
    gene_id         -- Gene id
    gene_type       -- Gene type
    
    trans_len       -- Transcript length
    
    utr_5_start     -- Transcript-based start site of 5'UTR
    utr_5_end       -- Transcript-based end site of 5'UTR
    cds_start       -- Transcript-based start site of CDS
    cds_end         -- Transcript-based end site of CDS
    utr_3_start     -- Transcript-based start site of 3'UTR
    utr_3_end       -- Transcript-based end site of 3'UTR
    
    exon_str        -- Genome-based exon string

getGeneParser(showAttr=True)

Features
Get gene parser with gene informations
Parameters
showAttr            -- Show an example
Return
Return a list of dictionaries:
    { geneID => { ... }, ... } includes
        chr         -- Genome chromosome
        strand      -- + or -
        start       -- Genome start
        end         -- Genome end
        
        gene_name   -- Gene symbol
        gene_type   -- All transcript types
        
        length      -- Gene length (end-start+1)
        
        transcript  -- All transcripts belong to this gene

getGeneIntron(geneID)

Features
Get genome-based intron corrdinates of given gene
Parameters
geneID              -- Gene id
Return
Get introns of all transcripts of this gene:
            { transID => [[intron1_start, intron1_end], [intron2_start, intron2_end], [intron3_start, intron3_end]],... }

getGeneExon(geneID)

Features
Get genome-based exon corrdinates of given gene
Parameters
geneID              -- Gene id
Return
Get exons of all transcripts of this gene:
            { transID => [[exon1_start, exon1_end], [exon2_start, exon2_end], [exon3_start, exon3_end]...],... }

geneCoor2genomeCoor(geneID, pos)

Features
Convert gene-based coordinates to genome-based coordinates
Parameters
geneID              -- Gene id
pos                 -- Gene-based position
Return
Return: [chrID, chrPos, Strand]

genomeCoor2geneCoor(chrID, start, end, strand)

Features
Convert genome-based coordinates to gene-based coordinates
Parameters
chrID               -- Chromosome id
start               -- Chromosome-based start position
end                 -- Chromosome-based end position
strand              -- Chromosome strand
Return
Return [ [chrID, chrStart, chrEnd, geneID, geneStart, geneEnd], ... ]

genomeCoor2transCoor(chrID, start, end, strand)

Features
Convert genome-based coordinates to transcript-based coordinates
Parameters
chrID               -- Chromosome id
start               -- Chromosome-based start position
end                 -- Chromosome-based end position
strand              -- Chromosome strand
Return
Return [ [chrID, chrStart, chrEnd, transID, transStart, transEnd], ... ]

transCoor2genomeCoor(transID, pos)

Features
Convert transcript-based coordinates to genome-based coordinates
Parameters
transID             -- Transcript id
pos                 -- Transcript-based position
Return
Return [chrID, chrPos, Strand]

geneCoor2transCoor(geneID, start, end)

Features
Convert gene-based coordinates to transcript-based coordinates
Parameters
geneID              -- Gene id
start               -- Gene-based start position
end                 -- Gene-based end position
Return
Return  [chrID, chrStart, chrEnd, transID, transStart, transEnd], ... ]

transCoor2geneCoor(transID, start, end)

Features
Convert transcript-based coordinates to gene-based coordinates
Parameters
transID             -- Transcript id
start               -- Transcript-based start position
end                 -- Transcript-based end position
Return
return [ [chrID, chrStart, chrEnd, geneID, geneStart, geneEnd], ... ]

getTransList()

Features
Return list of all transcript id
Parameters
No
Return
return [ transID1, transID2, ... ]

getGeneList()

Features
Return list of all gene id
Parameters
No
Return
return [ geneID1, geneID2, ... ]

getmRNATransList()

Features
Return list of all mRNA id
Parameters
No
Return
return [ mRNAID1, mRNAID2, ... ]

getmRNAGeneList()

Features
Return list of all mRNA gene id
Parameters
No
Return
return [ mRNAGeneID1, mRNAGeneID2, ... ]

getLenSortedTransDictForGenes(only=[])

Features
Return a dictionary of geneID to sorted transID list
Parameters
only                --  Contraint transcript types, such as mRNA, snRNA ...
Return
return { geneID => [ transID1, transID2, ... ], ...}
transcripts are sorted by length

getTransByGeneID(geneID)

Features
Return transcripts belong to specific gene
Parameters
geneID              -- Gene id
Return
return [ transID1, transID2... ]

getGeneByGeneName(geneName)

Features
Return gene id with specific gene name
Parameters
geneName            -- Gene symbol
Return
return geneID

getTransByGeneName(geneName)

Features
Return transcripts belong to specific gene
Parameters
geneName            -- Gene symbol
Return
return [ transID1, transID2... ]

getTransByGeneType(geneType)

Features
 Return a list of transcripts belong to specific gene type
Parameters
geneType            -- Gene type
Return
return [ transID1, transID2... ]

getTransSeq(transID)

Features
 Return transcript sequence
Parameters
transID             -- Transcript id
Return
return sequence

getGeneCombinedIntronExon(geneID, verbose=True)

Features
Parse gene intron/exon regions:
    1. Combine exons from all transcripts, they are defined as exon regions
    2. Remove the exon regions from gene regions, they are defined as intron regions
Parameters
geneID          -- Gene id
verbose         -- Print error information when error occured
Return
return: [intron_regions, exon_regions]

showRNAStructure(transID)

Features
Return string of mRNA sequence with color labeled its UTR/CDS/Codons
Parameters
transID         -- Transcript id
Return
return: colored sequence

labelRNAPosition(transID, region, bn=None, bw=None)

Features
Return string of mRNA sequence with color labeled its UTR/CDS/Codons

        return RNA structure and label the region
        -------|||||||||||||||||--------------------------
         5'UTR        CDS                3'UTR
Parameters
transID         -- Transcript id
region          -- [start, end]
bn              -- Bin number (default: 50)
bw              -- Bin width
bw and bn cannot be specified at the same time
Return
return: -------|||||||||||||||||--------------------------

getRNAPosition(transID, region)

Features
Get the position of region located in mRNA
Parameters
transID         -- Transcript id
region          -- [start, end]
Return
Return any one of:
    -> not_mRNA
    -> 5UTR
    -> span_5UTR_CDS
    -> CDS
    -> span_CDS_3UTR
    -> 3UTR
    -> span_5UTR_CDS_3UTR
    -> INVALID

Notice

Complete documentation is in doc/Introduction.html

Authors

About

GFF Annotation Paser (GAP) is a tool for parsing Gencode/Ensembl GTF and NCBI GFF3 genome annotation files.

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages