# Compare Zhenxia and FlyBase GTFs

Zhenxia has created a GTF file using Stringtie. There are a ton of transcript models for each gene, so we don't think this is a very good representation. However, I want to create some metrics to compare the two GTF files.

In [142]:
# Imports
import tempfile
from subprocess import check_call, PIPE

import numpy as np
import pandas as pd

import gffutils
import gffutils.pybedtools_integration
import pybedtools

It will be best to focus on a small subset and go from there. I will start by only looking at genes on **Y**. 

## Create GTF databases for the Y chromosome

For testing I want to filter out only the Y chromosome because it is small and easy to work with. I will make a Y only version of both the FlyBase and StringTie GTFs and then create the gffutils databases for those in memory. 

In [237]:
# Create a temp file for FlyBase Y chrom GTF
dmel = tempfile.mkstemp()[1]

# Pull out only the Y chrom and write to tmp
cmd = 'grep -e "^chrY" ../../output/dmel-all-r6.08.chr.gtf >{0};'.format(dmel)
check_call(cmd, shell=True)

FlyBaseDB = gffutils.create_db(dmel, ":memory:", id_spec={'gene': 'gene_id', 'transcript': 'transcript_id'})

  "It appears you have a gene feature in your GTF "


In [238]:
# Create a temp file for StringTie Y chrom GTF
tie = tempfile.mkstemp()[1]

# Pull out only the Y chrom and write to tmp
cmd = 'grep -e "^chrY" ../../data/zhenxia/all.ucsc.gtf > {0}'.format(tie)
check_call(cmd, shell=True)

TieDB = gffutils.create_db(tie, ":memory:", id_spec={'gene': 'gene_id', 'transcript': 'transcript_id'})

  "It appears you have a transcript feature in your GTF "


In [232]:
print("These are the types of features in the FlyBase GTF")
list(FlyBaseDB.featuretypes())

These are the types of features in the FlyBase GTF


['3UTR',
 '5UTR',
 'CDS',
 'exon',
 'gene',
 'mRNA',
 'ncRNA',
 'pseudogene',
 'start_codon',
 'stop_codon',
 'transcript']

In [233]:
print("These are the types of features in the StringTie GTF")
list(TieDB.featuretypes())

These are the types of features in the StringTie GTF


['exon', 'gene', 'transcript']

## Basic Summary Stats

I will now look at some basic summary stats.

In [82]:
def get_feature_len(features, func=np.mean):
    return func(np.array([x.end - x.start for x in features]))

In [234]:
stats = {'FB': {}, 'TIE': {}, 'COMPARE':{}}

stats['FB']['num_genes'] = len(list(FlyBaseDB.features_of_type('gene')))
stats['TIE']['num_genes'] = len(list(TieDB.features_of_type('gene')))

stats['FB']['num_ts'] = len(list(FlyBaseDB.features_of_type('transcript')))
stats['TIE']['num_ts'] = len(list(TieDB.features_of_type('transcript')))

stats['FB']['avg_ts_len'] = get_feature_len(FlyBaseDB.features_of_type('transcript'))
stats['TIE']['avg_ts_len'] = get_feature_len(TieDB.features_of_type('transcript'))

stats['FB']['med_ts_len'] = get_feature_len(FlyBaseDB.features_of_type('transcript'), func=np.median)
stats['TIE']['med_ts_len'] = get_feature_len(TieDB.features_of_type('transcript'), func=np.median)

stats['FB']['max_ts_len'] = get_feature_len(FlyBaseDB.features_of_type('transcript'), func=np.max)
stats['TIE']['max_ts_len'] = get_feature_len(TieDB.features_of_type('transcript'), func=np.max)

stats['FB']['min_ts_len'] = get_feature_len(FlyBaseDB.features_of_type('transcript'), func=np.min)
stats['TIE']['min_ts_len'] = get_feature_len(TieDB.features_of_type('transcript'), func=np.min)

stats['FB']['num_exons'] = len(list(FlyBaseDB.features_of_type('exon')))
stats['TIE']['num_exons'] = len(list(TieDB.features_of_type('exon')))

stats['FB']['avg_exon_len'] = get_feature_len(FlyBaseDB.features_of_type('exon'))
stats['TIE']['avg_exon_len'] = get_feature_len(TieDB.features_of_type('exon'))

stats['FB']['med_exon_len'] = get_feature_len(FlyBaseDB.features_of_type('exon'), func=np.median)
stats['TIE']['med_exon_len'] = get_feature_len(TieDB.features_of_type('exon'), func=np.median)

stats['FB']['max_exon_len'] = get_feature_len(FlyBaseDB.features_of_type('exon'), func=np.max)
stats['TIE']['max_exon_len'] = get_feature_len(TieDB.features_of_type('exon'), func=np.max)

stats['FB']['min_exon_len'] = get_feature_len(FlyBaseDB.features_of_type('exon'), func=np.min)
stats['TIE']['min_exon_len'] = get_feature_len(TieDB.features_of_type('exon'), func=np.min)

In [137]:
def get_bedtools(fbFeature='exon', tieFeature='exon'):
    fb = gffutils.pybedtools_integration.to_bedtool(FlyBaseDB.features_of_type(fbFeature))
    tie = gffutils.pybedtools_integration.to_bedtool(TieDB.features_of_type(tieFeature))
    return fb, tie

In [226]:
# Count the number of exons that perfectly match
fb, tie = get_bedtools(fbFeature='exon', tieFeature='exon')
inter = tie.intersect(fb, f=1.0, s=True)
stats['COMPARE']['exon_exact_match'] = inter.saveas('../../output/stringTie_FlyBase_exact_match.gtf').count()

In [255]:
# Count the number of exons that perfectly match
fb, tie = get_bedtools(fbFeature='exon', tieFeature='exon')
inter = tie.saveas().intersect(a=tie, b=tie)
#stats['COMPARE']['exon_partial_match'] = inter.saveas('../../output/stringTie_FlyBase_fuzy_unique.gtf').count()

In [256]:
inter.head()

In [274]:
fb, tie = get_bedtools(fbFeature='exon', tieFeature='exon')

In [275]:
fb, tie2 = get_bedtools(fbFeature='exon', tieFeature='exon')

In [276]:
merged = tie.intersect(tie2, f=.9, r=True, s=True).saveas('../../output/stringTie_FlyBase_fuzy_unique.gtf')

In [273]:
merged.head()

chrY	StringTie	exon	1	18	1000	-	.	transcript_id "MSTRG.9704.1"; ; exon_number "1"; gene_id "MSTRG.9704"
 chrY	StringTie	exon	80	1197	1000	-	.	transcript_id "MSTRG.9704.1"; ; exon_number "2"; gene_id "MSTRG.9704"
 chrY	StringTie	exon	233	486	1000	-	.	transcript_id "MSTRG.9704.1"; ; exon_number "2"; gene_id "MSTRG.9704"
 chrY	StringTie	exon	555	1197	1000	-	.	transcript_id "MSTRG.9704.1"; ; exon_number "2"; gene_id "MSTRG.9704"
 chrY	StringTie	exon	251	274	1000	-	.	transcript_id "MSTRG.9704.1"; ; exon_number "2"; gene_id "MSTRG.9704"
 chrY	StringTie	exon	394	1197	1000	-	.	transcript_id "MSTRG.9704.1"; ; exon_number "2"; gene_id "MSTRG.9704"
 chrY	StringTie	exon	50330	50352	1000	-	.	transcript_id "MSTRG.9704.1"; ; exon_number "3"; gene_id "MSTRG.9704"
 chrY	StringTie	exon	233	486	1000	+	.	transcript_id "MSTRG.9705.1"; ; exon_number "1"; gene_id "MSTRG.9705"
 chrY	StringTie	exon	233	486	1000	+	.	transcript_id "MSTRG.9705.1"; ; exon_number "1"; gene_id "MSTRG.9705"
 chrY	StringTie	exon	251	2

In [225]:
merged.count()

1482

In [None]:
db.merge()