# Defining Housekeeping Genes with Naieve Bayes

De Ferrari, Luna, and Stuart Aitken. 2006. “Mining Housekeeping Genes with a Naive Bayes Classifier.” BMC Genomics 7 (October): 277.

They use a Naieve Bayes to classify genes as either housekeeping or not. Here I convert the Ensembl IDs that they provide to current FBgns.

There are 179 genes that are missing, after spot checking I see that these appear to have been removed from FlyBase.

In [2]:
import os
import sys
from pathlib import Path

from IPython.display import display, HTML, Markdown
import numpy as np
import pandas as pd

import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

# Project level imports
sys.path.insert(0, '../lib')
from larval_gonad.notebook import Nb
from larval_gonad.plotting import make_figs
from larval_gonad.config import memory

# Setup notebook
nbconfig = Nb.setup_notebook('2018-03-29_housekeeping_genes_Ferrari_et_al_2006')

last updated: 2018-03-29 
Git hash: f801087de045aa9513181cc6ad771b9c12b12309


In [16]:
df = pd.read_csv('../data/external/Ferrari_et_al_2006.tsv', sep='\t')

In [17]:
df.head()

Unnamed: 0,EMBL_gene_id Flybase_name,description,EMBL_transcript_id,cDNA_length,cds_length,exons_nr,5_MAR_presence,3_MAR_presence,5_polyA_18_presence,5_CCGNN_2_5_presence,perc_go_hk_match,perc_go_ts_match,is_hk,predicted_class,hk_probability
CG10000,,Putative polypeptide N-acetylgalactosaminyltra...,CG10000-RA,2528,1677,7,no,no,no,no,0,0.3333333,?,2:no,0.004
CG10001,AR-2,CG10001-PA [Source:RefSeq_peptide;Acc:NP_524544],CG10001-RA,1624,1074,4,no,yes,no,no,?,?,?,2:no,0.07
CG10002,fkh,Fork head protein. [Source:Uniprot/SWISSPROT;A...,CG10002-RA,3268,1533,1,no,yes,no,yes,?,?,?,2:no,0.22
CG10005,,CG10005-PA [Source:RefSeq_peptide;Acc:NP_650137],CG10005-RA,1049,696,4,no,no,no,no,?,?,?,2:no,0.06
CG10006,,CG10006-PA [Source:RefSeq_peptide;Acc:NP_648732],CG10006-RA,1560,1560,3,no,yes,no,yes,0,0.5,?,2:no,0.002


In [19]:
df = df.iloc[:, [0, -2]]

In [23]:
housekeeping = df[df.predicted_class == '1:yes']

In [24]:
anno = pd.read_csv('/data/LCDB/lcdb-references/dmel/r6-16/fb_annotation/dmel_r6-16.fb_annotation', sep='\t')

In [51]:
mapper = {}
for i, row in anno.iterrows():
    fbgn = row.primary_FBgn
    mapper[row.gene_symbol] = fbgn
    mapper[row.annotation_ID] = fbgn
    
    if isinstance(row.secondary_FBgn, str):
        for sec in row.secondary_FBgn.split(','):
            mapper[sec] = fbgn
        
    if isinstance(row.secondary_annotation_ID, str):
        for sec in row.secondary_annotation_ID.split(','):
            mapper[sec] = fbgn
        
    mapper[row.annotation_ID] = fbgn

In [58]:
missing = []
res = []
for key in housekeeping.index.tolist():
    try:
        res.append([key, mapper[key]])
    except:
        missing.append(key)
        

In [65]:
df = pd.DataFrame(res, columns=['accession', 'FBgn'])

In [69]:
with open('../data/external/Ferrari_et_al_2006_housekeeping_FBgn.txt', 'w') as fh:
    fh.write('\n'.join(df.FBgn.values))