### Goal

Rank genes according to number of associated CNEs and annotate. CNEs were associated with their closest gene in the genome with find_closest_gene.py

### Input

- closest_gene_counts_by_node.tsv: number of closest CNEs for each gene, generated with find_closest_gene.py
- GFF annotations (converted to gffutils databases)
- directories of proteome annotations (diamond blastp, pannzer, interproscan)
- gene_IPR_dict.pickle: dictionary of protein domains for each gene, generated with find_closest_gene.py
- gene_protein_dict.pickle : dictionary of gene-protein IDs, generated with find_closest_gene.py
- all_IPR_desc.tsv: Interpro protein domain descriptions (retrieve with overrep_domains_closest_gene.py)
- summary_by_sp.tsv: output of overrep_domains_closest_gene.py

### Output

- species + "_ranked_genes.tsv": annotated list of all genes with more CNEs than average, ranked by number of associated CNEs
- species + "_immuno_ranked.tsv: idem, with immunoglobulin-containing genes only
- species + "_homeo_ranked.tsv: idem, with homeodomain-containing genes only
- species + "_TF_ranked.tsv: idem, with other transcription factor domainss only

In [1]:
import pandas as pd
import glob
import pickle
import os
import gffutils

In [32]:
cne_counts_file = "../find_closest_gene/new_parse_gff/closest_gene_counts_by_node.tsv" 
cne_counts = pd.read_csv(cne_counts_file, sep="\t")
cne_counts

Unnamed: 0,species,gene,closest_cne_count,cne_node
0,spis,gene-LOC111326177,2,hexacorallia
1,spis,gene-LOC111327133,1,hexacorallia
2,spis,gene-LOC111328643,1,hexacorallia
3,spis,gene-LOC111331958,1,hexacorallia
4,spis,gene-LOC111325339,1,hexacorallia
...,...,...,...,...
138947,nvec,gene-LOC5521942,1,ambiguous
138948,nvec,gene-LOC5521953,1,ambiguous
138949,nvec,gene-LOC5522147,2,ambiguous
138950,nvec,gene-LOC116604813,1,ambiguous


In [33]:
orig_annot_dir = "../gff_db_files/"

In [34]:
annotations_dir = "../proteome_annotations/"
interpro_dir ="../proteome_annotations/interpro_files/"

In [35]:
gene_IPR_file = "../find_closest_gene/new_parse_gff/gene_IPR_dict.pickle"

In [36]:
with open(gene_IPR_file, 'rb') as handle:
    gene_IPR_dict = pickle.load(handle)

In [37]:
gene_IPR_dict

{'aaur': {'scaffold1.g1': [],
  'scaffold1.g2': [],
  'scaffold1.g3': ['IPR023179', 'IPR027417', 'IPR006073'],
  'scaffold1.g4': [],
  'scaffold1.g5': ['IPR011992',
   'IPR015153',
   'IPR015154',
   'IPR015154',
   'IPR000433',
   'IPR015153',
   'IPR011992',
   'IPR043145',
   'IPR015154',
   'IPR015153',
   'IPR011992'],
  'scaffold1.g6': ['IPR036020', 'IPR001202'],
  'scaffold1.g7': [],
  'scaffold1.g8': [],
  'scaffold1.g9': [],
  'scaffold1.g10': ['IPR001680',
   'IPR019775',
   'IPR036322',
   'IPR036322',
   'IPR001680',
   'IPR019775'],
  'scaffold1.g11': ['IPR036322'],
  'scaffold1.g12': [],
  'scaffold1.g13': ['IPR011598', 'IPR036638', 'IPR011598', 'IPR036638'],
  'scaffold1.g14': [],
  'scaffold1.g15': ['IPR007808', 'IPR038567'],
  'scaffold1.g16': [],
  'scaffold1.g17': [],
  'scaffold1.g18': [],
  'scaffold1.g19': [],
  'scaffold1.g20': ['IPR038187', 'IPR039370', 'IPR002715'],
  'scaffold1.g21': ['IPR018247', 'IPR011992'],
  'scaffold1.g22': [],
  'scaffold1.g23': [],
  '

In [38]:
gene_protein_file = "../find_closest_gene/new_parse_gff/gene_protein_dict.pickle"

In [39]:
with open(gene_protein_file, 'rb') as handle:
    gene_protein_dict = pickle.load(handle)

In [40]:
gene_protein_dict['pdam']

defaultdict(set,
            {'gene-LOC113673859': {'XP_027049641.1'},
             'gene-LOC113668373': {'XP_027040096.1'},
             'gene-LOC113667069': {'XP_027038702.1'},
             'gene-LOC113670418': {'XP_027042395.1'},
             'gene-LOC113666499': {'XP_027038121.1'},
             'gene-LOC113666399': {'XP_027038016.1'},
             'gene-LOC113670712': {'XP_027042728.1'},
             'gene-LOC113670806': {'XP_027042830.1'},
             'gene-LOC113668840': {'XP_027040630.1'},
             'gene-LOC113670613': {'XP_027042619.1'},
             'gene-LOC113665099': {'XP_027036909.1'},
             'gene-LOC113668074': {'XP_027039789.1'},
             'gene-LOC113666803': {'XP_027038432.1', 'XP_027038505.1'},
             'gene-LOC113666701': {'XP_027038326.1'},
             'gene-LOC113668672': {'XP_027040457.1'},
             'gene-LOC113686527': {'XP_027059994.1',
              'XP_027060065.1',
              'XP_027060134.1',
              'XP_027060212.1',
      

In [41]:
all_IPRS_desc = pd.read_csv("../overrep_domains/avg/with_conv_filt/all_IPR_desc.tsv", sep="\t")
all_IPRS_desc

Unnamed: 0,IPR_id,description
0,,
1,IPR002181,"Fibrinogen, alpha/beta/gamma chain, C-terminal..."
2,IPR036056,"Fibrinogen-like, C-terminal"
3,IPR000885,"Fibrillar collagen, C-terminal"
4,IPR002110,Ankyrin repeat
...,...,...
14329,IPR003566,T-cell surface glycoprotein CD5
14330,IPR037004,"Exonuclease VII, small subunit superfamily"
14331,IPR004124,"Glycoside hydrolase, family 33, N-terminal"
14332,IPR012480,Heparinase II/III-like


### Retrieve CNE thresholds

In [42]:
summary_by_sp_file = "../overrep_domains/new_parse_gff/avg/summary_by_sp.tsv"
summary_by_sp_df = pd.read_csv(summary_by_sp_file, sep="\t")
summary_by_sp_df

Unnamed: 0,species,cne_count,gene_count,cne_threshold,num_IPR_tested,num_sig_IPRs,num_homeo
0,hsym,6463,22022,0.293479,4375,21,0
1,hvul,2029,20058,0.101157,2064,0,0
2,pdam,81412,19935,4.083873,4895,76,4
3,ofav,42827,25929,1.651703,5081,22,3
4,epal,5026,22509,0.223288,2859,12,1
5,mvir,2654,24278,0.109317,2480,6,0
6,spis,106547,24846,4.288296,5348,71,4
7,chem,2289,45872,0.0499,1505,1,0
8,dgig,3562,22045,0.161579,1509,21,0
9,nvec,3471,23845,0.145565,2590,4,0


In [43]:
mean_cne_dict = {}
for idx, row in summary_by_sp_df.iterrows():
    species = row['species']
    mean_cne = row['cne_threshold']
    mean_cne_dict[species] = mean_cne
mean_cne_dict

{'aaur': 0.06659594921603076,
 'adig': 1.9154681087715264,
 'aten': 0.26306306306306304,
 'chem': 0.049899720962678765,
 'dgig': 0.16157858924926288,
 'epal': 0.22328846239282066,
 'hsym': 0.2934792480247025,
 'hvul': 0.10115664572739057,
 'mvir': 0.1093170771892248,
 'nvec': 0.14556510798909625,
 'ofav': 1.6517027266766942,
 'pdam': 4.083872585904189,
 'spis': 4.288295902761008}

In [44]:
def format_interpro_res(interpro_file, cne_count_df):
    interpro_results = pd.read_table(interpro_file, names = ('protein_id', 'identifier', 'length', 'software', 
                                                             'software_id', 'software_prediction', 'start', 'end',
                                                             'score', 'status', 'date', 'IPR_id', 'description'))
    interpro_results = interpro_results.dropna(subset=['IPR_id'])
    #### dmel has dashes insted of NA column IPR_id
    interpro_results = interpro_results[interpro_results['IPR_id'] != '-']
    interpro_results = interpro_results.reset_index(drop = True)
    interpro_results = interpro_results[['protein_id', 'software','IPR_id', 'description']]
    interpro_results = interpro_results.drop_duplicates(subset=['protein_id', 'IPR_id'])
    # Merge with closest_genes_df
    #interpro_results = interpro_results.merge(closest_genes_df, how = 'left', on = 'gene_id')
    annot_df = cne_count_df.merge(interpro_results, how = 'left', on = 'protein_id' )
    #interpro_results = interpro_results.merge(cne_count_df, how = 'outer', on = 'gene_id') # how = 'left',
    #interpro_results.loc[interpro_results['closest_cne_count'].isna(), 'closest_cne_count'] = 0 
    annot_df.loc[annot_df['IPR_id'].isna(), 'IPR_id'] = 'no_IPR' 
    return annot_df

### Retrieve one protein per gene to merge Blast annotations

In [45]:
all_sp_protein_df = pd.DataFrame(columns=['species', 'gene_id', 'protein_id'])
for species, gene_dict in gene_protein_dict.items():
    sp_protein_dict = {}
    for gene_id, protein_set in gene_dict.items():
        sp_protein_dict[gene_id] = list(protein_set)[0] # random protein for annotation
    sp_gene_protein_df = pd.DataFrame(sp_protein_dict.items(), columns=['gene_id', 'protein_id'])
    sp_gene_protein_df['species'] = species
    all_sp_protein_df = pd.concat([all_sp_protein_df, sp_gene_protein_df])
all_sp_protein_df

Unnamed: 0,species,gene_id,protein_id
0,spis,gene-LOC111326177,XP_022785962.1
1,spis,gene-LOC111327133,XP_022787155.1
2,spis,gene-LOC111327047,XP_022786908.1
3,spis,gene-LOC111328643,XP_022788889.1
4,spis,gene-LOC111328547,XP_022788781.1
...,...,...,...
23840,nvec,gene-LOC5522391,XP_032223481.1
23841,nvec,gene-LOC5522393,XP_032223620.1
23842,nvec,gene-LOC5522254,XP_032223631.1
23843,nvec,gene-LOC5522255,XP_001642144.2


In [46]:
def rank_genes_by_cne_count(cne_counts, species):
    print('Read Blast annotations')
    annotations = annotations_dir + species + "_annotations_combined.tsv"
    annotations_df = pd.read_csv(annotations, sep="\t")
    annotations_df = annotations_df.rename({'gene_id':'protein_id'}, axis=1)
    print('Merge CNE counts with BLAST annotations') 
    cne_counts_sp = cne_counts[cne_counts['species']==species]
    cne_counts_sp = cne_counts_sp.rename({'gene': 'gene_id'}, axis=1)
    cne_counts_sp = cne_counts_sp.groupby('gene_id').sum().reset_index()
    cne_counts_sp = cne_counts_sp.merge(all_sp_protein_df)
    cne_counts_sp = cne_counts_sp.merge(annotations_df, how='left', on='protein_id')
    print("Merge CNE counts with Interpro annotations")
    interpro_file = interpro_dir + species + "_combined.tsv"
    cne_counts_annot = format_interpro_res(interpro_file, cne_counts_sp)
    print("Combine IPR IDs and descriptions for each gene")
    cne_counts_annot = cne_counts_annot.fillna('')
    combined_IPRid = cne_counts_annot.groupby('gene_id')['IPR_id'].apply(lambda x: ','.join(x)).reset_index()\
        .rename({'IPR_id': 'IPR_ids'}, axis=1)
    combined_descs = cne_counts_annot.groupby('gene_id')['description'].apply(lambda x: ','.join(x)).reset_index()\
        .rename({'description': 'descriptions'}, axis=1)
    output_df = cne_counts_annot.drop(['software', 'IPR_id', 'description'], axis=1).drop_duplicates('gene_id').\
        merge(combined_IPRid, how='left').merge(combined_descs, how='left')
    output_df = output_df[output_df['closest_cne_count']!='']
    output_df = output_df.sort_values('closest_cne_count',ascending=False).reset_index(drop=True)
    output_df.closest_cne_count = output_df.closest_cne_count.astype(int)
    output_df['rank'] = output_df.index + 1
    return(output_df)

In [20]:
for species in set(cne_counts['species']):
    output_df = rank_genes_by_cne_count(cne_counts, species)
    output_df.to_csv(species + "_ranked_genes.tsv", sep="\t", index=False)

Read Blast annotations
Merge CNE counts with BLAST annotations
Merge CNE counts with Interpro annotations
Combine IPR IDs and descriptions for each gene
Read Blast annotations
Merge CNE counts with BLAST annotations
Merge CNE counts with Interpro annotations
Combine IPR IDs and descriptions for each gene
Read Blast annotations
Merge CNE counts with BLAST annotations
Merge CNE counts with Interpro annotations
Combine IPR IDs and descriptions for each gene
Read Blast annotations
Merge CNE counts with BLAST annotations
Merge CNE counts with Interpro annotations
Combine IPR IDs and descriptions for each gene
Read Blast annotations
Merge CNE counts with BLAST annotations
Merge CNE counts with Interpro annotations
Combine IPR IDs and descriptions for each gene
Read Blast annotations
Merge CNE counts with BLAST annotations
Merge CNE counts with Interpro annotations
Combine IPR IDs and descriptions for each gene
Read Blast annotations
Merge CNE counts with BLAST annotations
Merge CNE counts wi

### Rank genes with protein domains of interest only

In [None]:
### Take all overrepresented domains
### Classify by type (already curated IPRs for stacked barplots): homeo, TF, immuno, TE...
### Retrieve all genes with > average CNEs
### Assign gene to each type
### Calculate pct CNEs associated with each type
### Make piechart for all species
### output 

In [47]:
def rank_genes_IPR_list(cne_counts, species, IPR_list):
    mean_cne = mean_cne_dict[species]
    print('Read Blast annotations')
    annotations = annotations_dir + species + "_annotations_combined.tsv"
    annotations_df = pd.read_csv(annotations, sep="\t")
    annotations_df = annotations_df.rename({'gene_id':'protein_id'}, axis=1)
    print('Merge CNE counts with BLAST annotations') 
    cne_counts_sp = cne_counts[cne_counts['species']==species]
    cne_counts_sp = cne_counts_sp.rename({'gene': 'gene_id'}, axis=1)
    cne_counts_sp = cne_counts_sp.groupby('gene_id').sum().reset_index()
    if species not in ['hsym']: # no origin annot available
        print('Read original annotations')
        orig_annot_file = orig_annot_dir + species + "_orig_annot.tsv"
        orig_annot_df = pd.read_csv(orig_annot_file, sep="\t")
        cne_counts_sp = cne_counts_sp.merge(orig_annot_df)
    cne_counts_sp = cne_counts_sp.merge(all_sp_protein_df)
    cne_counts_sp = cne_counts_sp.merge(annotations_df, how='left', on='protein_id')
    print("Merge CNE counts with Interpro annotations")
    interpro_file = interpro_dir + species + "_combined.tsv"
    cne_counts_annot = format_interpro_res(interpro_file, cne_counts_sp)
    print("Filter by IPR_id")
    cne_counts_annot = cne_counts_annot[cne_counts_annot['IPR_id'].isin(IPR_list)]
    print("Combine IPR IDs and descriptions for each gene")
    cne_counts_annot = cne_counts_annot.fillna('')
    combined_IPRid = cne_counts_annot.groupby('gene_id')['IPR_id'].apply(lambda x: ','.join(x)).reset_index()\
        .rename({'IPR_id': 'IPR_ids'}, axis=1)
    combined_descs = cne_counts_annot.groupby('gene_id')['description'].apply(lambda x: ','.join(x)).reset_index()\
        .rename({'description': 'descriptions'}, axis=1)
    output_df = cne_counts_annot.drop(['software', 'IPR_id', 'description'], axis=1).drop_duplicates('gene_id').\
        merge(combined_IPRid, how='left').merge(combined_descs, how='left')
    output_df = output_df[output_df['closest_cne_count']!='']
    output_df = output_df.sort_values('closest_cne_count',ascending=False).reset_index(drop=True)
    output_df.closest_cne_count = output_df.closest_cne_count.astype(int)
    output_df['rank'] = output_df.index + 1
    # remove genes with CNEs < mean
    output_df = output_df[output_df['closest_cne_count'] > mean_cne]
    return(output_df)

In [23]:
homeo_IPRs = ['IPR009057', 'IPR017970', 'IPR001356', 'IPR020479', 'IPR008422', 'IPR032967',
               'IPR032453', 'IPR000747' ] 

### Output ranked homeodomain genes

In [27]:
out_dir = 'homeo_ranked_genes/'
os.mkdir(out_dir)
for species in set(cne_counts['species']):
    output_df = rank_genes_IPR_list(cne_counts, species, homeo_IPRs)
    output_df.to_csv(out_dir + species + "_homeo_ranked.tsv", sep="\t", index=False)

Read Blast annotations
Merge CNE counts with BLAST annotations
Read original annotations
Merge CNE counts with Interpro annotations
Filter by IPR_id
Combine IPR IDs and descriptions for each gene
Read Blast annotations
Merge CNE counts with BLAST annotations
Read original annotations
Merge CNE counts with Interpro annotations
Filter by IPR_id
Combine IPR IDs and descriptions for each gene
Read Blast annotations
Merge CNE counts with BLAST annotations
Read original annotations
Merge CNE counts with Interpro annotations
Filter by IPR_id
Combine IPR IDs and descriptions for each gene
Read Blast annotations
Merge CNE counts with BLAST annotations
Read original annotations
Merge CNE counts with Interpro annotations
Filter by IPR_id
Combine IPR IDs and descriptions for each gene
Read Blast annotations
Merge CNE counts with BLAST annotations
Read original annotations
Merge CNE counts with Interpro annotations
Filter by IPR_id
Combine IPR IDs and descriptions for each gene
Read Blast annotatio

### Output ranked Immunoglobulin genes

In [48]:
IPRs_file = "../../cnidaria_06_22/combine_overrep_analyses/all_IPRs_curated.txt"
IPRs_df = pd.read_csv(IPRs_file, sep="\t")
IPRs_df

Unnamed: 0,IPR_id,description,class
0,IPR029034,Cystine-knot cytokine,other
1,IPR001303,Class II aldolase/adducin N-terminal,other
2,IPR033929,"Tensin, phosphotyrosine-binding domain",other
3,IPR011641,Tyrosine-protein kinase ephrin type A/B recept...,other
4,IPR036300,Mir domain superfamily,other
...,...,...,...
621,IPR010442,PET domain,TF
622,IPR038096,TEA/ATTS domain superfamily,TF
623,IPR008967,"p53-like transcription factor, DNA-binding",TF
624,IPR011598,"Myc-type, basic helix-loop-helix (bHLH) domain",TF


In [29]:
IPRs_df[IPRs_df['class']=='immunoglobulin']

Unnamed: 0,IPR_id,description,class
507,IPR043204,Basigin-like,immunoglobulin
508,IPR036179,Immunoglobulin-like domain superfamily,immunoglobulin
509,IPR007110,Immunoglobulin-like domain,immunoglobulin
510,IPR013106,Immunoglobulin V-set domain,immunoglobulin
511,IPR013783,Immunoglobulin-like fold,immunoglobulin
512,IPR013098,Immunoglobulin I-set,immunoglobulin
513,IPR003598,Immunoglobulin subtype 2,immunoglobulin
514,IPR003599,Immunoglobulin subtype,immunoglobulin
515,IPR014756,Immunoglobulin E-set,immunoglobulin
516,IPR037448,Zwei Ig domain protein zig-8,immunoglobulin


In [30]:
immuno_IPRs = list(IPRs_df[IPRs_df['class']=='immunoglobulin']['IPR_id'])
immuno_IPRs

['IPR043204',
 'IPR036179',
 'IPR007110',
 'IPR013106',
 'IPR013783',
 'IPR013098',
 'IPR003598',
 'IPR003599',
 'IPR014756',
 'IPR037448',
 'IPR013162']

In [31]:
out_dir = 'immunoglobulin/'
os.mkdir(out_dir)
for species in set(cne_counts['species']):
    output_df = rank_genes_IPR_list(cne_counts, species, immuno_IPRs)
    output_df.to_csv(out_dir + species + "_immuno_ranked.tsv", sep="\t", index=False)

Read Blast annotations
Merge CNE counts with BLAST annotations
Read original annotations
Merge CNE counts with Interpro annotations
Filter by IPR_id
Combine IPR IDs and descriptions for each gene
Read Blast annotations
Merge CNE counts with BLAST annotations
Read original annotations
Merge CNE counts with Interpro annotations
Filter by IPR_id
Combine IPR IDs and descriptions for each gene
Read Blast annotations
Merge CNE counts with BLAST annotations
Read original annotations
Merge CNE counts with Interpro annotations
Filter by IPR_id
Combine IPR IDs and descriptions for each gene
Read Blast annotations
Merge CNE counts with BLAST annotations
Read original annotations
Merge CNE counts with Interpro annotations
Filter by IPR_id
Combine IPR IDs and descriptions for each gene
Read Blast annotations
Merge CNE counts with BLAST annotations
Read original annotations
Merge CNE counts with Interpro annotations
Filter by IPR_id
Combine IPR IDs and descriptions for each gene
Read Blast annotatio

### Transcription factors

In [50]:
IPRs_df[IPRs_df['class']=='TF']

Unnamed: 0,IPR_id,description,class
475,IPR000664,Lethal(2) giant larvae protein,TF
558,IPR013847,POU domain,TF
559,IPR008116,Sequence-specific single-strand DNA-binding pr...,TF
560,IPR000327,POU-specific domain,TF
561,IPR037987,Four and a half LIM domains protein 2/3/5,TF
...,...,...,...
621,IPR010442,PET domain,TF
622,IPR038096,TEA/ATTS domain superfamily,TF
623,IPR008967,"p53-like transcription factor, DNA-binding",TF
624,IPR011598,"Myc-type, basic helix-loop-helix (bHLH) domain",TF


In [51]:
TF_IPRs = list(IPRs_df[IPRs_df['class']=='TF']['IPR_id'])

In [53]:
out_dir = 'TF/'
os.mkdir(out_dir)
for species in set(cne_counts['species']):
    output_df = rank_genes_IPR_list(cne_counts, species, TF_IPRs)
    output_df.to_csv(out_dir + species + "_TF_ranked.tsv", sep="\t", index=False)

Read Blast annotations
Merge CNE counts with BLAST annotations
Read original annotations
Merge CNE counts with Interpro annotations
Filter by IPR_id
Combine IPR IDs and descriptions for each gene
Read Blast annotations
Merge CNE counts with BLAST annotations
Read original annotations
Merge CNE counts with Interpro annotations
Filter by IPR_id
Combine IPR IDs and descriptions for each gene
Read Blast annotations
Merge CNE counts with BLAST annotations
Read original annotations
Merge CNE counts with Interpro annotations
Filter by IPR_id
Combine IPR IDs and descriptions for each gene
Read Blast annotations
Merge CNE counts with BLAST annotations
Read original annotations
Merge CNE counts with Interpro annotations
Filter by IPR_id
Combine IPR IDs and descriptions for each gene
Read Blast annotations
Merge CNE counts with BLAST annotations
Read original annotations
Merge CNE counts with Interpro annotations
Filter by IPR_id
Combine IPR IDs and descriptions for each gene
Read Blast annotatio