# Computational phospho-proteomic network inference pipeline
####  by Matt Macgilvary

###### This pipeline turns a list of S. cerevesiae phospho-peptides that exhibit stress responsive abundance changes, as measured by mass spectrometry, into a hierarchical signaling network, connecting upstream kinases and phosphatases to their downstream targets. Our computational pipeline is based on the premise that kinases and phosphatases recognize target substrates through specific amino acid sequences at the phosphorylated residue, called phosphorylation motifs. This pipeline groups phospho-peptides with similar abundance changes and the same phosphorylation motif into modules. Modules are partitioned into smaller groups, called submodules, based on differences in phospho- peptide abundance in mutant strain(s) (sources). Candidate submodule regulators, called shared interactors, are identified through enrichment analysis using a protein interaction network in yeast (Chasman et al., 2014). Shared interactor-submodule pairs serve as inputs for a previously developed Integer Programming (IP) approach that connects the sources to their downstream target submodules (Chasman et al., 2014).

###### Please see our bioRxiv preprint for additional information:
    Network inference reveals novel connections in pathways regulating growth and defense in the yeast salt response.   Matthew E. MacGilvray+, Evgenia Shishkova+, Deborah Chasman, Michael Place, Anthony Gitter, Joshua J. Coon, Audrey P. Gasch. bioRxiv 2017. doi:10.1101/176230


### Prerequisites

### The user should define differentially changing phospho-peptides in the "WT" or "Parent" strain using their own criteria (eg; fold-change, p-value, etc.), followed by grouping/clustering phospho-peptides based on similar directionality of abundance change.

# __This pipeline is for the user who has already run Motifx and defined the modules manually.__


# If you wish to run Motifx here and then manually define your modules execute the Motifx.py cell just below.

# Identify motifs by Calling Motifx.py
    
    This automates submitting jobs to the Motif-x Website (http://motif-x.med.harvard.edu/)

### Expected Input  : A single Plain text file (called inputfiles in the next cell) listing excel files to process, one excel file name per line.

    text file:
    data_sheet1.xlsx
    data_sheet2.xlsx

	The Excel file format:
	Ppep	Group	Localized_Sequence	Motif_X_Input_Peptide
	YGL076C_T8_S11	Induced	AAEKILtPEsQLKK	AAEKILT*PES*QLKK

	Column order is unimportant, column names must match above.


In [None]:
%run -i 'Motifx.py' -f 'inputfiles' -u 'reference/orf_trans_all.20150113.fasta'

In [2]:
# required python libraries
import Bio
import glob
import itertools
import math
import os
import numpy as np
import pandas as pd
import random
import re
import shutil
from scipy.stats import hypergeom
from Bio import SeqIO
from Bio.Seq import Seq
from Bio import motifs
from Bio.Alphabet import IUPAC

current_dir = os.getcwd()


# Run Identify_Modules_and_Submodules step.

### This step can be skipped if the user has already defined submodules

This script identifies co-regulated groups of phospho-peptides using the following approach:

1) First, the script identifies 'modules', which are groups of phospho-peptides that exhibit the same directionality in stress-dependent abundance change (ie, increased 'Induced', or decreased 'Repressed') and the same motif. The module nomenclature is as follows: Induced/Repressed- motif (ex: Induced..RK.s....).

2) Next, the script partitions modules into 'submodules' based on their phospho-peptide constituents dependency on a protein(s) for stress-dependent abundance changes (ie, phospho-peptides that exhibit increased 'amplified' or decreased 'defective' abundance in a deletion strain compared to the 'WT' or 'Parental' type strain). These phenotypes are user defined. If two or more mutant phenotypes are recorded for a phospho-peptide then it's placed into two separate subModules (one for each mutant phenotype). If there was not a mutant phenotype at a user defined threshold then the phenotype is 'No-Phenotype'

The submodule nomenclature is as follows: module name-mutant phenotype/No-Phenotype (ex: Induced..RK.s....Mutant_Defective).

Possible submodule phenotypes: Induced-Defective, Induced-Amplified, Repressed-Defective, Repressed-Amplified, Induced-No-Phenotype, Repressed-No-Phenotype

idModules.csv file looks like:

> Ppep,Cluster,Motif,Peptide,ire1,mkk1_2<br>
> YGR240C_S895,Induced,......SP.....,NKKNEASPNTDAK,Induced_Amplified,Induced_Defective<br>
> YMR005W_S80,Induced,...K..SP.....,VLPKNVSPTTNLR,Induced_Amplified,Induced_Defective<br>
> YPL242C_S7,Induced,......SP.....,MTAYSGSPSKPGN,Induced_Amplified, <br>



In [6]:
pd.options.mode.chained_assignment = None  # default='warn'
Data=pd.read_csv('idModules.csv') # Define path to input file

def Slicedataframe():
    '''Define a function that slices the input dataframe into independent dataframes based on the Cluster names. Next, slice these dataframes based on the presence of the same motif, generating 'modules' '''  
    ClusterLST=Data['Cluster'].unique().tolist()                            # generate a list of unique Cluster names (ie, 'Induced' and 'Repressed')
    lst=[]                                                                 
    DF=Data.copy()                                                         
    for cluster in ClusterLST:                                              # Select the first 'cluster' on the list 
        DF2=DF.loc[DF['Cluster']== cluster]                                 # Create a new dataframe by selecting only those rows that contain the selected 'cluster' in the 'Cluster' column 
        MotifLST=DF2['Motif'].unique().tolist()                             # From the newly created dataframe, place each instance of a unique motif into a list
        cleanedMotifLST = [x for x in MotifLST if str(x) != 'nan']          # drop the string 'nan' from the list. 'nan' occurs for Ppeps that did not have an identified Motif from Motif-X. 
        for motif in cleanedMotifLST:                                       # Select a motif in the list
            DF3=DF2.loc[DF['Motif']== motif]                                # Filter the dataframe, selecting only those rows that contain 'motif' in the Motif column
            DF3['freq'] = DF3.groupby('Motif')['Motif'].transform('count')  # Produce a new column, called 'freq' that contains the number of rows, and thus phospho-peptides, that contain a given motif.
            lst.append(DF3) 
        
    return lst

SlicedDF_lst=Slicedataframe()

def ConcatenateDFs():
    ''' Define a function that appends the dataframes in the SlicedDF_list together. '''
    EmptyDF = pd.DataFrame()                                                # create an empty dataframe
    for df in SlicedDF_lst:                                                
        df=df.copy() 
        EmptyDF=EmptyDF.append(df)                                          # append to the empty DF the dataframe selected and overwrite the empty dataframe
    return EmptyDF

Final_DF=ConcatenateDFs()
FinalDFV2=Final_DF.fillna(0)                                                #  fill any NaN values with '0'

#--------------------------------------------------------------------------------------------------------------------------
def Module_Motif_NoMutantPhenotypeExists(df):
    ''' Define a function that assigns no-phenotype submodules'''
    if(df['ire1'] ==0) & (df['mkk1_2']==0):
        return 'No_Phenotype_Exists'
    
FinalDFV2['Phenotype']=FinalDFV2.apply(Module_Motif_NoMutantPhenotypeExists, axis=1) 
FinalDFV2=FinalDFV2.loc[FinalDFV2['Phenotype']=='No_Phenotype_Exists']        # Select all rows for which "No_Phenotype_Exists" in the 'Phenotype' column.
FinalDFV2['subModule']=FinalDFV2.Cluster.map(str) + "_" + FinalDFV2.Motif + "_" + FinalDFV2.Phenotype                                   # create a new column, called submodule, that contains the concatenated strings in the 'Cluster', 'Motif', and 'Phenotype' columns.

# CHANGE GENE NAMES HERE
FinalDF=Final_DF.dropna(subset = ['ire1', 'mkk1_2'], how='all')        # Remove rows that have NaN in all 3 columns representing mutant phenotpes. This steps removes theNo-phenotype submodules which were creat                                                                               # ed above. 
FinalDF=FinalDF.fillna(0)                                                     # fill any NaN that remain with '0'
lstCols=['ire1', 'mkk1_2']                                             # make a list that contains the column headers for the 3 mutants. 



def DefineMutantContribution(row):
    ''' Define a function that identifies for each phospho-peptide if it has a phenotype in more than one mutant strain'''
    dictData={} 
    for colname in lstCols:    
        if not row[colname]==0:                                                # if value is not equal to zero, there is a mutant phenotype (ex; Induced_defective)
            dictData[colname]=row[colname]  
    if len(dictData.keys())==0: return 0  
    else:
        return ":".join(dictData.keys())
    
FinalDF['Contribution']=FinalDF.apply(lambda x: DefineMutantContribution(x), axis=1) 



def DefinePhenotypeFromMutants(row):
    ''' Define a function that captures the mutant phenotype for Ppeps with multiple phenotypes and places it within a column'''
    dictData={}  
    for colname in lstCols:   
        if not row[colname]==0:
            dictData[colname]=row[colname] 
    if len(dictData.keys())==0: return 0 
    else:
        return ":".join(dictData.values()) 
    
FinalDF['Phenotype']=FinalDF.apply(lambda x: DefinePhenotypeFromMutants(x), axis=1)


#--------------------------------------------------------------------------------------------------------------------------------------------------
''' Determine all Ppeps that have 2 or more mutant phenotypes (ire1/mkk1_2 phenotypes), then the script produces a new column with individual subModule names for 
the ire1 phenotype''' 

FinalDF_multiplePhenotypes=FinalDF[FinalDF['Contribution'].str.contains(":")]     # Select 'contribution column rows that contain ":", which means the Ppep has two mutant phenotypes since this is a separator between gene names
FinalDF_multiplePhenotypes_ire1=FinalDF_multiplePhenotypes[FinalDF_multiplePhenotypes['Contribution'].str.contains("ire1")] 
FinalDF_multiplePhenotypes_ire1['Ire1']='ire1' 
FinalDF_multiplePhenotypes_ire1['subModule']=FinalDF_multiplePhenotypes_ire1.Cluster.map(str) + "_" + FinalDF_multiplePhenotypes_ire1.Motif + "_" + FinalDF_multiplePhenotypes_ire1.Ire1 + "_" + FinalDF_multiplePhenotypes_ire1.ire1


#--------------------------------------------------------------------------------------------------------------------------------------------------
''' Define all Ppeps that have 2 or more mutant phenotypes (ire1/mkk1_2 phenotypes), then the script produces a new column with individual subModule names for 
the mkk1_2 phenotype''' 
FinalDF_multiplePhenotypes_mkk1_2=FinalDF_multiplePhenotypes[FinalDF_multiplePhenotypes['Contribution'].str.contains("mkk1_2")]
FinalDF_multiplePhenotypes_mkk1_2['Mkk1_2']='mkk1_2'
FinalDF_multiplePhenotypes_mkk1_2['subModule']=FinalDF_multiplePhenotypes_mkk1_2.Cluster.map(str) + "_" + FinalDF_multiplePhenotypes_mkk1_2.Motif + "_" + FinalDF_multiplePhenotypes_mkk1_2.Mkk1_2 + "_" + FinalDF_multiplePhenotypes_mkk1_2.mkk1_2

#--------------------------------------------------------------------------------------------------------------------------------------------------
''' Define all Ppeps that have 2 or more mutant phenotypes (ire1/mkk1_2 phenotypes), then the script produces a new column with individual subModule names for 
the cdc14 phenotype'''

#FinalDF_multiplePhenotypes_cdc14=FinalDF_multiplePhenotypes[FinalDF_multiplePhenotypes['Contribution'].str.contains("cdc14")]
#FinalDF_multiplePhenotypes_cdc14['Cdc14']='cdc14'
#FinalDF_multiplePhenotypes_cdc14['subModule']=FinalDF_multiplePhenotypes_cdc14.Cluster.map(str) + "_" + FinalDF_multiplePhenotypes_cdc14.Motif + "_" + FinalDF_multiplePhenotypes_cdc14.Cdc14 + "_" + FinalDF_multiplePhenotypes_cdc14.cdc14


#--------------------------------------------------------------------------------------------------------------------------------------------------

'''This section of code appends the above mutant dataframes together (ie, FinalDF_multiplePhenotypes_cdc14, etc.) (contained ":"). The result is Ppeps with phenotypes in more than one strain are listed on multiple lines rather than a single line'''

FinalDF_mutants=FinalDF_multiplePhenotypes_ire1.append(FinalDF_multiplePhenotypes_ire1) 
FinalDF_mutants_Final=FinalDF_mutants.append(FinalDF_multiplePhenotypes_mkk1_2)

# if you have more than 2 gene names add them here after Peptide
FinalDF_mutants_Final=FinalDF_mutants_Final[['Ppep','Cluster','Motif','Peptide','ire1','mkk1_2','freq','Contribution','Phenotype','subModule']] # Only retain these columns 


#-----------------------------------------------------------------------------------------------------------------------------------------------------
# Drop from the original dataframe rows containing Ppeps with multiple mutant phenotypes. 
FinalDF_minus_multiPhenotypePpeps=FinalDF[FinalDF.Contribution.str.contains(":")==False] # Removing all rows that contain ":", and thus are phospho-peptides with multiple mutant phenotypes


# Generate the final submodule names
FinalDF_minus_multiPhenotypePpeps['subModule']=FinalDF_minus_multiPhenotypePpeps.Cluster.map(str) + "_" + FinalDF_minus_multiPhenotypePpeps.Motif + "_" + FinalDF_minus_multiPhenotypePpeps.Contribution + "_" + FinalDF_minus_multiPhenotypePpeps.Phenotype


#-----------------------------------------------------------------------------------------------------------------------------------------------------
Ppeps_with_PhenotypesDF=FinalDF_minus_multiPhenotypePpeps.append(FinalDF_mutants_Final)  # Appending together the dataframes that originally had single mutant phenotypes, and the dataframe that started with multiple mutant Phentoypes, but now contains single listings for each Ppep-mutant phenotype



# Remove any submodule that only has a single Ppep constituent, since by default a submodule must contain 2 Ppeps. 
Ppeps_with_PhenotypesDF_subModules=Ppeps_with_PhenotypesDF[Ppeps_with_PhenotypesDF.duplicated(['subModule'], keep='last') | Ppeps_with_PhenotypesDF.duplicated(['subModule'])]  # only retain duplicates, get rid of single entries 



# Append to the dataframe with phenotype subModules, all No-Phenotype submodules
Ppeps_with_Phenotypes_subModules_and_noPhenotypes_DF=Ppeps_with_PhenotypesDF_subModules.append(FinalDFV2) # append to dataframe
Ppeps_with_Phenotypes_subModules_and_noPhenotypes_DF=Ppeps_with_Phenotypes_subModules_and_noPhenotypes_DF[['Ppep', 'Cluster', 'Motif', 'Peptide', 'ire1', 'mkk1_2', 'freq', 'Contribution', 'Phenotype', 'subModule']]

#-----------------------------------------------------------------------------------------------------------------------------------------------------
''' Create a column with the 'Module' name '''
Ppeps_with_Phenotypes_subModules_and_noPhenotypes_DF['Module']=Ppeps_with_Phenotypes_subModules_and_noPhenotypes_DF.Cluster.map(str) + "_" + Ppeps_with_Phenotypes_subModules_and_noPhenotypes_DF.Motif 

#-----------------------------------------------------------------------------------------------------------------------------------------------------
# define a function that will write out a dataframe as a tab separated file
def Dataframe_to_Tsv (dataframe, NewFileName):
    dataframe.to_csv (NewFileName,sep='\t')

Dataframe_to_Tsv(Ppeps_with_Phenotypes_subModules_and_noPhenotypes_DF, 'Modules_pPep.csv') 
# The above file contains all modules and subModules with and without mutant phenotypes. 

# OUTPUT: Modules_pPep.csv
#   for the case of 2 genes ire1, mkk1_2, header looks like
#   Ppep    Cluster Motif   Peptide ire1    mkk1_2  freq    Contribution    Phenotype       subModule       Module

# Prep for Identify Shared Interactors 

## ONLY USE FOR CASES IN WHICH THERE IS MANUAL CURATION OF SUBMODULES

   Kevin has motifx files where he is working w/ Strains.  He has already grouped his submodules.
   His files look like: 
    
    YJL082W_S(187),KNSSSPSPSEKSQ,Group_1,......SP.....
    YPL112C_S(304),KDDGSQSPIRKQL,Group_1,......SP.....
    YJL128C_S(83),DKGSSQSPKHIQQ,Group_1,......SP.....
    YNL118C_S(750),VSSNQQSPKSQHL,Group_1,......SP.....
    YAL035W_S(395),PTPSSASPNKKDL,Group_1,......SP.....

input file : 
Submodule,ORF
Induced_......TP....._cdc14_Repressed_Amplified,YLR319C
Induced_......TP....._cdc14_Repressed_Amplified,YJL070C

#### NOTE: ORF names have need to have the '-' removed,  YER074W-A becomes YER074A.

The other input files are provided.
'''

In [12]:

##  CHANGE INPUT FILE HERE   ##
inputFile = 'Modules_pPep.csv'

# create the input file based on the output of the previous step.
with open('Submodule_constituents.csv', 'w') as out:
    out.write('Submodule,ORF\n')
    with open(inputFile,'r') as f:
        f.readline()                            # skip header
        for line in f:
            data = line.rstrip().split('\t')    # CHECK THE FILE DELIMITER 
            name = data[1].split('_')[0]   
            name = re.sub('-', '', name)
            row  = data[10] + ',' + name + '\n'
            out.write(row)
out.close()
# OUTPUT: Submodule_constituents.csv

# Identify Shared Interactors 

This script identifies proteins enriched for interactions with Submodule constituent proteins, based on known interactions in the background network. We call these proteins 'Shared Interactors'. The background network is a protein
interaction network curated in yeast under mostly nutrient replete conditions that contains 4638 proteins and ~ 25,000 interactions, including directed (ex; kinase-substrate), and 
non-directed. 

Proteins enriched for interactions with Submodule proteins at a 5% FDR, determined by a hypergeometric test and BH correction, are considered shared interactors.

Shared Interactors represent numerous functional classes, including kinases and phosphatases. Kinase and phosphatase shared interactors represent potential Submodule regulators.
 
HyperG function:
distrib=hypergeom(N,M,n)
distrib.pmf(m)

* N - population size (4638 unique proteins in Background network file - phospho_v4_bgnet_siflike_withdirections_Matt_Modified.csv)

* M - total number of successes  (# of interactions for a given protein. ie. Protein A has 200 known interactions in the background network).

* n - the number of trials (also called sample size) -  ie. (Number of proteins that reside within a submdoule)

* m - the number of successes - for example: Protein A, a shared interactor, has 35 interactions with proteins in Submodule B. 
 
 
 Final shared interactor file:   __Final_enriched.csv__  , this contains the significant Shared Interactors based on the
 BH_significance test.
 
 A list of all shared interactors can be found:  __Network_Submodule_Nodes_background_Network.csv__
 
 

In [13]:
Submodule_DF   = pd.read_csv(current_dir + '/Submodule_constituents.csv')                                                                       # File that contains Submodule names and their protein constituents
BgNet          = pd.read_csv(current_dir + '/SI_Identification_Input_Files/Background_Network.csv')                                                                                   # Background network of protein interactions
Num_Prot_Inter = pd.read_csv(current_dir + '/SI_Identification_Input_Files/Number_Interactions_Each_Protein.csv')                                              # Number of protein interactions for each protein in the background network
Annotation_DF  = pd.read_csv(current_dir + '/SI_Identification_Input_Files/Annotation.csv')                                                   # Yeast protein annotation file
 
Submodule_List=Submodule_DF['Submodule'].unique().tolist()                                                                                                  # Send the Submodules to a list, but filter out duplicates, which there will be many, since the Submodules will have been found in many proteins.

dicOrfs={}
for Submodule in Submodule_List:                                                                                                                            # Key (Submodule), Value (Yeast ORFs that are Submodule constituents). Filter ORFs found twice to single occurence (important for enrichment analysis)
    dicOrfs[Submodule]=(Submodule_DF.loc[Submodule_DF['Submodule'] == Submodule])['ORF'].unique().tolist()
        

dicOrfsCounts={}  
for k,v in dicOrfs.items():  
    if k not in dicOrfsCounts:  
        value=len(v)            
        dicOrfsCounts[k]=value
        
df_Submodule_Size=pd.DataFrame(list(dicOrfsCounts.items()),                                                                                                  # convert dict to dataframe.
                      columns=['Submodule','n'])

def SliceDataframe():
    ''' For each Submodule identify all proteins that interact with the Submodule proteins in the backgroudn network '''
    lst = []
    for key in dicOrfs.keys():                                                                                                                             #Select the key, which is a Submodule, from the dict
        CurrentDF=BgNet.copy() 
        x=CurrentDF[CurrentDF['Protein1'].isin(dicOrfs[key])].rename(columns={'Protein1':'Submodule_Containing_Proteins', 'Protein2':'Possible_Shared_Interactors'})                              #Create a new dataframe that is a slice of the salt background network, and only contains proteins that were passed in "dicOrfs[key]". At the same time, rename the columns                                
        x['Submodule']=key 
        lst.append(x)
        
    return lst

Sliced_dataframe_list= SliceDataframe()
      
def Add_n():    
    ''' Function adds 'n', the number of proteins in the Submodule, to each dataframe'''
    lst= []
    for df in Sliced_dataframe_list:
        NewDF=df.merge(df_Submodule_Size)
        lst.append(NewDF)
        
    return lst

Sliced_dataframe_list= Add_n()

def Identify_Shared_Interactors():
    ''' Function identifies proteins that interact with at least 2 protein constituents of each submodule'''
    
    lst=[] 
    for df in Sliced_dataframe_list: 
        NewDF=df.copy()
        NewDF2=NewDF[NewDF.duplicated(['Possible_Shared_Interactors'], keep = 'last')| NewDF.duplicated(['Possible_Shared_Interactors'])]                  # Only retain proteins that interact with at least 2 submodule protein constituents
        x=NewDF2.sort_values(by='Possible_Shared_Interactors', ascending=True) 
        lst.append(x)
       
    return lst

Shared_Interactors_lst=Identify_Shared_Interactors()

def AppendDFs_that_Contain_AllSharedInteractors_and_their_targets():
    ''' Function appends all submodules and their shared interactors together into a single file'''
    EmptyDF = pd.DataFrame() 
    for df in Shared_Interactors_lst:  
        df=df.copy() 
        EmptyDF=EmptyDF.append(df)
    return EmptyDF

SI_andTargets=AppendDFs_that_Contain_AllSharedInteractors_and_their_targets()

SI_andTargets_FINAL=pd.merge(left=SI_andTargets, right=Annotation_DF, how='left',
                              left_on='Possible_Shared_Interactors', right_on='systematic_name_dash_removed')                                               # complete a merge so I can get the dashes back in the names, which are not included in the background network
del SI_andTargets_FINAL['Possible_Shared_Interactors']                                                                                                      # drop because  lacks the dashes which are needed for the correct naming convention
del SI_andTargets_FINAL['systematic_name_dash_removed']                                                                                                     # drop because carried over from the merge
del SI_andTargets_FINAL['Directed']

SI_andTargets_FINAL.columns = ['Submodule_Containing_Proteins', 'Interaction', 'Submodule', 'n','Possible_Shared_Interactors']                        # rename columns

myDF = pd.DataFrame(SI_andTargets_FINAL)
# OUTPUT NAME FOR SHARED INTERACTORS
filename = 'SI_Identification_SubmoduleS__SIs_and_Targets_FDR.csv'
myDF.to_csv(filename, index=False, encoding='utf-8' )              # All interactions between SIs and their submodule constituent proteins. No enrichment at this step.

#-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
''' Preparing dataframe for Hypergeometric test'''

def Add_N_and_m():
    ''' Function adds 'N' and calculates 'm' values, which are inputs for the hypergeometric test, to the datframe'''
    lst=[]
    for df in Shared_Interactors_lst:
        NewDF=df.copy()
        NewDF['N'] = 4638          # THIS IS THE LENGTH OF THE DATA FRAME, *******************************                                                                                                                         # of proteins in the background network
        NewDF['m'] = NewDF.groupby('Possible_Shared_Interactors')['Possible_Shared_Interactors'].transform('count')
        lst.append(NewDF)
    
    return lst

Dataframes_list_with_n_N_m=Add_N_and_m()

#-----------------------------------------------------------------------------------------------------------------------------------------------------------------------


def Drop_dups():
    ''' For each dataframe, which contains a single submodule, it's protein constituents, and shared interactors, drop duplicate entries for identified SI proteins
    . This leaves a single entry for each shared interactor protein. '''
    lst=[]
    for df in Dataframes_list_with_n_N_m:
        NewDF=df.copy()
        Final_DF=NewDF.drop_duplicates('Possible_Shared_Interactors')
        Final_DF=Final_DF.rename(columns={'Possible_Shared_Interactors':'Shared_Interactor'})
        lst.append(Final_DF)
        
    return lst

Drop_Dups_lst=Drop_dups()


def Return_M():
    ''' Function identifies 'M' (the total number of interactions for each Shared Interactor protein in the background network) and adds that number
    to the dataframe'''
    lst=[]
    for df in Drop_Dups_lst:
        NewDF=df.copy()
        NewDF2=df.copy()
        NewDF_lst=NewDF['Shared_Interactor'].tolist()                                                                                                            # place all proteins in the 'Shared_Interactor' column in a list 
        Shared_Interactors=Num_Prot_Inter[Num_Prot_Inter['Protein'].isin(NewDF_lst)].rename(columns={'Protein':'Shared_Interactor', 'Total':'M'})
        Shared_Interactor_merge=Shared_Interactors.merge(NewDF2, on='Shared_Interactor')
        Shared_Interactor_merge=Shared_Interactor_merge.sort_values(by='Shared_Interactor', ascending=True)
        lst.append(Shared_Interactor_merge)
        
    return lst

Return_M_lst=Return_M()

#-----------------------------------------------------------------------------------------------------------------------------------------
def hyper(N,M,n,m): 
    ''' Function defines the parameters for a hypergeometric test that returns a p-value representing the chances of identifying >= x, where x is the number of successes '''  
    frozendist=hypergeom(N,M,n)
    ms=np.arange(m, min(n+1, M+1))
    rv=0;
    for single_m in ms: rv=rv+frozendist.pmf(single_m)
    return rv

def run_hyper():
    ''' Function calls the hypergeometric function above  on each shared interactor for each submodule'''
    lst=[]
    for df in Return_M_lst:
        if not df.empty:
            NewDF=df.copy()
            NewDF['p-value'] = NewDF.apply(lambda row: hyper(row['N'], row['M'], row['n'], row['m']), axis=1)
            lst.append(NewDF)
        
    return lst 

run_hyper_lst=run_hyper()

#-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
def AppendDFs():
    ''' Append DFs for each submodule and it's SIs together into a single DF'''   
    EmptyDF = pd.DataFrame() #
    for df in run_hyper_lst: 
        df=df.copy() 
        EmptyDF=EmptyDF.append(df)
    return EmptyDF

Final=AppendDFs()

#-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
''' Prepping for Benjamini Hochberg procedure. Below code is ranking p-values from 1 to n based on lowest to highest p-value score'''

Final=Final.sort_values(by=['p-value'],ascending=[True])                                                                                              # Sort p-values from lowest to highest
Final_resetIndex=Final.reset_index()                                                                                                        # Reset the index after the sort
Final_resetIndex.index +=1                                                                                                                  # start numbering at 1 for index
       
NewDF=Final_resetIndex
NewDF_Allp_values=Final_resetIndex
NewDF=NewDF[['p-value']]                                                                                                                    # select only the p-value column of the dataframe 
NewDF_dropdups=NewDF.drop_duplicates('p-value')                                                                                             # drop duplicate p-values
NewDF_dropdups=NewDF_dropdups.reset_index()                                                                                                 # reset the index
NewDF_dropdups.index +=1                                                                                                                    # start numbering at 1 for index
NewDF_dropdups['Rank(i)'] = NewDF_dropdups.index                                                                                            # #Add a rank column that will be filled with index values. 
NewDF_dropdups=NewDF_dropdups.drop('index', 1)                                                                                              # Drop the additional column 'index' that is not sorted.
NewDF_merge=NewDF_Allp_values.merge(NewDF_dropdups, on='p-value')                                                                           # create a new dataframe that is a merge of the dataframe with all p-values, and the dataframe with unique p-values and their ranks. 
NewDF_merge=NewDF_merge.drop('index',1)                                                                                                     # drop the index that was added from the merge. This leaves all p-values ordered from lowest to highest with their ranking.

'''Add parameters necessary for completing Benjamini-Hochberg procedure '''

NewDF=NewDF_merge
NewDF['m_(number_of_tests)']=(len(NewDF))                                                                                                   # Add 'm (number of tests)' column 
NewDF['Q_(FDR)']=0.05      # THIS IS THE FDR VALUE, USER CAN CHANGE **************************************                                                                                                                 # Add Q (FDR) column. This can be changed manually.
NewDF['(i/m)Q']=((NewDF['Rank(i)']/NewDF['m_(number_of_tests)'])*NewDF['Q_(FDR)'])                                                          # add the (i/m)Q column 
NewDF['BH_significant']=NewDF.apply(lambda x: 1 if x['p-value']<x['(i/m)Q'] else 0, axis=1)                                                 # Identify which proteins are  significant. 
NewDF=pd.merge(left=NewDF, right=Annotation_DF, how='left', left_on='Shared_Interactor', right_on='systematic_name_dash_removed')           # complete a merge to recover dashed version of YORFs
del NewDF['Shared_Interactor'] 
del NewDF['systematic_name_dash_removed']
del NewDF['Directed']
NewDF.columns = ['M','Submodule_Containing_Proteins', 'Interaction', 'Submodule', 'n','N','m','p-value','Rank(i)', 'm_(number_of_tests)', 'Q_(FDR)','(i/m)Q','BH_significant', 'Shared_Interactor'] # rename columns

myDF = pd.DataFrame(NewDF)
filename = 'Network_Submodule_Nodes_background_Network.csv'
myDF.to_csv(filename, index=False, encoding='utf-8' )       # Write out final file with enriched shared interactors for each submodule


# FILTER FOR THE FINAL Shared Interactors.
# Open and parse Network_Submodule_Nodes_background_Network.csv 
# Only keep the identified Shared Interactors about the first zero that appears in the BH_Significant column.
with open('Final_enriched.csv', 'w') as outfile, open('Network_Submodule_Nodes_background_Network.csv', 'r') as f:
    for line in f:
        if line.startswith('M,'):
            outfile.write(line)
            continue
        dat = line.split(',')
        if dat[12] == '0':
            break
        else:
            outfile.write(line)

f.close()
outfile.close()

# OUTPUT: SI_Identification_SubmoduleS__SIs_and_Targets_FDR.csv, 
#         Network_Submodule_Nodes_background_Network.csv
#         Final_enriched.csv


# Classify Shared Interactors Inputs Outputs

For each SI and it's connections with submodule protein constituents, determine if the SI acts upon the submodule (that is, the Shared Interactor has at least 1 directional interaction, or ppi interaction, with a submodule protein), or if the submodule acts upon the SI (that is, all interactions between the SI and submodule proteins have the 'Reverse' designation', indicating that the submodule proteins act upon the SI).

- If all of the interactions are reversed, then the script will define the relationship between the SI and the submodule as "Output"

- If there is at least one interaction that is directed from SI towards submodule, or is a ppi, the relationship between the SI and the submodule is defined as "Input"

This script takes an input file that contains the following:

- All enriched Shared Interactors (SIs) (according to HyperG) and their connections to submodules.
- All known protein interactions for each SI (ppi, kinase-substrate, etc)
- Many of these interactions are directed (kinase-substrate, metabolic pathway, etc). PPI are not a directed interaction.

Input: plain csv text file

Csv format:
SI_submodule,Shared_Interactor,SI_name,Motif_Containing_Proteins,submodule_Name
,Interaction_Directionality

YLR164C_Repressed_..RR.s.No_Phenotype_Exists,YLR164C,Tpk1,YDR207C,
Repressed_..RR.s.No_Phenotype_Exists, kinase_substrate

Column order is unimportant, column names must match above.


In [17]:
import yeast_Gene_name_to_ORF as yg          # useSI_Identification_SubmoduleS__SIs_and_Targets_FDR.csv'd to get standard name

# create input files 
# SI_Identification_SubmoduleS__SIs_and_Targets_FDR.csv - input from ID shared interactors
submod = dict()
with open('SI_Identification_SubmoduleS__SIs_and_Targets_FDR.csv', 'r') as f:
    for line in f:
        row = line.rstrip().split(',')
        submod[row[0] + '_' + row[2]] = row

f.close()

# get network information
with open('classify_sharedInteractors_input.csv', 'w') as out:
    header = '%s,%s,%s,%s,%s,%s\n' %('SI_submodule','Shared_Interactor','SI_name','Motif_Containing_Proteins','submodule_Name'
,'Interaction_Directionality')
    out.write(header)
    with open('Final_enriched.csv') as f:
        for line in f:
            if line.startswith('M'):
                continue
            row = line.rstrip().split(',')
            if int(row[12]) != 1:
                continue
            name = row[1] + '_' + row[3]
            if name in submod:
                if row[1].endswith(('A','B')):
                    tmp = list(row[1])
                    tmp.insert(-1,'-')
                    row[1] = "".join(tmp)
                n = re.sub('-', '', row[1])
                ln = name + ',' + n + ',' + yg.sc_orfToGene[row[1]] + ',' + row[-1] + ',' + row[3] + ',' + row[2] + '\n'
                out.write(ln)

Input_df=pd.read_csv('classify_sharedInteractors_input.csv')

def Split_based_on_SI_submodule_Column():
    ''' Function splits the input DF into independent DFs based on the SI-submodule column pairs. Thus, each SI and it's submodule protein interactions are
    in independent dataframes '''
    DF_lst =[]
    for SI_submodule in Input_df['SI_submodule'].unique():
        DF=Input_df.loc[Input_df['SI_submodule']==SI_submodule]
        DF_lst.append(DF)
    return DF_lst

DF_lst=Split_based_on_SI_submodule_Column()

def Count_Instances_of_Reverse_Interaction():
    ''' Function counts, for each DF, and thus each SI-submodule pair, how many of the interactions are 'reversed', or facing from submodule TOWARDS SI. 
        It also counts the length of the dataframe, and then subtracts the the length of the dataframe from the counts. If the resultant value is 0, then all of the interactions 
        were reversed '''
    DF_Counts_lst=[]
    for df in DF_lst:
        df=df.copy()
        df['Counts']=df.Interaction_Directionality.str.contains('Reversed').sum()                                                               # Count the number of interactions that are "Reversed"
        x=len(df)
        df['Length']=x
        df['Counts_Length']=df['Counts']-df['Length']
        
        DF_Counts_lst.append(df)
    return DF_Counts_lst

DF_Counts_lst=Count_Instances_of_Reverse_Interaction()

def Only_Reverse_Interactions_Move_to_Outgoing_Columns():
    '''Function assigns 'Input' and 'Output' classifications based on the 'Counts_Length' column in the dataframe. '0' values are 'outputs', all other's are 'inputs' '''
    df_Modified_Outgoing_lst=[]
    for df in DF_Counts_lst:
        for value in df['Counts_Length'].unique():
           
            if value == 0:
                df['Shared_Interactor_submodule_Relationship']= 'Output'
                df_Modified_Outgoing_lst.append(df)
            else:
                df['Shared_Interactor_submodule_Relationship']= 'Input'
                df_Modified_Outgoing_lst.append(df)
           
    return df_Modified_Outgoing_lst
            
df_Modified_Outgoing_lst=Only_Reverse_Interactions_Move_to_Outgoing_Columns()


def AppendDFs(): 
    '''Function appends all dataframes back together '''
    EmptyDF = pd.DataFrame()
    for df in df_Modified_Outgoing_lst: 
        df=df.copy() 
        EmptyDF=EmptyDF.append(df) 
    return EmptyDF

Final=AppendDFs()    

Final_Keep_Columns_Needed_For_SIF=Final[['SI_submodule', 'Shared_Interactor', 'submodule_Name', 'Shared_Interactor_submodule_Relationship']]  
Final_Keep_Columns_Needed_For_SIF=Final_Keep_Columns_Needed_For_SIF.drop_duplicates('SI_submodule')                                                                     # Dropping duplicates entries, which are created because for each SI-submodule interaction there are numerous interactions with protein constituent. Only want a single interaction, input or output, for each SI and it's submodule. 

# create a new dataframe and write results to file
myDF = pd.DataFrame(Final_Keep_Columns_Needed_For_SIF)
filename = 'SIs_submodule_Relationships_Define_ClassA_Network.csv'
myDF.to_csv(filename, index=False, encoding='utf-8',sep='\t') 
    
# OUTPUT:  classify_sharedInteractors_input.csv
#          SIs_submodule_Relationships_Define_ClassA_Network.csv

# Create Fasta file

Since Kevin identified submodules manually, we use his input file:  Group_1_Motifx-results.txt 

This file is parsed to produce a file which looks like:

    Module,Name,Sequence
    Group_1_......SP.....,YJL082W_S187,KNSSSPSPSEKSQ

All peptide sequences should be the same length (13 amino acids).

Module constituents should be used here, not submodules. 
Fasta Files for each module will be created in a dir called: FastaFiles_Modules/

The output Fasta format files are named with their module designation.

In [19]:
# open & parse file
with open('pwm_input.csv', 'w') as out:
    header = 'Module,Name,Sequence\n'
    out.write(header)
    with open('Modules_pPep.csv','r') as f:               # here we used Kevin's Group_#_Motifx-results.txt
        for line in f:
            if line.startswith('\tPpep'):
                continue
            dat = line.rstrip().split('\t')      # CHECK DELIMITER usually , or tab
            outrow = '%s,%s,%s\n' %(dat[10],dat[1], dat[4])
            out.write(outrow)


Input_df=pd.read_csv('pwm_input.csv')

def Split_Into_SeparateDFs():
    ''' Function splits the input dataframe, based on the module name, into independent dataframes for each module'''
    df_lst=[]
    for Module in Input_df['Module'].unique():
        DF=Input_df.loc[Input_df['Module']==Module]
        df_lst.append(DF)
        
    return df_lst

df_lst=Split_Into_SeparateDFs()
#print (df_lst)

if not os.path.exists('FastaFiles_Modules'):
    os.mkdir('FastaFiles_Modules')

def CreateIndividualFastaFiles():
    '''Function creates individual fasta files for each module nd writes them out to a user defined directory'''
    for df in df_lst:                
        Module_lst=df["Module"].tolist()    
        for name in Module_lst:          
        # open a new file that contains the module name. USER can Change directory here.
            ofile= open("FastaFiles_Modules/"+name+".fasta", "w") 
        
            df_lstName=df['Name'].tolist()             # send the module names to a list
            df_lstSeq=df['Sequence'].tolist()          # send the peptide sequences to a list 
            
            for i in range(len(df_lstSeq)):                    
                
                ofile.write(">" + df_lstName[i] + "\n" + df_lstSeq[i] + "\n")                                            # create a fasta file where the peptide name will be followed by the peptide sequence, on a new line
           
        df_lstName=[]                                                                                                    # empty each of the lists for the next iteration
        df_lstSeq=[]
        Module_lst=[]
        ofile.close           
        
    return 
      
CreateIndividualFastaFiles()

# OUTPUT: pwm_input.csv
#         FastaFiles_Modules/*.fasta 

# Create PWMs from Module Fasta

Generate PWMs for each module, using the module Fasta files. Module PWMs
can then be compared to PWMs for 63 known kinase recognition motifs (Mok et al.,
2010).

## Always run on the Module level, i.e. Induced_sp, not Induced_sp mutant phenotype

Input: A directory containing files in Fasta format.

Script uses BioPython to generate position weight matrices from a directory containing Fasta files for
each modules phospho-peptides. 

### Note: 
Duplicate amino acid sequences should be removed from the Fasta files before running this script, if they exist, to prevent overweighting the matrix. No value can be zero in the pwm, if the script fails check.


    
output file should look like:

    Motif,AA,0,1,2,3,4,5,6,7,8,9,10,11,12
    Induced_...R.NS......,A:,0.044444444444444446,0.044444444444444446,0.044444444444444446, etc...
    Induced_...R.NS......,C:,0.022222222222222223,0.022222222222222223,0.022222222222222223, etc...


In [20]:
# Remove duplicate sequences from each fasta file


clean = dict()

for fasta in glob.glob('FastaFiles_Modules/*'):
    print('processing file: %s ' %(fasta))
    for seq_record in SeqIO.parse(fasta, 'fasta'):   # create Seq objects
        s = str(seq_record.seq)
        if s not in clean:
            clean[s] = seq_record                    # only keep unique sequences
            
    out_handle = open('tmp.fasta', 'w')              
    
    for k,v in clean.items():                        # write unique sequences to tmp file  
        SeqIO.write(v, out_handle, 'fasta')
    out_handle.close()
            
    shutil.move('tmp.fasta', fasta)                  # overwrite original fasta file    

# OUTPUT: FastaFiles_Modules/*.fasta  file have duplicates removed

processing file: FastaFiles_Modules/Induced_......SP....._ire1_Induced_Defective.fasta 
processing file: FastaFiles_Modules/Induced_......SP....._ire1_Induced_Amplified.fasta 
processing file: FastaFiles_Modules/Induced_...RR.S......_mkk1_2_Induced_Defective.fasta 
processing file: FastaFiles_Modules/Induced_......SP....._No_Phenotype_Exists.fasta 
processing file: FastaFiles_Modules/Induced_...RR.S......_No_Phenotype_Exists.fasta 
processing file: FastaFiles_Modules/Repressed_......TP....._No_Phenotype_Exists.fasta 
processing file: FastaFiles_Modules/Induced_...R..S......_No_Phenotype_Exists.fasta 
processing file: FastaFiles_Modules/Induced_...K..SP....._mkk1_2_Induced_Defective.fasta 
processing file: FastaFiles_Modules/Induced_...R..S......_ire1_Induced_Defective.fasta 
processing file: FastaFiles_Modules/Induced_...K..SP....._ire1_Induced_Amplified.fasta 
processing file: FastaFiles_Modules/Induced_....R.S......_No_Phenotype_Exists.fasta 
processing file: FastaFiles_Modules/Induc

In [21]:
alphabet = IUPAC.protein           # use protein alphabet
instances = []
# list of amino acids used to print the position weight matrix
AminoList = ['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y' ]
# column numbers for printing pwm, length of peptide, assumed to be 13, if different change last value
pep_Header = ','.join([str(i) for i in range(0,13)])     

# user defined directory containing Fasta files
#os.chdir("/home/mplace/projects/forMatt/Phospho_Network/")  

def CreatePWM():
    ''' Function creates PWMs for each Module '''
    instances = []
    with open('position_weight_matrix.txt', 'w') as out:
        out.write('Motif,AA,%s\n' %(pep_Header))                   
        for x in os.listdir('FastaFiles_Modules/'):                 # Iterate through the Fasta files in the directory
            if x.endswith('.fasta'):
                with open('FastaFiles_Modules/' + x, "r") as f:
                    for line in f:
                        if line.startswith('>'):                                       
                            continue
                        line = line.rstrip()                                                 
                        instances.append(Seq(line, IUPAC.protein))  # add amino acid sequence to instances
                    m = motifs.create(instances)
                    pwm = m.counts.normalize(pseudocounts = 1)      # Add a +1 pseudocount
                    instances = []
                    name = re.sub('.fasta', '', x)                  # use file name for 1st column          
                    for aa in AminoList :
                        score = [ str(i) for i in pwm[aa]]
                        score = ','.join(score)
                        out.write('%s,%s:,%s\n' %(name,aa,score))
    out.close()
                    
CreatePWM()

# OUTPUT: position_weight_matrix.txt

# Run Kullback-Leibler Module to Each Kinase

Purpose:  

To quantify similarity between the Mok et. al. kinase PWMs and the module
PWMs. Script employs a previously described quantitative motif comparison method
called Kullback-Leibler divergence (KLD) (Thijs et al., 2002, Gupta et al., 2007).
KLD generates a similarity measure by comparing the Kullback-Leiber distance, or
information content, for each amino acid at each position between a query and
comparison PWM. The more alike two PWMs are, the closer to zero the score approaches.

    KLD(X,Y) = 1/2 (E Xalog(Xa/Ya) + E Yalog(Ya/Xa))

Where ‘X’ represents a query PWM position and ‘Y’ a comparison PWM position.
Xa indicates the probability of a given amino acid a ε A in X. 
The symbol ‘A’ represents the length of the motif alphabet, which is 20, 
representing each of the naturally occurring amino acids. 



Input:
A plain text .csv file that contains all module position weight matrices. Each
module PWM should have 20 rows, representing each of the 20 naturally occurring
amino acids. They are in a column called "AA" which stands for amino acid. There
should also be 13 columns, labeled 0-12 (representing the 13 amino acid sequence length
of the phospho-peptides used to build the position weight matrix) that contain the
frequency of each amino acid at each position.

Csv file format
Motif,AA,0,1,2,3,4,5,6,7,8,9,10,11,12
Induced_...sP.,P:,0.05,0.05,0.03, 0.05,0.05,0.03,0.05,0.05,0.03, 0.05,0.05,0.03

In addition, a directory that contains the Mok et al kinase PWMs. They have the identical
format as above. They have been pre-generated and are available for download on Github.
The repository is titled, "Mok_kinase_PWMs"

Required Parameters: Pandas must be installed on your machine.

Output: A directory containing plain text .csv files named after each module (ie.
Induced_...sP..txt). Within the .csv files are 63 KLD scores representing how well the
63 Mok et al kinases match the module motif.

In [22]:
Compare_To=pd.read_csv('position_weight_matrix.txt')   # input module pwm                                                                                    # The PWMs for the Modules.

def DF_to_TSV(dataframe, NewFileName): 
    ''' Function writes out dataframes as TSV files'''
    #path ='' 
    dataframe.to_csv (NewFileName,sep='\t')  

def SplitCompareTOMotifs_df():
    ''' Function splits the Compare_To DF by Motif, which is listed in the "Motif" column, 
        and puts the new dataframes into a list
    '''
    DF_CompareTo_lst =[]
    for Motif in Compare_To['Motif'].unique():
        DF=Compare_To.loc[Compare_To['Motif']==Motif]
        DF_CompareTo_lst.append(DF)
    return DF_CompareTo_lst

DF_CompareTo_lst=SplitCompareTOMotifs_df()

def SplitInput_df_byMotif():
    ''' Split the Input dataframe by Motif and create indpendent dataframes'''
    DF_Input_lst =[]
    for Motif in Input['Motif'].unique():
        DF=Input.loc[Input['Motif']==Motif]
        DF_Input_lst.append(DF)
    return DF_Input_lst

def Copy(df):
    ''' Function makes a copy of a dataframe.  '''
    df=df.copy()
    return df

def mergeInputMotifFile_withDF_CompareTo(df_Input,df_CompareTo):
    ''' Merge the query and comparison PWMs so that KLD can be calculated by comparing column values'''
    df_merged=df_Input.merge(df_CompareTo, on='AA')
  
    return df_merged

def Calculate_log_x_y(df):
        ''' Function takes the log2 value of the amino acid frequency at each position of the query/comparison motifs'''
        df['0_log(x/y)'] = df.apply(lambda x: math.log(x['0_x'],2) - math.log(x['0_y'],2), axis=1)
        df['1_log(x/y)'] = df.apply(lambda x: math.log(x['1_x'],2) - math.log(x['1_y'],2), axis=1)
        df['2_log(x/y)'] = df.apply(lambda x: math.log(x['2_x'],2) - math.log(x['2_y'],2), axis=1)
        df['3_log(x/y)'] = df.apply(lambda x: math.log(x['3_x'],2) - math.log(x['3_y'],2), axis=1)
        df['4_log(x/y)'] = df.apply(lambda x: math.log(x['4_x'],2) - math.log(x['4_y'],2), axis=1)
        df['5_log(x/y)'] = df.apply(lambda x: math.log(x['5_x'],2) - math.log(x['5_y'],2), axis=1)
        df['6_log(x/y)'] = df.apply(lambda x: math.log(x['6_x'],2) - math.log(x['6_y'],2), axis=1)
        df['7_log(x/y)'] = df.apply(lambda x: math.log(x['7_x'],2) - math.log(x['7_y'],2), axis=1)
        df['8_log(x/y)'] = df.apply(lambda x: math.log(x['8_x'],2) - math.log(x['8_y'],2), axis=1)
        df['9_log(x/y)'] = df.apply(lambda x: math.log(x['9_x'],2) - math.log(x['9_y'],2), axis=1)
        df['10_log(x/y)'] = df.apply(lambda x: math.log(x['10_x'],2) - math.log(x['10_y'],2), axis=1)
        df['11_log(x/y)'] = df.apply(lambda x: math.log(x['11_x'],2) - math.log(x['11_y'],2), axis=1)
        df['12_log(x/y)'] = df.apply(lambda x: math.log(x['12_x'],2) - math.log(x['12_y'],2), axis=1)
        return df


def Calculate_log_y_x(df):
        ''' Function takes the log2 value of the amino acid frequency at each position of the comparison/query motifs'''
        df['0_log(y/x)'] = df.apply(lambda x: math.log(x['0_y'],2) - math.log(x['0_x'],2), axis=1)
        df['1_log(y/x)'] = df.apply(lambda x: math.log(x['1_y'],2) - math.log(x['1_x'],2), axis=1)
        df['2_log(y/x)'] = df.apply(lambda x: math.log(x['2_y'],2) - math.log(x['2_x'],2), axis=1)
        df['3_log(y/x)'] = df.apply(lambda x: math.log(x['3_y'],2) - math.log(x['3_x'],2), axis=1)
        df['4_log(y/x)'] = df.apply(lambda x: math.log(x['4_y'],2) - math.log(x['4_x'],2), axis=1)
        df['5_log(y/x)'] = df.apply(lambda x: math.log(x['5_y'],2) - math.log(x['5_x'],2), axis=1)
        df['6_log(y/x)'] = df.apply(lambda x: math.log(x['6_y'],2) - math.log(x['6_x'],2), axis=1)
        df['7_log(y/x)'] = df.apply(lambda x: math.log(x['7_y'],2) - math.log(x['7_x'],2), axis=1)
        df['8_log(y/x)'] = df.apply(lambda x: math.log(x['8_y'],2) - math.log(x['8_x'],2), axis=1)
        df['9_log(y/x)'] = df.apply(lambda x: math.log(x['9_y'],2) - math.log(x['9_x'],2), axis=1)
        df['10_log(y/x)'] = df.apply(lambda x: math.log(x['10_y'],2) - math.log(x['10_x'],2), axis=1)
        df['11_log(y/x)'] = df.apply(lambda x: math.log(x['11_y'],2) - math.log(x['11_x'],2), axis=1)
        df['12_log(y/x)'] = df.apply(lambda x: math.log(x['12_y'],2) - math.log(x['12_x'],2), axis=1)
        return df

def Calculate_Faax_times_log_x_y(df):
    ''' Function multiplies the frequency of an amino acid (Faax) "Xa" at a specific position in the query motif against the log(Xa/Ya) for that amino acid
     It is calculating this part of the function  "Xalog(Xa/Ya)" '''

    df['F(aax)*0_log(x/y)']=df['0_x']*df['0_log(x/y)']
    df['F(aax)*1_log(x/y)']=df['1_x']*df['1_log(x/y)']
    df['F(aax)*2_log(x/y)']=df['2_x']*df['2_log(x/y)']
    df['F(aax)*3_log(x/y)']=df['3_x']*df['3_log(x/y)']
    df['F(aax)*4_log(x/y)']=df['4_x']*df['4_log(x/y)']
    df['F(aax)*5_log(x/y)']=df['5_x']*df['5_log(x/y)']
    df['F(aax)*6_log(x/y)']=df['6_x']*df['6_log(x/y)']
    df['F(aax)*7_log(x/y)']=df['7_x']*df['7_log(x/y)']
    df['F(aax)*8_log(x/y)']=df['8_x']*df['8_log(x/y)']
    df['F(aax)*9_log(x/y)']=df['9_x']*df['9_log(x/y)']
    df['F(aax)*10_log(x/y)']=df['10_x']*df['10_log(x/y)']
    df['F(aax)*11_log(x/y)']=df['11_x']*df['11_log(x/y)']
    df['F(aax)*12_log(x/y)']=df['12_x']*df['12_log(x/y)']
    return df

def Calculate_Faay_times_log_y_x(df):
    ''' Function multiplies the frequency of an amino acid (Faay) "Ya" at a specific position in the query motif against the log(Ya/Xa) for that amino acid
     It is calculating this part of the function  "Yalog(Ya/Xa)" '''
    df['F(aay)*0_log(y/x)']=df['0_y']*df['0_log(y/x)']
    df['F(aay)*1_log(y/x)']=df['1_y']*df['1_log(y/x)']
    df['F(aay)*2_log(y/x)']=df['2_y']*df['2_log(y/x)']
    df['F(aay)*3_log(y/x)']=df['3_y']*df['3_log(y/x)']
    df['F(aay)*4_log(y/x)']=df['4_y']*df['4_log(y/x)']
    df['F(aay)*5_log(y/x)']=df['5_y']*df['5_log(y/x)']
    df['F(aay)*6_log(y/x)']=df['6_y']*df['6_log(y/x)']
    df['F(aay)*7_log(y/x)']=df['7_y']*df['7_log(y/x)']
    df['F(aay)*8_log(y/x)']=df['8_y']*df['8_log(y/x)']
    df['F(aay)*9_log(y/x)']=df['9_y']*df['9_log(y/x)']
    df['F(aay)*10_log(y/x)']=df['10_y']*df['10_log(y/x)']
    df['F(aay)*11_log(y/x)']=df['11_y']*df['11_log(y/x)']
    df['F(aay)*12_log(y/x)']=df['12_y']*df['12_log(y/x)']
    return df

def Column_SUM(df):
    ''' Function sums the values calculated by the previous two functions for each position, or column, in the PWMs  '''
    df['sum_0']=sum(df['F(aax)*0_log(x/y)'])+sum(df['F(aay)*0_log(y/x)'])
    df['sum_1']=sum(df['F(aax)*1_log(x/y)'])+sum(df['F(aay)*1_log(y/x)'])
    df['sum_2']=sum(df['F(aax)*2_log(x/y)'])+sum(df['F(aay)*2_log(y/x)'])
    df['sum_3']=sum(df['F(aax)*3_log(x/y)'])+sum(df['F(aay)*3_log(y/x)'])
    df['sum_4']=sum(df['F(aax)*4_log(x/y)'])+sum(df['F(aay)*4_log(y/x)'])
    df['sum_5']=sum(df['F(aax)*5_log(x/y)'])+sum(df['F(aay)*5_log(y/x)'])
    df['sum_6']=sum(df['F(aax)*6_log(x/y)'])+sum(df['F(aay)*6_log(y/x)'])
    df['sum_7']=sum(df['F(aax)*7_log(x/y)'])+sum(df['F(aay)*7_log(y/x)'])
    df['sum_8']=sum(df['F(aax)*8_log(x/y)'])+sum(df['F(aay)*8_log(y/x)'])
    df['sum_9']=sum(df['F(aax)*9_log(x/y)'])+sum(df['F(aay)*9_log(y/x)'])
    df['sum_10']=sum(df['F(aax)*10_log(x/y)'])+sum(df['F(aay)*10_log(y/x)'])
    df['sum_11']=sum(df['F(aax)*11_log(x/y)'])+sum(df['F(aay)*11_log(y/x)'])
    df['sum_12']=(sum(df['F(aax)*12_log(x/y)'])+sum(df['F(aay)*12_log(y/x)']))
    return df

def TotalScore(df):
    ''' Function calculates the total score by summing the summed values for each position in the PWM (13 positions)'''
    df['FinalScore']=df['sum_0']+df['sum_1']+df['sum_2']+df['sum_3']+df['sum_4']+df['sum_5']+df['sum_6']+df['sum_7']+df['sum_8']+df['sum_9']+df['sum_10']+df['sum_11']+df['sum_12']
    Lst=df['FinalScore'].unique()
    n=Lst[0]
    return n

# Import the Mok Kinases PWM .csv files individually and create dataframes
path=r"Mok_kinase_PWMs/"
filenames = glob.glob(path + "*.csv")

dfs_lst = []
for filename in filenames:
    dfs_lst.append(pd.read_csv(filename, sep=","))
    
ITER_NUM=1                                          # One iteration of the below function. 
dict_Final={}
for df2 in dfs_lst:                                 # This is the dataframe that has Module PWMs
    
    subModule_name=df2['Motif'].unique()
    for df in DF_CompareTo_lst:                     # select one of the compare to dataframes (Mok Kinase PWM)
        Kinase_name=[]
        Kinase_name=df['Motif'].unique()
 
        for iteration in range (ITER_NUM):                                      # for the first iteration 

            Copied=Copy(df2)                                                    # Copy Dataframe
            # Create a merged version of the dataframe for each Kinase PWM and each Module PWM
            df_merged=mergeInputMotifFile_withDF_CompareTo(Copied, df)  

            DF_1=Calculate_log_x_y(df_merged)                                   
            DF_2=Calculate_log_y_x(DF_1)
            DF_3=Calculate_Faax_times_log_x_y(DF_2)
            DF_4=Calculate_Faay_times_log_y_x(DF_3)
            DF_5=Column_SUM(DF_4)
        
            n=TotalScore(DF_5)                                                  
            #print (n)
            test_tup = (n, subModule_name[0])
            if Kinase_name[0] in dict_Final:
                dict_Final[Kinase_name[0]].append(test_tup)
            else:
                lst=[]
                dict_Final[Kinase_name[0]] = lst
                dict_Final[Kinase_name[0]].append(test_tup)


# write out the final dictionary to a folder where each key and value pair is an independent csv file. 
if not os.path.exists('ClassA_NoShuffle_KL'):
    os.mkdir('ClassA_NoShuffle_KL')
    
path="ClassA_NoShuffle_KL/"              # this is the path to the folder where the output files will be housed

for k, v in dict_Final.items():           # select each key and value pair in the dict 
    newFile=path+ k +'.csv'               # create newFile, that will have the path and name (the key, which is the kinase) associated with it
    #print (newFile)
    with open(newFile, 'w') as output:  
        output.write(k)
        output.write("\n")
        for x in v:
           
            output.write(str(x))
            output.write("\n")

# OUTPUT: ClassA_NoShuffle_KL/*.csv    

# Kullback-Leibler Module to Each Kinase Shuffled 1000x


Purpose:  The same algorithm as the last step is used but w/ 1000 Shuffles of the Mok Kinase PWMs are performed by the script, generating randomized PWMs that are compared against the Module PWMs, producing a distribution of scores.

Output: A directory containing plain text .csv files named after each module. Within
the .csv files are 63,000 KLD scores representing how well the 63 Mok et al kinases
match the module motif after 1000 permutations of each Mok kinase.

# NUMBER OF ITERATIONS SET TO 25 for testing


In [None]:
#Compare_To=pd.read_csv('position_weight_matrix.txt')
Input=pd.read_csv('position_weight_matrix.txt')

def SplitCompareTOMotifs_df():
    ''' Function splits the Compare_To DF by Motif, which is listed in the "Motif" column, and puts the new dataframes into a list'''
    DF_CompareTo_lst =[]
    for Motif in Compare_To['Motif'].unique():
        DF=Compare_To.loc[Compare_To['Motif']==Motif]
        DF_CompareTo_lst.append(DF)
    return DF_CompareTo_lst

DF_CompareTo_lst=SplitCompareTOMotifs_df()


def SplitInput_df_byMotif():
    ''' Split the Input dataframe by Motif and create indpendent dataframes'''
    DF_Input_lst =[]
    for Motif in Input['Motif'].unique():
        DF=Input.loc[Input['Motif']==Motif]
        DF_Input_lst.append(DF)
    return DF_Input_lst

#DF_Input_lst=SplitInput_df_byMotif()
#Unique_Input_Motif_Names_lst=Input['Motif'].unique()

def Shuffle(df):
    '''Shuffle each column by row, after first creating independent dataframes for each column (position within the PWM)''' 

    df_Index = df[['Motif','AA']]  
    df_data_0 = df['0']#,'1','2','3','4','5','7','8','9','10','11','12']]
    df_data_1 = df['1']#'2','3','4','5','7','8','9','10','11','12']]
    df_data_2 = df['2']
    df_data_3 = df['3']
    df_data_4 = df['4']
    df_data_5 = df['5']
    df_data_6 = df['6']
    df_data_7 = df['7']
    df_data_8 = df['8']
    df_data_9 = df['9']
    df_data_10 = df['10']
    df_data_11 = df['11']
    df_data_12 = df['12']
    Shuffled_Input_0=df_data_0.iloc[np.random.permutation(len(df_data_0))] 
    Shuffled_Input_1=df_data_1.iloc[np.random.permutation(len(df_data_1))]                                  # Shuffle the data by row
    Shuffled_Input_2=df_data_2.iloc[np.random.permutation(len(df_data_2))] 
    Shuffled_Input_3=df_data_3.iloc[np.random.permutation(len(df_data_3))] 
    Shuffled_Input_4=df_data_4.iloc[np.random.permutation(len(df_data_4))] 
    Shuffled_Input_5=df_data_5.iloc[np.random.permutation(len(df_data_5))] 
    Shuffled_Input_6=df_data_6.iloc[np.random.permutation(len(df_data_6))] 
    Shuffled_Input_7=df_data_7.iloc[np.random.permutation(len(df_data_7))] 
    Shuffled_Input_8=df_data_8.iloc[np.random.permutation(len(df_data_8))]
    Shuffled_Input_9=df_data_9.iloc[np.random.permutation(len(df_data_9))] 
    Shuffled_Input_10=df_data_10.iloc[np.random.permutation(len(df_data_10))] 
    Shuffled_Input_11=df_data_11.iloc[np.random.permutation(len(df_data_11))] 
    Shuffled_Input_12=df_data_12.iloc[np.random.permutation(len(df_data_12))] 

    Shuffled_Input_0_reset_Index= Shuffled_Input_0.reset_index(drop=True)                                   # reset the index for a later merge
    Shuffled_Input_1_reset_Index= Shuffled_Input_1.reset_index(drop=True)
    Shuffled_Input_2_reset_Index= Shuffled_Input_2.reset_index(drop=True) 
    Shuffled_Input_3_reset_Index= Shuffled_Input_3.reset_index(drop=True)
    Shuffled_Input_4_reset_Index= Shuffled_Input_4.reset_index(drop=True) 
    Shuffled_Input_5_reset_Index= Shuffled_Input_5.reset_index(drop=True)
    Shuffled_Input_6_reset_Index= Shuffled_Input_6.reset_index(drop=True) 
    Shuffled_Input_7_reset_Index= Shuffled_Input_7.reset_index(drop=True)
    Shuffled_Input_8_reset_Index= Shuffled_Input_8.reset_index(drop=True) 
    Shuffled_Input_9_reset_Index= Shuffled_Input_9.reset_index(drop=True)
    Shuffled_Input_10_reset_Index= Shuffled_Input_10.reset_index(drop=True)
    Shuffled_Input_11_reset_Index= Shuffled_Input_11.reset_index(drop=True) 
    Shuffled_Input_12_reset_Index= Shuffled_Input_12.reset_index(drop=True)
    
    result = pd.concat([df_Index, Shuffled_Input_0_reset_Index, Shuffled_Input_1_reset_Index, Shuffled_Input_2_reset_Index, Shuffled_Input_3_reset_Index,
                       Shuffled_Input_4_reset_Index, Shuffled_Input_5_reset_Index, Shuffled_Input_6_reset_Index, Shuffled_Input_7_reset_Index,
                       Shuffled_Input_8_reset_Index, Shuffled_Input_9_reset_Index, Shuffled_Input_10_reset_Index, Shuffled_Input_11_reset_Index, Shuffled_Input_12_reset_Index], axis=1) # concatenate the dataframes - they will sit side by side since have the same index (numbering)
    result_reordered=result[[ 'Motif', 'AA','0','1', '2','3','4','5','6','7','8','9','10','11','12']]       # reorder the columns so in the correct PWM order.
    
    result_reordered_Index=result_reordered[['Motif','AA']]
    result_reordered_Frame=result_reordered[['0','1', '2','3','4','5','6','7','8','9','10','11','12']]
    cols = result_reordered_Frame.columns.tolist()                                                          # send column headers to a list
   
    random.shuffle(cols)                                                                                    # shuffle the columns, which are in a list by name, and return a different order
 
    
    
    FinalDF=result_reordered_Frame[cols]                                                                    # make a new dataframe with randomly shuffled columns
    FinalDF.columns = ['0','1', '2','3','4','5','6','7','8','9','10','11','12']                             # reset the column names, so that they have the original names
    FinalDataframe= pd.concat([result_reordered_Index, FinalDF],axis=1)
    return FinalDataframe

Shuffled=Shuffle(Input)

def mergeInputMotifFile_withDF_CompareTo(df_Input,df_CompareTo):
    ''' Merge the query and comparison PWMs so that KLD can be calculated by comparing column values'''
    df_merged=df_Input.merge(df_CompareTo, on='AA')
    #df_lst.append(df_merged)
    return df_merged
   
df_merged=mergeInputMotifFile_withDF_CompareTo(Shuffled, Compare_To)

def Calculate_log_x_y(df):
        ''' Function takes the log2 value of the amino acid frequency at each position of the query/comparison motifs'''
        df['0_log(x/y)'] = df.apply(lambda x: math.log(x['0_x'],2) - math.log(x['0_y'],2), axis=1)
        df['1_log(x/y)'] = df.apply(lambda x: math.log(x['1_x'],2) - math.log(x['1_y'],2), axis=1)
        df['2_log(x/y)'] = df.apply(lambda x: math.log(x['2_x'],2) - math.log(x['2_y'],2), axis=1)
        df['3_log(x/y)'] = df.apply(lambda x: math.log(x['3_x'],2) - math.log(x['3_y'],2), axis=1)
        df['4_log(x/y)'] = df.apply(lambda x: math.log(x['4_x'],2) - math.log(x['4_y'],2), axis=1)
        df['5_log(x/y)'] = df.apply(lambda x: math.log(x['5_x'],2) - math.log(x['5_y'],2), axis=1)
        df['6_log(x/y)'] = df.apply(lambda x: math.log(x['6_x'],2) - math.log(x['6_y'],2), axis=1)
        df['7_log(x/y)'] = df.apply(lambda x: math.log(x['7_x'],2) - math.log(x['7_y'],2), axis=1)
        df['8_log(x/y)'] = df.apply(lambda x: math.log(x['8_x'],2) - math.log(x['8_y'],2), axis=1)
        df['9_log(x/y)'] = df.apply(lambda x: math.log(x['9_x'],2) - math.log(x['9_y'],2), axis=1)
        df['10_log(x/y)'] = df.apply(lambda x: math.log(x['10_x'],2) - math.log(x['10_y'],2), axis=1)
        df['11_log(x/y)'] = df.apply(lambda x: math.log(x['11_x'],2) - math.log(x['11_y'],2), axis=1)
        df['12_log(x/y)'] = df.apply(lambda x: math.log(x['12_x'],2) - math.log(x['12_y'],2), axis=1)
        return df

DF_1=Calculate_log_x_y(df_merged)


def Calculate_log_y_x(df):
        ''' Function takes the log2 value of the amino acid frequency at each position of the comparison/query motifs'''
        df['0_log(y/x)'] = df.apply(lambda x: math.log(x['0_y'],2) - math.log(x['0_x'],2), axis=1)
        df['1_log(y/x)'] = df.apply(lambda x: math.log(x['1_y'],2) - math.log(x['1_x'],2), axis=1)
        df['2_log(y/x)'] = df.apply(lambda x: math.log(x['2_y'],2) - math.log(x['2_x'],2), axis=1)
        df['3_log(y/x)'] = df.apply(lambda x: math.log(x['3_y'],2) - math.log(x['3_x'],2), axis=1)
        df['4_log(y/x)'] = df.apply(lambda x: math.log(x['4_y'],2) - math.log(x['4_x'],2), axis=1)
        df['5_log(y/x)'] = df.apply(lambda x: math.log(x['5_y'],2) - math.log(x['5_x'],2), axis=1)
        df['6_log(y/x)'] = df.apply(lambda x: math.log(x['6_y'],2) - math.log(x['6_x'],2), axis=1)
        df['7_log(y/x)'] = df.apply(lambda x: math.log(x['7_y'],2) - math.log(x['7_x'],2), axis=1)
        df['8_log(y/x)'] = df.apply(lambda x: math.log(x['8_y'],2) - math.log(x['8_x'],2), axis=1)
        df['9_log(y/x)'] = df.apply(lambda x: math.log(x['9_y'],2) - math.log(x['9_x'],2), axis=1)
        df['10_log(y/x)'] = df.apply(lambda x: math.log(x['10_y'],2) - math.log(x['10_x'],2), axis=1)
        df['11_log(y/x)'] = df.apply(lambda x: math.log(x['11_y'],2) - math.log(x['11_x'],2), axis=1)
        df['12_log(y/x)'] = df.apply(lambda x: math.log(x['12_y'],2) - math.log(x['12_x'],2), axis=1)
        return df

def Calculate_Faax_times_log_x_y(df):
    ''' Function multiplies the frequency of an amino acid (Faax) "Xa" at a specific position in the query motif against the log(Xa/Ya) for that amino acid
     It is calculating this part of the function  "Xalog(Xa/Ya)" '''
    df['F(aax)*0_log(x/y)']=df['0_x']*df['0_log(x/y)']
    df['F(aax)*1_log(x/y)']=df['1_x']*df['1_log(x/y)']
    df['F(aax)*2_log(x/y)']=df['2_x']*df['2_log(x/y)']
    df['F(aax)*3_log(x/y)']=df['3_x']*df['3_log(x/y)']
    df['F(aax)*4_log(x/y)']=df['4_x']*df['4_log(x/y)']
    df['F(aax)*5_log(x/y)']=df['5_x']*df['5_log(x/y)']
    df['F(aax)*6_log(x/y)']=df['6_x']*df['6_log(x/y)']
    df['F(aax)*7_log(x/y)']=df['7_x']*df['7_log(x/y)']
    df['F(aax)*8_log(x/y)']=df['8_x']*df['8_log(x/y)']
    df['F(aax)*9_log(x/y)']=df['9_x']*df['9_log(x/y)']
    df['F(aax)*10_log(x/y)']=df['10_x']*df['10_log(x/y)']
    df['F(aax)*11_log(x/y)']=df['11_x']*df['11_log(x/y)']
    df['F(aax)*12_log(x/y)']=df['12_x']*df['12_log(x/y)']
    return df

def Calculate_Faay_times_log_y_x(df):
    ''' Function multiplies the frequency of an amino acid (Faay) "Ya" at a specific position in the query motif against the log(Ya/Xa) for that amino acid
     It is calculating this part of the function  "Yalog(Ya/Xa)" '''
    df['F(aay)*0_log(y/x)']=df['0_y']*df['0_log(y/x)']
    df['F(aay)*1_log(y/x)']=df['1_y']*df['1_log(y/x)']
    df['F(aay)*2_log(y/x)']=df['2_y']*df['2_log(y/x)']
    df['F(aay)*3_log(y/x)']=df['3_y']*df['3_log(y/x)']
    df['F(aay)*4_log(y/x)']=df['4_y']*df['4_log(y/x)']
    df['F(aay)*5_log(y/x)']=df['5_y']*df['5_log(y/x)']
    df['F(aay)*6_log(y/x)']=df['6_y']*df['6_log(y/x)']
    df['F(aay)*7_log(y/x)']=df['7_y']*df['7_log(y/x)']
    df['F(aay)*8_log(y/x)']=df['8_y']*df['8_log(y/x)']
    df['F(aay)*9_log(y/x)']=df['9_y']*df['9_log(y/x)']
    df['F(aay)*10_log(y/x)']=df['10_y']*df['10_log(y/x)']
    df['F(aay)*11_log(y/x)']=df['11_y']*df['11_log(y/x)']
    df['F(aay)*12_log(y/x)']=df['12_y']*df['12_log(y/x)']
    return df

def Column_SUM(df):
    ''' Function sums the values calculated by the previous two functions for each position, or column, in the PWMs  '''
    df['sum_0']=sum(df['F(aax)*0_log(x/y)'])+sum(df['F(aay)*0_log(y/x)'])
    df['sum_1']=sum(df['F(aax)*1_log(x/y)'])+sum(df['F(aay)*1_log(y/x)'])
    df['sum_2']=sum(df['F(aax)*2_log(x/y)'])+sum(df['F(aay)*2_log(y/x)'])
    df['sum_3']=sum(df['F(aax)*3_log(x/y)'])+sum(df['F(aay)*3_log(y/x)'])
    df['sum_4']=sum(df['F(aax)*4_log(x/y)'])+sum(df['F(aay)*4_log(y/x)'])
    df['sum_5']=sum(df['F(aax)*5_log(x/y)'])+sum(df['F(aay)*5_log(y/x)'])
    df['sum_6']=sum(df['F(aax)*6_log(x/y)'])+sum(df['F(aay)*6_log(y/x)'])
    df['sum_7']=sum(df['F(aax)*7_log(x/y)'])+sum(df['F(aay)*7_log(y/x)'])
    df['sum_8']=sum(df['F(aax)*8_log(x/y)'])+sum(df['F(aay)*8_log(y/x)'])
    df['sum_9']=sum(df['F(aax)*9_log(x/y)'])+sum(df['F(aay)*9_log(y/x)'])
    df['sum_10']=sum(df['F(aax)*10_log(x/y)'])+sum(df['F(aay)*10_log(y/x)'])
    df['sum_11']=sum(df['F(aax)*11_log(x/y)'])+sum(df['F(aay)*11_log(y/x)'])
    df['sum_12']=(sum(df['F(aax)*12_log(x/y)'])+sum(df['F(aay)*12_log(y/x)']))
    return df

def TotalScore(df):
    ''' Function calculates the total score by summing the summed values for each position in the PWM (13 positions)'''
    df['FinalScore']=df['sum_0']+df['sum_1']+df['sum_2']+df['sum_3']+df['sum_4']+df['sum_5']+df['sum_6']+df['sum_7']+df['sum_8']+df['sum_9']+df['sum_10']+df['sum_11']+df['sum_12']
    Lst=df['FinalScore'].unique()
    n=Lst[0]
    return n

# Import the Mok Kinases PWM .csv files individually and create dataframes
path=r"Mok_kinase_PWMs/"
filenames = glob.glob(path + "*.csv")

dfs_lst = []
for filename in filenames:
    dfs_lst.append(pd.read_csv(filename, sep=","))
    
# CHANGE THE NUMBER OF ITERATIONS HERE IF DESIRED
ITER_NUM=25                                                                         # 1000 interations of this function
dict_Final={}
for df2 in dfs_lst:                                                                 # this is the dataframe that has Mok Kinase PWMs
    
    subModule_name=df2['Motif'].unique()
    for df in DF_CompareTo_lst:                                                     # select one of the compare to dataframes (Modules)
        Kinase_name=[]
        Kinase_name=df['Motif'].unique()
   
        for iteration in range (ITER_NUM):                                          # for iteration x 
    
            Shuffled=Shuffle(df2)                                                   # shuffle the dataframe by row and column
            df_merged=mergeInputMotifFile_withDF_CompareTo(Shuffled, df)   
            DF_1=Calculate_log_x_y(df_merged)
            DF_2=Calculate_log_y_x(DF_1)
            DF_3=Calculate_Faax_times_log_x_y(DF_2)
            DF_4=Calculate_Faay_times_log_y_x(DF_3)
            DF_5=Column_SUM(DF_4)
        
            n=TotalScore(DF_5)
            #print (n)
            test_tup = (n, subModule_name[0])
            if Kinase_name[0] in dict_Final:
                dict_Final[Kinase_name[0]].append(test_tup)
            else:
                lst=[]
                dict_Final[Kinase_name[0]] = lst
                dict_Final[Kinase_name[0]].append(test_tup)

# write out the final dictionary to a folder where each key and value pair is an independent csv file. 
if not os.path.exists('Shuffle_KL'):
    os.mkdir('Shuffle_KL')
    
path=r"Shuffle_KL/"  # path to output directory
              
for k, v in dict_Final.items():  
    newFile=path+ k +'.csv'       
    #print (newFile)
    with open(newFile, 'w') as output:  
        output.write(k)
        output.write("\n")
        for x in v:
            #for subModule_name in filenames:
            output.write(str(x))
            output.write("\n")

print('Kullback-Leibler Module to Each Kinase Shuffled 1000x complete')

# OUTPUT: Shuffle_KL/*.csv 


# Calculate FDR Each Module 
Purpose: Identify FDR scores for each Mok et. al. kinase and each module by comparing
the non-shuffled scores to the distribution of shuffled scores. The user can then manually
define the FDR cutoff to call kinases "motif-match" or "non-match" for a given module.

Input: Two directories and a single plain text .csv file, called "Kinases_Not_In_Mok.csv"
that is provided on the Github page. The first directory contains plain text .csv files with
KLD scores for non-shuffled Mok et al kinases and Modules. The second directory
contains plain text .csv files containing KLD scores for shuffled Mok et. al. kinases and
Modules.

Csv format (For both Input Directories)

Scores,Kinase,Module,
13.25,cdc15,Induced.sP.


Output: A table that contains for each module, all yeast kinases, including those found in
the Mok et al dataset and those that were absent, and their FDR scores for each module.
Kinases not found in the Mok et al dataset are given an FDR score of 1.

In [29]:
''' This script is calculates an FDR for each Mok Kinase to each Module. 
The script takes the 63,000 shuffled Mok et al kinase-module scores and determines for each kinase where that unshuffled
kinase-module score falls in the shuffled distribution.For example,if a Kinase has an shuffled score of 14.7 to a module,
and only 63 unshuffled kinase-module scores are below that value, then this kinase has has an FDR of 0.1% 
(63/63,000 scores). We can then use the FDR values for all kinases to a module to determine an FDR cutoff for that
module. Thus, we can say only these x kinases are a good match to the module. Calling an FDR threshold is 
done manually by the user.
'''

#Read in input files (which are the non-shuffled scores for all Mok kinases compared to each Module) 
path=r"ClassA_NoShuffle_KL/"
filenames = glob.glob(path + "*.csv")

labels = ['Scores','Kinase','Module']                # column header

def load_data(filenames, dfs_input_lst):
    ''' Read in files and parse to produce input data  '''
    for i in filenames:
        with open(i, 'r') as f:
            mod_name = f.readline().rstrip()
            for ln in f:
                ln = ln.rstrip()
                ln = re.sub('\(', '', re.sub('\)', '', re.sub('\'','', re.sub('\s', '', ln))))
                row = (ln + ',' +mod_name).split(',')
                dfs_input_lst.append(row)
        f.close()
    return dfs_input_lst

dfs_input_lst = []
Input = load_data(filenames, dfs_input_lst)
Input = pd.DataFrame(dfs_input_lst, columns=labels)

def SplitInput_df_byModule(data):
    ''' Function splits the input dataframe, by Module, into independent dataframes'''
    DF_Input_lst =[]
    for Module in data['Module'].unique():
        DF=data.loc[data['Module']==Module]
        DF_Input_lst.append(DF)
    return DF_Input_lst

DF_Input_lst=SplitInput_df_byModule(Input)


# Importing the Shuffled Mok kinase-Module Score csv files individually and creating dataframes
path=r"Shuffle_KL/"
filenames = glob.glob(path + "*.csv")

dfs_lst = []
shuffledData = load_data(filenames, dfs_lst)
shuffledData = pd.DataFrame(dfs_lst, columns=labels)

dfs_lst2= SplitInput_df_byModule(shuffledData)


def CountScores_Below():
    '''Function is calculating the number of scores in the shuffled distribution below a non-shuffled Kinase_Module score '''
    Final_DFs_lst=[]   #
    for DF_shuffled in dfs_lst2:  
        DF_shuffled=DF_shuffled.copy()
       
        for df_noShuffle in DF_Input_lst:  
            df_noShuffle=df_noShuffle.copy()
       
            if df_noShuffle['Module'].unique().all() == DF_shuffled['Module'].unique().all():      # if all of the values match in the module column for each df, then and only then, perform the below steps
                Input_Score_lst=df_noShuffle['Scores'].tolist()  
                
                Scores_lst=[]  
                for score in Input_Score_lst: 
                    Scores_lst_individual=[]
                    num_smaller_items = (DF_shuffled['Scores']<score).sum()                        # create a variable that is the sum of all scores below the score in the Shuffled_Scores dataframe
                 
                    Scores_lst_individual.append(num_smaller_items)                                # append the number of scores below a given kinase-module score to the individual list.
                    Scores_lst.append(Scores_lst_individual)                                       # append the individual scores to a list.
                    merged = list(itertools.chain(*Scores_lst))                                    # Flatten the list of lists. 
                    
                df_noShuffle['Counts_Less_Than']=merged
                df_noShuffle['Number_of_Scores']=len(DF_shuffled)                                  # take all of the summed scores, one per kinase from the kinase-module no-shuffle dataframe, and create a new column.
                Final_DFs_lst.append(df_noShuffle)  
    return Final_DFs_lst

DFs_with_CountsBelow_lst=CountScores_Below()
   
def Calculate_FDR():
    ''' Function calculates an FDR value by dividing the number of shuffled scores for a kinase-module
    that are smaller than the non-shuffled kinase-module score by all shuffled scores (63,000)  '''
    DFs_with_CountsBelow_lst2=[]
    for DF in DFs_with_CountsBelow_lst:
        DF['FDR']=DF['Counts_Less_Than']/DF['Number_of_Scores']
        DF=DF.sort_values(by=['FDR'], ascending=[True])
        DFs_with_CountsBelow_lst2.append(DF)
    return DFs_with_CountsBelow_lst2
 
DFs_with_CountsBelow_lst2=Calculate_FDR()

# Import the kinases not in the Mok et al dataset.
Kinases_Not_In_Mok_DF=pd.read_csv('Kinases_Not_In_Mok.csv')


def ConcatenateDFs_with_Kinases_Not_In_Mok():
    '''Function adds the kinases not in the Mok et al dataset to the dataframes for each module'''
    DFs_with_CountsBelow_lst3=[]
    for DF in DFs_with_CountsBelow_lst2:
        DF=DF.copy()
        FinalDF=DF.append(Kinases_Not_In_Mok_DF)
        DFs_with_CountsBelow_lst3.append(FinalDF)
    return DFs_with_CountsBelow_lst3

DFs_with_CountsBelow_lst3=ConcatenateDFs_with_Kinases_Not_In_Mok() 
  

def ConcatenateDFs(): 
    '''Function appends all of the dataframes, for each module, together into one dataframe''' 
    EmptyDF = pd.DataFrame() 
    for df in DFs_with_CountsBelow_lst3: 
        df=df.copy() 
        EmptyDF=EmptyDF.append(df) 
    return EmptyDF

Final=ConcatenateDFs()

def DF_to_CSV(dataframe, NewFileName): 
    ''' Write out dataframe as a tab separated file.'''
    dataframe.to_csv (NewFileName,sep='\t') 
    
DF_to_CSV(Final, 'FDR_Scores.csv')

# OUTPUT: FDR_Scores.csv

# Prep Merge_SI_subModule_relationships_with_FDR_Scores

Input

FDR Scores input file:

> Scores,Kinase,Group (According to Mok),Module,Counts_Less_Than,Number_of_Scores,FDR
> 11.44097292,pho85-pho80,Proline_directed,Induced_......SP.....,0,63000,0
> 11.92968433,fus3,Proline_directed,Induced_......SP.....,0,63000,0
> 11.02325264,cdc28,Proline_directed,Induced_......SP.....,0,63000,0


SI (Shared Interactors file) Final_enriched.csv from __Identify Shared Interactors__ Step
NOTE: the shared interactor name will which is last in the Final_enriched.csv file has to be moved to the
start of the line.  This is done in the next cell.

> Shared_Interactor,M,Motif_Containing_Proteins,Motif,n,N,m,p-value,Rank(i),m_(number_of_tests),Q_(FDR),>(i/m)Q,BH_significant
> YBR160W,304,YKL168C,Induced_......SP....._No_Phenotype_Exists,90,4638,24,1.37E-09,1,894,0.05,5.59E-05,1
> YNL293W,16,YLR319C,Induced_......SP....._mkk1_2_Induced_Defective,31,4638,4,2.81E-06,2,894,0.05,0.000111857,1




In [44]:
# Prep input files
# add group classification according to Mok to FDR_Scores.
Mok = dict()
with open('required/Mok_Kinase_Groups_Corrected.csv','r') as mk:
    for kns in mk:
        kns = kns.rstrip()
        group = kns.split()
        Mok[group[0]] = group[2]
mk.close()

grpName = ''

with open('FDR_Scores_merged.csv', 'w') as outfile, open('FDR_Scores.csv', 'r') as f:
    for line in f:
        line = line.rstrip()
        dat = line.split()
        if line.startswith('\tScores'):               # setup header row
            dat.insert(2,'Group (According to Mok)')
            out = ','.join(dat) + '\n'
            outfile.write(out)
            continue
        if dat[2] in Mok:
            grpName = dat[3]
            dat.insert(3,Mok[dat[2]])
            dat.pop(0)                 # removes data frame column number
            out = ','.join(dat) + '\n'
            outfile.write(out)
        else:
            dat.insert(3, grpName)
            dat.pop(0)                 # removes data frame column number
            out =','.join(dat) + '\n'
            outfile.write(out)
            
outfile.close()
f.close()

# REORDER SHARED INTERACTOR w/in the line, file Network_Submodule_Nodes_background_Network.csv
# use FINAL_enriched.csv
# Shared_Interactor	M	Motif_Containing_Proteins	Interaction	Motif	n	N	m	p-value	Rank(i)	m_(number_of_tests)	Q_(FDR)	(i/m)Q	BH_significant


with open('Final_enriched.csv','r') as f, open('All_SIs.csv','w') as outfile:
    for line in f:
        dat = line.rstrip().split(',')
        if line.startswith('M'):
            dat[1] = 'Motif_Containing_Proteins'
            dat[3] = 'Motif'
        last = dat.pop(-1)
        dat.insert(0,last)
        out = ','.join(dat) + '\n'
        outfile.write(out)
        
outfile.close()

# OUTPUT: FDR_Scores_merged.csv
#         All_SIs.csv

['M', 'Submodule_Containing_Proteins', 'Interaction', 'Submodule', 'n', 'N', 'm', 'p-value', 'Rank(i)', 'm_(number_of_tests)', 'Q_(FDR)', '(i/m)Q', 'BH_significant', 'Shared_Interactor']
['304', 'YOR188W', 'kinase_substrate:Reversed', 'Induced_......SP.....', '117', '4638', '29', '1.6549691069793215e-10', '1', '919', '0.05', '5.44069640914037e-05', '1', 'YBR160W']
['298', 'YJR049C', 'kinase_substrate:Reversed', 'Induced_...RR.S......', '27', '4638', '12', '2.891946623756871e-08', '2', '919', '0.05', '0.0001088139281828074', '1', 'YJL164C']
['212', 'YJR049C', 'kinase_substrate:Reversed', 'Induced_...RR.S......', '27', '4638', '10', '1.3749551891468198e-07', '3', '919', '0.05', '0.00016322089227421111', '1', 'YJR059W']
['49', 'YHR135C', 'kinase_substrate', 'Induced_......SP.....', '117', '4638', '10', '2.5463103225217233e-07', '4', '919', '0.05', '0.0002176278563656148', '1', 'YBL007C']
['17', 'YLR337C', 'ppi', 'Induced_......SP.....', '117', '4638', '6', '2.2362673666729923e-06', '5', '

# Merge_SI_subModule_relationships_with_FDR_Scores

FDR_Scores_DF input file:
> Scores,Kinase,Group (According to Mok),Module,Counts_Less_Than,Number_of_Scores,FDR
> 11.44097292,pho85-pho80,Proline_directed,Induced_......SP.....,0,63000,0
> 11.92968433,fus3,Proline_directed,Induced_......SP.....,0,63000,0
 
SIs_DF input file:
 > Shared_Interactor,M,Motif_Containing_Proteins,Interaction,Motif,n,N,m,p-value,Rank(i),m_(number_of_tests),Q_(FDR),(i/m)Q	BH_significant
>YBR160W,304,YOR188W,kinase_substrate:Reversed,Induced_......SP.....,117,4638,29,1.65496910697932E-10,1,919,0.05,5.44069640914037E-05	1


Kinase_Names_DF file: ( does not change )

> Kinase,Kinase_Pho85_renamed,Kinase_YORF,Mok <br>
> yck3,yck3,YER123W,yes <br>
> yck1,yck1,YHR135C,yes <br>
> yck2,yck2,YNL154C,yes <br>


In [46]:
''' This script takes the output of the KLD FDR script and adds these FDR scores to the identified SI-submodule pairs. 
'''
# FDR_Scores_merged.csv
FDR_Scores_DF=pd.read_csv('FDR_Scores_merged.csv') # All_SIs.csv  # All FDR scores for Mok kinase and module PWM comparison
# All_SIs.csv
SIs_DF=pd.read_csv('All_SIs.csv')  # SIs - All, enriched and not enriched
# DO NOT CHANGE
Kinase_Names_DF=pd.read_csv('required/Kinases_Mok_andNOT_In_Mok.csv') # Contains all kinases in Mok and not in Mok. Contains common names and YORFs, also contains modifications for Pho85 naming (ex: Pho85-Pcl is now Pho85 in one column)

def DF_to_CSV(dataframe, NewFileName):  
    dataframe.to_csv (current_dir + '/' + NewFileName,sep=',') 
####################################################################################################################################
''' Merging the FDR_Scores_DF with the Kinase_Names_DF '''
# This step is completed so that the correct Pho85 nomenclature is used for a subsquent merge with the SI dataframe. This is necessary because there are 3 pho85-cofactor variants in the Mok et al, dataset.

FDR_Scores_DF_merged_left = pd.merge(left=FDR_Scores_DF,right=Kinase_Names_DF, how='left', left_on='Kinase', right_on='Kinase') # completing a merge so that all the kinase nomenclature from the Kinase_Names_DF

####################################################################################################################################
''' Merge the Kinase name (common name,ex. Hog1) with the Module name to create a new column called "Candidate_Kinase_Regulators" '''

FDR_Scores_DF_merged_left['Candidate_Kinase_Regulators'] = FDR_Scores_DF_merged_left.Module.map(str) + "_" + FDR_Scores_DF_merged_left.Kinase_Pho85_renamed # creating a new column that is the result of a merge between Kinase_Pho85_renamed, and Module
#print (FDR_Scores_DF_merged_left.head(1))
DF_to_CSV(FDR_Scores_DF_merged_left, 'yay3.csv')
####################################################################################################################################
''' Filtering out non-enriched SIs'''

SIs_Filtered=SIs_DF.loc[SIs_DF['BH_significant'] == 1] # Filter out non-significant shared interactors (anything with a "0") 

####################################################################################################################################
''' Adding to the SIs file, the "common" name for the proteins, rather than just using the YORF designation'''
SIs_Filtered_merged_left= pd.merge(left=SIs_Filtered,right=Kinase_Names_DF, how='left', left_on='Shared_Interactor', right_on='Kinase_YORF')


####################################################################################################################################
''' Splitting 'Motif' column and producing a new column, called "Module" that only lists the Induced/Repressed WT phenotype and the motif '''

def Split_After_2nd_Occurence_In_A_String_Retaining_Beginning():
    lst=[] # create an empty list 
    for string in SIs_Filtered_merged_left['Motif']: # Select the string from the "Motif" column 
        strip_character ="_"  # define character where strip will occur
        lst.append(strip_character.join(string.split(strip_character)[:2])) # append to the list the text before the second occurence of the character "_"
    Series_Object = pd.Series(lst) # put the list into a series 
    SIs_Filtered_merged_left['Module'] = Series_Object.values # append the series values to the already existing DF in a new column
    return SIs_Filtered_merged_left
        
SIs_Filtered_merged_left_String_Split=Split_After_2nd_Occurence_In_A_String_Retaining_Beginning()
#print (SIs_Filtered_merged_left_String_Split)

####################################################################################################################################    
####################################################################################################################################    
'''Creating New Columns that can be used for a merge'''
SIs_Filtered_merged_left_String_Split['Kinase_subModules'] = SIs_Filtered_merged_left_String_Split.Motif.map(str) + "_" + SIs_Filtered_merged_left_String_Split.Kinase_Pho85_renamed


SIs_Filtered_merged_left_String_Split['Kinase_Modules'] = SIs_Filtered_merged_left_String_Split.Module.map(str) + "_" + SIs_Filtered_merged_left_String_Split.Kinase_Pho85_renamed

#################################################################################################################################### 
''' Perform a merge where of the FDR Scores Dataframe with the SIs_Filtered_merged_left_String_Split DF. 
This will reveal if a kinase-Module relationship, from the FDR Score Dataframe, which contains all possible Kinase-Module relationships, exist in the users 
kinase-subModule file (so the SIs file)'''
    
merged_left= pd.merge(left=FDR_Scores_DF_merged_left,right=SIs_Filtered_merged_left_String_Split, how='left', left_on='Candidate_Kinase_Regulators', right_on='Kinase_Modules')

#################################################################################################################################### 
''' Drop columns that are not needed or redundant '''

merged_left=merged_left[['Scores', 'Kinase_x', 'Module_x', 'Candidate_Kinase_Regulators', 'Counts_Less_Than', 'Number_of_Scores', 'FDR', 'Kinase_subModules']]

#################################################################################################################################### 
''' Drop NaN values '''
merged_left=merged_left.dropna(subset=['Kinase_subModules']) # Drop the NaN values, so that the dataframe only contains Kinases that were connected to subModules.


####################################################################################################################################
''' Drop any duplicates that occur in TWO columns - this is done only because of Pho85 being listed 3 times (because of it's co-factor interactions) and that affects the merge'''
merged_left=merged_left.drop_duplicates(subset=['Kinase_x', 'Kinase_subModules']) # only drop duplicates that are found in BOTH columns

DF_to_CSV(merged_left, 'All_FDR_Scores_and_their_Kinase_SI_subModules.csv')

# OUTPUT: All_FDR_Scores_and_their_Kinase_SI_subModules.csv

# Clean Kullback_Leibler Shuffle results

Input file:

> Scores,Kinase_x,Module_x,Candidate_Kinase_Regulators,Counts_Less_Than,Number_of_Scores,FDR,Kinase_subModules <br>
> 11.02325264,cdc28,Induced_......SP.....,Induced_......SP....._cdc28,0,63000,0,Induced_......SP....._No_Phenotype_Exists_cdc28
> 11.02325264,cdc28,Induced_......SP.....,Induced_......SP....._cdc28,0,63000,0,Induced_......SP....._mkk1_2_Induced_Defective_cdc28
> 12.51579799,slt2,Induced_......SP.....,Induced_......SP....._slt2,1,63000,1.59E-05, Induced_......SP....._mkk1_2_Induced_Defective_slt2



In [48]:

''' This is a quick script that cleans up the Output of the Kullback_Leibler Shuffle Script.
It removes unwanted names that trail the subModule name-these are leftovers from a previous
script. It then also sorts each subModule by ascending for the SI scores
'''
# Input=pd.read_csv('All_DTT_T120_Kinase_Module_FDR_Scores_and_their_Kinase_SI_subModules_Sept2017.csv')
Input=pd.read_csv('All_FDR_Scores_and_their_Kinase_SI_subModules.csv')
#print (Input.dtypes)

# Remove the last occurrence of a character, and the text that follows
def Remove_Text_After_Last_Occurence_of_Character():
    Value_lst=[]
    for value in Input['Kinase_subModules']:
        value="_".join(value.split("_")[:-1]) # return everything minus the last occurrence of the "_" and what trailed
        Value_lst.append(value)
    Input['Kinase_subModules']=Value_lst
        #sep = '_'
        #value = value.split(sep, 5)[-1]
    return (Input)
        
        
Input=Remove_Text_After_Last_Occurence_of_Character()
#print (Input)

#Split the dataframe into separate dataframes by the subModule name
def SplitInput_df_by_subModule():
    DF_Input_lst =[]
    for subModule in Input['Kinase_subModules'].unique():
        DF=Input.loc[Input['Kinase_subModules']==subModule]
        DF_Input_lst.append(DF)
    return DF_Input_lst

DF_Input_lst=SplitInput_df_by_subModule()

#Sort each dataframe within the list of dataframes by ascending for the FDR column
def Sort_by_Ascending():
    DF_Input_lst2=[]
    for DF in DF_Input_lst:
        DF=DF.copy()
        DF=DF.sort(['FDR'], ascending=[True])
        DF_Input_lst2.append(DF)
    return DF_Input_lst2

DF_Input_lst2=Sort_by_Ascending()



#Concatenate the dataframes back together into one so they can be printed out as a single dataframe.
def ConcatenateDFs():    #Concatenate the DFs together 
    EmptyDF = pd.DataFrame() # create an empty dataframe
    for df in DF_Input_lst2:  # select a dataframe in the list 
        df=df.copy() # make a copy of that dataframe 
        EmptyDF=EmptyDF.append(df) # append to the empty DF the dataframe selected and overwrite the empty dataframe
    return EmptyDF

Final=ConcatenateDFs()
#print (Final)


def DF_to_CSV(dataframe, NewFileName): 
    dataframe.to_csv (NewFileName,sep='\t')
    
DF_to_CSV(Final, 'All_Module_FDR_Scores__Kinase_SI_subModules_Sorted.csv')

# OUTPUT: All_Module_FDR_Scores__Kinase_SI_subModules_Sorted.csv

Unnamed: 0                       int64
Scores                         float64
Kinase_x                        object
Module_x                        object
Candidate_Kinase_Regulators     object
Counts_Less_Than                 int64
Number_of_Scores                 int64
FDR                            float64
Kinase_subModules               object
dtype: object
    Unnamed: 0     Scores     Kinase_x               Module_x  \
0            1  17.341918         ptk2  Induced_......TP.....   
1           20  17.102829         ark1  Induced_......TP.....   
2           46  17.558753         pkh2  Induced_......TP.....   
3          380  10.381472         yck1  Induced_....R.S......   
4          389  10.311257         rck2  Induced_....R.S......   
5          405  12.761746         atg1  Induced_....R.S......   
6          407   9.218274  pho85-pho80  Induced_....R.S......   
7          411   9.676206   pho85-pcl2  Induced_....R.S......   
8          415   9.804924         ark1  Induced_..



In [54]:
# clean up All_SIs.csv from Prep Merge_SI_subModule_relationships_with_FDR_Scores step
with open('All_SIs.csv','r') as f, open('only_enriched.csv','w') as outfile :
    for line in f:
        dat = line.rstrip().split(',')
        if line.startswith('Shared'):
            dat.pop(4)
            dat[3] = 'subModule'
        else:
            dat[3],dat[4] = dat[4],dat[3]
            dat[3] += dat[4]
            dat.pop(4)
        out = ','.join(dat) + '\n'
        outfile.write(out)

f.close()
outfile.close()
        
# OUTPUT: only_enriched.csv

In [58]:
# Remake the header for SI_Identification_SubmoduleS__SIs_and_Targets_FDR.csv (from Identify Shared Interactors step)
# for the next script
with open('SI_Identification_SubmoduleS__SIs_and_Targets_FDR.csv', 'r') as f, open('shared_Interactors.csv', 'w') as out:
    f.readline()
    newHeader = 'Motif_Containing_Proteins,Interaction,subModule,n,Possible_Shared_Interactors\n'
    out.write(newHeader)
    for line in f:
        out.write(line)
f.close()
out.close()    

# OUTPUT: shared_Interactors.csv 
    

In [61]:

''' 

The purpose of this script is to identify for each enriched SI (according to the hypergeometric and BH correction)
what it's interacting proteins, and potential "targets"  are in the background network. Next, 
the script identifies the correct orientation for interactions between a Shared Interactor and it's 
partners, which are reversed in an earlier script. Thus, the script ensures all interactions are in 
the correct orientation. 

The output file from this script can then be used to determine if a Shared Interactor is acting as an input (all interactions going towards a subModule, or is an output (interactions go away), etc. An input is a potential regulator of a submodule whereas an output is unlikely to regulate a submodoule.

'''
##################################################################################################################################
# FDR_Scores_merged.csv
#Enriched_SIs_subModules_Only_Sig=pd.read_csv('DTT_T120_SIs_All_Sept2017_Enriched.csv')  # Import ONLY enriched SIs (drop non-enriched SIs)
Enriched_SIs_subModules_Only_Sig=pd.read_csv('only_enriched.csv')  
Enriched_SIs_subModules_Only_Sig["SI_subModule"] = Enriched_SIs_subModules_Only_Sig["Shared_Interactor"].map(str) + "_" + Enriched_SIs_subModules_Only_Sig["subModule"] # creating a new column by merging two columns together.


# I think this is from SI_Identification_SubmoduleS__SIs_and_Targets_FDR.csv  from step Identify Shared Interactors
#All_enriched_and_not_SIs_andTargets=pd.read_csv('SI_Identification_T120_DTT_Sept2017_T120_Possible_SIs_and_Targets_Dashes_Removed_4638_proteins_BOTH_Reps_Ppeps_NORMALIZED.csv') # Import all identified Shared Interactors (both enriched and not) and their interacting partners from submodules.
All_enriched_and_not_SIs_andTargets=pd.read_csv('shared_Interactors.csv')

All_enriched_and_not_SIs_andTargets["SI_subModule"] = All_enriched_and_not_SIs_andTargets["Possible_Shared_Interactors"].map(str) + "_" + All_enriched_and_not_SIs_andTargets["subModule"]

Merged_left = pd.merge(left=Enriched_SIs_subModules_Only_Sig,right=All_enriched_and_not_SIs_andTargets, how='left', left_on='SI_subModule', right_on='SI_subModule') # 

Merged_left_FINAL=Merged_left[['SI_subModule', 'Shared_Interactor', 'Motif_Containing_Proteins_y', 'subModule_y']] # retain these columns only
Merged_left_FINAL.columns=['SI_Module','Shared_Interactor', 'Motif_Containing_Proteins', 'subModule_Name'] # rename columns



def DF_to_CSV(dataframe, NewFileName): 
    dataframe.to_csv (NewFileName,sep='\t')



''' Add the SI common name to the file  FILE DOES NOT CHANGE'''
Annotation_File_DF=pd.read_csv('required/Annotation_file_dashes_remain_No_duplicate_Common_names.csv') 

Merged_left_Again = pd.merge(left=Merged_left_FINAL,right=Annotation_File_DF, how='left', left_on='Shared_Interactor', right_on='Protein_Name') # complete the merge toget the common names
Merged_left_Again=Merged_left_Again[['SI_Module', 'Shared_Interactor', 'Common_Name', 'Motif_Containing_Proteins', 'subModule_Name']] # retain only these columns
Merged_left_Again.columns=['SI_Module', 'Shared_Interactor', 'SI_Name', 'Motif_Containing_Proteins', 'subModule_Name'] # rename columns 

Merged_left_Again['Protein1:Protein2']=Merged_left_Again['Shared_Interactor'].map(str) + ":" + Merged_left_Again['Motif_Containing_Proteins'] # adding column for merge section below.

Merged_left_Again['Protein1:Protein2'] = Merged_left_Again['Protein1:Protein2'].str.replace('-', '') # Removing the dashes from the names because if they remain in this column, the merge below will fail, because the background network lacks dashes in the gene annotations. 

###########################################################################################

'''FILE DOES NOT CHANGE from Debbi Chasman'''
Background_network_correct_Orientation=pd.read_csv('required/phospho_v4_bgnet_siflike_withdirections_fix_Matt_Modifications_ForPipeline.csv') # import the salt background network with the correct orientations (this file lacks dashes in gene annotations!)

Merged_left_Again_Get_Correct_Protein_Orientiations=pd.merge(left=Merged_left_Again, right=Background_network_correct_Orientation, how='left', left_on='Protein1:Protein2', right_on='Protein1:Protein2') # merge based on the columns to the left

Merged_left_Again_Get_Correct_Protein_Orientiations=Merged_left_Again_Get_Correct_Protein_Orientiations[['SI_Module', 'Shared_Interactor', 'SI_Name', 'Motif_Containing_Proteins', 'subModule_Name','Interaction']] # retain only these columns


Merged_left_Again_Get_Correct_Protein_Orientiations.columns=['SI_Module', 'Shared_Interactor', 'SI_name', 'Motif_Containing_Proteins', 'subModule_Name','Interaction_Directionality'] # rename columns 
#print (Merged_left_Again_Get_Correct_Protein_Orientiations.head(2))

DF_to_CSV(Merged_left_Again_Get_Correct_Protein_Orientiations, 'Orientation_Script.csv')

# OUTPUT:  Orientation_Script.csv  this one of the inputs to the next script

In [None]:
v = set()
with open('Orientation_Script.csv') as f:
    
    
    

In [60]:
''' 
The function of this script is as follows: For each SI and it's interactions a submodules constituents,
the script determines if the SI is a likely submodule regulator, the Shared Interactor has at least 1 
directional interaction, or ppi interaction, with a subModule protein aimed from the SI to the submodule,
or if the subModule proteins are act upon the SI , all interactions between the SI and subModule proteins
have the 'Reverse' designation', indicating that the subModule proteins act upon the SI.  

-If all of the interactions are reversed, then the script will define the relationship between the SI and the subModule
as "Output", indicating that the SI is likely downstream of the submodule and is not likely regulating the submodule.

-If there is at least one interaction that is NOT reverse (ie, kinase-substrate) or is and non-directed ppi, 
the relationship between the SI and the subMOdule is defined as "Input", suggesting there is a possibility 
that the SI can regulate the submodule protein phosphorylation state.


This script takes an input file that contains the following:
- All enriched Shared Interactors (SIs) (according to HyperG) and their connections to subModules.
- All known protein interactions for each SI (ppi, kinase-substrate, etc)
- Many of these interactions are directed (kinase-substrate, metabolic pathway, etc). PPI are not a directed interaction.

'''
Input_df=pd.read_csv('Orientation_Script.csv', delimiter='\t')
#Input_df=pd.read_csv('DTT_T120_Prep_for_Orientation_Script_Sept2017.csv', delimiter='\t')
print(Input_df.head(n=5))

# Split the input DF into independent DFs based on the term in the SI_Module column (this columns contains the SI and it's connection to each subModule). 
def Split_based_on_SI_Module_Column():
    DF_lst =[]
    for SI_Module in Input_df['SI_Module'].unique():
        DF=Input_df.loc[Input_df['SI_Module']==SI_Module]
        DF_lst.append(DF)
    return DF_lst

DF_lst=Split_based_on_SI_Module_Column()

# This function counts, for each DF, how many of the interactions are reversed. It also counts the length of the dataframe, and then
# subtracts the the length of the dataframe from the counts. If the resultant value is 0, then all of the interactions were reversed.
def Count_Instances_of_Reverse_Interaction():
    DF_Counts_lst=[]
    for df in DF_lst:
        df=df.copy()
        df['Counts']=df.Interaction_Directionality.str.contains('Reversed').sum()  # Count the number of interactions that are "Reversed"
        x=len(df)
        df['Length']=x
        df['Counts_Length']=df['Counts']-df['Length']
        
        DF_Counts_lst.append(df)
    return DF_Counts_lst

DF_Counts_lst=Count_Instances_of_Reverse_Interaction()
print (DF_Counts_lst)

# This function assigns 'Input' and 'Output' classifications based on the 'Counts_Length' column in the dataframe. 
def Only_Reverse_Interactions_Move_to_Outgoing_Columns():
    df_Modified_Outgoing_lst=[]
    for df in DF_Counts_lst:
        #print (df.dtypes)
        for value in df['Counts_Length'].unique():
            #print (value)
            if value == 0:
                df['Shared_Interactor_subModule_Relationship']= 'Output'
                df_Modified_Outgoing_lst.append(df)
            else:
                df['Shared_Interactor_subModule_Relationship']= 'Input'
                df_Modified_Outgoing_lst.append(df)
                
    return df_Modified_Outgoing_lst
    
df_Modified_Outgoing_lst=Only_Reverse_Interactions_Move_to_Outgoing_Columns()

# This function concatenates the dataframes back together, leaving a single DF. 
def ConcatenateDFs():   
    EmptyDF = pd.DataFrame() # create an empty dataframe
    for df in df_Modified_Outgoing_lst:  # select a dataframe in the list 
        df=df.copy() # make a copy of that dataframe 
        EmptyDF=EmptyDF.append(df) # append to the empty DF the dataframe selected and overwrite the empty dataframe
    return EmptyDF

Final=ConcatenateDFs()

print (Final)

#Function writes out a Dataframe to a CSV file. 
def DF_to_CSV(dataframe, NewFileName): 
    dataframe.to_csv (NewFileName,sep='\t') 
    

Final_Keep_Columns_Needed_For_SIF=Final[['SI_Module', 'Shared_Interactor', 'subModule_Name', 'Shared_Interactor_subModule_Relationship']] 
Final_Keep_Columns_Needed_For_SIF=Final_Keep_Columns_Needed_For_SIF.drop_duplicates('SI_Module')


DF_to_CSV(Final_Keep_Columns_Needed_For_SIF, 'SIs_subModule_Relationships_Defined_DTT_T120_Network_Input_for_making_SIF_Sept2017.csv')
 


   Unnamed: 0                                          SI_Module  \
0           0  YBR160W_Induced_......SP.....kinase_substrate:...   
1           1  YBR160W_Induced_......SP.....kinase_substrate:...   
2           2  YBR160W_Induced_......SP.....kinase_substrate:...   
3           3  YBR160W_Induced_......SP.....kinase_substrate:...   
4           4  YBR160W_Induced_......SP.....kinase_substrate:...   

  Shared_Interactor SI_name  Motif_Containing_Proteins  subModule_Name  \
0           YBR160W   CDC28                        NaN             NaN   
1           YBR160W   CDC28                        NaN             NaN   
2           YBR160W   CDC28                        NaN             NaN   
3           YBR160W   CDC28                        NaN             NaN   
4           YBR160W   CDC28                        NaN             NaN   

   Interaction_Directionality  
0                         NaN  
1                         NaN  
2                         NaN  
3                 

AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas

In [81]:
pd.options.mode.chained_assignment = None

''' The function of this script is to produce a SIF file that Debbie can use to Infer a signaling network using an ILP progamming method. The output SIF file can also be opended with Cytoscape to view a signaling network that has NOT been inferred.  



***Overview/Important Notes***
-The final SIF file indicates directionality between interactions or information flow between entitites. For example, the protein in column "A" acts upon the protein/subModule in column "B". The submodule in column "A" contains the proteins listed in column "B"

-There are 6 Interaction types in this file, listed below:

A) Motif-matched: This is a Kinase-SI that recognizes the phosphorylation motif for a subModule at a predetermined FDR cutoff. This interaction type only exists for 
kinases that are Input for a subModule (and thus can potentially regulate them). If a kinase is a match to a subModule, but is an output, it is unlikely to regulate that subModule, since
it is downstream of the subModule, and would be considered an Output (see below).

B) Unknown-recognition-motif: A Kinase or Phoshphatase SI for which we have no information about the phosphorylation motif it recognizes (ie, not in the Mok et al Dataset)
This interaction type is input only (Kinase/Phosphatase SI -> subModule)
All of our SI-Phosphatase inputs fall into this group, since we do not know their recognized phosphorylation motif.
 
C) Motif-unmatched: A Kinase SI which did NOT meet our FDR cutoff for a subModule. This is also only for Input Kinases.

D) Output: subModule -> Kinase/Phosphatase SI. Key here is that all subModule-SI interactions face TOWARDS the SI, and thus the SI is an output and unlikely to regulate the submodule.

E) Constituent: A subModule to it's protein constituents (proteins that are part of the subModule). For this group, many of the constituent proteins will NOT be SIs.
   
E) Shared_Interaction: Non-Kinase/Phosphatase SI connected to it's subModule. Can be either SI -> subModule or subModule -> SI (so an Input/Output based on interaction directionality)


Files that will always be used to create the SIF file:
# FILES DO NOT CHANGE
-Annotation_kinases.csv = this file contains all kinases, including which kinases are in the Mok Dataset.
-kinase_phosphatase_yeast.csv = this file contains all kinases and phosphatases in yeast (annotated as kinase/phosphatase catalytic-from Mike Tyers Kinome project)

# User defined
-List of enriched SIs and their subModules
-List of subModules and their protein constituents 
-KL scoring system for kinases to their subModules (which is actually based on comparing Kinases to their Modules).


'''
# from previous step
All_SI_subModule_relationships_DF= pd.read_csv('SIs_subModule_Relationships_Defined_DTT_T120_Network_Input_for_making_SIF_Sept2017.csv') 


subModule_Constituent_Proteins_DF= pd.read_csv('DTT_submodule_constituents_Sept2017.csv')

# File does not change
#DF contains all kinases, including the 3 Pho85-Co-activator varieties, and includes whether a kinase is in the Mok dataset or not.
Annotation_Kinases_DF=pd.read_csv('required/Annotation_kinases_Updated_Correct.csv') 

# File does not change
#DF contains all protein annotated as kinase catalytic or phosphatase catalytic from Mike Tyers Kinome project_base
Kinases_Phosphatases_yeast_DF=pd.read_csv('required/kinase_phosphatase_yeast.csv')


KL_Matching_Kinases_Modules_DF=pd.read_csv('MotifMatch_Scores_for_FASTA.csv')


#-------------------------------------------------------------------------------------------------------------------------------
def DF_to_CSV(dataframe, NewFileName): 
    dataframe.to_csv (NewFileName,sep='\t')  

#-------------------------------------------------------------------------------------------------------------------------------
#Add a column to the SI_File
All_SI_subModule_relationships_DF['SI']='Yes'

#-------------------------------------------------------------------------------------------------------------------------------
#Generating Interaction type: Output (subModule -> subModule constituent proteins)

subModule_Constituent_Proteins_DF=subModule_Constituent_Proteins_DF[['Protein', 'subModule']]  # Only retain the listed columns in the dataframe
subModule_Constituent_Proteins_DF['Protein_Constituent_subModule']=subModule_Constituent_Proteins_DF.Protein.map(str) + "_" + subModule_Constituent_Proteins_DF.subModule  # create a new column that merges the protein name and subModule name together
subModule_Constituent_Proteins_DF_dupes_removed=subModule_Constituent_Proteins_DF.drop_duplicates(subset='Protein_Constituent_subModule') # drop duplicates, so only a single occurrence is listed for each SI-subModule
subModule_Constituent_Proteins_DF_dupes_removed_reorganized=subModule_Constituent_Proteins_DF_dupes_removed.rename(columns={'subModule':'Interactor_A', 'Protein':'Interactor_B'}) # Change column names
subModule_Constituent_Proteins_DF_dupes_removed_reorganized['Edge_Type']='Constituent'  # Add the Edge type (Interaction type-column). This line has been modified since the original script was created.
subModule_Constituent_Proteins_DF_dupes_removed_reorganized=subModule_Constituent_Proteins_DF_dupes_removed_reorganized[['Interactor_A','Edge_Type','Interactor_B','Protein_Constituent_subModule']] # reorganize columns
subModule_Constituent_Proteins_merged_left_DF=pd.merge(left=subModule_Constituent_Proteins_DF_dupes_removed_reorganized,right=Kinases_Phosphatases_yeast_DF, how='left', left_on='Interactor_B', right_on='ORF')  # do a merge, so I get information about which proteins are kinases/phosphatases
subModule_Constituent_Proteins_merged_left_DF=subModule_Constituent_Proteins_merged_left_DF[['Interactor_A', 'Edge_Type', 'Interactor_B', 'Annotation', 'Protein_Constituent_subModule']]  # drop the columns I don't want 
subModule_Constituent_Proteins_merged_left_DF_Add_SI_Information=pd.merge(left=subModule_Constituent_Proteins_merged_left_DF, right=All_SI_subModule_relationships_DF, how='left', left_on='Protein_Constituent_subModule', right_on='SI_Module')
subModule_Constituent_Proteins_merged_left_DF_Add_SI_Information=subModule_Constituent_Proteins_merged_left_DF_Add_SI_Information[['Interactor_A', 'Edge_Type', 'Interactor_B', 'Annotation', 'SI']]

#-------------------------------------------------------------------------------------------------------------------------------
# Split the SI_subModule_relationships_DF by Input/Output

#Splitting the Dataframe based on if the SI is acting as an Input or Output
#If a SI_subModule relationship is an output, then it should be listed that Interactor_A is the subModule and Interactor_B is the SI. 

SIs_subModules_Output=All_SI_subModule_relationships_DF.loc[All_SI_subModule_relationships_DF['Shared_Interactor_subModule_Relationship']=='Output'] # All interactions that go from subModule -> SI
SIs_subModules_Input=All_SI_subModule_relationships_DF.loc[All_SI_subModule_relationships_DF['Shared_Interactor_subModule_Relationship']=='Input']

#---------------------------------------------------------------------------------------------------------------------------------
# Generating Interaction type: Output (subModule -> SIs-Kinase (that have all Interactions facing from subModule to SI)    

SIs_subModules_Output_renamed=SIs_subModules_Output.rename(columns={'subModule_Name':'Interactor_A','Shared_Interactor':'Interactor_B'}) #rename columns appropriately. 
#print (SIs_subModules_Output_renamed.head(5))
SIs_subModules_Output_renamed['Edge_Type']='Output'  # Add the edge_type, which is output in this case 

SIs_subModules_Output_renamed=SIs_subModules_Output_renamed[['Interactor_A', 'Edge_Type', 'Interactor_B', 'SI_Module']] # drop unwanted columns
SIs_subModules_Output_renamed_merge_left=pd.merge(left=SIs_subModules_Output_renamed,right=Kinases_Phosphatases_yeast_DF, how='left', left_on='Interactor_B', right_on='ORF')  # do a merge, so I get information about which proteins are kinases/phosphatases
SIs_subModules_Output_renamed_merge_left=SIs_subModules_Output_renamed_merge_left[['Interactor_A', 'Edge_Type', 'Interactor_B', 'Annotation', 'SI_Module']] #Drop unwanted columns from the above merge 

SIs_subModules_Output_renamed_merge_left_left_again=pd.merge(left=SIs_subModules_Output_renamed_merge_left, right=All_SI_subModule_relationships_DF, how='left', left_on='SI_Module', right_on='SI_Module') # merge so we get SI information 
SIs_subModules_Output_renamed_merge_left_left_again=SIs_subModules_Output_renamed_merge_left_left_again[['Interactor_A', 'Edge_Type', 'Interactor_B', 'Annotation', 'SI']] # drop columns we don't want
SIs_subModules_Output_renamed_merge_left_left_again_Kinase_SIs_Only=SIs_subModules_Output_renamed_merge_left_left_again.loc[SIs_subModules_Output_renamed_merge_left_left_again['Annotation']=='Kinase']
SIs_subModules_Output_renamed_merge_left_left_again_Kinase_SIs_Only=SIs_subModules_Output_renamed_merge_left_left_again_Kinase_SIs_Only.loc[SIs_subModules_Output_renamed_merge_left_left_again_Kinase_SIs_Only['Annotation']=='Kinase'] # only keep annotations that are Kinases!


#---------------------------------------------------------------------------------------------------------------------------------
# Generating Interaction type: Output (subModule -> SIs_Phosphatase (that have all interactions facing from submodule to SI)
SIs_subModules_Output_renamed_merge_left_left_Twice=pd.merge(left=SIs_subModules_Output_renamed_merge_left, right=All_SI_subModule_relationships_DF, how='left', left_on='SI_Module', right_on='SI_Module')
SIs_subModules_Output_renamed_merge_left_left_Twice=SIs_subModules_Output_renamed_merge_left_left_Twice[['Interactor_A', 'Edge_Type', 'Interactor_B', 'Annotation', 'SI']] # drop columns we don't want
SIs_subModules_Output_renamed_merge_left_left_Twice_Phosphatase_SIs_Only = SIs_subModules_Output_renamed_merge_left_left_Twice.loc[SIs_subModules_Output_renamed_merge_left_left_Twice['Annotation']=='Phosphatase']


#---------------------------------------------------------------------------------------------------------------------------------
# Generating Interaction type: Shared_Interaction: Non-Kinase/Phosphatase Shared Interactors and their Module associations. can be directed as follows: subModule -> non-kinase/phosphatase SI  or non-kinase/phosphatase SI -> subModule
# Note: These are All Shared Interactors, so we can simply add a SI column to this file. 

# This section of code is making the subModule -> SI direction 
SIs_subModules_Output_renamed_merge_left=SIs_subModules_Output_renamed_merge_left[['Interactor_A', 'Edge_Type', 'Interactor_B', 'Annotation', 'SI_Module']] 
SIs_subModules_Output_renamed_merge_left_non_Kin_Phos=SIs_subModules_Output_renamed_merge_left.loc[SIs_subModules_Output_renamed_merge_left['Annotation']!='Kinase']
SIs_subModules_Output_renamed_merge_left_non_Kin_Phos=SIs_subModules_Output_renamed_merge_left_non_Kin_Phos.loc[SIs_subModules_Output_renamed_merge_left_non_Kin_Phos['Annotation']!='Phosphatase'] # Return all proteins with annotations that are NOT kinases!
SIs_subModules_Output_renamed_merge_left_non_Kin_Phos['SI']='Yes' # Since all of these proteins are SIs, add a column that indicates they are SIs
SIs_subModules_Output_renamed_merge_left_non_Kin_Phos=SIs_subModules_Output_renamed_merge_left_non_Kin_Phos[['Interactor_A', 'Edge_Type', 'Interactor_B', 'Annotation', 'SI']]
SIs_subModules_Output_renamed_merge_left_non_Kin_Phos['Edge_Type']='Shared_Interaction'
#print (SIs_subModules_Output_renamed_merge_left_non_Kin_Phos)

#This section of code is making the SI -> subModule direction 

#print (len(SIs_subModules_Input))
SIs_subModules_Input_merge_left_left_again=pd.merge(left=SIs_subModules_Input,right=Kinases_Phosphatases_yeast_DF, how='left', left_on='Shared_Interactor', right_on='ORF') # Merge so we get Kinae/phosphatase information for SIs
SIs_subModules_Input_merge_left_left_again_non_Kin_Phos=SIs_subModules_Input_merge_left_left_again.loc[SIs_subModules_Input_merge_left_left_again['Annotation']!='Kinase']  # Drop kinases
SIs_subModules_Input_merge_left_left_again_non_Kin_Phos=SIs_subModules_Input_merge_left_left_again_non_Kin_Phos.loc[SIs_subModules_Input_merge_left_left_again_non_Kin_Phos['Annotation']!='Phosphatase'] # drop phosphatases 
SIs_subModules_Input_merge_left_left_again_non_Kin_Phos['Edge_Type']='Shared_Interaction' # add the edge type 

SIs_subModules_Input_merge_left_left_again_non_Kin_Phos=SIs_subModules_Input_merge_left_left_again_non_Kin_Phos[['Shared_Interactor', 'Edge_Type', 'subModule_Name', 'Annotation', 'SI']] # drop unwanted columns
SIs_subModules_Input_merge_left_left_again_non_Kin_Phos_renamed=SIs_subModules_Input_merge_left_left_again_non_Kin_Phos.rename(columns={'Shared_Interactor':'Interactor_A', 'subModule_Name':'Interactor_B'}) # rename columns so I can merge all dataframes later on.


#---------------------------------------------------------------------------------------------------------------------------------
# Generating Interaction Type: Motif-Matched  (SI-Kinase -> subModule)

SIs_subModules_Input_merge_left_left_again_Kinase_Only=SIs_subModules_Input_merge_left_left_again.loc[SIs_subModules_Input_merge_left_left_again['Annotation']== 'Kinase'] # subset the dataframe so we are only working with Kinases.['a'] = df['a'].apply(lambda x: x.split('-')[0])
SIs_subModules_Input_merge_left_left_again_Kinase_Only['SI_V2']=SIs_subModules_Input_merge_left_left_again_Kinase_Only['SI_Module'].apply(lambda x: x.split('_')[0])  # Get term before the first "_"
SIs_subModules_Input_merge_left_left_again_Kinase_Only['Cluster']=SIs_subModules_Input_merge_left_left_again_Kinase_Only['SI_Module'].apply(lambda x: x.split('_')[1]) # Get term after the first "_"
SIs_subModules_Input_merge_left_left_again_Kinase_Only['motif']=SIs_subModules_Input_merge_left_left_again_Kinase_Only['SI_Module'].apply(lambda x: x.split('_')[2]) # Get term after the second "_"
SIs_subModules_Input_merge_left_left_again_Kinase_Only['SI_MODULE']=SIs_subModules_Input_merge_left_left_again_Kinase_Only.SI_V2.map(str) + "_" + SIs_subModules_Input_merge_left_left_again_Kinase_Only.Cluster + "_" + SIs_subModules_Input_merge_left_left_again_Kinase_Only.motif
DF_to_CSV(SIs_subModules_Input_merge_left_left_again_Kinase_Only, "Test1.csv")
SIs_subModules_Input_merge_left_left_again_Kinase_Only_merge_left_KL=pd.merge(left=SIs_subModules_Input_merge_left_left_again_Kinase_Only, right=KL_Matching_Kinases_Modules_DF, how='left', left_on='SI_MODULE', right_on='Kinase_Module') # Perform a merge so I get information about matching Kinases to subModule motifs (KL script output).
#print (SIs_subModules_Input_merge_left_left_again_Kinase_Only_merge_left_KL)
#DF_to_CSV(SIs_subModules_Input_merge_left_left_again_Kinase_Only_merge_left_KL, 'Test2.csv')
SIs_subModules_Input_merge_left_left_again_Kinase_Only_merge_left_KL=SIs_subModules_Input_merge_left_left_again_Kinase_Only_merge_left_KL[['Shared_Interactor', 'motif_match', 'subModule_Name', 'Annotation', 'SI', 'FDR']] # drop columns I don't want in the final version
SIs_subModules_Input_merge_left_left_again_Kinase_Only_merge_left_KL_renamed=SIs_subModules_Input_merge_left_left_again_Kinase_Only_merge_left_KL.rename(columns={'Shared_Interactor':'Interactor_A', 'motif_match':'Edge_Type', 'subModule_Name':'Interactor_B'}) # rename columns so I can merge all dataframes in the future.
SIs_subModules_Input_merge_left_left_again_Kinase_Only_merge_left_KL_renamed['Edge_Type']=SIs_subModules_Input_merge_left_left_again_Kinase_Only_merge_left_KL_renamed['Edge_Type'].map({'yes':'motif_match', 'no':'motif_unmatched', 'unknown_recognition_motif':'unknown_recognition_motif'})
#print (SIs_subModules_Input_merge_left_left_again_Kinase_Only_merge_left_KL_renamed)
#DF_to_CSV(SIs_subModules_Input_merge_left_left_again_Kinase_Only_merge_left_KL_renamed, 'New_Script_file_for_merge.csv')
#print (KL_Matching_Kinases_Modules_DF.head(5))

#---------------------------------------------------------------------------------------------------------------------------------
# Generating Interaction Type: unknown-recognition motif  (SI-Phosphatase -> subModule)

SIs_subModules_Input_merge_left_left_again_Phosphatase_Only_unknown_recogntion_motif=SIs_subModules_Input_merge_left_left_again.loc[SIs_subModules_Input_merge_left_left_again['Annotation']== 'Phosphatase'] # subset the dataframe so we are only working with Phosphatases.
SIs_subModules_Input_merge_left_left_again_Phosphatase_Only_unknown_recogntion_motif['Edge_Type']='unknown_recognition_motif'  # add the edge type 
SIs_subModules_Input_merge_left_left_again_Phosphatase_Only_unknown_recogntion_motif=SIs_subModules_Input_merge_left_left_again_Phosphatase_Only_unknown_recogntion_motif[['Shared_Interactor', 'Edge_Type','subModule_Name', 'Annotation', 'SI']]
SIs_subModules_Input_merge_left_left_again_Phosphatase_Only_unknown_recogntion_motif_renamed=SIs_subModules_Input_merge_left_left_again_Phosphatase_Only_unknown_recogntion_motif.rename(columns={'Shared_Interactor':'Interactor_A', 'subModule_Name':'Interactor_B'})
#print (SIs_subModules_Input_merge_left_left_again_Phosphatase_Only_unknown_recogntion_motif_renamed)

#---------------------------------------------------------------------------------------------------------------------------------
#Adding empty columns to a few of the dataframes so that the final append will work (requires that all files have the same column names,otherwise you'll end up with Column_X, Column_Y)

subModule_Constituent_Proteins_merged_left_DF_Add_SI_Information['FDR_Score']=""
subModule_Constituent_Proteins_merged_left_DF_Add_SI_Information['Match_FDR']=""
#print (subModule_Constituent_Proteins_merged_left_DF_Add_SI_Information.head(5))


SIs_subModules_Output_renamed_merge_left_left_again_Kinase_SIs_Only['FDR_Score']=""
#print (SIs_subModules_Output_renamed_merge_left_left_again_Kinase_SIs_Only.head(10))
SIs_subModules_Output_renamed_merge_left_left_again_Kinase_SIs_Only['Match_FDR']=""
#print (SIs_subModules_Output_renamed_merge_left_left_again_Kinase_SIs_Only.head(2))

SIs_subModules_Output_renamed_merge_left_non_Kin_Phos['FDR_Score']=""
SIs_subModules_Output_renamed_merge_left_non_Kin_Phos['Match_FDR']=""
#print (SIs_subModules_Output_renamed_merge_left_non_Kin_Phos.head(2))

SIs_subModules_Input_merge_left_left_again_non_Kin_Phos_renamed['FDR_Score']=""
SIs_subModules_Input_merge_left_left_again_non_Kin_Phos_renamed['Match_FDR']=""
#print (SIs_subModules_Input_merge_left_left_again_non_Kin_Phos_renamed.head(4))


SIs_subModules_Input_merge_left_left_again_Kinase_Only_merge_left_KL_renamed=SIs_subModules_Input_merge_left_left_again_Kinase_Only_merge_left_KL_renamed.rename(columns={'FDR':'FDR_Score'})
SIs_subModules_Input_merge_left_left_again_Kinase_Only_merge_left_KL_renamed['Match_FDR']=""
#DF_to_CSV(SIs_subModules_Input_merge_left_left_again_Kinase_Only_merge_left_KL_renamed, 'Test101.csv')
#print (SIs_subModules_Input_merge_left_left_again_Kinase_Only_merge_left_KL_renamed.head(4))

SIs_subModules_Input_merge_left_left_again_Phosphatase_Only_unknown_recogntion_motif_renamed['FDR_Score']=""
SIs_subModules_Input_merge_left_left_again_Phosphatase_Only_unknown_recogntion_motif_renamed['Match_FDR']=""
#print (SIs_subModules_Input_merge_left_left_again_Phosphatase_Only_unknown_recogntion_motif_renamed.head(4))

SIs_subModules_Output_renamed_merge_left_left_Twice_Phosphatase_SIs_Only['FDR_Score']=""
SIs_subModules_Output_renamed_merge_left_left_Twice_Phosphatase_SIs_Only['Match_FDR']=""

FinalDF=subModule_Constituent_Proteins_merged_left_DF_Add_SI_Information.append(SIs_subModules_Output_renamed_merge_left_left_again_Kinase_SIs_Only)
#DF_to_CSV(FinalDF, "Test.csv")
FinalDF_2=FinalDF.append(SIs_subModules_Output_renamed_merge_left_non_Kin_Phos)
FinalDF_3=FinalDF_2.append(SIs_subModules_Input_merge_left_left_again_non_Kin_Phos_renamed)
FinalDF_4=FinalDF_3.append(SIs_subModules_Input_merge_left_left_again_Kinase_Only_merge_left_KL_renamed)
FinalDF_5=FinalDF_4.append(SIs_subModules_Input_merge_left_left_again_Phosphatase_Only_unknown_recogntion_motif_renamed)
FinalDF_6=FinalDF_5.append(SIs_subModules_Output_renamed_merge_left_left_Twice_Phosphatase_SIs_Only)
DF_to_CSV(FinalDF_6, 'DTT_T120_SIF_FINAL_Sept2017.csv')
