# PhosphoELM Data Formatting

This file takes data regarding kinase-protein interactions from the PhosphoELM database and converts the data into the .gmt format. The data was retrieved from the PhosphoELM database on Wed, Jun 7 2017 16:27:31. This data will be added to enhance the KEA2 database and will be suitably formatted for use by ENRICHR and X2K.

## Import packages necessary for following program

In [None]:
%run /home/maayanlab/Projects/Scripts/init.ipy

## Create a dataframe from a file containing PhosphoELM data

In [None]:
#read data from excel file into dataframe 'phospho_df'
phospho_df = pd.read_excel('~/Desktop/phosphoELM_all_2015-04.xlsm')

#View dataframe
phospho_df.head()

## Filter by Organism and Kinase/Substrate columns
Filters out only columns necessary for .gmt file format (kinases and accession numbers) and esnures that only data from mice and humans are included.

In [None]:
#Define a list of selected organisms
organisms = ['Mus musculus', 'Homo sapiens']

#Get indices of rows whose species is in the selected organisms
indices = [index for index, rowData in phospho_df.iterrows() if rowData['species'] in organisms]

#Filter
phospho_df_filter = phospho_df.loc[indices, ['acc', 'kinases', 'species']].dropna()

#View dataframe
phospho_df_filter.head()

## Convert UniProt IDs to Gene Symbols

In [None]:
#Use uniprot_to_symbol function from Scripts.py to convert
phospho_df_filter['target_symbol'] = Scripts.uniprot_to_symbol(phospho_df_filter['acc'].tolist())

#View dataframe
phospho_df_filter.head()

## Create a new column combining kinases and organism

In [None]:
# Combine 'kinases' and 'species' into one column 'kinase_organism'
phospho_df_filter['kinase_organism'] = ['_'.join([kinase, species]) for kinase, species in phospho_df_filter[['kinases', 'species']].as_matrix()]

#View dataframe
phospho_df_filter.head()

## Perform preliminary data processing 

Select columns necessary for .gmt format and filter into new dataframe 'df'
df = phospho_df[['acc', 'kinases', 'species']]We must drop duplicates and NaNs, as well as select only the columns necessary for the .gmt file format (the protein ids and kinase gene symbols). The species column is also selected for future filtering of data by desired species.

In [None]:
#select columns necessary for .gmt format and filter into new dataframe 'df'
df = phospho_df_filter[['target_symbol', 'kinase_organism']]

#drop duplicate rows in the dataframe
df.drop_duplicates(inplace = True)

#drop all rows with an 'NaN' value for the kinases
df.dropna(axis = 0, inplace = True)

#Visualize data
df.head()

## Set Index to 'Kinase_Organism' and Aggregate Kinase Targets

In [None]:
df.set_index('kinase_organism')

#Group kinases in dataframe 'kin'
#Aggregate data in 'kin' according to kinase groups
kin = df.groupby('kinase_organism').agg(lambda x: tuple(x))

#Create a new column with 'PhosphoELM' as description of data
kin.insert(0, 'Description', 'PhosphoELM')

#Visualize Data
kin.head()

# Exploratory Data Analysis

## Calculate Number of Protein targets for each kinase
Create a new column with the number of substrates related to each kinase, and sort the dataframe by this column.

In [None]:
# Create column representing counts of protein targets per kinase
kin['kinase_targets_num'] = [len(lst) for kinase, lst in kin['target_symbol'].iteritems()]

# Sort kinases from max to min according to number of protein targets each has
kin.sort_values(by = ['kinase_targets_num'], ascending= False, inplace=True)

# Visualize data
kin.head()

## Create Histogram to display distribution of number of targets per kinase

In [None]:
# Create histogram displaying the distribution of the number
#targets per kinase
kin.plot.hist(by = 'kinase_targets_num', bins = 34)

#Show histogram
plt.show()

#  Creation of Final .GMT File

## Create Dictionary of Tab-Separated Rows of the Dataframe

In [None]:
#Reset index of the dataframe
kin.reset_index(inplace = True)

#create column 'target_symbol_merged' in which all 'target_symbol' elements are joined by a \t symbol
kin['target_symbol_merged'] = ['\t'.join(x) for x in kin['target_symbol']]

#drop the now-unneccesary column 'target_symbol' and 'kinase_targets_num'
kin.drop('target_symbol', axis=1, inplace = True)
kin.drop('kinase_targets_num', axis=1, inplace = True)

#Create dictionary 'PhosphoELM' with index numbers as keys
PhosphoELM_num = dict([(key, '') for key in kin.index])

# loop through rows with iterrows()
for index, rowData in kin.iterrows():
    line = ('\t'.join(rowData))
    PhosphoELM_num[index] = line

## Write Info from Dictionary into a .GMT file

In [None]:
#Transfer tab-separated info into a new txt file
with open('PhosphoELM.gmt', 'w') as openfile:
    for index in PhosphoELM_num:
        openfile.write(str(PhosphoELM_num[index]) + '\n')