# PhosphoELM Data Formating

This file takes data regarding kinase-protein interactions from the PhosphoELM database and converts the data into the .gmt format. The data was retrieved from the PhosphoELM database on Wed, Jun 7 2017 16:27:31. This data will be added to enhance the KEA2 database and will be suitably formatted for use by ENRICHR and X2K.

## Import packages necessary for following program

In [148]:
%run /home/maayanlab/Projects/Scripts/init.ipy

## Create a dataframe from a file containing PhosphoELM data

In [149]:
#read data from excel file into dataframe 'phospho_df'
phospho_df = pd.read_excel('~/Desktop/phosphoELM_all_2015-04.xlsm')
phospho_df.head()

Unnamed: 0,acc,sequence,position,code,pmids,kinases,source,species,entry_date
0,O08539,MAEMGSKGVTAGKIASNVQKKLTRAQEKVLQKLGKADETKDEQFEQ...,304,S,17114649,,HTP,Mus musculus,2005-03-14 12:16:11.108314+01
1,O08539,MAEMGSKGVTAGKIASNVQKKLTRAQEKVLQKLGKADETKDEQFEQ...,304,S,17242355,,HTP,Mus musculus,2005-03-14 12:16:11.108314+01
2,O08539,MAEMGSKGVTAGKIASNVQKKLTRAQEKVLQKLGKADETKDEQFEQ...,304,S,15345747,,HTP,Mus musculus,2005-03-14 12:16:11.108314+01
3,O08539,MAEMGSKGVTAGKIASNVQKKLTRAQEKVLQKLGKADETKDEQFEQ...,296,S,17114649,,HTP,Mus musculus,2007-07-13 15:17:45.666219+02
4,O08539,MAEMGSKGVTAGKIASNVQKKLTRAQEKVLQKLGKADETKDEQFEQ...,296,S,17242355,,HTP,Mus musculus,2007-07-13 15:17:45.666219+02


## Filter by Organism

In [150]:
# define a list of selected organisms
organisms = ['Mus musculus', 'Homo sapiens']

# get indices of rows whose species is in the selected organisms
indices = [index for index, rowData in phospho_df.iterrows() if rowData['species'] in organisms]

# filter
phospho_df_filter = phospho_df.loc[indices, ['acc', 'kinases', 'species']].dropna()
phospho_df_filter.head()

Unnamed: 0,acc,kinases,species
9,O08605,PAK2,Mus musculus
10,O08605,PAK2,Mus musculus
14,O14543,Lck,Homo sapiens
16,O14543,Lck,Homo sapiens
18,O14746,PKB_group,Homo sapiens


## Convert UniProt IDs to Gene Symbols

In [151]:
# Use uniprot_to_symbol function from Scripts.py to convert
phospho_df_filter['target_symbol'] = Scripts.uniprot_to_symbol(phospho_df_filter['acc'].tolist())
phospho_df_filter.head()

Unnamed: 0,acc,kinases,species,target_symbol
9,O08605,PAK2,Mus musculus,Mknk1
10,O08605,PAK2,Mus musculus,Mknk1
14,O14543,Lck,Homo sapiens,SOCS3
16,O14543,Lck,Homo sapiens,SOCS3
18,O14746,PKB_group,Homo sapiens,TERT


## Create a new column combining kinases and organism

In [152]:
# Combine 'kinases' and 'species' into one column 'kinase_organism'
phospho_df_filter['kinase_organism'] = ['_'.join([kinase, species]) for kinase, species in phospho_df_filter[['kinases', 'species']].as_matrix()]
phospho_df_filter.head()

Unnamed: 0,acc,kinases,species,target_symbol,kinase_organism
9,O08605,PAK2,Mus musculus,Mknk1,PAK2_Mus musculus
10,O08605,PAK2,Mus musculus,Mknk1,PAK2_Mus musculus
14,O14543,Lck,Homo sapiens,SOCS3,Lck_Homo sapiens
16,O14543,Lck,Homo sapiens,SOCS3,Lck_Homo sapiens
18,O14746,PKB_group,Homo sapiens,TERT,PKB_group_Homo sapiens


## Perform preliminary data processing 

Select columns necessary for .gmt format and filter into new dataframe 'df'
df = phospho_df[['acc', 'kinases', 'species']]We must drop duplicates and NaNs, as well as select only the columns necessary for the .gmt file format (the protein ids and kinase gene symbols). The species column is also selected for future filtering of data by desired species.

In [153]:
#select columns necessary for .gmt format and filter into new dataframe 'df'
df = phospho_df_filter[['target_symbol', 'kinase_organism']]

#drop duplicate rows in the dataframe
df.drop_duplicates(inplace = True)

#drop all rows with an 'NaN' value for the kinases
df.dropna(axis = 0, inplace = True)

#Visualize data
df.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,target_symbol,kinase_organism
9,Mknk1,PAK2_Mus musculus
14,SOCS3,Lck_Homo sapiens
18,TERT,PKB_group_Homo sapiens
21,TERT,SRC_Homo sapiens
27,IKBKB,IKK_group_Homo sapiens


## Set Index to 'Kinase_Organism' and Aggregate Kinase Targets

In [154]:
df.set_index('kinase_organism')

#Group kinases in dataframe 'kin'
#Aggregate data in 'kin' according to kinase groups
kin = df.groupby('kinase_organism').agg(lambda x: tuple(x))

#Create a new column 'PhosphoELM' as description of data
kin.insert(0, 'Description', 'PhosphoELM')

#Visualize Data
kin.head()

Unnamed: 0_level_0,Description,target_symbol
kinase_organism,Unnamed: 1_level_1,Unnamed: 2_level_1
AAK1_Homo sapiens,PhosphoELM,"(NUMB, AP1M1, AP2M1)"
ALK_Homo sapiens,PhosphoELM,"(STAT3, ALK, ZC3HC1)"
AMPK_group_Homo sapiens,PhosphoELM,"(IRS1, KPNA2, EP300, EEF2K, PFKFB2, HNF4A, PRK..."
AMPK_group_Mus musculus,PhosphoELM,"(Irs1, Mtor)"
ATM_Homo sapiens,PhosphoELM,"(CHEK2, ABL1, TP53, ATF2, RPA2, CREB1, BRCA1, ..."


## Create column representing count of Protein targets for each kinase

In [159]:
# Create column representing counts of protein targets per kinase
kin['kinase_targets_num'] = [len(lst) for kinase, lst in kin['target_symbol'].iteritems()]
kin.sort_values(by = ['kinase_targets_num'], ascending= False, inplace=True)
kin.head()

Unnamed: 0_level_0,Description,target_symbol,kinase_targets_num
kinase_organism,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
PKA_group_Homo sapiens,PhosphoELM,"(HIST1H2BI, FGA, ESR1, RAF1, ANXA1, TH, ADRB2,...",139
CK2_group_Homo sapiens,PhosphoELM,"(C1R, JUN, SSB, PGR, CDK1, HSP90AA1, HNRNPC, S...",84
PKC_group_Homo sapiens,PhosphoELM,"(EGFR, IL2RA, LMNA, FGA, TFRC, RAF1, ANXA1, EE...",81
CDK1_Homo sapiens,PhosphoELM,"(BIRC5, EGFR, TK1, KRT8, NPM1, PPP1CA, VIM, TO...",67
PKC_alpha_Homo sapiens,PhosphoELM,"(EGFR, TFRC, ANXA1, INSR, ANXA2, PLEK, SRF, CF...",67


## Create Dictionary of Tab-Separated Rows of the Dataframe

In [None]:
#Reset index of the dataframe
kin.reset_index(inplace = True)

#create column 'acc_merged' in which all 'acc' elements are joined by a \t symbol
kin['target_symbol_merged'] = ['\t'.join(x) for x in kin['target_symbol']]

#drop the now-unneccesary column 'target_symbol'
kin.drop('target_symbol', axis=1, inplace = True)

#Create dictionary 'PhosphoELM' with index numbers as keys
PhosphoELM_num = dict([(key, '') for key in kin.index])

# loop through rows with iterrows()
for index, rowData in kin.iterrows():
    line = ('\t'.join(rowData))
    PhosphoELM_num[index] = line

## Write Info from Dictionary into a .GMT file

In [None]:
#Transfer tab-separated info into a new txt file
with open('PhosphoELM.gmt', 'w') as openfile:
    for index in PhosphoELM_num:
        openfile.write(str(PhosphoELM_num[index]) + '\n')

## Test: Reading in the Newly-Created .GMT File

In [160]:
lst = ['protein']
df2 = pd.read_table('PhosphoELM.gmt', delimiter = '\t', names = lst*141)
df2 = df2.replace(np.nan, '', regex=True)
df2.rename(columns = {'protein': 'kinase'}, inplace = True)
df2.rename(columns = {'protein.1': 'Description'}, inplace = True)
df2


Unnamed: 0,kinase,Description,protein.2,protein.3,protein.4,protein.5,protein.6,protein.7,protein.8,protein.9,...,protein.131,protein.132,protein.133,protein.134,protein.135,protein.136,protein.137,protein.138,protein.139,protein.140
0,AAK1_Homo sapiens,PhosphoELM,NUMB,AP1M1,AP2M1,,,,,,...,,,,,,,,,,
1,ALK_Homo sapiens,PhosphoELM,STAT3,ALK,ZC3HC1,,,,,,...,,,,,,,,,,
2,AMPK_group_Homo sapiens,PhosphoELM,IRS1,KPNA2,EP300,EEF2K,PFKFB2,HNF4A,PRKAA2,LIPE,...,,,,,,,,,,
3,AMPK_group_Mus musculus,PhosphoELM,Irs1,Mtor,,,,,,,...,,,,,,,,,,
4,ATM_Homo sapiens,PhosphoELM,CHEK2,ABL1,TP53,ATF2,RPA2,CREB1,BRCA1,MRE11,...,,,,,,,,,,
5,ATM_Mus musculus,PhosphoELM,H2afx,,,,,,,,...,,,,,,,,,,
6,ATR_Homo sapiens,PhosphoELM,CREB1,BRCA1,MDM2,E2F1,CHEK1,ATRIP,,,...,,,,,,,,,,
7,Abl2_Homo sapiens,PhosphoELM,CAT,SIVA1,GPX1,,,,,,...,,,,,,,,,,
8,Abl_Homo sapiens,PhosphoELM,PLSCR1,TP73,ABL1,ANXA1,TOP1,MUC1,NFKBIA,CRKL,...,,,,,,,,,,
9,Abl_Mus musculus,PhosphoELM,Abl1,Cav1,Crk,Dok1,Apbb1,Myod1,Jak2,,...,,,,,,,,,,


In [None]:
##Display figures regarding the data
#look into plotly

# This is a title
## This is a subtitle

hello you can write in **bold**, *italic*

lists:
- asd
- asd
- asd


numbered lists:
1. asd
2. 234
3. 46