# PhosphoELM Data Formating

This file takes data regarding kinase-protein interactions from the PhosphoELM database and converts the data into the .gmt format. The data was retrieved from the PhosphoELM database on Wed, Jun 7 2017 16:27:31. This data will be added to enhance the KEA2 database and will be suitably formatted for use by ENRICHR and X2K.

## Import packages necessary for following program

In [8]:
%run /home/maayanlab/Projects/Scripts/init.ipy

## Create a dataframe from a file containing PhosphoELM data

In [5]:
#read data from excel file into dataframe 'phospho_df'
phospho_df = pd.read_excel('~/Desktop/phosphoELM_all_2015-04.xlsm')
phospho_df.head()

Unnamed: 0,acc,sequence,position,code,pmids,kinases,source,species,entry_date
0,O08539,MAEMGSKGVTAGKIASNVQKKLTRAQEKVLQKLGKADETKDEQFEQ...,304,S,17114649,,HTP,Mus musculus,2005-03-14 12:16:11.108314+01
1,O08539,MAEMGSKGVTAGKIASNVQKKLTRAQEKVLQKLGKADETKDEQFEQ...,304,S,17242355,,HTP,Mus musculus,2005-03-14 12:16:11.108314+01
2,O08539,MAEMGSKGVTAGKIASNVQKKLTRAQEKVLQKLGKADETKDEQFEQ...,304,S,15345747,,HTP,Mus musculus,2005-03-14 12:16:11.108314+01
3,O08539,MAEMGSKGVTAGKIASNVQKKLTRAQEKVLQKLGKADETKDEQFEQ...,296,S,17114649,,HTP,Mus musculus,2007-07-13 15:17:45.666219+02
4,O08539,MAEMGSKGVTAGKIASNVQKKLTRAQEKVLQKLGKADETKDEQFEQ...,296,S,17242355,,HTP,Mus musculus,2007-07-13 15:17:45.666219+02


## Filter by Organism

In [6]:
# define a list of selected organisms
organisms = ['Mus musculus', 'Homo sapiens']

# get indices of rows whose species is in the selected organisms
indices = [index for index, rowData in phospho_df.iterrows() if rowData['species'] in organisms]

# filter
phospho_df_filter = phospho_df.loc[indices, ['acc', 'kinases', 'species']].dropna()
phospho_df_filter.head()

Unnamed: 0,acc,kinases,species
9,O08605,PAK2,Mus musculus
10,O08605,PAK2,Mus musculus
14,O14543,Lck,Homo sapiens
16,O14543,Lck,Homo sapiens
18,O14746,PKB_group,Homo sapiens


## Convert UniProt IDs to Gene Symbols

In [11]:
# Use uniprot_to_symbol function from Scripts.py to convert
phospho_df_filter['target_symbol'] = Scripts.uniprot_to_symbol(phospho_df_filter['acc'].tolist())
phospho_df_filter.head()

Unnamed: 0,acc,kinases,species,target_symbol
9,O08605,PAK2,Mus musculus,Mknk1
10,O08605,PAK2,Mus musculus,Mknk1
14,O14543,Lck,Homo sapiens,SOCS3
16,O14543,Lck,Homo sapiens,SOCS3
18,O14746,PKB_group,Homo sapiens,TERT


## Create a new column combining kinases and organism

In [15]:
# Combine 'kinases' and 'species' into one column 'kinase_organism'
phospho_df_filter['kinase_organism'] = ['_'.join([kinase, species]) for kinase, species in phospho_df_filter[['kinases', 'species']].as_matrix()]
phospho_df_filter.head()

Unnamed: 0,acc,kinases,species,target_symbol,kinase_organism
9,O08605,PAK2,Mus musculus,Mknk1,PAK2_Mus musculus
10,O08605,PAK2,Mus musculus,Mknk1,PAK2_Mus musculus
14,O14543,Lck,Homo sapiens,SOCS3,Lck_Homo sapiens
16,O14543,Lck,Homo sapiens,SOCS3,Lck_Homo sapiens
18,O14746,PKB_group,Homo sapiens,TERT,PKB_group_Homo sapiens


## Perform preliminary data processing 

Select columns necessary for .gmt format and filter into new dataframe 'df'
df = phospho_df[['acc', 'kinases', 'species']]We must drop duplicates and NaNs, as well as select only the columns necessary for the .gmt file format (the protein ids and kinase gene symbols). The species column is also selected for future filtering of data by desired species.

In [20]:
#select columns necessary for .gmt format and filter into new dataframe 'df'
df = phospho_df_filter[['target_symbol', 'kinase_organism']]

#drop duplicate rows in the dataframe
df.drop_duplicates(inplace = True)

#drop all rows with an 'NaN' value for the kinases
df.dropna(axis = 0, inplace = True)

#Visualize data
df.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,target_symbol,kinase_organism
9,Mknk1,PAK2_Mus musculus
14,SOCS3,Lck_Homo sapiens
18,TERT,PKB_group_Homo sapiens
21,TERT,SRC_Homo sapiens
27,IKBKB,IKK_group_Homo sapiens


## View altered 'kin' dataframe

## Create new function to convert UniProtkb IDs to gene symbols

Function 'uniprot_to_gene' will retrieve the gene symbols directly from the API of UniProt and return it

In [None]:

#Create dictionary 'PhosphoELM' with kinases as keys
#PhosphoELM = dict([(key, []) for key in kinases.unique()])

#Define url to obtain gene symbol from API
ENRICHR_URL = 'https://www.ebi.ac.uk/proteins/api/proteins/%s'

#Define function uniprot_to_gene which converts uniprot_id into the gene symbol
def uniprot_to_gene(protein_id):
    response = requests.get(ENRICHR_URL % protein_id)
    if not response.ok:
        name = np.NaN
    else:
        data = xmltodict.parse(response.text)
        entry = data['entry']
        # check if entry contains 'gene'
        if 'gene' in entry.keys():
            data = data['entry']['gene']
            #Make sure entry is 'OrderedDict'
            if type(data) == list:
                name = str(protein_id)
            else:
                names = data['name']
                #Make sure entry is 'OrderedDict'
                if type(names) == list:
                    if len(names) > 2:
                        name = list(names[0].values())[2]
                    else:
                        name = list(names[0].values())[1]
                else:
                    name = list(names.values())[1]
        else:
            name = data['entry']['name']
    #After processing through file, return protein's gene symbol
    return name
    

In [None]:
# Create a set of UniProt IDs
uniprot_ids = kin['acc'].unique()

# Create dictionary to match UniProt IDs to gene symbols
uniprot_to_gene_dict = {x: uniprot_to_gene(x) for x in uniprot_ids[:100]}
uniprot_to_gene_dict

## Convert protein UniProtkb IDs into necessary gene symbols

In [None]:

acc = pd.Series(kin.acc[:5])
    
for index, row in acc.iteritems():
    
    protein_id = '%s' %row
    geneS = uniprot_to_gene(protein_id)
    kin.acc[index] = geneS
    print(kin.acc[index])

In [None]:
#Look at 'kin' after altering accession numbers
kin.head()

In [None]:
#Group kinases in dataframe 'kin'
#Aggregate data in 'kin' according to kinase groups
kin = kin.groupby('kinases').agg(lambda x: tuple(x))

#Create a new column 'PhosphoELM' as description of data
kin.insert(0, 'Description', 'PhosphoELM')

In [None]:
# fix the dataframe in order to have three columns:
# kinases, description, acc_merged (acc, but all elements are joined by a \t symbol)
# with a reset index

#create column 'acc_merged' in which all 'acc' elements are joined by a \t symbol
kin['acc_merged'] = ['\t'.join(x) for x in kin['acc']]

#drop the now-unneccesary column 'acc'
kin.drop('acc', axis=1, inplace = True)

#Create dictionary 'PhosphoELM' with index numbers as keys
PhosphoELM_num = dict([(key, []) for key in kin.index])

# loop through rows with iterrows()
for index, rowData in kin.iterrows():
    line = ['\t'.join(rowData)]
    PhosphoELM_num[index] = line

In [None]:
#Transfer tsv info into a new txt file
with open('PhosphoELM.txt', 'w') as openfile:
    for index in PhosphoELM_num:
        openfile.write(str(PhosphoELM_num[index]) + '\n')

In [None]:
##Display figures regarding the data

#look into plotly

# This is a title
## This is a subtitle

hello you can write in **bold**, *italic*

lists:
- asd
- asd
- asd


numbered lists:
1. asd
2. 234
3. 46

In [None]:
#Corner case as TEST
protein_id = 'O43524'
geneS = uniprot_to_gene(protein_id)
print(geneS)

In [41]:
from xml.etree import ElementTree as ET
import urllib

ENRICHR_URL = 'https://www.ebi.ac.uk/proteins/api/proteins/%s'
protein_id = 'O43524'

tree = ET.ElementTree(file=urllib.urlopen('https://www.ebi.ac.uk/proteins/api/proteins/%s' %protein_id))


AttributeError: module 'urllib' has no attribute 'urlopen'