# PhosphoELM Data Formating

This file takes data regarding kinase-protein interactions from the PhosphoELM database and converts the data into the .gmt format. The data was retrieved from the PhosphoELM database on Wed, Jun 7 2017 16:27:31. This data will be added to enhance the KEA2 database and will be suitably formatted for use by ENRICHR and X2K.

## Import packages necessary for following program

In [None]:
import numpy as np
import pandas as pd
import xmltodict
import json
import requests

## Create a dataframe from a file containing PhosphoELM data

In [None]:
#read data from excel file into dataframe 'phospho_df'
phospho_df = pd.read_excel('~/Desktop/phosphoELM_all_2015-04.xlsm')

## Perform preliminary data processing 

We must drop duplicates and NaNs, as well as select only the columns necessary for the .gmt file format (the protein ids and kinase gene symbols). For future steps, we will also use the 'kinases' column as an index for the dataframe.

In [10]:
#select columns necessary for .gmt format and filter into new dataframe 'df'
df = phospho_df[['acc', 'kinases']]

#drop all columns with an 'NaN' value for the kinases
df.dropna(axis = 0, inplace = True)

#drop duplicate rows in the dataframe
df.drop_duplicates(inplace = True)

#set index of protein values 'acc' as kinases
#creates new dataframe 'kin'
kin = df.set_index('kinases')

kin.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0_level_0,acc
kinases,Unnamed: 1_level_1
PAK2,O08605
Lck,O14543
PKB_group,O14746
SRC,O14746
IKK_group,O14920


## Convert protein UniProtkb IDs into necessary gene symbols

In [None]:

#Create dictionary 'PhosphoELM' with kinases as keys
PhosphoELM = dict([(key, []) for key in kin.index])

#Define url to obtain gene symbol from API
ENRICHR_URL = 'https://www.ebi.ac.uk/proteins/api/proteins/%s'

#Define function uniprot_to_gene which converts uniprot_id into the gene symbol
def uniprot_to_gene(protein_id):
    response = requests.get(ENRICHR_URL % protein_id)
    if not response.ok:
        name = np.NaN
    else:
        data = xmltodict.parse(response.text)
        entry = data['entry']
        # check if entry contains 'gene'
        if 'gene' in entry.keys():
            data = data['entry']['gene']
            #Make sure entry is 'OrderedDict'
            if type(data) == list:
                name = str(protein_id)
            else:
                names = data['name']
                #Make sure entry is 'OrderedDict'
                if type(names) == list:
                    name = list(names[0].values())[1]
                else:
                    name = list(names.values())[1]
        else:
            name = data['entry']['name']
    #After processing through file, return protein's gene symbol
    return name

acc = pd.Series(kin.acc[:])
    
for key, row in acc.iteritems():
    
    protein_id = '%s' %row
    geneS = uniprot_to_gene(protein_id)
    PhosphoELM[key] = PhosphoELM[key] + [geneS]
    print(key)
    

In [None]:
#Using dictionary, re-create dataframe 'kin' with gene symbols rather than gene accession numbers
kin = pd.DataFrame.from_dict(PhosphoELM, orient = 'index')

#Look at format of newly created kin dataframe
kin.head()

In [None]:
#Group kinases in dataframe 'kin'
#Aggregate data in 'kin' according to kinase groups
kin = kin.groupby('kinases').agg(lambda x: tuple(x))

#Create a new column 'PhosphoELM' as description of data
kin.insert(0, 'Description', 'PhosphoELM')

In [None]:
# fix the dataframe in order to have three columns:
# kinases, description, acc_merged (acc, but all elements are joined by a \t symbol)
# with a reset index

#reset index of the datframe to integers, restores column 'kinases'
kin.reset_index(inplace = True)

#create column 'acc_merged' in which all 'acc' elements are joined by a \t symbol
kin['acc_merged'] = ['\t'.join(x) for x in kin['acc']]

#drop the now-unneccesary column 'acc'
kin.drop('acc', axis=1, inplace = True)

#Create dictionary 'PhosphoELM' with index numbers as keys
PhosphoELM_num = dict([(key, []) for key in kin.index])

# loop through rows with iterrows()
for index, rowData in kin.iterrows():
    line = ['\t'.join(rowData)]
    PhosphoELM_num[index] = line

In [None]:
#Transfer tsv info into a new txt file
with open('PhosphoELM.txt', 'w') as openfile:
    for index in PhosphoELM_num:
        openfile.write(str(PhosphoELM_num[index]) + '\n')

In [None]:
##Display figures regarding the data

#look into plotly

# This is a title
## This is a subtitle

hello you can write in **bold**, *italic*

lists:
- asd
- asd
- asd


numbered lists:
1. asd
2. 234
3. 46