<h1>Project Imitamenta</h1>
<h2>Protein imitation in Dengue </h2>

<h3>Abstract</h3>

<h3>Introduction</h3>

 

<ol>
<li><h4>Dengue: A Clinical Apporach</h4></li>
    <p>Dengue Fever is a deadly disease caused by the dengue fever virus.  It is one of the most widely spread mosquito-borne diseases (WHO and HealthMap.org, 1997). According to the World Health Organization (WHO, 2012), around 40 % of the world population is at risk of contracting dengue, especially in the tropical and subtropical regions (WHO and HealthMap.org, 1997) (Fig 1.1).  It is estimated that the rate of dengue infection might be as high as 100 million cases annually and about 500,000 per year require hospitalization (Halstead et al., 2007; WHO, 2012).</p>
</ol>

In [1]:
#Import of pakcages used in the project
import pandas as pd #Pandas are really useful for fast dataframe analysis
import numpy 
import glob
import requests
import os
from bs4 import BeautifulSoup
import pypdb
import zipfile

In [2]:
# Firstly, we will retrieve the search query with the following parameters from uniprot
# Taxonomy: 12637, dengue virus
# Format = Tibular
# Limit = 10000 
# columns: ID, protein name and status 
url="https://www.uniprot.org/uniprot/?query=taxonomy:12637&format=tab&sort=score&columns=id,protein%20names,reviewed"

#We store the data in the url into a variable 
c=pd.read_csv(url, "\t")



In [3]:
# we will separate the data into 2 variables and transform 
# them into dataframe in numpy

unreviewed_uniprot_dengue = c.loc[c['Status'] == "unreviewed"] 
reviewed_uniprot_dengue = c.loc[c['Status'] == "reviewed"] 

df_unreviewed_uniprot_dengue = pd.DataFrame(unreviewed_uniprot_dengue)
df_reviewed_uniprot_dengue = pd.DataFrame(reviewed_uniprot_dengue)

In [4]:
path = "./datasets/dengue_uniprot/uniprot_dengue_list_reviewed.csv"
df_unreviewed_uniprot_dengue.to_csv(path, index=False, compression='gzip')
path = "./datasets/dengue_uniprot/uniprot_dengue_list_unreviewed.csv"
df_reviewed_uniprot_dengue.to_csv(path, index=False, compression='gzip')

In [5]:
# We will upload the dataset from cath3d with the sequence datam
# ftp://orengoftp.biochem.ucl.ac.uk/cath/releases/latest-release/sequence-data/cath-uniprot-annotations.tsv.gz
df_cath = pd.read_csv('./datasets/cath_uniprot/cath-uniprot-annotations.tsv',delimiter="\t", chunksize=10000)

In [6]:
def uniprot_to_pdb(code):
    chunksize= 10**6 # We choose a chunk size of a million entries
    previous = False # This variable allows to identify if matching entries were found in a previous chunk
    for chunk in pd.read_csv('./datasets/cath_uniprot/cath-uniprot-annotations.tsv',delimiter="\t", chunksize=chunksize):
        temp = chunk.loc[chunk["# UNIPROT_ACC"] == code]
        if not temp.empty: # check if the dataset is no empty
            if previous == False: # if no match is found in a previous chunk
                df_code = temp[['# UNIPROT_ACC','MODEL_MATCH', 'BOUNDARIES','SUPERFAMILY']].copy().reset_index(drop=True)
                df_code["PDB"] = df_code["MODEL_MATCH"].str[:4]
                df_cath_temp = pd.DataFrame(columns=df_code.columns)
                previous = True
            else: # if a match is found in a previous chunk
                df_code = temp[['# UNIPROT_ACC','MODEL_MATCH', 'BOUNDARIES','SUPERFAMILY']].copy().reset_index(drop=True)
                df_code["PDB"] = df_code["MODEL_MATCH"].str[:4]
                df_cath_temp = pd.concat([df_cath_temp,df_code])
                print(df_code)
        else: # check if temp is empty
            if previous == False: # if temp is empty and no matches found in a the previous chunk
                continue
            else: # if there was a match found in a previous chunk
                if df_cath_temp.empty: # if df_cath is empty we return the df_code
                    path = "./datasets/cath_uniprot/"+code+".csv"
                    df_code.to_csv(path, index=False, compression='gzip')
                    return df_code
                else: # if df_cath is not empty we return df_cath
                    path = "./datasets/cath_uniprot/"+code+".csv"
                    df_cath_temp.to_csv(path, index=False, compression='gzip')
                    return df_cath_temp
                

In [7]:
# gathering PDB models for the Uniprot codes found for dengue
df_cath = pd.DataFrame() # empty dataframe that will store all the data gathered
for uniprot_code in df_reviewed_uniprot_dengue["Entry"]:
    a= uniprot_to_pdb(uniprot_code) # we store the results in a temporary variable called a
    if not df_cath.empty: # if df_cath is not empty we concat a and df_cath
            df_cath = pd.concat([df_cath,a])
    else: # if df_cath is empty we store a into df_cath
        df_cath=a
df_cath = df_cath.reset_index(drop=True)
path = "./datasets/cath_uniprot/uniprot_to_pdb_total.csv"
df_cath.to_csv(path, index=False, compression='gzip')

In [8]:
def find_MOL_des(pdb_name,uniprot_name): # function that finds the Mol description
    pdb = pypdb.describe_pdb(pdb_name)
    A=[] # temporary list for related pdb
    B=[] # temporary list for details of related pdb
    if 'relatedPDB' in pdb: # if relatedPDB is in pdb
        temp = pdb["relatedPDB"]
        listORdict = isinstance(temp, list)
        #if temp is a dict
        if not listORdict:
            l_key = list(temp)
            A.append(temp[l_key[0]])
            if temp[l_key[1]]:
                #a non-empty variable
                B.append((temp[l_key[1]]))
            else:
            #an empty variable
                B.append("None")
        #If temp is a list
        else:
            for i in range(len(temp)):
                l_key = list(temp[i])
                A.append(temp[i][l_key[0]])
                if temp[i][l_key[1]]:
                    #a non-empty variable
                    B.append((temp[i][l_key[1]]))
                else:
                    #an empty variable
                    B.append(None)
    else: # if no related PDB is available
        A.append(None)
        B.append(None)
    pdb.pop('relatedPDB',None) # we remove relatedPDB key from pdb
    pdb["uniprot_code"]=uniprot_name # we store uniprot code
    pdb["related_pdb"]= A 
    pdb["ralated_pdb_details"] = B
    file_name = uniprot_name+"-"+pdb_name
    pdb_related = pd.DataFrame.from_dict(pdb)
    path = "./datasets/pdb_descriptions/"+file_name+".csv"
    pdb_related.to_csv(path, index=False, compression='gzip')
    return pdb_related


In [9]:
df_pdb = pd.DataFrame() # new dataframe for the pdb files
for i in  range(len(df_cath["MODEL_MATCH"])):
    a= find_MOL_des(df_cath["MODEL_MATCH"][i],df_cath["# UNIPROT_ACC"][i])
    if not df_pdb.empty:
        df_pdb = pd.concat([df_pdb, a],ignore_index=True)
    else:
        df_pdb=a.copy()

df_pdb = df_pdb.reset_index(drop=True)
path = "./datasets/pdb_descriptions/pdb_description_total.csv"
df_pdb.to_csv(path, index=False, compression='gzip')

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """


In [10]:
# Organizing the data into CATH superfamilies
dengue_superfamilies = df_cath["SUPERFAMILY"].unique()
#store df_cath by superfamilies
for sf in dengue_superfamilies:
    df_superfamilies = df_cath.loc[df_cath["SUPERFAMILY"] == sf]
    path = "./datasets/info_superfamily/"+sf+".csv"
    df_superfamilies.to_csv(path, index=False, compression='gzip')

In [25]:
#loading irefindex dataset
df_temp_mitb = pd.read_csv('./datasets/info_superfamily/All.mitab.01-22-2018.txt', delimiter="\t", chunksize = 100000)


In [72]:
# make the 
for chunk in df_temp_mitb:
    chunk=chunk.reset_index(drop=True)
    print(chunk.loc[0])    
    print("------------------------- Values -------------------------")
    for values in chunk.values:
        print(values)
        print(len(values))
        break
    break


#uidA                                                       uniprotkb:P16070
uidB                                                    uniprotkb:A0A024RDE2
altA                       entrezgene/locuslink:960|refseq:NP_000601|unip...
altB                       entrezgene/locuslink:6696|refseq:NP_001035147|...
aliasA                     hgnc:CD44|uniprotkb:CD44_HUMAN|crogid:QOb28rF7...
aliasB                     hgnc:SPP1|uniprotkb:A0A024RDE2_HUMAN|uniprotkb...
method                            psi-mi:"MI:0813"(proximity ligation assay)
author                                                                     -
pmids                                                        pubmed:20549562
taxa                                                taxid:9606(Homo sapiens)
taxb                                                taxid:9606(Homo sapiens)
interactionType                       psi-mi:"MI:0915"(physical association)
sourcedb                                                   MI:0917(matrixdb)

In [91]:
chunk=chunk.reset_index(drop=True)
l_key = chunk.columns
d = dict.fromkeys(l_key, [])
temp = {}
for column, values in chunk.iteritems():
    print(column)
    #process uidA
    if column == "#uidA":
        temp['a_type_id'], temp['a_id'] = values.str.split(':', 1).str
    #process uidB
    if column == "uidB":
        temp['b_type_id'], temp['b_id'] = values.str.split(':', 1).str
    
    if column == "confidence":
        df_temp_confidence = values.str.split('|', -1,expand=True)
        #.reset_index(drop=True)
        for column in df_temp_confidence:
            a1,b1= df_temp_confidence[column].str.split(':', 1).str
            print(a1)
            break
        break
    if column =="expansion":
        values[values == "none"] = None
        temp["expansion"] = values
    if column =="biological_role_A":
        a1, b1 = values.str.split(':', 1).str # separate string by :
        temp["biological_role_a_id"] = b1.str.replace(r"\(.*\)","") # extract text outside the parenthesis
        temp["biological_role_a_name"]=b1.str.replace(r'[\(\)\d]+', '')# extract text inside parenthesis
    if column =="biological_role_B":
        a1, b1 = values.str.split(':', 1).str # separate string by :
        temp["biological_role_b_id"] = b1.str.replace(r"\(.*\)","") # extract text outside the parenthesis
        temp["biological_role_b_name"]=b1.str.replace(r'[\(\)\d]+', '')# extract text inside parenthesis
    if column =="experimental_role_A":
        a1, b1 = values.str.split(':', 1).str # separate string by :
        temp["experimental_role_a_id"] = b1.str.replace(r"\(.*\)","") # extract text outside the parenthesis
        temp["experimental_role_a_name"]=b1.str.replace(r'[\(\)\d]+', '')# extract text inside parenthesis
    if column =="experimental_role_B":
        a1, b1 = values.str.split(':', 1).str # separate string by :
        temp["experimental_role_b_id"] = b1.str.replace(r"\(.*\)","") # extract text outside the parenthesis
        temp["experimental_role_b_name"]=b1.str.replace(r'[\(\)\d]+', '')# extract text inside parenthesis
    if column =="interactor_type_A":
        a1, b1 = values.str.split(':', 1).str # separate string by :
        temp["interactor_type_a_id"] = b1.str.replace(r"\(.*\)","") # extract text outside the parenthesis
        temp["interactor_type_a_name"]=b1.str.replace(r'[\(\)\d]+', '')# extract text inside parenthesis
    if column =="interactor_type_B":
        a1, b1 = values.str.split(':', 1).str # separate string by :
        temp["interactor_type_b_id"] = b1.str.replace(r"\(.*\)","") # extract text outside the parenthesis
        temp["interactor_type_b_name"]=b1.str.replace(r'[\(\)\d]+', '')# extract text inside parenthesis    
    if column == "xrefs_A":
        values[values == "-"] = None
        temp["xrefs_a"] = values  
    if column == "xrefs_B":
        values[values == "-"] = None
        temp["xrefs_b"] = values  
    if column =="xrefs_Interaction":
        values[values == "-"] = None
        temp["xrefs_interaction"] = values  
    if column =="Annotations_a":
        values[values == "-"] = None
        temp["annotations_a"] = values    
    if column =="Annotations_B":
        values[values == "-"] = None
        temp["annotations_b"] = values    
    if column == "Annotations_Interaction":
        values[values == "-"] = None
        temp["annotations_interaction"] = values    
    if column =="Host_organism_taxid":
        a1, b1 = values.str.split(':', 1).str # separate string by :
        temp["host_organism_taxid"] = b1.str.replace(r"\(.*\)","") # extract text outside the parenthesis
        temp["host_organism_tax_name"]=b1.str.replace(r'[\(\)\d]+', '')# extract text inside parenthesis
    if column =="parameters_Interaction":
        values[values == "-"] = None
        temp["parameters_interaction"] = values
    if column == "Creation_date":
        temp["creation_date"] = values 
    if column == "Update_date":
        temp["update_date"] = values 
    if column == "Negative":
        temp["negative"] = values    
    if column == "irogida":
        temp["irogida"] = values 
    if column == "irogidb":
        temp["irogidb"] = values 
    if column == "irigid":
        temp["irigid"] = values 
    if column == "crogida":
        temp["crogida"] = values 
    if column == "crogidb":
        temp["crogidb"] = values 
    if column == "crigid":
        temp["crigid"] = values    
    if column == "icrogida":
        temp["icrigida"] = values    
    if column == "icrogidb":
        temp["icrigidb"] = values    
    if column == "icrogid":
        temp["icrigid"] = values    
    if column == "imex_id":
        temp["imex_id"] = values
    if column == "numParticipants":
        temp["numParticipants"] = values
    #if column == "altA":
    #    a = str(values[0])
    #    b = a.split("|")
    #    b_series = pd.Series(b)
    #    c,d = b_series.str.split(':', 1).str
    #    print(d)
    #    break
    
    

#uidA
uidB
altA
altB
aliasA
aliasB
method
author
pmids
taxa
taxb
interactionType
sourcedb
interactionIdentifier
confidence
0        hpr
1        hpr
2        hpr
3        hpr
4        hpr
5        hpr
6        hpr
7        hpr
8        hpr
9        hpr
10       hpr
11       hpr
12       hpr
13       hpr
14       hpr
15       hpr
16       hpr
17       hpr
18       hpr
19       hpr
20       hpr
21       hpr
22       hpr
23       hpr
24       hpr
25       hpr
26       hpr
27       hpr
28       hpr
29       hpr
        ... 
99970    hpr
99971    hpr
99972    hpr
99973    hpr
99974    hpr
99975    hpr
99976    hpr
99977    hpr
99978    hpr
99979    hpr
99980    hpr
99981    hpr
99982    hpr
99983    hpr
99984    hpr
99985    hpr
99986    hpr
99987    hpr
99988    hpr
99989    hpr
99990    hpr
99991    hpr
99992    hpr
99993    hpr
99994    hpr
99995    hpr
99996    hpr
99997    hpr
99998    hpr
99999    hpr
Name: 0, Length: 100000, dtype: object
0        lpr
1        lpr
2        lpr
3     

In [None]:
#change delimeter and add column names
ii_columns = ['uidA','uidB','altA','altB','aliasA','aliasB','method','author'
              ,'pmids','taxa','taxb','interactionType','sourcedb','interactionIdentifier'
              ,'confidence','expansion','biological_role_A','biological_role_B'
              ,'experimental_role_A','experimental_role_B','interactor_type_A'
              ,'interactor_type_B','xrefs_A','xrefs_B','xrefs_Interaction',
              'Annotations_A','Annotations_B','Annotations_Interaction',
              'Host_organism_taxid','parameters_Interaction','Creation_date',
              'Update_date','Checksum_A','Checksum_B','Checksum_Interaction',
              'Negative','OriginalReferenceA','OriginalReferenceB','FinalReferenceA'
              ,'FinalReferenceB','MappingScoreA','MappingScoreB','irogida'
              ,'irogidb','irigid','crogida','crogidb','crigid',
              'icrogida','icrogidb','icrigid','imex_id','edgetype','numParticipants']
df_irefindex = pd.read_csv('http://irefindex.org/download/irefindex/data/archive/release_15.0/psi_mitab/MITAB2.6/All.mitab.22012018.txt.zip',skiprows=1,delimiter='\\',header=None ,compression='zip',error_bad_lines=False)
df_irefindex.columns = ii_columns

In [None]:
sample=pd.read_csv(StringIO(''.join(l.replace('', ',') for l in open('stuff.csv'))))

In [None]:
with zipfile.ZipFile('http://irefindex.org/download/irefindex/data/current/psi_mitab/MITAB2.6/All.mitab.22012018.txt.zip') as z:
    with z.open('All.mitab.22012018.txt') as f:
        for line in f:
            print(line)

In [None]:
import urllib3

http = urllib3.PoolManager()

R = http.request('GET', "http://irefindex.org/download/irefindex/data/current/psi_mitab/MITAB2.6/All.mitab.22012018.txt.zip")

In [None]:
print(R.read)