<h1>Project Imitamenta</h1>
<h2>Protein imitation in Dengue </h2>

<h3>Abstract</h3>

<h3>Introduction</h3>

 

<ol>
<li><h4>Dengue: A Clinical Apporach</h4></li>
    <p>Dengue Fever is a deadly disease caused by the dengue fever virus.  It is one of the most widely spread mosquito-borne diseases (WHO and HealthMap.org, 1997). According to the World Health Organization (WHO, 2012), around 40 % of the world population is at risk of contracting dengue, especially in the tropical and subtropical regions (WHO and HealthMap.org, 1997) (Fig 1.1).  It is estimated that the rate of dengue infection might be as high as 100 million cases annually and about 500,000 per year require hospitalization (Halstead et al., 2007; WHO, 2012).</p>
</ol>

In [17]:
#Import of pakcages used in the project
import pandas as pd #Pandas are really useful for fast dataframe analysis
import numpy 
import glob
import requests
import os
from bs4 import BeautifulSoup
import pypdb
import zipfile

In [2]:
# Firstly, we will retrieve the search query with the following parameters from uniprot
# Taxonomy: 12637, dengue virus
# Format = Tibular
# Limit = 10000 
# columns: ID, protein name and status 
url="https://www.uniprot.org/uniprot/?query=taxonomy:12637&format=tab&limit=10000&columns=id,protein%20names,reviewed"

#We store the data in the url into a variable 
c=pd.read_csv(url, "\t")
c.head()

# we will separate the data into 2 variables and transform 
#them into dataframe in numpy

ureviewed_uniprot_dengue = c.loc[c['Status'] == "unreviewed"] 
reviewed_uniprot_dengue = c.loc[c['Status'] == "reviewed"] 

df_unreviewed_uniprot_dengue = pd.DataFrame(ureviewed_uniprot_dengue)
df_reviewed_uniprot_dengue = pd.DataFrame(reviewed_uniprot_dengue)

In [3]:
# We will upload the dataset from cath3d with the sequence data
# ftp://orengoftp.biochem.ucl.ac.uk/cath/releases/latest-release/sequence-data/cath-uniprot-annotations.tsv.gz
df_cath = pd.read_csv('ftp://orengoftp.biochem.ucl.ac.uk/cath/releases/latest-release/sequence-data/cath-uniprot-annotations.tsv.gz',delimiter="\t",compression='gzip')

In [4]:
# Gathering PDB description from rcsb
def find_MOL_des(pdb_name):
    pdb = pypdb.describe_pdb(pdb_name)
    #Extract the title of the pdb
    name = pdb["title"]
    #Extract the keywords associated with the pdb
    keywords = pdb["keywords"]
    value = pdb.get("relatedPDB","None")
    A=[]
    B=[]
    if "relatedPDB" not in pdb:
        A.append("None")
        B.append("None")
    else:
        temp = pdb["relatedPDB"]
        listORdict = isinstance(temp, list)
        #if temp is a dict
        if not listORdict:
            l_key = list(temp)
            A.append(temp[l_key[0]])
            if temp[l_key[1]]:
                #a non-empty variable
                B.append((temp[l_key[1]]))
            else:
                #an empty variable
                B.append("None")
        #If temp is a list
        else:
            for i in range(len(temp)):
                l_key = list(temp[i])
                A.append(temp[i][l_key[0]])
                if temp[i][l_key[1]]:
                    #a non-empty variable
                    B.append((temp[i][l_key[1]]))
                else:
                    #an empty variable
                    B.append("None")
        
    authors = pdb["structure_authors"].split(".,")
    return name,keywords,A,B, authors
    


In [5]:
n_columns = ["UNIPROT","MODEL_MATCH","BOUNDARIES","SUPERFAMILY","PDB","PDB_NAME",
           "PDB_KEYWORDS","PDB_RELATED_ID","PDB_RELATED_DETAILS","PDB_AUTHORS"]
df_code_total = pd.DataFrame(columns=n_columns)

for uniprot_code in df_reviewed_uniprot_dengue["Entry"]:
    df_temp = []
    temp = df_cath.loc[df_cath["# UNIPROT_ACC"] == uniprot_code]
    df_code = temp[['# UNIPROT_ACC','MODEL_MATCH', 'BOUNDARIES','SUPERFAMILY']].copy().reset_index()
    df_code["PDB"] = df_code["MODEL_MATCH"].str[:4]
    df_code = df_code.reset_index()
    p_n=[]
    p_k=[]
    a=[]
    b=[]
    p_r= {"RELATED_PDB_ID":[],"RELATED_PDB_DETAILS":[]}
    p_a=[]
    # convert df_code into a dictionary
    df_code = df_code.to_dict("list")
    df_code.pop("level_0",None) # delete levels
    df_code.pop("index",None) # delete index
    df_code["UNIPROT"] = df_code.pop("# UNIPROT_ACC")
    #New keys in dictionary
    df_code["PDB_NAME"] = []
    df_code["PDB_KEYWORDS"] = []
    df_code["PDB_RELATED_ID"] = []
    df_code["PDB_RELATED_DETAILS"] = []
    df_code["PDB_AUTHORS"] = []
    #Acquire PDB summary data, mentioned above in the new keys
    for i in df_code["MODEL_MATCH"]:
        p_name,p_keywords, a,b, p_authors = find_MOL_des(i)
        df_code["PDB_NAME"].append(p_name)
        df_code["PDB_KEYWORDS"].append(p_keywords)
        df_code["PDB_RELATED_ID"].append(a)
        df_code["PDB_RELATED_DETAILS"].append(b)
        df_code["PDB_AUTHORS"].append(p_authors)
    #Save files
    df_code = pd.DataFrame.from_dict(df_code)
    path = "./datasets/info_uniprot/"+uniprot_code+".csv"
    df_code.to_csv(path, index=False, compression='gzip')
    df_code_total = pd.concat([df_code_total,df_code])


of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.




In [6]:
path = "./datasets/info_uniprot/uniprot_PDB_superfamily.csv"
df_code_total.to_csv(path, index=False, compression='gzip')

In [7]:
# Extract all known PDBS with the store CATH superfamily domains
# from the list created before.

dengue_superfamilies = df_code_total["SUPERFAMILY"].drop_duplicates()
df_sf_total = pd.DataFrame()
for sf in dengue_superfamilies:
    df_superfamilies = df_cath.loc[df_cath["SUPERFAMILY"] == sf]
    path = "./datasets/info_superfamily/"+sf+".csv"
    df_superfamilies.to_csv(path, index=False, compression='gzip')
    df_sf_total = pd.concat([df_sf_total,df_superfamilies])


In [8]:
path = "./datasets/info_superfamily/superfamily_total.csv"
df_sf_total.to_csv(path, index=False,compression='gzip')

In [16]:
#change delimeter and add column names
ii_columns = ['uidA','uidB','altA','altB','aliasA','aliasB','method','author'
              ,'pmids','taxa','taxb','interactionType','sourcedb','interactionIdentifier'
              ,'confidence','expansion','biological_role_A','biological_role_B'
              ,'experimental_role_A','experimental_role_B','interactor_type_A'
              ,'interactor_type_B','xrefs_A','xrefs_B','xrefs_Interaction',
              'Annotations_A','Annotations_B','Annotations_Interaction',
              'Host_organism_taxid','parameters_Interaction','Creation_date',
              'Update_date','Checksum_A','Checksum_B','Checksum_Interaction',
              'Negative','OriginalReferenceA','OriginalReferenceB','FinalReferenceA'
              ,'FinalReferenceB','MappingScoreA','MappingScoreB','irogida'
              ,'irogidb','irigid','crogida','crogidb','crigid',
              'icrogida','icrogidb','icrigid','imex_id','edgetype','numParticipants']
df_irefindex = pd.read_csv('http://irefindex.org/download/irefindex/data/archive/release_15.0/psi_mitab/MITAB2.6/All.mitab.22012018.txt.zip',skiprows=1,delimiter='\\',header=None ,compression='zip',error_bad_lines=False)
df_irefindex.columns = ii_columns

b'Skipping line 1369665: expected 1 fields, saw 2\n'


ValueError: Length mismatch: Expected axis has 1 elements, new values have 54 elements

In [14]:
sample=pd.read_csv(StringIO(''.join(l.replace('', ',') for l in open('stuff.csv'))))

Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,#uidA	uidB	altA	altB	aliasA	aliasB	method	author	pmids	taxa	taxb	interactionType	sourcedb	interactionIdentifier	confidence	expansion	biological_role_A	biological_role_B	experimental_role_A	experimental_role_B	interactor_type_A	interactor_type_B	xrefs_A	xrefs_B	xrefs_Interaction	Annotations_A	Annotations_B	Annotations_Interaction	Host_organism_taxid	parameters_Interaction	Creation_date	Update_date	Checksum_A	Checksum_B	Checksum_Interaction	Negative	OriginalReferenceA	OriginalReferenceB	FinalReferenceA	FinalReferenceB	MappingScoreA	MappingScoreB	irogida	irogidb	irigid	crogida	crogidb	crigid	icrogida	icrogidb	icrigid	imex_id	edgetype	numParticipants
"uniprotkb:P54274\tuniprotkb:O00410\tentrezgene/locuslink:7013|refseq:NP_059523|uniprotkb:P54274|rogid:3Tjmp8hqEhLBIQqxxWHntpL4hz49606|irogid:5406538\tentrezgene/locuslink:3843|uniprotkb:O00410|rogid:jBYLvkqyYyWAu1TdCVyCMh9ZHKY9606|irogid:2521025\thgnc:TERF1|uniprotkb:TERF1_HUMAN|crogid:3Tjmp8hqEhLBIQqxxWHntpL4hz49606|icrogid:5406538\thgnc:IPO5|uniprotkb:IPO5_HUMAN|crogid:jBYLvkqyYyWAu1TdCVyCMh9ZHKY9606|icrogid:2521025\tpsi-mi:""MI:0809""(bimolecular",fluorescence,complementation)\t-\tpubmed:21044950\ttaxid:9606(Homo,sapiens)\ttaxid:9606(Homo,"sapiens)\tpsi-mi:""MI:0915""(physical",association)\tMI:1332(bhf-ucl)\tintact:EBI-113...
"uniprotkb:P54274\tuniprotkb:O00410\tentrezgene/locuslink:7013|refseq:NP_059523|uniprotkb:P54274|rogid:3Tjmp8hqEhLBIQqxxWHntpL4hz49606|irogid:5406538\tentrezgene/locuslink:3843|uniprotkb:O00410|rogid:jBYLvkqyYyWAu1TdCVyCMh9ZHKY9606|irogid:2521025\thgnc:TERF1|uniprotkb:TERF1_HUMAN|crogid:3Tjmp8hqEhLBIQqxxWHntpL4hz49606|icrogid:5406538\thgnc:IPO5|uniprotkb:IPO5_HUMAN|crogid:jBYLvkqyYyWAu1TdCVyCMh9ZHKY9606|icrogid:2521025\tpsi-mi:""MI:0809""(bimolecular",fluorescence,complementation)\t-\tpubmed:21044950\ttaxid:9606(Homo,sapiens)\ttaxid:9606(Homo,"sapiens)\tpsi-mi:""MI:0915""(physical",association)\tMI:1332(bhf-ucl)\tintact:EBI-113...
"uniprotkb:P54274\tuniprotkb:P49591\tentrezgene/locuslink:7013|refseq:NP_059523|uniprotkb:P54274|rogid:3Tjmp8hqEhLBIQqxxWHntpL4hz49606|irogid:5406538\tentrezgene/locuslink:6301|refseq:NP_006504|uniprotkb:P49591|rogid:4+Ai/MIybmiKJyfa0C0bhnW/CG49606|irogid:494867\thgnc:TERF1|uniprotkb:TERF1_HUMAN|crogid:3Tjmp8hqEhLBIQqxxWHntpL4hz49606|icrogid:5406538\thgnc:SARS|uniprotkb:SYSC_HUMAN|crogid:4+Ai/MIybmiKJyfa0C0bhnW/CG49606|icrogid:494867\tpsi-mi:""MI:0809""(bimolecular",fluorescence,complementation)\t-\tpubmed:21044950\ttaxid:9606(Homo,sapiens)\ttaxid:9606(Homo,"sapiens)\tpsi-mi:""MI:0915""(physical",association)\tMI:1332(bhf-ucl)\tintact:EBI-113...
"uniprotkb:P54274\tuniprotkb:P49591\tentrezgene/locuslink:7013|refseq:NP_059523|uniprotkb:P54274|rogid:3Tjmp8hqEhLBIQqxxWHntpL4hz49606|irogid:5406538\tentrezgene/locuslink:6301|refseq:NP_006504|uniprotkb:P49591|rogid:4+Ai/MIybmiKJyfa0C0bhnW/CG49606|irogid:494867\thgnc:TERF1|uniprotkb:TERF1_HUMAN|crogid:3Tjmp8hqEhLBIQqxxWHntpL4hz49606|icrogid:5406538\thgnc:SARS|uniprotkb:SYSC_HUMAN|crogid:4+Ai/MIybmiKJyfa0C0bhnW/CG49606|icrogid:494867\tpsi-mi:""MI:0809""(bimolecular",fluorescence,complementation)\t-\tpubmed:21044950\ttaxid:9606(Homo,sapiens)\ttaxid:9606(Homo,"sapiens)\tpsi-mi:""MI:0915""(physical",association)\tMI:1332(bhf-ucl)\tintact:EBI-113...
"uniprotkb:P54274\tuniprotkb:Q9GZR7\tentrezgene/locuslink:7013|refseq:NP_059523|uniprotkb:P54274|rogid:3Tjmp8hqEhLBIQqxxWHntpL4hz49606|irogid:5406538\tentrezgene/locuslink:57062|refseq:NP_065147|uniprotkb:DDX24_HUMAN|rogid:n9gQabTC50Y/LtGMAMFNVgj12as9606|irogid:3170037\thgnc:TERF1|uniprotkb:TERF1_HUMAN|crogid:3Tjmp8hqEhLBIQqxxWHntpL4hz49606|icrogid:5406538\thgnc:DDX24|uniprotkb:DDX24_HUMAN|crogid:n9gQabTC50Y/LtGMAMFNVgj12as9606|icrogid:3170037\tpsi-mi:""MI:0809""(bimolecular",fluorescence,complementation)\t-\tpubmed:21044950\ttaxid:9606(Homo,sapiens)\ttaxid:9606(Homo,"sapiens)\tpsi-mi:""MI:0915""(physical",association)\tMI:1332(bhf-ucl)\tintact:EBI-113...


In [20]:
with zipfile.ZipFile('http://irefindex.org/download/irefindex/data/archive/release_15.0/psi_mitab/MITAB2.6/All.mitab.22012018.txt.zip') as z:
    with z.open('All.mitab.22012018.txt') as f:
        for line in f:
            print(line)

FileNotFoundError: [Errno 2] No such file or directory: 'http://irefindex.org/download/irefindex/data/archive/release_15.0/psi_mitab/MITAB2.6/All.mitab.22012018.txt.zip'

In [32]:
import urllib3

http = urllib3.PoolManager()

R = http.request('GET', "http://irefindex.org/download/irefindex/data/archive/release_15.0/psi_mitab/MITAB2.6/All.mitab.22012018.txt.zip")

In [None]:
print(R.read)