<a href="https://colab.research.google.com/github/ilante/AML_91934_exam/blob/main/Copy_of_Programming_Skills_for_biocurators.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Retrieving data from the Protein API using Python

Our aim is to collect all small scale (=large-scale FALSE) publications mentioning a 3D structure and interactions for the Amyloid-beta precursor protein (gene name APP,  accession number P05067) using the UniProt PROTEIN API 

https://rest.uniprot.org/beta/docs/

(Note that from May 2022, new url https://rest.uniprot.org/docs/)

To obtain the URL that we need to collect the data, we are using the PROTEIN API option under "Miscellaneous" + 

GET "/uniprotkb/{accession}/publications" and use the query below:

Facet_filter ((categories:"Structure") AND (categories:"Interaction") AND (is_large_scale:"false")





For more information: 

https://www.ebi.ac.uk/training/events/programmatic-access-uniprot-using-python/

Speakers: Emily Bowler-Barnet and Aurélien Luciani


In [1]:
# import libraries to interact with the Protein API and process data
import requests, sys
import json
import pprint

# Paste the URL from the Protein API below between quotes "".
requestURL = "https://rest.uniprot.org/beta/uniprotkb/P05067/publications?facetFilter=%28categories%3A%22Structure%22%29%20AND%20%28categories%3A%22Interaction%22%29%20AND%20%28is_large_scale%3A%22false%22%29"

# select the type of data format, here 'json'
r = requests.get(requestURL, headers={ "Accept" : "application/json"})

# verify connection with the Protein API
if not r.ok:
  r.raise_for_status()
  sys.exit()

# save data
data = r.json()

#print result
pp = pprint.PrettyPrinter(indent=4, width=80, compact=False)
print("Pretty Printing JSON Data using pprint module")
pp.pprint(data)

Pretty Printing JSON Data using pprint module
{   'results': [   {   'citation': {   'authors': [   'Scheidig A.J.',
                                                      'Hynes T.R.',
                                                      'Pelletier L.A.',
                                                      'Wells J.A.',
                                                      'Kossiakoff A.A.'],
                                       'citationCrossReferences': [   {   'database': 'PubMed',
                                                                          'id': '9300481'},
                                                                      {   'database': 'DOI',
                                                                          'id': '10.1002/pro.5560060902'}],
                                       'citationType': 'UniProt indexed '
                                                       'literatures',
                                       'completeAuthorList': True,
 

# Reading and extracting information

In [2]:
# how to extract information from the first reference (position 0 in the list)

PubMed_Id_ref_1 = data['results'][0]['citation']['id']
print(PubMed_Id_ref_1)

9300481


In [4]:
# create a list for the type of information you want to collect example are below:

PubMed_Id = [] #PMID id
Title = [] # Title of the publication
publication_date = [] # Year of publication
number_of_computationally_mapped_entries = [] # Number of other entries in UniProt mentioned in the paper


# now parse the json file and collect information for ALL references in the data:
for each_publication in data['results']:
    
    print(each_publication['citation']['id'])
    PubMed_Id.append(each_publication['citation']['id'])
    print(each_publication['citation']['title'])
    Title.append(each_publication['citation']['title'])
    print(each_publication['citation']['publicationDate'])    
    publication_date.append(each_publication['citation']['publicationDate'])
    print(each_publication['statistics']['computationallyMappedProteinCount'])
    number_of_computationally_mapped_entries.append(each_publication['statistics']['computationallyMappedProteinCount'])

# as always print to be sure you're collecting the correct info!
# once all info is collected you can put it into a table using pandas

9300481
Crystal structures of bovine chymotrypsin and trypsin complexed to the inhibitor domain of Alzheimer's amyloid beta-protein precursor (APPI) and basic pancreatic trypsin inhibitor (BPTI): engineering of inhibitors with altered specificities.
1997
4
17051221
Structures of human insulin-degrading enzyme reveal a new substrate recognition mechanism.
2006
7
17239395
Structural studies of the Alzheimer's amyloid precursor protein copper- binding domain reveal how it binds copper ions.
2007
0
17895381
Molecular basis for passive immunotherapy of Alzheimer's disease.
2007
3
19923222
Structural correlates of antibodies associated with acute reversal of amyloid beta-related behavioral deficits in a mouse model of Alzheimer disease.
2010
0
20212142
Structure and biochemical analysis of the heparin-induced E1 dimer of the amyloid precursor protein.
2010
14
25122912
Amyloid precursor protein dimerization and synaptogenic function depend on copper binding to the growth factor-like domain.
2

# Sort and filter information in a table

In [5]:
# for this task we are using the pandas library (see documentation https://pandas.pydata.org/pandas-docs/stable/index.html)
import pandas as pd

# Create a table by naming the columns "E.g.Publication" and assigning to them a value

Publications_accession_table= pd.DataFrame(
    {'Publication': PubMed_Id,
     'Title': Title,
     'Date of Publication': publication_date,
     'Number of Entries mentioned in paper': number_of_computationally_mapped_entries
    })


In [None]:
#Display table
Publications_accession_table

Unnamed: 0,Publication,Title,Date of Publication,Number of Entries mentioned in paper
0,9300481,Crystal structures of bovine chymotrypsin and ...,1997,4
1,17051221,Structures of human insulin-degrading enzyme r...,2006,7
2,17239395,Structural studies of the Alzheimer's amyloid ...,2007,0
3,17895381,Molecular basis for passive immunotherapy of A...,2007,3
4,19923222,Structural correlates of antibodies associated...,2010,0
5,20212142,Structure and biochemical analysis of the hepa...,2010,14
6,25122912,Amyloid precursor protein dimerization and syn...,2014,0
7,30630874,Recognition of the amyloid precursor protein b...,2019,0
8,31373844,Attenuation of amyloid-beta generation by atyp...,2019,18
9,29894164,The Human Amyloid Precursor Protein Binds Copp...,2018,15
