<a href="https://colab.research.google.com/github/glevans/7ADD-workshop-2024/blob/main/2_GET_vs_POST.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Leveraging the power of PDBe's APIs**
<img src="https://github.com/glevans/7ADD-workshop-2024/blob/main/Images/API_image.png?raw=true" height="120" align="right">

Welcome to this notebook!

To use this notebook in Colab (link at top of the page):

*   you will need to have a Google account
*   be logged in to Google Colab (by being logged into Google account)

<br>

You can also download this notebook and view it *via* a local installation of [Jupyter](https://jupyter.org/) (*i.e.* latest Jupyterlab or original Juptyer Notebook) or a browser instance of [JupyterLab](https://jupyter.org/try-jupyter/lab/).

<br>

---

This interactive Python notebook is part of a series that will guide you through various ways of programmatically accessing Protein Data Bank in Europe (PDBe) data using APIs.

<img src="https://www.ebi.ac.uk/pdbe/docs_dev/logos/images/RGB/PDBe-logo-RGB_2013.png" height="300" align="right">


The REST API is a programmatic way to obtain information from the PDB and EMDB.

You can access details about:

* sample
* experiment
* models
* compounds
* cross-references
* publications
* quality
* assemblies
* and more...

For more information, visit http://www.ebi.ac.uk/pdbe/pdbe-rest-api

<br>

---

  ## How to use this notebook <a name="Quick Start"></a>
1. To run a code cell, click on the cell to select it. You will notice a play button (▶️) on the left side of the cell. Click on the play button or press Shift+Enter to run the code in the selected cell.
2. The code will start executing, and you will see the output, if any, displayed below the code cell.
3. Move to the next code cell and repeat steps 2 and 3 until you have executed all the desired code cells in sequence.
4. The currently running step is indicated by a circle with a stop sign next to it.
If you need to stop or interrupt the execution of a code cell, you can click on the stop button (■) located next to the play button.

*Remember to run the code cells in the correct order, as their execution might depend on variables or functions defined in previous cells. You can modify the code in a code cell and re-run it to see updated results.*

<br>

---

## Contact us

If you experience any bugs please contact pdbehelp@ebi.ac.uk and put "Help with" and the title of the notebook in the subject line of the message.



# Notebook #2

This notebook is the second in the training material series. It aims to lay down the foundation for understanding how users can interact with the PDBe REST API using Python3.

## 1) Making imports and setting variables

First, we import some packages that we will use, and set some variables.

We will be using Python packages / modules:

*   [re](https://https://docs.python.org/3/library/re.html) - allows use of regular expression matching operations similar to those found in Perl.
*   [requests](https://docs.python.org/3/library/re.html) - allows you to send HTTP/1.1 requests extremely easily.
*   [pprint](https://docs.python.org/3/library/pprint.html) - makes data look more readable / pretty
*   [csv](https://docs.python.org/3/library/csv.html) - enables csv file input and output


<br>



---



*FURTHER INFORMATION:*

Full list of valid PDBe API URLs / API endpoints is available from http://www.ebi.ac.uk/pdbe/api/doc/


In [1]:
# Importing Python packages / modules
import re
import requests
import pprint
import csv

# Defining variables to describe API urls
base_url = "https://www.ebi.ac.uk/pdbe/"

api_base = base_url + "api/"

summary_url = api_base + "pdb/entry/summary/"

# We have defined a variable called summary_url with the following value:
#### https://www.ebi.ac.uk/pdbe/api/pdb/entry/summary/

experiment_url = api_base + "pdb/entry/experiment/"

# We have defined a variable called experiment_url with the following value:
#### https://www.ebi.ac.uk/pdbe/api/pdb/entry/experiment/

ligands_url = api_base + "pdb/entry/ligand_monomers/"

# We have defined a variable called ligand_monomers_url with the following value:
#### https://www.ebi.ac.uk/pdbe/api/pdb/entry/ligand_monomers/

compound_summaries_url = api_base + "pdb/compound/summary/"

# We have defined a variable called compound_summaries_url with the following value:
#### https://www.ebi.ac.uk/pdbe/api/pdb/compound/summary/

## 2) Useful functions

We will start with some simple Python code to enhance the usefulness of PDBe's APIs.

### 2.1) The get_entry_info() function

This function pulls out the information subset from an information block for a single PDB id.

In [2]:
# The below function called 'get_entry_info' does two things:
# --> Checks if PDB id is listed in the information.
# --> If PDB id is listed, returns the information for this id.
# --> Information is output if available in the form of a dictionary.

def get_entry_info(pdb_id, input_information):
    try:
        output_for_entry_as_list = (input_information[pdb_id])
        output_for_entry_as_dict = {}
        for item in output_for_entry_as_list:
            output_for_entry_as_dict.update(item)
        return output_for_entry_as_dict
    except KeyError as error:
        error_message1 = "PDB id is NOT present"
        return print(error_message1)

### 2.2) Functions for doing PDBe API calls.

The following functions are for doing PDBe API calls.

The **make_GET_request** is useful when quering for information on a single PDB id.

The **make_POST_request** needs to be used for list of PDB ids.

With both of these functions we have added checks that the PDB id or PDB id list input is appropriately formatted. These functions also report success, failure or errors.

<br>

It is useful to be aware when using APIs that there are some requests that should use [GET](https://en.wikipedia.org/wiki/HTTP#Request_methods) calls and others that should use [POST](https://en.wikipedia.org/wiki/POST_(HTTP)) calls.

In general, one should first consider using a **GET** call.

If a **GET** call will not get the appropriate information, one should consider **POST** call options.

In [3]:
# This function will make a GET call to the PDBe API.
# This function will use a PDB id and API url provided as arguments.

def make_GET_request(pdb_id, api_url):
    # Checking the PDB id is formatted correctly
    if len(pdb_id) == 4:
        pdb_id = pdb_id
    else:
        error_message1 = "Invalid PDB ID"
        return print(error_message1)
    if not re.match("[0-9][A-Za-z][A-Za-z0-9]{2}", pdb_id):
        error_message1 = "Invalid PDB ID"
        return print(error_message1)

    # Reporting input contents after checking formatting
    print("Input - PDB id: " + pdb_id)

    # Making a GET call to the API URL
    get_response = requests.get(url=api_url+pdb_id)

    # If there is data returned (with HTML status code 200)
    # then return the data in JSON format
    if get_response.status_code == 200:
        print("Data retrieved")
        return get_response.json()
    # If there is no data, print status code and response
    else:
        print("No data retrieved - " + get_response.status_code, get_response.text)
    return

# This function checks and cleans the list of PDB ids.
# This function makes sure the PDB id list is in the correct format

def clean_pdb_id_list(pdb_id_list):
    # If input is PDB id list in string format converts to a Python list
    if type(pdb_id_list) == str:
          if pdb_id_list.count(", ") >= 1:
              pdb_id_list_input = pdb_id_list.split(", ")
          elif pdb_id_list.count(",") >= 1:
              pdb_id_list_input = pdb_id_list.split(",")
          else:
              pdb_id_list_input = pdb_id_list
    else:
          pdb_id_list_input = pdb_id_list

    # Making sure the PDB id list is formatted as Python list
    if type(pdb_id_list_input) != list:
        error_message1 = "Invalid List"
        return print(error_message1)

    # Checking the PDB ids are correctly structured
    # The below removes any items from list that do not match PDB id format
    clean_pdb_id_list = []
    for pdb_id in pdb_id_list_input:
        if len(pdb_id) != 4:
            break
        elif re.match("[0-9][A-Za-z][A-Za-z0-9]{2}", pdb_id):
            clean_pdb_id_list.append(pdb_id)
        else:
            continue

    if clean_pdb_id_list == []:
        error_message2 = "Invalid PDB ID"
        return print(error_message2)

    # Converting the Python list into appropriate format for POST call
    pdb_list_string = ", ".join(clean_pdb_id_list)
    return pdb_list_string

# This function will make a POST call to the PDBe API.
# This function will use a PDB id list and API url provided as arguments.

def make_POST_request(pdb_id_list, api_url):
    # Checking the PDB ids and list are formatted correctly
    pdb_list_string = clean_pdb_id_list(pdb_id_list)
    # Reporting the input after removing any incorrectly formatted
    print("Input - PDB id: " + pdb_list_string)

    # Making a POST call to the API URL
    post_response = requests.post(url=api_url, data=pdb_list_string)

    # If there is data returned (with HTML status code 200)
    # then return the data in JSON format
    if post_response.status_code == 200:
        print("Data retrieved")
        return post_response.json()
    # If there is no data, print status code and response
    else:
        print("No data retrieved - ", post_response.status_code, post_response.text)
    return None

*Examples to test above code*

---

1.   *For GET call:*

     PDB id for mouse cyclooxygenase-2 (COX-2) bound to ibuprofen: 4ph9

2.   *For POST call:*

      PDB ids for mouse cyclooxygenase-2 (COX-2) bound to ibuprofen: 4ph9, 4rs0, 8et0

In [None]:
my_pdb_id1 = "4ph9"
my_pdb_id2 = "abcd"
my_pdb_id_list1 = ["4ph9", "4rs0", "8et0"]
my_pdb_id_list2 = "4ph9,4rs00,8et00"
my_pdb_id_list3 = "4ph9, 4rs0, 8et0"

get_result1 = make_GET_request(my_pdb_id1, summary_url)
#pprint.pprint(get_result1)
print()

get_result2 = make_GET_request(my_pdb_id2, summary_url)
#pprint.pprint(get_result2)
print()

post_result1 = make_POST_request(my_pdb_id_list1, summary_url)
#pprint.pprint(post_result1)
print()

post_result2 = make_POST_request(my_pdb_id_list2, summary_url)
#pprint.pprint(post_result2)

print()

post_result3 = make_POST_request(my_pdb_id_list3, summary_url)
#pprint.pprint(post_result3)

### 2.3) The get_value() function

This function is to help in navigating Python dictionaries.

PDBe API calls generally return data in the form that readily can be converted to Python dictionaries, i.e. collections of key and value pairs.

<br>

More advance options and analysis can be performed by using tools available from other Python packages, such as [pandas](https://pandas.pydata.org/docs/index.html), [NumPy](https://numpy.org/), [Matplotlib](https://matplotlib.org/), and [seaborn](https://seaborn.pydata.org/). We will not cover using these in this notebook these other Python packages in this notebook. These other Python packages enable viewing the type of data you get from APIs in tables and graphs and are powerful for exploring and visualize the data.

In [5]:
# Getting value using a simple function we have named 'get_value'.
# This function gets the value for a key from a dictionary.
# The function has two inputs/arguments:
#  --> the key
#  --> the dictionary/input_information
def get_value(key, input_information):
    try:
        return input_information[key]
    except KeyError as error:
        error_message1 = "no value"
        return error_message1

### 2.4) The make_entry_summary() function

This function can be used to write a brief summary of a PDB entry.

In [6]:
# The below functions work to returns the information for this id in a summary format.

def make_entry_summary(pdb_id,input_information):
    entry_information = get_entry_info(pdb_id,input_information)

    # Getting the title of the entry
    title = get_value("title", entry_information)

    # Getting the release date of the entry
    release_date = get_value("release_date", entry_information)
    # Formatting the release data to make it more user-friendly
    formatted_release_date = "{}/{}/{}".format(release_date[:4], release_date[4:6], release_date[6:])

    # Getting the experimental methods
    # Because there can be multiple methods, so this is a list that
    # needs to be iterated
    experimental_methods = ""
    for experimental_method in get_value("experimental_method", entry_information):
        if experimental_methods:
            experimental_methods += " and "
        experimental_methods += experimental_method

    # Creating the summary text using all the extracted information
    summary = ("Entry is titled " + title + " was released on " + formatted_release_date + ". ")
    summary += ("This entry was determined using " + experimental_methods + ".")
    return summary

*Examples to test above code*

---

1.   *For GET call:*

     PDB id for mouse cyclooxygenase-2 (COX-2) bound to ibuprofen: 4ph9

2.   *For POST call:*

      PDB ids for mouse cyclooxygenase-2 (COX-2) bound to ibuprofen: 4ph9, 4rs0, 8et0

In [None]:
my_pdb_id = "4ph9"
my_pdb_id_list = "4ph9, 4rs0, 8et0"
cleaned_my_pdb_id_list = clean_pdb_id_list(my_pdb_id_list)

print("Input list: ", my_pdb_id)

get_result = make_GET_request(my_pdb_id, summary_url)
print()

print("Input list: ", my_pdb_id_list)
print("PDB ids (after cleaning)", cleaned_my_pdb_id_list)

post_result = make_POST_request(my_pdb_id_list, summary_url)
print()

my_summary1 = make_entry_summary("4ph9",get_result)
print(my_summary1)
print()

my_summary2 = make_entry_summary("4rs0",post_result)
print(my_summary2)
print()

my_summary3 = make_entry_summary("8et0",post_result)
print(my_summary3)

## 3) Looking at a set of related structures

We will use as a starting point the human protein *Coagulation factor XIa light chain* that is the target of a drug currently in clinical trials.

<br>
Publication:
<br>

[Design and Preclinical Characterization Program toward Asundexian (BAY 2433334), an Oral Factor XIa Inhibitor for the Prevention and Treatment of Thromboembolic Disorders.](https://europepmc.org/article/MED/37669040)

<br>
PDBe entry page for the structure of the target with Asundexian bound:
<br>

[COAGULATION FACTOR XI PROTEASE DOMAIN IN COMPLEX WITH ACTIVE SITE INHIBITOR Asundexian](https://www.ebi.ac.uk/pdbe/entry/pdb/8bo3)

<br>

----

Searching at the [PDBe](https://www.ebi.ac.uk/pdbe/) with the molecular name *Coagulation factor XIa light chain* results in 100+ experimental determined structures for this protein target.


## 3.1) Generating a PDB id list

We have previously run the following code to pull all the PDB ids from the CSV file generated from a search:

```
with open('PDBe_search.csv', 'r') as file:
    reader = csv.reader(file)
    column_1 = [(row[0]) for row in reader]

new_pdb_id_list = []
for row in column_1:
    if re.match("[0-9][A-Za-z][A-Za-z0-9]{2}", row):
        new_pdb_id_list.append(row)
    else:
      continue

new_pdb_id_list = list(dict.fromkeys(new_pdb_id_list ))

print(new_pdb_id_list)
```

We have run this code and have generated a PDB id list which we have named (stored as a Python object that is a list data-type):
<br>
**Coagulation_factor_XIa_light_chain_list**

If you wish you can un-comment the code in the block below, and run the code cell tomorrow after uploading your own search results.


In [8]:
## Block of code to convert PDBe_search.csv to Python list
#with open('PDBe_search.csv', 'r') as file:
#    reader = csv.reader(file)
#    column_1 = [(row[0]) for row in reader]

#new_pdb_id_list = []
#for row in column_1:
#    if re.match("[0-9][A-Za-z][A-Za-z0-9]{2}", row):
#        new_pdb_id_list.append(row)
#    else:
#      continue

#new_pdb_id_list = list(dict.fromkeys(new_pdb_id_list ))

#new_PDB_id_list = new_pdb_id_list

Coagulation_factor_XIa_light_chain_list = ['7qot', '7cj1', '7mbo', '8bo6', '8bo4', '8bo5', '8bo7', '8bo3', '4x6m', '4x6n', '3sor', '3sos', '4x6p', '1zml', '1zrk', '1zpc', '1zsl', '1zhp', '1zpz', '1zjd', '1zhr', '1zsk', '1zlr', '1zom', '1zsj', '1ztl', '1xx9', '2fda', '1xxd', '1zpb', '1zmn', '1zhm', '1xxf', '4y8y', '4y8x', '4y8z', '4x6o', '4wxi', '3bg8', '5qcl', '5qcn', '5q0d', '5exl', '5tku', '5eok', '5qck', '5q0e', '5qtw', '5qtx', '5qty', '5qqp', '5qtv', '5i25', '5q0h', '5tkt', '5tks', '5qtu', '5qqo', '6c0s', '4cre', '4ty6', '4na7', '4na8', '4cra', '4crg', '4ty7', '6aod', '4d7g', '4d7f', '4cr5', '4crb', '4crf', '5e2p', '5q0f', '5exm', '5exn', '5eod', '5qcm', '5wb6', '4cr9', '4d76', '4crd', '4crc', '5qtt', '5e2o', '5q0g', '6i58', '6ts7', '6vlv', '6hhc', '6ts4', '6ts5', '6w50', '6vlu', '6twc', '6twb', '6usy', '6ts6', '6r8x', '2j8l', '1ztk', '1ztj', '1zmj', '7v17', '2j8j', '7v16', '7v15', '7v18', '7v11', '7v0z', '7v12', '7v13', '7v14', '7v10']


## 3.2) Getting information on the experimental method and resolution for these entries

We have made a simple **get_experimental_summary()** function to pull a subset of information for multiple entries.



In [None]:
def get_experimental_summary(pdb_id=None, pdb_list=None):
    # If neither a single PDB id, nor a list was provided,
    # exit the function
    if not pdb_id and not pdb_list:
        print("Either provide one PDB id, or a list of ids")
        return None

    if pdb_id:
        # If a single PDB id was provided, call the API with GET
        data = make_GET_request(pdb_id, experiment_url)
    else:
        # If multiple PDB ids were provided, call the API with POST
        # The POST API call expects PDB ids as a comma-separated lise
        pdb_list_string = ", ".join(pdb_list)
        data = make_POST_request(pdb_list, experiment_url)

    # When no data is returned by the API, exit the function
    if not data:
        print("No data available")
        return None


    # Loop through all the PDB entries in the retrieved data
    report = "\n"
    report += "Experimental Summary"
    report += "\n"
    for entry_id in data.keys():
        entry = data[entry_id]
        entry_table = entry[0]
        experimental_method = entry_table["experimental_method"]
        resolution = str(get_value("resolution", entry_table))
        report += (entry_id + ": " + experimental_method + ", Resolution: " + resolution + " Angstrom" + "\n")
    print(report)

    return None

new_summary = get_experimental_summary(pdb_list=Coagulation_factor_XIa_light_chain_list)

## 3.3) Getting the ligands (that is the CC IDs for the ligands) modelled in these entries

We have made a simple **get_molecule_summary()** function to pull a subset of information for multiple entries.



In [None]:
def get_molecule_summary(pdb_id=None, pdb_list=None):
    # If neither a single PDB id, nor a list was provided,
    # exit the function
    if not pdb_id and not pdb_list:
        print("Either provide one PDB id, or a list of ids")
        return None

    if pdb_id:
        # If a single PDB id was provided, call the API with GET
        data = make_GET_request(pdb_id,ligands_url)
    else:
        # If multiple PDB ids were provided, call the API with POST
        # The POST API call expects PDB ids as a comma-separated lise
        pdb_list_string = ", ".join(pdb_list)
        data = make_POST_request(pdb_list,ligands_url)

    # When no data is returned by the API, exit the function
    if not data:
        print("No data available")
        return None


    # Loop through all the PDB entries in the retrieved data
    report = "\n"
    report += "Molecule Summary"
    report += "\n"
    chem_comp_list = ""
    for entry_id in data.keys():
        entry = data[entry_id]
        for i in range (len(entry)):
            entry_table = entry[i]
            chem_comp = str(get_value("chem_comp_id", entry_table))
            chem_comp_list += (chem_comp + ", ")
        # Remove duplicates chem_comp ids
        chem_comp_list = chem_comp_list[:-2]
        chem_comp_list_Plist =  chem_comp_list.split(sep=",")
        chem_comp_list_Plist =  list(dict.fromkeys(chem_comp_list_Plist))
        chem_comp_list = ','.join(chem_comp_list_Plist)
        # Making report
        report += (entry_id + ": " + "contains " + chem_comp_list + "\n")
        chem_comp_list = ""
    print(report)

    return None

new_summary = get_molecule_summary(pdb_list=Coagulation_factor_XIa_light_chain_list)

## 3.4) Converting list of CC ids into SMILES strings

We can get SMILES string or descriptions for all the ligands present in an entry by using one of the APIs we have for compounds.

There is a webpage available for this here:

https://www.ebi.ac.uk/pdbe/api/compounds.html

<img src="https://github.com/glevans/7ADD-workshop-2024/blob/main/Images/REST%20API_PDBe_compound_summaries.png?raw=true" height="300" align="center">



In [None]:
# Making the code
def get_SMILES(CC_id_list=None):
    # If multiple CC ids were provided, call the API with POST
    # The POST API call expects PDB ids as a comma-separated lise
    CC_id_list_string = ", ".join(CC_id_list)
    print(CC_id_list_string)
    data_out = requests.post(url=compound_summaries_url, data=CC_id_list_string)
    POST_response = data_out.json()

    # Loop through all the PDB entries in the retrieved data
    report = "\n"
    report += "SMILES string summary"
    report += "\n"
    compound_SMILES_details = ""
    for CC_id in POST_response.keys():
        compound_info = POST_response[CC_id]
        compound_details = compound_info[0]
        compound_SMILES_info =  compound_details["smiles"]
        compound_SMILES_details = compound_SMILES_info[0]
        compound_SMILES = compound_SMILES_details["name"]
        # Making report
        report += (CC_id + ": " + compound_SMILES + "\n")
    compound_SMILES_details = ""
    print(report)

    return None

# Running the code
pdb_8bo3_summary = get_molecule_summary(pdb_id="8bo3")
CC_id_list = ["QV3, CIT"]
compound_details_list = get_SMILES(CC_id_list=CC_id_list)

## 4) Summary

In this notebook we have coverted information from an API call into a Python object that is a dictionary data-type.

We have performed both GET and POST API calls.

We have shown some simple summaries that can be generated from these types of API calls

<br>

We have made 9 Python functions / definitions / methods to help get and navigate the information from PDBe API calls:

*   **get_value()**

    *- gets the value that corresponds to a key in a dictionary*

    *- if the key doesn't exist in a dictionary then it returns error message 'no value'*

    *- output is a string*

*   **make_GET_request()**

    *- will make a GET call to the PDBe API using the PDB id and API url as arguments*

    *- if the request is not a PDB id (aka the format is not that of the PDB id), it returns error message 'Invalid PDB id' and 'None'*

*   **clean_pdb_id_list()**

    *- if the list includes items which are not a PDB id (aka the format is not that of the PDB id), it removes the item from the list*

*   **make_POST_request()**

    *- will make POST call to the PDBe API using a list of PDB id and API url as arguments*

    *- uses **clean_pdb_id_list** function to 'clean' the PDB list anre removes items which are not a PDB id (aka the format is not that of the PDB id)*

*   **get_entry_info()**
    
    *- gets the data from an information block that corresponds to a PDB id*

    *- if the PDB id is not the present, it returns error message 'PDB id is not present'*

    *- takes output from **get_GET_request** function or **get_POST_request** as input*

    *- output is a dictionary*

*   **make_entry_summary()**

    *- creates a summary for a PDB entry.*

    *- takes output from **get_GET_request** function or **get_POST_request** as input*
    
    *- uses **get_value** function as part of the method*

    *- output is a string*

*   **make_experimental_summary()**

    *- creates an experimental summary for a list of PDB entry (or a single PDB id).*

    *- use **make_GET_request** function or **make_POST_request** as part of the method*
    
    *- uses **get_value** function as part of the method*

    *- output is a string*

*   **make_molecule_summary()**

    *- creates a list of CC ids corresponding to what ligands are bound for a list of PDB entry (or a single PDB id).*

    *- use **make_GET_request** function or **make_POST_request** as part of the method*
    
    *- uses **get_value** function as part of the method*

    *- output is a string*

*   **get_SMILES()**

    *- reports SMILES that correspond to a list of CC IDs.*

    *- output is a string*
<br>


## This ends the second notebook - please proceed to other notebooks of your interest

Copyright 2024 EMBL - European Bioinformatics Institute

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.