<a href="https://colab.research.google.com/github/glevans/PDB_Notebooks/blob/main/FEBS_engineering_enzymes_2025/PART1_Activity_on_PDBeAPIs_ANS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🧬 Exploring PDBe APIs for Enzyme Classifications

<img src="https://www.ebi.ac.uk/pdbe/docs_dev/logos/images/RGB/PDBe-logo-RGB_2013.png" height="180" align="right">

In this notebook, you'll learn how to:

- Access and parse biological data from an API
- Understand the relationship between macromolecular structures & EC (Enzyme Commission) numbers
- Make use of Python definitions
- Transform nested JSON data into a structured `DataFrame`
- Introduction to `glom` which is Python package useful for nested data
- Use AI to bugfix and improve Python code

<br>

---

## ℹ️ **Introduction**

### **What is the Protein Data Bank (PDB)?**

The **Protein Data Bank (PDB)** contains experimentally-determined 3D structures of biological molecules, such as protein, DNA and RNA.

The **Protein Data Bank in Europe (PDBe)** is the European branch of the PDB that helps prepare, share, organize and visualize protein structure data for scientists worldwide. PDBe is based at the **European Bioinformatics Institute (EMBL-EBI)**, a global leader in bioinformatic tools and resources. EMBL-EBI provides freely available data, tools, and services to support life science research.

### **What are EC Numbers?**

EC (Enzyme Commission) numbers classify enzymes based on the chemical reactions they catalyze. Each EC numbers consists of up to four-part (*e.g.* 2.7.1.1), forming a hierarchical identifier that reflects an enzyme’s function. While full four-part identifiers specify a unique enzymatic activity, two-part and three-part identifiers are often used when the exact substrate or reaction details are unknown or variable. These shorter EC numbers group enzymes by broader functional categories, making them useful for preliminary enzyme classification or when an enzyme acts on a range of substrates.

**Example: EC 2.7.1.1**

*  **2:** Transferase – an enzyme that transfers functional groups.
*  **7:** Transfers phosphorus-containing groups.
*  **1:** Transfers a phosphate group to an alcohol group.
*  **1:** Specifies the complete reaction: ATP + D-hexose → ADP + D-hexose 6-phosphate.

<br>

Mapping EC numbers to protein chains in PDB entries helps researchers understand the biological role of a protein structure.

Information on EC numbers, enzyme reactions, and enzyme names at PDBe is sourced from:

*   [Expasy ENZYME - Enzyme nomenclature database](https://enzyme.expasy.org/)
*   [ExplorEnz - the Enzyme Database](https://www.enzyme-database.org/)

Previously, this data was obtained from [Intenz](https://pmc.ncbi.nlm.nih.gov/articles/PMC308853/). However, this resources is no longer available.

### **What is a Python definition?**

A Python definition is a way to create a reusable block of code that does something specific.

### **What is a API?**
<img src="https://github.com/glevans/7ADD-workshop-2024/blob/main/Images/API_image.png?raw=true" height="140" align="right">

The API is a programmatic way to obtain information. APIs are in the background providing information we see on websites such as [PDBe's website](https://pdbe.org). Using Python code to access APIs enables faster analysis than can be obtained by viewing information directly on websites.

For more information on PDBe's APIs, visit:

*   [http://www.ebi.ac.uk/pdbe/pdbe-rest-api](http://www.ebi.ac.uk/pdbe/pdbe-rest-api)
*   [https://www.ebi.ac.uk/pdbe/api/v2/#/](https://www.ebi.ac.uk/pdbe/api/v2/#/)

### **API Request Components**

An example of an PDBe API endpoint:

[https://www.ebi.ac.uk/pdbe/api/mappings/ec/2XFU](https://www.ebi.ac.uk/pdbe/api/mappings/ec/2XFU)

#### *Protocol*

```
https://
```

This is the communication protocol used to make the API request.

#### *Domain or Host*

www.ebi.ac.uk (European Bioinformatics Institute)

#### *API Structure or Path*

PDBe API v2 uses service-specific endpoint patterns, such as:

**Mappings:** `/pdbe/api/mappings/{type}/{id}`

**Entry Data:** `/pdbe/api/pdb/entry/{data}/{id}`

**Validation:** `/pdbe/api/validation/{type}/{id}`

Each service has its own optimized URL structure designed for that specific functionality, making the API more intuitive.

#### *Query string -- ID*

Some examples of IDs used for querying PDBe's API endpoints:

##### **1. 🧪 PDB Identifiers**
```
Format: [0-9][A-Za-z0-9]{3}
Examples: 1mso, 8a3h, 2ins
```
- **Primary Resource ID**: Most common identifier type
- **4-character alphanumeric**: First character is numeric (0-9), followed by 3 alphanumeric
- **Case insensitive**: API accepts both uppercase and lowercase
- **Usage**: For experimentally-detemined structure endpoints

<br>

##### **2. ⚗️ Chemical Component Dictionary Identifiers (or HET codes)**
```
Format: [A-Z0-9]{1,3,5}
Examples: ATP, NAD, HEM, ZN, SO4, A1L3O (5-Hydroxyuridine 5'-phosphate)
```
- **Chemical identifiers**: 1, 3 or 5 character codes for non-polymer entities
- **Standardized**: Defined by wwPDB Chemical Component Dictionary
- **Types include**:
  - **Ligands**: ATP (adenosine triphosphate), NAD (nicotinamide adenine dinucleotide)
  - **Cofactors**: HEM (heme), FAD (flavin adenine dinucleotide)
  - **Metals**: ZN (zinc), CA (calcium), MG (magnesium)
  - **Solvents**: GOL (glycerol), SO4 (sulfate)
- **Usage**: For small molecule / ligand-specific endpoints


### **What is a Notebook?**

A **Colab** or **Jupyter** notebook corresponds to a file with the extension `.ipynb`.

Notebooks are useful for sharing examples of code and exploring progammatic ways of handling data.

<br>

To use this notebook in **Colab** (link at top of the page):

*   you will need to have a Google account
*   be logged in to Google Colab (by being logged into Google account)

<br>

To use as a **Jupyter** notebook, download & viewed with:

*   a local installation of [Jupyter](https://jupyter.org/)
*   a browser instance of [JupyterLab](https://jupyter.org/try-jupyter/lab/)

<br>

<br>

---

## How to use this notebook <a name="Quick Start"></a>
1. To run a code cell, click on the cell to select it. You will notice a play button (▶️) on the left side of the cell. Click on the play button or press Shift+Enter to run the code in the selected cell.
2. The code will start executing, and you will see the output, if any, displayed below the code cell.
3. Move to the next code cell and repeat steps 2 and 3 until you have executed all the desired code cells in sequence.
4. The currently running step is indicated by a circle with a stop sign next to it.
If you need to stop or interrupt the execution of a code cell, you can click on the stop button (■) located next to the play button.
5. The exercise & bonus challenges had empty code cells will require the addition of code before they are run.

*Remember to run the code cells in the correct order, as their execution might depend on variables or functions defined in previous cells. You can modify the code in a code cell and re-run it to see updated results.*

<br>

---

## Contact us

If you experience any bugs please contact pdbehelp@ebi.ac.uk and put "Help with" and the title of the notebook in the subject line of the message.


## ⚙️ **Setup**



### 📦 Step 1: Install Required Package
Ensure the `glom` package is installed.

This is used to simplify data extraction from nested data structures.

To run a BASH command in a Notebook, one adds `!` before the command.

Many python packages are available from [PyPi](https://pypi.org/).

The BASH command `pip install` installs from PyPi.

In [1]:
# Install glom if not already installed
!pip install glom

Collecting glom
  Downloading glom-24.11.0-py3-none-any.whl.metadata (5.1 kB)
Collecting boltons>=19.3.0 (from glom)
  Downloading boltons-25.0.0-py3-none-any.whl.metadata (6.5 kB)
Collecting face>=20.1.1 (from glom)
  Downloading face-24.0.0-py3-none-any.whl.metadata (1.1 kB)
Downloading glom-24.11.0-py3-none-any.whl (102 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m102.7/102.7 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading boltons-25.0.0-py3-none-any.whl (194 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.2/194.2 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading face-24.0.0-py3-none-any.whl (54 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m54.7/54.7 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: boltons, face, glom
Successfully installed boltons-25.0.0 face-24.0.0 glom-24.11.0


### 📥 Step 2: Import Modules
Import all necessary Python modules for data fetching, transformation, and display.

We will be using Python packages / modules:

*   [requests](https://docs.python.org/3/library/re.html) - allows you to send HTTP/1.1 requests extremely easily.
*   [pandas](https://pandas.pydata.org/) - for working with data in tables, like spreadsheets
*   [glom](https://glom.readthedocs.io/en/stable/) - for exploring and accessing information in nested data structures, such as that from APIs

<br>

In [None]:
# Import necessary modules
import requests
from glom import glom, Coalesce
import pandas as pd

### 🌐 Step 3: Fetch Data from an API

We will be using the PDBe API to retrieve enzyme classification (EC) mapping data for a specific PDB entry (2XFU).

In [None]:
url_2xfu = "https://www.ebi.ac.uk/pdbe/api/mappings/ec/2XFU"
response_2xfu = requests.get(url_2xfu)
data_2xfu = response_2xfu.json()

To view this API endpoints output (for query `2XFU`):

[https://www.ebi.ac.uk/pdbe/api/mappings/ec/2XFU](https://www.ebi.ac.uk/pdbe/api/mappings/ec/2XFU)


<br>

Information from this API is formatted as a JSON object.



---




## 🧭 **Exploring API endpoint data structure**


### 🔑 Step 1: Top-Level Keys

Use this to get a quick overview of the the top-level division of information from the API.

In [None]:
# Print top-level keys to understand the structure
print(data_2xfu.keys())

dict_keys(['2xfu'])


### 🧰 Step 2: Map All Keys

We are using a Python definition to find all the dictionary keys in the nested structure.

This is useful approach for understanding deeply nested JSON objects.

🧠 **Function: map_keys**

In [None]:
# Define a recursive function to map all keys
def map_keys(d, level=0, path=''):
    # If the current object is a dictionary
    if isinstance(d, dict):
        for k, v in d.items():
            # Build the full path to the current key
            full_path = f"{path}.{k}" if path else k
            # Print the key with indentation based on the current level
            print("  " * level + f"- {full_path}")
            # Recursively call map_keys on the value
            map_keys(v, level + 1, full_path)

    # If the current object is a list
    elif isinstance(d, list):
        for i, item in enumerate(d):
            # Build the full path to the current list index
            full_path = f"{path}[{i}]"
            # Recursively call map_keys on the list item
            map_keys(item, level + 1, full_path)

▶️ **Run the Function**

In [None]:
# Call the function on your JSON-like data structure
map_keys(data_2xfu)

- 2xfu
  - 2xfu.EC
    - 2xfu.EC.1.4.3.21
      - 2xfu.EC.1.4.3.21.reaction
      - 2xfu.EC.1.4.3.21.systematic_name
      - 2xfu.EC.1.4.3.21.accepted_name
      - 2xfu.EC.1.4.3.21.synonyms
      - 2xfu.EC.1.4.3.21.mappings
          - 2xfu.EC.1.4.3.21.mappings[0].chain_id
          - 2xfu.EC.1.4.3.21.mappings[0].entity_id
          - 2xfu.EC.1.4.3.21.mappings[0].struct_asym_id
          - 2xfu.EC.1.4.3.21.mappings[1].chain_id
          - 2xfu.EC.1.4.3.21.mappings[1].entity_id
          - 2xfu.EC.1.4.3.21.mappings[1].struct_asym_id
      - 2xfu.EC.1.4.3.21.identifier
    - 2xfu.EC.1.4.3.4
      - 2xfu.EC.1.4.3.4.reaction
      - 2xfu.EC.1.4.3.4.systematic_name
      - 2xfu.EC.1.4.3.4.accepted_name
      - 2xfu.EC.1.4.3.4.synonyms
      - 2xfu.EC.1.4.3.4.mappings
          - 2xfu.EC.1.4.3.4.mappings[0].chain_id
          - 2xfu.EC.1.4.3.4.mappings[0].entity_id
          - 2xfu.EC.1.4.3.4.mappings[0].struct_asym_id
          - 2xfu.EC.1.4.3.4.mappings[1].chain_id
          - 2xfu.EC.1.4.

### 🗺️ Step 3: Map all keys with added information on data structure

We are using a Python definition to see all dictionary keys, with added information.

The below code reports at each level in the nested data whether data is structured as:

*   dictionary
*   list
*   string

In [None]:
def json_structure_report(data, level=0, path='root', show_values=False, max_depth=None):
    """
    Recursively reports the type of each layer in a nested JSON-like structure.

    Parameters:
    - data: The JSON-like object (dict or list) to inspect.
    - level: Current depth level (used for indentation).
    - path: String representing the path to the current node.
    - show_values: If True, prints the value for non-dict-keys/list types.
    - max_depth: If set, limits the depth of recursion.
    """
    indent = "  " * level  # Indentation for visual hierarchy

    # Stop recursion if max_depth is reached
    if max_depth is not None and level > max_depth:
        print(f"{indent}{path} ... (max depth reached)")
        return

    if isinstance(data, dict):
        print(f"{indent}{path} is a dictionary with {len(data)} keys: {list(data.keys())}")
        for key, value in data.items():
            json_structure_report(value, level + 1, f"{path}.{key}", show_values, max_depth)

    elif isinstance(data, list):
        print(f"{indent}{path} is a list with {len(data)} items")
        for i, item in enumerate(data):
            json_structure_report(item, level + 1, f"{path}[{i}]", show_values, max_depth)

    else:
        # For primitive types (str, int, etc.)
        type_name = type(data).__name__
        if show_values:
            print(f"{indent}{path} is a {type_name} with value: {repr(data)}")
        else:
            print(f"{indent}{path} is a {type_name}")

▶️ **Run the Function**

In [None]:
# Call the function on your JSON-like data structure
json_structure_report(data_2xfu, show_values=True)

root is a dictionary with 1 keys: ['2xfu']
  root.2xfu is a dictionary with 1 keys: ['EC']
    root.2xfu.EC is a dictionary with 2 keys: ['1.4.3.21', '1.4.3.4']
      root.2xfu.EC.1.4.3.21 is a dictionary with 6 keys: ['reaction', 'systematic_name', 'accepted_name', 'synonyms', 'mappings', 'identifier']
        root.2xfu.EC.1.4.3.21.reaction is a str with value: 'a primary methyl amine + O2 + H2O = an aldehyde + H2O2 + NH4(+).'
        root.2xfu.EC.1.4.3.21.systematic_name is a str with value: 'primary-amine:oxygen oxidoreductase (deaminating)'
        root.2xfu.EC.1.4.3.21.accepted_name is a str with value: 'primary-amine oxidase.'
        root.2xfu.EC.1.4.3.21.synonyms is a list with 5 items
          root.2xfu.EC.1.4.3.21.synonyms[0] is a str with value: 'CAO.'
          root.2xfu.EC.1.4.3.21.synonyms[1] is a str with value: 'amine oxidase (copper-containing).'
          root.2xfu.EC.1.4.3.21.synonyms[2] is a str with value: 'amine oxidase.'
          root.2xfu.EC.1.4.3.21.synonyms[

## 🔍 **1) EXERCISE - PROVIDED EXAMPLE**








### ❓ **TASK 1:** Suggest bug fixes for the code below.


The code was generated by AI and contains errors.

*HINT: Use insights from the functions* `map_keys` *and* `json_structure_report`.


<br>

---

🐞 **Original Code with Bugs**

In [None]:
# Extract the list of EC mappings
ecmappings = glom(data_2xfu, '2XFU.EC', default=[])
print(ecmappings)

# Create an empty list to store EC numbers
ecnumbers = []

# Loop through each entry in the ecmappings list
for details in ecmappings:
    # Extract the value associated with the 'ecnumber' key
    ec_number = details['ecnumber']
    # Append the EC number to the list
    ecnumbers.append(ec_number)

print("Extracted EC Numbers:", ecnumbers)

[]
Extracted EC Numbers: []


### 🧪 **SOLUTION 1:** Bug Fixes and Explanation


*   Issue 1: Incorrect capitalization of PDB ID (2XFU should be lowercase: 2xfu)
*   Issue 2: '2xfu.EC' refers to a **dictionary**, not a
*   Issue 3: 'ecnumber' is NOT a **dictionary key**

In [None]:
# Extract the list of EC mappings
ecmappings = glom(data_2xfu, '2xfu.EC', default={})

# Create an empty list to store EC numbers
ecnumbers = []

# Loop through each key in the ecmappings dictionary
for ec_number in ecmappings.keys():
    # Append each EC number to the list
    ecnumbers.append(ec_number)

print("Extracted EC Numbers:", ecnumbers)

Extracted EC Numbers: ['1.4.3.21', '1.4.3.4']


## 🔍 **2) EXERCISE**

### ❓ **TASK 2:** Suggest bug fixes for the code below.

The code was generated by AI and contains errors.

*HINT1: Use insights from the functions* `map_keys` *and* `json_structure_report`.

*HINT2: You can use AI to help find with the bug fixing*.

<br>

---

🐞 **Original Code with Bugs**

In [None]:
# Step 1: Extract EC numbers and associated chain IDs
ec_mappings = data_2xfu.get("2XFU", {}).get("EC", [])
rows = []
for details in ec_mappings:
    ec_number = details.get("ec_number")
    for mapping in details.get("mappings", []):
        chain_id = mapping.get("chain_id")
        rows.append({"EC Number": ec_number, "Chain ID": chain_id})

# Step 2: Create a DataFrame
df = pd.DataFrame(rows)

# Step 3: Display a DataFrame / Table
display(df)

### 🧪 **SOLUTION 2:** Bug Fixes and Explanation

*  Issue 1: Incorrect capitalization of PDB ID (2XFU → 2xfu)
*  Issue 2: '2xfu.EC' is a **dictionary**, not a list
*  Issue 3: Iteration in first `for` loops needs to take into account nested data structure


In [None]:
# Step 1: Extract EC numbers and associated chain IDs
ec_mappings = data_2xfu.get("2xfu", {}).get("EC", {})
rows = []
# Iterate over the items (key-value pairs) of the ec_mappings dictionary
for ec_number, details in ec_mappings.items():
    # print(details)
    # Iterate over the 'mappings' list within the details for each EC number
    for mapping in details.get("mappings", []):
        # print(mapping)
        chain_id = mapping.get("chain_id")
        # Append a dictionary with the EC Number and Chain ID to the rows list
        rows.append({"EC Number": ec_number, "Chain ID": chain_id})

# Step 2: Create a DataFrame
df = pd.DataFrame(rows)

# Step 3: Display a DataFrame / Table
display(df)

Unnamed: 0,EC Number,Chain ID
0,1.4.3.21,A
1,1.4.3.21,B
2,1.4.3.4,A
3,1.4.3.4,B


Equivalent code but updated to use `glom`.

In [None]:
# Step 1: Extract EC numbers and associated chain IDs using glom with Coalesce
ec_mappings = glom(data_2xfu, Coalesce("2xfu.EC", default={}))

rows = []
for ec_number, details in ec_mappings.items():
    mappings = glom(details, Coalesce("mappings", default=[]))
    for mapping in mappings:
        chain_id = mapping.get("chain_id")
        rows.append({"EC Number": ec_number, "Chain ID": chain_id})

# Step 2: Create a DataFrame
df = pd.DataFrame(rows)

# Step 3: Display the DataFrame
display(df)

Unnamed: 0,EC Number,Chain ID
0,1.4.3.21,A
1,1.4.3.21,B
2,1.4.3.4,A
3,1.4.3.4,B


## 📝 Quick Review Quiz 1

Test your understanding!

---

**1. What does "API" stand for?**

<select>
  <option value="Select_answer">Select answer</option>
  <option value="Automated_Program_Integration">Automated Program Integration</option>
  <option value="Advanced_Protocol_Interface">Advanced Protocol Interface</option>
  <option value="Application_Programming_Interface">Application Programming Interface</option>
  <option value="Applied Programming Instruction">Applied Programming Instruction</option>
</select>

---

**2. Which part of an API endpoint typically specifies the resource being accessed?**

<select>
  <option value="Select_answer">Select answer</option>
  <option value="Protocol">Protocol</option>
  <option value="Domain">Domain</option>
  <option value="Path">Path</option>
  <option value="Query_string">Query string</option>
</select>


---

**3.  In the endpoint `https://api.example.com/users/123`, what does 123 represent?**

<select>
  <option value="Select_answer">Select answer</option>
  <option value="API_version">API version</option>
  <option value="User_ID">User ID</option>
  <option value="Query_parameter">Query parameter</option>
  <option value="HTTP method">HTTP method</option>
</select>


---

**4. What does an EC number represent?**

<select>
  <option value="Select_answer">Select answer</option>
  <option value="Molecular_weight">The molecular weight of an enzyme</option>
  <option value="Protein_structure">The structure of a protein</option>
  <option value="Catalyzed_reaction">The type of chemical reaction an enzyme catalyzes</option>
  <option value="Cellular_location">The location of the enzyme in the cell</option>
</select>


---

**5. How many parts are there in a full EC number?**

<select>
  <option value="Select_answer">Select answer</option>
  <option value="2">2</option>
  <option value="3">3</option>
  <option value="4">4</option>
  <option value="5">5</option>
</select>

## 🔍 **3) EXERCISE**

### ❓ **TASK 3:** Replace the PDB id in the Python code & get a different output

Replace the PDB id with a new id - `3DIV`


### 🧪 **SOLUTION 3:** Replacing PDB id




In [None]:
# Step 1: Query API endpoint
url_3div = "https://www.ebi.ac.uk/pdbe/api/mappings/ec/3DIV"
response_3div = requests.get(url_3div)
data_3div = response_3div.json()

# Step 2: Extract EC numbers and associated chain IDs
ec_mappings = data_3div.get("3div", {}).get("EC", {})
rows = []
# Iterate over the items (key-value pairs) of the ec_mappings dictionary
for ec_number, details in ec_mappings.items():
    # print(details)
    # Iterate over the 'mappings' list within the details for each EC number
    for mapping in details.get("mappings", []):
        # print(mapping)
        chain_id = mapping.get("chain_id")
        # Append a dictionary with the EC Number and Chain ID to the rows list
        rows.append({"EC Number": ec_number, "Chain ID": chain_id})

# Step 3: Create a DataFrame
df2 = pd.DataFrame(rows)

# Step 4: Display a DataFrame / Table
display(df2)

Unnamed: 0,EC Number,Chain ID
0,1.10.3.2,A


## 📝 Quick Review Quiz 2

Test your understanding!

**1. What is the main purpose of using an API in bioinformatics?**  
<select>
  <option value="Select_answer">Select answer</option>
  <option value="Visualize_protein_structures">To visualize protein structures</option>
  <option value="Access_biological_data">To access and retrieve biological data from online databases</option>
  <option value="Edit_DNA_sequences">To edit DNA sequences</option>
  <option value="Create_3D_models">To create 3D models of enzymes</option>
</select>

---

**2. Which Python keyword is used to define a function?**  
<select>
  <option value="Select_answer">Select answer</option>
  <option value="define">define</option>
  <option value="function">function</option>
  <option value="def">def</option>
  <option value="lambda">lambda</option>
</select>

---

**3. What Python data type is commonly used to store parsed JSON data?**  
<select>
  <option value="Select_answer">Select answer</option>
  <option value="List">List</option>
  <option value="Tuple">Tuple</option>
  <option value="Dictionary">Dictionary</option>
  <option value="String">String</option>
</select>

---

**4. What is the purpose of transforming nested JSON into a DataFrame?**  
<select>
  <option value="Select_answer">Select answer</option>
  <option value="Compress_data">To compress the data</option>
  <option value="Visualize_graph">To visualize the data as a graph</option>
  <option value="Analyze_data">To make the data easier to analyze and manipulate</option>
  <option value="Encrypt_data">To encrypt the data</option>
</select>

---

**5. Which Python package is introduced in the notebook for handling nested data?**  
<select>
  <option value="Select_answer">Select answer</option>
  <option value="pandas">pandas</option>
  <option value="glom">glom</option>
  <option value="numpy">numpy</option>
  <option value="json">json</option>
</select>

---

**6. What does the `glom` package help you do?**  
<select>
  <option value="Select_answer">Select answer</option>
  <option value="Create_plots">Create plots from data</option>
  <option value="Access_nested_data">Access deeply nested data structures easily</option>
  <option value="Connect_APIs">Connect to APIs</option>
  <option value="Clean_missing_values">Clean missing values in a DataFrame</option>
</select>

---

**7. What is the first step when working with data from an API?**  
<select>
  <option value="Select_answer">Select answer</option>
  <option value="Visualize_data">Visualize the data</option>
  <option value="Parse_JSON">Parse the JSON</option>
  <option value="Send_request">Send a request to the API</option>
  <option value="Save_to_file">Save the data to a file</option>
</select>

---

**8. Which Python library is commonly used to convert JSON data into a DataFrame?**  
<select>
  <option value="Select_answer">Select answer</option>
  <option value="matplotlib">matplotlib</option>
  <option value="glom">glom</option>
  <option value="pandas">pandas</option>
  <option value="requests">requests</option>
</select>

---

**9. How can AI help improve Python code in this notebook?**  
<select>
  <option value="Select_answer">Select answer</option>
  <option value="Generate_random_data">By generating random data</option>
  <option value="Fix_bugs">By automatically fixing bugs and suggesting improvements</option>
  <option value="Visualize_proteins">By visualizing protein structures</option>
  <option value="Encrypt_data">By encrypting sensitive data</option>
</select>

---

**10. Can a single protein chain in a structure be assigned more than one EC number?**  
<select>
  <option value="Select_answer">Select answer</option>
  <option value="No_one_EC">No, each chain can only have one EC number</option>
  <option value="Yes_multiple_reactions">Yes, if the chain catalyzes multiple distinct reactions</option>
  <option value="Only_eukaryotes">Only if the protein is from a eukaryotic organism</option>
  <option value="Only_homodimer">Only if the structure is a homodimer</option>
</select>

---

##🔍 **4) BONUS CHALLENGE**

---



### ❓ **TASK 4:** Improve the code  from the PDBe API with example PDB ID `2xfu`

Enhance the previous code to generate an output table with four columns:

*   EC Number
*   Enzyme Name
*   Reaction
*   Chain ID


<br>

---



### 🧪 **SOLUTION 4:** Improved code (include more info from API)

In [None]:
# Step 1: Extract EC numbers and associated chain IDs
ec_mappings = data_2xfu.get("2xfu", {}).get("EC", {})
rows = []
# Iterate over the items (key-value pairs) of the ec_mappings dictionary
for ec_number, details in ec_mappings.items():
    accepted_name = details.get('accepted_name').rstrip(".") # The '.rstrip(".")' removes a '.' that is at end of the enzyme names coming from API.
    reaction = details.get("reaction")

    # Iterate over the 'mappings' list within the details for each EC number
    for mapping in details.get("mappings", []):
        chain_id = mapping.get("chain_id")
        # Append a dictionary with the EC Number and Chain ID to the rows list
        rows.append({"EC number": ec_number, "Enzymatic name": accepted_name, "Reaction": reaction,"Chain ID": chain_id})

# Step 2: Create a DataFrame
df = pd.DataFrame(rows)

# Step 3: Display a DataFrame / Table
display(df)

Unnamed: 0,EC number,Enzymatic name,Reaction,Chain ID
0,1.4.3.21,primary-amine oxidase,a primary methyl amine + O2 + H2O = an aldehyd...,A
1,1.4.3.21,primary-amine oxidase,a primary methyl amine + O2 + H2O = an aldehyd...,B
2,1.4.3.4,monoamine oxidase,a secondary aliphatic amine + O2 + H2O = a pri...,A
3,1.4.3.4,monoamine oxidase,a secondary aliphatic amine + O2 + H2O = a pri...,B



## 🔍 **5) BONUS CHALLENGE**

### ❓ **TASK 5:** Use AI and the Python package `glom` to simplify the code with example PDB ID `2xfu`

*HINT: May take more than one prompt -- you can ask AI to bug fix.*

<br>

---

### 🧪 **SOLUTION 5:** Simplified by using `glom`, assisted by AI

In [None]:
# Step 1: Extract EC mappings using glom
ec_mappings = glom(data_2xfu, "2xfu.EC")

# Step 2: Build rows
rows = []
for ec_number, details in ec_mappings.items():
    accepted_name = glom(details, Coalesce("accepted_name", default="")).rstrip(".")
    reaction = glom(details, Coalesce("reaction", default=""))
    mappings = glom(details, Coalesce("mappings", default=[]))

    for mapping in mappings:
        chain_id = mapping.get("chain_id")
        rows.append({
            "EC number": ec_number,
            "Enzymatic name": accepted_name,
            "Reaction": reaction,
            "Chain ID": chain_id
        })

# Step 3: Create and display DataFrame
df = pd.DataFrame(rows)
display(df)

Unnamed: 0,EC number,Enzymatic name,Reaction,Chain ID
0,1.4.3.21,primary-amine oxidase,a primary methyl amine + O2 + H2O = an aldehyd...,A
1,1.4.3.21,primary-amine oxidase,a primary methyl amine + O2 + H2O = an aldehyd...,B
2,1.4.3.4,monoamine oxidase,a secondary aliphatic amine + O2 + H2O = a pri...,A
3,1.4.3.4,monoamine oxidase,a secondary aliphatic amine + O2 + H2O = a pri...,B


# Copyright 2025 EMBL - European Bioinformatics Institute

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

---

**1. What does "API" stand for?**

ANSWER: Application Programming Interface

---

**2. Which part of an API endpoint typically specifies the resource being accessed?**

ANSWER: Path

---

**3.  In the endpoint `https://api.example.com/users/123`, what does 123 represent?**

ANSWER: Query parameter

---

**4. What does an EC number represent?**

ANSWER: The type of chemical reaction an enzyme catalyzes


---

**5. How many parts are there in a full EC number?**

ANSWER: 4


---

**1. What is the main purpose of using an API in bioinformatics?**  
ANSWER: To access and retrieve biological data from online databases


---

**2. Which Python keyword is used to define a function?**  
ANSWER: `def`

---

**3. What Python data type is commonly used to store parsed JSON data?**  
ANSWER: Dictionary

---

**4. What is the purpose of transforming nested JSON into a DataFrame?**  
ANSWER: To make the data easier to analyze and manipulate

---

**5. Which Python package is introduced in the notebook for handling nested data?**  
ANSWER: `glom`

---

**6. What does the `glom` package help you do?**  
ANSWER: Access deeply nested data structures easily

---

**7. What is the first step when working with data from an API?**  
ANSWER: Send a request to the API

---

**8. Which Python library is commonly used to convert JSON data into a DataFrame?**  
ANSWER: `pandas`

---

**9. How can AI help improve Python code in this notebook?**  
ANSWER: By automatically fixing bugs and suggesting improvements

---

**10. Can a single protein chain in a structure be assigned more than one EC number?**  
ANSWER: Yes, if the chain catalyzes multiple distinct reactions

---