<a href="https://colab.research.google.com/github/glevans/PDB_Notebooks/blob/main/FEBS_engineering_enzymes_2025/Activity2_Enzymes_and_APIs_ANSWERS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🧬 Integrating Enzyme Data: APIs, Identifiers, and Smart Parsing

<img src="https://github.com/glevans/PDB_Notebooks/raw/main/PDBe-logo.png" height="200" align="right">

In this notebook, you'll learn how to:

- Combine information from different APIs
- Understand the relationship between other protein identifiers (*e.g.* UniProt ID), macromolecular structures & EC (Enzyme Commission) numbers
- Transform nested JSON data into a DataFrame
- Use `glom` to simiplify code for nested data
- Use AI to help generate useful code & debug code

<br>

---

## ℹ️ **Introduction**

### **What is UniProt?**

[UniProt](https://www.uniprot.org/) (Universal Protein Resource) is a globally recognized, high-quality database that provides comprehensive information on protein sequences, functions, and interactions. The information is provided for proteins across all domains of life. UniProt integrates data from multiple sources, including:

*   research literature
*   experimental findings
*   computational predictions

[UniProt](https://www.uniprot.org/) is hosted and maintained by EMBL-EBI, in collaboration with other international institutions, ensuring its reliability.  Each UniProt entry is assigned a unique identifier and corresponds to a specific combination of protein sequence and source organism. The source organism is defined by taxonomy name and taxonomy ID.

---


Every 2-4 months a new verison of UniProt is released with new entries, as well as additions and corrections to existing entries.

More information on the most recent release is available here:

[https://www.uniprot.org/uniprotkb/statistics](https://www.uniprot.org/uniprotkb/statistics)

### **Types of UniProt Entries and Their Uses**

---


🔹 **1. UniProtKB (Universal Protein Knowledgebase)**
UniProtKB entries are divided into UniProtKB/Swiss-Prot and UniProtKB/TrEMBL.
- **Purpose**: UniProtKB is a comprehensive resource for **protein sequence and functional information**.
- **Entry name & accession**: Each entry has unique identifiers for tracking and referencing.
- **Types**:
   - *Entry status:* Reviewed (Swiss-Prot) vs. Unreviewed (TrEMBL)
   - *Annotation level:* Manual vs. automatic
- **Example use cases**:
   - Studying the catalytic mechanism of enzyme from different bacterial species.
   - Designing primers to amplify a specific isoform of a human protein.
   - Building a phylogenetic tree of homologous proteins across species.

---

🔹 **2. UniRef (UniProt Reference Clusters)**
- **Purpose**: UniRef provides **clustered sets of protein sequences** to reduce redundancy and improve search efficiency.
- **Types**:
  - *UniRef100:* Clusters identical sequences and fragments.
  - *UniRef90:* Clusters sequences with ≥90% identity.
  - *UniRef50:* Clusters sequences with ≥50% identity.
- **Use case**: Similarity searches and large-scale analyses.

🔗 [UniRef Overview](https://www.uniprot.org/help/uniref)

---

🔹 **3. UniParc (UniProt Archive)**
- **Purpose**: UniParc is a comprehensive repository of all known protein sequences, regardless of their source or annotation status.
- **Key Feature**: It stores **100% identical sequences** as single entries, even across different species or databases.
- **Use case**: Tracking sequence history and redundancy across databases.

🔗 [UniParc Overview](https://www.uniprot.org/help/uniparc)

### **What is UniRef90?**

**UniRef90** is a clustered sets of protein sequences that groups protein sequences based on sequence similarity.

Specifically:

*  *90% identity threshold:* Proteins that share 90% or higher sequence identity are grouped into the same cluster.


*  *Length coverage:* The alignment must cover at least 80% of the longest sequence.


*  *Single representative:* Each cluster has one "representative" sequence (usually the longest or best-annotated).


---



Why Does **UniRef90** Exist?

The UniProt database contains millions of protein sequences, many of which are nearly identical (*e.g.* same protein from closely related species). **UniRef90** reduces this redundancy while preserving biological diversity.

### **Isoforms & canonical sequences**

A protein isoform is a variant of a protein that is produced from the same gene but differs in its amino acid sequence due to mechanisms like alternative splicing, alternative promoter usage, or alternative translation initiation. These isoforms can have distinct functions, localizations, or interactions within the cell.

Whenever possible, all the protein products encoded by one gene in a given species are described in a single **UniProtKB/Swiss-Prot** entry, including all isoforms generated by alternative splicing, alternative promoter usage, and alternative translation initiation.

When a **UniProtKB/Swiss-Prot** entry has isoforms one in amongst the gene products is chosen as the canonical sequence.

The canonical sequence in a **UniProtKB/Swiss-Prot** entry is selected based on several criteria to ensure it is representative and informative:

*  *Functionality:* The sequence must correspond to a functional protein product.
*  *Expression:* It should be widely expressed across tissues or conditions.
*  *Evolutionary Conservation:* It is encoded by conserved exons found in orthologous sequences across species.
*  *Consensus with Other Resources:* It matches consensus sequences from other annotation efforts, like:
   - [CCDS (Consensus Coding Sequence)](https://www.ncbi.nlm.nih.gov/projects/CCDS/CcdsBrowse.cgi) - human and mouse protein-coding regions
   - [MANE (Matched Annotation from NCBI and EMBL-EBI)](https://www.ncbi.nlm.nih.gov/refseq/MANE/) - human protein-coding regions

### **What is a Python definition?**

A Python definition is a way to create a reusable block of code that does something specific.

### **What is a API?**

<img src="https://github.com/glevans/PDB_Notebooks/raw/main/API_graphic.png" height="200" align="right">

The API is a programmatic way to obtain information. APIs are in the background providing information we see on websites such as [PDBe's website](https://pdbe.org). Using Python code to access APIs enables faster analysis than can be obtained by viewing information directly on websites.

For more information on PDBe's APIs, visit:

*   [http://www.ebi.ac.uk/pdbe/pdbe-rest-api](http://www.ebi.ac.uk/pdbe/pdbe-rest-api)
*   [https://www.ebi.ac.uk/pdbe/api/v2/#/](https://www.ebi.ac.uk/pdbe/api/v2/#/)

### **What is a Notebook?**

A **Colab** or **Jupyter** notebook corresponds to a file with the extension `.ipynb`.

Notebooks are useful for sharing examples of code and exploring progammatic ways of handling data.

<br>

To use this notebook in **Colab** (link at top of the page):

*   you will need to have a Google account
*   be logged in to Google Colab (by being logged into Google account)

<br>

To use as a **Jupyter** notebook, download & viewed with:

*   a local installation of [Jupyter](https://jupyter.org/)
*   a browser instance of [JupyterLab](https://jupyter.org/try-jupyter/lab/)

<br>

<br>

---

## How to use this notebook <a name="Quick Start"></a>
1. To run a code cell, click on the cell to select it. You will notice a play button (▶️) on the left side of the cell. Click on the play button or press Shift+Enter to run the code in the selected cell.
2. The code will start executing, and you will see the output, if any, displayed below the code cell.
3. Move to the next code cell and repeat steps 2 and 3 until you have executed all the desired code cells in sequence.
4. The currently running step is indicated by a circle with a stop sign next to it.
If you need to stop or interrupt the execution of a code cell, you can click on the stop button (■) located next to the play button.
5. The exercise & bonus challenges had empty code cells will require the addition of code before they are run.

*Remember to run the code cells in the correct order, as their execution might depend on variables or functions defined in previous cells. You can modify the code in a code cell and re-run it to see updated results.*

<br>

---

## Contact us

If you experience any bugs please contact pdbehelp@ebi.ac.uk and put "Help with" and the title of the notebook in the subject line of the message.


## ⚙️ **Setup**



### 📦 Step 1: Install Required Package
Ensure the `glom` package is installed.

This is used to simplify data extraction from nested data structures.

To run a BASH command in a Notebook, one adds `!` before the command.

Many python packages are available from [PyPi](https://pypi.org/).

The BASH command `pip install` installs from PyPi.

In [279]:
# Install glom if not already installed
!pip install glom



### 📥 Step 2: Import Modules

Import all necessary Python modules for data fetching, transformation, and display.

We will be using Python packages / modules:

*   [requests](https://docs.python.org/3/library/re.html) - allows you to send HTTP/1.1 requests extremely easily
*   [re](https://https://docs.python.org/3/library/re.html) - allows use of regular expression matching operations similar to those found in Perl
*   [pprint](https://docs.python.org/3/library/pprint.html) - makes data look more readable / pretty
*   [pandas](https://pandas.pydata.org/) - for working with data in tables, like spreadsheets
*   [glom](https://glom.readthedocs.io/en/stable/) - for exploring and accessing information in nested data structures, such as that from APIs.

<br>

In [280]:
# Import necessary modules
import re
import requests
from pprint import pprint
from glom import glom, Coalesce, PathAccessError, Path
import pandas as pd

### ✔️ Step 3: Setting up functions to check identifiers

Setting up two definitions to check whether identifers are correctly formatted and input is a string (not a number).

In [281]:
def is_pdb_id(identifier):
    """
    Check if a string is a valid PDB ID.

    Parameters:
        identifier (str): The string to check.

    Returns:
        bool: True if it matches PDB ID format, False otherwise.
    """
    if isinstance(identifier, str):
        # PDB IDs are 4-character strings starting with a digit
        pattern = r"^[0-9][A-Za-z0-9]{3}$"
        return bool(re.match(pattern, identifier))
    else:
        return False

In [282]:
# Example usage
print(is_pdb_id("2XFU"))           # True
print(is_pdb_id("2xfu"))           # True
print(is_pdb_id("P69905"))         # False
print(is_pdb_id("not_a_protein"))  # False
print(is_pdb_id(123))              # False
print(is_pdb_id(10.25))            # False

True
True
False
False
False
False


In [283]:
def is_uniprot_id(identifier):
    """
    Check if a string is a valid UniProt accession or entry name.

    Parameters:
        identifier (str): The string to check.

    Returns:
        bool: True if it matches UniProt ID patterns, False otherwise.
    """
    if isinstance(identifier, str):
      # UniProt accessions: 6 or 10 characters, starting with a letter
      accession_pattern = r"^[A-Z0-9]{6}$|^[A-Z0-9]{10}$"
      # UniProt entry names: e.g., P69905_HUMAN
      entry_name_pattern = r"^[A-Z0-9]+_[A-Z0-9]+$"
      return bool(re.match(accession_pattern, identifier)) or bool(re.match(entry_name_pattern, identifier))
    else:
      return False

In [284]:
# Example usage
print(is_uniprot_id("P69905"))            # True
print(is_uniprot_id("G1L2V2"))            # True
print(is_uniprot_id("HIS4_MYCTA"))        # True
print(is_uniprot_id("A0A081HVH1_9MYCO"))  # True
print(is_uniprot_id("A0A081HVH1"))        # True
print(is_uniprot_id("not_a_protein"))     # False
print(is_uniprot_id("2xfu"))              # False
print(is_uniprot_id(123))                 # False
print(is_uniprot_id(10.25))               # False

True
True
True
True
True
False
False
False
False


### 🌐 Step 4: Setting up variables

Full list of PDBe API endpoints is available from: https://www.ebi.ac.uk/pdbe/api/v2/doc/

In [285]:
# Defining variables to describe API urls
ebi_host = "https://www.ebi.ac.uk/"

pdbe_api_base = ebi_host + "pdbe/api/v2/"

best_isoform_url = pdbe_api_base + "mappings/isoforms/"

# We have defined a variable called best_isoform_url with the following value:
#### https://www.ebi.ac.uk/pdbe/api/v2/mapping/isoforms/

structures_ranked_url = pdbe_api_base + "uniprot/best_structures/"

# We have defined a variable called best_isoform_url with the following value:
#### https://www.ebi.ac.uk/pdbe/api/v2/uniprot/structures_ranked_url/

Uniprot has many APIs.

List of some UniProt APIs is here:
[https://www.uniprot.org/help/api_retrieve_entries](https://www.uniprot.org/help/api_retrieve_entries)

More information on UniProt APIs:
[https://www.uniprot.org/help/api_queries](https://www.uniprot.org/help/api_queries)

In [286]:
# Defining more variables to describe API urls
uniprot_rest_host = "https://rest.uniprot.org/"

uniref_url = uniprot_rest_host + "uniref/"

# We have defined a variable called uniref_url with the following value:
#### https://rest.uniprot.org/uniref/

uniref_search_url = uniprot_rest_host + "uniref/search?query="

# We have defined a variable called unire_search_url with the following value:
#### https://rest.uniprot.org/uniref/search?query=

## 💻 **Setting-up data retrieval from API endpoints**

Setting-up new Python definitions to retrieve data from different PDBe and UniProt API endpoints.

### **1. Best UniProt isoform for a PDB entry**

In [287]:
def fetch_best_isoform(pdb_id):
    # Validate PDB ID format (using previous definition)
    if is_pdb_id(pdb_id) is False:
        print(f"Invalid PDB ID: {pdb_id}")
        return None
    else:
      print(f"Validated PDB ID: {pdb_id}")

      # Make GET request
      full_url = f"{best_isoform_url}{pdb_id}"
      response = requests.get(full_url)
      print("URL:", full_url)

    if response.status_code == 200:
        print("Data retrieved successfully.")
        return response.json()
    else:
        print(f"Failed to retrieve data. Status code: {response.status_code}")
        return None

In [288]:
# Example usage
pdb_id = "1ffy" # Example: Isoleucine--tRNA ligase
result_1ffy = fetch_best_isoform(pdb_id)
pprint(result_1ffy)

Validated PDB ID: 1ffy
URL: https://www.ebi.ac.uk/pdbe/api/v2/mappings/isoforms/1ffy
Data retrieved successfully.
{'1ffy': {'UniProt': {'P41972': {'identifier': 'SYI1_STAAU',
                                 'mappings': [{'chain_id': 'A',
                                               'end': {'author_insertion_code': '',
                                                       'author_residue_number': 917,
                                                       'residue_number': 917},
                                               'entity_id': 2,
                                               'identity': 0.99,
                                               'pdb_end': 917,
                                               'pdb_start': 1,
                                               'start': {'author_insertion_code': '',
                                                         'author_residue_number': 1,
                                                         'residue_number': 1},
         

In [289]:
def fetch_isoforms_for_pdbids(pdb_ids_string):
    """
    Process a comma-separated string of PDB IDs and fetch data for each.
    """
    pdb_ids = [pid.strip() for pid in pdb_ids_string.split(",")]
    results = {}

    for pdb_id in pdb_ids:
        data = fetch_best_isoform(pdb_id)
        if data:
            results[pdb_id] = data

    return results

In [290]:
# Example usage:

pdb_ids_as_list = [
    "3LII", "3K5V", "1U70", "2XFU", "2X91",
]

pdb_ids_string = ", ".join(pdb_ids_as_list)

print(pdb_ids_string)

results = fetch_isoforms_for_pdbids(pdb_ids_string)
pprint(results)


3LII, 3K5V, 1U70, 2XFU, 2X91
Validated PDB ID: 3LII
URL: https://www.ebi.ac.uk/pdbe/api/v2/mappings/isoforms/3LII
Data retrieved successfully.
Validated PDB ID: 3K5V
URL: https://www.ebi.ac.uk/pdbe/api/v2/mappings/isoforms/3K5V
Data retrieved successfully.
Validated PDB ID: 1U70
URL: https://www.ebi.ac.uk/pdbe/api/v2/mappings/isoforms/1U70
Data retrieved successfully.
Validated PDB ID: 2XFU
URL: https://www.ebi.ac.uk/pdbe/api/v2/mappings/isoforms/2XFU
Data retrieved successfully.
Validated PDB ID: 2X91
URL: https://www.ebi.ac.uk/pdbe/api/v2/mappings/isoforms/2X91
Data retrieved successfully.
{'1U70': {'1u70': {'UniProt': {'P00375': {'identifier': 'DYR_MOUSE',
                                          'mappings': [{'chain_id': 'A',
                                                        'end': {'author_insertion_code': '',
                                                                'author_residue_number': 186,
                                                                'residue

### **2. PDB structures for a UniProt entry (ranked)**

In [291]:
def fetch_ranked_structures(uniprot_id):
    # Validate UniProt ID format (using previous definition)
    if is_uniprot_id(uniprot_id) is False:
        print(f"Invalid UniProt ID: {uniprot_id}")
        return None
    else:
      print(f"Validated UniProt ID: {uniprot_id}")

      # Make GET request
      full_url = f"{structures_ranked_url}{uniprot_id}"
      response = requests.get(full_url)
      print("URL:", full_url)

    if response.status_code == 200:
        print("Data retrieved successfully.")
        return response.json()
    else:
        print(f"Failed to retrieve data. Status code: {response.status_code}")
        return None

In [292]:
# Example usage
uniprot_id = "P41972" # Example: Isoleucine--tRNA ligase
result_P41972 = fetch_ranked_structures(uniprot_id)
pprint(result_P41972)

Validated UniProt ID: P41972
URL: https://www.ebi.ac.uk/pdbe/api/v2/uniprot/best_structures/P41972
Data retrieved successfully.
{'P41972': [{'chain_id': 'B',
             'coverage': 1.0,
             'end': 917,
             'entity_id': 2,
             'experimental_method': 'X-ray diffraction',
             'observed_regions': [{'unp_end': 917, 'unp_start': 1}],
             'pdb_id': '1qu2',
             'preferred_assembly_id': 1,
             'resolution': 2.2,
             'start': 1,
             'tax_id': 1280,
             'unp_end': 917,
             'unp_start': 1},
            {'chain_id': 'B',
             'coverage': 1.0,
             'end': 917,
             'entity_id': 2,
             'experimental_method': 'X-ray diffraction',
             'observed_regions': [{'unp_end': 917, 'unp_start': 1}],
             'pdb_id': '1ffy',
             'preferred_assembly_id': 1,
             'resolution': 2.2,
             'start': 1,
             'tax_id': 1280,
             'unp

### **3. Retrieve UniRef clusters for a UniProt ID**

In [293]:
def fetch_uniref_clusters(uniprot_id):
    # Validate UniProt ID format (using previous definition)
    if is_uniprot_id(uniprot_id) is False:
        print(f"Invalid UniProt ID: {uniprot_id}")
        return None
    else:
      print(f"Validated UniProt ID: {uniprot_id}")

      # Make GET request
      full_url = f"{uniref_search_url}{uniprot_id}&size=3"
      print("URL:", full_url)
      response = requests.get(full_url)

    if response.status_code == 200:
        print("Data retrieved successfully.")
        return response.json()
    else:
        print(f"Failed to retrieve data. Status code: {response.status_code}")
        return None

In [294]:
# Example usage
uniprot_id = "G1L2V2"  # Example: Alcohol dehydrogenase (from Giant Panda)
uniref_clusters_G1L2V2 = fetch_uniref_clusters(uniprot_id)
pprint(uniref_clusters_G1L2V2)

Validated UniProt ID: G1L2V2
URL: https://rest.uniprot.org/uniref/search?query=G1L2V2&size=3
Data retrieved successfully.
{'results': [{'commonTaxon': {'scientificName': 'Laurasiatheria',
                              'taxonId': 314145},
              'entryType': 'UniRef90',
              'goTerms': [{'aspect': 'GO Molecular Function',
                           'goId': 'GO:0008270'},
                          {'aspect': 'GO Cellular Component',
                           'goId': 'GO:0110165'},
                          {'aspect': 'GO Biological Process',
                           'goId': 'GO:0042572'},
                          {'aspect': 'GO Biological Process',
                           'goId': 'GO:0042573'}],
              'id': 'UniRef90_P00327',
              'memberCount': 59,
              'memberIdTypes': ['UniParc',
                                'UniProtKB Reviewed (Swiss-Prot)',
                                'UniProtKB Unreviewed (TrEMBL)'],
              'members': [

### **4. Retrieve UniRef90 clusters contents**

In [295]:
def fetch_uniref90_content(uniprot_id):
    # Validate UniProt ID format (using previous definition)
    if is_uniprot_id(uniprot_id) is False:
        print(f"Invalid UniProt ID: {uniprot_id}")
        return None
    else:
      print(f"Validated UniProt ID: {uniprot_id}")

      # Make GET request
      full_url = f"{uniref_url}UniRef90_{uniprot_id}"
      print("URL:", full_url)
      response = requests.get(full_url)

    if response.status_code == 200:
        print("Data retrieved successfully.")
        return response.json()
    else:
        print(f"Failed to retrieve data. Status code: {response.status_code}")
        return None

In [296]:
# Example usage
uniprot_id = "P00327"  # Example: Alcohol dehydrogenase (from Horse)
uniref90_content_P00327 = fetch_uniref90_content(uniprot_id)
pprint(uniref90_content_P00327)

Validated UniProt ID: P00327
URL: https://rest.uniprot.org/uniref/UniRef90_P00327
Data retrieved successfully.
{'commonTaxon': {'scientificName': 'Laurasiatheria', 'taxonId': 314145},
 'entryType': 'UniRef90',
 'goTerms': [{'aspect': 'GO Molecular Function', 'goId': 'GO:0008270'},
             {'aspect': 'GO Cellular Component', 'goId': 'GO:0110165'},
             {'aspect': 'GO Biological Process', 'goId': 'GO:0042572'},
             {'aspect': 'GO Biological Process', 'goId': 'GO:0042573'}],
 'id': 'UniRef90_P00327',
 'memberCount': 59,
 'members': [{'accessions': ['P00328'],
              'memberId': 'ADH1S_HORSE',
              'memberIdType': 'UniProtKB ID',
              'organismName': 'Equus caballus (Horse)',
              'organismTaxId': 9796,
              'proteinName': 'Alcohol dehydrogenase S chain',
              'sequenceLength': 374,
              'uniparcId': 'UPI000016C425',
              'uniref100Id': 'UniRef100_P00328',
              'uniref50Id': 'UniRef50_P0032

### Example data to test code

Example data retrieved using the 4 new Python definition:

1.  `result_1ffy`
2.  `result_P41972`
3.  `uniref_clusters_G1L2V2`
4.  `uniref90_content_P00327`

## 🧭 **Exploring API endpoint Python definitions**


### 🔑 Step 1: Top-Level Keys

Use this to get a quick overview of the the top-level division of information from the API.

In [297]:
# Print top-level keys to understand the structure
print(result_1ffy.keys())
print(result_P41972.keys())
print(uniref_clusters_G1L2V2.keys())
print(uniref90_content_P00327.keys())

dict_keys(['1ffy'])
dict_keys(['P41972'])
dict_keys(['results'])
dict_keys(['id', 'name', 'memberCount', 'updated', 'entryType', 'commonTaxon', 'seedId', 'goTerms', 'representativeMember', 'members'])


### 🧰 Step 2: Map All Keys

We are using a Python definition to find all the dictionary keys in the nested structure.

This is useful approach for understanding deeply nested JSON objects.

🧠 **Function: map_keys**

In [298]:
# Define a recursive function to map all keys
def map_keys(d, level=0, path=''):
    # If the current object is a dictionary
    if isinstance(d, dict):
        for k, v in d.items():
            # Build the full path to the current key
            full_path = f"{path}.{k}" if path else k
            # Print the key with indentation based on the current level
            print("  " * level + f"- {full_path}")
            # Recursively call map_keys on the value
            map_keys(v, level + 1, full_path)

    # If the current object is a list
    elif isinstance(d, list):
        for i, item in enumerate(d):
            # Build the full path to the current list index
            full_path = f"{path}[{i}]"
            # Recursively call map_keys on the list item
            map_keys(item, level + 1, full_path)

▶️ **Run the Function**

#### Mapping keys for Result 1

In [299]:
# Call the function on your JSON-like data structure
map_keys(result_1ffy)

- 1ffy
  - 1ffy.UniProt
    - 1ffy.UniProt.P41972
      - 1ffy.UniProt.P41972.name
      - 1ffy.UniProt.P41972.mappings
          - 1ffy.UniProt.P41972.mappings[0].entity_id
          - 1ffy.UniProt.P41972.mappings[0].chain_id
          - 1ffy.UniProt.P41972.mappings[0].struct_asym_id
          - 1ffy.UniProt.P41972.mappings[0].unp_start
          - 1ffy.UniProt.P41972.mappings[0].unp_end
          - 1ffy.UniProt.P41972.mappings[0].pdb_start
          - 1ffy.UniProt.P41972.mappings[0].pdb_end
          - 1ffy.UniProt.P41972.mappings[0].start
            - 1ffy.UniProt.P41972.mappings[0].start.author_residue_number
            - 1ffy.UniProt.P41972.mappings[0].start.author_insertion_code
            - 1ffy.UniProt.P41972.mappings[0].start.residue_number
          - 1ffy.UniProt.P41972.mappings[0].end
            - 1ffy.UniProt.P41972.mappings[0].end.author_residue_number
            - 1ffy.UniProt.P41972.mappings[0].end.author_insertion_code
            - 1ffy.UniProt.P41972.mappings[0]

#### Mapping keys for Result 2

In [300]:
# Call the function on your JSON-like data structure
map_keys(result_P41972)

- P41972
    - P41972[0].experimental_method
    - P41972[0].tax_id
    - P41972[0].resolution
    - P41972[0].pdb_id
    - P41972[0].chain_id
    - P41972[0].entity_id
    - P41972[0].preferred_assembly_id
    - P41972[0].observed_regions
        - P41972[0].observed_regions[0].unp_start
        - P41972[0].observed_regions[0].unp_end
    - P41972[0].start
    - P41972[0].end
    - P41972[0].unp_start
    - P41972[0].unp_end
    - P41972[0].coverage
    - P41972[1].experimental_method
    - P41972[1].tax_id
    - P41972[1].resolution
    - P41972[1].pdb_id
    - P41972[1].chain_id
    - P41972[1].entity_id
    - P41972[1].preferred_assembly_id
    - P41972[1].observed_regions
        - P41972[1].observed_regions[0].unp_start
        - P41972[1].observed_regions[0].unp_end
    - P41972[1].start
    - P41972[1].end
    - P41972[1].unp_start
    - P41972[1].unp_end
    - P41972[1].coverage
    - P41972[2].experimental_method
    - P41972[2].tax_id
    - P41972[2].resolution
    - P41972[

#### Mapping keys for Result 3

In [301]:
# Call the function on your JSON-like data structure
map_keys(uniref_clusters_G1L2V2)

- results
    - results[0].id
    - results[0].name
    - results[0].updated
    - results[0].entryType
    - results[0].commonTaxon
      - results[0].commonTaxon.scientificName
      - results[0].commonTaxon.taxonId
    - results[0].memberCount
    - results[0].organismCount
    - results[0].representativeMember
      - results[0].representativeMember.memberIdType
      - results[0].representativeMember.memberId
      - results[0].representativeMember.organismName
      - results[0].representativeMember.organismTaxId
      - results[0].representativeMember.sequenceLength
      - results[0].representativeMember.proteinName
      - results[0].representativeMember.accessions
      - results[0].representativeMember.uniref50Id
      - results[0].representativeMember.uniref100Id
      - results[0].representativeMember.uniparcId
      - results[0].representativeMember.sequence
        - results[0].representativeMember.sequence.value
        - results[0].representativeMember.sequence.length


#### Mapping keys for Result 4

In [302]:
# Call the function on your JSON-like data structure
map_keys(uniref90_content_P00327)

- id
- name
- memberCount
- updated
- entryType
- commonTaxon
  - commonTaxon.scientificName
  - commonTaxon.taxonId
- seedId
- goTerms
    - goTerms[0].goId
    - goTerms[0].aspect
    - goTerms[1].goId
    - goTerms[1].aspect
    - goTerms[2].goId
    - goTerms[2].aspect
    - goTerms[3].goId
    - goTerms[3].aspect
- representativeMember
  - representativeMember.memberIdType
  - representativeMember.memberId
  - representativeMember.organismName
  - representativeMember.organismTaxId
  - representativeMember.sequenceLength
  - representativeMember.proteinName
  - representativeMember.accessions
  - representativeMember.uniref50Id
  - representativeMember.uniref100Id
  - representativeMember.uniparcId
  - representativeMember.sequence
    - representativeMember.sequence.value
    - representativeMember.sequence.length
    - representativeMember.sequence.molWeight
    - representativeMember.sequence.crc64
    - representativeMember.sequence.md5
- members
    - members[0].memberIdType


### 🗺️ Step 3: Generate a JSON **Structure** Report
We are using a Python definition to see all dictionary keys.

This code reports at each level in the nested data whether data is structured as:

*   dictionary
*   list
*   string

🧠 **Function: json_structure_report**

In [303]:
def json_structure_report(data, level=0, path='root', show_values=False, max_depth=None):
    """
    Recursively reports the type of each layer in a nested JSON-like structure.

    Parameters:
    - data: The JSON-like object (dict or list) to inspect.
    - level: Current depth level (used for indentation).
    - path: String representing the path to the current node.
    - show_values: If True, prints the value for non-dict-keys/list types.
    - max_depth: If set, limits the depth of recursion.
    """
    indent = "  " * level  # Indentation for visual hierarchy

    # Stop recursion if max_depth is reached
    if max_depth is not None and level > max_depth:
        print(f"{indent}{path} ... (max depth reached)")
        return

    if isinstance(data, dict):
        print(f"{indent}{path} is a dictionary with {len(data)} keys: {list(data.keys())}")
        for key, value in data.items():
            json_structure_report(value, level + 1, f"{path}.{key}", show_values, max_depth)

    elif isinstance(data, list):
        print(f"{indent}{path} is a list with {len(data)} items")
        for i, item in enumerate(data):
            json_structure_report(item, level + 1, f"{path}[{i}]", show_values, max_depth)

    else:
        # For primitive types (str, int, etc.)
        type_name = type(data).__name__
        if show_values:
            print(f"{indent}{path} is a {type_name} with value: {repr(data)}")
        else:
            print(f"{indent}{path} is a {type_name}")

▶️ **Run the Function**

#### Data structure Result 1

In [304]:
# Call the function on your JSON-like data structure
json_structure_report(result_1ffy, show_values=True)

root is a dictionary with 1 keys: ['1ffy']
  root.1ffy is a dictionary with 1 keys: ['UniProt']
    root.1ffy.UniProt is a dictionary with 1 keys: ['P41972']
      root.1ffy.UniProt.P41972 is a dictionary with 3 keys: ['name', 'mappings', 'identifier']
        root.1ffy.UniProt.P41972.name is a str with value: 'SYI1_STAAU'
        root.1ffy.UniProt.P41972.mappings is a list with 1 items
          root.1ffy.UniProt.P41972.mappings[0] is a dictionary with 10 keys: ['entity_id', 'chain_id', 'struct_asym_id', 'unp_start', 'unp_end', 'pdb_start', 'pdb_end', 'start', 'end', 'identity']
            root.1ffy.UniProt.P41972.mappings[0].entity_id is a int with value: 2
            root.1ffy.UniProt.P41972.mappings[0].chain_id is a str with value: 'A'
            root.1ffy.UniProt.P41972.mappings[0].struct_asym_id is a str with value: 'B'
            root.1ffy.UniProt.P41972.mappings[0].unp_start is a int with value: 1
            root.1ffy.UniProt.P41972.mappings[0].unp_end is a int with value:

#### Data structure Result 2

In [305]:
# Call the function on your JSON-like data structure
json_structure_report(result_P41972, show_values=True)

root is a dictionary with 1 keys: ['P41972']
  root.P41972 is a list with 3 items
    root.P41972[0] is a dictionary with 13 keys: ['experimental_method', 'tax_id', 'resolution', 'pdb_id', 'chain_id', 'entity_id', 'preferred_assembly_id', 'observed_regions', 'start', 'end', 'unp_start', 'unp_end', 'coverage']
      root.P41972[0].experimental_method is a str with value: 'X-ray diffraction'
      root.P41972[0].tax_id is a int with value: 1280
      root.P41972[0].resolution is a float with value: 2.2
      root.P41972[0].pdb_id is a str with value: '1qu2'
      root.P41972[0].chain_id is a str with value: 'B'
      root.P41972[0].entity_id is a int with value: 2
      root.P41972[0].preferred_assembly_id is a int with value: 1
      root.P41972[0].observed_regions is a list with 1 items
        root.P41972[0].observed_regions[0] is a dictionary with 2 keys: ['unp_start', 'unp_end']
          root.P41972[0].observed_regions[0].unp_start is a int with value: 1
          root.P41972[0].ob

#### Data structure Result 3

In [306]:
# Call the function on your JSON-like data structure
json_structure_report(uniref_clusters_G1L2V2, show_values=True)

root is a dictionary with 1 keys: ['results']
  root.results is a list with 3 items
    root.results[0] is a dictionary with 13 keys: ['id', 'name', 'updated', 'entryType', 'commonTaxon', 'memberCount', 'organismCount', 'representativeMember', 'seedId', 'memberIdTypes', 'members', 'organisms', 'goTerms']
      root.results[0].id is a str with value: 'UniRef90_P00327'
      root.results[0].name is a str with value: 'Cluster: Alcohol dehydrogenase E chain'
      root.results[0].updated is a str with value: '2025-06-18'
      root.results[0].entryType is a str with value: 'UniRef90'
      root.results[0].commonTaxon is a dictionary with 2 keys: ['scientificName', 'taxonId']
        root.results[0].commonTaxon.scientificName is a str with value: 'Laurasiatheria'
        root.results[0].commonTaxon.taxonId is a int with value: 314145
      root.results[0].memberCount is a int with value: 59
      root.results[0].organismCount is a int with value: 27
      root.results[0].representativeMembe

#### Data structure Result 4

In [307]:
# Call the function on your JSON-like data structure
json_structure_report(uniref90_content_P00327, show_values=True)

root is a dictionary with 10 keys: ['id', 'name', 'memberCount', 'updated', 'entryType', 'commonTaxon', 'seedId', 'goTerms', 'representativeMember', 'members']
  root.id is a str with value: 'UniRef90_P00327'
  root.name is a str with value: 'Cluster: Alcohol dehydrogenase E chain'
  root.memberCount is a int with value: 59
  root.updated is a str with value: '2025-06-18'
  root.entryType is a str with value: 'UniRef90'
  root.commonTaxon is a dictionary with 2 keys: ['scientificName', 'taxonId']
    root.commonTaxon.scientificName is a str with value: 'Laurasiatheria'
    root.commonTaxon.taxonId is a int with value: 314145
  root.seedId is a str with value: 'UPI003917D848'
  root.goTerms is a list with 4 items
    root.goTerms[0] is a dictionary with 2 keys: ['goId', 'aspect']
      root.goTerms[0].goId is a str with value: 'GO:0008270'
      root.goTerms[0].aspect is a str with value: 'GO Molecular Function'
    root.goTerms[1] is a dictionary with 2 keys: ['goId', 'aspect']
      r

## 🔍 **1) EXERCISE - PROVIDED EXAMPLE**


### ❓ **TASK 1:** Suggest bug fixes for the code below.

The code was generated by AI and contains errors.
The code below should extract table with PDB ids, chain ids, experimental method, resolution and UniProt coverage.

*HINT: Use output from the 'Exploring API endpoints Python definitions'* `map_keys` *or* `json_structure_report` *in the prompt.*

---

🐞 **Original Code with Bugs**

In [308]:
def fetch_best_structures(uniprot_id):
    url = f"https://www.ebi.ac.uk/pdbe/api/v2/uniprot/best_structures/{uniprot_id}"
    response = requests.get(url)
    response.raise_for_status()  # Raise an error for bad responses
    data = response.json()

    records = []
    for entry in data.get(uniprot_id, []):
        for region in entry.get("observed_regions", []):
            unp_start = region.get("unp_start")
            unp_end = region.get("unp_end")
            coverage = (unp_end - unp_start + 1) / (entry.get("end", unp_end) - entry.get("start", unp_start) + 1)

            records.append({
                "experimental_method": entry.get("experimental_method"),
                "resolution": entry.get("resolution"),
                "pdb_id": entry.get("pdb_id"),
                "chain_id": entry.get("chain_id"),
                "unp_start": unp_start,
                "unp_end": unp_end,
                "coverage": coverage
            })

In [309]:
# Example usage

fetch_best_structures("P41972")

### 🧪 **SOLUTION 1:** Bug Fixes and Explanation

Issue 1. Redundant Coverage Calculation
- Problem: The code recalculates `coverage` using:
  ```python
  coverage = (unp_end - unp_start + 1) / (entry.get("end", unp_end) - entry.get("start", unp_start) + 1)
  ```
- Why problem: The API already provides a `coverage` value, making this calculation unnecessary and potentially error-prone.

Issue 2. No explicit handling for entries solved by solution NMR where the resolution value will be `Null` or `None`

Issue 3. Data is not loaded into a pandas dataframe

Issue 4. Code could be simplified by using `glom`

---

In [310]:
def fetch_best_structures(uniprot_id):
    url = f"https://www.ebi.ac.uk/pdbe/api/v2/uniprot/best_structures/{uniprot_id}"
    response = requests.get(url)
    response.raise_for_status()
    data = response.json()

    target = Path(uniprot_id)
    records = []

    for entry in glom(data, target):
        for region in entry.get("observed_regions", []):
            record = {
                "experimental_method": entry.get("experimental_method"),
                "resolution": entry.get("resolution") if entry.get("resolution") is not None else float("nan"),
                "pdb_id": entry.get("pdb_id"),
                "chain_id": entry.get("chain_id"),
                "coverage": entry.get("coverage"),
                "unp_start": region.get("unp_start"),
                "unp_end": region.get("unp_end")
            }
            records.append(record)

    df = pd.DataFrame(records)
    return df

In [311]:
# Example usage 1

fetch_best_structures("P41972")

Unnamed: 0,experimental_method,resolution,pdb_id,chain_id,coverage,unp_start,unp_end
0,X-ray diffraction,2.2,1qu2,B,1.0,1,917
1,X-ray diffraction,2.2,1ffy,B,1.0,1,917
2,X-ray diffraction,2.9,1qu3,B,0.96,2,881


In [312]:
# Example usage 2

fetch_best_structures("P48754")

Unnamed: 0,experimental_method,resolution,pdb_id,chain_id,coverage,unp_start,unp_end
0,Solution NMR,,7k3s,A,0.03,1337,1388


## 🔍 **2) EXERCISE**

### ❓ **TASK 2:** Generate code that produces a list of PDB ids

With help from AI convert the code from the previous exercise into code that produces a list of pdb ids rather than a dataframe / table.

*HINT: Incorporate code to remove duplicates from the pdb id list.*

<br>

---



### 🧪 **SOLUTION 2:** New Python definition that outputs a list of structures that contain at least one protein chain that corresponds to input UniProt id

In [313]:
def remove_duplicates(input_list):
    """
    Removes duplicates from a list while preserving order.

    Parameters:
        input_list (list): The list from which to remove duplicates.

    Returns:
        list: A new list with duplicates removed.
    """
    seen = set()
    unique_list = []
    for item in input_list:
        if item not in seen:
            seen.add(item)
            unique_list.append(item)
    return unique_list

In [314]:
def list_of_PDB_structures(uniprot_id):
    url = f"https://www.ebi.ac.uk/pdbe/api/v2/uniprot/best_structures/{uniprot_id}"
    response = requests.get(url)
    response.raise_for_status()
    data = response.json()

    target = Path(uniprot_id)
    pdb_ids = []

    for entry in glom(data, target):
        pdb_id = entry.get("pdb_id")
        if pdb_id:
            pdb_ids.append(pdb_id)

    pdb_ids = remove_duplicates(pdb_ids)

    return pdb_ids

In [315]:
# Example usage 1

pdb_list = list_of_PDB_structures("P38398")  # Breast cancer type 1 susceptibility protein (with E3 ubiquitin-protein ligase activity)
print(pdb_list)

['8rs8', '4y2g', '4igk', '7lyb', '7jzv', '3pxa', '1t15', '4y18', '4ofb', '3pxb', '8grq', '4ifi', '3pxe', '3pxc', '1y98', '3pxd', '1t29', '1jnx', '1n5o', '3k15', '1t2u', '3k0h', '3k0k', '3coj', '3k16', '1t2v', '4u4a', '2ing', '4jlu', '6g2i', '1jm7', '1oqa']


In [316]:
# Example usage 2

pdb_list = list_of_PDB_structures("P21802")  # Breast cancer type 1 susceptibility protein (with E3 ubiquitin-protein ligase activity)
print(pdb_list)
#print(len(pdb_list))

['5ugl', '7ozy', '3b2t', '8swe', '3dar', '7kia', '5ui0', '3ri1', '8w3d', '4j99', '2pvf', '6v6q', '4wv1', '3euu', '3caf', '8e1x', '2fdb', '2pz5', '4j96', '5ugx', '2pwl', '1ev2', '4j97', '2pzp', '2psq', '4j98', '8w3b', '7kie', '5eg3', '6lvk', '3cu1', '3ojm', '5uhn', '3oj2', '3cly', '2py3', '1iil', '8w38', '4j95', '8w2x', '6agx', '2pvy', '1djs', '1gjo', '1ii4', '1oec', '6lvl', '1e0o', '2q0b', '1nun', '8u1f', '2pzr', '8stg', '9u3n', '4j23', '8h75', '1wvz']


## 🔍 **3) EXERCISE**

### ❓**TASK 3:** Improve the code and convert into a Python definition.

Enhance the previous code to generate an output table with 14 columns that correspond to the content from the API.


<br>

---

### 🧪 **SOLUTION 3:** Improved code (include more info from API)

In [317]:
def parse_pdb_uniprot_mapping(data):
    """
    Parses a nested JSON-like dictionary containing PDB-UniProt mapping data
    and returns a pandas DataFrame with extracted fields.

    Parameters:
        data (dict): Nested dictionary with structure like:
                     { '1ffy': { 'UniProt': { 'P41972': { ... } } } }

    Returns:
        pd.DataFrame: DataFrame containing extracted mapping information.
    """
    records = []

    for pdb_id, pdb_entry in data.items():
        uniprot_entries = pdb_entry.get('UniProt', {})
        for uniprot_id, uniprot_data in uniprot_entries.items():
            identifier = uniprot_data.get('identifier')
            name = uniprot_data.get('name')
            mappings = uniprot_data.get('mappings', [])

            for mapping in mappings:
                record = {
                    "pdb_id": pdb_id,
                    "uniprot_id": uniprot_id,
                    "identifier": identifier,
                    "name": name,
                    "chain_id": mapping.get('chain_id'),
                    "entity_id": mapping.get('entity_id'),
                    "identity": mapping.get('identity'),
                    "struct_asym_id": mapping.get('struct_asym_id'),
                    "unp_start": mapping.get('unp_start'),
                    "unp_end": mapping.get('unp_end'),
                    "pdb_start": mapping.get('pdb_start'),
                    "pdb_end": mapping.get('pdb_end'),
                    "start_residue": mapping.get('start', {}).get('residue_number'),
                    "end_residue": mapping.get('end', {}).get('residue_number')
                }
                records.append(record)

    return pd.DataFrame(records)

In [318]:
# Example usage 1

parse_pdb_uniprot_mapping(result_1ffy)

Unnamed: 0,pdb_id,uniprot_id,identifier,name,chain_id,entity_id,identity,struct_asym_id,unp_start,unp_end,pdb_start,pdb_end,start_residue,end_residue
0,1ffy,P41972,SYI1_STAAU,SYI1_STAAU,A,2,0.99,B,1,917,1,917,1,917


In [319]:
# Example usage 2

pdb_id = "3ojm" # Example: Isoleucine--tRNA ligase
result_1ffy = fetch_best_isoform(pdb_id)
parse_pdb_uniprot_mapping(result_1ffy)

Validated PDB ID: 3ojm
URL: https://www.ebi.ac.uk/pdbe/api/v2/mappings/isoforms/3ojm
Data retrieved successfully.


Unnamed: 0,pdb_id,uniprot_id,identifier,name,chain_id,entity_id,identity,struct_asym_id,unp_start,unp_end,pdb_start,pdb_end,start_residue,end_residue
0,3ojm,P05230,FGF1_HUMAN,FGF1_HUMAN,A,1,1.0,A,1,155,1,155,1,155
1,3ojm,P21802-3,FGFR2_HUMAN,FGFR2_HUMAN,B,2,1.0,B,140,369,2,231,2,231


## 🔍 **4) BONUS CHALLENGE**

### ❓ **TASK 4:** The below code was generated with AI. Update code so it outputs a single canonical UniProt id.

The code was generated by AI after several prompts.

The code below extacts both the isoform UniProt and related information from API: `https://www.ebi.ac.uk/pdbe/api/v2/mappings/isoforms/{pdb_id}`.

*HINT 1: Use the code below as part of an AI prompts to generate a Python definition that outputs a single uniprot id.*

*HINT2: Update code so it takes a chain id as an input.*

*HINT3: Update code as an additional prompt to convert isomeric UniProt id (e.g. P21802-3, P21802-4) to canocial UniProt id (P21802).*

<br>

---

🤖 **AI generated Code**



In [320]:
# Extract the PDB ID key (e.g., '1ffy')
pdb_id = list(result_1ffy.keys())[0]

# Extract the UniProt ID key (e.g., 'P41972') dynamically
uniprot_id = list(result_1ffy[pdb_id]['UniProt'].keys())[0]

# Access the UniProt entry
entry = result_1ffy[pdb_id]['UniProt'][uniprot_id]

# Print extracted values
print("PDB ID:", pdb_id)
print("UniProt ID:", uniprot_id)
print("Identifier:", entry['identifier'])
print("Name:", entry['name'])

# Print mapping details
for mapping in entry['mappings']:
    print(f"Chain: {mapping['chain_id']}")
    print(f"Start residue: {mapping['start']['residue_number']}")
    print(f"End residue: {mapping['end']['residue_number']}")
    print(f"UniProt range: {mapping['unp_start']}–{mapping['unp_end']}")
    print(f"Identity: {mapping['identity']}")
    print("---")

PDB ID: 3ojm
UniProt ID: P05230
Identifier: FGF1_HUMAN
Name: FGF1_HUMAN
Chain: A
Start residue: 1
End residue: 155
UniProt range: 1–155
Identity: 1.0
---


### 🧪 **SOLUTION 4:** New Python definition that can be combined with `fetch_best_isoform(pdb_id)`


In [321]:
# input result here is the output from fetch_best_isoform(pdb_id)

def get_canonical_uniprot_id_by_chain(result, chain_id):
    """
    Extracts the canonical UniProt ID associated with a specific chain ID from a PDB-UniProt mapping result.

    Parameters:
        result (dict): Dictionary containing PDB to UniProt mapping data.
        chain_id (str): Chain ID to search for (e.g., 'A').

    Returns:
        str: Canonical UniProt ID corresponding to the given chain ID, or None if not found.
    """
    try:
        pdb_id = list(result.keys())[0]
        for uniprot_id, entry in result[pdb_id]['UniProt'].items():
            for mapping in entry.get('mappings', []):
                if mapping.get('chain_id') == chain_id:
                    # Strip isomeric suffix if present (e.g., '-3', '-4')
                    canonical_id = uniprot_id.split('-')[0]
                    return canonical_id
        return None
    except (KeyError, IndexError, TypeError):
        return None

In [322]:
# Example usage 1

result_1ffy = fetch_best_isoform('1ffy')

get_canonical_uniprot_id_by_chain(result_1ffy, 'A')

Validated PDB ID: 1ffy
URL: https://www.ebi.ac.uk/pdbe/api/v2/mappings/isoforms/1ffy
Data retrieved successfully.


'P41972'

In [323]:
# Example usage 2

result_3ojm = fetch_best_isoform('3ojm')

get_canonical_uniprot_id_by_chain(result_3ojm, 'B')

Validated PDB ID: 3ojm
URL: https://www.ebi.ac.uk/pdbe/api/v2/mappings/isoforms/3ojm
Data retrieved successfully.


'P21802'

## 🔍 **5) BONUS CHALLENGE** -- PROVIDED EXAMPLE

### ❓ **TASK 5:** Use AI to generate a code that is able to take a list of PDB ids and find the best isoforms

*HINT: May take more than one prompt -- you can ask AI to bug fix.*

<br>

---

### 🧪 **SOLUTION 5:** A new Python definition that incorporated the previous Python definitions.

In [324]:
def fetch_multiple_isoforms(pdb_ids_string):
    """
    Process a comma-separated string of PDB IDs, fetch data for each,
    and return a combined DataFrame of parsed UniProt mappings.
    """
    pdb_ids = [pid.strip() for pid in pdb_ids_string.split(",")]

    frames = []

    for pdb_id in pdb_ids:
        data = fetch_best_isoform(pdb_id)
        #display(parse_pdb_uniprot_mapping(data))
        if data:
            df = parse_pdb_uniprot_mapping(data)
            if not df.empty:
                frames.append(df)

    return pd.concat(frames, ignore_index=True) if frames else pd.DataFrame()

In [325]:
# Example usage 1:

pdb_ids_as_list = [
    "3LII", "3K5V", "1U70", "2XFU", "2X91", "2R4F", "3B7E", "3K4V", "2JHF", "1GAL", "3B2T", "3OJM"
]

pdb_ids_string = ", ".join(pdb_ids_as_list)

results_enzyme_list = fetch_multiple_isoforms(pdb_ids_string)

Validated PDB ID: 3LII
URL: https://www.ebi.ac.uk/pdbe/api/v2/mappings/isoforms/3LII
Data retrieved successfully.
Validated PDB ID: 3K5V
URL: https://www.ebi.ac.uk/pdbe/api/v2/mappings/isoforms/3K5V
Data retrieved successfully.
Validated PDB ID: 1U70
URL: https://www.ebi.ac.uk/pdbe/api/v2/mappings/isoforms/1U70
Data retrieved successfully.
Validated PDB ID: 2XFU
URL: https://www.ebi.ac.uk/pdbe/api/v2/mappings/isoforms/2XFU
Data retrieved successfully.
Validated PDB ID: 2X91
URL: https://www.ebi.ac.uk/pdbe/api/v2/mappings/isoforms/2X91
Data retrieved successfully.
Validated PDB ID: 2R4F
URL: https://www.ebi.ac.uk/pdbe/api/v2/mappings/isoforms/2R4F
Data retrieved successfully.
Validated PDB ID: 3B7E
URL: https://www.ebi.ac.uk/pdbe/api/v2/mappings/isoforms/3B7E
Data retrieved successfully.
Validated PDB ID: 3K4V
URL: https://www.ebi.ac.uk/pdbe/api/v2/mappings/isoforms/3K4V
Data retrieved successfully.
Validated PDB ID: 2JHF
URL: https://www.ebi.ac.uk/pdbe/api/v2/mappings/isoforms/2JHF
Dat

In [326]:
display(results_enzyme_list)

Unnamed: 0,pdb_id,uniprot_id,identifier,name,chain_id,entity_id,identity,struct_asym_id,unp_start,unp_end,pdb_start,pdb_end,start_residue,end_residue
0,3lii,P22303,ACES_HUMAN,ACES_HUMAN,A,1,1.0,A,35,574,1,540,1,540
1,3lii,P22303,ACES_HUMAN,ACES_HUMAN,B,1,1.0,B,35,574,1,540,1,540
2,3k5v,P00520,ABL1_MOUSE,ABL1_MOUSE,A,1,0.99,A,229,515,7,293,7,293
3,3k5v,P00520,ABL1_MOUSE,ABL1_MOUSE,B,1,0.99,B,229,515,7,293,7,293
4,1u70,P00375,DYR_MOUSE,DYR_MOUSE,A,1,0.99,A,2,187,1,186,1,186
5,2xfu,P27338,AOFB_HUMAN,AOFB_HUMAN,A,1,1.0,A,2,520,1,519,1,519
6,2xfu,P27338,AOFB_HUMAN,AOFB_HUMAN,B,1,1.0,B,2,520,1,519,1,519
7,2x91,Q10714,ACE_DROME,ACE_DROME,A,1,1.0,A,17,614,1,598,1,598
8,2r4f,P04035,HMDH_HUMAN,HMDH_HUMAN,A,1,0.99,A,441,875,7,441,7,441
9,2r4f,P04035,HMDH_HUMAN,HMDH_HUMAN,B,1,0.99,B,441,875,7,441,7,441


In [327]:
# Example usage 2:

pdb_ids_as_list = ['8rs8', '4y2g', '4igk', '7lyb', '7jzv', '3pxa', '1t15', '4y18', '4ofb', '3pxb', '8grq', '4ifi', '3pxe', '3pxc', '1y98', '3pxd', '1t29', '1jnx', '1n5o', '3k15', '1t2u', '3k0h', '3k0k', '3coj', '3k16', '1t2v', '4u4a', '2ing', '4jlu', '6g2i', '1jm7', '1oqa']

pdb_ids_string = ', '.join(pdb_ids_as_list)

results_BRCA1 = fetch_multiple_isoforms(pdb_ids_string)

Validated PDB ID: 8rs8
URL: https://www.ebi.ac.uk/pdbe/api/v2/mappings/isoforms/8rs8
Data retrieved successfully.
Validated PDB ID: 4y2g
URL: https://www.ebi.ac.uk/pdbe/api/v2/mappings/isoforms/4y2g
Data retrieved successfully.
Validated PDB ID: 4igk
URL: https://www.ebi.ac.uk/pdbe/api/v2/mappings/isoforms/4igk
Data retrieved successfully.
Validated PDB ID: 7lyb
URL: https://www.ebi.ac.uk/pdbe/api/v2/mappings/isoforms/7lyb
Data retrieved successfully.
Validated PDB ID: 7jzv
URL: https://www.ebi.ac.uk/pdbe/api/v2/mappings/isoforms/7jzv
Data retrieved successfully.
Validated PDB ID: 3pxa
URL: https://www.ebi.ac.uk/pdbe/api/v2/mappings/isoforms/3pxa
Data retrieved successfully.
Validated PDB ID: 1t15
URL: https://www.ebi.ac.uk/pdbe/api/v2/mappings/isoforms/1t15
Data retrieved successfully.
Validated PDB ID: 4y18
URL: https://www.ebi.ac.uk/pdbe/api/v2/mappings/isoforms/4y18
Data retrieved successfully.
Validated PDB ID: 4ofb
URL: https://www.ebi.ac.uk/pdbe/api/v2/mappings/isoforms/4ofb
Dat

In [328]:
display(results_BRCA1)

Unnamed: 0,pdb_id,uniprot_id,identifier,name,chain_id,entity_id,identity,struct_asym_id,unp_start,unp_end,pdb_start,pdb_end,start_residue,end_residue
0,8rs8,P38398,BRCA1_HUMAN,BRCA1_HUMAN,A,1,1.0,A,1646,1859,4,217,4,217
1,8rs8,P38398,BRCA1_HUMAN,BRCA1_HUMAN,B,1,1.0,B,1646,1859,4,217,4,217
2,8rs8,P38398,BRCA1_HUMAN,BRCA1_HUMAN,C,1,1.0,C,1646,1859,4,217,4,217
3,8rs8,P38398,BRCA1_HUMAN,BRCA1_HUMAN,D,1,1.0,D,1646,1859,4,217,4,217
4,8rs8,Q5UIP0,RIF1_HUMAN,RIF1_HUMAN,E,2,1.0,E,2260,2270,1,11,1,11
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
133,6g2i,P38398,BRCA1_HUMAN,BRCA1_HUMAN,W,2,1.0,R,1646,1859,27,240,27,240
134,6g2i,P38398,BRCA1_HUMAN,BRCA1_HUMAN,Y,2,1.0,Q,1646,1859,27,240,27,240
135,1jm7,P38398,BRCA1_HUMAN,BRCA1_HUMAN,A,1,1.0,A,1,110,1,110,1,110
136,1jm7,Q99728,BARD1_HUMAN,BARD1_HUMAN,B,2,1.0,B,26,140,1,115,1,115


## 📝 Quick Review Quiz 1

Test your understanding!

**1. What is a UniProt ID?**  

*A UniProt ID uniquely identifies:*

<select>
  <option value="Select_answer">Select answer</option>
  <option value="protein_in_PDB">A protein structure in the PDB.</option>
  <option value="human_gene">A gene in the human genome.</option>
  <option value="protein_seq_w_defined_organism">A specific protein sequence from a defined organism.</option>
  <option value="same_protein_multiple_species">Same protein across multiple species</option>
</select>

---

**2. What is the canonical sequence in UniProt?**

*The canonical sequence in a UniProtKB/Swiss-Prot entry is selected based on:*

<select>
  <option value="Select_answer">Select answer</option>
  <option value="longest_isoform">The longest isoform available.</option>
  <option value="recent_isoform">The most recently discovered isoform.</option>
  <option value="functionality_expression_etc">Functionality, expression, conservation, and consensus with other databases</option>
  <option value="highest_MW_isoform">The isoform with the highest molecular weight.</option>
</select>

---

**3. How often is UniProt updated?**  

*New versions of UniProt are released:*

<select>
  <option value="Select_answer">Select answer</option>
  <option value="Weekly">Weekly</option>
  <option value="Monthly">Monthly</option>
  <option value="2-4 months">Every 2-4 months</option>
  <option value="Annually">Annually</option>
</select>

---

**4. Which statement about isoforms is TRUE?**  
<select>
  <option value="Select_answer">Select answer</option>
  <option value="only_in_non-humans">Isoforms are only found in non-human proteins.</option>
  <option value="alt_splicing">Isoforms result from alternative splicing and may have different functions.</option>
  <option value="not_in_uniprot">Isoforms are not included in UniProt entries.</option>
  <option value="same_tissue">Isoforms cannot be expressed in the same tissue or organism.</option>
</select>

---


## 🔍 **6) EXERCISE**

### ❓ **TASK 6:** Make Python definition by adapting the below code.

The below code was generated with AI.

Use this starting point to generate new code that outputs the **UniRef50** cluster instead of the **UniRef90** cluster.

<br>

---

In [329]:
def fetch_uniref90_cluster_id(query):
    """
    Fetches UniRef90 cluster IDs from the top 3 UniRef search results for a given UniProt accession or ID.

    Parameters:
        query (str): UniProt accession or ID (e.g., 'G1L2V2')

    Returns:
        list: A list of UniRef90 cluster IDs (e.g., ['UniRef90_P00327']), or an empty list if none found.
    """
    url = f"https://rest.uniprot.org/uniref/search?query={query}&size=3"
    response = requests.get(url)
    response.raise_for_status()
    data = response.json()

    uniref90_ids = []
    try:
        for result in data.get("results", []):
            cluster_id = result.get("id", "")
            if re.match(r"^UniRef90_", cluster_id):
                uniref90_ids.append(cluster_id)
    except Exception as e:
        print("Error parsing UniRef data:", e)

    return uniref90_ids

In [330]:
# Example usage
ids = fetch_uniref90_cluster_id("G1L2V2")
print("UniRef90 IDs:", ids)

UniRef90 IDs: ['UniRef90_P00327']


### 🧪 **SOLUTION 6:** New Python definition based on previous code.

In [331]:
def fetch_uniref50_cluster_id(query):
    """
    Fetches UniRef90 cluster IDs from the top 3 UniRef search results for a given UniProt accession or ID.

    Parameters:
        query (str): UniProt accession or ID (e.g., 'G1L2V2')

    Returns:
        list: A list of UniRef50 cluster IDs (e.g., ['UniRef50_P00325']), or an empty list if none found.
    """
    url = f"https://rest.uniprot.org/uniref/search?query={query}&size=3"
    response = requests.get(url)
    response.raise_for_status()
    data = response.json()

    uniref50_ids = []
    try:
        for result in data.get("results", []):
            cluster_id = result.get("id", "")
            if re.match(r"^UniRef50_", cluster_id):
                uniref50_ids.append(cluster_id)
    except Exception as e:
        print("Error parsing UniRef data:", e)

    return uniref50_ids

In [332]:
# Example usage

cluster_id = fetch_uniref50_cluster_id("G1L2V2")
print("UniRef Cluster ID:", cluster_id)

UniRef Cluster ID: ['UniRef50_P00325']


## 🔍 **7) EXERCISE**

### ❓ **TASK 7:** Make Python definition by adapting the below code.

Use the code below starting point to generate new code that outputs a csv file.

<br>

---

In [333]:
def uniref_cluster_details(cluster_id):
    """
    Fetches and parses UniRef cluster data from the UniProt REST API.

    Parameters:
        cluster_id (str): UniRef cluster ID (e.g., 'UniRef90_P00327')

    Returns:
        pd.DataFrame: DataFrame containing selected fields from cluster members.
    """
    # Validate cluster ID format
    pattern = r'^UniRef(50|90|100)_\w+$'
    if not re.match(pattern, cluster_id):
        raise ValueError(f"Invalid UniRef cluster ID format: {cluster_id}")

    # Getting data from API
    url = f"https://rest.uniprot.org/uniref/{cluster_id}"
    response = requests.get(url)
    response.raise_for_status()
    data = response.json()

    # Parsing JSON from API
    cluster_name = data.get('id')
    seed = []
    seed_info = data.get('representativeMember')

    seed_accessions = seed_info.get('accessions')
    seed_accession = seed_accessions[0] if isinstance(seed_accessions, list) and seed_accessions else None

    seed.append({
        'Index': 1,
        'Cluster_name': cluster_name,
        'UniProtId': seed_accession,
        'MemberId': seed_info.get('memberId'),
        'MemberId_Type': seed_info.get('memberIdType'),
        'Organism_Name': seed_info.get('organismName'),
        'Organism_TaxId': seed_info.get('organismTaxId'),
        'Protein_Name': seed_info.get('proteinName'),
        'Seq_Alignment_Length': seed_info.get('sequence', {}).get('length')
    })

    members = data.get("members", [])
    parsed_members = seed
    for idx, member in enumerate(members):
        accessions = member.get('accessions')
        accession = accessions[0] if isinstance(accessions, list) and accessions else None
        parsed_members.append({
            'Index': idx + 2,
            'UniProtId': accession,
            'Cluster_name': cluster_name,
            'MemberId': member.get('memberId'),
            'MemberId_Type': member.get('memberIdType'),
            'Organism_Name': member.get('organismName'),
            'Organism_TaxId': member.get('organismTaxId'),
            'Protein_Name': member.get('proteinName'),
            'Seq_Alignment_Length': member.get('sequenceLength')
        })

    # Convert to DataFrame
    df = pd.DataFrame(parsed_members)
    return df

In [334]:
# Example usage

cluster_data = uniref_cluster_details("UniRef90_P11373")
display(cluster_data)

Unnamed: 0,Index,Cluster_name,UniProtId,MemberId,MemberId_Type,Organism_Name,Organism_TaxId,Protein_Name,Seq_Alignment_Length
0,1,UniRef90_P11373,P11373,CUTI1_COLGL,UniProtKB ID,Colletotrichum gloeosporioides (Anthracnose fu...,474922,Cutinase 1,224
1,2,UniRef90_P11373,T0L198,T0L198_COLGC,UniProtKB ID,Colletotrichum gloeosporioides (strain Cg-14) ...,1237896,Cutinase,224
2,3,UniRef90_P11373,A0A8H3ZKH1,A0A8H3ZKH1_9PEZI,UniProtKB ID,Colletotrichum asianum,702518,Cutinase,224
3,4,UniRef90_P11373,A0AAD9AM37,A0AAD9AM37_9PEZI,UniProtKB ID,Colletotrichum chrysophilum,1836956,Cutinase,224
4,5,UniRef90_P11373,A0A7J6INC9,A0A7J6INC9_COLFN,UniProtKB ID,Colletotrichum fructicola (strain Nara gc5) (A...,1213859,Cutinase,224
5,6,UniRef90_P11373,A0AAD9YGR3,A0AAD9YGR3_COLKA,UniProtKB ID,Colletotrichum kahawae (Coffee berry disease f...,34407,Cutinase,224
6,7,UniRef90_P11373,A0A0F7N0B9,A0A0F7N0B9_COLGL,UniProtKB ID,Colletotrichum gloeosporioides (Anthracnose fu...,474922,Cutinase,224
7,8,UniRef90_P11373,A0A8H4C6B1,A0A8H4C6B1_COLGL,UniProtKB ID,Colletotrichum gloeosporioides (Anthracnose fu...,474922,Cutinase,224
8,9,UniRef90_P11373,A0A9W4S563,A0A9W4S563_9PEZI,UniProtKB ID,Colletotrichum noveboracense,2664923,Cutinase,224
9,10,UniRef90_P11373,A0A9P5EYW3,A0A9P5EYW3_COLSI,UniProtKB ID,Colletotrichum siamense (Anthracnose fungus),690259,Cutinase,224


### 🧪 **SOLUTION 7:** New Python definition based on previous code.

In [335]:
def uniref_cluster_details_csv(cluster_id):
    """
    Fetches and parses UniRef cluster data from the UniProt REST API.

    Parameters:
        cluster_id (str): UniRef cluster ID (e.g., 'UniRef90_P00327')

    Returns:
        pd.DataFrame: DataFrame containing selected fields from cluster members.
    """
    # Validate cluster ID format
    pattern = r'^UniRef(50|90|100)_\w+$'
    if not re.match(pattern, cluster_id):
        raise ValueError(f"Invalid UniRef cluster ID format: {cluster_id}")

    # Getting data from API
    url = f"https://rest.uniprot.org/uniref/{cluster_id}"
    response = requests.get(url)
    response.raise_for_status()
    data = response.json()

    # Parsing JSON from API
    cluster_name = data.get('id')
    seed = []
    seed_info = data.get('representativeMember')

    seed_accessions = seed_info.get('accessions')
    seed_accession = seed_accessions[0] if isinstance(seed_accessions, list) and seed_accessions else None

    seed.append({
        'Index': 1,
        'Cluster_name': cluster_name,
        'UniProtId': seed_accession,
        'MemberId': seed_info.get('memberId'),
        'MemberId_Type': seed_info.get('memberIdType'),
        'Organism_Name': seed_info.get('organismName'),
        'Organism_TaxId': seed_info.get('organismTaxId'),
        'Protein_Name': seed_info.get('proteinName'),
        'Seq_Alignment_Length': seed_info.get('sequence', {}).get('length')
    })

    members = data.get("members", [])
    parsed_members = seed
    for idx, member in enumerate(members):
        accessions = member.get('accessions')
        accession = accessions[0] if isinstance(accessions, list) and accessions else None
        parsed_members.append({
            'Index': idx + 2,
            'UniProtId': accession,
            'Cluster_name': cluster_name,
            'MemberId': member.get('memberId'),
            'MemberId_Type': member.get('memberIdType'),
            'Organism_Name': member.get('organismName'),
            'Organism_TaxId': member.get('organismTaxId'),
            'Protein_Name': member.get('proteinName'),
            'Seq_Alignment_Length': member.get('sequenceLength')
        })

    # Convert to DataFrame
    df = pd.DataFrame(parsed_members)


    # Save to CSV
    csv_filename = f"{cluster_id}_details.csv"
    df.to_csv(csv_filename, index=False)
    print(f"Data saved to {csv_filename}")

    return df

In [336]:
# Example usage

cluster_data = uniref_cluster_details_csv("UniRef90_P11373")
display(cluster_data)

Data saved to UniRef90_P11373_details.csv


Unnamed: 0,Index,Cluster_name,UniProtId,MemberId,MemberId_Type,Organism_Name,Organism_TaxId,Protein_Name,Seq_Alignment_Length
0,1,UniRef90_P11373,P11373,CUTI1_COLGL,UniProtKB ID,Colletotrichum gloeosporioides (Anthracnose fu...,474922,Cutinase 1,224
1,2,UniRef90_P11373,T0L198,T0L198_COLGC,UniProtKB ID,Colletotrichum gloeosporioides (strain Cg-14) ...,1237896,Cutinase,224
2,3,UniRef90_P11373,A0A8H3ZKH1,A0A8H3ZKH1_9PEZI,UniProtKB ID,Colletotrichum asianum,702518,Cutinase,224
3,4,UniRef90_P11373,A0AAD9AM37,A0AAD9AM37_9PEZI,UniProtKB ID,Colletotrichum chrysophilum,1836956,Cutinase,224
4,5,UniRef90_P11373,A0A7J6INC9,A0A7J6INC9_COLFN,UniProtKB ID,Colletotrichum fructicola (strain Nara gc5) (A...,1213859,Cutinase,224
5,6,UniRef90_P11373,A0AAD9YGR3,A0AAD9YGR3_COLKA,UniProtKB ID,Colletotrichum kahawae (Coffee berry disease f...,34407,Cutinase,224
6,7,UniRef90_P11373,A0A0F7N0B9,A0A0F7N0B9_COLGL,UniProtKB ID,Colletotrichum gloeosporioides (Anthracnose fu...,474922,Cutinase,224
7,8,UniRef90_P11373,A0A8H4C6B1,A0A8H4C6B1_COLGL,UniProtKB ID,Colletotrichum gloeosporioides (Anthracnose fu...,474922,Cutinase,224
8,9,UniRef90_P11373,A0A9W4S563,A0A9W4S563_9PEZI,UniProtKB ID,Colletotrichum noveboracense,2664923,Cutinase,224
9,10,UniRef90_P11373,A0A9P5EYW3,A0A9P5EYW3_COLSI,UniProtKB ID,Colletotrichum siamense (Anthracnose fungus),690259,Cutinase,224


## 🔍 **8) BONUS CHALLENGE** - PROVIDED EXAMPLE


### ❓ **TASK 8:** Combine previous Python definitions so input is pdb id and output is UniRef90 cluster.

<br>

---

### 🧪 **SOLUTION 8:** A new Python definition which refers to previous Python definitions.

In [337]:
def uniref_cluster_ids(cluster_id):
    """
    Fetches and parses UniRef cluster data from the UniProt REST API.

    Parameters:
        cluster_id (str): UniRef cluster ID (e.g., 'UniRef90_P00327')

    Returns:
        pd.DataFrame: DataFrame containing selected fields from cluster members.
    """
    # Validate cluster ID format
    pattern = r'^UniRef(50|90|100)_\w+$'
    if not re.match(pattern, cluster_id):
        raise ValueError(f"Invalid UniRef cluster ID format: {cluster_id}")

    # Getting data from API
    url = f"https://rest.uniprot.org/uniref/{cluster_id}"
    response = requests.get(url)
    response.raise_for_status()
    data = response.json()

    # Parsing JSON from API
    cluster_name = data.get('id')
    seed = []
    seed_info = data.get('representativeMember')

    seed_accessions = seed_info.get('accessions')
    seed_accession = seed_accessions[0] if isinstance(seed_accessions, list) and seed_accessions else None

    seed.append({
        'UniProtId': seed_accession,
    })

    members = data.get("members", [])
    parsed_members = seed
    for idx, member in enumerate(members):
        accessions = member.get('accessions')
        accession = accessions[0] if isinstance(accessions, list) and accessions else None
        parsed_members.append({
            'UniProtId': accession,
        })

    # Remove None and duplicates
    parsed_members = [item for item in parsed_members if item['UniProtId'] is not None]
    parsed_members = list({v['UniProtId']: v for v in parsed_members}.values())


    # Extract the UniProtId values into a list
    uniprot_ids = [entry['UniProtId'] for entry in parsed_members]

    return uniprot_ids

In [338]:
uniref_cluster_ids('UniRef90_P21802')

['P21802',
 'P21803',
 'A0A6J2D1E3',
 'A0A8C7ANU4',
 'A0A8U0SV68',
 'A0A673SQ31',
 'A1YYN9',
 'A0A2J8PBY1',
 'A0A8C0AIU6',
 'A0A8C6GPM7',
 'A0A4W2C5Z9',
 'J9QDM6',
 'P21802-2',
 'A0A8U0NUB3',
 'A0A8C7AP27',
 'A0A667H493',
 'A0A6J1ZR09',
 'A0A2K5E4E2',
 'A0A7J8G7X4',
 'A0A7J7UWA7',
 'A0A4W2GEQ6',
 'A0A2K6SD92',
 'A0A8C8ZGV2',
 'A0A8C3X2C3',
 'A0A2J8PBX3',
 'A0A8D2AJZ0',
 'A0A8C9DBZ1',
 'A0A6J1ZNH1',
 'A0A667HME2',
 'A0A8C7AGV2',
 'A0A8U0TDH6',
 'A0A341CAK0',
 'A0A2Y9N4Z5',
 'A0A8B8TWI3',
 'A0A9W3FLM0',
 'A0A8B8VID3',
 'A0A480ZAB5',
 'A0A6P3IT95',
 'A0A6P5BGN1',
 'A0A383ZEH3',
 'A0A7J8G7N6',
 'A0A7J7UWQ2',
 'F7I6U1',
 'A0A8D2KAB0',
 'A0A2J8PBY8',
 'A0A8C8ZE96',
 'A0A2J8W6K2',
 'A0A8C9IIZ2',
 'P21802-21',
 'A1YYP1',
 'A0A8C6CG43',
 'A0A5F5XRU1',
 'A0A8C9EES0',
 'A0A8C8Y8N5',
 'A0A8C3XBP8',
 'A0A8D2AJV5',
 'A0A452RT42']

In [340]:
def fetch_uniref90_for_pdb(pdb_id, chain_id):
    """
    Fetches UniRef90 cluster IDs for a given PDB ID.

    Parameters:
        pdb_id (str): PDB ID (e.g., '3LII')

    Returns:
        list: A list of UniRef90
    """
    result = fetch_best_isoform(pdb_id)
    uniref90_id = fetch_uniref90_cluster_id(get_canonical_uniprot_id_by_chain(result, chain_id))
    uniref90_per_chain_per_pdb = uniref_cluster_ids(uniref90_id[0])
    return uniref90_per_chain_per_pdb

In [341]:
uniref90_3ojm = fetch_uniref90_for_pdb('3ojm', 'B')

Validated PDB ID: 3ojm
URL: https://www.ebi.ac.uk/pdbe/api/v2/mappings/isoforms/3ojm
Data retrieved successfully.


In [342]:
print(uniref90_3ojm)

['P21802', 'P21803', 'A0A6J2D1E3', 'A0A8C7ANU4', 'A0A8U0SV68', 'A0A673SQ31', 'A1YYN9', 'A0A2J8PBY1', 'A0A8C0AIU6', 'A0A8C6GPM7', 'A0A4W2C5Z9', 'J9QDM6', 'P21802-2', 'A0A8U0NUB3', 'A0A8C7AP27', 'A0A667H493', 'A0A6J1ZR09', 'A0A2K5E4E2', 'A0A7J8G7X4', 'A0A7J7UWA7', 'A0A4W2GEQ6', 'A0A2K6SD92', 'A0A8C8ZGV2', 'A0A8C3X2C3', 'A0A2J8PBX3', 'A0A8D2AJZ0', 'A0A8C9DBZ1', 'A0A6J1ZNH1', 'A0A667HME2', 'A0A8C7AGV2', 'A0A8U0TDH6', 'A0A341CAK0', 'A0A2Y9N4Z5', 'A0A8B8TWI3', 'A0A9W3FLM0', 'A0A8B8VID3', 'A0A480ZAB5', 'A0A6P3IT95', 'A0A6P5BGN1', 'A0A383ZEH3', 'A0A7J8G7N6', 'A0A7J7UWQ2', 'F7I6U1', 'A0A8D2KAB0', 'A0A2J8PBY8', 'A0A8C8ZE96', 'A0A2J8W6K2', 'A0A8C9IIZ2', 'P21802-21', 'A1YYP1', 'A0A8C6CG43', 'A0A5F5XRU1', 'A0A8C9EES0', 'A0A8C8Y8N5', 'A0A8C3XBP8', 'A0A8D2AJV5', 'A0A452RT42']


# Copyright 2025 EMBL - European Bioinformatics Institute

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

**Quick Review Quiz 1**

---

🧪 **Question 1: What is a UniProt ID?**

**A UniProt ID uniquely identifies:**

A. A protein structure in the PDB  
B. A gene in the human genome  
C. A specific protein sequence from a defined organism  
D. A protein isoform across all species

✅ **Correct Answer:** C

---

🧪 **Question 2: What is the canonical sequence in UniProt?**

**The canonical sequence in a UniProtKB/Swiss-Prot entry is selected based on:**

A. The longest isoform available  
B. The most recently discovered isoform  
C. Functionality, expression, conservation, and consensus with other databases  
D. The isoform with the highest molecular weight

✅ **Correct Answer:** C

---

🧪 **Question 3: How often is UniProt updated?**

**New versions of UniProt are released:**

A. Weekly  
B. Monthly  
C. Every 2-4 months  
D. Annually

✅ **Correct Answer:** C

---

🧪 **Question 4: Which statement about isoforms is TRUE?**

A. Isoforms are only found in non-human proteins.
B. Isoforms result from alternative splicing and may have different functions.
C. Isoforms are not included in UniProt entries.
D. Isoforms cannot be expressed in the same tissue or organism.

✅ **Correct Answer:** B
