<a href="https://colab.research.google.com/github/paulynamagana/afdb-api-course/blob/main/1_alphafold_api_introduction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <font color='#02CCFE'> Introduction to AlphaFold Database API

<img src="https://www.embl.org/about/info/communications/wp-content/uploads/2017/09/Ebi_official_logo.png" height="100" align = "right">

Welcome to this Google Colab notebook, designed to guide you through the application of EMBL-EBI APIs for protein characterization. This session will primarily focus on interacting with the AlphaFold Database API, a key resource for accessing predicted protein structures.

This notebook will demonstrate the process of retrieving data from the AlphaFold Protein Structure Database via its Application Programming Interface (API). Detailed API endpoint documentation is available on the [AlphaFold Database website](https://alphafold.ebi.ac.uk/api-docs).

The AlphaFold Database, a collaborative effort by Google DeepMind and EMBL-EBI, has significantly advanced structural biology by providing broad access to highly accurate protein structure predictions. Developing proficiency in programmatic interaction with this database is a valuable skill for researchers working with protein sequence and structural data.

## Objectives

This notebook will cover the fundamental aspects of accessing and retrieving data from the AlphaFold Database API. You will learn to:

* Formulate API requests to search for protein structures.
* Retrieve predicted 3D coordinates and associated metadata.
* Understand the format of the API responses.
* Integrate this data into your bioinformatics workflows.

Upon completion of this Colab, you will possess a foundational understanding for utilizing the AlphaFold Database API in your research, facilitating large-scale exploration of protein structures.

## How to use Google Colab <a name="Quick Start"></a>
1. To run a code cell, click on the cell to select it. You will notice a play button (▶️) on the left side of the cell. Click on the play button or press Shift+Enter to run the code in the selected cell.
2. The code will start executing, and you will see the output, if any, displayed below the code cell.
3. Move to the next code cell and repeat steps 2 and 3 until you have executed all the desired code cells in sequence.
4. The currently running step is indicated by a circle with a stop sign next to it.
If you need to stop or interrupt the execution of a code cell, you can click on the stop button (■) located next to the play button.

*Remember to run the code cells in the correct order, as their execution might depend on variables or functions defined in previous cells. You can modify the code in a code cell and re-run it to see updated results.*

# <font color='#e59454'>1.**Get response from API endpoint** </font>

This section defines a function to retrieve protein structure prediction data from the AlphaFold Database using its public API.


In [None]:
import requests #to make HTTP requests to the API
import json #to work with the JSON data returned by the API

def get_afdb_response(uniprot_id):
  """
  Fetches prediction data from the AlphaFold Database API for a given UniProt ID.

  Args:
    uniprot_id (str): The UniProt accession ID for the protein of interest. Example: "P12345"

  Returns:
    dict or str: A dictionary containing the API response (usually a list of prediction entries) if the request is successful (HTTP status code 200).
    Returns an error message string if the request fails.
  """

  AFDB_API_BASE_URL = "https://alphafold.ebi.ac.uk/api/prediction/"
  uniprot_id = uniprot_id.upper() # this will ensure that the UniProt ID is uppercase

  request_url = AFDB_API_BASE_URL + uniprot_id #construct the full URL by appending the base URL and the UniProt ID
  print(f"➡️ Sending request to: {request_url}") #to see the URL

  response = requests.get(request_url, timeout =10) #query the URL
  if response.status_code == 200: # if the request was successful (200 means OK)
    result = response.json()
    print(f"✅ Success! Received data for {uniprot_id}")
    return result

  else:
    error_message = f"❌ API request failed for {uniprot_id} with status: {response.status_code} - {response.reason}"
    print(error_message)
    if response.status_code == 404 or response.status_code == 400:
      print(f"ℹ️  Hint: A status code of 404/400 often means the UniProt ID '{uniprot_id}' is not available in the AlphaFold Database or the ID is incorrect.")


In [None]:
# @markdown We'll use a UniProt ID to specify the protein we're interested in.

uniprot_id = "Q5VSL9" #@param {type:"string"}
get_afdb_response(uniprot_id) #call the previous function

➡️ Sending request to: https://alphafold.ebi.ac.uk/api/prediction/Q5VSL9
✅ Success! Received data for Q5VSL9


[{'entryId': 'AF-Q5VSL9-F1',
  'gene': 'STRIP1',
  'sequenceChecksum': '5F9BA1D4C7DE6925',
  'sequenceVersionDate': '2004-12-07',
  'uniprotAccession': 'Q5VSL9',
  'uniprotId': 'STRP1_HUMAN',
  'uniprotDescription': 'Striatin-interacting protein 1',
  'taxId': 9606,
  'organismScientificName': 'Homo sapiens',
  'uniprotStart': 1,
  'uniprotEnd': 837,
  'uniprotSequence': 'MEPAVGGPGPLIVNNKQPQPPPPPPPAAAQPPPGAPRAAAGLLPGGKAREFNRNQRKDSEGYSESPDLEFEYADTDKWAAELSELYSYTEGPEFLMNRKCFEEDFRIHVTDKKWTELDTNQHRTHAMRLLDGLEVTAREKRLKVARAILYVAQGTFGECSSEAEVQSWMRYNIFLLLEVGTFNALVELLNMEIDNSAACSSAVRKPAISLADSTDLRVLLNIMYLIVETVHQECEGDKAEWRTMRQTFRAELGSPLYNNEPFAIMLFGMVTKFCSGHAPHFPMKKVLLLLWKTVLCTLGGFEELQSMKAEKRSILGLPPLPEDSIKVIRNMRAASPPASASDLIEQQQKRGRREHKALIKQDNLDAFNERDPYKADDSREEEEENDDDNSLEGETFPLERDEVMPPPLQHPQTDRLTCPKGLPWAPKVREKDIEMFLESSRSKFIGYTLGSDTNTVVGLPRPIHESIKTLKQHKYTSIAEVQAQMEEEYLRSPLSGGEEEVEQVPAETLYQGLLPSLPQYMIALLKILLAAAPTSKAKTDSINILADVLPEEMPTTVLQSMKLGVDVNRHKEVIVKAISAVLLLLLKHFKLNHVYQFEYMAQHLVFANCIPLILKFFNQNIMSYI

# <font color='#e59454'> **2. Programmatic file download** </font>

A significant advantage of utilizing API endpoints is the ability to automate data retrieval, thereby eliminating the need for manual downloads. The following code cells will demonstrate how to programmatically download structure files and associated data (such as pLDDT values, Predicted Aligned Error (PAE) images, and AlphaMissense annotations, where available) directly to a specified folder within your Google Drive.

In [None]:
api_response = get_afdb_response(uniprot_id) #reuse previous function

if isinstance(api_response, list) and api_response: #check it was successful and that we have a list
  for entry in api_response:
    print(f"-- Processing Entry--")
    urls_to_download = {}

    if "cifUrl" in entry and entry["cifUrl"]:
      urls_to_download["cif"] = entry["cifUrl"]
      print(f"🔗 CIF file URL: {entry['cifUrl']}")

    if "pdbUrl" in entry and entry["pdbUrl"]:
      urls_to_download["pdbUrl"] = entry["pdbUrl"]
      print(f"🔗 PDB file URL: {entry['pdbUrl']}")

    if "paeImageUrl" in entry and entry["paeImageUrl"]:
      urls_to_download["paeImageUrl"] = entry["paeImageUrl"]
      print(f"🔗 PAE Image URL: {entry['paeImageUrl']}")

    if "amAnnotationsUrl" in entry and entry["amAnnotationsUrl"]:
      urls_to_download["amAnnotationsUrl"] = entry["amAnnotationsUrl"]
      print(f"🔗 AlphaMissense URL: {entry['amAnnotationsUrl']}")
    else:
      print("⚠️ No AlphaMissense annotations available for this entry. Is it human?")

  if not urls_to_download:
    print("No downloable URLs found in the API response.")
  else:
    print(f"Collected all urls to download: {urls_to_download}")


➡️ Sending request to: https://alphafold.ebi.ac.uk/api/prediction/Q5VSL9
✅ Success! Received data for Q5VSL9
-- Processing Entry--
🔗 CIF file URL: https://alphafold.ebi.ac.uk/files/AF-Q5VSL9-F1-model_v4.cif
🔗 PDB file URL: https://alphafold.ebi.ac.uk/files/AF-Q5VSL9-F1-model_v4.pdb
🔗 PAE Image URL: https://alphafold.ebi.ac.uk/files/AF-Q5VSL9-F1-predicted_aligned_error_v4.png
🔗 AlphaMissense URL: https://alphafold.ebi.ac.uk/files/AF-Q5VSL9-F1-aa-substitutions.csv
Collected all urls to download: {'cif': 'https://alphafold.ebi.ac.uk/files/AF-Q5VSL9-F1-model_v4.cif', 'pdbUrl': 'https://alphafold.ebi.ac.uk/files/AF-Q5VSL9-F1-model_v4.pdb', 'paeImageUrl': 'https://alphafold.ebi.ac.uk/files/AF-Q5VSL9-F1-predicted_aligned_error_v4.png', 'amAnnotationsUrl': 'https://alphafold.ebi.ac.uk/files/AF-Q5VSL9-F1-aa-substitutions.csv'}


This section provides a function designed to download the extracted files into a dedicated folder on your Google Drive. This folder will be named `"AFDB_API_course_downloads"`.

In [None]:
import os
from google.colab import drive
import re

# -- Mount Google Drive
drive.mount('/content/drive', force_remount=True) # this will mount you Google Drive

def download_files(url_dictionary):
  destination_path = "/content/drive/MyDrive/AFDB_API_course_downloads"

  if not os.path.exists(destination_path): #if the folder doesn't exist
    os.makedirs(destination_path)
    print(f"✅ Directory created: {destination_path}")
  else:
    print(f"ℹ️ Directory already exists: {destination_path}")


  for file_type, url in url_dictionary.items():
    if not url: # Skip if a URL is empty or None
        print(f"🚫 Skipping {file_type} as URL is missing.")
        continue
    try:
      original_filename = url.split('/')[-1]
      file_extension = ""
      match = re.search(r'AF-([A-Z0-9]+)-F\d+', url)
      if match:
          uniprot_id = match.group(1)

      if '.' in original_filename:
          file_extension = "." + original_filename.split('.')[-1]

      if file_type == "cif":
          filename = f"{uniprot_id}_model{file_extension or '.cif'}"
      elif file_type == "pdb":
          filename = f"{uniprot_id}_model{file_extension or '.pdb'}"
      elif file_type == "pae_image":
          filename = f"{uniprot_id}_pae{file_extension or '.png'}"
      elif file_type == "alphamissense_tsv": # Assuming you used this key earlier
          filename = f"{uniprot_id}_alphamissense{file_extension or '.tsv'}" # Or .csv if it's a csv
      else:
          # Fallback for other types: use the file_type and original extension
          filename = f"{uniprot_id}_{file_type}{file_extension}"

      full_file_path = os.path.join(destination_path, filename)

      file_response = requests.get(url, stream=True, timeout=30)

      if file_response.status_code == 200:
        with open(full_file_path, 'wb') as f:
          for chunk in file_response.iter_content(chunk_size=8192): # Download in 8KB chunks
            f.write(chunk)
        print(f"  ✅ Successfully downloaded: {filename}")
      else:
        print(f"  ❌ Failed to download {file_type}. Status: {file_response.status_code} - {file_response.reason}")

    except requests.exceptions.RequestException as e:
      print(f"  🚫 Error downloading {file_type} from {url}: {e}")
    except Exception as e:
      print(f"  🛑 An unexpected error occurred with {file_type}: {e}")
    print("-" * 20) # Separator for each file download attempt

  print("="*30 + f"\n🏁 All attempted downloads for {uniprot_id} complete.")


Mounted at /content/drive


In [None]:
download_files(urls_to_download)

ℹ️ Directory already exists: /content/drive/MyDrive/AFDB_API_course_downloads
  ✅ Successfully downloaded: Q5VSL9_model.cif
--------------------
  ✅ Successfully downloaded: Q5VSL9_pdbUrl.pdb
--------------------
  ✅ Successfully downloaded: Q5VSL9_paeImageUrl.png
--------------------
  ✅ Successfully downloaded: Q5VSL9_amAnnotationsUrl.csv
--------------------
🏁 All attempted downloads for Q5VSL9 complete.
