<a href="https://colab.research.google.com/github/glevans/7ADD-workshop-2024/blob/main/variants_embl-ebi_july2025/Example_1_structural_assessment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**<H1>Structural assessment of a variant</H1>**
**Using PDBe-KB & 3D-Beacons to gain understanding on genetic variants**
<img src="https://www.ebi.ac.uk/pdbe/docs_dev/logos/images/RGB/PDBe-logo-RGB_2013.png" height="300" align="right">

#Welcome to this notebook form!

To use this notebook form you can:

(1) Click on link at top of the page when viewing the file in Github

*   you will need to have a Google account
*   be logged in to Google Colab <br>(by being logged into Google account)

<br>

(2) Visit the Colaboratory page:<br>
https://colab.research.google.com/<br>
and access the following Github repository *via* the interface:<br>
https://github.com/PDBeurope/pdbe-notebooks/

<br>

(3) Visit the Colaboratory page:<br>
https://colab.research.google.com/<br>
and upload the file from the Colaboratory interface

*   you will need to have a Google account
*   be logged in to Google Colab <br>(by being logged into Google account)



---

  ## How to use this notebook form:
1. To run a code cell, click on the cell to select it. You will notice a play button (▶️) on the left side of the cell. Click on the play button or **press Shift+Enter** to run the code in the selected cell.
2. The code will start executing, and you will see the output, if any, displayed below the code cell.
3. Move to the next code cell and repeat steps 2 and 3 until you have executed all the desired code cells in sequence.
4. The currently running step is indicated by a circle with a stop sign next to it.
If you need to stop or interrupt the execution of a code cell, you can click on the stop button (■) located next to the play button.
5. The code behind the form can be hidden by selecting View → Show/hide code or using the toolbar above the selected code cell.
6. The cells in this notebook primarily are forms with pulldown options. In a few case they run small bits of code to generate a visual.

*Remember to run the code cells in the correct order, as their execution might depend on variables or functions defined in previous cells. You can modify the code in a code cell and re-run it to see updated results.*

<br>

---

## Contact us

If you experience any bugs please contact pdbehelp@ebi.ac.uk and put "Help with" and the title of the notebook form in the subject line of the message.


# Example 1

**Presenilin-1** is a protein component/subunit in a protein complex called the gamma secretase complex.

<br>

A feature from the protein perspective is: <br>
InterPro Protein Family **Presenilin/signal peptide peptidase** ID [IPR006639](https://www.ebi.ac.uk/interpro/entry/InterPro/IPR006639/protein/reviewed/#table).

*   Gene **PSEN1** encodes protein **Presenilin-1** (UniProt ID: [P49768](https://www.uniprot.org/uniprotkb/P49768/entry))
*   Gene **PSEN2** encodes protein **Presenilin-2** (UniProt ID: [P49810](https://www.uniprot.org/uniprotkb/P49810/entry))

<br>

In this notebook we will focus on: <br>
**Presenilin-1** <br>(UniProt ID: P49768)

By examining the subpages of [PDBe-KB.org](PDBe-KB.org), we can simplify possibly causes that make a variant <i>likely pathogenic or pathogenic</i>.

---

*   **Ligands**
*   **Interactions**

In [None]:
# @title Initial Questions
# @markdown
# @markdown
UniProt_ID = 'P49768' # @param {type:"string"}
# @markdown
# @markdown ---
# @markdown Do we have 3D-structural experimental data for ligands as potential reasons for a variant's pathogenicity?
Ligands_pos = 'Yes' # @param ["Input Answer","Yes", "No"]
# @markdown ---
# @markdown Do we have 3D-structural experimental data for metals as potential reasons for a variant's pathogenicity?
Metals_pos = 'No' # @param ["Input Answer","Yes", "No"]
# @markdown ---
# @markdown Do we have 3D-structural experimental data for complexes as potential reasons for a variant's pathogenicity?
Complexes_pos = 'Yes' # @param ["Input Answer","Yes", "No"]
# @markdown ---

# Setup

In [None]:
# @title **Press shift + enter to install packages**

import io
from IPython.display import display

!pip install --upgrade pip

!pip install ipywidgets
import ipywidgets as widgets

!pip install ColabTurtlePlus
from ColabTurtlePlus.Turtle import *

print()
print("Succesfully installed!")

Collecting pip
  Downloading pip-25.1.1-py3-none-any.whl.metadata (3.6 kB)
Downloading pip-25.1.1-py3-none-any.whl (1.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m21.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 24.1.2
    Uninstalling pip-24.1.2:
      Successfully uninstalled pip-24.1.2
Successfully installed pip-25.1.1
Collecting jedi>=0.16 (from ipython>=4.0.0->ipywidgets)
  Downloading jedi-0.19.2-py2.py3-none-any.whl.metadata (22 kB)
Downloading jedi-0.19.2-py2.py3-none-any.whl (1.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m31.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: jedi
Successfully installed jedi-0.19.2
Collecting ColabTurtlePlus
  Downloading ColabTurtlePlus-2.0.1-py3-none-any.whl.metadata (10 kB)
Downloading ColabTurtlePlus-2.0.1-py3-none-any.whl (31 kB)
Installing collec

In [None]:
# @title **Press shift + enter to install more packages**

import pandas as pd
import json
import csv
import os

print("Succesfully installed!")

Succesfully installed!


# 3D structures

There are various ways to obtain the below information.

We have another Notebook, the [Structures Available Notebook](https://colab.research.google.com/github/PDBeurope/pdbe-notebooks/blob/main/variants_embl-ebi_may2024/structures_available.ipynb), to aid in finding the information quickly. For this activity section 3.4 is the most useful.

Alternatively, a combination of querying and examining contents on the webpages of [3D Beacons](https://www.ebi.ac.uk/pdbe/pdbe-kb/3dbeacons/) and [PDBe-KB](https://www.ebi.ac.uk/pdbe/pdbe-kb/) should enable the same information to be obtained.

In [None]:
# @title **Structural data availability**

UniProt_ID = 'P49768' # @param {type:"string"}
# @markdown
# @markdown ---
# @markdown How many **structures (both predicted and experimentally-determined)** are available for this UniProt ID?
No_of_all_available = 34 # @param {type:"number"}
# @markdown HINT: Use **3D-Beacons** query
# @markdown
# @markdown ---
# @markdown How many **predicted structures** (from *AlphaFoldDB*, *Swiss-Model*, *ModelArchive*, etc) are available for this UniProt ID?
No_of_predicted_available = 8 # @param {type:"number"}
# @markdown HINT: Use **3D-Beacons** query
# @markdown
# @markdown ---
# @markdown How many **experimentally-determined structures** (from *PDB* & *Small angle scattering Biological Data Bank (SASBDB)*) are available for this UniProt ID?
No_of_exp_available = 26 # @param {type:"number"}
# @markdown HINT: Use either **3D-Beacons**
# @markdown
# @markdown ---
# @markdown How many **experimentally-determined structures (with PDB ids)** are available for this UniProt ID?
No_of_PDB_available = 26 # @param {type:"number"}
# @markdown HINT: Use either **3D-Beacons** or **PDBe-KB**
# @markdown
# @markdown ---
# @markdown How many experimentally-determined PDB structures are available for this UniProt ID with a **resolution better than 3 Ångstrom**?
No_of_3A_and_better_available = 14 # @param {type:"number"}
# @markdown HINT: Use **PDBe-KB**
# @markdown

if No_of_exp_available > 0:
  EXPERIMENTALLY_DETERMINED_STRUCTURES = 'Yes'
else:
  EXPERIMENTALLY_DETERMINED_STRUCTURES = 'No'

# Print the summary
print("The following is a summary based on what has been filled-in so far.")
print()
print(f"There are {No_of_all_available} structures (predicted and experimentally determined) available associated with UniProt ID {UniProt_ID}.")
if EXPERIMENTALLY_DETERMINED_STRUCTURES == 'Yes':
  print(f"There are {No_of_exp_available} experimentally-determined structures associated with UniProt ID {UniProt_ID}.")
  print(f"There are {No_of_PDB_available} experimentally-determined PDB structures associated with UniProt ID {UniProt_ID}.")
else:
  print(f"There are no experimentalLY-determined structures available.")
print()

# Initialize an empty list to store the summaries
summaries = []

## Define the file to check
#file_to_check = "/content/struct_avail.csv"
## Loop while the file exists
#if os.path.exists(file_to_check):
#    print(f"File '{file_to_check}' exists.")
#    # Load the CSV file into a DataFrame
#    df = pd.read_csv(file_to_check)
#    # Get the number of rows
#    num_rows = len(df)
#    for value in range(0, num_rows):
#        # Convert a specific row to a dictionary (e.g., the first row)
#        row_index = value
#        row_dict = df.iloc[row_index].to_dict()
#        if [df['UniProt_ID'] != UniProt_ID:
#            # Convert dictionary to a list of dictionary items
#            summaries.append(row_dict)
#else:
#    print(f"File '{file_to_check}' does not exists.")

# Function to add a new entry to the summaries list
def add_summary(UniProt_ID, No_of_all_available, No_of_PDB_available):
    summary = {
        "UniProt_ID": UniProt_ID,
        "No of predicted and exp-determined available": No_of_all_available,
        "No of predicted available": No_of_predicted_available,
        "No of exp-determined available": No_of_exp_available,
        "No of PDB available": No_of_PDB_available,
        "No of PDBs with 3A or better": No_of_3A_and_better_available
    }
    summaries.append(summary)

add_summary(UniProt_ID, No_of_all_available, No_of_PDB_available)

# Convert the list of summaries to a pandas DataFrame
df_summaries = pd.DataFrame(summaries)

## Output the DataFrame to a CSV file
#df_summaries.to_csv('struct_avail.csv', index=False)
#print("DataFrame has been saved to struct_avail.csv file.")
#print()

# Print the DataFrame
#display(df_summaries)

The following is a summary based on what has been filled-in so far.

There are 34 structures (predicted and experimentally determined) available associated with UniProt ID P49768.
There are 26 experimentally-determined structures associated with UniProt ID P49768.
There are 26 experimentally-determined PDB structures associated with UniProt ID P49768.



In [None]:
# @title **Assessing the available structures**

UniProt_ID = 'P49768' # @param {type:"string"}
# @markdown
# @markdown ---
# @markdown What are the PDB ids for the structural data  with the highest resolution that are available for this UniProt ID?
Best_resolution = "8kcs" # @param {type:"string"}
# @markdown HINT: Use **PDBe-KB** query
# @markdown
# @markdown ---
# @markdown What are the PDB ids for the structural data with the most coverage available for this UniProt ID?
Most_coverage = "7d8x, 6lr4, 8kct, 7c9i, 7y5t, 6lqg, 8kcp, 6iyc, 6idf, 8x54, 8kcs, 8kco, 8k8e, 8im7, 8kcu, 8x53, 8oqy, 8x52, 8oqz, 5a63, 5fn4, 5fn5, 5fn3, 5fn2" # @param {type:"string"}
# @markdown HINT: Use **3D-Beacons** query
# @markdown
# @markdown ---
# @markdown What are the PDB ids for the structural data without mutations available for this UniProt ID?
Has_no_mutations = "7d8x, 6lr4, 8kct, 7c9i, 7y5t, 6lqg, 8kcp, 8x54, 8kcs, 8kco, 8k8e, 8im7, 8kcu, 8x53, 8oqy, 8x52, 5a63, 5fn4, 5fn5, 2kr6" # @param {type:"string"}
# @markdown HINT: Use **PDBe-KB** query
# @markdown
# @markdown
# @markdown ---

if ',' in Best_resolution:
  Best_resolution_list = Best_resolution.split(',')
  Best_resolution_is = "List"
else:
  Best_resolution = Best_resolution
  Best_resolution_is = "Single"

if ',' in Most_coverage:
  Most_coverage_list = Most_coverage.split(',')
  Most_coverage_is = "List"
else:
  Most_coverage = Most_coverage
  Most_coverage_is = "Single"

if ',' in Has_no_mutations:
  Has_no_mutations_list = Has_no_mutations.split(',')
  Has_no_mutations_is = "List"
else:
  Has_no_mutations = Has_no_mutations
  Has_no_mutations_is = "Single"


# Print the assessment
print("The following is a summary based on what has been filled-in so far.")
print()
if Best_resolution_is == 'Single':
  print(f"There is a single structure with the best resolution associated with UniProt ID {UniProt_ID}.")
  print(f"The PDB ID is {Best_resolution} for the best resolution structure.")
else:
  print(f"There are multiple structures with the best resolution associated with UniProt ID {UniProt_ID}.")
  print(f"The PDB IDs of {Best_resolution_list} correspond to the best resolution structures.")
print()
if Most_coverage_is == 'Single':
  print(f"There is a single structure with the most coverage of the UniProt ID {UniProt_ID}.")
  print(f"The PDB ID is {Most_coverage} for the structure with most coverage of the UniProt sequence.")
else:
  print(f"There are multiple structures with the most coverage of the UniProt ID {UniProt_ID}.")
  print(f"The PDB IDs of {Most_coverage_list} correspond to the structures with most coverage of the UniProt sequence.")
print()
if Has_no_mutations_is == 'Single':
  print(f"There is a single structure with no mutations in sample sequence associated with UniProt ID {UniProt_ID}.")
  print(f"The PDB ID is {Has_no_mutations} for a structure with no mutations.")
else:
  print(f"There are multiple structures with no mutations in sample sequence associated with UniProt ID {UniProt_ID}.")
  print(f"The PDB IDs of {Has_no_mutations_list} correspond to structures with no mutations.")
print()

# Initialize an empty list to store the assessments
assessments = []

## Define the file to check
#file_to_check = "/content/struct_assess.csv"
## Loop while the file exists
#if os.path.exists(file_to_check):
#    print(f"File '{file_to_check}' exists.")
#    # Load the CSV file into a DataFrame
#    df = pd.read_csv(file_to_check)
#    # Get the number of rows
#    num_rows = len(df)
#    for value in range(0, num_rows):
#        # Convert a specific row to a dictionary (e.g., the first row)
#        row_index = value
#        row_dict = df.iloc[row_index].to_dict()
#        if [df['UniProt_ID'] != UniProt_ID:
#            # Convert dictionary to a list of dictionary items
#            assessments.append(row_dict)
#else:
#    print(f"File '{file_to_check}' does not exists.")

# Function to add a new entry to the assessments list
def add_assessment(UniProt_ID, Best_resolution, Most_coverage, Has_no_mutations):
    def convert_to_list(input_str):
        # Check if a comma is present in the string
        if ',' in input_str:
            # Split the string by commas and return the resulting list
            return input_str.split(',')
        else:
            # Return the original string if no comma is present
            return input_str
    assessment = {
        "UniProt_ID": UniProt_ID,
        "PDB id(s) with best resolution": convert_to_list(Best_resolution),
        "PDB id(s) most coverage": convert_to_list(Most_coverage),
        "PDB id(s) without any mutations": convert_to_list(Has_no_mutations)
    }
    assessments.append(assessment)

add_assessment(UniProt_ID, Best_resolution, Most_coverage, Has_no_mutations)

# Convert the list of summaries to a pandas DataFrame
df_assessments = pd.DataFrame(assessments)

## Output the DataFrame to a CSV file
#df_summaries.to_csv('struct_assess.csv', index=False)
#print("DataFrame has been saved to struct_assess.csv file.")
#print()

# Print the DataFrame
#display(df_assessments)

The following is a summary based on what has been filled-in so far.

There is a single structure with the best resolution associated with UniProt ID P49768.
The PDB ID is 8kcs for the best resolution structure.

There are multiple structures with the most coverage of the UniProt ID P49768.
The PDB IDs of ['7d8x', ' 6lr4', ' 8kct', ' 7c9i', ' 7y5t', ' 6lqg', ' 8kcp', ' 6iyc', ' 6idf', ' 8x54', ' 8kcs', ' 8kco', ' 8k8e', ' 8im7', ' 8kcu', ' 8x53', ' 8oqy', ' 8x52', ' 8oqz', ' 5a63', ' 5fn4', ' 5fn5', ' 5fn3', ' 5fn2'] correspond to the structures with most coverage of the UniProt sequence.

There are multiple structures with no mutations in sample sequence associated with UniProt ID P49768.
The PDB IDs of ['7d8x', ' 6lr4', ' 8kct', ' 7c9i', ' 7y5t', ' 6lqg', ' 8kcp', ' 8x54', ' 8kcs', ' 8kco', ' 8k8e', ' 8im7', ' 8kcu', ' 8x53', ' 8oqy', ' 8x52', ' 5a63', ' 5fn4', ' 5fn5', ' 2kr6'] correspond to structures with no mutations.



In [None]:
# @title **Structure summary**

UniProt_ID = 'P49768' # @param {type:"string"}
# @markdown
# @markdown ---
##

filter_value = UniProt_ID

# Filter DataFrames based on the 'UniProt_ID' column value
filtered_df_summaries = df_summaries[df_summaries['UniProt_ID'] == filter_value]
filtered_df_assessments = df_assessments[df_assessments['UniProt_ID'] == filter_value]

# Merge filtered DataFrames based on shared value in 'UniProt_ID' column
df_structures = pd.merge(filtered_df_summaries, filtered_df_assessments, on='UniProt_ID')

# Transpose the resulting DataFrame
df_structures_transposed = df_structures.T


# Left-align the index
df_formatted = df_structures_transposed.style.set_table_styles(
    [{'selector': 'th.row_heading', 'props': [('text-align', 'left')]}],
)

print("Table 1: General Structure Availability")
display(df_formatted)

Table 1: General Structure Availability


Unnamed: 0,0
UniProt_ID,P49768
No of predicted and exp-determined available,34
No of predicted available,8
No of exp-determined available,26
No of PDB available,26
No of PDBs with 3A or better,14
PDB id(s) with best resolution,8kcs
PDB id(s) most coverage,"['7d8x', ' 6lr4', ' 8kct', ' 7c9i', ' 7y5t', ' 6lqg', ' 8kcp', ' 6iyc', ' 6idf', ' 8x54', ' 8kcs', ' 8kco', ' 8k8e', ' 8im7', ' 8kcu', ' 8x53', ' 8oqy', ' 8x52', ' 8oqz', ' 5a63', ' 5fn4', ' 5fn5', ' 5fn3', ' 5fn2']"
PDB id(s) without any mutations,"['7d8x', ' 6lr4', ' 8kct', ' 7c9i', ' 7y5t', ' 6lqg', ' 8kcp', ' 8x54', ' 8kcs', ' 8kco', ' 8k8e', ' 8im7', ' 8kcu', ' 8x53', ' 8oqy', ' 8x52', ' 5a63', ' 5fn4', ' 5fn5', ' 2kr6']"


# Variants

In [None]:
# @title **Input variants**
# @markdown
# @markdown Put variants being considered in textbox below.
# @markdown
# @markdown HINT: Use format where **(1st)** amino acid change, and **(2nd)** position in sequence.

# @markdown ---
##

text_input = widgets.Text(
    value='',
    placeholder='e.g. 6  K/M',
    description='Input:',
    disabled=False
)

display(text_input)

def on_text_change(change):
    print(f'You entered: {change["new"]}')

text_input.observe(on_text_change, names='value')

Text(value='', description='Input:', placeholder='e.g. 6  K/M')

You entered: 435 L/F


In [None]:
# @title **Press shift + enter to convert input to table**

input = text_input.value
print(f"Input is: {input}")
# Split the string into parts, assuming space separates variants
parts = input.split()

# Create a list of tuples (residue_number, amino_acid_change)
# Iterate through the parts in steps of 2, taking the residue number (first part) and the amino acid change (second part)
data = [(int(parts[i]), parts[i+1]) for i in range(0, len(parts), 2)]

# Create DataFrame
df_variants = pd.DataFrame(data, columns=["residue_number", "amino_acid_change"])

# Split the 'amino_acid_change' column into two new columns
df_variants[["original_aa", "mutated_aa"]] = df_variants["amino_acid_change"].str.split("/", expand=True)

# Optional: Reorder or drop columns if needed
df_variants = df_variants[["original_aa", "mutated_aa", "residue_number"]]

# Dictionary for one-letter to three-letter amino acid codes
aa_dict = {
    "A": "ALA", "R": "ARG", "N": "ASN", "D": "ASP", "C": "CYS",
    "Q": "GLN", "E": "GLU", "G": "GLY", "H": "HIS", "I": "ILE",
    "L": "LEU", "K": "LYS", "M": "MET", "F": "PHE", "P": "PRO",
    "S": "SER", "T": "THR", "W": "TRP", "Y": "TYR", "V": "VAL"
}

# Map the one-letter codes to three-letter codes
df_variants["original_aa"] = df_variants["original_aa"].map(aa_dict)
df_variants["mutated_aa"] = df_variants["mutated_aa"].map(aa_dict)

# Add a column with row numbers starting from 1
df_variants.insert(0, "no.", range(1, len(df_variants) + 1))

display(df_variants)

Input is: 435 L/F


Unnamed: 0,no.,original_aa,mutated_aa,residue_number
0,1,LEU,PHE,435


## BLOSUM62 Matrix
<img src="https://upload.wikimedia.org/wikipedia/commons/f/f5/Blosum62-dayhoff-ordering.svg" height="500" align="right">

The BLOSUM62 (BLOcks SUbstitution Matrix) is a substitution matrix that can be used in alignment of protein sequences (*e.g.* [BLASTp](https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome) and [UniProt's BLAST](https://www.uniprot.org/blast)). It was generated from analysis of many protein sequences.

<br>

Positive values indicate frequent exchanges between amino acid pairs during evolution, negative values indicate amino acid pairs that rarely replace each other. The colouring in the image of BLOSUM62 is from Margaret Dayhoff's amino acid classification. Together these highlight that amino acids can be clustered in terms of more and less similar.

<br>

For an overview of the development of BLOSUM, click [here](https://en.wikipedia.org/wiki/BLOSUM)<br>

<br>




## **Table of Margaret Dayhoff's encoding of amino acids**

|                     Amino acids                      | 1-letter code  |      3-letter code       |        Property        | Dayhoff  |
|:----------------------------------------------------:|:--------------:|:------------------------:|:----------------------:|:--------:|
| Cysteine                                             | C              | Cys                      | Sulphur polymerization  | a        |
| Glycine, Serine, Threonine, Alanine, Proline         | G, S, T, A, P  | Gly, Ser, Thr, Ala, Pro  | Small                  | b        |
| Aspartic acid, Glutamic acid, Asparagine, Glutamine  | D, E, N, Q     | Asp, Glu, Asn, Gln       | Acid and amide         | c        |
| Arginine, Histidine, Lysine                          | R, H, K        | Arg, His, Lys            | Basic                  | d        |
| Leucine, Valine, Methionine, Isoleucine              | L, V, M, I     | Leu, Val, Met, Ile       | Hydrophobic            | e        |
| Tyrosine, Phenylalanine, Tryptophan                  | Y, F, W        | Tyr, Phe, Trp            | Aromatic               | f        |

In [None]:
# @title **Press shift + enter to add Dayhoff information to table**

# Define Dayhoff groups
dayhoff_groups = {
    'GLY': '1', 'ALA': '1', 'SER': '1', 'THR': '1', 'PRO': '1',
    'ASP': '2', 'GLU': '2',
    'ASN': '3', 'GLN': '3',
    'ARG': '4', 'LYS': '4', 'HIS': '4',
    'MET': '5', 'ILE': '5', 'LEU': '5', 'VAL': '5', 'CYS': '5',
    'PHE': '6', 'TYR': '6', 'TRP': '6'
}


# Map amino acids to Dayhoff groups
df_variants['Dayhoff_for_original_aa'] = df_variants['original_aa'].map(dayhoff_groups)
df_variants['Dayhoff_for_mutated_aa'] = df_variants['mutated_aa'].map(dayhoff_groups)

# Compare groups
df_variants['SameGroup'] = df_variants['Dayhoff_for_original_aa'] == df_variants['Dayhoff_for_mutated_aa']
df_variants['SameGroup'] = df_variants['SameGroup'].map({True: 'Yes', False: 'No'})

display(df_variants)

Unnamed: 0,no.,original_aa,mutated_aa,residue_number,Dayhoff_for_original_aa,Dayhoff_for_mutated_aa,SameGroup
0,1,LEU,PHE,435,5,6,No


In [None]:
# @title Questions
# @markdown
# @markdown
# @markdown ---
# @markdown What is the BLOSUM62 value for the canonical to variant change?
BLOSUM62_value = 0 # @param {type:"number"}
# @markdown ---
# @markdown If the canonical and variant amino acids are in the same Dayhoff group, which one?
Dayhoff_group = 'Not_same_group' # @param ["Not_same_group", "Sulfur_polymerization","Small", "Acid_and_amide", "Basic", "Hydrophobic", "Aromatic"]
# @markdown ---

# Amino acid details

---


Not all genetic variants resulted in an amino acid change, but here we will be focusing on the genetic variants that cause an amino acid change (missense variants).

We will be contrasting the amino acid present in the genetic variant, with the canonical amino acid. A canonical amino acid is the one most commonly observed at a particular position in the protein sequence.

In [None]:
# @title **Amino acid change input if not using table**
# @markdown
# @markdown Not all genetic variants resulted in an amino acid change, but here we will be focusing on the genetic variants that cause an amino acid change (missense variants).
# @markdown
# @markdown
# @markdown We will be contrasting the amino acid present in the genetic variant, with the canonical amino acid.
# @markdown A canonical amino acid is the one most commonly observed at a particular position in the protein sequence.
# @markdown
# @markdown **Pressing run / shift-enter** on this section captures the identity of the amino acids and sets up the code for the subsequent visuallizations on amino acid type.
# @markdown
# @markdown ---
# @markdown For the variant of interest, <br>what was the canonical amino acid?
Canonical_Amino_Acid = 'Select the amino acid' # @param ['Select the amino acid', 'GLY (G)', 'ALA (A)', 'VAL (V)', 'LEU (L)', 'ILE (I)', 'THR (T)', 'SER (S)', 'MET (M)', 'CYS (C)', 'PRO (P)', 'PHE (F)', 'TYR (Y)', 'TRP (W)', 'HIS (H)', 'LYS (K)', 'ARG (R)', 'ASP (D)', 'GLU (E)', 'ASN (N)', 'GLN (Q)']
# @markdown ---
# @markdown For the variant of interest, <br>what was its position in the amino acid sequence (in UniProt numbering)?
Variant_Position = 0 # @param {type:"number"}
# @markdown ---
# @markdown For the variant of interest, <br>what was the new amino acid resulting from the genetic variant?
New_Amino_Acid = 'Select the amino acid' # @param ['Select the amino acid', 'GLY (G)', 'ALA (A)', 'VAL (V)', 'LEU (L)', 'ILE (I)', 'THR (T)', 'SER (S)', 'MET (M)', 'CYS (C)', 'PRO (P)', 'PHE (F)', 'TYR (Y)', 'TRP (W)', 'HIS (H)', 'LYS (K)', 'ARG (R)', 'ASP (D)', 'GLU (E)', 'ASN (N)', 'GLN (Q)']
# @markdown ---

####### COLOURS
#defining colour variables based on amino acids identity

if Canonical_Amino_Acid == 'CYS (C)' or Canonical_Amino_Acid == 'PRO (P)' or Canonical_Amino_Acid == 'GLY (G)':
    colour1 = '#cccc00'
    colourword1 = 'yellow'
elif Canonical_Amino_Acid == 'SER (S)' or Canonical_Amino_Acid == 'THR (T)' or Canonical_Amino_Acid == 'ASN (N)' or Canonical_Amino_Acid == 'GLN (Q)':
    colour1 = '#cc99ff'
    colourword1 = 'pink-purple'
elif Canonical_Amino_Acid == 'ARG (R)' or Canonical_Amino_Acid == 'HIS (H)' or Canonical_Amino_Acid == 'LYS (K)' or Canonical_Amino_Acid == 'ASP (D)' or Canonical_Amino_Acid == 'GLU (E)':
    colour1 = '#9999ff'
    colourword1 = 'blue-purple'
elif Canonical_Amino_Acid == 'ALA (A)' or Canonical_Amino_Acid == 'VAL (V)' or Canonical_Amino_Acid == 'ILE (I)' or Canonical_Amino_Acid == 'LEU (L)':
    colour1 = '#99ff66'
    colourword1 = 'green'
elif Canonical_Amino_Acid == 'MET (M)' or Canonical_Amino_Acid == 'PHE (F)' or Canonical_Amino_Acid == 'TYR (Y)' or Canonical_Amino_Acid == 'TRP (W)':
    colour1 = '#99ff66'
    colourword1 = 'green'
else:
    colour1 = '#b3b3b3'

if New_Amino_Acid == 'CYS (C)' or New_Amino_Acid == 'PRO (P)' or New_Amino_Acid == 'GLY (G)':
    colour2 = '#cccc00'
    colourword2 = 'yellow'
elif New_Amino_Acid == 'SER (S)' or New_Amino_Acid == 'THR (T)' or New_Amino_Acid == 'ASN (N)' or New_Amino_Acid == 'GLN (Q)':
    colour2 = '#cc99ff'
    colourword2 = 'pink-purple'
elif New_Amino_Acid == 'ARG (R)' or New_Amino_Acid == 'HIS (H)' or New_Amino_Acid == 'LYS (K)' or New_Amino_Acid == 'ASP (D)' or New_Amino_Acid == 'GLU (E)':
    colour2 = '#9999ff'
    colourword2 = 'blue-purple'
elif New_Amino_Acid == 'ALA (A)' or New_Amino_Acid == 'VAL (V)' or New_Amino_Acid == 'ILE (I)' or New_Amino_Acid == 'LEU (L)':
    colour2 = '#99ff66'
    colourword2 = 'green'
elif New_Amino_Acid == 'MET (M)' or New_Amino_Acid == 'PHE (F)' or New_Amino_Acid == 'TYR (Y)' or New_Amino_Acid == 'TRP (W)':
    colour2 = '#99ff66'
    colourword2 = 'green'
else:
    colour2 = '#b3b3b3'

if Canonical_Amino_Acid == 'CYS (C)' or Canonical_Amino_Acid == 'ASP (D)' or Canonical_Amino_Acid == 'GLU (E)' or Canonical_Amino_Acid == 'THR (T)':
    colour3 = 'red'
    colourword3 = 'red'
elif Canonical_Amino_Acid == 'SER (S)' or Canonical_Amino_Acid == 'TYR (Y)':
    colour3 = '#ff6666'
    colourword3 = 'light red'
elif Canonical_Amino_Acid == 'ARG (R)' or Canonical_Amino_Acid == 'LYS (K)':
    colour3 = 'blue'
    colourword3 = 'blue'
elif  Canonical_Amino_Acid == 'HIS (H)' or Canonical_Amino_Acid == 'TRP (W)' or Canonical_Amino_Acid == 'ASN (N)' or Canonical_Amino_Acid == 'GLN (Q)':
    colour3 = '#6699ff'
    colourword3 = 'light blue'
elif Canonical_Amino_Acid == 'ALA (A)' or Canonical_Amino_Acid == 'VAL (V)' or Canonical_Amino_Acid == 'ILE (I)' or Canonical_Amino_Acid == 'LEU (L)':
    colour3 = '#b3b3b3'
    colourword3 = 'grey'
elif Canonical_Amino_Acid == 'MET (M)' or Canonical_Amino_Acid == 'PHE (F)' or Canonical_Amino_Acid == 'PRO (P)':
    colour3 = '#b3b3b3'
    colourword3 = 'grey'
else:
    colour3 = '#b3b3b3'
    colourword3 = 'grey'

if New_Amino_Acid == 'CYS (C)' or New_Amino_Acid == 'ASP (D)' or New_Amino_Acid == 'GLU (E)' or New_Amino_Acid == 'THR (T)':
    colour4 = 'red'
    colourword4 = 'red'
elif New_Amino_Acid == 'SER (S)' or New_Amino_Acid == 'TYR (Y)':
    colour4 = '#ff6666'
    colourword4 = 'light red'
elif New_Amino_Acid == 'ARG (R)' or New_Amino_Acid == 'LYS (K)':
    colour4 = 'blue'
    colourword4 = 'blue'
elif  New_Amino_Acid == 'HIS (H)' or New_Amino_Acid == 'TRP (W)' or New_Amino_Acid == 'ASN (N)' or New_Amino_Acid == 'GLN (Q)':
    colour4 = '#6699ff'
    colourword4 = 'light blue'
elif New_Amino_Acid == 'ALA (A)' or New_Amino_Acid == 'VAL (V)' or New_Amino_Acid == 'ILE (I)' or New_Amino_Acid == 'LEU (L)':
    colour4 = '#b3b3b3'
    colourword4 = 'grey'
elif New_Amino_Acid == 'MET (M)' or New_Amino_Acid == 'PHE (F)' or New_Amino_Acid == 'PRO (P)':
    colour4 = '#b3b3b3'
    colourword4 = 'grey'
else:
    colour4 = '#b3b3b3'
    colourword4 = 'grey'

In [None]:
# @title **View the Variants table**

display(df_variants)

Unnamed: 0,no.,original_aa,mutated_aa,residue_number,Dayhoff_for_original_aa,Dayhoff_for_mutated_aa,SameGroup
0,1,LEU,PHE,435,5,6,No


In [None]:
# @title **Select a variant from table to further analysis**

UniProt_ID = 'P49768' # @param {type:"string"}
# @markdown
# @markdown ---
# @markdown Indicate the row number (no.) of the variant to further explore:
row_number = 1 # @param {type:"number"}

# Function to get values from a specific row by 'no.'
def get_row_values(row_no):
    row = df_variants[df_variants["no."] == row_no]
    if not row.empty:
        original = row["original_aa"].values[0]
        mutated = row["mutated_aa"].values[0]
        residue = row["residue_number"].values[0]
        return original, mutated, residue
    else:
        return None, None, None

Canonical_Amino_Acid, New_Amino_Acid, Variant_Position = get_row_values(row_number)

# Function to format amino acid
def format_aa(aa):
    aa_3to1 = {
        "ALA": "A", "ARG": "R", "ASN": "N", "ASP": "D", "CYS": "C",
        "GLN": "Q", "GLU": "E", "GLY": "G", "HIS": "H", "ILE": "I",
        "LEU": "L", "LYS": "K", "MET": "M", "PHE": "F", "PRO": "P",
        "SER": "S", "THR": "T", "TRP": "W", "TYR": "Y", "VAL": "V",
        "A": "ALA (A)", "R": "ARG (R)", "N": "ASN (N)", "D": "ASP (D)", "C": "CYS (C)",
        "Q": "GLN (Q)", "E": "GLU (E)", "G": "GLY (G)", "H": "HIS (H)", "I": "ILE (I)",
        "L": "LEU (L)", "K": "LYS (K)", "M": "MET (M)", "F": "PHE (F)", "P": "PRO (P)",
        "S": "SER (S)", "T": "THR (T)", "W": "TRP (W)", "Y": "TYR (Y)", "V": "VAL (V)"
    }
    one_letter = aa_3to1.get(aa)
    return f"{aa} ({one_letter})" if one_letter else aa

# Apply formatting
Canonical_Amino_Acid = format_aa(Canonical_Amino_Acid)
New_Amino_Acid = format_aa(New_Amino_Acid)

# Output result
print(f"Original: {Canonical_Amino_Acid}, Mutated: {New_Amino_Acid}, Residue Number: {Variant_Position}")

####### COLOURS
#defining colour variables based on amino acids identity

if Canonical_Amino_Acid == 'CYS (C)' or Canonical_Amino_Acid == 'PRO (P)' or Canonical_Amino_Acid == 'GLY (G)':
    colour1 = '#cccc00'
    colourword1 = 'yellow'
elif Canonical_Amino_Acid == 'SER (S)' or Canonical_Amino_Acid == 'THR (T)' or Canonical_Amino_Acid == 'ASN (N)' or Canonical_Amino_Acid == 'GLN (Q)':
    colour1 = '#cc99ff'
    colourword1 = 'pink-purple'
elif Canonical_Amino_Acid == 'ARG (R)' or Canonical_Amino_Acid == 'HIS (H)' or Canonical_Amino_Acid == 'LYS (K)' or Canonical_Amino_Acid == 'ASP (D)' or Canonical_Amino_Acid == 'GLU (E)':
    colour1 = '#9999ff'
    colourword1 = 'blue-purple'
elif Canonical_Amino_Acid == 'ALA (A)' or Canonical_Amino_Acid == 'VAL (V)' or Canonical_Amino_Acid == 'ILE (I)' or Canonical_Amino_Acid == 'LEU (L)':
    colour1 = '#99ff66'
    colourword1 = 'green'
elif Canonical_Amino_Acid == 'MET (M)' or Canonical_Amino_Acid == 'PHE (F)' or Canonical_Amino_Acid == 'TYR (Y)' or Canonical_Amino_Acid == 'TRP (W)':
    colour1 = '#99ff66'
    colourword1 = 'green'
else:
    colour1 = '#b3b3b3'

if New_Amino_Acid == 'CYS (C)' or New_Amino_Acid == 'PRO (P)' or New_Amino_Acid == 'GLY (G)':
    colour2 = '#cccc00'
    colourword2 = 'yellow'
elif New_Amino_Acid == 'SER (S)' or New_Amino_Acid == 'THR (T)' or New_Amino_Acid == 'ASN (N)' or New_Amino_Acid == 'GLN (Q)':
    colour2 = '#cc99ff'
    colourword2 = 'pink-purple'
elif New_Amino_Acid == 'ARG (R)' or New_Amino_Acid == 'HIS (H)' or New_Amino_Acid == 'LYS (K)' or New_Amino_Acid == 'ASP (D)' or New_Amino_Acid == 'GLU (E)':
    colour2 = '#9999ff'
    colourword2 = 'blue-purple'
elif New_Amino_Acid == 'ALA (A)' or New_Amino_Acid == 'VAL (V)' or New_Amino_Acid == 'ILE (I)' or New_Amino_Acid == 'LEU (L)':
    colour2 = '#99ff66'
    colourword2 = 'green'
elif New_Amino_Acid == 'MET (M)' or New_Amino_Acid == 'PHE (F)' or New_Amino_Acid == 'TYR (Y)' or New_Amino_Acid == 'TRP (W)':
    colour2 = '#99ff66'
    colourword2 = 'green'
else:
    colour2 = '#b3b3b3'

if Canonical_Amino_Acid == 'CYS (C)' or Canonical_Amino_Acid == 'ASP (D)' or Canonical_Amino_Acid == 'GLU (E)' or Canonical_Amino_Acid == 'THR (T)':
    colour3 = 'red'
    colourword3 = 'red'
elif Canonical_Amino_Acid == 'SER (S)' or Canonical_Amino_Acid == 'TYR (Y)':
    colour3 = '#ff6666'
    colourword3 = 'light red'
elif Canonical_Amino_Acid == 'ARG (R)' or Canonical_Amino_Acid == 'LYS (K)':
    colour3 = 'blue'
    colourword3 = 'blue'
elif  Canonical_Amino_Acid == 'HIS (H)' or Canonical_Amino_Acid == 'TRP (W)' or Canonical_Amino_Acid == 'ASN (N)' or Canonical_Amino_Acid == 'GLN (Q)':
    colour3 = '#6699ff'
    colourword3 = 'light blue'
elif Canonical_Amino_Acid == 'ALA (A)' or Canonical_Amino_Acid == 'VAL (V)' or Canonical_Amino_Acid == 'ILE (I)' or Canonical_Amino_Acid == 'LEU (L)':
    colour3 = '#b3b3b3'
    colourword3 = 'grey'
elif Canonical_Amino_Acid == 'MET (M)' or Canonical_Amino_Acid == 'PHE (F)' or Canonical_Amino_Acid == 'PRO (P)':
    colour3 = '#b3b3b3'
    colourword3 = 'grey'
else:
    colour3 = '#b3b3b3'
    colourword3 = 'grey'

if New_Amino_Acid == 'CYS (C)' or New_Amino_Acid == 'ASP (D)' or New_Amino_Acid == 'GLU (E)' or New_Amino_Acid == 'THR (T)':
    colour4 = 'red'
    colourword4 = 'red'
elif New_Amino_Acid == 'SER (S)' or New_Amino_Acid == 'TYR (Y)':
    colour4 = '#ff6666'
    colourword4 = 'light red'
elif New_Amino_Acid == 'ARG (R)' or New_Amino_Acid == 'LYS (K)':
    colour4 = 'blue'
    colourword4 = 'blue'
elif  New_Amino_Acid == 'HIS (H)' or New_Amino_Acid == 'TRP (W)' or New_Amino_Acid == 'ASN (N)' or New_Amino_Acid == 'GLN (Q)':
    colour4 = '#6699ff'
    colourword4 = 'light blue'
elif New_Amino_Acid == 'ALA (A)' or New_Amino_Acid == 'VAL (V)' or New_Amino_Acid == 'ILE (I)' or New_Amino_Acid == 'LEU (L)':
    colour4 = '#b3b3b3'
    colourword4 = 'grey'
elif New_Amino_Acid == 'MET (M)' or New_Amino_Acid == 'PHE (F)' or New_Amino_Acid == 'PRO (P)':
    colour4 = '#b3b3b3'
    colourword4 = 'grey'
else:
    colour4 = '#b3b3b3'
    colourword4 = 'grey'

Original: LEU (L), Mutated: PHE (F), Residue Number: 435


## Amino acid table

<img src="https://upload.wikimedia.org/wikipedia/commons/4/4f/ProteinogenicAminoAcids.svg" height="700" align="center">

In [None]:
# @title Hydrophobic *vs.* Hydrophilic
# @markdown ---
# @markdown
# @markdown Pressing run / shift+enter will generate a visuallization.
# @markdown <br>A circle representing the canonical and the variant amino acid will be generated.

clearscreen()
print(f"   canonical ({colourword1})  -->   variant({colourword2})")
setup(500,125)
T = Turtle()
T.color(colour1)
T.speed(15)
S = T.clone()
S.color(colour2)
T.jumpto(-150,50)
S.jumpto(25,50)
T.begin_fill()
S.begin_fill()
T.circle(-50)
S.circle(-50)
T.end_fill()
S.end_fill()
T.jumpto(-150,25)
S.jumpto(25,25)

print("      (comparing amino acid hydrophobocity)")


   canonical (green)  -->   variant(green)
      (comparing amino acid hydrophobocity)


The side chains on amino acids can be classed as follows:

* **<font color='#9999ff'> Polar charged side chains</font>** (hydrophilic; blue-purple)
* **<font color='#cc99ff'> Polar uncharged side chains</font>** (hydrophilic; pink-purple)
* **<font color='#99ff66'> Hydrophobic & aliphatic side chains</font>** (hydrophobic; green)
* **<font color='#99ff66'> Hydrophobic & aromatic side chains</font>** (hydrophobic: green)
* **<font color='#cccc00'> Special case side chains</font>** (yellow)

The above colour contrast shows if the variant changes hydrophobicity.

In [None]:
# @title Acidic *vs.* Basic
# @markdown ---
# @markdown Pressing run / shift+enter will generate a visuallization.
# @markdown <br>A circle representing the canonical and the variant amino acid will be generated.

clearscreen()
print(f"      canonical ({colourword3})  -->  variant ({colourword4})")
setup(500,125)
T = Turtle()
T.color(colour3)
T.speed(15)
S = T.clone()
S.color(colour4)
T.jumpto(-150,50)
S.jumpto(25,50)
T.begin_fill()
S.begin_fill()
T.circle(-50)
S.circle(-50)
T.end_fill()
S.end_fill()
T.jumpto(-150,25)
S.jumpto(25,25)

print("      (comparing amino acid charge polarity)")

      canonical (grey)  -->  variant (grey)


      (comparing amino acid charge polarity)


The above shows if the variant changes the size of the side chain
<br>(with basic vs acidic colouring).

The above colour contrast shows if the variant changes the charge / polarity of the side chain:
<br>

<font color='blue'> **basic**</font> (positively charged) --> <font color='#6699ff'> **weakly basic**</font> (uncharged but polar) -->
<br>**neutral** --> <font color='#ff6666'> **weakly acidic**</font> (uncharged but polar) --> <font color='red'> **acidic**</font> (negatively charged)
<br>
<br>
<font color='red'> **acidic**</font> (negatively charged) --> <font color='#ff6666'> **weakly acidic**</font> (uncharged but polar) -->
<br>**neutral** --> <font color='#6699ff'> **weakly basic**</font> (uncharged but polar) --> <font color='blue'> **basic**</font> (positively charged)

In [None]:
# @title Tiny *vs.* Small *vs.* Big
# @markdown ---
# @markdown Pressing run / shift+enter will generate a visuallization.
# @markdown <br>A circle representing the canonical and the variant amino acid will be generated.

clearscreen()
print(f"    canonical ({colourword3})  -->  variant ({colourword4})")
setup(500,250)
T = Turtle()
T.color(colour3)
T.speed(15)
S = T.clone()
S.color(colour4)
T.jumpto(-150,50)
S.jumpto(25,50)
T.begin_fill()
S.begin_fill()

if Canonical_Amino_Acid == 'ALA' or Canonical_Amino_Acid == 'GLY' or Canonical_Amino_Acid == 'SER':
    T.circle(-20)
elif Canonical_Amino_Acid == 'ASP' or Canonical_Amino_Acid == 'ASN' or Canonical_Amino_Acid == 'THR' or Canonical_Amino_Acid == 'VAL' or Canonical_Amino_Acid == 'PRO':
    T.circle(-30)
elif Canonical_Amino_Acid == 'ARG' or Canonical_Amino_Acid == 'HIS' or Canonical_Amino_Acid == 'PHE' or Canonical_Amino_Acid == 'TRP' or Canonical_Amino_Acid == 'TYR':
    T.circle(-75)
else:
    T.circle(-50)

if New_Amino_Acid == 'ALA' or New_Amino_Acid == 'GLY' or New_Amino_Acid == 'SER':
    S.circle(-20)
elif New_Amino_Acid == 'ASP' or New_Amino_Acid == 'ASN' or New_Amino_Acid == 'THR' or New_Amino_Acid == 'VAL' or New_Amino_Acid == 'PRO':
    S.circle(-30)
elif New_Amino_Acid == 'ARG' or New_Amino_Acid == 'HIS' or New_Amino_Acid == 'PHE' or New_Amino_Acid == 'TRP' or New_Amino_Acid == 'TYR':
    S.circle(-75)
else:
    S.circle(-50)

T.end_fill()
S.end_fill()
T.jumpto(-150,25)
S.jumpto(25,25)

print("      (comparing amino acid charge steric bulk)")


    canonical (grey)  -->  variant (grey)


      (comparing amino acid charge steric bulk)


# Local structural environment of the variant position

## **Buried?**

<img src="https://upload.wikimedia.org/wikipedia/commons/c/c5/Protein_folding_schematic.png" height="500" align="right">

<p>Amino acids can be positioned at the protein surface or buried inside a structural fold.</p>

<br>
<p>Often hydrophobic ('water-hating') amino acids are buried, and hydrophilic ('water-loving') amino acids are on the protein surface. The image on the right shows almost all hydrophobic residues (black) buried and all hydrophilic residues (white) on the surface. This is a simplification because in most structures a mixture of amino acid types is observed buried within the structural fold. However, the buried hydrophilic residues tend to have amino acid partners that are forming complementary  interactions (salt-bridges, hydrogen bonds, etc).
</p>

<br>
<p>If an amino acid appears at the surface when considering a single protein chain, it may be buried by protein-protein interactions when considering a higher-level complex, that is the protein's quaternary structure. For example, an amino acid on the surface when considering a single chain, may be buried between protein chains within a protein, or by a protein partner that forms a temporary complex with another protein or protein(s) for a specific purpose.</p>

<br>
<p>We will first only be considering the amino acid at the variant position within the context of a single protein chain.
</p>

In [None]:
# @title Question
# @markdown
# @markdown
# @markdown ---
# @markdown Where is the **canonical** amino acid with respect to the rest of the single protein chain?
FOUND_AT = 'Buried_site' # @param ["Input Answer","Protein_surface", "Buried_site","Unclear"]
# @markdown **[PDF Guide for using Mol* to answer this question](https://github.com/PDBeurope/pdbe-notebooks/blob/main/variants_embl-ebi_may2024/Molstar_Guide_for_Buried_or_Surface_Question.pdf)**
# @markdown <br> HINT1: Look at sequence viewer on **PDBe-KB** Structures page. Take note of which experimentally structures contain the sequence region with the variant position.
# @markdown <br>HINT2: Use structural superpositions on **PDBe-KB** Structures page and/or view a protein structure / structure(s) using [Mol* viewer](https://molstar.org/viewer/) to aid analysis
# @markdown <br>HINT3: In Mol* viewer, click on triple dots next to polymer and add **'Molecular surface'** or **'Spacefill'** representation.
# @markdown <br>HINT4: Use sequence view in Mol* and note the UniProt ID sequence numbering that appears in the bottom right corner.
# @markdown
# @markdown ---

## Salt bridges & hydrogen bonds
<img src="https://upload.wikimedia.org/wikipedia/commons/b/b4/Next_Revisit_Glutamic_Acid_Lysine_salt_bridge.png" height="300" align="right">

A key feature contributing to protein folding and protein stability are hydrogen bonds and salt bridges. A salt bridge is a combination of hydrogen bonding and ionic bonding.

<br>

The most common amino acid pairs involved in a salt bridge:

*  ASP - LYS
*  ASP - ARG
*  GLU - LYS
*  GLU - ARG

<br>

If an acidic or basic amino acid is buried within a structural fold, it is typically forming either a salt bridge or hydrogen bonds with another amino acid.


In [None]:
# @title Question
# @markdown
# @markdown
# @markdown ---
# @markdown If the **canonical** amino acid is basic or acidic, as well as buried, <br> is there an amino acid partner that can be identified?
ACIDIC_or_BASIC_PARTNER = 'No' # @param ["Input Answer","Yes", "No","Unclear","Canonical_aa_not_acidic_or_basic"]
# @markdown HINT: Use structural superpositions on **PDBe-KB** Structures page and/or view a protein structure / structure(s) using mol* to aid analysis
# @markdown
# @markdown ---

## Disulfide bridges & metal coordination
<img src="https://upload.wikimedia.org/wikipedia/commons/8/8c/Disulfide_Bridges_(SCHEMATIC)_V.1.svg" height="250" align="right">

<p>Key features that contribute to protein folding and stability are disulfide bridges.  The image on the right highlights how the disulfide bridges act as covalent crosslinkers to stabilise a protein's fold. Metal coordination has also been observed as important for protein folding and stability.</p>

<br>
<p>Disulfide bonds and metal coordination may also contribute to catalysis when the protein is an enzyme. This may serve as an additional role to protein stabilisation. </p<>

<br>
<br>
<p>Disulfide bonds only involve cysteines, and metal coordination tends to involve cysteines and histidines, but can involve other amino acids.</p>


In [None]:
# @title Questions
# @markdown
# @markdown
# @markdown HINT: Answering these questions can be aided by utilissing the structural superposition view and sequence viewer available from **PDBe-KB** Ligands page.
# @markdown
# @markdown ---
# @markdown Is the **canonical** (reference / wild-type) amino acid a cysteine?
Is_CYS = 'No' # @param ["Input Answer","Yes", "No"]
# @markdown ---
# @markdown If the answer for the previous is **yes**, <br>is the cysteine involved in a disulfide bridge?
Disulfide_bridge = 'No' # @param ["Input Answer","Yes", "No"]
# @markdown ---
# @markdown If the answer for the previous is **no**, <br>is the cysteine involved in coordinating a metal or other ion?
Metal_site_with_CYS = 'No' # @param ["Input Answer","Yes", "No"]
# @markdown ---
# @markdown Is the **canonical** amino acid a histidine?
Is_HIS = 'No' # @param ["Input Answer","Yes", "No"]
# @markdown ---
# @markdown If the answer for the previous is **yes**, <br>is the histidine involved in coordinating a metal or other ion?
Metal_site_with_His = 'No' # @param ["Input Answer","Yes", "No"]
# @markdown ---
# @markdown Examine the structural superposition on the **PDBe-KB** Ligands page. <br>Are there any metals or other ions?
Metal_or_ion_sites = 0 # @param {type:"number"}
# @markdown ---
# @markdown Are any metals involved in catalysis?
No_of_Metals_in_catalysis = 'No' # @param ["Input Answer","Yes", "No"]
# @markdown HINT: Visit the UniProt page to see if the enzyme catalysis involves a metal.
# @markdown
# @markdown ---
# @markdown If the **canonical** amino acid is involved in coordinate a metal / ion, <br>is the metal / ion site part of the active site (*i.e.* the metal / ion is involved in catalysis?)
Metals_at_site = 'No' # @param ["Input Answer","Yes", "No"]
# @markdown ---



##Ligand-protein interactions

<img src="https://github.com/PDBeurope/pdbe-notebooks/blob/main/variants_embl-ebi_may2024/Human_CFTR_ATP_Ligand.png?raw=true" height="490" align="right">

Small molecules (or ligands) can be found in protein structures.

These can be important for protein stabilisation or function.

Examples of small molecules of potential importance:
* lipids
* co-factors
* carbohydrates <br>(both as substrates & as covalently attached glycosylation)
* agonists / antagonists <br>(associated with protein receptors & channels)
* substrates / reactants <br>(associated with protein enzymes)
* effectors <br>(associated with protein gene regulators)
* individual units of: <br>amino acid / DNA base / RNA base <br>
* allosteric regulators <br>(*e.g.* amino acid acting binding to one of the enzymes involved in its synthesis)
* inhibitors / activator
* drugs

<br>

Less important small molecules*:
* Crystallisation agents
* Buffer components -- e.g. Tris Buffer, CC ID [TRS](https://www.ebi.ac.uk/pdbe-srv/pdbechem/chemicalCompound/show/TRS)
* Cryoprotectants -- e.g. glycerol, CC ID [GOL](https://www.ebi.ac.uk/pdbe-srv/pdbechem/chemicalCompound/show/GOL)
* Solvent -- e.g. DMSO, CC ID [DMS](https://www.ebi.ac.uk/pdbe-srv/pdbechem/chemicalCompound/show/DMS)
* Detergents

<p>* The second set of small molecules are those which were present for the experimental conditions utilised to determine the protein structure but are not necessarily relevant to protein function. Sometimes these agents/ additives will bind in protein pockets in ways that mimic how biologically relevant small molecules might bind, that weren't present in experimental conditions.</p>

<br>

---
The **PDBe-KB** Ligands pages analyse protein structures that contain at least one protein chain with sequence mapping to a UniProt ID, but from a 'ligand' perspective.  The **PDBe-KB** Ligands page shows a list of ligands that bind the UniProt ID mapped protein chain and map the small molecule interactions onto the UniProt sequence. Additionally, there is a superposition view. Both views aim to highlight ligand binding sites and if they overlap between protein structures for the same UniProt ID. Where possible, based on chemical similarity, 'reactant-like', 'drug-like' and 'cofactor-like' chemicals have been identified.

Variants that disrupt a ligand binding site that has biological relevance may impact protein function.

In [None]:
# @title Questions
# @markdown
# @markdown
# @markdown HINT: Answering these questions can be aided by utilising the structural superposition view and sequence viewer available from **PDBe-KB** Ligands page.
# @markdown
# @markdown ---
# @markdown Is the **canonical** amino acid near a small molecule binding site?
Near_small_molecule = 'Yes' # @param ["Input Answer","Yes", "No"]
# @markdown ---
# @markdown Are there any direct interactions between the **canonical** amino acid and small molecule(s)?
Interacting_with_small_molecule = 'Yes' # @param ["Input Answer","Yes", "No"]
# @markdown ---
# @markdown How strong is the experimental evidence supporting the small molecules binding pose?
Map_supports_ligand_binding_pose = 'Yes' # @param ["Input Answer","Yes", "No", "Uncertain"]
# @markdown ---
# @markdown Is it likely that the variant disrupts the ligand binding site?
LIKELY_DISRUPTING_ligand_binding = 'Yes' # @param ["Input Answer","Yes", "No", "Maybe"]
# @markdown ---
# @markdown ---
# @markdown Is the small molecule at the site we are examining have a known biological function listed on the UniProt page?
UniProt_indicates_bio_relevance = 'Yes' # @param ["Input Answer","Yes", "No"]
# @markdown ---
# @markdown Is the small molecule at the site we are examining marked as 'co-factor-like' or 'drug factor-like' on the **PDBe-KB** Ligands page?
PDBeKB_indicates_bio_relevance = 'No' # @param ["Input Answer","Yes", "No"]
# @markdown ---
# @markdown Examine the structure/structures associated with the small molecule. Does the citation or protein structure 'entry title' for the protein structures indicate the small molecule has biological relevance?
Entry_title_or_cit_indicates_bio_relevance = 'Yes' # @param ["Input Answer","Yes", "No", "Maybe"]
# @markdown HINT: Visit the UniProt page to see if the enzyme catalysis involves a metal.
# @markdown
# @markdown ---
# @markdown Is it likely that the variant disrupting the ligand binding is disrupting protein function?
LIKELY_DISRUPTING_function = 'Yes' # @param ["Input Answer","Yes", "No", "Maybe"]



# Variant considered in the broader structural context

## **Secondary structure**

The most common secondary structure elements are $\beta$-sheets and $\alpha$-helices.

As shown in the image below, the formation of these features involves hydrogen bonds (black dashes). The hydrogen bond network in $\beta$-sheets or  $\alpha$-helices involves mostly the backbone atoms (carboxyl and amino groups) of amino acids (*aka* NOT the side chain atoms that changes between amino acids and change if a genetic variant results in a new amino acid within the protein sequence). Thus, the variant is unlikely to directly impact the hydrogen bonding network, but rather the secondary structure will be indirectly impacted when the new amino acid is repositioned with respect to adjacent amino acids because it cannot be accommodated at the same position in the same way as the canonical amino acid. The exception is proline, where the backbone is fused with the side chain, so has special behaviour, if it is being replaced or being introduced by a genetic variant.

<br>

<img src="https://upload.wikimedia.org/wikipedia/commons/c/c5/Alpha_beta_structure_(full).png" height="400" align="center">


In [None]:
# @title Questions
# @markdown
# @markdown
# @markdown HINT: The following subsection can be aided by the **PDBe-KB** Structures page, especially the protein sequence view.
# @markdown
# @markdown ---
# @markdown What is the secondary structure where the position of the **canonical** amino acid is?
SECONDARY_STRUCTURE = 'Input Answer' # @param ["Input Answer","HELIX", "SHEET", "LOOP"]
# @markdown Selecting **LOOP** indicates NO secondary structure element is present.
# @markdown
# @markdown ---
# @markdown Is there a secondary structure element in the position directly adjacent in sequence to where the position of the **canonical** amino acid?
Adjacent_has_SS = 'Input Answer' # @param ["Input Answer","Yes", "No"]
# @markdown ---
# @markdown If the answer to the above is **yes**, <br> what is the secondary structure element in the adjacent residue/residues to the position of the **canonical** amino acid in the protein sequence?
ADJACENT_SECONDARY_STRUCTURE = 'Input Answer' # @param ["Input Answer","HELIX", "SHEET", "LOOP"]
# @markdown ---
# @markdown Is it likely that the variant disrupts the secondary structure?
LIKELY_DISRUPTING_SS= 'Input Answer' # @param ["Input Answer","Yes", "No", "Maybe"]
# @markdown <br>
# @markdown Some examples that might disrupt secondary structure are:
# @markdown
# @markdown   *  any changes that involve a proline
# @markdown   *  *potentially, if an amino acid is at a buried site,* and the canonical amino acid is small and the new amino acid is large
# @markdown   *  *potentially, if an amino acid is acidic / basic,* and it changes charge thus no longer able to form salt-bridge or hydrogen bonds with appropriate amino acid partner(s).

# @markdown ---
# @markdown
# @markdown It should be noted that disrupting secondary structure will NOT necessarily impact on protein function.


## **Protein domains**

<H3>A protein domain can be thought of as a combination of secondary elements, folded in a compact manner, that is found amongst diverse proteins (<i>i.e.</i> found in proteins with different functions or from different organisms). It is typically thought to be a section of protein sequence which folds independently of other sections when the protein is being translated from RNA.</H3>

<p>However, it is not necessary that a function is contained within a single domain -- <i>e.g.</i> an enzyme's active site may be at the interface between two domains. Thus a 'feature' / 'functional unit', or 'protein family' such as might be highlighted on UniProt pages, may or may not correspond to a protein domain.</p>

Protein domains have generally been defined / understood in the context of multiple sequences and structural comparisons such as that available from databases like [InterPro](https://www.ebi.ac.uk/interpro/), [Pfam](http://pfam.xfam.org/), [SCOPe](https://scop.berkeley.edu/), SCOP, [CATH](https://www.cathdb.info/), etc). Protein classifications such as families / domains that are defined by these resources may or may not overlap for the same protein. For example, a Pfam annotation of a protein family, may correspond to two or more domains, as defined in SCOPe, SCOP or CATH. A quick overview of the type of classification / analysis performed by these databases is available [here](https://www.ebi.ac.uk/interpro/about/interpro/)<br>

<br>

---

For even more details:
[EMBL-EBI on-demand training course on protein classification](https://www.ebi.ac.uk/training/online/courses/protein-classification-intro-ebi-resources/protein-classification/)

---

<p>Investigating domains and finding related proteins with the same / similar domains may be an avenue of further analysis. Especially if no rationale can be developed to understand why a variant is causing disease after answering all the questions in this guide. Additionally, if this guide reveals a rationale for understanding why a variant causes disease, related proteins may provide further evidence or insight to support the answer.</p>

---

<br>

A valuable tool for evaluating structures is the **Predicted aligned error (PAE)** plots associated with predicted protein structures, such as those available from the [AlphaFold database](https://alphafold.ebi.ac.uk/). The **PAE** plot is a relationship matrix that indicates confidence with respect to two amino acids positions relative to each other.

<p>Thus, perhaps unsurprisingly, blocks of green, showing confidence in the prediction for the relative position of amino acids to each other often correlates with protein domains, but not always. That is, sometimes a block of green corresponds to two or more domains.</p>

<br>

---

One example of this type of occurrence (two domains corresponding to one green square in a **PAE** plot) is pyruvate kinase PKM.<br>
To explore this further:<br>
[PDBe-KB page for pyruvate kinase PKM](https://www.ebi.ac.uk/pdbe/pdbe-kb/proteins/P14618)
<br>
[AlphaFold Database page for pyruvate kinase PKM](https://alphafold.ebi.ac.uk/entry/P14618)
<br>
[Example PDBe page for one of the pyruvate kinase PKM structures](https://www.ebi.ac.uk/pdbe/entry/pdb/1t5a)<br>
[C-term domain on InterPro](https://www.ebi.ac.uk/interpro/entry/InterPro/IPR036918/)<br>
[C-term domain on CATH](http://www.cathdb.info/version/latest/superfamily/3.40.1380.20/superposition)

---

<br>

<p>In the next subsection we will examine the <strong>PAE</strong> plot from the <strong>AlphaFold Database</strong>, as well as structural superpositions from <strong>PDBe-KB</strong> and consider the experimental structures in the context of the full UniProt sequence.</p>

<p>When considering the full protein sequence (<i>i.e.</i> UniProt ID sequence) it can appear there are multiple globular proteins connected by relatively unstructured regions of the protein chain. Where the variant occurs within the sequence has relevance to how it may or may not impair protein function.</p>

<p>We will avoid the term domains in the context of the next section of analysis. This is because even though the green square on a <strong>PAE</strong> plot or the section of protein that behaved well enough to determine a structure experimentally may correspond to a single domain, it may also correspond to two or more domains. Thus, we will use the term 'folded unit'.</p>



---

<br>

Please also consider the protein processing that occurs after protein is translated.

Signal peptides and propeptides are associated with 'molecular processing' events that can occur and are listed in the Feature viewer on the corresponding UniProt page or **PDBe-KB** Structure page. These are regions removed at various stages and thus are typically not present in the mature form of the protein.

Sometimes multiple protein chains are generated from a single UniProt ID. This is due to specific additional molecular processing events that occur after the protein has been translated. When this has been annotated by UniProt, two or more chains with special subpages are shown on the **PDBe-KB** Summary page.

In [None]:
# @title Questions
# @markdown
# @markdown
# @markdown HINT: The following subsection can be aided by utilising the structural superpositions available from the **PDBe-KB** Structures page,<br> especially the **AlphaFold Superposition** feature.
# @markdown <br>HINT2: Visiting the **[AlphaFold Database](https://alphafold.ebi.ac.uk/)** provides an interactive **PAE** plot.
# @markdown
# @markdown ---
# @markdown Does the **PAE** plot for the predicted structure from the **AlphaFold Database** for this UniProt ID indicate there is more than one 'folded unit'?
More_than_one_folded_unit = 'Input Answer' # @param ["Input Answer","Yes", "No", "Maybe"]
# @markdown ---
# @markdown Do the experimentally-determined structures for the UniProt ID cover different regions of sequence?
BlueBlocks_Sequence = 'Input Answer' # @param ["Input Answer","Yes", "No"]
# @markdown ---
# @markdown Examine the structural superposition. <br> Is there more than one superposition for the same UniProt ID?
More_than_one_section_for_Superposition ='Input Answer' # @param ["Input Answer","Yes", "No"]
# @markdown HINT: Click on 'Select Segment' in **PDBe-KB** structural superposition view.
# @markdown
# @markdown ---
# @markdown Does the protein adopt a single compact fold / appear to be a single approximately spherical shape (aka globular protein)?
One_folded_unit = 'Input Answer' # @param ["Input Answer","Yes", "No"]
# @markdown ---
# @markdown If there are molecular processing events that generate more than one chain, please indicate the number of chains:
Molecular_Processing_Sections = 0 # @param {type:"number"}
# @markdown HINT: Molecule processing is on **PDBe-KB** Overview page, as well as on the UniProt page *Feature viewer*
# @markdown
# @markdown ---
# @markdown Taking into account any molecular processing events, has the analysis indicated there are more than one 'folded units'?
Multiple_folded_units = 'Input Answer' # @param ["Input Answer","Yes", "No"]
# @markdown ---
# @markdown If the answer for the previous is **yes**, <br> is the variant in the folded unit along the protein sequence, or is it part of folded unit 2, or another?
Variant_in_folded_unit_no = 0 # @param {type:"number"}
# @markdown ---
# @markdown **Remember the above analysis for the next section.**
# @markdown <br>**Focus the subsequent analysis on protein complexes on the folded unit which contains the canonical amino acid and ignore complexes for other folded units.**

## Protein complexes
<img src="https://github.com/PDBeurope/pdbe-notebooks/blob/main/variants_embl-ebi_may2024/TB_DHDPS_Assemblies.png?raw=true" height="690" align="right">

<p>We have been primarily considering protein structures within the context of a single protein chain, however many proteins form complexes and many have a biological assembly of two (dimers), three (trimers), four (tetramers), five (pentamers), six (hexamers) etc</p>

<br>

Oligomers of protein can be broadly classified as:
* homomeric - composed of multiple chains with the same amino acid sequence.
* heteromeric - composed of multiple chains with different amino acid sequences.

<br>

If the variant is at the surface or on a secondary structure element that if modified could impact on the shape of the surface, the biological assembly of a protein could be impacted, and this could impact on the protein's function.

<br>

The example on the right is of an engineered mutation that disrupted protein assembly of a bacterial enzyme. Because the protein was homomeric, and because of the symmetry in the protein, the single mutation from ALA to ARG, resulted in two positively charged polar amino acids positioned opposite each other at a protein-protein interaction interface. Interestingly, although enzyme function was impacted, the engineered mutation did not fully inhibit protein function and thus indicated that if this was a genetic variant, even though the protein assembly had changed, this change would not be fatal for the organism.

<br>

Reference: <br> [A tetrameric structure is not essential for activity in dihydrodipicolinate synthase (DHDPS) from Mycobacterium tuberculosis](https://www.sciencedirect.com/science/article/abs/pii/S000398611100186X)

In [None]:
# @title Questions
# @markdown
# @markdown ---
# @markdown Does the protein form a complex with other chains with the same amino acid sequence?
More_than_one = 'Input Answer' # @param ["Input Answer","Yes", "No"]
# @markdown HINT: Visit the **PDBe-KB** Interaction page & use link for **PDB** entry page for individual structures
# @markdown
# @markdown ---
# @markdown Does the protein mostly appear to form a complex only with other chains of the same amino acid sequence?
Homomeric = 'Input Answer' # @param ["Input Answer","Yes", "No"]
# @markdown HINT: Visit the **PDBe-KB** Interaction page & use link for **PDB** entry page for individual structures
# @markdown
# @markdown ---
# @markdown If the answer for the previous is **yes**, <br>how many chains with the same protein sequence are involved in the biological assembly?
Homomeric_of = 0 # @param {type:"number"}
# @markdown HINT: Visit the **PDBe-KB** Interaction page & use link for **PDB** entry page for individual structures
# @markdown
# @markdown ---
# @markdown If the biological assembly involves protein chains with different protein sequences, what is the total number of chains in the biological assembly?
Heteromeric_of = 0 # @param {type:"number"}
# @markdown ---
# @markdown If the **canonical** amino acid is at the surface of the single chain, <br>is it buried when the protein forms its biological assembly?
Variant_at_biological_assembly_interface = 'Input Answer' # @param ["Input Answer","Yes", "No"]
# @markdown
# @markdown ---
# @markdown ---
# @markdown Are there any experimentally-determined structures with the protein having multiple protein, DNA and/or RNA partners?
Protein_partners = 'Input Answer' # @param ["Input Answer","Yes", "No", "Maybe"]
# @markdown HINT: Visit the **PDBe-KB** Interaction page & use link for **PDB** entry page for individual structures
# @markdown
# @markdown ---
# @markdown If the **canonical** amino acid is at the surface of the single chain, or the biological assembly, is it buried when the protein interacts with its partner?
Variant_at_biological_partner_interface = 'Input Answer' # @param ["Input Answer","Yes", "No"]
# @markdown
# @markdown ---
# @markdown ---
# @markdown If the answer to the last question was **no** then:<br>
# @markdown Compare the protein complexes listed on the UniProt page with the structures on the **PDBe-KB** Interaction page.
# @markdown How many protein partners with predicted / experimental evidence for complex formation do not yet have any protein structure?
Yet_to_have_structure_protein_complexes = 0 # @param {type:"number"}
# @markdown ---
# @markdown Is the **canonical** amino acid at the surface of the protein in all currently available experimentally-determined structures?
Surface_aa_in_all_structure = 'Input Answer' # @param ["Input Answer","Yes", "No"]
# @markdown ---
# @markdown If the answer to the final question is **yes**, although there is no evidence for it yet, the variant may be impacted on a protein complex where there has yet to be a structure determined.



# This ends the notebook form.<br>


---

<p>Please consider the evidence collected in this notebook form and decide what is the most likely rationale for why the genetic variant is disrupting function / pathogenic.</p>

<p>It may not be apparent from the currently available protein structures why a genetic variant causes disease. This is also a valid conclusion.</p>

<p>If there are any experimentally-determined structures containing the variant, please both examine these and look at the associated citations.

<br>

#**Summarize your hypothesis here:**

The mutation is moderate in terms of changing amino acid properties (<i>e.g.</i> BLOSUM62 is 0, but the Dayhoff's property group is not the same).

The amino acid at this position directly interacts with a ligand that was designed to be a transition state analog, so this variant likely directly impacts the activity of this protein / protein complex.

<br>
<br>

---



Copyright 2024 EMBL - European Bioinformatics Institute

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.