<a href="https://colab.research.google.com/github/glevans/PDBe_Notebooks/blob/main/variants_embl-ebi_may2024/structural_assessment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**<H1>Structural assessment of a variant</H1>**
**Using PDBe-KB & 3D-Beacons to gain understanding on genetic variants**
<img src="https://www.ebi.ac.uk/pdbe/docs_dev/logos/images/RGB/PDBe-logo-RGB_2013.png" height="300" align="right">

#Welcome to this notebook form!

To use this notebook in Colab (link at top of the page):

*   you will need to have a Google account
*   be logged in to Google Colab (by being logged into Google account)

<br>

You can also access this notebook *via* the GitHub repository:<br>
https://github.com/PDBeurope/pdbe-notebooks/

<br>



---

  ## How to use this notebook:
1. To run a code cell, click on the cell to select it. You will notice a play button (▶️) on the left side of the cell. Click on the play button or **press Shift+Enter** to run the code in the selected cell.
2. The code will start executing, and you will see the output, if any, displayed below the code cell.
3. Move to the next code cell and repeat steps 2 and 3 until you have executed all the desired code cells in sequence.
4. The currently running step is indicated by a circle with a stop sign next to it.
If you need to stop or interrupt the execution of a code cell, you can click on the stop button (■) located next to the play button.

*Remember to run the code cells in the correct order, as their execution might depend on variables or functions defined in previous cells. You can modify the code in a code cell and re-run it to see updated results.*

<br>

---

## Contact us

If you experience any bugs please contact pdbehelp@ebi.ac.uk and put "Help with" and the title of the notebook in the subject line of the message.


# Amino acid details

In [10]:
# @title Amino acid change
# @markdown
# @markdown Not all genetic variants resulted in an amino acid change, but here we will be focusing the genetic variants that cause an amino acid change.
# @markdown
# @markdown We will be contrasting the amino acid present in the genetic variant, with the canonical amino acid.
# @markdown The canonical amino acid is the one most commonly observed at this particular position in the protein sequence.
# @markdown
# @markdown ---
# @markdown For the variant, what was the canonical amino acid?
Canonical_Amino_Acid = 'LEU' # @param ['Select the amino acid', 'GLY', 'ALA', 'VAL', 'LEU', 'ILE', 'THR', 'SER', 'MET', 'CYS', 'PRO', 'PHE', 'TYR', 'TRP', 'HIS', 'LYS', 'ARG', 'ASP', 'GLU', 'ASN', 'GLN']
# @markdown ---
# @markdown For the variant, what was its position in the amino acid sequence (in UniProt numbering)?
Variant_Position = 435 # @param {type:"number"}
# @markdown ---
# @markdown For the variant, what was the amino acid changed to by the genetic variant?
New_Amino_Acid = 'PHE' # @param ['Select the amino acid', 'GLY', 'ALA', 'VAL', 'LEU', 'ILE', 'THR', 'SER', 'MET', 'CYS', 'PRO', 'PHE', 'TYR', 'TRP', 'HIS', 'LYS', 'ARG', 'ASP', 'GLU', 'ASN', 'GLN']
# @markdown ---

!pip install ColabTurtlePlus
from ColabTurtlePlus.Turtle import *

print("Succesfully installed!")

####### COLOURS
#defining colour variables based on amino acids identity

if Canonical_Amino_Acid == 'CYS' or Canonical_Amino_Acid == 'PRO' or Canonical_Amino_Acid == 'GLY':
    colour1 = '#cccc00'
elif Canonical_Amino_Acid == 'SER' or Canonical_Amino_Acid == 'THR' or Canonical_Amino_Acid == 'ASN' or Canonical_Amino_Acid == 'GLN':
    colour1 = '#cc99ff'
elif Canonical_Amino_Acid == 'ARG' or Canonical_Amino_Acid == 'HIS' or Canonical_Amino_Acid == 'LYS' or Canonical_Amino_Acid == 'ASP' or Canonical_Amino_Acid == 'GLU':
    colour1 = '#9999ff'
elif Canonical_Amino_Acid == 'ALA' or Canonical_Amino_Acid == 'VAL' or Canonical_Amino_Acid == 'ILE' or Canonical_Amino_Acid == 'LEU':
    colour1 = '#99ff66'
elif Canonical_Amino_Acid == 'MET' or Canonical_Amino_Acid == 'PHE' or Canonical_Amino_Acid == 'TYR' or Canonical_Amino_Acid == 'TRP':
    colour1 = '#99ff66'
else:
    colour1 = '#b3b3b3'

if New_Amino_Acid == 'CYS' or New_Amino_Acid == 'PRO' or New_Amino_Acid == 'GLY':
    colour2 = '#cccc00'
elif New_Amino_Acid == 'SER' or New_Amino_Acid == 'THR' or New_Amino_Acid == 'ASN' or New_Amino_Acid == 'GLN':
    colour2 = '#cc99ff'
elif New_Amino_Acid == 'ARG' or New_Amino_Acid == 'HIS' or New_Amino_Acid == 'LYS' or New_Amino_Acid == 'ASP' or New_Amino_Acid == 'GLU':
    colour2 = '#9999ff'
elif New_Amino_Acid == 'ALA' or New_Amino_Acid == 'VAL' or New_Amino_Acid == 'ILE' or New_Amino_Acid == 'LEU':
    colour2 = '#99ff66'
elif New_Amino_Acid == 'MET' or New_Amino_Acid == 'PHE' or New_Amino_Acid == 'TYR' or New_Amino_Acid == 'TRP':
    colour2 = '#99ff66'
else:
    colour2 = '#b3b3b3'

if Canonical_Amino_Acid == 'CYS' or Canonical_Amino_Acid == 'ASP' or Canonical_Amino_Acid == 'GLU' or Canonical_Amino_Acid == 'THR':
    colour3 = 'red'
elif Canonical_Amino_Acid == 'SER' or Canonical_Amino_Acid == 'TYR':
    colour3 = '#ff6666'
elif Canonical_Amino_Acid == 'ARG' or Canonical_Amino_Acid == 'LYS':
    colour3 = 'blue'
elif  Canonical_Amino_Acid == 'HIS' or Canonical_Amino_Acid == 'TRP' or Canonical_Amino_Acid == 'ASN' or Canonical_Amino_Acid == 'GLN':
    colour3 = '#6699ff'
elif Canonical_Amino_Acid == 'ALA' or Canonical_Amino_Acid == 'VAL' or Canonical_Amino_Acid == 'ILE' or Canonical_Amino_Acid == 'LEU':
    colour3 = '#b3b3b3'
elif Canonical_Amino_Acid == 'MET' or Canonical_Amino_Acid == 'PHE' or Canonical_Amino_Acid == 'PRO':
    colour3 = '#b3b3b3'
else:
    colour3 = '#b3b3b3'

if New_Amino_Acid == 'CYS' or New_Amino_Acid == 'ASP' or New_Amino_Acid == 'GLU' or New_Amino_Acid == 'THR':
    colour4 = 'red'
elif New_Amino_Acid == 'SER' or New_Amino_Acid == 'TYR':
    colour4 = '#ff6666'
elif New_Amino_Acid == 'ARG' or New_Amino_Acid == 'LYS':
    colour4 = 'blue'
elif  New_Amino_Acid == 'HIS' or New_Amino_Acid == 'TRP' or New_Amino_Acid == 'ASN' or New_Amino_Acid == 'GLN':
    colour4 = '#6699ff'
elif New_Amino_Acid == 'ALA' or New_Amino_Acid == 'VAL' or New_Amino_Acid == 'ILE' or New_Amino_Acid == 'LEU':
    colour4 = '#b3b3b3'
elif New_Amino_Acid == 'MET' or New_Amino_Acid == 'PHE' or New_Amino_Acid == 'PRO':
    colour4 = '#b3b3b3'
else:
    colour4 = '#b3b3b3'

Succesfully installed!


## Amino acid table

<img src="https://upload.wikimedia.org/wikipedia/commons/4/4f/ProteinogenicAminoAcids.svg" height="700" align="center">

The side chains on amino acids can be classed as follows:

* **<font color='#9999ff'> Polar charged side chains </font>** (hydrophilic)
* **<font color='#cc99ff'> Polar uncharged side chains </font>** (hydrophilic)
* **<font color='#99ff66'> Hydrophobic & aliphatic side chains </font>** (hydrophobic)
* **<font color='#99ff66'> Hydrophobic & aromatic side chains </font>** (hydrophobic)
* **<font color='#cccc00'> Special case side chains </font>**

The below colour contrast shows if the variant changes hydrophobicity:

In [11]:
# @title Hydrophobic *vs.* Hydrophilic
# @markdown ---
# @markdown

clearscreen()
setup(500,125)
T = Turtle()
T.color(colour1)
T.speed(15)
S = T.clone()
S.color(colour2)
T.jumpto(-150,50)
S.jumpto(25,50)
T.begin_fill()
S.begin_fill()
T.circle(-50)
S.circle(-50)
T.end_fill()
S.end_fill()

The below colour contrast shows if the variant changes the charge / polarity of the side chain:
<br>

<font color='blue'> **basic**</font> (positively charged) --> <font color='#6699ff'> **weakly basic**</font> (uncharged but polar) -->
<br>**neutral** --> <font color='#ff6666'> **weakly acidic**</font> (uncharged but polar) --> <font color='red'> **acidic**</font> (negatively charged)
<br>
<br>
<font color='red'> **acidic**</font> (negatively charged) --> <font color='#ff6666'> **weakly acidic**</font> (uncharged but polar) -->
<br>**neutral** --> <font color='#6699ff'> **weakly basic**</font> (uncharged but polar) --> <font color='blue'> **basic**</font> (positively charged) -->

In [12]:
# @title Acidic *vs.* Basic
# @markdown ---
# @markdown

clearscreen()
setup(500,125)
T = Turtle()
T.color(colour3)
T.speed(15)
S = T.clone()
S.color(colour4)
T.jumpto(-150,50)
S.jumpto(25,50)
T.begin_fill()
S.begin_fill()
T.circle(-50)
S.circle(-50)
T.end_fill()
S.end_fill()

The below show if the variant changes the size of the side chain:
<br>(with basic vs acidic colouring).

In [13]:
# @title Tiny *vs.* Small *vs.* Big
# @markdown ---
# @markdown

clearscreen()
setup(500,250)
T = Turtle()
T.color(colour3)
T.speed(15)
S = T.clone()
S.color(colour4)
T.jumpto(-150,50)
S.jumpto(25,50)
T.begin_fill()
S.begin_fill()

if Canonical_Amino_Acid == 'ALA' or Canonical_Amino_Acid == 'GLY' or Canonical_Amino_Acid == 'SER':
    T.circle(-20)
elif Canonical_Amino_Acid == 'ASP' or Canonical_Amino_Acid == 'ASN' or Canonical_Amino_Acid == 'THR' or Canonical_Amino_Acid == 'VAL' or Canonical_Amino_Acid == 'PRO':
    T.circle(-30)
elif Canonical_Amino_Acid == 'ARG' or Canonical_Amino_Acid == 'HIS' or Canonical_Amino_Acid == 'PHE' or Canonical_Amino_Acid == 'TRP' or Canonical_Amino_Acid == 'TYR':
    T.circle(-75)
else:
    T.circle(-50)

if New_Amino_Acid == 'ALA' or New_Amino_Acid == 'GLY' or New_Amino_Acid == 'SER':
    S.circle(-20)
elif New_Amino_Acid == 'ASP' or New_Amino_Acid == 'ASN' or New_Amino_Acid == 'THR' or New_Amino_Acid == 'VAL' or New_Amino_Acid == 'PRO':
    S.circle(-30)
elif New_Amino_Acid == 'ARG' or New_Amino_Acid == 'HIS' or New_Amino_Acid == 'PHE' or New_Amino_Acid == 'TRP' or New_Amino_Acid == 'TYR':
    S.circle(-75)
else:
    S.circle(-50)

T.end_fill()
S.end_fill()

## BLOSUM62 Matrix
<img src="https://upload.wikimedia.org/wikipedia/commons/f/f5/Blosum62-dayhoff-ordering.svg" height="500" align="right">

The BLOSUM62 (BLOcks SUbstitution Matrix) is a substitution matrix that can be used in alignment of protein sequences (*e.g.* [BLASTp](https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome) and [UniProt's BLAST](https://www.uniprot.org/blast)). It was generated from analysis of many protein sequences.

<br>

Positive values indicate frequent exchanges between amino acid pairs during evolution, negative values indicate amino acid pairs that rarely replace each other. The colouring in the image of BLOSUM62 is from Margaret Dayhoff's amino acid classification. Together these highlight that amino acids can be clustered in terms of more and less similar.

<br>

For an overview of the development of BLOSUM, click [here](https://en.wikipedia.org/wiki/BLOSUM)<br>

<br>




## **Table of Margaret Dayhoff's encoding of amino acids**

|                     Amino acids                      | 1-letter code  |      3-letter code       |        Property        | Dayhoff  |
|:----------------------------------------------------:|:--------------:|:------------------------:|:----------------------:|:--------:|
| Cysteine                                             | C              | Cys                      | Sulfur polymerization  | a        |
| Glycine, Serine, Threonine, Alanine, Proline         | G, S, T, A, P  | Gly, Ser, Thr, Ala, Pro  | Small                  | b        |
| Aspartic acid, Glutamic acid, Asparagine, Glutamine  | D, E, N, Q     | Asp, Glu, Asn, Gln       | Acid and amide         | c        |
| Arginine, Histidine, Lysine                          | R, H, K        | Arg, His, Lys            | Basic                  | d        |
| Leucine, Valine, Methionine, Isoleucine              | L, V, M, I     | Leu, Val, Met, Ile       | Hydrophobic            | e        |
| Tyrosine, Phenylalanine, Tryptophan                  | Y, F, W        | Tyr, Phe, Trp            | Aromatic               | f        |

In [None]:
# @title Questions
# @markdown
# @markdown
# @markdown ---
# @markdown What is the BLOSUM62 value for the substitution?
BLOSUM62_value = 0 # @param {type:"number"}
# @markdown ---
# @markdown If the canonical and variant amino acid are in the same Dayhoff group, which one?
Dayhoff_group = 'Not_same_group' # @param ["Not_same_group", "Sulfur_polymerization","Small", "Acid_and_amide", "Basic", "Hydrophobic", "Aromatic"]
# @markdown ---

# 3D structures

There are various ways to obtain the below information, including a [Structures Available Notebook](https://colab.research.google.com/github/PDBeurope/pdbe-notebooks/blob/main/variants_embl-ebi_may2024/structures_available.ipynb) we have developed for this purpose. Alternatively, a combination of quering and examining the webpages with [3D Beacons](https://www.ebi.ac.uk/pdbe/pdbe-kb/3dbeacons/) and [PDBe-KB](https://www.ebi.ac.uk/pdbe/pdbe-kb/) pages should enable the information to be obtained.

In [22]:
# @title Structural data availablity

UniProt_ID = 'P49768' # @param {type:"string"}
# @markdown ---
# @markdown Are any experimentally determined structures available?
EXPERIMENTALLY_DETERMINED_STRUCTURES = 'Yes' # @param ["Input Answer","Yes", "No"]
# @markdown ---
# @markdown How many strutures (both predicted experimentally-determined) are available for the UniProt ID? <br> HINT: Use **3D-Beacons** query
No_of_all_available = 26 # @param {type:"number"}
# @markdown ---
# @markdown How many experimentally-determined structures (with PDB ids) are available for the UniProt ID? <br> HINT: Use either **3D-Beacons** or **PDBe-KB**
No_of_exp_available = 18 # @param {type:"number"}
# @markdown ---
# @markdown Are any experimentally-determined structures (with PDB ids) available where the variant is present? <br> HINT use **PDBe-KB**
VARIANT_STRUCTURE = 'No' # @param ["Input Answer","Yes", "No"]
# @markdown ---

print(f"There are {No_of_all_available} structures (predicted and experimentally determined) available associated with UniProt ID {UniProt_ID}.")
print(f"There are {No_of_exp_available} experimentally-determined structures associated with UniProt ID {UniProt_ID}.")
if VARIANT_STRUCTURE == 'No':
    print(f"There are NO experimentally-determined structure available that contain {New_Amino_Acid} at {Variant_Position} instead of {Canonical_Amino_Acid}.")
else:
    print(f"There are experimentally-determined structure available that contain {New_Amino_Acid} at {Variant_Position} instead of {Canonical_Amino_Acid}.")

There are 26 structures (predicted and experimentally determined) available associated with UniProt ID P49768.
There are 18 experimentally-determined structures associated with UniProt ID P49768.
There are NO experimentally-determined structure available that contain PHE at 435 instead of LEU.


# Local structural environment of the variant position

## **Buried?**

<img src="https://upload.wikimedia.org/wikipedia/commons/c/c5/Protein_folding_schematic.png" height="500" align="right">

<p>Amino acids can be positioned at the protein surface, or buried inside a structural fold.</p>

<br>
<p>Often hydrophobic ('water-hating') amino acids are buried, and hydrophilic ('water-loving') amino acids are on the protein surface. The image on the right shows almost all hydrophobic residues (black) buried and all hydrophilic residues (white) on the surface. This is a simplification because in most structures a mixture of amino acid types are observed buried within the structural fold. However, the buried hydrophilic residues tend to have amino acid partners that are forming complimentary interactions (salt-bridges, hydrogen bonds, etc).
</p>

<br>
<p>If an amino acid appears at the surface when considering a single protein chain, it may actually be buried by protein-protein interactions when considering a higher level complex. For example between protein chains within a protein, or by a protein partner that forms a temporary complex with another protein or protein(s) for a specific purpose.</p>

<br>
<p>We will first only be considering the amino acid at the variant position within the context of a single protein chain.
</p>

In [15]:
# @title Question
# @markdown
# @markdown
# @markdown ---
# @markdown Where is the amino acid with respect to the rest of the protein chain?
FOUND_AT = 'Input Answer' # @param ["Input Answer","Protein_surface", "Buried_site","Unclear"]
# @markdown HINT: Use structural superpositions on **PDBe-KB** Structures page and/or view a protein structure / structure(s) using mol* to aid analysis
# @markdown
# @markdown ---

## Salt bridges & hydrogen bonds
<img src="https://upload.wikimedia.org/wikipedia/commons/b/b4/Next_Revisit_Glutamic_Acid_Lysine_salt_bridge.png" height="300" align="right">

A key feature contributing to protein folding and protein stability are hydrogen bonds and salt bridges. A salt bridge is a combination of hydrogen bonding and ionic bonding.

<br>

The most common amino acid pairs involved in a salt bridge:

*  ASP - LYS
*  ASP - ARG
*  GLU - LYS
*  GLU - ARG

<br>

If an acidic or basic amino acid is buried within a structural fold, it is typically forming either a salt bridge or hydrogen bonds with another amino acid.


In [16]:
# @title Question
# @markdown
# @markdown
# @markdown ---
# @markdown If the amino acid is basic or acidic, as well as buried, <br> is there an amino acid partner that can be identified?
ACIDIC_or_BASIC_PARTNER = 'Canonical_aa_not_acidic_or_basic' # @param ["Yes", "No","Unclear","Canonical_aa_not_acidic_or_basic"]
# @markdown HINT: Use structural superpositions on **PDBe-KB** Structures page and/or view a protein structure / structure(s) using mol* to aid analysis
# @markdown
# @markdown ---

## **Secondary structure**

The most common secondary structure elements are $\beta$-sheets and $\alpha$-helices.

As shown in the image below, the formation of these features involves hydrogen bonds (black dashes). The hydrogen bond network in $\beta$-sheets or  $\alpha$-helices involves mostly the backbone of amino acids (aka NOT the side chain that changes between amino acids and changes if a genetic variant results in a new amino acid within the protein sequence). Thus, the variant should not directly impact the hydrogen bonding network, but rather the secondary structure will be indirectly impacted when the new amino acid is repositioned with respect to adjacent amino acids because it cannot be accommodated at the same position in the same way as the canonical amino acid. The exception is proline, where the backbone is fused with the side chain, so has special behaviour, if it is being replaced or being introduced by a genetic variant.

<br>

<img src="https://upload.wikimedia.org/wikipedia/commons/c/c5/Alpha_beta_structure_(full).png" height="400" align="center">


In [17]:
# @title Questions
# @markdown
# @markdown
# @markdown HINT: The following subsection can be aided by the **PDBe-KB** Structures page, especially the protein sequence view.
# @markdown
# @markdown ---
# @markdown What is the secondary structure where the variant occurs?
# @markdown <br> Selecting **LOOP** indicates NO secondary structure element is present.
SECONDARY_STRUCTURE = 'SHEET' # @param ["Input Answer","HELIX", "SHEET", "LOOP"]
# @markdown ---
# @markdown If the answer is **LOOP**, <br> is there a secondary structure element starting or ending in the position directly adjacent to where the variant occurs?
Adjacent_has_SS = 'Yes' # @param ["Input Answer","Yes", "No"]
# @markdown ---
# @markdown If the answer to the above is **Yes**, <br> what is the a secondary structure element starting or ending in the adjacent residues to the variant position in the protein sequence?
ADJACENT_SECONDARY_STRUCTURE = 'SHEET' # @param ["Input Answer","HELIX", "SHEET", "LOOP"]
# @markdown ---
# @markdown Is it likely that the mutation disrupts the secondary structure?
LIKELY_DISRUPTING_SS= 'Maybe' # @param ["Input Answer","Yes", "No", "Maybe"]
# @markdown <br>
# @markdown Some examples that would distupt secondary structure are:
# @markdown
# @markdown   *  any changes that involve a proline
# @markdown   *  original amino acid is small and new amino acid is large
# @markdown   *  *potentially, but not always:* amino acid acidic / basic changing charge thus no longer position to form salt-bridge or hydrogen bonds with appropriate amino acid partner(s).

# @markdown ---
# @markdown
# @markdown It should be noted that disrupting secondary structure will NOT necessarily impact on protein function.


## **Protein domains**

<H3>A protein domain can be thought of a combination of secondary elements, folded in a compact manner, that is found amongst diverse proteins (<i>i.e.</i> in proteins with different functions or from different organisms). It is typically thought to be a section of protein sequence which folds independently of other sections when the protein is being generated.</H3>

<p>However, it is not necessary that that a protein's function (<i>e.g.</i> enzyme's active site) is not found within a single domain, as it may be at the interface between two domains. Thus a 'feature' / 'functional unit' highlighted on UniProt pages may or may not correspond to a protein domain.</p>

Protein domains have generally been defined / understood in the context of multiple sequences and structural comparisons such as that available from databases like [InterPro](https://www.ebi.ac.uk/interpro/), [Pfam](http://pfam.xfam.org/), [SCOPe](https://scop.berkeley.edu/), SCOP, [CATH](https://www.cathdb.info/), etc). Protein classifications / families / superfamilies as defined by these resources reflect analysis of molecular evolution (comparison of diverse proteins' sequence and/or structure), but may be more focused on protein function, rather than shared structural and sequence features between a diverse set of proteins. For example, a Pfam annotation, focused on function, may corresponds to two or more domains, found with comparison of diverse proteins, as defined in SCOPe, SCOP or CATH. A quick overview of the type of classification / analysis performed by these databases is available [here](https://www.ebi.ac.uk/interpro/about/interpro/)<br>

<br>

---

<p>Investigating domains and finding related proteins with the same / similar domains may be an avenue of further analysis if no rationale can be developed to understand why the variant causes disease after answering all the questions in this guide. Additionally, if this guide reveals a rationale for understanding why a variant causes disease, related proteins may provide further evidence or insight to support the answer.</p>

---

A valuable tool for evaluating structures is the **Predicted aligned error (PAE)** plots associated with predicted protein structures, such as those available from the [AlphaFold database](https://alphafold.ebi.ac.uk/). The **PAE** plot is a relationship matrix that indicates confidence with respect to two amino acids positions relative to each other.

<p>Thus, perhaps unsurprisingly, blocks of green, showing confidence in the prediction for the relative position of amino acids to each other often correlates with protein domains, but not always. That is, sometimes a block of green corresponds to two or more domains.</p>

<br>

---

One example of this type of occurence (two domains corresponding to one green square in a **PAE** plot) is pyruvate kinase PKM.<br>
To explore this further:<br>
[PDBe-KB page for pyruvate kinase PKM](https://www.ebi.ac.uk/pdbe/pdbe-kb/proteins/P14618)
<br>
[AlphaFold page for pyruvate kinase PKM](https://alphafold.ebi.ac.uk/entry/P14618)
<br>
[Example PDBe page for one of the pyruvate kinase PKM structures](https://www.ebi.ac.uk/pdbe/entry/pdb/1t5a)<br>
[C-term domain on InterPro](https://www.ebi.ac.uk/interpro/entry/InterPro/IPR036918/)<br>
[C-term domain on CATH](http://www.cathdb.info/version/latest/superfamily/3.40.1380.20/superposition)

---

<br>

<p>In the next subsection we will examine the <strong>PAE</strong> plot from AlphaFold, as well as structural superpositions from PDBe-KB and consider the experimental structures in the context of the full UniProt sequence.</p>

<p>For human, and other eukaryotes, when considering the full protein sequence (<i>i.e.</i> UniProt ID sequence) it can appear there are multiple globular proteins connected by relatively unstructured regions of protein chain. Where the variant occurs within in the sequence has relevance to how it may or may not impair protein function.</p>

<p>We will avoid the term domains in the context of the next section of analysis. This is because even though the green square on a <strong>PAE</strong> plot or the sections of proteins that behaved well enough to determine a structure experimentally may correspond to a single domain, it may also correspond to two or more domains. Thus we will use the term 'folded unit'.</p>

Please also consider the protein processing that occurs after protein is translated. Signal peptides and propeptides are associated with 'molecular processing' events that can occur and are listed on the Feature viewer on a corresponding UniProt page. These are regions removed at various stages and thus are typically not present in the mature form of the protein. Even more relevant to our analysis: sometimes multiple protein chains are generated from single UniProt ID due to specific molecular processing events. When known these are shown on the **PDBe-KB** summary page.


In [18]:
# @title Questions
# @markdown
# @markdown
# @markdown HINT: The following subsection can be aided by utilizing the structural superpositions available from **PDBe-KB** Structures page,<br> especially the **AlphaFold Superposition** feature.
# @markdown
# @markdown ---
# @markdown Does the **PAE** plot from AlphaFold predicted structure for this UniProt ID indicated there is more than one 'folded unit'?
GreenBlocks_PAE = 'Input Answer' # @param ["Input Answer","Yes", "No", "Maybe"]
# @markdown ---
# @markdown Do the experimentally-determined structures for the UniProt ID cover different regions of sequence?
BlueBlocks_Sequence = 'Input Answer' # @param ["Input Answer","Yes", "No"]
# @markdown ---
# @markdown Examine the structural superposition. <br> Are there more than one superposition for the same UniProt ID? <br>HINT: Click on 'Select Segment'.
No_of_sections_Superposition ='Input Answer' # @param ["Input Answer","Yes", "No"]
# @markdown ---
# @markdown Does the protein adopt a single compact fold / appear to be a single approximately spherical shape (aka globular protein)?
One_folded_unit = 'Input Answer' # @param ["Input Answer","Yes", "No"]
# @markdown ---
# @markdown If there are molecular processing events that generate more than one chain, please indicate the number of chains: <br>HINT: Molecule processing is on **PDBe-KB** Overview page, as well as on the UniProt page *Feature viewer*
Molecular_Processing_Sections = 1 # @param {type:"number"}
# @markdown ---
# @markdown Taking into account any molecular processing events, has the analysis indicated there are more than one 'folded units'?
Multiple_folded_units = 'Input Answer' # @param ["Input Answer","Yes", "No"]
# @markdown ---


## Disulfide bridges & metal coordination
<img src="https://upload.wikimedia.org/wikipedia/commons/8/8c/Disulfide_Bridges_(SCHEMATIC)_V.1.svg" height="250" align="right">

<p>Key features that contribute to protein folding and stability are disulfide bridges.  The image on the left highlights how the disulfide bridges act as covalent crosslinkers to stabilize a protein's fold. Metal coordination has also been observed as important for protein folding and stability.</p>

<br>
<p>Disulfide bonds and metal coordination may or may not contribute to catalysis when the protein is an enzyme, in addition to their role in protein stabilization. </p<>

<br>
<br>
<p>Disulfide bonds only involve cysteines, and metal coordination tends to involve cysteines and histidines, but can involved other amino acids.</p>


In [19]:
# @title Questions
# @markdown
# @markdown
# @markdown HINT: The following subsection can be aided by utilizing the structural superpositions and sequence viewer available from **PDBe-KB** Ligands page. Also, revisit the UniProt page to see if the enzyme catalysis involves a metal.
# @markdown
# @markdown ---
# @markdown Is the canonical (wild-type) amino acid a cysteine?
Is_CYS = 'Input Answer' # @param ["Input Answer","Yes", "No"]
# @markdown ---
# @markdown If the answer for the previous is yes, <br>is the cysteine involved in a disulfide bridge?
Disulfide_bridge = 'Input Answer' # @param ["Input Answer","Yes", "No"]
# @markdown ---
# @markdown If the answer for the previous is yes, <br>is the cysteine involved in coordinating a metal or other ion?
Metal_site_with_CYS = 'Input Answer' # @param ["Input Answer","Yes", "No"]
# @markdown ---
# @markdown Is the canonical (wild-type) amino acid a histidine?
Is_HIS = 'Input Answer' # @param ["Input Answer","Yes", "No"]
# @markdown ---
# @markdown If the answer for the previous is yes, <br>is the histidine involved in coordinating a metal or other ion?
Metal_site_with_His = 'Input Answer' # @param ["Input Answer","Yes", "No"]
# @markdown ---
# @markdown Examine the structural superposition on the *PDBe-KB* Ligands page. <br>Are there any metals or other ions?
Metal_or_ion_sites = 0 # @param {type:"number"}
# @markdown ---
# @markdown Are any metals involved in catalysis?
No_of_Metals_in_catalysis = 'Input Answer' # @param ["Input Answer","Yes", "No"]
# @markdown ---
# @markdown If the variant is at a metal coordinate site, <br>is the site part of the active site (*i.e.* metal site is involved in catalysis?)
Metals_at_Variant_site = 'Input Answer' # @param ["Input Answer","Yes", "No"]
# @markdown ---


## Protein complexes
<img src="https://upload.wikimedia.org/wikipedia/commons/8/8c/Disulfide_Bridges_(SCHEMATIC)_V.1.svg" height="250" align="right">

<p>We have been primarily considering protein structures within the context of a single protein chain, however many proteins form complexes and many are generated as complexes of two (dimers), there (trimers)</p>

<br>
<p>Disulfide bonds and metal coordination may or may not contribute to catalysis when the protein is an enzyme, in addition to their role in protein stabilization. </p<>

<br>
<br>
<p>Disulfide bonds only involve cysteines, and metal coordination tends to involve cysteines and histidines, but can involved other amino acids.</p>


In [20]:
# @title Questions
# @markdown
# @markdown ---
# @markdown If the protein?
Is_CYS = 'Input Answer' # @param ["Input Answer","Yes", "No"]
# @markdown ---
# @markdown If the answer for the previous is yes, <br>is the cysteine involved in a disulfide bridge?
Disulfide_bridge = 'Input Answer' # @param ["Input Answer","Yes", "No"]
# @markdown ---
# @markdown If the answer for the previous is yes, <br>is the cysteine involved in coordinating a metal or other ion?
Metal_site_with_CYS = 'Input Answer' # @param ["Input Answer","Yes", "No"]
# @markdown ---
# @markdown Is the canonical (wild-type) amino acid a histidine?
Is_HIS = 'Input Answer' # @param ["Input Answer","Yes", "No"]
# @markdown ---
# @markdown If the answer for the previous is yes, <br>is the histidine involved in coordinating a metal or other ion?
Metal_site_with_His = 'Input Answer' # @param ["Input Answer","Yes", "No"]
# @markdown ---
# @markdown Examine the structural superposition on the *PDBe-KB* Ligands page. <br>Are there any metals or other ions?
Metal_or_ion_sites = 0 # @param {type:"number"}
# @markdown ---
# @markdown Are any metals involved in catalysis?
No_of_Metals_in_catalysis = 'Input Answer' # @param ["Input Answer","Yes", "No"]
# @markdown ---
# @markdown If the variant is at a metal coordinate site, <br>is the site part of the active site (*i.e.* metal site is involved in catalysis?)
Metals_at_Variant_site = 'Input Answer' # @param ["Input Answer","Yes", "No"]
# @markdown ---


In [21]:
# @title Press shift+enter to start
# @markdown
# @markdown ---
# @markdown Summary of information on local environment
# @markdown ---

if (LIKELY_DISRUPTING_SS == 'No' or  LIKELY_DISRUPTING_SS == 'Maybe') and SECONDARY_STRUCTURE == 'LOOP' and ADJACENT_SECONDARY_STRUCTURE == 'LOOP':
    print(f"It is highly unlikely that {Canonical_Amino_Acid} at {Variant_Position} being replaced by {New_Amino_Acid} is impacting on the protein function by disrupting a secondary structure element.")
elif (LIKELY_DISRUPTING_SS == 'No' or  LIKELY_DISRUPTING_SS == 'Maybe') and SECONDARY_STRUCTURE == 'LOOP' and ADJACENT_SECONDARY_STRUCTURE != 'LOOP':
    print(f"It is unlikely that {Canonical_Amino_Acid} at {Variant_Position} being replaced by {New_Amino_Acid} is impacting on the protein function by disrupting a secondary structure element.")
elif (LIKELY_DISRUPTING_SS == 'No') and SECONDARY_STRUCTURE != 'LOOP' and ADJACENT_SECONDARY_STRUCTURE != 'LOOP':
    print(f"It seems unlikely that {Canonical_Amino_Acid} at {Variant_Position} being replaced by {New_Amino_Acid} found within secondary structure element ({SECONDARY_STRUCTURE}) is disrupting this\n or an adjacent to secondary structure element ({ADJACENT_SECONDARY_STRUCTURE}).")
elif (LIKELY_DISRUPTING_SS == 'Maybe') and SECONDARY_STRUCTURE != 'LOOP' and ADJACENT_SECONDARY_STRUCTURE != 'LOOP':
    print(f"It is possible that {Canonical_Amino_Acid} at {Variant_Position} being replaced by {New_Amino_Acid} found within secondary structure element ({SECONDARY_STRUCTURE}) is disrupting this\n or an adjacent to secondary structure element ({ADJACENT_SECONDARY_STRUCTURE}).")
elif (LIKELY_DISRUPTING_SS == 'Yes') and SECONDARY_STRUCTURE == 'LOOP' and ADJACENT_SECONDARY_STRUCTURE == 'LOOP':
    print(f"It seems unlikely that {Canonical_Amino_Acid} at {Variant_Position} is impacting on the protein function by disrupting a secondary structure element\n because it is not either in or adjacent in sequence to a secondary structure element.")
elif (LIKELY_DISRUPTING_SS == 'Yes') and SECONDARY_STRUCTURE == 'LOOP' and ADJACENT_SECONDARY_STRUCTURE != 'LOOP':
    print(f"It seems likely that {Canonical_Amino_Acid} at {Variant_Position} being replaced by {New_Amino_Acid} found within a ({SECONDARY_STRUCTURE}) is disrupting an adjacent to secondary structure element ({ADJACENT_SECONDARY_STRUCTURE})\n and this may impact on protein function.")
elif (LIKELY_DISRUPTING_SS == 'Yes' or  LIKELY_DISRUPTING_SS == 'Maybe') and SECONDARY_STRUCTURE != 'LOOP' and ADJACENT_SECONDARY_STRUCTURE != 'LOOP':
    print(f"It seems likely that {Canonical_Amino_Acid} at {Variant_Position} being replaced by {New_Amino_Acid} found within secondary structure element ({SECONDARY_STRUCTURE}) is disrupting this\n or an adjacent to secondary structure element ({ADJACENT_SECONDARY_STRUCTURE}) and this may impact on protein function.")
else:
    print(f"Still need to describe.")

It is possible that LEU at 435 being replaced by PHE found within secondary structure element (SHEET) is disrupting this
 or an adjacent to secondary structure element (SHEET).
