# Working With Proteins From the PDB

The Protein Data Bank (PDB) stores information about the 3D shapes of proteins, nucleic acids, and complex assemblies. This data bank can be found online at https://www.rcsb.org/. In this tutorial, we will go through how to visualize proteins from the PDB, and how to map mutation data from other databases to the PDB. 

## Visualizing Proteins using Py3Dmol

Py3Dmol downloads PDB structures using the compressed binary MMTF file format from https://mmtf.rcsb.org. For more information, read the documentation at https://pypi.org/project/py3Dmol/.

In [13]:
import py3Dmol

Here, we are visualizing the human protein hemoglobin from the PDB

In [23]:
hemoglobin = py3Dmol.view(query='pdb:5WOG') # 5WOG is the PDB ID
hemoglobin.setStyle({'cartoon': {'color': 'spectrum'}}) # here, we are setting the color and animation style
hemoglobin.setStyle({'hetflag': True}, {'stick':{'radius': 0.3, 'singleBond': False}}) # here, we are setting the display style
hemoglobin.zoomTo() # this command makes sure the output zooms in on the protein
hemoglobin.show() # this command shows us the protein

This protein looks a little messy! To make it easier, we can visualize the different subunits of hemoglobin in different colors.

In [15]:
hemoglobin.setStyle({'chain':['A','B']},{'cartoon': {'color': 'yellow'}}) # alpha subunits of hemoglobin
hemoglobin.addLabel('alpha subunits', {'fontColor':'yellow', 'backgroundColor':'lightgray'}, {'chain': ['A','B']}) # adding a label                                              
hemoglobin.setStyle({'chain':['C','D']},{'cartoon': {'color': 'blue'}}) # beta subunits of hemoglobin
hemoglobin.addLabel('beta subunits', {'fontColor':'blue', 'backgroundColor':'lightgray'}, {'chain': ['C','D']}) #adding a label
hemoglobin.show()

## Visualizing Residues and Ligands

We can also visualize the different residues and ligands interacting with the protein. Here, we will look at the waters interacting with the protein.

In [19]:
hemoglobin.setStyle({'resn': 'HOH'}, {'sphere':{'radius':0.5}})
hemoglobin.show()

Now, turn the waters off

In [17]:
hemoglobin.setStyle({'resn': 'HOH'}, {})
hemoglobin.show()

### Exercise
Try visualizing all the heme residues that interact with hemoglobin. The residue name for heme is "HEM"

### Exercise
Try visualizing the protein with the PDB ID "1JM7." This is heterodimer (a protein composed of two polypeptide chains) of the the RING domain of the BRCA1 protein and the BARD1 protein. Mutations in BRCA1 are commonly associated with breast cancer. Additionally, visualize the water residues that interact with this heterodimer.

## Visualizing specific amino acids

Now let's visualize some specific amino acids on the hemoglobin protein. Here, we are looking at the 6th amino acid in the beta chains, which is a glutamine. Take a second to think about how changing this amino acid might affect the structure and function of the hemoglobin protein.

In [18]:
hemoglobin.setStyle({'chain': 'C', 'resi': '6'},{'stick': {'colorscheme': 'redCarbon'}})
hemoglobin.setStyle({'chain': 'D', 'resi': '6'},{'stick': {'colorscheme': 'redCarbon'}})
hemoglobin.show()

In people affected with sickle cell anemia, this glutamine is changed to a valine. Molecules of sickle cell affected hemoglobin stick to one another, forming rigid rods. These rods cause a person's red blood cells to take on a deformed, sickle-like shape. These blood cells do not carry oxygen well, and they also tend to clog capillaries. So, when a person affected by sickle cell anemia exerts themselves even slightly, they often experience terrible pain, and might even undergo heart attack or strokeâ€”all because of a single substitution!

## MAVE data

MAVE stands for Multiplexed Assays of Variant Effect. MAVE assays test all possible single nucleotide or amino acid variants in a selected genomic region. This rapidly generates a large volume of variant effect data, which can be used to assess how mutations affect protein structure and function. This data can be used in clinical applications to study and develop cures for diseases caused by mutations. The MAVE Data Base (MAVE DB) contains a repository of MAVe data. This data base can be accessed at mavedb.org. In this tutorial, you will look at 5 different mutations from a MAVE dataset and explore their effects on protein function. 

First, we will start by importing the data as a Pandas dataframe from a .csv (comma separated values) file. 

In [22]:
import pandas as pd # import the pandas package, which is a common data analysis package
MAVEdata = pd.read_csv('BRCA1_MAVE_practicedata.csv') # import the csv data as a pandas dataframe
MAVEdata # visualize the dataframe

Unnamed: 0,accession,hgvs_nt,hgvs_pro,score,SE,epsilon,SE_PlusE2NewRep3,score_PlusE2NewRep3,SE_PlusE2NewRep4,score_PlusE2NewRep4,SE_PlusE2NewRep5,score_PlusE2NewRep5,SE_PlusE2Rep3,score_PlusE2Rep3,SE_PlusE2Rep4,score_PlusE2Rep4,SE_PlusE2Rep5,score_PlusE2Rep5
0,urn:mavedb:00000003-a-1#1370,77G>T,Cys26Phe,-4.134405,0.3331,0.0,0.289026,-3.680959,0.42938,-4.364103,0.353721,-3.514476,0.376292,-3.292892,0.315858,-4.385208,0.161177,-5.376609
1,urn:mavedb:00000003-a-1#1622,33A>T,Gln11His,0.796523,0.095511,2.2e-12,0.082801,0.927734,0.104117,0.720088,0.142041,1.044444,0.216357,0.198959,0.141822,0.679058,0.053409,0.942422
2,urn:mavedb:00000003-a-1#9137,159G>T,Gln53His,-2.211282,0.280966,1.11e-16,0.176838,-2.353744,0.162923,-1.45007,0.170418,-2.142533,0.354973,-3.359422,0.329391,-1.424622,0.154763,-2.623216
3,urn:mavedb:00000003-a-1#2102,136T>G,Cys46Gly,-1.6248,0.324948,0.0,0.453543,-2.041432,0.159157,-1.937265,0.205004,-2.788734,0.109212,-1.205581,0.020247,-1.338925,0.336462,-0.434591
4,urn:mavedb:00000003-a-1#1264,266T>G,Ile89Ser,-1.066312,0.224966,0.0,0.32934,-1.9574,0.113729,-0.276591,0.239786,-1.283948,0.123882,-1.166353,0.166965,-1.297198,0.087659,-0.691205


In the table above, the "accession" number is the MAVE ID number. The "hgvs_nt" column describes the nucleotide substitution for this mutation. For example, in the 0th row, the guanine in the 77th position of the gene is replaced by a thymine. The "hgvs_pro" column describes the amino acid substitution for this mutation. For example, in the 0th row, the cysteine in the 26th position of the protein is replaced by a phenylalanine. The rest of columns describe various effects the mutations have on structure and function. Understanding these columns is not necessary for this tutorial, but you can find more information about them at https://www.mavedb.org/docs/mavedb/ if you are interested.

### Exercise

Visualize the amino acids in the positions described in the MAVE data. (Hint: This is done in the same way that the 6th amino acid of glutamine was visualized for hemoglobin above). All of these mutations cause a loss of function (LOF) in the BRCA1 RING domain. The mutation in the 0th row causes the largest LOF, while the mutation in the 1st row causes the smallest LOF. After visualizing all five mutations, order the remaining three mutations from largest to smallest LOF based on where they occur on the protein. 

**Acknowledgements**: 
This tutorial was developed by Kriti Shukla at the University of North Carolina a

This tutorial was adapted from the MMTF-2018 Workshop & Hackathon, hosted by Dr. Peter W. Rose at the University of California, San Diego.  