# L04 notebook

Today, we will have a guided lecture about exploring a data set on protein contact maps and molecular properties.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Contact maps

Here is what a protein looks like.

<div id="msegfp-view" class="mol-container"></div>
<script>
var uri = 'https://files.rcsb.org/view/8DTA.pdb';
jQuery.ajax( uri, {
    success: function(data) {
        // https://3dmol.org/doc/GLViewer.html
        let viewer = $3Dmol.createViewer(
            document.querySelector('#msegfp-view'),
            { backgroundAlpha: '0.0' }
        );
        viewer.addModel( data, 'pdb' );
        viewer.setStyle({chain: 'A'}, {cartoon: {color: 'spectrum'}});
        viewer.setStyle({chain: 'A', resn: 'CRO'}, {stick: {}, cartoon: {color: "spectrum"}});
        viewer.setStyle({chain: 'A', resi: '147'}, {stick: {}, cartoon: {color: "spectrum"}});
        viewer.setStyle({chain: 'A', resi: '202'}, {stick: {}, cartoon: {color: "spectrum"}});
        viewer.setView([ -60.64682338153259, -20.114962159611807, 0.5702077286702113, 80.5194132281471, -0.15077826938374425, 0.19679882644092048, -0.8102144809849335, -0.5311201654949984 ]);
        viewer.render();
    },
    error: function(hdr, status, err) {
        console.error( "Failed to load " + uri + ": " + err );
    },
});
</script>

In [None]:
import io
from urllib import request

NPY_PATH = "https://gitlab.com/oasci/courses/pitt/biosc1540-2024s/-/raw/main/biosc1540/files/npy/protein-contact-maps.npy"

response = request.urlopen(NPY_PATH)
content = response.read()

# Load the .npy file
contact_maps = np.load(io.BytesIO(content))

## Molecular properties

In [7]:
CSV_PATH = "https://gitlab.com/oasci/courses/pitt/biosc1540-2024s/-/raw/main/biosc1540/files/csv/mol-biodegrade-props.csv"

df_all = pd.read_csv(CSV_PATH)

print(df_all.head())

      CAS-RN             Smiles Class  SpMax_L  J_Dz(e)  nHM  F01[N-N]  \
0  1120-21-4        CCCCCCCCCCC    RB    3.919   2.6909    0         0   
1   106-88-7            CCC1CO1    RB    4.170   2.1144    0         0   
2   112-50-5       CCOCCOCCOCCO    RB    3.932   3.2512    0         0   
3    64-18-6               OC=O    RB    3.000   2.7098    0         0   
4   124-17-4  CCCCOCCOCCOC(C)=O    RB    4.236   3.3944    0         0   

   F04[C-N]  NssssC  nCb-  ...  nCrt  C-026  F02[C-N]  nHDon  SpMax_B(m)  \
0         0       0     0  ...     0      0         0      0       2.949   
1         0       0     0  ...     0      0         0      0       3.315   
2         0       0     0  ...     0      0         0      1       3.076   
3         0       0     0  ...     0      0         0      1       3.046   
4         0       0     0  ...     0      0         0      0       3.351   

   Psi_i_A  nN  SM6_B(m)  nArCOOR  nX  
0    1.591   0     7.253        0   0  
1    1.967   0    

## Get acquainted

I define a function called `show_mol` in a hidden cell that allows you to display a 3D molecule from it's `"Smiles"` string using [rdkit](https://www.rdkit.org/) and [py3Dmol](https://3dmol.org/).

In [3]:
# @title

from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import Draw
import py3Dmol


def show_mol(smi, style="stick"):
    mol = Chem.MolFromSmiles(smi)
    mol = Chem.AddHs(mol)
    AllChem.EmbedMolecule(mol)
    AllChem.MMFFOptimizeMolecule(mol, maxIters=200)
    mblock = Chem.MolToMolBlock(mol)

    view = py3Dmol.view(width=500, height=500)
    view.addModel(mblock, "mol")
    view.setStyle({style: {}})
    view.zoomTo()
    view.show()

In [6]:
df_index = 62  # Change me to some int from 0 to 1054
_ = df_all.iloc[df_index]
show_mol(_["Smiles"])
print(_)

CAS-RN              91-16-7
Smiles         COc1ccccc1OC
Class                    RB
SpMax_L               4.828
J_Dz(e)              3.3386
nHM                       0
F01[N-N]                  0
F04[C-N]                  0
NssssC                    0
nCb-                      2
C%                     40.0
nCp                       0
nO                        2
F03[C-N]                  0
SdssC                   0.0
HyWi_B(m)             3.203
LOC                   1.261
SM6_L                 9.749
F03[C-O]                  4
Me                    1.004
Mi                    1.125
nN-N                      0
nArNO2                    0
nCRX3                     0
SpPosA_B(p)           1.219
nCIR                      1
B01[C-Br]                 0
B03[C-Cl]                 0
N-073                     0
SpMax_A               2.247
Psi_i_1d              0.014
B04[C-Br]                 0
SdO                     0.0
TI2_L                 1.436
nCrt                      0
C-026               

We have a ton of information about each molecule.

For completeness, here are the column descriptions.
You do not need to know what any of these mean; we are just using this for some realistic data.

<details><summary>Open to view data definitions</summary>

| Column | Description |
| ------ | ----------- |
| B01[C-Br] | presence/absence of C–Br at topological distance 1 |
| B03[C-Cl] | presence/absence of C–Cl at topological distance 3 |
| B04[C-Br] | presence/absence of C–Br at topological distance 4 |
| C% | percentage of C atoms |
| C-026 | R–CX–R |
| F01[N-N] | frequency of N–N at topological distance 1 |
| F02[C-N] | frequency of C–N at topological distance 2 |
| F03[C-N] | frequency of C–N at topological distance 3 |
| F03[C-O] | frequency of C–O at topological distance 3 |
| F04[C-N] | frequency of C–N at topological distance 4 |
| HyWi_B(m) | hyper-Wiener-like index (log function) from Burden matrix weighted by mass |
| J_Dz(e) | Balaban-like index from Barysz matrix weighted by Sanderson electronegativity |
| LOC | lopping centric index |
| Me | mean atomic Sanderson electronegativity (scaled on Carbon atom) |
| Mi | mean first ionization potential (scaled on carbon atom) |
| N-073 | Ar2NH/Ar3N/Ar2N–Al/R···N···R |
| nArCOOR | number of esters (aromatic) |
| nArNO2 | number of nitro groups (aromatic) |
| nCb- | number of substituted benzene C(sp2) |
| nCIR | number of circuits |
| nCp | number of terminal primary C(sp3) |
| nCrt | number of ring tertiary C(sp3) |
| nCRX3 | number of CRX3 |
| nHDon | number of donor atoms for H-bonds (N and O) |
| nHM | number of heavy atoms |
| nN | number of nitrogen atoms |
| nN-N | number of N hydrazines |
| nO | number of oxygen atoms |
| NssssC | number of atoms of type ssssC |
| nX | number of halogen atoms |
| Psi_i_1d | intrinsic state pseudoconnectivity index–type 1d |
| Psi_i_A | intrinsic state pseudoconnectivity index—type S average |
| SdO | sum of dO E-states |
| SdssC | sum of dssC E-states |
| SM6_B(m) | spectral moment of order 6 from Burden matrix weighted by mass |
| SM6_L | spectral moment of order 6 from Laplace matrix |
| SpMax_A | leading eigenvalue from adjacency matrix (Lovasz–Pelikan index) |
| SpMax_B(m) | leading eigenvalue from Burden matrix weighted by mass |
| SpMax_L | leading eigenvalue from Laplace matrix |
| SpPosA_B(p) | normalized spectral positive sum from Burden matrix weighted by polarizability |
| TI2_L | second Mohar index from Laplace matrix |
</details>

## Acknowledgements

This CSV file was taken from [this paper](https://pubs.acs.org/doi/10.1021/ci4000213).