# Laboratorio 5
**Tutorial on data analysis and visualization**


Authors:
    
- Prof. Marco A. Deriu (marco.deriu@polito.it)
- Lorenzo Pallante (lorenzo.pallante@polito.it)
- Eric A. Zizzi (eric.zizzi@polito.it)
- Marcello Miceli (marcello.miceli@polito.it)
- Marco Cannariato (marco.cannariato@polito.it)

<div class="alert alert-block alert-warning">
<b>WARNING:</b> You will need a working instance of VMD (Visual Molecular Dynamics) v. 1.9.3 or higher on your machine!<br>
Please make sure to download (registration required) and install the software <b>before</b> attending the lab!<br>
More information and download:<br> https://www.ks.uiuc.edu/Development/Download/download.cgi?PackageName=VMD
</div>

# Table of Contents

1. Basic data formats and plotting
3. Visualizing PDB files in Nglview
4. Visualizing PDB files in VMD (local)

**Learning outcomes:** 
- loading data from text files
- being able to create meaningful plots for easy data visualization
- being able to load, visualize and render 3D representations of molecular systems

# 1. Basic data formats and plotting

## 1.1 Loading data from text files
Other than in very specific applications, scientific data is often generated and shared in some form of ASCII text file. This means that such data files can be parsed (=read) and analyzed by virtually anyone, provided the data format is known.
As you might remember, reading a text file and saving its content into appropriate arrays is a relatively easy task in most programming languages, Python included. We will see a few examples now.

**REMINDER: Linux is an <u>extensionless system</u>, and generally speaking you should not be afraid of strange file extensions. If you know that a given file is an ASCII text file, then you can open it in the usual ways discussed in the following, whatever the extension (.txt, .dat, .in, .xvg, .out, ...)**

In [None]:
# IF YOU ARE USING COLAB EXECUTE THIS CELL (to copy over data repository)
!git clone https://github.com/lorenzopallante/BiomeccanicaMultiscala.git
!mv BiomeccanicaMultiscala/LAB/05-VisualizationAnalysis/* .

Let's have a look at a basic, two-column text file containing time-series data:

In [None]:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

### 1.1.1 Loading an ASCII file in python

In [None]:
lines=[]
f = open("data/apl.txt","r")
for line in f:
    lines.append(line)
f.close()

Notice that you had to manually close the file at the end using the .close() method! This is rather cumbersome, so there is a more compact and efficient way:

In [None]:
lines=[]
with open("data/apl.txt",'r') as f:
    for line in f:
        lines.append(line)

Now let's have a look at the content:

In [None]:
print("Fist line of the loaded file:")
print(lines[0])

This file contains two columns, separated by a whitespace. The first column is the time (expressed in picoseconds in floating-point), the second column is a quantity called "Area per Lipid", corrensponding by the average surface (in Angstrom^2) occupied by one phospholipid in a membrane simulation.


As we can see, we managed to separate individual lines into entries of a list, but we need to further separate the time (which could be our x value) and the APL (our y value to be analyzed). A pure python way to do this would be e.g. with the split method:

In [None]:
row1 = lines[0].split()
print(f"The time is {row1[0]} ps")
print(f"The APL value is {row1[1]} Å^2")

We would need to do this operation for each line of the file, presumably in a loop:

In [None]:
vtimes = [] # Array of times
vapl = [] # Array of data points, in this case the APL value
for el in lines:
    time = el.split()[0]
    apl = el.split()[1]
    vtimes.append(time)
    vapl.append(apl)


And now we have the individual columns as two different lists:

In [None]:
print(vtimes[0:11])
print(vapl[0:11])

This method works, but has a few shortcomings:
* it works for this specific file type (i.e, two columns separated by whitespace and no header, comments, metadata)
* it requires "for" loops (= inefficient)
* it must be written from scratch each time (= time consuming)

Fortunately, most data processing libraries (numpy, pandas, ...) also usually have a builtin method/function to parse text files.<br>
The function bundled with numpy is<br>
```python
import numpy as np
arr = np.loadtxt(filename,usecols=...,max_rows=...,comments=...)
```

Let's see it in action on our text file:

In [None]:
vtimes = np.loadtxt("data/apl.txt",usecols=(0),max_rows=1250)
vapl = np.loadtxt("data/apl.txt",usecols=(1),max_rows=1250)
print(vtimes[0:11])
print(vapl[0:11])
vtimes.dtype

You can use the usual python magic inside the function (string substitutions, etc):
```python
root_folder = "data"
filename = "apl.txt"
vtimes = np.loadtxt(f"{root_folder}/{filename}",usecols=(0),max_rows=1250)
# Alternative:
vtimes = np.loadtxt("%s/%s" % (root_folder,filename),usecols=(0),max_rows=1250)
# ...
```

### 1.1.2 Loading and plotting XVG files

As you can see, this is much quicker to write. It is also way more flexible in dealing with filetypes with different separators, headers, comments, etc.
Let's open for example an <b>xvg</b> file, which is the file GROMACS writes when performing (most of the) analyses, containing the RMSD as a function of time.

Wait...what the hell is RMSD?<br>
You will see the details, but in a nutshell it is a metric that tells you "how different" two molecular structures are. Often, it is used to quantify the change of a given (macro)molecule in time as the atoms move, bonds and angles wiggle, and so on...<br>

Example: one water molecule evolving in time:<br>
<img src="data/SupplementaryAnimationRMSD.gif" width="1500" align="center">

GROMACS calculates this quantity automatically for us, given a trajectory of a molecule in time. It saves the results into an xvg file. Let's have a look:

In [None]:
%%bash
head -n 20 data/rmsd.xvg

A couple of things to notice:
* The file has a header of 18 lines
* Lines starting with # are comments
* Lines starting with @ contain plotting instructions (e.g. for grace), and can also be regarded as comments here
* Actual data starts at line 19, and the two columns are separated by 4 whitespaces

While we could read this file in pure python (skipping the first 18 lines, etc.), but numpy makes it really easy:

In [None]:
comm = ["#","@"] # A list containing the starting characters of comment lines
vtimes = np.loadtxt("data/rmsd.xvg",usecols=(0),comments=comm)
vrmsd = np.loadtxt("data/rmsd.xvg",usecols=(1),comments=comm)
# Or, even more compact:
vtimes, vrmsd =  np.loadtxt("data/rmsd.xvg",usecols=(0,1),comments=comm, unpack=True)
# Let's see the data:
print(vtimes[0:11])
print(vrmsd[0:11])


<div class="alert alert-block alert-info"><b>Nice little detail:</b><br> numpy already converted small numbers into scientific notation!</div>

Now that we have the data, we can first of all plot it, using matplotlib and/or seaborn. Seaborn is a matplotlib wrapper, which makes plotting a little bit easier thanks to its preset plotting functions which allow you to make nice plots with little adjustments.<br>
<div class="alert alert-block alert-warning"> Keep in mind that everything you do in seaborn can also be achieved using pure matplotlib. Also, a plot generated using seaborn is in fact a matplotlib plot, so all its attributes (title, axes, legends, etc.) are accessible using the matplotlib syntax! </div>

In [None]:
# Example 1: using just matplotlib
plt.plot(vtimes,vrmsd);

This is a very basic plot. It can be further personalized though. Let's add a title, axis labels and maybe change the colors a bit:

In [None]:
plt.plot(vtimes,vrmsd,'k')
plt.title("RMSD after lsq to Backbone")
plt.xlabel("Time (ps)")
plt.ylabel("RMSD (nm)")
plt.ylim(0,1);

Now let's try with Seaborn:

In [None]:
sns.lineplot(x=vtimes,y=vrmsd);

As you can see, the plot is exaclty the same as the one generated using pure matplotlib. For these simple plots, matplotlib is sufficiently easy to create meaningful plots, so we'll stick to that.<br>
Let's adjust the plot a bit to explore some further possibilities. For example, let's plot just one every 100 datapoints to simplify the plot:

In [None]:
plt.plot(vtimes[::100],vrmsd[::100],'k')
plt.title("RMSD after lsq to Backbone")
plt.xlabel("Time (ps)")
plt.ylabel("RMSD (nm)");

We used pure python syntax (a thing called "slicing") to skip through a certain number of elements in the list.<br>


<div class="alert alert-block alert-warning"> <b>CAUTION:</b> the lists provided for x and y values must have the same length (just as in e.g. Matlab). Also, be careful that by skipping a certain number of elements, depending on the sampling frequency, you might miss some important parts of the dynamics of the system! </div>


## 1.1.3 Bash + Python Excercise

Let's merge what we have learned!

In lab 04-BashScripting you have realized a script that, for each pdb file in a folder, counts the number of amminoacids.

Now:
- Modify the script  by adding a third column that contains the number of Alanins (resname ALA) in the pdb.
- Launch the script in order to create the file stats.csv.
- Write a python code that reads stats.csv and creates a scatterplot showing on the x-axis the number of residues and on the y-axis the number of alanins. Add the appropriate labels to the figure.

Do you see any relationship between the two variables?

<div class="alert alert-block alert-info"> <b>HINT:</b> plt.scatter can be used to create a scatterplot. Look at the online documentation of matplotlib!</div>


In [None]:
# Write your code here

# 2. Visualisation of molecular systems

Humans are visual learners: it turns out that, along with our other senses, our brain is highly specialized in interpreting visual cues and analysing visual patterns in an almost automatic way.<br>
This is one of the reasons why we appreaciate data plots so much (and sunsets, and landscapes, and flowers, ...).<br>
The same reasoning can be also applied to molecular systems: visualising a 3D model of a protein can give us a great deal of information regarding the system, even prior to analysing it numerically.

Example: is there something wrong with this system?<br>
<img src="glitch.png" width="1500" align="left">

Something obviously went wrong, and we can catch this (and many other similar glitches) by just looking at our systems!

There are a bunch of different options available to view pdb files and molecular trajectories, from stand-alone software (VMD, PyMol, MOE, Avodagro, Chimera, ...) to code plugins (for Python, Java, ...).
We will see two of them:
* VMD (standalone, interactive)
* NGLView (in a jupyter notebook, interactive)

# 2.1 NGLView

NGLView is a library to view molecular structures and trajectories in a python code (specifically, in an interactive python session).
Let's see it in action:

In [None]:
# IF YOU ARE USING COLAB EXECUTE THIS CELL (to install nglview)
!pip install nglview
from google.colab import output
output.enable_custom_widget_manager()

In [None]:
import nglview as nv
from IPython.display import IFrame

with open("data/peptide.pdb") as f:
    view = nv.show_file(f, ext="pdb")
view

Note the use of the classic python syntax to open a file, pass the handle to another function (in this case nglview) and then close the file. As you can see, this creates an interactive widget that you can interact with using your mouse to rotate, zoom, highlight atoms, etc...<br>.
The main controls are the following:

* Translation: right click + drag
* Rotation: left click + drag
* Z-axis rotation: Ctrl + right click + drag
* Zoom: scroll wheel
* Center view: left click on the desired atom (or its representation)


Sometimes you might be interested in saving a particular visualization (e.g. to create a figure). You can do this using the same **view** class:

In [None]:
view.render_image(trim=True, factor=4)

In this case we used the "Trim" keyword to crop the image precisely around the peptide, and the "factor" keyword to control the quality of the image. Note that if you change the visualization in the interactive widget (e.g. by rotating the peptide above) and then run the render_image method again, a new updated image is generated! This is very handy for a quick rendering of pdb files.

NGLView can also be very handy to quickly view distances, angles and dihedrals.<br>
1. Right-click on a single atom and a green sphere will appear on the atom.
2. Right-click on a second atom, so that a second sphere appears. You can repeat this procedure for up to four atoms.
3. Right-click one more time on the last selected atom to terminate the selection.
4. If you chose two atoms the distance in Å between the two will appear with a dashed connecting line.
5. If you chose three atoms, the angle in degrees (°) between the three will appear.
6. If you chose four atoms, the dihedral angle in degrees (°) will appear.
7. To remove a distance/angle/dihedral representation, repeat the steps above on the selected atoms.

**TRY YOURSELF!**

Note that NGLView works both on local files and on PDB indentifiers: it can fetch and visualize a given PDB file for you, so that you don't need to download it beforehand. Just enter the PDB ID for the entry as in the example below:

In [None]:
view2 = nv.show_pdbid("1jff")
view2

NGLView also allows you to customize the visualization (representations of atoms, colors, etc.) in many ways. Let's see a few examples (see the NGLView documentation for an exhaustive list)

In [None]:
view2.add_surface()

<div class="alert alert-block alert-warning"> <b>CAREFUL:</b> Notice how running the above command without calling again "view2" at the end actually changed the visualzation in the widget above! There is no need to create a new NGLView widget every time you change the representation!<br>
We can just keep working on the same widget. </div><br>
Let's see a couple more examples:

In [None]:
# Let's make some changes:
# Quick reset:
view2.clear_representations()
# New surface, with a wireframe:
view2.add_surface(color='blue', wireframe=True, opacity=0.2, isolevel=3.)
# Some sweet licorice:
view2.add_representation('licorice', selection='not hydrogen and not protein') # Only ligands will be licorice!
# Add some cartoon back:
view2.add_representation('cartoon', selection='protein', color='green')

Pretty cool stuff huh?<br>
Actually, there are many more things you can do with NGLView, especially if you pair it to other tools that deal with molecular dynamics trajectories in Python (e.g. Pytraj, MDAnalysis,...). We will not discuss these tools here, since these are a bit more advanced topics.<br>


The <b><u>take-home message</b></u> here is that NGLView is a really handy tool to have a quick look at molecular systems within interactive python sessions. As you have seen, it can be really useful to have a python/Colab notebook where you perform data analysis, create the plots, and also view the molecular systems, all in the same workspace.<br>

**However...**

Many times it is much quicker to use **Visual Molecular Dynamics (VMD)** to look at pdb files and trajectories. Don't be fooled by its rather "vintage"-looking user interface! This piece of software is remarkably powerful to view MD data and to create high-quality renderings!<br>

Let's see it in action with a quick demo