# UMD FIRE Stream- Genome Computing V
## Figure Generation with `matplotlib`

Here you will learn about working with vectors and matrices with the `matplotlib` module.

In this notebook we will learn:
<ul>
    <li>how to make basic plots in matplotlib,</li>
    <li>about the different plotting functionality</li>
    <li>how to make figures with subplots</li>
    <li>get practice plotting real data sets</li>
</ul>

In [None]:

import numpy as np

import pandas as pd


# Basic Plotting

A picture is worth a thousand words. Often times it is easier and more informative to plot the data we are examining then to rely on descriptive statistics alone. In this notebook we'll go over the minimal `python` plotting skills you'll need to get through this boot camp.

## `matplotlib`

A number of you have experience with MATLAB. `matplotlib` was a project started by John Hunter in 2002 to enable MATLAB like plotting in python. If you've done a lot of plotting in MATLAB matplotlib will come very naturally to you. If you've never even heard of MATLAB, don't worry! `matplotlib` is very intuitive and you'll be plotting like a pro in no time.

Let's start by importing the package.

In [None]:
# We will be using the pyplot subpackage
# it is standard to call it plt

import matplotlib.pyplot as plt


### A First Plot

Let's jump right in and make our first plot.

In [None]:
# Here is some data in a list
x = [0,1,2,3,4,5,6,7,8,9,10]

# Here is more data, using list comprehension 
# for the formula: y = 2x-3
y = [2*i - 3 for i in x]

# plt.plot will make the plot
# First put what you want on the x, then the y

plt.plot(x,y)

# Always end your plotting block with plt.show with interactive Python
# this makes sure that the plot displays properly
plt.show()

# To help reduce the memory load, clear the figure
plt.clf()


### What Happened?

So what happened when we ran the above code?

`matplotlib` creates a figure object, and on that object it places a subplot object, and finally it places the points on the subplot then connects the points with straight lines.

We'll return to the topic of subplots later in the notebook

Now you try plotting the following `x` and `y`.

In [None]:
# use np.linspace to make an evenly spaced list of numbers over a specified interval
#    np.linspace(<start>, <stop>, <how many points>)
x = 10*np.linspace(-5,5,100)

# Plot function: y = x^2 - 3
y = x**2 - 3


In [None]:
# Execute your plot here

plt.plot(x,y)
plt.show()
plt.clf()


### Getting More Control

#### Making The Figure Object

We can have more control over how the plot itself looks by creating the figure object ourselves.

In [None]:
# plt.figure() will make the figure object
#    -> figsize can control how large it is (width,height)

plt.figure(figsize = (10,12))

# Create the subplot object that we plot on
plt.plot(x,y)

# Add a title
plt.title("A Plot Title v1", fontsize = 20)

plt.show()
plt.clf()

In [None]:
# That's a big plot. Let's go smaller. Try (7,7). Let's also make the background color grey

plt.figure(figsize = (   7,7  ), facecolor="grey")

plt.plot(x,y)

# Add axis labels and control their fontsize
plt.xlabel("x-axis", fontsize = 16)
plt.ylabel("y-axis", fontsize = 16)

# Make a new title
plt.title("A Plot Title v2", fontsize = 20)

plt.show()
plt.clf()

# -!- notice how the background color is in the figure object and NOT in the subplot object

In [None]:
# Better, but not quite. Maybe wider is better than taller. Try 7,5, 
# a lighter shade of grey, 
# and let's try to fill out the subplot using "tight_layout"

plt.figure(figsize = ( 7,5 ), 
           facecolor="lightgrey", 
           tight_layout=True)
# -!- A nice thing about objects and programming is that you can put parameters in as separate lines
# as long as they are within parenthesis. It will still do everything you ask.

plt.plot(x,y)

plt.xlabel("x-axis", fontsize = 16)
plt.ylabel("y-axis", fontsize = 16)

# Let's add some limits to our x- and y-axes.
plt.xlim((-20,20))
plt.ylim(-100,100)

# New title
plt.title("A Plot Title v3", fontsize = 20)

plt.show()
plt.clf()

In [None]:
# This looks ok. Let's save it!

plt.figure(figsize = ( 7,5 ), 
           facecolor="lightgrey", 
           edgecolor="red",
           tight_layout=True)

plt.plot(x,y)

plt.xlabel("x-axis", fontsize = 16)
plt.ylabel("y-axis", fontsize = 16)
plt.xlim((-20,20))
plt.ylim(-100,100)

plt.title("A Plot Title v3", fontsize = 20)

plt.show()
plt.savefig("My_First_Figure.png")
plt.clf()

In [None]:
# -!- Hold up!
# Why is your figure empty? 

# Because the order of the view matters. 

# If you want to both save locally AND view in the notebook, save FIRST, then view. Not the other way around

# Save the above figure by putting the code under here:






#### Controlling How the Plotted Data Looks

We can control the appearance of what is plotted. Here's a quick cheatsheet of easy to use options:



| Color           | Description  |
| :-------------: |:------------:|
| r               | red          |
| b               | blue         |
| k               | black        |
| g               | green        |
| y               | yellow       |
| m               | magenta      |
| c               | cyan         |
| w               | white        |

|Line Style | Description   |
|:---------:|:-------------:|
| -         | Solid line    |
| --        | Dashed line   |
| :         | Dotted line   |
| -.        | Dash-dot line |

| Marker | Description    |
|:------:|:--------------:|
|o       | Circle         |
|+       | Plus Sign      |
|*       | Asterisk       |
|.       | Point          |
| x      | Cross          |
| s      | Square         |
|d       | Diamond        |
|^       | Up Triangle    |
|<       | Right Triangle |
|>       | Left Triangle  |
|p       | Pentagram      |
| h      | hexagram       |


In [None]:
# Let's now plot 2 sets of data on the same object

plt.figure(figsize = (8,5))

# First, make our plot of magenta pentagrams 
# -!- We will add a LABEL so we can add a legend to the plot
plt.plot(x, y,
         'mp', 
         label="points")

# Take the x and y data and shift it, using a green dashed line
plt.plot(x+10,
         y-100,
         'g--', 
         label="shifted line")

# Add axis labels and a title that simply says "Data and its Shift"
plt.xlabel("", fontsize = 16)
plt.ylabel("", fontsize = 16)
plt.title(" ", fontsize = 20)

# plt.legend() adds the legend to the plot
plt.legend(fontsize=14)


plt.show()
plt.clf()

In [None]:
# Work with new data

x = 10*np.random.random(100) - 5

y = x**3 - x**2 + x

print(x)
print()
print(y)

In [None]:
# Make a 10x5 figure object
# Plot the raw data in black circles with label "raw data"
# Plot a shift in x by +10 in blue squares with label "shift in x"
# Plot a separate shift in y by +10 in red squares with label "shift in y"
# Plot a shift in both x and y by +10 in green pentagons with label "shift in x and y"
# No chart title
# and label your axes

plt.figure(figsize=(10,5))

plt.plot(x,
         y, 
         'ko', 
         label='')

plt.plot()

plt.plot()

plt.plot()

plt.xlim(-15, 15)
plt.ylim(-150,150)

plt.xlabel()
plt.ylabel()
plt.legend()

plt.show()
plt.clf()


### Subplots

What if we want to plot more than one thing in the same figure? We'll want to make some subplots.

In [None]:
# plt.subplots makes a figure object then populates it with subplots

fig, axes = plt.subplots(2, 2, figsize = (8,8))

# the first number is the number of rows, the second number is the number of columns
# so this makes a 2 by 2 subplot matrix

# 'fig' is the figure object while 'axes' is a matrix containing the four subplots



# We can plot like before but instead of plt.plot we use axes[i,j].plot
# Notice that this using index notation. 
# Thus, [0,0] will be the subplot at the top row, first column
# and [1,1] will be the bottom row subplot in the second column


# Plot a "random walk" on axes[0,0] in a red dashed line
axes[0,0].plot(np.random.randn(20).cumsum(),
               'r--')

# Set the x and y labels on that individual subplot:
axes[0,0].set_xlabel("X")
axes[0,0].set_ylabel("y")


# .hist() plots a histogram which is a bar plot that shows the distribution of data, 
# which you can control the number of bins with 'bins'
axes[0,1].hist(np.random.randn(1000), 
               bins = 50)

# .scatter() is a quicker way to produce a scatter plot
axes[1,0].scatter(x=np.random.random(20), 
                  y=np.random.randn(20),
                  color = 'g')

# Some text on axes[1,1]
axes[1,1].text(0.2, 0.5, "Fear the Turtle", fontsize = 14)


plt.show()
plt.clf()

Now you practice!

In [None]:
# Here is some random data for you

x1 = 2*np.random.randn(500) + 3
x2 =   np.random.randn(500) + 4

# use function y = x1 + x1^2 + log(x2) +1/2*(C), with C = some constant number
y = x1 + x1**2 + np.log(x2) + 0.5*np.random.randn(500)


In [None]:
# Make a 3 by 3 subplot (a total of 9 figures) with a square figure size that big enough to view all 9 figures

# Plot histograms of x1, x2, and y along the diagonal
# Plot (x1 vs x2) and (x1 vs y) scatterplots in the remaining columns of the top row
# plot (x2 vs y) scatterplot in the second row 3rd column


fig, axes = plt.subplots( figsize = (    ))


# Below are the 9 subplots. For any subplot you will NOT use, comment that line out; do not delete
axes[0,0].
axes[0,1].
axes[0,2].
axes[1,0].
axes[1,1].
axes[1,2].
axes[0,2].
axes[1,2].
axes[2,2].


# For any subplot you did NOT use, place text in the subplot that says "No data here"
.text(0.2, 0.5, "No data here", fontsize = 14)


plt.show()
plt.clf()


### Plotting Data

Finally we'll see how we can use `matplotlib` to examine some real data.

#### Test Parameter File

We are going to work with a 141-bp nucleosome parameter file that contains 6 base pair and 6 base-pairs step parameter values

In [None]:
# Read the data in a pandas dataframes

pars = pd.read_csv("datafiles/test_dnapar.par", 
                   skiprows=2, 
                   sep='\s+')



In [None]:
pars.tail()

In [None]:
# Start by making a line plot of the shear, stretch, and stagger parameters along the fragment
# Make a 3x1 subplot, making sure that the figure has the "tight_layout" active
# For each subplot, set the y-limits to -/+1.5 , the x-limits from 0 to the length of the dataframe, and turn gridlines on

fig , axes = plt.subplots(3, 1, figsize=(8,6), tight_layout=True)

#Make a list of values to be used as the x-axis, based on the length of the parameter chain
xs = 

axes[0].plot()
axes[0].set_ylabel("Shear")

axes[1].plot()
axes[1].set_ylabel("Stretch")


axes[2].plot()
axes[2].set_ylabel("Stagger")


# Loops can be used to set some axis-specific settings based on indexing
for i in [0,1,2]:
    axes[i].
    axes[i].
    axes[i].
    
plt.show()
plt.clf()
del xs

In [None]:
# Make the scatterplot with 1 subplot 
# Plot the tilt vs. roll data in blue circles
# Plot the twist vs. roll data in red squares
# Let the Roll parameter be the x-axis
# Note: These are base-pair STEP parameters, so the first line of data are 0s. You should not plot these

plt.figure(figsize=(8,8))

plt.scatter(#pars.Roll[1:], 
            #pars.Tilt[1:],
            #marker='o',
            #color="blue",
            #label="Roll v. Tilt",
            
            # We will add two new methods to customize:
            # add an edgecolor to all of the datapoints
            edgecolor="black", 
            
            # Make the datapoints more transparent with alpha, from 0-1 (lower number = more transparent)
            alpha=0.5)

plt.scatter(pars.Roll[1:], 
            pars.Twist[1:],
            marker='s',
            color="red",
            label="Roll v. Twist",
            edgecolor="black", 
            alpha=0.5)
plt.legend()

plt.show()
plt.clf()

In [None]:
# Make 2 histograms of the three rotation base-pair step parameters
# Make 1 of a bin size of 10, make the other of a bin size of 25
# Add y labels to indicate which parameter is used
# Note: These are base-pair STEP parameters, so the first line of data are 0s. You should not plot these


fig, axes = plt.subplots(3, 1, figsize=(6,6), tight_layout=True)

bin_size=10
axes[0].hist(bins=bin_size)
axes[1].hist(   )
axes[2].hist(   )
del bin_size

plt.show()
plt.clf()


fig, axes = plt.subplots(3, 1, figsize=(6,6), tight_layout=True)

bin_size=25



del bin_size

plt.show()
plt.clf()

In [None]:
# Finally, we can make figures that help with statistical information, such as a Box & Whisker Plot.

# A box plot shows the Inter-quartile range of the data, the median of the parameter, and any outliers 
# Lets look at the Box plot for the Prop-Tw parameter:

fig = plt.figure()

plt.boxplot(pars.Roll[1:], 
            
            # Label the axis
            labels=["Prop-Tw"], 
            
            # There is a 'vert' option if you want the box to sit horizontally or vertically
            vert=False)

plt.show()
plt.clf()

In [None]:
# We can use selection options with Pandas to organize the data based on base pair sequence

fig = plt.figure()

plt.boxplot([pars['Prop-Tw'].loc[pars['#']=='A-T'], 
             pars['Prop-Tw'].loc[pars['#']=='T-A'], 
             pars['Prop-Tw'].loc[pars['#']=='G-C'], 
             pars['Prop-Tw'].loc[pars['#']=='C-G']], 
            labels=["A-T Base Pair", "T-A Base Pair", "G-C Base Pair", "C-G Base Pair"], 
            vert=False)


plt.show()
plt.clf()

In [None]:
# Make 2 box plots for all base-pair step parameters, split up based on motion type (rotational and translational)

fig, axes = plt.subplots(1, 2, figsize=(6,3), tight_layout=True)

axes[0].boxplot()
            
axes[1].boxplot()

plt.show()
plt.clf()

## Advanced: Plotting the test_dnarefframe.dat file

We have worked with the all-atom .pdb files and the parametric .par files

In order to make base pair and base-pair steps, we need a set of reference frame data.

A refframe frame has 5 lines for every base-pair
 line 1:   the nucleotide base-pair
 line 2:   the base-pair origin in (x, y, z)
 line 3-5: the three unit vectors that describe the orientation of the reference frame

This is an advanced task because you first need to go through the file and store only the origin values into a dataframe with x, y, and z column labels


In [None]:
# Load and organize your data





In [None]:
## Create a 3-D scatter plot of this dataframe






In [None]:
# You can calculate the Euclidian distance of each origin relative to another
# This can be done using a specific module:

from scipy.spatial import distance 

# in this is a function: pdist()

# Look up information on pdist() here: https://docs.scipy.org/doc/scipy/reference/spatial.distance.html

# Calculate the pdist of the refframe origins and then create a heatmap plot
# This plot will be a square plot that shows different colors based on how close or how far two points are

# Finally, save the image as a .png file






In [1]:
# Dr. Robert Young, University of Maryland
# In collaboration with Matthew Osborne, Erdos Institute
# UMD FIRE Genome Computing