# Analysis and Visualization

Data analysis and visualization are essential to science. This chapter will teach you the best ways to perform data analysis and visualization on the computer, saving time and allowing for more publications.

Scientists encounter many types of data. Once those data have been collected and prepared, they must be loaded into the computer.

## Loading Data

There are numerous python packages for loading data into memory-accesible structures. These will be discussed in detail in chapter 11. Here, we will focus on four tools: NumPy, PyTables, Pandas, and Blaze. 

Numerous factors determine the right tool for data analysis. The most important factor is often the size of the data. 

### NumPy

For small data that can be loaded into memory all at once, NumPy is often a good choice. We will begin our discussion there. NumPy arranges data into an array of numbers. NumPy arrays are very common and very powerful. 

Below is code that tabulates the results of a count of a decaying isotope. The left-hand column holds the independent variable, time, and the right hand column holds the dependent variable, the observed number of decays. The data are loaded by NumPy from a comma separated value file with the following code: 

In [25]:
import numpy as np  # Imports numpy with alias np
decays_arr = np.loadtxt('data/decays.csv', delimiter=",", skiprows=1)  # Creates an object with the loadtxt() function
decays_arr

array([[0.00000000e+00, 1.00000000e+01],
       [1.00000000e+00, 1.35335283e+00],
       [2.00000000e+00, 1.83156389e-01],
       [3.00000000e+00, 2.47875220e-02],
       [4.00000000e+00, 3.35462600e-03],
       [5.00000000e+00, 4.53999000e-04],
       [6.00000000e+00, 6.14420000e-05],
       [7.00000000e+00, 8.31500000e-06],
       [8.00000000e+00, 1.12600000e-06],
       [9.00000000e+00, 1.52000000e-07]])

### Pandas

Pandas is a very flexible tool that provides a good alternative to NumPy or PyTables in many cases. 

It is very easy to load data into pandas. Observe:

In [26]:
import pandas as pd #  Import pandas in alias it as pd
decays_df = pd.read_csv('data/decays.csv') #  Creates a data frame object to hold the data loaded by read_csv()
decays_df

Unnamed: 0,Time (s),Decays (#)
0,0,10.0
1,1,1.353353
2,2,0.1831564
3,3,0.02478752
4,4,0.003354626
5,5,0.000453999
6,6,6.1442e-05
7,7,8.315e-06
8,8,1.126e-06
9,9,1.52e-07


We can also use pandas to change the format of data. This code will create to an hdf5 file called _decays.h5_ in a group node called _experimental_ if we ran it:

In [27]:
# import pandas as pd
# decays_df = pd.read_csv('data/decays.csv')

# decays_df.to_hdf('decays.h5', 'experimental')

### Blaze 

Blaze is another tool. Similar to Panda, it can easily convert data between different formats. However, blaze is still in active development, and not fully stable. Please be cautious if you decide to use blaze. The following code takes the CSV code and turns it into a data descriptor, which it then transforms ito Blaze Table

In [28]:
import blaze as bz #  Imports blaze and aliases as bz
csv_data = bz.CSV('data/decays.csv') #  Uses the CSV() constructor to transform the csv into blaze data
decays_tb = bz.Table(csv_data) #  Transforms the data descripter csv_data into a blaze table

AttributeError: module 'pandas.core.computation' has no attribute 'expressions'

## Cleaning and Munging Data

Data munging refers to many things, but broadly means dealing with data. Typically, munging means converting data from its raw form to a well-structued format that can be used for plotting. 

Suppose you performed an experiment counting the decay rate of a radioative source. However, a few things went wrong with the experiment and you cannot repeat the experiment due to time or financial constraints. In particular, let's imagine that during the measurement, a colleague walked through the laboratory with a stronger, more stable source so that many of the measurements are biased by this strong source. Additionally, the lab lost power for a few seconds towards the end of the measurement, so some measurements are nonexistant. 

First let's use Panda to remove the rows from our table with missing data:

In [None]:
decay_df = pd.read_csv("data/many_decays.csv")
decay_df.count() #  The count() method ignores the NaN values

In [None]:
decay_df.dropna() #  The dropna() method returns the dataframe without the NaN values

### Visualisation 

Now that the data are a bit cleaner, let's plot them.

#### MatPlotLib

Matplotlib is an amazing plotting tool for scientific computing. The following python script will create a plot of the decay data:

In [None]:
import numpy as np  # Imports and aliases NumPy

# as in the previous example, load decays.csv into a NumPy array
decaydata = np.loadtxt('data/decays.csv', delimiter=",", skiprows=1)

# provide handles for the x and y columns
time = decaydata[:,0]
decays = decaydata[:,1]

# import the matplotlib plotting functionality
import matplotlib
%matplotlib inline
import pylab as plt

plt.plot(time, decays)  # Generates a plot of decays vs time

plt.xlabel('Time (s)')
plt.ylabel('Decays')
plt.title('Decays')
plt.grid(True)  # Adds gridlines
#plt.savefig("decays_matplotlib.png")  # saves the figure as a png

Here is an example of a rather long script to make a nice flyer for a talk about MatPlotLib:

In [None]:
# Import various necessary Python and matplotlib packages
import numpy as np
import matplotlib.cm as cm  # Imports the colormaps library
from matplotlib.pyplot import figure, show, rc  # Imports other useful libraries
from matplotlib.patches import Ellipse  # We need the ellipse shape for our text boxes

# Create a square figure on which to place the plot
fig = figure(figsize=(8,8))

# Create square axes to hold the circular polar plot
ax = fig.add_axes([0.1, 0.1, 0.8, 0.8], polar=True)

# Generate 20 colored, angular wedges for the polar plot
N = 20
theta = np.arange(0.0, 2*np.pi, 2*np.pi/N)
radii = 10*np.random.rand(N)
width = np.pi/4*np.random.rand(N)
bars = ax.bar(theta, radii, width=width, bottom=0.0)
for r,bar in zip(radii, bars):
    bar.set_facecolor(cm.jet(r/10.))
    bar.set_alpha(0.5)

# Using dictionaries, create a color scheme for the text boxes
bbox_args = dict(boxstyle="round, pad=0.9", fc="green", alpha=0.5)
bbox_white = dict(boxstyle="round, pad=0.9", fc="1", alpha=0.9)
patch_white = dict(boxstyle="round, pad=1", fc="1", ec="1")

# Create various boxes with text annotations in them at specific
# x and y coordinates
ax.annotate(" ",
    xy=(.5,.93),  # Places an annotation box at the desired x amd y coordinates 
    xycoords='figure fraction',  # Tells python to read those annotations as fractions of figure height and width
    ha="center", va="center",  # Aligns the text to the center of the box
    bbox=patch_white)  # Makes the box white

ax.annotate('Matplotlib and the Python Ecosystem for Scientific Computing',
    xy=(.5,.95),
    xycoords='figure fraction',
    xytext=(0, 0), textcoords='offset points',
    size=15,
    ha="center", va="center",
    bbox=bbox_args)

ax.annotate('Author and Lead Developer \n of Matplotlib ',
    xy=(.5,.82),
    xycoords='figure fraction',
    xytext=(0, 0), textcoords='offset points',
    ha="center", va="center",
    bbox=bbox_args)

ax.annotate('John D. Hunter',
    xy=(.5,.89),
    xycoords='figure fraction',
    xytext=(0, 0), textcoords='offset points',
    size=15,
    ha="center", va="center",
    bbox=bbox_white)

ax.annotate('Friday November 5th  \n 2:00 pm \n1106ME ',
    xy=(.5,.25),
    xycoords='figure fraction',
    xytext=(0, 0), textcoords='offset points',
    size=15,
    ha="center", va="center",
    bbox=bbox_args)

ax.annotate('Sponsored by: \n The Hacker Within, \n'
    'The University Lectures Committee, \n The Department of '
    'Medical Physics\n and \n The American Nuclear Society',
    xy=(.78,.1),
    xycoords='figure fraction',
    xytext=(0, 0), textcoords='offset points',
    size=9,
    ha="center", va="center",
    bbox=bbox_args)

#fig.savefig("plot.pdf")

Further cool examples of plots made with MatPlotLib may be found at the MatPlotLib gallery

#### Bokeh

Bokeh is another plotting tool that is quite similar to MatPlotLib, but specialized for generating interactive plots for the internet. The following script makes an html file holding the plot of the decay data:

In [None]:
import numpy as np
# import the Bokeh plotting tools
from bokeh import plotting as bp

# as in the matplotlib example, load decays.csv into a NumPy array
decaydata = np.loadtxt('data/decays.csv',delimiter=",",skiprows=1)

# provide handles for the x and y columns
time = decaydata[:,0]
decays = decaydata[:,1]

# define some output file metadata
bp.output_file("decays.html", title="Experiment 1 Radioactivity")

# create a figure with fun Internet-friendly features (optional)
bp.figure(tools="pan,wheel_zoom,box_zoom,reset,previewsave")

# on that figure, create a line plot
bp.figure().line(time, decays, x_axis_label="Time (s)", y_axis_label="Decays (#)",
     color='#1F78B4', legend='Decays per second')

# additional customization to the figure can be specified separately
bp.curplot().title = "Decays"
bp.grid().grid_line_alpha=0.3

# open a browser
bp.show()