Data Visualization in Python
============================

<center><img src="http://ijstokes-public.s3.amazonaws.com/img/continuum-logo-color.png"></center>

<a href="http://about.me/ijstokes" target="_parent">Ian Stokes-Rees</a> <a href="http://twitter.com/ijstokes" target="_parent">@ijstokes</a>

<a href="http://continuum.io" target="_parent">Continuum Analytics</a> <a href="http://twitter.com/ContinuumIO" target="_parent">@ContinuumIO</a>

Course Material: <a href="http://j.mp/pyvis-3h" target="_parent">http://j.mp/pyvis-3h</a>
        
Background for those new to Python: <a href="http://j.mp/py4sci-3h" target="_parent">http://j.mp/py4sci-3h</a>

References
----------

* <a href="http://matplotlib.org/" target="_parent">matplotlib</a>

* <a href="https://github.com/olgabot/prettyplotlib" target="_parent">prettyplotlib</a>

* <a href="http://www.stanford.edu/~mwaskom/software/seaborn/" target="_parent">seaborn</a>

* <a href="bokeh.pydata.org" target="_parent">bokeh</a>

* <a href="http://docs.enthought.com/mayavi/mayavi/" target="_parent">mayavi</a> <a href="https://github.com/enthought/mayavi" target="_parent">[github]</a>

* <a href="http://pandas.pydata.org/" target="_parent">pandas</a>

* <a href="http://networkx.github.io/" target="_parent">networkx</a>

* <a href="http://statsmodels.sourceforge.net/stable/" target="_parent">statsmodels</a>

* <a href="http://blog.yhathq.com/posts/ggplot-for-python.html" target="_parent">ggplot for Python</a>


Matplotlib
----------

* figure
* axis
* lines
* decorations (text, labels, LaTeX, legends)
* colormaps
* subplots
* plot styles: scatter, bar, hist
* 3D: surface, pcolor, contour

<a href="https://www.wakari.io/nb/url/https://raw2.github.com/jrjohansson/scientific-python-lectures/master/Lecture-4-Matplotlib.ipynb" target="_parent">matplotlib tutorial</a>

<a href="http://matplotlib.org/gallery.html" target="_parent">matplotlib gallery</a>


In [None]:
import matplotlib.pyplot as plt
import numpy as np

import matplotlib.pylot as plt

$\bar{x} = \cos(2\pi x) \cdot e^{\frac{-x}{T}} + \mathcal{N}(0,A_n)$

In [None]:
n      = 200
A_n    = 0.1
T      = 3

x      = np.linspace(0.0, 15.0, n)
signal = np.cos(2 * np.pi * x) * np.exp(-x/T)
noise  = A_n * np.random.randn(n)
obs    = signal + noise

In [None]:
plt.figure()
_ = plt.plot(x, obs, 'r')
plt.title('Exponential decay of RF circuit $cos(2 \pi x)e^{-x/3}$', fontsize=18)
plt.xlabel('time (ms)')
plt.ylabel('dmm reading (mV)')
plt.savefig('rf-circuit-decay.png')
plt.savefig('rf-circuit-decay.pdf')

In [None]:
plt.figure()
_ = plt.hist(noise, 20)

*Warm Up Exercise*

1. Introduce yourself to the person beside you and share what kind of data visualization you are interested in.  Try and help each other out with the next few steps, and confirm that you're each on the same page.

2. Download the course material from the course URL (*Download Entire Bundle* link at the top of the page), and unzip it somewhere appropriate -- this will be your working directory.

3. Get to the terminal/command line:
   * Windows: *Start->cmd* or *Anaconda->Python Command Prompt*
   * Mac: `CMD-SPACE` (spotlight) then type *terminal* to find the application and `ENTER`
   * Linux: you probably already know how to do this

4. Confirm that you both have the Anaconda Python Distribution installed (if it doesn't say *Anaconda* then it is the wrong Python distribution)
```
    $ python -V
    Python 2.7.6 :: Anaconda 1.8.0 (x86_64)
```

5. Navigate to the directory where you uncompressed the course material.

6. Start *IPython Notebook* (or, if you are familiar with it, the *spyder* IDE)
```
    $ ipython notebook --matplotlib=inline
```

7. This should open your web browser.  If it doesn't, look for a URL that is output to the screen in the terminal window where you started the notebook from.  It should be `http://127.0.0.1:8888` (last 4 digits may differ).  Put this into your web browser.

8. Enter the code above, run it, check that there is output on the screen and also a file on disk.  Ask for help from your new friend if necessary. Failing that, ask the instructor.

9. BONUS: Experiment with updating the histogram plot to include a title and labels, then save it to disk

In [None]:
import sys
print sys.executable

Trojans and Spartans
--------------------

A recent careful textual, historical, and archeological study has revealed new insights into the comparative ferocity and empathy of the ancient Trojans and Spartans.

Hundreds of objects and entities have been categorized according to these two measures, and a new model recently published in the *Journal of Computational Mythology*.

Unfortunately everyone is having some trouble interpretting the results.  Python Data Visualization to the rescue!

In [None]:
n1   = 500
n2   = 150
# amplitude
A1 = np.array([ 3.5,  7.2])
A2 = np.array([ 1.5,  2.5])

# offset
o1 = np.array([ 1.1,  4.5])
o2 = np.array([-2.5, -0.5])

# rotation matrix
r1  = np.array([[ 0.92, -0.39],
                [ 0.39,  0.92]]) # 22.5' 
r2  = np.array([[ 0.39,  0.92],
                [-0.92,  0.39]]) # 67.5'

s  = np.dot(np.random.randn(n1,2) * A1, r1) + o1
t  = np.dot(np.random.randn(n2,2) * A2, r2) + o2

In [None]:
plt.figure()
plt.plot(s[:,0], s[:,1])
plt.plot(t[:,0], t[:,1])

In [None]:
plt.figure()
plt.plot(s[:,0], s[:,1], 'o')
plt.plot(t[:,0], t[:,1], 'o')

In [None]:
# change from plot to scatter
plt.figure()
plt.scatter(s[:,0], s[:,1])
plt.scatter(t[:,0], t[:,1])

In [None]:
# gah! different parameter signature!
plt.figure()
plt.scatter(t[:,0], t[:,1], c='b')
plt.scatter(s[:,0], s[:,1], c='r')

In [None]:
plt.figure()
plt.scatter(t[:,0], t[:,1], c='b', label='Trojans')
plt.scatter(s[:,0], s[:,1], c='r', label='Spartans')
plt.legend()

In [None]:
plt.figure()
plt.scatter(t[:,0], t[:,1], c='b', label='Trojans')
plt.scatter(s[:,0], s[:,1], c='r', label='Spartans')
plt.legend(loc='upper left')
plt.xlabel('ferocity')
plt.ylabel('empathy')
plt.title('Trojan and Spartan ferocity and empathy')

In [None]:
# Hey, what about those Trojans?
plt.figure()
plt.scatter(t[:,0], t[:,1], c='b', alpha=0.3, linewidths=0, label='Trojans')
plt.scatter(s[:,0], s[:,1], c='r', alpha=0.3, linewidths=0, label='Spartans')
plt.legend(loc='lower right')
plt.xlabel('ferocity')
plt.ylabel('empathy')
plt.title('Trojan and Spartan ferocity and empathy')

In [None]:
# New data! object significance
base = 10
s1 = base + 20*np.random.randint(1,5,size=n1)
s2 = base + 20*np.random.randint(1,5,size=n2)

In [None]:
plt.figure(figsize=(8,6))
plt.scatter(t[:,0], t[:,1], s=s2, c='b', alpha=0.3, linewidths=0, label='Trojans')
plt.scatter(s[:,0], s[:,1], s=s1, c='r', alpha=0.3, linewidths=0, label='Spartans')
plt.legend(loc='lower right')
plt.xlabel('ferocity')
plt.ylabel('empathy')
plt.title('Trojan and Spartan ferocity and empathy')

Prettyplotlib
-------------
* matplotlib is inspired by Matlab
* matlab plotting came to fruition in the 80s
* *ergo* matplotlib figures can look a few decades out of date

Prettyplotlib to the rescue! Inspired by <a href="http://www.edwardtufte.com/tufte/" target="_parent">Edward Tufte</a>: minimalist

In [None]:
import prettyplotlib as ppl

In [None]:
plt.figure(figsize=(8,6))
ppl.scatter(t[:,0], t[:,1], s=s2, c='b', alpha=0.5, linewidths=0, label='Trojans')
ppl.scatter(s[:,0], s[:,1], s=s1, c='r', alpha=0.5, linewidths=0,label='Spartans')
ppl.legend(loc='upper left')
plt.xlabel('ferocity')
plt.ylabel('empathy')
plt.title('Trojan and Spartan ferocity and empathy', fontsize=16)

In [None]:
plt.figure(figsize=(8,10))
plt.subplot(211)
ppl.hist(s[:,0], label='Spartans')
ppl.hist(t[:,0], label='Trojans')
plt.subplot(212)
ppl.hist(s[:,1], label='Spartans')
ppl.hist(t[:,1], label='Trojans')
ppl.legend()
plt.savefig('hist.svg')

In [None]:
import IPython.display
IPython.display.SVG(open('hist.svg').read())

Seaborn
-------
* R-inspired
* Talks *Pandas* `DataFrames` and *Numpy* `ndarrays`

In [1]:
from scipy import stats
import pandas  as pd
import seaborn as sns

In [2]:
plt.hexbin(s[:,0],s[:,1], gridsize=12, cmap='Reds')
plt.xlabel('ferocity')
plt.ylabel('empathy')
plt.title('Spartan heat map')

NameError: name 'plt' is not defined

In [None]:
plt.hexbin(t[:,0],t[:,1], gridsize=8, cmap='Blues')
plt.xlabel('ferocity')
plt.ylabel('empathy')
plt.title('Trojan heat map')

In [None]:
sns.regplot(s[:,0], s[:,1],color='r')
main, x_marg, y_marg = plt.gcf().axes
sns.despine(ax=main)
sns.despine(ax=x_marg, left=True)
sns.despine(ax=y_marg, bottom=True)

In [None]:
sns.regplot(t[:,0], t[:,1],color='b')
main, x_marg, y_marg = plt.gcf().axes
sns.despine(ax=main)
sns.despine(ax=x_marg, left=True)
sns.despine(ax=y_marg, bottom=True)

<a href="http://bokeh.pydata.org/" target="_parent">Bokeh</a>
-----

Goal: *a billion points, meaningfully, interactively, in the browser*
    
<a href="http://bokeh.pydata.org/index.html#technicalvision" target="_parent">Technical Vision</a>

Get bokeh 0.3 From the command line:
```
   $ conda install bokeh
```

From inside your Python interpreter:
```
   import bokeh
   bokeh.sampledata.download()
```

Bokeh supports:
    
* stand-alone HTML output
* server-mode (plots and data held in Redis DB)
* IPython Notebooks (stand-alone JS, and server-connected)

Mayavi - Interactive 3D Rendering
---------------------------------
* need to run these with `pythonw` due to library linking issues

In [None]:
from numpy import pi, sin, cos, mgrid

dphi, dtheta = pi/250.0, pi/250.0

[phi,theta] = mgrid[0:pi+dphi*1.5:dphi,0:2*pi +dtheta*1.5:dtheta]

m0, m1, m2, m3, m4, m5, m6, m7 = 4, 3, 2, 3, 6, 2, 6, 4

r = sin(m0*phi)**m1 + cos(m2*phi)**m3 + sin(m4*theta)**m5 + cos(m6*theta)**m7

x = r*sin(phi)*cos(theta)
y = r*cos(phi)
z = r*sin(phi)*sin(theta)

In [None]:
# won't work in Wakari -- needs native graphics
from mayavi import mlab

s = mlab.mesh(x, y, z)
mlab.show()

NetworkX
--------

In [None]:
import networkx as nx
import matplotlib.pyplot as plt

G   = nx.house_graph()
# explicitly set positions
pos = {0:(0,0),
     1:(1,0),
     2:(0,1),
     3:(1,1),
     4:(0.5,2.0)}

nx.draw_networkx_nodes(G,pos,node_size=2000,nodelist=[4])
nx.draw_networkx_nodes(G,pos,node_size=3000,nodelist=[0,1,2,3],node_color='b')
nx.draw_networkx_edges(G,pos,alpha=0.5,width=6)

plt.axis('off')
plt.show() # display

In [None]:
G      = nx.star_graph(20)
pos    = nx.spring_layout(G)
colors = range(20)

In [None]:
G.nodes()

In [None]:
G.edges()

In [None]:
nx.draw(G, pos, node_color='#A0CBE2', edge_color=colors, width=4, edge_cmap=plt.cm.Blues, with_labels=False)

plt.show() # display

In [None]:
G=nx.random_geometric_graph(200,0.125)
# position is stored as node attribute data for random_geometric_graph
pos=nx.get_node_attributes(G,'pos')

# find node near center (0.5,0.5)
dmin=1
ncenter=0
for n in pos:
    x,y=pos[n]
    d=(x-0.5)**2+(y-0.5)**2
    if d<dmin:
        ncenter=n
        dmin=d

# color by path length from node near center
p=nx.single_source_shortest_path_length(G,ncenter)

plt.figure(figsize=(8,8))
nx.draw_networkx_edges(G,pos,nodelist=[ncenter],alpha=0.4)
nx.draw_networkx_nodes(G,pos,nodelist=p.keys(),
                       node_size=80,
                       node_color=p.values(),
                       cmap=plt.cm.Reds_r)

plt.xlim(-0.05,1.05)
plt.ylim(-0.05,1.05)
plt.axis('off')
plt.show()

In [None]:
import networkx as nx

# tag names specifying what game info should be
# stored in the dict on each digraph edge
game_details=["Event",
              "Date",
              "Result",
              "ECO",
              "Site"]

def chess_pgn_graph(pgn_file="WCC.pgn"):
    """Read chess games in pgn format in pgn_file.

    Filenames ending in .gz or .bz2 will be uncompressed.

    Return the MultiDiGraph of players connected by a chess game.
    Edges contain game data in a dict.

    """
    G=nx.MultiDiGraph()
    game={}
    datafile = open(pgn_file)
    lines = (line.decode().rstrip('\r\n') for line in datafile)
    for line in lines:
        if line.startswith('['):
            tag,value=line[1:-1].split(' ',1)
            game[str(tag)]=value.strip('"')
        else:
        # empty line after tag set indicates
        # we finished reading game info
            if game:
                white=game.pop('White')
                black=game.pop('Black')
                G.add_edge(white, black, **game)
                game={}
    return G


if __name__ == '__main__':
    import networkx as nx


    G=chess_pgn_graph()

    ngames=G.number_of_edges()
    nplayers=G.number_of_nodes()

    print("Loaded %d chess games between %d players\n"\
                   % (ngames,nplayers))

    # identify connected components
    # of the undirected version
    Gcc=nx.connected_component_subgraphs(G.to_undirected())
    if len(Gcc)>1:
        print("Note the disconnected component consisting of:")
        print(Gcc[1].nodes())

    # find all games with B97 opening (as described in ECO)
    openings=set([game_info['ECO']
                  for (white,black,game_info) in G.edges(data=True)])
    print("\nFrom a total of %d different openings,"%len(openings))
    print('the following games used the Sicilian opening')
    print('with the Najdorff 7...Qb6 "Poisoned Pawn" variation.\n')

    for (white,black,game_info) in G.edges(data=True):
        if game_info['ECO']=='B97':
           print(white,"vs",black)
           for k,v in game_info.items():
               print("   ",k,": ",v)
           print("\n")


    try:
        import matplotlib.pyplot as plt
    except ImportError:
        import sys
        print("Matplotlib needed for drawing. Skipping")
        sys.exit(0)

    # make new undirected graph H without multi-edges
    H=nx.Graph(G)

    # edge width is proportional number of games played
    edgewidth=[]
    for (u,v,d) in H.edges(data=True):
        edgewidth.append(len(G.get_edge_data(u,v)))

    # node size is proportional to number of games won
    wins=dict.fromkeys(G.nodes(),0.0)
    for (u,v,d) in G.edges(data=True):
        r=d['Result'].split('-')
        if r[0]=='1':
            wins[u]+=1.0
        elif r[0]=='1/2':
            wins[u]+=0.5
            wins[v]+=0.5
        else:
            wins[v]+=1.0
    try:
        pos=nx.graphviz_layout(H)
    except:
        pos=nx.spring_layout(H,iterations=20)

    plt.rcParams['text.usetex'] = False
    plt.figure(figsize=(8,8))
    nx.draw_networkx_edges(H,pos,alpha=0.3,width=edgewidth, edge_color='m')
    nodesize=[wins[v]*50 for v in H]
    nx.draw_networkx_nodes(H,pos,node_size=nodesize,node_color='w',alpha=0.4)
    nx.draw_networkx_edges(H,pos,alpha=0.4,node_size=0,width=1,edge_color='k')
    nx.draw_networkx_labels(H,pos,fontsize=14)
    font = {'fontname'   : 'Helvetica',
            'color'      : 'k',
            'fontweight' : 'bold',
            'fontsize'   : 14}
    plt.title("World Chess Championship Games: 1886 - 1985", font)

    # change font and write text (using data coordinates)
    font = {'fontname'   : 'Helvetica',
    'color'      : 'r',
    'fontweight' : 'bold',
    'fontsize'   : 14}

    plt.text(0.5, 0.97, "edge width = # games played",
             horizontalalignment='center',
             transform=plt.gca().transAxes)
    plt.text(0.5, 0.94,  "node size = # games won",
             horizontalalignment='center',
             transform=plt.gca().transAxes)

    plt.axis('off')
    plt.savefig("chess_masters.png",dpi=75)
    print("Wrote chess_masters.png")
    plt.show() # display

<a href="http://networkx.github.io/documentation/latest/gallery.html" target="_parent">NetworkX Gallery</a>

Statsmodels
-----------

In [None]:
import numpy as np
import pandas
import matplotlib.pyplot as plt
import statsmodels.api as sm
from   statsmodels.formula.api import ols

In [None]:
prestige = sm.datasets.get_rdataset("Duncan", "car", cache=True).data

In [None]:
prestige.head(20)

In [None]:
prestige_model = ols("prestige ~ income + education", data=prestige).fit()

In [None]:
fix, ax = plt.subplots(figsize=(12,14))
fig = sm.graphics.plot_partregress("prestige", "income", ["education"], data=prestige, ax=ax)

In [None]:
print prestige_model.summary()

In [None]:
fig, ax = plt.subplots(figsize=(12, 8))
fig = sm.graphics.plot_fit(prestige_model, "education", ax=ax)

ggplot
------

R-style composable plotting

In [None]:
import ggplot as gp
import pandas as pd

In [None]:
p = gp.ggplot(gp.mtcars, gp.aes('cyl'))
p + gp.geom_bar()

gp.plt.show(1)

In [None]:
gp.ggplot(gp.diamonds, gp.aes(x='price', color='cut')) + gp.geom_density()

In [None]:
df = pd.DataFrame({
    "x": range(100),
    "y": np.random.choice([-1, 1], 100)
})

df.y = df.y.cumsum()

p = gp.ggplot(gp.aes(x='x', y='y'), data=df)
p + gp.geom_step()
plt.show(True)

In [None]:
gp.ggplot(gp.aes(x='date', y='beef'), data=gp.meat) + gp.geom_point(alpha=0.3) + gp.stat_smooth(colour="black", se=True)

*For Next Time*

* Basemap
* Cholopleths