In [3]:
%run ../talktools.py

# Data Exploration and Viz

You should be familiar with the basics of `numpy` arrays: creation and manipulation. 

Good starting points to brush up on this are:
 
 <ul>
    <li>The online <a href="http://www.scipy-lectures.org/">Scipy Lecture notes</a>, especially chapter 1.3.
    <li>Stefán Van Der Walt, S. Chris Colbert, Gaël Varoquaux. <a href="https://hal.inria.fr/inria-00564007/document">The NumPy array: a structure for efficient numerical computation</a>. Computing in Science and Engineering, Institute of Electrical and Electronics Engineers, 2011, 13 (2), pp. 22&ndash;30.
</ul>


You will also find yourself using `pandas` a lot in this course (and afterwards!). While not a requirement to use `pandas` for array manipulation, it's worth understanding some of the basics. A quick primer is [here](https://github.com/profjsb/python-seminar/DataFiles_and_Notebooks/12_pandas/12_pandas.ipynb). 

**Who wants to do a `pandas` exercise?**

<img src="https://www.evernote.com/l/AUVM_nsHxwxMU4IwxVdaZlTaOxRad1hL-lAB/image.png">
Source: B. Grainger (PyData 2016) https://www.youtube.com/watch?v=aRxahWy-ul8
[declarative: what should be done, imperative: how it should be done]

Hopefully after this week you will:

- Know how to make and polish figures to the point where they can go to a journal.
- Understand matplotlib's internal model enough to:
  - know where to look for knobs to fine-tune
  - better understand the help and examples online
  - use it as a development platform for complex visualization
- Be able to build basic interactive viz (in the browser)


# Plotting and visualization: overview & motivation

Major uses of plotting and viz in (data) science workflows:

  1. (Initial) **Understanding** - What's there? what's missing? What patterns are worth exploring? What more data do I need?
  2. **Exploration** - deeper dive into the "meaning". Often cyclical/iterative.
  3. **Presentation** - results, telling stories with data (and potentially allowing others to explore)
      - different requirements for data science vs science

## Always visualize your data!


In [4]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

# If you're reading in an image
# !conda install pillow -y
#import matplotlib.image as mpimg
#data = mpimg.imread("imgs/my_data_file.jpg")

# or read in a numpy array
import numpy as np
data = np.load("imgs/my_data_file.npy")

In [5]:
data.shape

(128, 128, 3)

In [6]:
import pandas as pd
df = pd.DataFrame(data[:,:,0])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 128 entries, 0 to 127
Columns: 128 entries, 0 to 127
dtypes: uint8(128)
memory usage: 16.1 KB


In [7]:
df.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,118,119,120,121,122,123,124,125,126,127
count,128.0,128.0,128.0,128.0,128.0,128.0,128.0,128.0,128.0,128.0,...,128.0,128.0,128.0,128.0,128.0,128.0,128.0,128.0,128.0,128.0
mean,103.71875,109.523438,111.523438,112.257812,113.984375,115.828125,114.023438,112.257812,122.53125,126.390625,...,135.929688,141.539062,132.476562,129.132812,123.132812,130.5,123.90625,124.515625,127.96875,122.617188
std,56.748227,57.071024,56.958469,56.43842,57.588136,63.771135,63.112257,60.526333,61.752738,60.231343,...,59.777433,59.760269,59.578964,52.287722,59.832097,56.118828,55.65663,57.670526,60.009702,61.449438
min,7.0,9.0,7.0,9.0,10.0,8.0,10.0,12.0,8.0,13.0,...,45.0,44.0,16.0,39.0,13.0,27.0,14.0,13.0,32.0,12.0
25%,55.75,62.5,60.5,67.75,64.0,56.0,49.0,62.75,70.25,72.75,...,80.0,81.75,75.0,76.0,72.0,79.75,75.75,72.0,76.5,73.5
50%,105.5,115.0,118.5,120.0,127.5,131.0,133.5,127.0,138.0,140.0,...,146.5,157.0,145.0,139.0,107.5,133.0,109.0,125.0,116.0,97.0
75%,152.25,160.0,156.25,158.25,157.0,165.0,161.25,159.25,170.0,168.0,...,185.25,191.5,181.0,168.0,169.0,175.25,170.25,166.25,177.25,174.25
max,210.0,209.0,214.0,217.0,241.0,218.0,217.0,237.0,236.0,244.0,...,246.0,244.0,243.0,236.0,246.0,245.0,245.0,251.0,249.0,237.0


In [8]:
df.tail()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,118,119,120,121,122,123,124,125,126,127
123,17,19,27,31,115,104,108,69,35,56,...,182,202,231,148,160,168,152,153,86,74
124,17,26,128,195,90,93,51,37,38,208,...,154,239,238,205,160,164,177,146,128,85
125,36,188,190,197,26,33,38,21,132,61,...,98,85,109,214,165,177,165,156,149,181
126,186,190,63,45,25,31,24,34,44,17,...,85,93,95,230,71,157,160,146,139,143
127,145,50,58,27,22,24,21,26,8,17,...,93,91,92,204,90,163,146,144,146,137


In [None]:
plt.imshow(data)
plt.show()

Looking at your data not just for images/arrays. Critical for point data too.

### Anscombe's Quartet

<p><a href="https://commons.wikimedia.org/wiki/File:Anscombe%27s_quartet_3.svg#/media/File:Anscombe%27s_quartet_3.svg"><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/e/ec/Anscombe%27s_quartet_3.svg/1200px-Anscombe%27s_quartet_3.svg.png" alt="Anscombe's quartet 3.svg"></a><br>By <a href="//commons.wikimedia.org/wiki/File:Anscombe.svg" title="File:Anscombe.svg">Anscombe.svg</a>: <a href="//commons.wikimedia.org/wiki/User:Schutz" title="User:Schutz">Schutz</a>
derivative work (label using subscripts): <a href="//commons.wikimedia.org/wiki/User:Avenue" title="User:Avenue">Avenue</a> (<a href="//commons.wikimedia.org/wiki/User_talk:Avenue" title="User talk:Avenue"><span class="signature-talk">talk</span></a>) - <a href="//commons.wikimedia.org/wiki/File:Anscombe.svg" title="File:Anscombe.svg">Anscombe.svg</a>, <a href="http://creativecommons.org/licenses/by-sa/3.0" title="Creative Commons Attribution-Share Alike 3.0">CC BY-SA 3.0</a>, <a href="https://commons.wikimedia.org/w/index.php?curid=9838454">https://commons.wikimedia.org/w/index.php?curid=9838454</a></p>

<img src="https://www.evernote.com/l/AUX2p-SfsmVAyZk9cnT7OqaI55Ru3JOmlMkB/image.png">

https://en.wikipedia.org/wiki/Anscombe%27s_quartet

In [10]:
from IPython.display import HTML
HTML("""
<video width="620" controls>
  <source src="https://pbs.twimg.com/tweet_video/CrIDuOhWYAAVzcM.mp4" type="video/mp4">
</video>
""")

From https://twitter.com/JustinMatejka/status/770682771656368128"
Called ... Anscombosaurus

## How you decide to show data is part of the story itself

Some basic thoughts:

 1. No more lines, colors, points than you need to tell the story.
 2. But, no removing data for no other reason than it doesn't tell your story.
 3. Figures for talks and publications should be (almost) self-describing. An expert in your field should get the point.
 4. Figures are the centerpiece of your paper/lab: most people will remember a visual better than they'll remember your abstract.
 5. I usually build the figures first, then write the meaty sections, then the conclusions, then the abstract, then the title.
 
"Ten Simple Rules for Better Figures":
http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003833

## Pie Charts: no.

Only two exceptions where this is ok...

<img src="http://i1.wp.com/flowingdata.com/wp-content/uploads/2008/09/Pie-I-have-Eaten.jpg">

<img src="https://i0.wp.com/flowingdata.com/wp-content/uploads/2014/12/Pie-Pyramid-e1417455667996.png">

## Bring on the box plots

<img src="http://www.nature.com/nmeth/journal/v11/n2/images/nmeth.2813-F1.jpg">

Sample BoxPlotR plots. Top: Simple Tukey-style box plot. Bottom: Tukey-style box plot with notches, means (crosses), 83% confidence intervals (gray bars; representative of p=0.05 significance) and n values.

http://blogs.nature.com/methagora/2014/01/bring-on-the-box-plots-boxplotr.html



In [11]:
from IPython.display import HTML
HTML("""
<blockquote class="twitter-tweet"><p lang="en" dir="ltr">Animation showing how a boxplot may hide very different data patterns.<br>
<br>By <a href="https://twitter.com/JustinMatejka?ref_src=twsrc%5Etfw">@JustinMatejka</a> <a href="https://t.co/Zmk10ZTflU">pic.twitter.com/Zmk10ZTflU</a></p>&mdash; Lionel Page (@page_eco) <a href="https://twitter.com/page_eco/status/1055785592829698048?ref_src=twsrc%5Etfw">October 26, 2018</a></blockquote> 
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
""")