# FIVE WORLD-CLASS VISUALIZATIONS AND WHAT WE CAN LEARN FROM THEM

**Jonathan Balaban**

## LINK: github.com/ultimatist/ODSC19

>What separates the world's best data visualizations from the rest? Join [Metis](https://www.thisismetis.com) Senior Data Scientist [Jonathan Balaban](https://www.linkedin.com/in/jbalaban/) as he highlights amazing visuals and dissects best practices in visualization and storytelling. We'll cover open-source platforms and code that enable similar, compelling visuals we can use to share our work.

## Outline

1. Introduce impactful and dynamic visualizations:
    - Discuss use, design choices, and pros/cons of our visualizations
    - Review which domains or data are best suited for our visualizations
    - Familiarization with dataset and quick visualization as baseline


2. Introduce packages for build, and walk through each visualization’s codeset:
    - Review default arguments and method choices
    - Consider alternatives and when they would be appropriate


3. Discuss sharing and deployment of results:
    - Tips and tools for embedding into front ends
    - Review best practices from Tufte and industry experts


4. Conclusion
    - Insight into what makes for a compelling visualization
    - Experience with multiple open-source, dynamic visualization packages
    - Insight into deployment strategies

## Historical Examples

We'll review a number of the examples on the [Tableau Best Viz](https://www.tableau.com/learn/articles/best-beautiful-data-visualization-examples) post. As we do, think about the following:

1. What is this visual telling me initially?
1. What stands out on further analysis?
1. Why did the designer choose this template?
1. Do the colors mean anything?
1. Do the sizes mean anything?
1. Do the shapes mean anything?
1. Does the layout convey anything?
1. Would there have been a better way to visualize this data?
1. Is it too busy or too simple for me?
1. Is there one main point, or multiple?

In [None]:
#NOTES: http://cdnlarge.tableausoftware.com/sites/default/files/pages/the_5_most_influential_data_visualizations_of_all_time_0.pdf

>“The greatest value of a picture is
when it forces us to notice what we
never expected to see.”
    - John Tukey, 1977

## Current Examples

- [Fall Foliage](https://www.washingtonpost.com/graphics/2019/national/fall-foliage-atlas/) by Washington Post
- [Who is a Millenial](https://junkcharts.typepad.com/junk_charts/2019/10/who-is-a-millennial-an-example-of-handling-uncertainty.html) by junkcharts
- [Strange Revival of Vinyl Records](https://www.economist.com/graphic-detail/2019/10/18/the-strange-revival-of-vinyl-records) by The Economist
- [‘Daily Rituals’](https://podio.com/site/creative-routines) by Mason Currey

In [None]:
# imports
import pandas as pd
import numpy as np
import seaborn as sns

# import plotly
import plotly.graph_objects as go

# Workshop Examples

## NYT - [A 3-D View of a Chart](https://www.nytimes.com/interactive/2015/03/19/upshot/3d-yield-curve-economic-growth.html) That Predicts The Economic Future: The Yield Curve

In [None]:
# load and view data
df = pd.read_csv('https://raw.githubusercontent.com/epogrebnyak/data-ust/master/ust.csv')

df.head()

In [None]:
# check data format
df.info()

In [None]:
# set datetime and get info
df.date = pd.to_datetime(df.date)
df.set_index(df.date, inplace=True)

In [None]:
# review index
df.index

# this needs to be sorted

In [None]:
# sort dates
df.sort_index(inplace=True, ascending=False)

In [None]:
# drop redundant columns and check info
df.drop(['date', 'BC_30YEARDISPLAY'], axis=1, inplace=True)
df.info()

In [None]:
# describe
df.describe()

In [None]:
# send to array and transpose so each row is a bond class
raw = df.to_numpy().T

In [None]:
# check shape
raw.shape

In [None]:
# plot
fig = go.Figure(data=[ # figure makes the canvas
    go.Surface(y=df.columns, x=df.index, z=raw) # this draws the surface   
])

# set up the canvas
fig.update_layout(title='Yield Curve', autosize=False,
                   width=1000, height=800,
                   scene_camera_eye=dict(x=4, y=-2, z=.3))

fig.update_layout(scene_aspectmode='manual',
                  scene_aspectratio=dict(x=4, y=2, z=1)) # stretch the years out

fig.update_layout(scene = dict(
                    xaxis_title='Year',
                    yaxis_title='Asset',
                    zaxis_title='Percent Yield'))

fig.show()

## Climate Lab Book - [Warming Stripes](http://www.climate-lab-book.ac.uk/2018/warming-stripes/)

In [None]:
# load data and subset to GCAG source
temp = pd.read_csv('https://datahub.io/core/global-temp/r/monthly.csv')
temp = temp[temp.Source == 'GCAG']

# convert datetime and view info
temp.Date = pd.to_datetime(temp.Date)
temp.info()

In [None]:
# check min and max temp ranges
min_temp = temp.Mean.min()
max_temp = temp.Mean.max()
print(min_temp, max_temp)

In [None]:
# create month and year columns for later pivot
temp['Month'] = temp.Date.dt.month_name()
temp['Year'] = temp.Date.dt.year
#df.head()
#import calendar
#df['Month'] = df['Month'].apply(lambda x: calendar.month_abbr[x])

In [None]:
# pivot for month rows and year columns
temp = temp.pivot("Month", "Year", "Mean")

In [None]:
temp.head()

In [None]:
# reindex to sort months correctly
temp = temp.reindex(['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December'], axis=0)


In [None]:
# graph with heatmap
sns.set(rc={'figure.figsize':(20,4)}) # widen graph

sns.heatmap(temp, center=0, cmap='seismic'); # center on zero and use blue/red cmap

>So interesting to see historical events in data, like the cold [winter of 1917](https://www.nytimes.com/1977/02/20/archives/new-jersey-weekly-the-winter-of-191718-was-a-cold-one-perhaps-the.html) and [heating during WWII](https://www.globalresearch.ca/dr-gottschalks-world-war-ii-heat-bump-did-the-war-contribute-to-air-pollution-and-global-warming/5690184)!

## RIAA - [Strange Revival of Vinyl Records](https://www.economist.com/graphic-detail/2019/10/18/the-strange-revival-of-vinyl-records)

In [None]:
vinyl = [19,18,16,15,16,13,11,10,8,5,3,2,1,1,.2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,.5,1,1,1,2,3,3,3]

cd = [0,0,0,0,0,1,3,5,8,11,13,16,19,24,26,27,30,33,35,37,40,41,42,42,42,40,38,35,30,24,20,14,11,10,8,8,5,3]

stream = [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,.1,.1,.5,1,1,1,1,2,3,5,11,15,18,21,25]

In [None]:
# import into a dataframe, optional
sales = pd.DataFrame(data={'Vinyl':vinyl, 'CD':cd, 'Streaming':stream}, index=range(1980,2018) 
)

In [None]:
# check tail
#sales.tail()

In [None]:
# visualize with stacked bar
fig2 = go.Figure(data=[
    go.Bar(name='Streaming', x=sales.index, y=sales.Streaming), # create three bars, for each feature
    go.Bar(name='CDs', x=sales.index, y=sales.CD),
    go.Bar(name='Vinyl', x=sales.index, y=sales.Vinyl)
])

# Change the bar mode
fig2.update_layout(barmode='stack', title='US Revenues by format, 2018 $bn', xaxis_tickfont_size=14,
            yaxis=dict( # define y axis
                title='USD (billions)',
                titlefont_size=16,
                tickfont_size=14,
                ),
            legend=dict(
                x=0, # set left side
                y=1.0, # set up top
                bgcolor='rgba(255, 255, 255, 0.5)', # use to make translucent         
                ),
            bargap=0.1 # gap between bars of adjacent location coordinates.
)
                 
fig2.show()

## Interactive [Gapminder](https://github.com/bokeh/bokeh/tree/master/examples/app/gapminder/templates) from Hans Rosling's [famous TED Talk](https://www.ted.com/talks/hans_rosling_shows_the_best_stats_you_ve_ever_seen)

In [None]:
from bokeh.io import curdoc, output_notebook
from bokeh.layouts import layout
from bokeh.models import (ColumnDataSource, HoverTool, SingleIntervalTicker,
                          Slider, Button, Label, CategoricalColorMapper)
from bokeh.palettes import Spectral6
from bokeh.plotting import figure, show

# helper function for accessing data
def process_data():
    from bokeh.sampledata.gapminder import fertility, life_expectancy, population, regions

    # Make the column names ints not strings for handling
    columns = list(fertility.columns)
    years = list(range(int(columns[0]), int(columns[-1])))
    rename_dict = dict(zip(columns, years))

    # fix names and types
    fertility = fertility.rename(columns=rename_dict)
    life_expectancy = life_expectancy.rename(columns=rename_dict)
    population = population.rename(columns=rename_dict)
    regions = regions.rename(columns=rename_dict)

    regions_list = list(regions.Group.unique())

    # Turn population into bubble sizes
    scale_factor = 200
    population_size = np.sqrt(population / np.pi) / scale_factor
    # set min size and custom formula above
    min_size = 3
    population_size = population_size.where(population_size >= min_size).fillna(min_size)

    return fertility, life_expectancy, population_size, regions, years, regions_list

In [None]:
# pull data and df with helper function
fertility_df, life_expectancy_df, population_df_size, regions_df, years, regions_list = process_data()

df = pd.concat({'fertility': fertility_df, # join three distinct datasets based on a common country ID
                'life': life_expectancy_df,
                'population': population_df_size},
               axis=1)

data = {}

In [None]:
regions_df.head()

In [None]:
# group by year in range
regions_df.rename({'Group':'region'}, axis='columns', inplace=True) # simple rename
for year in years:
    df_year = df.iloc[:,df.columns.get_level_values(1)==year] # get all country data for that year
    df_year.columns = df_year.columns.droplevel(1)
    data[year] = df_year.join(regions_df.region).reset_index().to_dict('series') 
    # data is dict with key:value year:dictionary of key:values for features

source = ColumnDataSource(data=data[years[0]]) # bokeh function to format arrays

plot = figure(x_range=(1, 9), y_range=(20, 100), title='Fertility vs. Life Expectancy', plot_height=600, plot_width=1100)
plot.xaxis.ticker = SingleIntervalTicker(interval=1)
plot.xaxis.axis_label = "Children per woman (total fertility)"
plot.yaxis.ticker = SingleIntervalTicker(interval=20)
plot.yaxis.axis_label = "Life expectancy at birth (years)"

label = Label(x=1.1, y=18, text=str(years[0]), text_font_size='70pt', text_color='#eeeeee') # big year label
plot.add_layout(label)

color_mapper = CategoricalColorMapper(palette=Spectral6, factors=regions_list) # allows auto-spectrum of bubbles

# plot all circles
plot.circle(
    x='fertility', # center of circle
    y='life', # center of circle
    size='population', # takes custom value we created
    source=source, # using bokeh formatter
    fill_color={'field': 'region', 'transform': color_mapper},  # modify aspects of circles and borders
    fill_alpha=0.9,
    line_color='#7c7e71',
    line_width=0.5,
    line_alpha=0.6,
)

plot.add_tools(HoverTool(tooltips="@Country", show_arrow=False, point_policy='follow_mouse')) # pulls country name

In [None]:
#output_notebook()

show(plot)

## [Choropleth](https://www.ers.usda.gov/data-products/state-export-data/annual-state-agricultural-exports/) (Map) Plots

Also, [World Happiness Dashboard](https://alpha.iodide.io/notebooks/193/?viewMode=report)

In [None]:
# load
ag = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/2011_us_ag_exports.csv')

# convert to strings
for col in ag.columns:
    ag[col] = ag[col].astype(str)

In [None]:
#ag.info()

In [None]:
# prepare hover text
ag['text'] = ag['state'] + '<br>' + \
    'Beef ' + ag['beef'] + ' Dairy ' + ag['dairy'] + '<br>' + \
    'Fruits ' + ag['total fruits'] + ' Veggies ' + ag['total veggies'] + '<br>' + \
    'Wheat ' + ag['wheat'] + ' Corn ' + ag['corn']

In [None]:
#ag.head()

In [None]:
fig3 = go.Figure(data=go.Choropleth(
    locations=ag['code'], # two digit state ID
    z=ag['total exports'].astype(float), # this colors the map and needs to be float
    locationmode='USA-states', # connects to the kind of data you have
    colorscale='Greens', # color scheme, play with this
    autocolorscale=False,
    text=ag['text'], # hover text
    marker_line_color='black', # line markers between states
    colorbar_title="Millions USD" # legend title
))

fig3.update_layout(
    title_text='2011 US Agriculture Exports by State<br>(Hover for breakdown)', width=1000, height=1000,
    geo = dict(
        scope='world', # range to plot, usa
        projection=go.layout.geo.Projection(type = 'azimuthal equal area'), # albers usa
        showlakes=True,
        showrivers=True,
        showocean=True,
        lakecolor='rgb(25, 25, 255)'),
)

fig3.show()

# Closing

You've seen examples across domains of visualization:

- Which stood out to you?
- Which are useful for your business?
- How will you modify or improve your visuals?
- How can these practices increase attention and engagement?

## Tufte Principles

Data:
- Above all else show data
- Maximize the data-ink ratio
- Erase non-data-ink
- Erase redundant data-ink
- Revise and edit

Scales:
- Choose the range of the tick marks to include or nearly include the range of data
- Subject to the constraints that scales have, choose the scales so that the data fill up as much of the region as possible
- It is sometimes helpful to use the pair of scale lines for a variable to show two different scales
- Choose appropriate scales when graphs are compared
- Use a logarithmic scale when it is important to understand percent change or multiplicative factors

Strategy:
- Put major conclusions into graphical form
- Make legends comprehensive and informative
- Error bars should be clearly explained
- Proofread and get feedback on graphs

## Graphing Package Comparison

1. Matplotlib
    - the original, and the foundation for many of the packages below
    - resembles MATLAB
    - highly customizable
    - lots of code
    - with customization comes huge, tedious documentation
    - takes a while to make viz look good, but has skins and styles that can be applied
    

2. Seaborn
    - very fast
    - built on matplotlib
    - looks good out of the box (but need to know matplotlib to modify)
    - lots of aggregate graphs with only one line of code
    
    
3. Pandas
    - super easy and fast, if your data is in Pandas format
    - limited options
    - looks like Matplotlib, stuck in the 90's


4. ggplot
    - comes from R
    - tight integration with Pandas, for better or worse
    - fast but low customization
    
    
5. Bokeh
    - interactive
    - easy output to JSON, HTML, or web apps
    - handles streaming data
    - verbose, lots of code
    
    
6. Plotly
    - interactive
    - web hosted graphics
    - has rare graphs like contour plots and dendrograms
    - benefits from being for-profit with robust documentation, but can't access top functionality for free


7. geoplotlib
    - map focused
    - offers choropleths, heatmaps, and dot density maps
    
    
8. D3.js
    - not Python, but plays well via a number of tools like Flask
    - extremely customizable
    - native to web: fast, responsive, and easy to share
    - have to know Javascript
    - thousands of examples, documentation can be tough to navigate

## Additional Resources

- [Microsoft SandDance](https://github.com/Microsoft/SandDance)
- [Tableau Blog](https://www.tableau.com/learn/articles/best-beautiful-data-visualization-examples)
- [The 25 Best Data Visualizations of 2018](https://visme.co/blog/best-data-visualizations/) by Visme
- [More interactive Bokeh](https://github.com/bokeh/bokeh/tree/master/examples/app)