# Python data visualization

## Agenda

* Plotting from pandas
* Matplotlib basics
* Python visualization alternatives (briefly)
    - Matplotlib
    - Plotly
    - Bokeh
    - Altair
    - Seaborn
    - ...
* Visualization distribution options

# Setup

**To run notebooks in local jupyter lab**
* Clone repo: https://bitbucket.itgit.oneadr.net/scm/~m015696/pandas_visualization.git
* run `pip install -r requirements.txt` in command prompt to install all requirements
* start jupyter lab with: `jupyter lab`

**To follow along in notebook in Binder (temporary web version)**
* Go to link: https://mybinder.org/v2/gh/purrepirre/visualization_demo/HEAD

**Jupyter lab basics**
* Shift + ENTER to run cell and move to next  
* Shift + TAB to get contextual documentation 

In [None]:
from vega_datasets import data
import pandas as pd

# Plotting from pandas

By default plotting directly from pandas used [matplotlib](https://matplotlib.org/) as the plotting backend  


### Get sample data

In [None]:
df = data.stocks()
df.head()

In [None]:
df.symbol.value_counts()

## Plot individual stock

In [None]:
df.loc[df.symbol=='AAPL'].plot(x='date',y='price')

### Adjust plotsize

In [None]:
df.loc[df.symbol=='AAPL'].plot(x='date',y='price', figsize=(16,8), grid=True)

### Re-structure data to plot all categories (stocks)

In [None]:
df_pivot =  df.pivot(index='date', columns='symbol', values='price')

In [None]:
df_pivot

In [None]:
df_pivot.plot(figsize=(16,8), grid=True)

## Series plotting

### Get some data about cars!

In [None]:
df_cars = data.cars()
df_cars.head()

In [None]:
df_cars.Horsepower.plot.hist(figsize=(12,6))

In [None]:
type(_)

In [None]:
df_cars_by_origin = df_cars.groupby('Origin').mean()
df_cars_by_origin

In [None]:

_ = df_cars_by_origin.loc[:,'Horsepower'].plot.barh(figsize=(20,8), title='Horsepower (mean)')

## pandas plotting tools

Specialized plots for special occasions

In [None]:
from pandas.plotting import scatter_matrix

### Scatter matrix

Use case: give a quick overview of how each column in a dataframe relates to the other columns

In [None]:
_ = scatter_matrix(df_cars, figsize=(24,18))

### What happend to "Year" info?

In [None]:
df_cars.head()

#### Make new numeric "Year" column

In [None]:
df_cars['make_year'] = df_cars.Year.dt.year

In [None]:
_ = scatter_matrix(df_cars, figsize=(24,18))

## More info on pandas plotting 

The [Visualization Chapter](https://pandas.pydata.org/docs/user_guide/visualization.html#) of the pandas User Guide gives a good overview  
In the [Cookbook](https://pandas.pydata.org/docs/user_guide/cookbook.html#plotting) section of the pandas docs there is a section around Visualization techniques  
Overview of [Pandas visualization ecosystem](https://pandas.pydata.org/docs/ecosystem.html#visualization)  

# Matplotlib Basics

* Extensive documentation available at: https://matplotlib.org/

**Pros**
* Easy to get started
* Powerful
* Almost every aspect of plot is configurable
* Plenty of examples and good documentation available
* Used for long time by many --> Most problems/questions have answers on StackOverflow etc.

**Cons**
* 2 APIs "Pyplot" and Object-oriented, need to choose and/or be aware of difference
* Pyplot API statefullness can be a bit confusing (from beginning built to mimic MATLAB)
* Main target output is static images (interacivity is possible)

_Matplotlib makes easy things easy and hard things possible._

In [None]:
import matplotlib.pyplot as plt
from vega_datasets import data
import pandas as pd

In [None]:
x = [1,3,5,10]
y = [1,10,5,20]

plt.plot(x,y)

In [None]:
df_stocks = data.stocks()
df_stocks.head()

In [None]:
df_tmp = df_stocks[df_stocks.symbol=='MSFT']
x = df_tmp.date
y = df_tmp.price

## Pyplot API (built to mimic MATLAB)

In [None]:
plt.figure(figsize=(18,6))
plt.plot(x,y)
_ = plt.title('Stock prices')


The Pyplot API has the concept of "currently active" figure and axes  
This can be convinient for simple cases, but makes things confusing for more complex plots

## Object-oriented API (recommended)

In [None]:
fig, ax = plt.subplots(figsize=(18,6))
ax.plot(x,y)
_ = ax.set_title('Stock prices')

### Terminology: Axes and Axis

Most of the terms are straightforward but the main thing to remember is that:

The _Figure_ is the final image that may contain 1 or more _Axes_.  
The Axes represent an individual plot (don't confuse this with the word "axis", which refers to the x/y axis of a plot).

## Multiple lines in same plot

In [None]:
df_stocks = data.stocks()
df_stocks

To draw multiple lines in same axes we need to loop and draw one at the time

In [None]:
fig, ax = plt.subplots(figsize=(18,6))

for symbol in df_stocks.symbol.unique():
    df_tmp = df_stocks[df_stocks.symbol==symbol]
    ax.plot(df_tmp.date, df_tmp.price, label=symbol)

ax.legend()
_ = ax.set_title('Prices')

## Subplots

In [None]:
fig, axs = plt.subplots(nrows=3, ncols=2, figsize=(36,12), sharex=True)
axs = axs.flatten()

for i, symbol in enumerate(df_stocks.symbol.unique()):
    df_tmp = df_stocks[df_stocks.symbol==symbol]
    ax = axs[i]
    ax.plot(df_tmp.date, df_tmp.price)
    ax.set_title(f'Stock prices for {symbol}')
    ax.grid(True)

_ = fig.suptitle('STOCK PRICES')    

## Scatter plot

In [None]:
df_stocks = data.stocks()
df_stocks

Example: Scatter-plot over yields over two differnt periods 

In [None]:
df_tmp = df_stocks.pivot(index='date', columns='symbol', values='price')

df_tmp

In [None]:
def get_yield(periods):
    '''
        Calculate yield over a given number of periods
        param: periods: Number of periods back to compare with
    '''
    df_diff = df_tmp.diff(periods=periods)
    return df_diff.div(df_tmp.shift(periods), axis='columns')

In [None]:
fig, ax = plt.subplots(figsize=(20,12))

X_YIELD_PERIODS = 12
Y_YIELD_PERIODS = 1

ax.scatter(x=get_yield(periods=X_YIELD_PERIODS).values.flatten(), y=get_yield(periods=Y_YIELD_PERIODS).values.flatten())

ax.grid(True)
ax.set_xlabel(f'Yield over {X_YIELD_PERIODS} month(s)')
ax.set_ylabel(f'Yield over {Y_YIELD_PERIODS} month(s)')

The data-munging capabilities of pandas + powerful visualization from matplotlib/plotly/altair + Interactive environment like Jupyter ==> Data Analysis Goodness 

## Mix of plot types

In [None]:
from matplotlib import cm
from cycler import cycler

In [None]:
df_cars = data.cars()
df_cars.head()

In [None]:
fig,ax = plt.subplots(figsize=(16,16))
s = ax.scatter(x=df_cars.Horsepower, y=df_cars.Weight_in_lbs, c=df_cars.Acceleration, cmap=cm.cividis)
plt.colorbar(s)

In [None]:
colmap = cm.tab10

In [None]:
colors = {key:colmap(i) for i,key in enumerate(df_cars.Origin.unique())}

In [None]:
colors

In [None]:
df_cylinder_plot = df_cars.groupby(['Origin','Cylinders']).count().iloc[:,0].reset_index().rename({'Name':'car_count'}, axis=1)
df_cylinder_plot

In [None]:
lo = '''
    AAD
    BCD
    ZZZ
'''

fig, axs = plt.subplot_mosaic(layout = lo, figsize=(24,16))

### Nr of cars per year
ax = axs['A']
df_by_year = df_cars.groupby('Year').count().iloc[:,0].reset_index().rename({'Name':'car_count'}, axis=1)
ax.plot(df_by_year.Year, df_by_year.car_count, '-o')
ax.set_ylim((0,65))
ax.grid(True)
ax.set_title('Nr of cars per year')

### Cars by origin pie-chart
ax = axs['B']
ax.pie(df_cars.Origin.value_counts(), labels=df_cars.Origin.value_counts().index)
ax.set_title('Nr of cars by Origin')

### Scatter-chart horse-power vs. Miles per Gallon
ax = axs['C']
ax.scatter(df_cars.Horsepower,df_cars.Miles_per_Gallon ,c=df_cars.Origin.map(colors), alpha=0.8)
ax.set_title('HorsePower vs. Miles per Gallon')
ax.set_xlabel('Horsepower')
ax.set_ylabel('Miles per Gallon')


### Plot bar-chart per cylinder count
ax = axs['D']
df_cylinder_plot = df_cars.groupby(['Origin','Cylinders']).count().iloc[:,0].reset_index().rename({'Name':'car_count'}, axis=1)

### Make sure we have values for every combination of Origin + Cyliners
mi = pd.MultiIndex.from_product([df_cylinder_plot.Origin.unique(), df_cylinder_plot.Cylinders.unique()], names=['Origin','Cylinders'])
df_cylinder_plot = df_cylinder_plot.set_index(['Origin','Cylinders']).reindex(mi, fill_value=0).reset_index()

### iterate over each cylinder_count, and add to plot, taking into account to stack the plot by starting bars where last iteration ended with "left=cyl" parameter
prev_end = 0
for cyl in df_cylinder_plot.Cylinders.unique():
    df_tmp = df_cylinder_plot[df_cylinder_plot.Cylinders == cyl]
    ax.barh(df_tmp.Origin, df_tmp.car_count, left=prev_end, label=cyl)
    prev_end = df_tmp.car_count.values + prev_end
ax.legend()
ax.set_title('Nr of cars per nr of cylinders')

### Scatterchart for HorsePower, Weight, Acceleration
ax = axs['Z']
scatter = ax.scatter(x=df_cars.Horsepower, y=df_cars.Weight_in_lbs, c=df_cars.Acceleration)
ax.set_xlabel('HorsePower')
ax.set_ylabel('Weight (lbs)')
ax.set_title('HorsePower/Weight/Acceleration')
fig.colorbar(scatter, ax=ax)

_ = fig.suptitle('Misc Car Plots', fontsize=20)

# How to use matplotlib outside of Jupyter notebooks

plt.ion() to turn on interactive mode

# Matplotlib interactivity inside Jupyter

ipyml is extension for using matplotlib interactivly inside Jupyter: https://github.com/matplotlib/ipympl

In [None]:
%matplotlib widget

In [None]:
import numpy as np

In [None]:
plt.ion()

fig = plt.figure()
plt.plot(np.sin(np.linspace(0, 20, 100)));

In [None]:
plt.plot(np.cos(np.linspace(0,20,100)))

## 3D plots

In [None]:
from mpl_toolkits.mplot3d import axes3d

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

# Grab some test data.
X, Y, Z = axes3d.get_test_data(0.05)

# Plot a basic wireframe.
ax.plot_wireframe(X, Y, Z, rstride=10, cstride=10)

plt.show()

In [None]:
# When using the `widget` backend from ipympl,
# fig.canvas is a proper Jupyter interactive widget, which can be embedded in
# an ipywidgets layout. See https://ipywidgets.readthedocs.io/en/stable/examples/Layout%20Templates.html

# One can bound figure attributes to other widget values.
from ipywidgets import AppLayout, FloatSlider

plt.ioff()

slider = FloatSlider(
    orientation='horizontal',
    description='Factor:',
    value=1.0,
    min=0.02,
    max=2.0
)

slider.layout.margin = '0px 30% 0px 30%'
slider.layout.width = '40%'

fig = plt.figure()
fig.canvas.header_visible = False
fig.canvas.layout.min_height = '400px'
plt.title('Plotting: y=sin({} * x)'.format(slider.value))

x = np.linspace(0, 20, 500)

lines = plt.plot(x, np.sin(slider.value * x))

def update_lines(change):
    plt.title('Plotting: y=sin({} * x)'.format(change.new))
    lines[0].set_data(x, np.sin(change.new * x))
    fig.canvas.draw()
    fig.canvas.flush_events()

slider.observe(update_lines, names='value')

AppLayout(
    center=fig.canvas,
    footer=slider,
    pane_heights=[0, 6, 1]
)

# Enhance plots from pandas using your Matplotlib skills

In [158]:
# To get back to non-interactive mode
# %matplotlib inline

In [None]:
df = data.stocks()
df_pivot =  df.pivot(index='date', columns='symbol', values='price')
df_pivot.plot(figsize=(16,8), grid=True)

## Get axes object and annotate point

In [None]:
plt.ion()

df = data.stocks()
df_pivot =  df.pivot(index='date', columns='symbol', values='price')
ax = df_pivot.plot(figsize=(16,8), grid=True)

In [None]:
bbox = dict(boxstyle="round", fc="0.8")
ax.annotate('What happend here?', ('2007-09-01',700),xytext=(-200,-20) ,textcoords='offset points', arrowprops=dict(width=4), bbox=bbox)

ax.set_title('Stocks')

In [None]:
type(ax)

# Python visualization alternatives
**(Non complete list)**

* [Matplotlib](https://matplotlib.org/stable/index.html)
    - Oldest most established plotting library for Python
    - Pros:  
        - Control over every detail of plot
        - Can get started quickly for simple plots
        - Extensive documentation, examples, problem solutions available online
    - Cons: 
        - Need to control every detail of plot, sometimes
        - 2 API:s, sometimes confusing conventions
* [Plotly](https://plotly.com/python/)
    - Ambitious for-profit solution, but with open source, free, python library
    - Pros:  
        - Comprahensive as well as easy to get started, there is a high-level plotly.express part of the library for convinience
        - There are also corresponding packages for javascript and R
        - They have a "Dash" solution for building dash-boards using plotly visualizations
    - Cons:  
        - Trying to sell their for profit "Enterprise" products 
* [Bokeh](https://bokeh.org/)
    - Interactive visualizations for the web (or inside Jupyter)
    - Pros:  
        - Focus on interactive web-based visualization
    - Cons:  
        - API has changed substantially over time
* [Altair](https://altair-viz.github.io/index.html)
    - Declarative plotting based on the Vega emerging web-standard
    - Pros:   
        - Once grasped the declaraive style is powerfull and intuitive
        - Tightly coupled to pandas DataFrames
    - Cons:   
        - The newest alternative, some functionallity might be lacking  
        - Depends on development of the Vega standard (also a strength)
        - Data gets embedded in plot definition, can lead to large notebooks. 
        - There is a default limit of 5.000 datapoints. (limit can be disabled)
* [Seaborn](https://seaborn.pydata.org/)
    - Statistical plotting, based on matplotlib
    - Pros:  
        - Advanced statistical plots in few lines of code
    - Cons
        - Not the same ambition to be a genral purpose plotting library as the others
        

## Python's Visualization Landscape
**can be a bit daunting**  
But it is not as complex as picture below looks, and it is a good thing to have choices

In [None]:
from IPython.display import Image

Image(filename=r'./PythonVisLandscape.jpg')

# Visualization distribution options

* Share jupyter notebook including visualization
    - Nordea is creating Jupyterhub solution for sharing work in notebooks (status?)
* Export as jpg/png/svg and include in Powerpoint/Word document etc.
* Export jpg/png/svg to Confluence/Sharepoint (using Python)
* Create custom HTML reports, including visualizations
* Create dashboard using Dash or other dashboarding solution
* Include in custom web-application (Flask/Django/FastAPI are python packages supporting the web part)
* Nordea [Engineering System Platform](https://wiki.itgit.oneadr.net/display/ESP/Engineering+System+Platform) can help in setting up web-solution available internally in Nordea  
    - Further improvement around this is possible and should be expected as part of move of functionallity to cloud

## Dashboarding solutions for Python


An ambitious article comparing Dash / Streamlit / Voilá / Panel  
https://medium.datadriveninvestor.com/streamlit-vs-dash-vs-voil%C3%A0-vs-panel-battle-of-the-python-dashboarding-giants-177c40b9ea57  

**TL;DR**  
Dash is currently maybe most compelling alternative with Streamlit beeing up-and-coming option for quickly turning python scripts into dashboards 

# Extra

## Change pandas plotting backend

A fairly recent addeed feature in pandas is the ability to install and use other default plotting backends than matplotlib.  
Options like Bokeh and hvplot can add more interactive plots.

`pip install hvplot`  
`pip install pandas_bokeh`

In [None]:
pd.set_option("plotting.backend", "hvplot")

In [None]:
df.pivot(index='date', columns='symbol', values='price').plot(figsize=(16,8), grid=True)

In [None]:
pd.set_option("plotting.backend", "pandas_bokeh")

In [None]:
df.pivot(index='date', columns='symbol', values='price').plot(sizing_mode="scale_width")

In [None]:
df2 = df.pivot(index='date', columns='symbol', values='price')
#df2.plot_bokeh(sizing_mode="scale_width")
df2.plot_bokeh(figsize=(1200,400))

In [None]:
# To change back to use matplotlib
pd.set_option("plotting.backend", 'matplotlib')

## Most used data science libraries
as per [python-developers-survey-2020](https://www.jetbrains.com/lp/python-developers-survey-2020/)

In [None]:
from IPython.display import Image

Image('./python_survey_2020_data_science_frameworks_libraries.png')