# Sankey

**Sankey diagrams** are a [data visualisation](https://en.wikipedia.org/wiki/Data_and_information_visualization "Data and information visualization") technique or [flow diagram](https://en.wikipedia.org/wiki/Flow_diagram "Flow diagram") that emphasizes flow/movement/change from one state to another or one time to another. Sankey diagrams emphasize the major transfers or flows within a system, and help with locating the most important contributions to a flow.

> <sup>*From [Wikipedia](https://en.wikipedia.org/wiki/Sankey_diagram)*</sup>

There are implementations for this chart type in both [matplotlib](https://matplotlib.org/stable/api/sankey_api.html) and [Plotly](https://plotly.com/python/sankey-diagram/).

## Sample Data
This example dataframe represents flows that split like a tree (i.e. only in downstream direction), and represents a typicial result of using aggregation on a hierarchical data set.

In [None]:
import numpy as np
import pandas as pd

In [None]:
# Sankey sample data
data = dict(
    lvl1=list('AAAAAAAABBB'),
    lvl2="AP AP AP AC AC AC AB AB BE BR BA".split(),
    lvl3="APP APE APA ACT ACC ACE ABL ABO BET BRE BAK".split(),
    lvl4="APPL APEX APAR ACTO ACCE ACER ABLE ABOU".split() + [np.nan] * 3,
    count=[5, 2, 3, 8, 2, 10, 1, 3, 4, 6, 3],
)
df = pd.DataFrame(data)
df

## Visualizing Hierarchical Data with Plotly
Plotly expects Sankey flows in the form of three distinct lists of source and target nodes, and the edges (weights) between them.

The following function returns a Plotly graph specifications converted from a hierarchical dataset like the above.

In [None]:
import cmasher as cmr
import plotly.offline as py

py.init_notebook_mode(connected=False)

In [None]:
# Based on https://gist.github.com/ken333135/09f8793fff5a6df28558b17e516f91ab#file-gensankey
def sankey_spec(df, columns=None, title='Sankey Diagram', titlesize=18, labelsize=12, cmap='Set1'):
    """Wrapper for Plotly to support a 'level columns + weights' data format."""
    columns = columns or df.columns
    cat_cols, weigth_col = columns[:-1], columns[-1]
    palette = cmr.take_cmap_colors(cmap, len(cat_cols), return_fmt='hex')

    # create conjoined node label / color lists, so we can use numercial indexes into them
    labels, colors = [], []
    for color, cat_col in zip(palette, cat_cols):
        col_labels = set(df[cat_col].dropna().values)
        labels.extend(col_labels)
        colors.extend([color] * len(col_labels))  # give each level its own color
    
    # transform df into source-target pairs
    for idx, edges in enumerate(zip(cat_cols, cat_cols[1:], [weigth_col] * len(cat_cols))):
        row_df = df[list(edges)]
        row_df.columns = ['source', 'target', 'weigth']
        if not idx:
            links_df = row_df
        else:
            links_df = pd.concat([links_df, row_df])
        links_df = links_df.groupby(['source', 'target']).agg({'weigth': 'sum'}).reset_index()
        
    # add index columns for source-target pairs
    links_df['sourceID'] = links_df['source'].apply(labels.index)
    links_df['targetID'] = links_df['target'].apply(labels.index)
    
    # return the Sankey diagram specification
    return dict(
        data=[dict(
            type='sankey',
            node=dict(
              pad=15, thickness=20, line=dict(color="black", width=0.5),
              label=labels, color=colors,
            ),
            link=dict(source=links_df['sourceID'],
                      target=links_df['targetID'],
                      value=links_df['weigth']),
        )],
        layout=dict(title=dict(text=title, font=dict(size=titlesize)), font=dict(size=labelsize)),
    )

## Final Result
Transforming the sample data using the conversion function yields this graph.

In [None]:
fig = sankey_spec(df, title='Word Etymology')
py.iplot(fig, validate=False)