## Sankey Diagrams in Jupyter Widgets

>Sankey diagrams are named after Irish Captain Matthew Henry Phineas Riall Sankey, who used this type of diagram in 1898 in a classic figure (see diagram) showing the energy efficiency of a steam engine. The original charts in black and white displayed just one type of flow (e.g. steam); using colors for different types of flows lets the diagram express additional variables.
 
>One of the most famous Sankey diagrams is Charles Minard's Map of Napoleon's Russian Campaign of 1812. It is a flow map, overlaying a Sankey diagram onto a geographical map. It was created in 1869, so it actually predates Sankey's 'first' Sankey diagram of 1898.  --- Wikipedia



### Install iPySankey Widget

#### Using Pip
``` bash
pip install ipysankeywidget
```
#### Using Conda
``` bash
conda install -c conda-forge ipysankeywidget 
```

#### Environment.yml
Clone repository  
Create Conda Environment from the environment.yml file  
``` bash
git clone https://github.com/gnfrazier/sankey_tutorial.git

conda env create --file=environment.yml
conda activate charts
```



### Enable Jupyter Extenstions

Especially with a new environment, enable Jupyter extensions. Otherwise your chart will not display.  

``` python
jupyter nbextension enable --py --sys-prefix ipysankeywidget
jupyter nbextension enable --py --sys-prefix widgetsnbextension

```

### Import Libraries

In [1]:
import datetime
import random

import ipysankeywidget as sk
import ipywidgets as ipw
import pandas as pd

### Generate Sample Data
For our use case we have a time series data set. This phaux dataset is to represent data as it would come from a Indoor Tracking System frequently used by retailers.

Note some of the locations are repeated to add some weighting to the dataset. 

In [2]:


locations = ['Produce', 'Dairy', 'Freezer', 'Chips', 'Cookies', 'Cereal', 'Produce', 'Dairy','Produce', 'Dairy','Cereal' ]

def generate_customers(num_custs=200):
    
    customer_list = []
    while len(customer_list) <= num_custs:
        customer_list.append('CUST-' + str(random.randrange(1,101)))
    print('Unique Customers = ', len(set(customer_list)))
    
    return customer_list

def generate_sample_data(sample_size=10000, num_custs=200):
    sample_data = []
    
    customers = generate_customers(num_custs)
    
    while len(sample_data) < sample_size:
        month = random.randrange(1,7)
        day = random.randrange(1,29)
        hour = random.randrange(8,22)
        
        rand_customer = customers[random.randrange(0,200)]
        
        for i in range(random.randrange(8)):

            minute = random.randrange(1,60)
            seconds = random.randrange(1,60)
            rand_location = locations[random.randrange(0,(len(locations)))]
            rand_timestamp = datetime.datetime(2019, month, day, hour, minute, seconds)

            item = {'timestamp':str(rand_timestamp), 'customer':rand_customer, 'location': rand_location}
            sample_data.append(item)

    print('Data Points Generated = ', len(sample_data))
    return sample_data

In [3]:
raw_data = generate_sample_data()

Unique Customers =  87
Data Points Generated =  10001


In [None]:
df = pd.DataFrame(raw_data)
df['timestamp']= pd.to_datetime(df['timestamp'])
df.sort_values(['customer', 'timestamp'], inplace=True)

### Final Sample Data

In [14]:
df.head(10)

Unnamed: 0,timestamp,customer,location
3081,2019-01-09 17:01:01,CUST-1,Cereal
3084,2019-01-09 17:15:07,CUST-1,Produce
3083,2019-01-09 17:40:20,CUST-1,Freezer
3085,2019-01-09 17:41:34,CUST-1,Produce
3082,2019-01-09 17:55:15,CUST-1,Produce
3086,2019-01-09 17:55:43,CUST-1,Produce
3080,2019-01-09 17:58:10,CUST-1,Produce
82,2019-01-11 09:30:25,CUST-1,Cereal
83,2019-01-11 09:46:48,CUST-1,Dairy
81,2019-01-11 09:59:47,CUST-1,Freezer


### Transform the data

The iPySankey widget needs the data as a dictionary structured like this:
``` Python
links = [
    {'source': 'A', 'target': 'B', 'value': 1},
    {'source': 'B', 'target': 'C', 'value': 1},
    {'source': 'A', 'target': 'D', 'value': 1},
        ]
```


In [15]:
accounts = list(df['customer'].unique())
area_types = list(df['location'].unique())

# tag the step number of each customer, regardless of visit
all_steps = [] # for total customer tracking over time

day_steps = [] # for daily trip tracking

next_steps = [] # for single area transversal

# Iterating through a pandas data frame is really horrible, for small bits of data like this
# it was not too bad. On the real data set processing 6M rows took a long time.

for account in accounts:
    row = {}
    row['account_num'] = account
    single_account = df[(df['customer'] == account)== True]
    
    steps = list(single_account['location']) 
    
    for count, locs in enumerate(steps, 1):
    
        row[str('col' + str(count))] = locs
        
        if count == 1:
            source = locs
            next
        
        
        next_steps.append({'source':source, 'target': locs+'1', 'value': 1})
        
        source = locs
        
    row['value'] = 1
    all_steps.append(row)
    
    

df_next = pd.DataFrame(next_steps)

### Custom Colors

In [6]:

def custom_colors(categories):


    num_cats = len(categories)

 
    cat_colors = ["1f77b4",
                  "ff7f0e", 
                  "2ca02c", 
                  "d62728", 
                  "9467bd", 
                  "8c564b", 
                  "e377c2", 
                  "7f7f7f", 
                  "bcbd22",
                  "17becf",
                 ]

    cat_colors = ['#' + color for color in cat_colors]


    full_color_map = dict(zip(categories[:num_cats], cat_colors[:num_cats]))

    
    return full_color_map


In [7]:
colormap = custom_colors(locations)

#df_next['color']=df_next['source'].apply(lambda x: colormap[x])
df_next.head(50)

next_grouped = df_next.groupby(['source','target']).sum().reset_index()

In [8]:
next_grouped['color']=next_grouped['source'].apply(lambda x: colormap[x])
links = next_grouped.to_dict(orient='records')

In [9]:
next_grouped

Unnamed: 0,source,target,value,color
0,Cereal,Cereal1,363,#8c564b
1,Cereal,Chips1,170,#8c564b
2,Cereal,Cookies1,165,#8c564b
3,Cereal,Dairy1,486,#8c564b
4,Cereal,Freezer1,178,#8c564b
5,Cereal,Produce1,502,#8c564b
6,Chips,Cereal1,178,#d62728
7,Chips,Chips1,99,#d62728
8,Chips,Cookies1,72,#d62728
9,Chips,Dairy1,259,#d62728


In [10]:
links

[{'source': 'Cereal', 'target': 'Cereal1', 'value': 363, 'color': '#8c564b'},
 {'source': 'Cereal', 'target': 'Chips1', 'value': 170, 'color': '#8c564b'},
 {'source': 'Cereal', 'target': 'Cookies1', 'value': 165, 'color': '#8c564b'},
 {'source': 'Cereal', 'target': 'Dairy1', 'value': 486, 'color': '#8c564b'},
 {'source': 'Cereal', 'target': 'Freezer1', 'value': 178, 'color': '#8c564b'},
 {'source': 'Cereal', 'target': 'Produce1', 'value': 502, 'color': '#8c564b'},
 {'source': 'Chips', 'target': 'Cereal1', 'value': 178, 'color': '#d62728'},
 {'source': 'Chips', 'target': 'Chips1', 'value': 99, 'color': '#d62728'},
 {'source': 'Chips', 'target': 'Cookies1', 'value': 72, 'color': '#d62728'},
 {'source': 'Chips', 'target': 'Dairy1', 'value': 259, 'color': '#d62728'},
 {'source': 'Chips', 'target': 'Freezer1', 'value': 72, 'color': '#d62728'},
 {'source': 'Chips', 'target': 'Produce1', 'value': 217, 'color': '#d62728'},
 {'source': 'Cookies', 'target': 'Cereal1', 'value': 177, 'color': '#94

In [19]:
layout = ipw.Layout(width="1000", height="600")
def sankey(margin_top=10, **value):
    """Show SankeyWidget with default values for size and margins"""
    return sk.SankeyWidget(layout=layout,
                        margins=dict(top=margin_top, bottom=0, left=90, right=90),
                        **value)





In [20]:

sankey(links=links)

SankeyWidget(layout=Layout(height='600', width='1000'), links=[{'source': 'Cereal', 'target': 'Cereal1', 'valu…