<a href="https://colab.research.google.com/github/jessiejxyu2/ist526/blob/main/visualizing_hierarchical_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Import Essential Libraries

In [None]:
# python visualization libraries
import pandas as pd
import numpy as np

import math
import json


import plotly.express as px
import plotly.graph_objects as go

# for hierarchical data
import networkx as nx

The source code is adopted from [here](https://towardsdatascience.com/visualize-hierarchical-data-using-plotly-and-datapane-7e5abe2686e1)



# Load Data from GitHub

In [None]:
# ref: https://stackoverflow.com/questions/32400867/pandas-read-csv-from-url

url = 'https://raw.githubusercontent.com/smbillah/ist526/main/hierarchical_data.csv'

# pandas call to read csv file 
df = pd.read_csv(url)

# quickly show the dataframe
df.head()

Unnamed: 0,Indent Level,Item and Group,Weight,Parent
0,0,All items,100.0,
1,1,Food and beverages,15.157,All items
2,2,Food,14.119,Food and beverages
3,3,Food at home,7.772,Food
4,4,Cereals and bakery products,1.001,Food at home


In [None]:
# it's a good idea to peek at the tail too. 
# Note, we need display(.) function if more than output is printed
display(df.head())
display(df.tail())

# get column names
display(df.columns)

Unnamed: 0,Indent Level,Item and Group,Weight,Parent
0,0,All items,100.0,
1,1,Food and beverages,15.157,All items
2,2,Food,14.119,Food and beverages
3,3,Food at home,7.772,Food
4,4,Cereals and bakery products,1.001,Food at home


Unnamed: 0,Indent Level,Item and Group,Weight,Parent
289,4,Funeral expenses,0.14,Miscellaneous personal services
290,4,Laundry and dry cleaning services,0.22,Miscellaneous personal services
291,4,Apparel services other than laundry and dry cl...,0.03,Miscellaneous personal services
292,4,Financial services,0.229,Miscellaneous personal services
293,4,Unsampled items,0.111,Miscellaneous personal services


Index(['Indent Level', 'Item and Group', 'Weight', 'Parent'], dtype='object')

"The data contains 295 different categories spread across 8 different ‘Indent Levels’, from 1— ‘Food and beverages` (15.16% weight), to 8 — ‘Uncooked ground beef’ (0.17% weight). Note that it doesn’t go all the way down to individual products. The items are arranged hierarchically, so each item’s weight will be equal to the sum of its children’s weights."

## Pre-processing

In [None]:
# remove NaN with blank, otherwise plotly will be upset
df.fillna('', inplace = True)
# df.dropna(axis=0, inplace = True)
display(df.head())


Unnamed: 0,Indent Level,Item and Group,Weight,Parent
0,0,All items,100.0,
1,1,Food and beverages,15.157,All items
2,2,Food,14.119,Food and beverages
3,3,Food at home,7.772,Food
4,4,Cereals and bakery products,1.001,Food at home


# Sunburst Tree

## Basic Sunburst Tree

In [None]:
fig = px.sunburst(
  df,   
  parents = 'Parent',
  names = 'Item and Group',
  values='Weight'
  #color='Indent Level'
)

fig.update_layout(
  title_text="Sunburst Diagram", 
  font_size=12
)

fig.show()

## Circular Sunburst and path

``Hierarchical data are often stored as a rectangular dataframe, with different columns corresponding to different levels of the hierarchy. px.sunburst can take a path parameter corresponding to a list of columns. Note that id and parent should not be provided if path is given.'' [ref](https://plotly.com/python/sunburst-charts/)

In [None]:
# load tips dataset
df_tips = px.data.tips()
display(df_tips.head())
display(df_tips.tail())

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.0,Female,Yes,Sat,Dinner,2
241,22.67,2.0,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2
243,18.78,3.0,Female,No,Thur,Dinner,2


A data frame with 244 observations on the following 8 variables.

`TipPercentage`
a numeric vector, the tip written as a percentage (0-100) of the total bill

`Bill`
a numeric vector, the bill amount (dollars)

`Tip`
a numeric vector, the tip amount (dollars)

`Gender`
a factor with levels Female Male, gender of the payer of the bill

`Smoker`
a factor with levels No Yes, whether the party included smokers

`Weekday`
a factor with levels Friday Saturday Sunday Thursday, day of the week

`Time`
a factor with levels Day Night, rough time of day

`PartySize`
a numeric vector, number of people in party

In [None]:
fig = px.sunburst(
  df_tips, 
  path=['day', 'time', 'sex'], 
  values='total_bill',
  color = 'day' 
)

fig.update_layout(
  title_text="Circular Sunburst Diagram", 
  font_size=10
)

# Treemaps

## Regular Treemaps
[ref](https://plotly.com/python/treemaps/)

In [None]:
fig = px.treemap(
  df,   
  parents = 'Parent',
  names = 'Item and Group',
  values='Weight'
  #color='Indent Level'
)

fig.update_layout(
  title_text="Treemap Diagram", 
  font_size=12
)

fig.show()

In [None]:
# adding a root color often helps

fig = px.treemap(
  df,   
  parents = 'Parent',
  names = 'Item and Group',
  values='Weight'
  #color='Indent Level'
)

fig.update_layout(
  title_text="Treemap Diagram", 
  font_size=12,
  margin = dict(t=50, l=25, r=25, b=25)
)

fig.update_traces(
  root_color="lightgrey"
)

fig.show()

## Treemaps for rectangular dataframe

In [None]:
fig = px.treemap(
  df_tips, 
  path=[px.Constant('all'), 'day', 'time', 'sex'], # adding a dummy constant as the root
  values='total_bill'  
)

fig.update_layout(
  title_text="Treemap Diagram", 
  font_size=12,
  margin = dict(t=50, l=25, r=25, b=25)
)

fig.update_traces(
  root_color="lightgrey"
)

fig.show()

## Another example with map/geo data

In [None]:
df_country = px.data.gapminder().query("year == 2007")
display(df_country.head())

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap,iso_alpha,iso_num
11,Afghanistan,Asia,2007,43.828,31889923,974.580338,AFG,4
23,Albania,Europe,2007,76.423,3600523,5937.029526,ALB,8
35,Algeria,Africa,2007,72.301,33333216,6223.367465,DZA,12
47,Angola,Africa,2007,42.731,12420476,4797.231267,AGO,24
59,Argentina,Americas,2007,75.32,40301927,12779.37964,ARG,32


In [None]:
fig = px.treemap(
  df_country, 
  path=[px.Constant("world"), 'continent', 'country'], 
  values='pop',
  color='lifeExp', 
  hover_data=['iso_alpha'], 
  color_continuous_scale='RdBu',  
)

fig.update_layout(
  margin = dict(t=50, l=25, r=25, b=25)
)

fig.show()

The above visualization is not fair; it does not consider the population size of different countries. So, let's normalize the color scale by population size. 

In plotly, we can achieve that by setting the mid-point of the color scale.

In [None]:
# the life expectance is weighted by a country's population size
midpoint = np.average(df_country['lifeExp'], weights=df_country['pop'])
display(midpoint)

68.91909251904043

In [None]:
fig = px.treemap(
  df_country, 
  path=[px.Constant("world"), 'continent', 'country'], 
  values='pop',
  color='lifeExp', 
  hover_data=['iso_alpha'], 
  color_continuous_scale='RdBu',
  color_continuous_midpoint=midpoint,
)

fig.update_layout(
  margin = dict(t=50, l=25, r=25, b=25)
)

fig.show()

# Sankey Diagram (Edge/Flow visualization)
Up until now, we haven't paid attention to edge of a tree. Enter [Sankey](https://en.wikipedia.org/wiki/Sankey_diagram) diagram.

A Sankey diagram is a flow diagram, in which the width of arrows is proportional to the flow quantity.

[Ref](https://plotly.com/python/sankey-diagram/)


## Basic Sankey Diagram

`source` to represent the source node, 

`target` for the target node, 

`value` to set the flow volume, and 

`label` that shows the node name

In [None]:
line = {'color': "black", 'width': 0.5}
print(line)

node = {'pad': 15, 
        'thickness': 20, 
        'line': line,
        'label': ["A1", "A2", "B1", "B2", "C1", "C2"],
        'color': "blue"
      }
print(node)

link = {
      'source': [0, 1, 0, 2, 3, 3], # indices correspond to labels, (e.g., A1=0, A2=1) and (souce_i, target_i) are tuple  
      'target': [2, 3, 3, 4, 4, 5],
      'value' : [8, 4, 2, 8, 4, 2]
    }
print(link)

{'color': 'black', 'width': 0.5}
{'pad': 15, 'thickness': 20, 'line': {'color': 'black', 'width': 0.5}, 'label': ['A1', 'A2', 'B1', 'B2', 'C1', 'C2'], 'color': 'blue'}
{'source': [0, 1, 0, 2, 3, 3], 'target': [2, 3, 3, 4, 4, 5], 'value': [8, 4, 2, 8, 4, 2]}


In [None]:
fig = go.Figure(
  data = [go.Sankey(node = node, link = link)]
)

fig.update_layout(
  title_text="Basic Sankey Diagram", 
  font_size=10
)
fig.show()

## Complex one

In [None]:
# Get the data in the format Plotly wants
label_dict = { df["Item and Group"][i] : i for i in range(0, len(df) ) }

# Initialize empty arrays
source = []
target = []
value = []

for i, row in df.iterrows():
    # Skip the root level
    if row["Item and Group"] != 'All items': 
        source.append(label_dict[row["Parent"]])
        target.append(label_dict[row["Item and Group"]])
        value.append(row["Weight"])   


# define three variables
line = {'color': "black", 'width': 0.5}

link = {
      'source': source,
      'target': target,
      'value' : value
    }

node = {'pad': 15, 
        'thickness': 20, 
        'line': line,
        'label': df["Item and Group"].to_list(),
        'color': "blue",
        'hovertemplate': '%{label} is %{value} of spending'
      }


fig = go.Figure(
  data = [go.Sankey(node = node, link = link)]
)

fig.update_layout(
  title_text="Complext Sankey Diagram", 
  font_size=10
)
fig.show()

# Node/Edge Visualization

d3 is better

## [Background] List, Dictionary, and JSON Basics

This question refreshes the concepts of a list, dictionary, and JSON format in Python. There are plenty of online resources (e.g., https://medium.com/analytics-vidhya/python-dictionary-and-json-a-comprehensive-guide-ceed58a3e2ed) on these topics. Feel free to check those out.

**Short version:**

In Python, a square bracket (e.g., ```[..]```) indicates a list, and a curly bracket (e.g., ```{...}``` indicates a dictionary. 

For example, 

```
ages = [23, 21, 40, 43]
student = {'id': 1, 'name': 'jack', 'score': 90}

```
Here, 
* ```ages``` is a list containing 4 elements that are separated by ```,```
* ```student``` is a dictionary that contains 3 key-value pairs, separated by ```,``` 
** A key-value pair looks like ```key```:```value```. Notice that a ```:``` is separating a ```key``` from its ```value```. 

So, the dictionary ```student``` has 3 keys, 'id', 'name', and 'score' with values 1, 'jack', 'score', respectively. 

We can create a list of lists or a list of dictionaries. See the following codes:
```
age_group = [[0, 12], [13, 19], [20,29], [30, 39]]
students = [{'id': 1, 'name': 'jack', 'score': 90}, {'id': 2, 'name': 'nina', 'score': 91}, {'id': 3, 'name': 'robin', 'score': 84}]
```

Here,
* ```age_group``` is a list that contains 4 sub-lists
* ```students``` is a list that contains 3 dictionaries, where each dictionary contains 3 key-value pairs.

**JSON is nothing but a representation of lists and dictionaries in the above format**


Create a dictionary named ```node_val``` that contans 2 keys ('id', 'label') with values 1, and 'Jason', respectively.

In [None]:
node_val = {'id':1, 'label':'Jason'}

Create another dictionary named ```edge_val``` that has two keys (```source```, and ```target```) where ```source``` has value 1 and ```target``` has value 2

In [None]:
edge_val = {'source': 1, 'target': 2}

Create a dictionary named ```baby_graph``` that contains two dictionaries with keys ```'node'``` and ```'edge'``` and values ```node_val``` and ```edge_val```, respectively. 

In [None]:
baby_graph = { 'node': node_val, 'edge': edge_val}

Create a list named ```nodes_vals```containing four dictionaries of ```node_val```. You can use random values for each key in individual ```node_val```

Similarly, create a list named ```edge_vals```containing four dictionaries of ```edge_val```. Use random values of each key in the individual ```edge_val``` dictionary. 

Finally, create a dictionary named ```graph``` containing two dictionaries with keys ```'nodes'``` and ```'edges'``` and values would ```node_vals``` and ```edge_vals```, respectively. 

In [None]:
# create 4 node_val dict.
node_val1 = {'id':1, 'label':'Jason'}
node_val2 = {'id':2, 'label':'Peter'}
node_val3 = {'id':3, 'label':'Jane'}
node_val4 = {'id':4, 'label':'Jasmine'}

# create node_vals list
nodes_vals = [node_val1, node_val2, node_val3, node_val4]


# create 4 edge_val dict.
edge_val1 = {'source': 1, 'target': 2}
edge_val2 = {'source': 1, 'target': 3}
edge_val3 = {'source': 1, 'target': 4}
edge_val4 = {'source': 2, 'target': 3}

# create edge_vals list
edges = [edge_val1, edge_val2, edge_val3, edge_val4]

# create graph dict.
graph = { 'node': node_val, 'edge': edge_val}

## Dash

In [None]:
!pip install jupyter-dash -q

[K     |████████████████████████████████| 3.6 MB 4.3 MB/s 
[?25h

In [None]:
!pip install dash_cytoscape -q

In [None]:
from jupyter_dash import JupyterDash 

# dash imports
import dash
import dash_html_components as html
from dash import dcc
from dash.dependencies import Output, Input
from dash import no_update

import dash_cytoscape as cyto

In [None]:
# this css creates columns and row layout
external_stylesheets = ['https://codepen.io/chriddyp/pen/bWLwgP.css']


## Uncomment the following line for runnning in Google Colab
app = JupyterDash(__name__, external_stylesheets=external_stylesheets)

## Uncomment the following line for running in a webbrowser
# app = dash.Dash(__name__, external_stylesheets=external_stylesheets)



# layout
app.layout = html.Div([
    html.P("Dash Cytoscape:"),
    cyto.Cytoscape(
        id='cytoscape',
        elements=[
            {'data': {'id': 'ca', 'label': 'Canada'}}, 
            {'data': {'id': 'on', 'label': 'Ontario'}}, 
            {'data': {'id': 'qc', 'label': 'Quebec'}},
            {'data': {'source': 'ca', 'target': 'on'}}, 
            {'data': {'source': 'ca', 'target': 'qc'}}
        ],
        layout={'name': 'breadthfirst'},
        #style={'width': '400px', 'height': '500px'}
    )
])

  
# run the code
# uncomment the following line to run in Google Colab
app.run_server(mode='inline', port=8030)

# uncomment the following lines to run in Browser via command line/terminal
#if __name__ == '__main__':
#  app.run_server(debug=True, host='127.0.0.1', port=8000)
#  app.run_server(debug=True)

<IPython.core.display.Javascript object>