# Performance Variability Boxplots

Performance variability boxplots provide an insight into the runtime distribution and its varibility across callsites. Boxplots are calculated to represent the range of the distribution and outliers (dots) correspond which are beyond the 1.5*IQR. Additionally, several statistical measures like mean, variance, kurtosis, skewness are also provided.

In [1]:
import os, sys
from IPython.display import HTML, display

# Hatchet imports
import hatchet as ht
from hatchet.util.unify_ensemble import unify_ensemble
from hatchet.util.boxplot import BoxPlot

First, we will construct a **hatchet.GraphFrame** using a sample dataset in our repository, **caliper-lulesh-json**. 

In [2]:
data_dir = os.path.realpath("../../../hatchet/tests/data")
data_path = os.path.join(data_dir, "caliper-lulesh-json/lulesh-annotation-profile.json")

gf_list = []
for i in range(10):
    gf = ht.GraphFrame.from_caliper(data_path)
    gf.dataset = "dset{}".format(i)
    gf_list.append(gf)

gf_ensemble = unify_ensemble(gf_list)

In [3]:
gf_ensemble.dataframe

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,name,nid,time,time (inc),hatchet_nid
node,rank,dataset,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
"{'name': 'main', 'type': 'region'}",0,dset9,main,0,121489.0,5882425.0,0
"{'name': 'main', 'type': 'region'}",4,dset7,main,0,118953.0,5905595.0,0
"{'name': 'main', 'type': 'region'}",5,dset7,main,0,133256.0,5877613.0,0
"{'name': 'main', 'type': 'region'}",6,dset7,main,0,114035.0,5870933.0,0
"{'name': 'main', 'type': 'region'}",7,dset7,main,0,137098.0,5898724.0,0
...,...,...,...,...,...,...,...
"{'name': 'TimeIncrement', 'type': 'region'}",1,dset3,TimeIncrement,12,212402.0,212402.0,23
"{'name': 'TimeIncrement', 'type': 'region'}",2,dset3,TimeIncrement,12,171635.0,171635.0,23
"{'name': 'TimeIncrement', 'type': 'region'}",3,dset3,TimeIncrement,12,323519.0,323519.0,23
"{'name': 'TimeIncrement', 'type': 'region'}",3,dset5,TimeIncrement,12,323519.0,323519.0,23


Next, using the **hatchet.GraphFrame**, we can calculate the data required for performance variability boxplot using an exposed hatchet API, **Boxplot**.

The interface excepts the following attributes:
1. `tgt_gf` - Target hatchet.GraphFrame 
2. `bkg_gf` - Background hatchet.GraphFrame (optional)
3. `callsites` - List of callsite names for which we want to compute/visualize the boxplots.
4. `metrics` - Runtime metrics for which we need to calculate the boxplots.
5. `iqr_scale` - Interquartile range scale (by default = 1.5)

In [4]:
from pyinstrument import Profiler
bp = BoxPlot(multi_index_gf=gf_ensemble, metrics=["time"], drop_index="dataset")

  if (await self.run_code(code, result,  async_=asy)):


**Boxplot** API calculates the results and stores as a GraphFrames in a dictionary (i.e., `tgt` and `bkg`). 

In [5]:
bp.gf['time'].dataframe

Unnamed: 0_level_0,Unnamed: 1_level_0,hatchet_nid,dataset,q,min,max,mean,var,imb,kurt,skew,name
node,rank,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
"{'name': 'main', 'type': 'region'}",0,0,dset9,"[105528.0, 113072.25, 116494.0, 124430.75, 137...",105528.0,137098.0,119373.50,1.044980e+08,0.148479,-0.942185,0.543673,main
"{'name': 'main', 'type': 'region'}",4,0,dset7,"[105528.0, 113072.25, 116494.0, 124430.75, 137...",105528.0,137098.0,119373.50,1.044980e+08,0.148479,-0.942185,0.543673,main
"{'name': 'main', 'type': 'region'}",7,0,dset1,"[105528.0, 113072.25, 116494.0, 124430.75, 137...",105528.0,137098.0,119373.50,1.044980e+08,0.148479,-0.942185,0.543673,main
"{'name': 'main', 'type': 'region'}",1,0,dset3,"[105528.0, 113072.25, 116494.0, 124430.75, 137...",105528.0,137098.0,119373.50,1.044980e+08,0.148479,-0.942185,0.543673,main
"{'name': 'main', 'type': 'region'}",0,0,dset6,"[105528.0, 113072.25, 116494.0, 124430.75, 137...",105528.0,137098.0,119373.50,1.044980e+08,0.148479,-0.942185,0.543673,main
...,...,...,...,...,...,...,...,...,...,...,...,...
"{'name': 'TimeIncrement', 'type': 'region'}",7,23,dset8,"[540.0, 202210.25, 269561.0, 361367.0, 423809.0]",540.0,423809.0,263538.75,1.775294e+10,0.608147,-0.558767,-0.563893,TimeIncrement
"{'name': 'TimeIncrement', 'type': 'region'}",0,23,dset9,"[540.0, 202210.25, 269561.0, 361367.0, 423809.0]",540.0,423809.0,263538.75,1.775294e+10,0.608147,-0.558767,-0.563893,TimeIncrement
"{'name': 'TimeIncrement', 'type': 'region'}",0,23,dset2,"[540.0, 202210.25, 269561.0, 361367.0, 423809.0]",540.0,423809.0,263538.75,1.775294e+10,0.608147,-0.558767,-0.563893,TimeIncrement
"{'name': 'TimeIncrement', 'type': 'region'}",7,23,dset3,"[540.0, 202210.25, 269561.0, 361367.0, 423809.0]",540.0,423809.0,263538.75,1.775294e+10,0.608147,-0.558767,-0.563893,TimeIncrement


In [6]:
bp.gf['time'].dataframe.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 240 entries, (Node({'name': 'main', 'type': 'region'}), 0) to (Node({'name': 'TimeIncrement', 'type': 'region'}), 0)
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   hatchet_nid  240 non-null    int64  
 1   dataset      240 non-null    object 
 2   q            240 non-null    object 
 3   min          240 non-null    float64
 4   max          240 non-null    float64
 5   mean         240 non-null    float64
 6   var          240 non-null    float64
 7   imb          240 non-null    float64
 8   kurt         240 non-null    float64
 9   skew         240 non-null    float64
 10  name         240 non-null    object 
dtypes: float64(7), int64(1), object(3)
memory usage: 22.3+ KB


Using the **roundtrip** interface, we can then visualize the compute boxplot information. Below, we load the roundtrip interface that allows users to visualize plots on jupyter notebook cells directly. 

In [7]:
# This is the relative path from the notebook to Roundtrip files in hatchet/external/roundtrip/
roundtrip_path = '../../../hatchet/external/roundtrip/'
hatchet_path = "."

# Add the path so that the notebook can find the Roundtrip extension
module_path = os.path.abspath(os.path.join(roundtrip_path)) 
if module_path not in sys.path:
    sys.path.append(module_path)
    sys.path.append(hatchet_path)

    
# Uncomment this line to widen the cells to handle large trees 
#display(HTML("<style>.container { width:100% !important; }</style>"))

# Load the Roundtrip extension. This only needs to be loaded once.
%load_ext roundtrip

Since **roundtrip** excepts the data in JSON format, **Boxplot** API exposes a method, `to_json()` which will dump the boxplot's graphframes (i.e., `tgt` and `bkg`) in JSON.

In [8]:
bp_json = bp.to_json()

In [9]:
bp_json

{(0,
  'dset9'): {'time': {'q': [105528.0,
    113072.25,
    116494.0,
    124430.75,
    137098.0],
   'min': 105528.0,
   'max': 137098.0,
   'mean': 119373.5,
   'var': 104497970.25,
   'imb': 0.14847935262013764,
   'kurt': -0.9421848873183336,
   'skew': 0.5436725364039101,
   'name': 'main',
   'node': Node({'name': 'main', 'type': 'region'}),
   'rank': 0}},
 (0,
  'dset7'): {'time': {'q': [105528.0,
    113072.25,
    116494.0,
    124430.75,
    137098.0],
   'min': 105528.0,
   'max': 137098.0,
   'mean': 119373.5,
   'var': 104497970.25,
   'imb': 0.14847935262013764,
   'kurt': -0.9421848873183336,
   'skew': 0.5436725364039101,
   'name': 'main',
   'node': Node({'name': 'main', 'type': 'region'}),
   'rank': 4}},
 (0,
  'dset1'): {'time': {'q': [105528.0,
    113072.25,
    116494.0,
    124430.75,
    137098.0],
   'min': 105528.0,
   'max': 137098.0,
   'mean': 119373.5,
   'var': 104497970.25,
   'imb': 0.14847935262013764,
   'kurt': -0.9421848873183336,
   'skew': 0

Now, we can trigger the visualization using **roundtrip** magic command, `%loadVisualization`. `%loadVisualization` expects the `roundtrip_path` (path in which roundtrip resides), `"boxplot"` (identifier to the visualization type) and  variable containing the data for the boxplots (here it is bp_json).

Interactions on the boxplot visualization:
1. Users can select the metric of interest to visualize the corresponding runtime information.
2. Users can sort the callsites by their statistical attributes (i.e., mean, min, max, variance, imbalance, kurtosis and skewness).
3. Users can select the sorting order (i.e., ascending or descending).
4. Users can select the number of callsites that would be visualized.

In [10]:
%loadVisualization roundtrip_path "boxplot" bp_json

KeyError: '"boxplot"'

Once the exploration of the variability is done. Users can get the corresponding data in their visualization using the `%fetchData` magic command. Similar to the `%loadVisualization`, we will have to specify `"boxplot"` to identify the corresponding visualization type. The results will be stored in the following variable (here it is `result_csv` ) in the `.csv` format.

In [None]:
%fetchData "boxplot" result_csv

In [None]:
print(result_csv)

The `.csv` formatted output can be converted to a dataframe as shown below.

In [None]:
import pandas as pd

columns = result_csv.split(';')[0].split(',')
data = [x.split(',') for x in result_csv.split(';')[1:]]
df = pd.DataFrame(data, columns=columns).set_index('name')

In [None]:
df