# Performance Variability Boxplots

Performance variability boxplots provide an insight into the runtime distribution and its varibility across callsites. Boxplots are calculated to represent the range of the distribution and outliers (dots) correspond which are beyond the 1.5*IQR. Additionally, several statistical measures like mean, variance, kurtosis, skewness are also provided.

In [1]:
import os, sys
from IPython.display import HTML, display

# Hatchet imports
import hatchet as ht
from hatchet.util.unify_ensemble import unify_ensemble
from hatchet.util.boxplot import BoxPlot

First, we will construct a **hatchet.GraphFrame** using a sample dataset in our repository, **caliper-lulesh-json**. 

In [2]:
data_dir = os.path.realpath("../../../hatchet/tests/data")
data_path = os.path.join(data_dir, "caliper-lulesh-json/lulesh-annotation-profile.json")

gf_list = []
for i in range(10):
    gf = ht.GraphFrame.from_caliper(data_path)
    gf.dataset = "dset{}".format(i)
    
    gf_list.append(gf)

gf_ensemble = unify_ensemble(gf_list)


In [3]:
gf_ensemble.dataframe

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,name,nid,time,time (inc),hatchet_nid
node,rank,dataset,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
"{'name': 'main', 'type': 'region'}",0,dset9,main,0,121489.0,5882425.0,0
"{'name': 'main', 'type': 'region'}",4,dset7,main,0,118953.0,5905595.0,0
"{'name': 'main', 'type': 'region'}",5,dset7,main,0,133256.0,5877613.0,0
"{'name': 'main', 'type': 'region'}",6,dset7,main,0,114035.0,5870933.0,0
"{'name': 'main', 'type': 'region'}",7,dset7,main,0,137098.0,5898724.0,0
...,...,...,...,...,...,...,...
"{'name': 'TimeIncrement', 'type': 'region'}",1,dset3,TimeIncrement,12,212402.0,212402.0,23
"{'name': 'TimeIncrement', 'type': 'region'}",2,dset3,TimeIncrement,12,171635.0,171635.0,23
"{'name': 'TimeIncrement', 'type': 'region'}",3,dset3,TimeIncrement,12,323519.0,323519.0,23
"{'name': 'TimeIncrement', 'type': 'region'}",3,dset5,TimeIncrement,12,323519.0,323519.0,23


Next, using the **hatchet.GraphFrame**, we can calculate the data required for performance variability boxplot using an exposed hatchet API, **Boxplot**.

The interface excepts the following attributes:
- multi_index_gf - Multi-indexed GraphFrame (required).
- drop_index - The index to drop in the ht.GraphFrame.dataframe to compute the variability (e.g., rank, dataset) (optional).
- metrics - list of inclusive/exclusive metrics (optional) [default = inc_metrics + exc_metrics].

Case: Multi-index gf has exactly 2 indexes

In [4]:
gf_ensemble_copy = gf_ensemble.copy()
gf_ensemble_copy.dataframe = gf_ensemble_copy.dataframe.groupby(["node","dataset"]).agg({'name': 'first', 'time': "mean", 'time (inc)': "mean", 'nid': 'first', 'hatchet_nid': 'first'})
gf_ensemble_copy.dataframe

Unnamed: 0_level_0,Unnamed: 1_level_0,name,time,time (inc),nid,hatchet_nid
node,dataset,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"{'name': 'main', 'type': 'region'}",dset0,main,119373.50,5889901.50,0,0
"{'name': 'main', 'type': 'region'}",dset1,main,119373.50,5889901.50,0,0
"{'name': 'main', 'type': 'region'}",dset2,main,119373.50,5889901.50,0,0
"{'name': 'main', 'type': 'region'}",dset3,main,119373.50,5889901.50,0,0
"{'name': 'main', 'type': 'region'}",dset4,main,119373.50,5889901.50,0,0
...,...,...,...,...,...,...
"{'name': 'TimeIncrement', 'type': 'region'}",dset5,TimeIncrement,263538.75,263538.75,12,23
"{'name': 'TimeIncrement', 'type': 'region'}",dset6,TimeIncrement,263538.75,263538.75,12,23
"{'name': 'TimeIncrement', 'type': 'region'}",dset7,TimeIncrement,263538.75,263538.75,12,23
"{'name': 'TimeIncrement', 'type': 'region'}",dset8,TimeIncrement,263538.75,263538.75,12,23


In [5]:
bp = BoxPlot(multi_index_gf=gf_ensemble_copy, metrics=["time", "time (inc)"])

In [6]:
bp.gf

{'time': <hatchet.graphframe.GraphFrame at 0x7fb683efa640>,
 'time (inc)': <hatchet.graphframe.GraphFrame at 0x7fb683efe670>}

In [7]:
bp.gf['time'].dataframe

Unnamed: 0_level_0,q,min,max,mean,var,imb,kurt,skew,name,nid,hatchet_nid
node,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
"{'name': 'main', 'type': 'region'}","[119373.5, 119373.5, 119373.5, 119373.5, 11937...",119373.5,119373.5,119373.5,0.0,0.0,-3.0,0.0,main,0,0
"{'name': 'LagrangeLeapFrog', 'type': 'region'}","[894.5, 894.5, 894.5, 894.5, 894.5]",894.5,894.5,894.5,0.0,0.0,-3.0,0.0,LagrangeLeapFrog,1,1
"{'name': 'CalcTimeConstraintsForElems', 'type': 'region'}","[7439.0, 7439.0, 7439.0, 7439.0, 7439.0]",7439.0,7439.0,7439.0,0.0,0.0,-3.0,0.0,CalcTimeConstraintsForElems,9,2
"{'name': 'CalcCourantConstraintForElems', 'type': 'region'}","[28915.875, 28915.875, 28915.875, 28915.875, 2...",28915.875,28915.875,28915.875,0.0,0.0,-3.0,0.0,CalcCourantConstraintForElems,10,3
"{'name': 'CalcHydroConstraintForElems', 'type': 'region'}","[10302.875, 10302.875, 10302.875, 10302.875, 1...",10302.875,10302.875,10302.875,0.0,0.0,-3.0,0.0,CalcHydroConstraintForElems,11,4
"{'name': 'LagrangeElements', 'type': 'region'}","[713.875, 713.875, 713.875, 713.875, 713.875]",713.875,713.875,713.875,0.0,0.0,-3.0,0.0,LagrangeElements,2,5
"{'name': 'ApplyMaterialPropertiesForElems', 'type': 'region'}","[14166.625, 14166.625, 14166.625, 14166.625, 1...",14166.625,14166.625,14166.625,0.0,0.0,-3.0,0.0,ApplyMaterialPropertiesForElems,3,6
"{'name': 'EvalEOSForElems', 'type': 'region'}","[254763.75, 254763.75, 254763.75, 254763.75, 2...",254763.75,254763.75,254763.75,0.0,0.0,-3.0,0.0,EvalEOSForElems,4,7
"{'name': 'CalcEnergyForElems', 'type': 'region'}","[287552.875, 287552.875, 287552.875, 287552.87...",287552.875,287552.875,287552.875,0.0,0.0,-3.0,0.0,CalcEnergyForElems,5,8
"{'name': 'CalcPressureForElems', 'type': 'region'}","[177454.875, 177454.875, 177454.875, 177454.87...",177454.875,177454.875,177454.875,0.0,0.0,-3.0,0.0,CalcPressureForElems,6,9


Case: Multi-index gf has more than 2 indexes.

In [9]:
# TODO: reword the exception, add an example. 
# TODO: drop_index => drop_index_levels.
bp = BoxPlot(multi_index_gf=gf_ensemble, metrics=["time"])

Exception: multi_index_gf contains 3 indexes = ['node', 'rank', 'dataset']. ht.util.BoxPlot is limited to processing GraphFrames with 2 indexes. Please specify the `drop_index` by which BoxPlot API will compute the distribution to avoid ambiguity.

Case: Metric not found in dataframe.

In [10]:
bp = BoxPlot(multi_index_gf=gf_ensemble, metrics=["time (incx)"], drop_index_levels=["rank"])

Exception: time (incx) not found in the gf.dataframe.

Case: Drop index by 'rank'.

In [11]:
# TODO: make `drop_index` a list.

In [12]:
bp = BoxPlot(multi_index_gf=gf_ensemble, drop_index_levels=["rank"], metrics=["time", "time (inc)"])

  exec(code_obj, self.user_global_ns, self.user_ns)


**Boxplot** API calculates the results and stores as a GraphFrames in a dictionary (i.e., `tgt` and `bkg`). 

In [13]:
bp.gf['time'].dataframe

Unnamed: 0_level_0,Unnamed: 1_level_0,q,min,max,mean,var,imb,kurt,skew,hatchet_nid,nid,name
node,dataset,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
"{'name': 'main', 'type': 'region'}",dset9,"[105528.0, 113072.25, 116494.0, 124430.75, 137...",105528.0,137098.0,119373.50,1.044980e+08,0.148479,-0.942185,0.543673,0,0,main
"{'name': 'main', 'type': 'region'}",dset7,"[105528.0, 113072.25, 116494.0, 124430.75, 137...",105528.0,137098.0,119373.50,1.044980e+08,0.148479,-0.942185,0.543673,0,0,main
"{'name': 'main', 'type': 'region'}",dset1,"[105528.0, 113072.25, 116494.0, 124430.75, 137...",105528.0,137098.0,119373.50,1.044980e+08,0.148479,-0.942185,0.543673,0,0,main
"{'name': 'main', 'type': 'region'}",dset3,"[105528.0, 113072.25, 116494.0, 124430.75, 137...",105528.0,137098.0,119373.50,1.044980e+08,0.148479,-0.942185,0.543673,0,0,main
"{'name': 'main', 'type': 'region'}",dset6,"[105528.0, 113072.25, 116494.0, 124430.75, 137...",105528.0,137098.0,119373.50,1.044980e+08,0.148479,-0.942185,0.543673,0,0,main
...,...,...,...,...,...,...,...,...,...,...,...,...
"{'name': 'TimeIncrement', 'type': 'region'}",dset8,"[540.0, 202210.25, 269561.0, 361367.0, 423809.0]",540.0,423809.0,263538.75,1.775294e+10,0.608147,-0.558767,-0.563893,23,12,TimeIncrement
"{'name': 'TimeIncrement', 'type': 'region'}",dset9,"[540.0, 202210.25, 269561.0, 361367.0, 423809.0]",540.0,423809.0,263538.75,1.775294e+10,0.608147,-0.558767,-0.563893,23,12,TimeIncrement
"{'name': 'TimeIncrement', 'type': 'region'}",dset2,"[540.0, 202210.25, 269561.0, 361367.0, 423809.0]",540.0,423809.0,263538.75,1.775294e+10,0.608147,-0.558767,-0.563893,23,12,TimeIncrement
"{'name': 'TimeIncrement', 'type': 'region'}",dset3,"[540.0, 202210.25, 269561.0, 361367.0, 423809.0]",540.0,423809.0,263538.75,1.775294e+10,0.608147,-0.558767,-0.563893,23,12,TimeIncrement


In [14]:
bp.gf['time'].dataframe.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 240 entries, (Node({'name': 'main', 'type': 'region'}), 'dset9') to (Node({'name': 'TimeIncrement', 'type': 'region'}), 'dset4')
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   q            240 non-null    object 
 1   min          240 non-null    float64
 2   max          240 non-null    float64
 3   mean         240 non-null    float64
 4   var          240 non-null    float64
 5   imb          240 non-null    float64
 6   kurt         240 non-null    float64
 7   skew         240 non-null    float64
 8   hatchet_nid  240 non-null    object 
 9   nid          240 non-null    object 
 10  name         240 non-null    object 
dtypes: float64(7), object(4)
memory usage: 22.3+ KB


Case: Drop index by 'dataset'.

In [15]:
bp = BoxPlot(multi_index_gf=gf_ensemble, metrics=["time"], drop_index_levels=["dataset"])

  exec(code_obj, self.user_global_ns, self.user_ns)


In [16]:
bp.gf['time'].dataframe

Unnamed: 0_level_0,Unnamed: 1_level_0,q,min,max,mean,var,imb,kurt,skew,hatchet_nid,time (inc),nid,name
node,rank,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
"{'name': 'main', 'type': 'region'}",0,"[121489.0, 121489.0, 121489.0, 121489.0, 12148...",121489.0,121489.0,121489.0,0.0,0.0,-3.0,0.0,0,5882425.0,0,main
"{'name': 'main', 'type': 'region'}",4,"[118953.0, 118953.0, 118953.0, 118953.0, 11895...",118953.0,118953.0,118953.0,0.0,0.0,-3.0,0.0,0,5905595.0,0,main
"{'name': 'main', 'type': 'region'}",5,"[133256.0, 133256.0, 133256.0, 133256.0, 13325...",133256.0,133256.0,133256.0,0.0,0.0,-3.0,0.0,0,5877613.0,0,main
"{'name': 'main', 'type': 'region'}",6,"[114035.0, 114035.0, 114035.0, 114035.0, 11403...",114035.0,114035.0,114035.0,0.0,0.0,-3.0,0.0,0,5870933.0,0,main
"{'name': 'main', 'type': 'region'}",7,"[137098.0, 137098.0, 137098.0, 137098.0, 13709...",137098.0,137098.0,137098.0,0.0,0.0,-3.0,0.0,0,5898724.0,0,main
...,...,...,...,...,...,...,...,...,...,...,...,...,...
"{'name': 'TimeIncrement', 'type': 'region'}",2,"[171635.0, 171635.0, 171635.0, 171635.0, 17163...",171635.0,171635.0,171635.0,0.0,0.0,-3.0,0.0,23,171635.0,12,TimeIncrement
"{'name': 'TimeIncrement', 'type': 'region'}",1,"[212402.0, 212402.0, 212402.0, 212402.0, 21240...",212402.0,212402.0,212402.0,0.0,0.0,-3.0,0.0,23,212402.0,12,TimeIncrement
"{'name': 'TimeIncrement', 'type': 'region'}",0,"[418469.0, 418469.0, 418469.0, 418469.0, 41846...",418469.0,418469.0,418469.0,0.0,0.0,-3.0,0.0,23,418469.0,12,TimeIncrement
"{'name': 'TimeIncrement', 'type': 'region'}",7,"[540.0, 540.0, 540.0, 540.0, 540.0]",540.0,540.0,540.0,0.0,0.0,-3.0,0.0,23,540.0,12,TimeIncrement


Using the **roundtrip** interface, we can then visualize the compute boxplot information. Below, we load the roundtrip interface that allows users to visualize plots on jupyter notebook cells directly. 

In [None]:
bp_json = bp.to_json()

In [None]:
bp_json