# Using pypop to calculate the standard POP MPI metrics with Extrae

Pypop is designed to automate the process of calculating the POP metrics, making the process fast and efficient while still providing the user with flexibility to modify the workflow as necessary for the particular task at hand.

Pypop uses Pandas DataFrames internally as they are fast, flexible, and allow the user to trivially access and use the data in their own scripts as needed.

## Setting up

Setting up pypop requires installing pypop using `setup.py` or `pip`, and making sure that the `Paramedir` and `Dimemas` binaries are available and ideally on the system `$PATH`.

In [None]:
import os
import sys

# If Paramedir or Dimemas are not on your PATH, you can add their directories below

from pypop.config import set_dimemas_path, set_paramedir_path
#set_paramedir_path('~/downloads/wxparaver-4.8.2-Linux_x86_64/bin/')
#set_dimemas_path('/path/to/dimemas/bin')

# Import the functions needed to calculate the standard MPI metrics
from pypop.traceset import TraceSet
from pypop.metrics import MPI_Metrics

## Traces

Traces should be captured using Extrae in the normal way.  Pypop features transparent support for gzip'ed traces, and compressed traces are recommended for systems with low I/O speeds such as those with network based storage.

The traces for this example are captured for the open source [EPOCH](https://cfsa-pmw.warwick.ac.uk/users/sign_in) PIC code (2D variant) using the input.deck file provided in the trace folder for 1-16 MPI processes (ranks).
 
Assuming the traces were copied along with the example notebook, we can set the trace directory location to:

In [None]:
trace_directory = './epoch_example_traces/'

## Analysis

Start by finding all the `*.prv` tracefiles in the analysis directory, and then create a TraceSet which uses Paramedir and Dimemas to calculate the statistics.

In [None]:
# Make a list of the tracefiles that we want
trace_files = [os.path.join(trace_directory, f) for f in os.listdir(trace_directory) if f.endswith('.prv.gz')]

# Use paramedir to calculate the statistics
statistics = TraceSet(trace_files, ignore_cache=True)

The metrics can then be calculated from the statistics list, in this case, we organise by MPI commsize.  Note that the metrics calculated here include the speedup and raw runtime for comparision purposes.  The data is in the form of a Pandas DataFrame.

In [None]:
metrics = MPI_Metrics(statistics.by_commsize())
display(metrics)

## Visualising the metrics

These metrics can then be simply visualised using the inbuilt plotting routines:

In [None]:
metric_table = metrics.plot_table(title="POP MPI Metrics for the EPOCH2D PIC Code")
display(metric_table)

## Plotting Scaling

Scaling can be plotted similarly using the scaling plot routines:

In [None]:
metrics.plot_scaling(title="Strong Scaling Speedup for the EPOCH2D PIC Code")

# Advanced usage: getting to the data

The actual data can be recovered from the statistics object using one of the `by_*` functions. For example `by_commsize` returns a dictionary containing the results for all the runs (as RunData objects) identified by their MPI commsize.

In [None]:
dict_by_commsize = statistics.by_commsize()

display(dict_by_commsize)

### Accessing the statistics data

The RunData objects contain both metadata about the run and the calculated statistics used to build the metrics.  The actual statistics are contained in a pandas dataframe `RunData.stats`

These can be viewed directly if desired. e.g for the 8 rank case:

In [None]:
display(dict_by_commsize[8].stats)

### Accessing the metrics data

The metrics behave similarly, the primary difference being that the superclass `MetricSet` is subclassed to provide both pure MPI (`MPI_Metrics`) as well as hybrid metrics (`MPI_OpenMP_Metrics`). The raw metrics data is accessible as a pandas dataframe `MetricSet.metric_data`.

In [None]:
display(metrics.metric_data)