# FRESCO Anvil Database Access and Analytics Notebook

## Overview

The FRESCO Analytics Notebook is tailored for effortless analysis of the Anvil dataset. It offers functionalities to:

- Extract filtered data from the Anvil database.
- Conduct statistical analyses on the filtered data.
- Visualize the results of the analyses.

The notebook is structured into three main sections:

### 1. Data Filtering
Define your analysis scope by selecting a specific datetime window. Customize your dataset further with various filters.

### 2. Data Analysis Options
Explore a range of analysis options and select the ones that align with your requirements.

### 3. Data Analysis and Visualizations
Execute the chosen analysis on the filtered dataset and visualize the outcomes.

## Step-by-Step Instructions

- **Cell 1:** Define the dataset's temporal boundaries. This will guide the extraction of relevant host time series and job accounting data. Additional conditions can be added for more refined data filtering.

- **Cell 2:** Configure your statistical analyses by:
  - Choosing statistics (e.g., Mean, Median).
  - Setting a threshold for "Ratio of Data Outside Threshold".
  - Selecting an interval type (Count or Time). If "Time" is chosen, define the time unit and interval count.

- **Cell 3:** Analyze the time series data based on Cell 2 configurations and generate visualizations.

## Database Table Information

### Host Table
- **jid**: Unique job identifier
- **host**: Origin node of the data point
- **event**: Resource usage metric type
- **value**: Numeric value of the metric
- **unit**: Measurement unit of the metric
- **time**: Timestamp of the data point


**Event Column Metrics:**
- **cpuuser:** CPU user mode average percentage.
- **block:** Data transfer rate to/from block devices.
- **memused:** OS's total physical memory usage.
- **memused_minus_diskcache:** Physical memory usage excluding caches.
- **gpu_usage:** GPU active time average percentage (only for GPU jobs).
- **nfs:** Data transfer rate over NFS mounts.

### Job Table
- **account**: Account or project name
- **jid**: Unique job identifier
- **ncores**: Total cores assigned to the job
- **ngpus**: Total GPUs assigned to the job
- **nhosts**: Number of nodes assigned to the job
- **timelimit**: Requested job duration (in seconds)
- **queue**: Job submission queue name
- **end_time**: Job end time
- **start_time**: Job start time
- **submit_time**: Job submission time
- **username**: Job owner's name
- **exitcode**: Job's exit status
- **host_list**: List of nodes the job ran on
- **jobname**: Job's name

In [2]:
import notebook_functions as nbf
import matplotlib.pyplot as plt
import pandas as pd
import ipywidgets as widgets
from datetime import datetime
from IPython.display import display, clear_output, HTML

time_series_df, host_data_sql_query = nbf.display_widgets()


GridBox(children=(VBox(children=(HTML(value='<h1>Query the Host Data Table</h1>'), HTML(value='<h5>Please sele…

In [3]:
try:
    stats = widgets.SelectMultiple(
        options=['None', 'Mean', 'Median', 'Standard Deviation', 'PDF', 'CDF', 'Ratio of Data Outside Threshold'],
        value=['None'],
        description='Statistics',
        disabled=False
    )

    ratio_threshold = widgets.IntText(
        value=0,
        description='Value:',
        disabled=True  # disabled by default
    )

    interval_type = widgets.Dropdown(
        options=['None', 'Count', 'Time'],
        value='None',
        description='Interval Type',
        disabled=True  # disabled by default
    )

    time_units = widgets.Dropdown(
        options=['None', 'Days', 'Hours', 'Minutes', 'Seconds'],
        value='None',
        description='Interval Unit',
        disabled=True  # disabled by default
    )

    time_value = widgets.IntText(
        value=0,
        description='Value:',
        disabled=True  # disabled by default
    )

    # Define a function to be called when stats value changes
    def on_stats_change(change):
        if change['type'] == 'change' and change['name'] == 'value':
            if "Ratio of Data Outside Threshold" in change['new']:
                # enable ratio_threshold if 'Ratio of Data Outside Threshold' is selected
                ratio_threshold.disabled = False
            else:
                # disable ratio_threshold if 'Ratio of Data Outside Threshold' is not selected
                ratio_threshold.disabled = True

            if change['new'][0] != "None":
                # enable interval_type if stats is not None
                interval_type.disabled = False
            else:
                # disable interval_type if stats is None
                interval_type.disabled = True
                interval_type.value = 'None'  # reset interval_type to 'None'

    stats.observe(on_stats_change)

    # Define a function to be called when interval_type value changes
    def on_interval_type_change(change):
        if change['type'] == 'change' and change['name'] == 'value':
            if change['new'] == "None":
                time_units.disabled = True
                time_value.disabled = True
                time_units.value = 'None'  # reset time_units to 'None'
                time_value.value = 0  # reset time_value to 0
            elif change['new'] == "Time":
                time_units.disabled = False
                time_value.disabled = False
            elif change['new'] == "Count":
                time_units.disabled = True
                time_value.disabled = False
            else:
                time_units.disabled = False
                time_value.disabled = False

    interval_type.observe(on_interval_type_change)

    # Display the widgets
    print("Please select a statistic to calculate.")
    display(stats)
    print("Please provide the threshold if 'Ratio of Data Outside Threshold' was selected.")
    display(ratio_threshold)
    print("Please select an interval type to use in the statistic calculation. If count is selected, the interval will correspond to a count of rows. If time is selected, the interval will be a time window.")
    display(interval_type)
    print("If time was selected, please select the unit of time.")
    display(time_units)
    print("Please provide the interval count.")
    display(time_value)

    time_series_df = nbf.remove_columns(time_series_df)
except NameError:
    print("ERROR: Please make sure to run the previous notebook cell before executing this one.")

Please select a statistic to calculate.


SelectMultiple(description='Statistics', index=(0,), options=('None', 'Mean', 'Median', 'Standard Deviation', …

Please provide the threshold if 'Ratio of Data Outside Threshold' was selected.


IntText(value=0, description='Value:', disabled=True)

Please select an interval type to use in the statistic calculation. If count is selected, the interval will correspond to a count of rows. If time is selected, the interval will be a time window.


Dropdown(description='Interval Type', disabled=True, options=('None', 'Count', 'Time'), value='None')

If time was selected, please select the unit of time.


Dropdown(description='Interval Unit', disabled=True, options=('None', 'Days', 'Hours', 'Minutes', 'Seconds'), …

Please provide the interval count.


IntText(value=0, description='Value:', disabled=True)

In [4]:
try:
    %matplotlib inline
    # Convert the 'time' columns to datetime
    try:
        time_series_df['time'] = pd.to_datetime(time_series_df['time'])
        time_series_df = time_series_df.set_index('time')
        time_series_df = time_series_df.sort_index()
    except Exception as e:
        print("")

    metric_func_map = {
        "Mean": nbf.get_mean if "Mean" in stats.value else "",
        "Median": nbf.get_median if "Median" in stats.value else "",
        "Standard Deviation": nbf.get_standard_deviation if "Standard Deviation" in stats.value else "",
        "PDF": nbf.plot_pdf if "PDF" in stats.value else "",
        "CDF": nbf.plot_cdf if "CDF" in stats.value else "",
        "Ratio of Data Outside Threshold": nbf.plot_data_points_outside_threshold if 'Ratio of Data Outside Threshold' in stats.value else ""
    }

    unit_map = {
        "CPU %": "cpuuser",
        "GPU %": "gpu_usage",
        "GB:memused": "memused",
        "GB:memused_minus_diskcache": "memused_minus_diskcache",
        "GB/s": "block",
        "MB/s": "nfs"
    }

    units = nbf.parse_host_data_query(host_data_sql_query, unit_map)  # get units requested in SQL query

    # set up outputs and tabbed layout
    tab = widgets.Tab()
    outputs = {}
    stat_values = []
    basic_stats = ['Mean', 'Median', 'Standard Deviation']

    # Populate the outputs dictionary
    for unit in units:
        outputs[unit] = {}
        if any(stat in stats.value for stat in basic_stats):
            for stat in stats.value + ('Box and Whisker',):
                outputs[unit][stat] = widgets.Output()
        else:
            for stat in stats.value:
                outputs[unit][stat] = widgets.Output()

    # set the tab children
    if any(stat in stats.value for stat in basic_stats):
        tab.children = [widgets.Accordion([widgets.Box([widgets.Label(stat), outputs[unit][stat]]) for stat in stats.value + ('Box and Whisker',)], titles=stats.value + ('Box and Whisker',)) for unit in units]
    else:
        tab.children = [widgets.Accordion([widgets.Box([widgets.Label(stat), outputs[unit][stat]]) for stat in stats.value], titles=stats.value) for unit in units]

    tab.titles = units

    with plt.style.context('fivethirtyeight'):
        unit_stat_dfs = {}
        time_map = {'Days': 'D', 'Hours': 'H', 'Minutes': 'T', 'Seconds': 'S'}
        for unit in units:
            unit_stat_dfs[unit] = {}
            for metric in stats.value:
                metric_df = time_series_df.query(f"`event` == '{unit_map[unit]}'")
                rolling = False

                # Calculate stats
                if interval_type.value == "Time":
                    rolling = True
                    window = f"{time_value.value}{time_map[time_units.value]}"
                elif interval_type.value == "Count":
                    rolling = True
                    window = time_value.value

                # Handle special cases outside the rolling condition
                if metric == "PDF":
                    with outputs[unit][metric]:
                        unit_stat_dfs[unit][metric] = metric_func_map[metric](metric_df)
                    continue
                elif metric == "CDF":
                    with outputs[unit][metric]:
                        unit_stat_dfs[unit][metric] = metric_func_map[metric](metric_df)
                    continue
                elif metric == "Ratio of Data Outside Threshold":
                    with outputs[unit][metric]:
                        unit_stat_dfs[unit][metric] = metric_func_map[metric](ratio_threshold.value, metric_df)
                    continue

                # Only calculate and plot basic stats if rolling is True
                if rolling:
                    unit_stat_dfs[unit][metric] = metric_func_map[metric](metric_df, rolling=True, window=window)

                    # Plot stats
                    with outputs[unit][metric]:
                        unit_stat_dfs[unit][metric].plot()
                        x_axis_label = ""
                        if interval_type.value == "Count":
                            x_axis_label += f"Count - Rolling Window: {time_value.value} Rows"
                        elif interval_type.value == "Time":
                            x_axis_label += f"Timestamp - Rolling Window: {time_value.value}{time_map[time_units.value]}"
                        y_axis_label = unit
                        plt.gcf().autofmt_xdate()  # auto formats datetimes
                        plt.style.use('fivethirtyeight')
                        plt.title(f"{unit} {metric}")
                        plt.legend(loc='upper left', fontsize="10")
                        plt.xlabel(x_axis_label)
                        plt.ylabel(y_axis_label)
                        plt.show()

            # Get the stats dataframes
            df_mean = unit_stat_dfs[unit].get('Mean')
            df_std = unit_stat_dfs[unit].get('Standard Deviation')
            df_median = unit_stat_dfs[unit].get('Median')

            # Plot box and whisker
            if any(df is not None for df in [df_mean, df_std, df_median]):
                with outputs[unit]['Box and Whisker']:
                    nbf.plot_box_and_whisker(df_mean, df_std, df_median)

        display(tab)
except NameError:
    print("ERROR: Please make sure to run the previous notebook cells before executing this one.")


ERROR: Please make sure to run the previous notebook cells before executing this one.


In [None]:
try:
    def on_selection_change(change):
        if len(change.new) > 2:
            correlations.value = change.new[:2]

    def on_button_click(button):
        graph_output.clear_output()
        with graph_output:
            with plt.style.context('fivethirtyeight'):
                display(nbf.calculate_and_plot_correlation(time_series_df, correlations.value))

    correlations = widgets.SelectMultiple(
        options=['None', 'cpuuser', 'gpu_usage', 'nfs', 'block', 'memused', 'memused_minus_diskcache'],
        value=['None'],
        description='Metrics',
        disabled=False
    )

    plot_button = widgets.Button(
        description = "Plot correlation",
        disabled = False,
        icon= "chart-line"
    )
    plot_button.on_click(on_button_click)

    graph_output = widgets.Output()

    container = widgets.VBox(
        [widgets.HBox([correlations, plot_button], layout = widgets.Layout(
            width = "50%",
            justify_content="space-between",
            align_items="center"),),
        graph_output])
    correlations.observe(on_selection_change, names='value')

    # Give the user the option to calculate correlations
    print("Please select two metrics below to find their Pearson correlation:")
    display(container)

except NameError:
    print("ERROR: Please make sure to run the previous notebook cells before executing this one.")
