## FRESCO Analytics Notebook
### Overview
This notebook has been designed to make analysis of the Anvil dataset as easy as possible. Generally speaking, it will allow the user to access the Anvil files stored locally, select a number of analysis options, and view the results.
### Instructions
1. Run the first cell and provide the complete directory path to the data in the 'Path' field.
2. Run the second cell. If data preprocessing is needed, select the desired options. Otherwise, skip to the "If data preprocessing options were not selected, follow these instructions" section below.
### If data preprocessing options were selected, follow these instructions
3. Run cell 3 and provide start time and end times.
4. Run cell 4 and select the units to be included in the timeseries data.
5. Run cell 5 and provide the desired options.
6. Run cell 6 and provide the desired statistic.
7. Run cell 7 and provide the desired data visualisation options.
8. Run cell 8 to see the data visualizations.

### If data preprocessing options were not selected, follow these instructions
3. Run cell 4 and provide start time and end times and select the units to be included in the timeseries data.
4. Run cell 5 and provide the desired options.
5. Run cell 6 and provide the desired statistic.
6. Run cell 7 and provide the desired data visualisation options.
7. Run cell 8 to see the data visualizations.


In [1]:
# -------------- CELL 1 --------------

from IPython.display import display
import ipywidgets as widgets
import pandas as pd

print(r"Please provide the directory path to the data files e.g., D:\Data")
dir_path = widgets.Text(
    value='',
    placeholder='',
    description='Path:',
    disabled=False
)
display(dir_path)

Please provide the directory path to the data files e.g., D:\Data


Text(value='', description='Path:', placeholder='')

In [2]:
# -------------- CELL 2 --------------

def get_data_files_directory(path):
    """
    This function should produce a folder path to the data files.
    :param path:
    :return:
    """
    pass


data_path = get_data_files_directory(dir_path.value)

# Data preprocessing: handling missing metrics
print("Data preprocessing: select this option if rows with missing metrics should be removed.")
missing_metrics = widgets.ToggleButton(
    value=False,
    description='Remove Rows with Missing Metrics?',
    disabled=False,
    button_style='',
    tooltip='Remove Rows with Missing Metrics?',
    icon='check'
)
display(missing_metrics)

print("Data preprocessing: select this option if an interval column should be added to the data.")
interval = widgets.ToggleButton(
    value=False,
    description='Add Interval Column?',
    disabled=False,
    button_style='',
    tooltip='Add Interval Column?',
    icon='check'
)
display(interval)

Data preprocessing: select this option if rows with missing metrics should be removed.


ToggleButton(value=False, description='Remove Rows with Missing Metrics?', icon='check', tooltip='Remove Rows …

Data preprocessing: select this option if an interval column should be added to the data.


ToggleButton(value=False, description='Add Interval Column?', icon='check', tooltip='Add Interval Column?')

In [3]:
# -------------- CELL 3 --------------

if missing_metrics.value or interval.value:
    print("Please select the start time and end time for the data preprocessing.")
    start_time = widgets.Text(
        value='01-01-2020',
        placeholder='',
        description='Start Time:',
        disabled=False
    )

    end_time = widgets.Text(
        value='12-31-9999',
        placeholder='',
        description='End Time:',
        disabled=False
    )
    display(start_time, end_time)

In [4]:
# -------------- CELL 4 --------------

def handle_missing_metrics(starting_time, ending_time, path):
    """
    This function should remove the rows within the given timeframe that are missing metrics.
    :param starting_time:
    :param ending_time:
    :param path:
    :return:
    """
    pass


def add_interval_column(starting_time, ending_time, path):
    """
    This function should add an interval column to the data that falls within the given timeframe. The interval column
    should reflect the length of each timestamp.
    :param starting_time:
    :param ending_time:
    :param path:
    :return:
    """
    pass

dataframe = pd.DataFrame()

if missing_metrics.value:
    dataframe = handle_missing_metrics(start_time.value, end_time.value, data_path)

if interval.value:
    dataframe = add_interval_column(start_time.value, end_time.value, data_path)

if not missing_metrics.value and not interval.value:
    print("Please enter a start time and end time.")
    start_time = widgets.Text(
        value='01-01-2020',
        placeholder='',
        description='Start Time:',
        disabled=False
    )

    end_time = widgets.Text(
        value='12-31-9999',
        placeholder='',
        description='End Time:',
        disabled=False
    )
    display(start_time, end_time)

print("Optional: select the units to be included in the timeseries data.")
units = widgets.SelectMultiple(
    options=['None', 'CPU %', 'GPU %', 'GB:memused', 'GB:memused_minus_diskcache', 'GB/s', 'MB/s'],
    value=['None'],
    description='Units:',
    disabled=False,
)

display(units)

Please enter a start time and end time.


Text(value='01-01-2020', description='Start Time:', placeholder='')

Text(value='12-31-9999', description='End Time:', placeholder='')

Optional: select the units to be included in the timeseries data.


SelectMultiple(description='Units:', index=(0,), options=('None', 'CPU %', 'GPU %', 'GB:memused', 'GB:memused_…

In [None]:
# -------------- CELL 5 --------------
unit_values = {}  # stores user low and high value user input such that: key = a unit from the units list above /// value = (low_value, high_value)

def filter_values(value, low_value, high_value):
    print(f"For {value}: Low Value: {low_value}, High Value: {high_value}")
    unit_values[value] = (low_value, high_value)

value_widgets = []
for value in units.value:
    if value != 'None':
        low_value = widgets.FloatSlider(min=0.0, max=100.0, step=0.1, value=0.1)
        high_value = widgets.FloatSlider(min=0.0, max=100.0, step=0.1, value=0.1)
        interact = widgets.interactive(filter_values, value=value, low_value=low_value, high_value=high_value)
        value_widgets.append(interact)

display(widgets.VBox(value_widgets))



print("Optional: select the hosts to be included in the timeseries data e.g., 'NODE1, NODE2'")
hosts = widgets.Text(
    value='',
    placeholder='',
    description='Hosts:',
    disabled=False
)
display(hosts)

# TODO: explore using a qgrid here for nodes and jobs
# import qgrid
#
# df_nodes = pd.DataFrame(nodes_list, columns=['Nodes'])
#
# qgrid_widget = qgrid.show_grid(df_nodes, show_toolbar=False)
# qgrid_widget
# With qgrid, your list of nodes is displayed in an interactive table. You can click the filter icon in the header of the column to select multiple nodes. After making the selection, you can retrieve the selected nodes using:
# selected_nodes = qgrid_widget.get_selected_df()

print("Optional: select the jobs to be included in the timeseries data e.g., 'JOB1, JOB2'")
job_ids = widgets.Text(
    value='',
    placeholder='',
    description='Jobs:',
    disabled=False
)
display(job_ids)

print("Optional: select if you want the account logs to be returned for the Job IDs matching your query.")
return_account_logs = widgets.ToggleButton(
    value=False,
    description='Account Logs',
    disabled=False,
    button_style='',
    tooltip='Return Account Logs?',
    icon='check'
)
display(return_account_logs)

print("Optional: select the columns to be included in the timeseries data (hold control to select multiple). If no columns are "
      "selected, all columns will be included.")
timeseries_return_columns = widgets.SelectMultiple(
    options=['None', 'Job Id', 'Hosts', 'Events', 'Units', 'Values', 'Timestamps'],
    value=['None'],
    description='Return Columns',
    disabled=False
)
display(timeseries_return_columns)

VBox(children=(interactive(children=(Text(value='GPU %', description='value'), FloatSlider(value=0.1, descript…

Optional: select the hosts to be included in the timeseries data e.g., 'NODE1, NODE2'


Text(value='', description='Hosts:', placeholder='')

Optional: select the jobs to be included in the timeseries data e.g., 'JOB1, JOB2'


Text(value='', description='Jobs:', placeholder='')

Optional: select if you want the account logs to be returned for the Job IDs matching your query.


ToggleButton(value=False, description='Account Logs', icon='check', tooltip='Return Account Logs?')

Optional: select the columns to be included in the timeseries data (hold control to select multiple). If no columns are selected, all columns will be included.


SelectMultiple(description='Return Columns', index=(0,), options=('None', 'Job Id', 'Hosts', 'Events', 'Units'…

In [18]:
# -------------- CELL 6 --------------

def get_timeseries_by_timestamp(begin_time: str, end_time: str, return_columns: list) -> pd.DataFrame:
    pass

def get_timeseries_by_values_and_unit(unit_vals: dict) -> pd.DataFrame:  # call this with the unit_values dict as an arg
    pass


def get_timeseries_by_hosts(hosts: str) -> pd.DataFrame:
    pass


def get_timeseries_by_job_ids(job_ids: str) -> pd.DataFrame:
    pass


def get_account_logs_by_job_ids(job_ids: str) -> pd.DataFrame:
    pass


get_timeseries_by_timestamp(start_time.value, end_time.value, timeseries_return_columns.value)

if units.value != "None":
    get_timeseries_by_values_and_unit(units.value, low_value.value, high_value.value)

if len(hosts.value) > 0:
    get_timeseries_by_hosts(hosts.value)
    
if len(job_ids.value) > 0:
    get_account_logs_by_job_ids(job_ids.value)
    
stats = widgets.SelectMultiple(
    options=['Average', 'Mean', 'Median', 'Standard Deviation', 'PDF', 'CDF', 'Ratio of Data Outside Threshold'],
    value=['Mean'],
    description='Statistics',
    disabled=False
)

{'GPU %': (18.2, 100.0), 'GB:memused': (19.0, 66.5)}


In [10]:
# -------------- CELL 7 --------------

def get_average():
    pass


def get_mean():
    pass


def get_median():
    pass


def get_standard_deviation():
    pass


def get_probability_density():
    pass


def get_cumulative_density():
    pass


def get_data_points_outside_threshold():
    pass


def get_ratio_of_data_points_outside_threshold():
    pass

# Display statistical data here

# Give the user the option to calculate correlations
print("If you would like to explore correlations between metrics, choose two metrics below:")
stats = widgets.SelectMultiple(
    options=['None', 'CPU %:cpuuser', 'GPU %:gpu_usage', 'GB:memused_minus_diskcache or memused', 'GB/s:block', 'MB/s:nfs'],
    value=['CPU %:cpuuser'],
    description='Statistics',
    disabled=False
)


If you would like to explore correlations between metrics, choose two metrics below:


In [13]:
# -------------- CELL 8 --------------

# Display correlation visualizations here

In [19]:
# -------------- CELL 9 ---------------
# Give the user the option to download data here.
print("Select the files to be downloaded:")
files_to_provide = widgets.SelectMultiple(
    options=['job_ts_metrics_aug2022_anon', 'job_ts_metrics_dec2022_anon',
             'job_ts_metrics_jan2022_anon', 'job_ts_metrics_july2022_anon',
             'job_ts_metrics_nov2022_anon', 'job_ts_metrics_sep2022_anon'],
    value=['job_ts_metrics_aug2022_anon'],
    description='Files',
    disabled=False
)

display(files_to_provide)

Select the files to be downloaded:


SelectMultiple(description='Files', index=(0,), options=('job_ts_metrics_aug2022_anon', 'job_ts_metrics_dec202…

In [None]:
# -------------- CELL 10 ---------------
