## FRESCO Analytics Notebook
### Overview
This notebook has been designed to make analysis of the Anvil dataset as easy as possible. Generally speaking, it will allow the user to access the Anvil files stored locally, select a number of analysis options, and view the results.
### Instructions
1. Run the first cell and provide the complete directory path to the data in the 'Path' field.
2. Run the second cell. If data preprocessing is needed, select the desired options. Otherwise, skip to the "If data preprocessing options were not selected, follow these instructions" section below.
#### If data preprocessing options were selected, follow these instructions
3. Run cell 3 and provide start time and end times.
4. Run cell 4 and select the units to be included in the timeseries data.
5. Run cell 5 and provide the desired options.
6. Run cell 6 and provide the desired statistic.
7. Run cell 7 and provide the desired data vizualization options.
8. Run cell 8.

#### If data preprocessing options were not selected, follow these instructions
3. Run cell 4 and provide start time and end times and select the units to be included in the timeseries data.
4. Run cell 5 and provide the desired options.
5. Run cell 6 and provide the desired statistic.
6. Run cell 7 and provide the desired data vizualization options.
7. Run cell 8.


In [1]:
# -------------- CELL 1 --------------

from IPython.display import display
import ipywidgets as widgets
import pandas as pd
import os

print(r"Please provide the directory path to the data files e.g., D:\Data")
dir_path = widgets.Text(
    value='',
    placeholder='',
    description='Path:',
    disabled=False
)
display(dir_path)

Please provide the directory path to the data files e.g., D:\Data


Text(value='', description='Path:', placeholder='')

In [2]:
# -------------- CELL 2 --------------

def get_data_files_directory(path):
    """
    This function should produce a folder path to the data files.
    :param path:
    :return:
    """
    return os.path.join(path, 'raw')


data_path = get_data_files_directory(dir_path.value)

# Data preprocessing: handling missing metrics
print("Data preprocessing: select this option if rows with missing metrics should be removed.")
missing_metrics = widgets.ToggleButton(
    value=False,
    description='Remove Rows with Missing Metrics?',
    disabled=False,
    button_style='',
    tooltip='Remove Rows with Missing Metrics?',
    icon='check'
)
display(missing_metrics)

print("Data preprocessing: select this option if an interval column should be added to the data.")
interval = widgets.ToggleButton(
    value=False,
    description='Add Interval Column?',
    disabled=False,
    button_style='',
    tooltip='Add Interval Column?',
    icon='check'
)
display(interval)

Data preprocessing: select this option if rows with missing metrics should be removed.


ToggleButton(value=False, description='Remove Rows with Missing Metrics?', icon='check', tooltip='Remove Rows …

Data preprocessing: select this option if an interval column should be added to the data.


ToggleButton(value=False, description='Add Interval Column?', icon='check', tooltip='Add Interval Column?')

In [3]:
# -------------- CELL 3 --------------

if missing_metrics.value or interval.value:
    print("Please select the start time and end time for the data preprocessing.")
    start_time = widgets.Text(
        value='01-01-2020',
        placeholder='',
        description='Start Time:',
        disabled=False
    )

    end_time = widgets.Text(
        value='12-31-9999',
        placeholder='',
        description='End Time:',
        disabled=False
    )
    display(start_time, end_time)

In [4]:
# -------------- CELL 4 --------------

def handle_missing_metrics(starting_time, ending_time, path):
    """
    This function should remove the rows within the given timeframe that are missing metrics.
    :param starting_time:
    :param ending_time:
    :param path:
    :return:
    """
    pass


def add_interval_column(starting_time, ending_time, path):
    """
    This function should add an interval column to the data that falls within the given timeframe. The interval column
    should reflect the length of each timestamp.
    :param starting_time:
    :param ending_time:
    :param path:
    :return:
    """
    pass

dataframe = pd.DataFrame()

if missing_metrics.value:
    dataframe = handle_missing_metrics(start_time.value, end_time.value, data_path)

if interval.value:
    dataframe = add_interval_column(start_time.value, end_time.value, data_path)

if not missing_metrics.value and not interval.value:
    print("Please enter a start time and end time.")
    start_time = widgets.Text(
        value='01-01-2020',
        placeholder='',
        description='Start Time:',
        disabled=False
    )

    end_time = widgets.Text(
        value='12-31-9999',
        placeholder='',
        description='End Time:',
        disabled=False
    )
    display(start_time, end_time)

print("Optional: select the units to be included in the timeseries data.")
units = widgets.ToggleButtons(
    options=['None', 'CPU %:cpuuser', 'GPU %:gpu_usage', 'GB:memused_minus_diskcache or memused', 'GB/s:block', 'MB/s:nfs'],
    description='Units:',
    disabled=False,
    button_style='',
    tooltips=['None', 'CPU %', 'GPU %', 'GB', 'GB/s', 'MB/s']
)

display(units)

Please enter a start time and end time.


Text(value='01-01-2020', description='Start Time:', placeholder='')

Text(value='12-31-9999', description='End Time:', placeholder='')

Optional: select the units to be included in the timeseries data.


ToggleButtons(description='Units:', options=('None', 'CPU %:cpuuser', 'GPU %:gpu_usage', 'GB:memused_minus_dis…

In [5]:
# -------------- CELL 5 --------------

if units.value != 'None':
    print(f"Enter the low value for {units.value}")
    low_value = widgets.FloatText(
        value=0.1,
        description=f'{units.value} Low Value:',
        disabled=False
    )
    display(low_value)

    print(f"Enter the high value for {units.value}")
    high_value = widgets.FloatText(
        value=99.9,
        description=f'{units.value} High Value:',
        disabled=False
    )
    display(high_value)
    
    if units.value == "GB":
        print("Select the Event type for GB:")
        event = widgets.ToggleButtons(
            options=['memused_minus_diskcache', 'memused'],
            description='Event:',
            disabled=False,
            button_style='',
            tooltips=['memused_minus_diskcache', 'memused']
        )
        display(event)


print("Optional: select the hosts to be included in the timeseries data e.g., 'NODE1, NODE2'")
hosts = widgets.Text(
    value='',
    placeholder='',
    description='Hosts:',
    disabled=False
)
display(hosts)

print("Optional: select the jobs to be included in the timeseries data e.g., 'JOB1, JOB2'")
job_ids = widgets.Text(
    value='',
    placeholder='',
    description='Jobs:',
    disabled=False
)
display(job_ids)

print("Optional: select if you want the account logs to be returned for the Job IDs matching your query.")
return_account_logs = widgets.ToggleButton(
    value=False,
    description='Account Logs',
    disabled=False,
    button_style='',
    tooltip='Return Account Logs?',
    icon='check'
)
display(return_account_logs)

print("Optional: select the columns to be included in the timeseries data (hold control to select multiple). If no columns are "
      "selected, all columns will be included.")
timeseries_return_columns = widgets.SelectMultiple(
    options=['None', 'Job Id', 'Hosts', 'Events', 'Units', 'Values', 'Timestamps'],
    value=['None'],
    description='Return Columns',
    disabled=False
)
display(timeseries_return_columns)

Optional: select the hosts to be included in the timeseries data e.g., 'NODE1, NODE2'


Text(value='', description='Hosts:', placeholder='')

Optional: select the jobs to be included in the timeseries data e.g., 'JOB1, JOB2'


Text(value='', description='Jobs:', placeholder='')

Optional: select if you want the account logs to be returned for the Job IDs matching your query.


ToggleButton(value=False, description='Account Logs', icon='check', tooltip='Return Account Logs?')

Optional: select the columns to be included in the timeseries data (hold control to select multiple). If no columns are selected, all columns will be included.


SelectMultiple(description='Return Columns', index=(0,), options=('None', 'Job Id', 'Hosts', 'Events', 'Units'…

In [None]:
# -------------- CELL 6 --------------

def get_timeseries_by_timestamp(begin_time: str, end_time: str, return_columns: list) -> pd.DataFrame:
    pass

def get_timeseries_by_values_and_unit(units: str, low_value, high_value) -> pd.DataFrame:
    pass


def get_timeseries_by_hosts(hosts: str) -> pd.DataFrame:
    pass


def get_timeseries_by_job_ids(job_ids: str) -> pd.DataFrame:
    pass


def get_account_logs_by_job_ids(job_ids: str) -> pd.DataFrame:
    pass


get_timeseries_by_timestamp(start_time.value, end_time.value, timeseries_return_columns.value)

if units.value != "None":
    get_timeseries_by_values_and_unit(units.value, low_value.value, high_value.value)

if len(hosts.value) > 0:
    get_timeseries_by_hosts(hosts.value)
    
if len(job_ids.value) > 0:
    get_account_logs_by_job_ids(job_ids.value)
    
stats = widgets.SelectMultiple(
    options=['Mean', 'Median', 'Mode', 'Standard Deviation', 'Variance'],
    value=['Mean'],
    description='Statistics',
    disabled=False
)

display(stats)

In [None]:
# -------------- CELL 7 --------------

def get_stat():  # placeholder
    pass

# Provide data vizualization options here.

In [None]:
# -------------- CELL 8 --------------

# Display statistical data here