# FRESCO Analytics Notebook
### Overview
This notebook has been designed to make analysis of the Anvil dataset as easy as possible. Generally speaking, it will allow the user to access the Anvil files stored locally, select a number of analysis options, and view the results.

The notebook can be divided into three sections:
#### Section 1: Data Filtering
This initial section is your gateway to defining the precise scope of your analysis. Select a specific datetime window and apply various filters to customize your dataset to your needs.

#### Section 2: Data Analysis Options
The second section provides a suite of analysis options. Here, you have the liberty to pick and choose the analysis that fits your needs.

#### Section 3: Data Analysis and Visualizations
The final section of this notebook performs the selected analysis option on the filtered dataset, and provides visualizations of those analyses.

### Step-by-Step Instructions
1. **Cell 1:** Start by defining the temporal boundaries of your dataset. This time frame will dictate the extraction of relevant host time series and job accounting data from the database.
2. **Cell 2:** Here, choose your preferred preprocessing methods. Multiple methods can be combined.
3. **Cell 3:** Specify the units for the time series data of the host that you wish to be included in the analysis.
4. **Cell 4:** Here, you input your desired values and select options. **Remember:** If units were selected in step 3, ensure the low and high values are added here, and click the **"Save Values"** button before moving forward.
5. **Cell 5:** This step involves two actions:
- Download Option: You can choose to download the filtered dataset for offline use or further analysis.
- Analysis Selection: Choose from various data analysis options for your filtered dataset.
6. **Cell 6:** Running this cell will generate all the data visualizations. If you would like to explore correlations among metrics and statistics, select from the provided options.
7. **Cell 7:** Run to see correlations.
8. **Cell 8:** TBD.


In [None]:
# -------------- CELL 1 --------------
from IPython.display import display
import ipywidgets as widgets
import notebook_functions as nbf
from datetime import datetime
import pandas as pd
import matplotlib.pyplot as plt
import os

print(r"Please provide a time window for your dataset.")

start_time = widgets.DatePicker(
    description='Pick a Date',
    disabled=False
)
end_time = widgets.DatePicker(
    description='Pick a Date',
    disabled=False
)

# start_time = widgets.NaiveDatetimePicker(
#     value=datetime.now().replace(microsecond=0),
#     placeholder='',
#     description='Start Time:',
#     disabled=False
# )
#
# end_time = widgets.NaiveDatetimePicker(
#     value=datetime.now().replace(microsecond=0),
#     placeholder='',
#     description='End Time:',
#     disabled=False
# )

# Add a button that the user can press to validate the dates
validate_button = widgets.Button(description="Validate Dates")
output = widgets.Output()

def on_button_clicked(b):
    if end_time.value and start_time.value >= end_time.value:
        b.description = "Invalid Times"
        b.button_style = 'danger'  # The button turns red when clicked
    elif start_time.value and end_time.value <= start_time.value:
        b.description = "Invalid Times"
        b.button_style = 'danger'  # The button turns red when clicked
    else:
        b.description = "Times Valid"
        b.button_style = 'success'  # The button turns green when clicked

validate_button.on_click(on_button_clicked)

display(start_time, end_time, validate_button, output)

print("Data preprocessing: select one or many:")
preprocessing = widgets.SelectMultiple(
    options=['None', 'Remove Rows with Missing Metric', 'Remove Rows with Negative Value', 'Add an Interval Column'],
    value=['None'],
    description='Options:',
    disabled=False,
)

display(preprocessing)

In [None]:
# -------------- CELL 2 --------------
# get timeseries from the DB
time_series_df = nbf.get_time_series_from_database(start_time.value.strftime('%Y-%m-%d %H:%M:%S'), end_time.value.strftime('%Y-%m-%d %H:%M:%S'))

# get the account logs from the DB
account_log_df = nbf.get_account_log_from_database(start_time.value.strftime('%Y-%m-%d %H:%M:%S'), end_time.value.strftime('%Y-%m-%d %H:%M:%S'))

# do the preprocessing
for value in preprocessing.value:

    if "Missing Metric" in value:
        time_series_df = time_series_df.dropna()
    if "Add" in value:
        time_series_df = nbf.add_interval_column(end_time.value, time_series_df, account_log_df)
    if "Negative Value" in value:
        time_series_df = time_series_df[time_series_df['value'] >= 0]

print("Optional: select the units to be included in the timeseries data.")
units = widgets.SelectMultiple(
    options=['None', 'CPU %', 'GPU %', 'GB:memused', 'GB:memused_minus_diskcache', 'GB/s', 'MB/s'],
    value=['None'],
    description='Units:',
    disabled=False,
)

display(units)

In [None]:
# -------------- CELL 3 --------------
unit_values = {}  # stores user low and high value user input such that: key = a unit from the units list above /// value = (low_value, high_value)

for value in units.value:
    if value != 'None':
        nbf.setup_widgets(unit_values, value)

print("Optional: provide the hosts to be included in the timeseries data e.g., 'NODE1, NODE2'")
hosts = widgets.Text(
    value='',
    placeholder='',
    description='Hosts:',
    disabled=False
)
display(hosts)
print("Optional: provide the jobs to be included in the timeseries data e.g., 'JOB1, JOB2'")
job_ids = widgets.Text(
    value='',
    placeholder='',
    description='Jobs:',
    disabled=False
)
display(job_ids)

print("Optional: select if you want the account logs to be returned for the Job IDs matching your query.")
return_account_logs = widgets.Button(description="Return Account Logs?")

def on_account_clicked(b):
    b.description = "Returning Account Logs!"
    b.button_style = 'success'  # The button turns green when clicked

return_account_logs.on_click(on_account_clicked)

display(return_account_logs)

print("Optional: select the columns to be included in the timeseries data (hold control to select multiple). If no columns are "
      "selected, all columns will be included. ** NOTE ** if 'Units', 'Values', and 'Timestamps' are required for graphing in the cells below!")
timeseries_return_columns = widgets.SelectMultiple(
    options=['None', 'Job Id', 'Hosts', 'Events', 'Units', 'Values', 'Timestamps'],
    value=['None'],
    description='Return Columns',
    disabled=False
)
display(timeseries_return_columns)

In [None]:
# -------------- CELL 4 --------------
if 'None' not in units.value or len(units.value) > 1:
    time_series_df = nbf.get_timeseries_by_values_and_unit(unit_values, time_series_df)

if len(hosts.value) > 0:
    time_series_df = nbf.get_timeseries_by_hosts(hosts.value, time_series_df)

if len(job_ids.value) > 0:
    account_log_df = nbf.get_timeseries_by_job_ids(job_ids.value, time_series_df)

if return_account_logs:
    account_log_df = nbf.get_account_logs_by_job_ids(time_series_df, account_log_df)

if any(selection != "None" for selection in timeseries_return_columns.value):
    col_map = {'Job Id': 'jid', 'Hosts': 'host', 'Events': 'event', 'Units': 'unit', 'Values': 'value', 'Timestamps': 'time'}
    time_series_df = nbf.filter_return_columns([col_map[selection] for selection in timeseries_return_columns.value if selection != "None"], time_series_df)

# -------------- timeseries download --------------
print("Do you want to download the filtered timeseries data?")
csv_download_button = widgets.Button(description="Download as CSV")
excel_download_button = widgets.Button(description="Download as Excel")
def on_csv_button_clicked(b):
    display(nbf.create_csv_download_link(time_series_df, title="Download timeseries CSV"))

def on_excel_button_clicked(b):
    display(nbf.create_excel_download_link(time_series_df, title="Download timeseries Excel"))

csv_download_button.on_click(on_csv_button_clicked)
excel_download_button.on_click(on_excel_button_clicked)
display(csv_download_button, excel_download_button)

# -------------- account log download --------------

print("Do you want to download the filtered accounting data?")
csv_acc_download_button = widgets.Button(description="Download as CSV")
excel_acc_download_button = widgets.Button(description="Download as Excel")

def on_acc_csv_button_clicked(b):
    display(nbf.create_csv_download_link(account_log_df, title="Download accounting CSV"))

def on_acc_excel_button_clicked(b):
    display(nbf.create_excel_download_link(account_log_df, title="Download accounting Excel"))

csv_acc_download_button.on_click(on_acc_csv_button_clicked)
excel_acc_download_button.on_click(on_acc_excel_button_clicked)
display(csv_acc_download_button, excel_acc_download_button)

# -------------- stats options --------------
stats = widgets.SelectMultiple(
    options=['None', 'Mean', 'Median', 'Standard Deviation', 'PDF', 'CDF', 'Ratio of Data Outside Threshold'],
    value=['None'],
    description='Statistics',
    disabled=False
)

ratio_threshold = widgets.IntText(
    value=0,
    description='Value:',
    disabled=True  # disabled by default
)

interval_type = widgets.Dropdown(
    options=['None', 'Count', 'Time'],
    value='None',
    description='Interval Type',
    disabled=True  # disabled by default
)

time_units = widgets.Dropdown(
    options=['None', 'Days', 'Hours', 'Minutes', 'Seconds'],
    value='None',
    description='Interval Unit',
    disabled=True  # disabled by default
)

time_value = widgets.IntText(
    value=0,
    description='Value:',
    disabled=True  # disabled by default
)

# Define a function to be called when stats value changes
def on_stats_change(change):
    if change['type'] == 'change' and change['name'] == 'value':
        if "Ratio of Data Outside Threshold" in change['new']: 
            # enable ratio_threshold if 'Ratio of Data Outside Threshold' is selected
            ratio_threshold.disabled = False
        else: 
            # disable ratio_threshold if 'Ratio of Data Outside Threshold' is not selected
            ratio_threshold.disabled = True

        if change['new'][0] != "None":  
            # enable interval_type if stats is not None
            interval_type.disabled = False
        else:  
            # disable interval_type if stats is None
            interval_type.disabled = True
            interval_type.value = 'None'  # reset interval_type to 'None'

stats.observe(on_stats_change)

# Define a function to be called when interval_type value changes
def on_interval_type_change(change):
    if change['type'] == 'change' and change['name'] == 'value':
        if change['new'] == "None":
            time_units.disabled = True
            time_value.disabled = True
            time_units.value = 'None'  # reset time_units to 'None'
            time_value.value = 0  # reset time_value to 0
        elif change['new'] == "Time":
            time_units.disabled = False
            time_value.disabled = False
        elif change['new'] == "Count":
            time_units.disabled = True
            time_value.disabled = False
        else:
            time_units.disabled = False
            time_value.disabled = False

interval_type.observe(on_interval_type_change)

# Display the widgets
print("Please select a statistic to calculate.")
display(stats)
print("Please provide the threshold if 'Ratio of Data Outside Threshold' was selected.")
display(ratio_threshold)
print("Please select an interval type to use in the statistic calculation. If count is selected, the interval will correspond to a count of rows. If time is selected, the interval will be a time window.")
display(interval_type)
print("If time was selected, please select the unit of time.")
display(time_units)
print("Please provide the interval count.")
display(time_value)

In [None]:
%matplotlib inline
# -------------- CELL 5 --------------
# Convert the 'time' columns to datetime
try:
    time_series_df['time'] = pd.to_datetime(time_series_df['time'])
    # account_log_df['time'] = pd.to_datetime(account_log_df['time'])

    # set the 'time' column as the index
    time_series_df = time_series_df.set_index('time')
    # account_log_df = account_log_df.set_index('time')

    # sort each by timestamp
    time_series_df = time_series_df.sort_index()
    # account_log_df = account_log_df.sort_index()
except Exception as e:
    print("Encountered the following error: {e}")

metric_func_map = {
    "Mean": nbf.get_mean if "Mean" in stats.value else "",
    "Median": nbf.get_median if "Median" in stats.value else "",
    "Standard Deviation": nbf.get_standard_deviation if "Standard Deviation" in stats.value else "",
    "PDF": nbf.plot_pdf if "PDF" in stats.value else "",
    "CDF": nbf.plot_cdf if "CDF" in stats.value else "",
    "Ratio of Data Outside Threshold": nbf.plot_data_points_outside_threshold if 'Ratio of Data Outside Threshold' in stats.value else ""
}

unit_map = {
    "CPU %": "cpuuser",
    "GPU %": "gpu_usage",
    "GB:memused": "memused",
    "GB:memused_minus_diskcache": "memused_minus_diskcache",
    "GB/s": "block",
    "MB/s": "nfs"
}

# set up outputs and tabbed layout
tab = widgets.Tab()
outputs = {}
for unit in units.value:
    outputs[unit] = {}
    for stat in stats.value:
        outputs[unit][stat] = widgets.Output()
tab.children = [widgets.Accordion([widgets.Box([widgets.Label(stat), outputs[unit][stat]]) for stat in stats.value], titles=stats.value) for unit in units.value]
tab.titles = units.value


with plt.style.context('fivethirtyeight'):
    unit_stat_dfs = {}
    time_map = {'Days': 'D', 'Hours': 'H', 'Minutes': 'T', 'Seconds': 'S'}
    for unit in units.value:
        unit_stat_dfs[unit] = {}
        for metric in stats.value:
            metric_df = time_series_df.query(f"`event` == '{unit_map[unit]}'")
            # handle special cases
            if metric == "PDF" or metric == "CDF":
                with outputs[unit][metric]:
                    metric_func_map[metric](metric_df)
                continue
            elif metric == "Ratio of Data Outside Threshold":
                with outputs[unit][metric]:
                    metric_func_map[metric](ratio_threshold.value, metric_df)
                continue
            
            # calculate stats
            if interval_type.value == "Time":
                unit_stat_dfs[unit][metric] = metric_func_map[metric](metric_df, True, f"{time_value.value}{time_map[time_units.value]}")
                rolling = True
            elif interval_type.value == "Count":
                unit_stat_dfs[unit][metric] = metric_func_map[metric](metric_df, True, time_value.value)
                rolling = True
            else:
                unit_stat_dfs[unit][metric] = metric_func_map[metric](metric_df, False)
            
            # plot stats
            if rolling:
                with outputs[unit][metric]:
                    unit_stat_dfs[unit][metric].plot()
                    x_axis_label = ""
                    if interval_type.value == "Count":
                        x_axis_label += f"Timestamp - Rolling Window: {time_value:,} Rows"
                    elif interval_type.value == "Time":
                        x_axis_label += f"Timestamp - Rolling Window: {time_value.value}{time_map[time_units.value]}"
                    y_axis_label = unit
                    plt.gcf().autofmt_xdate()  # auto formats datetimes
                    plt.style.use('fivethirtyeight')
                    plt.title(f"{unit} {metric}")
                    plt.legend(loc='upper left', fontsize="10")
                    plt.xlabel(x_axis_label)
                    plt.ylabel(y_axis_label)
                    plt.show()
display(tab)

In [None]:
# -------------- CELL 6 --------------
def on_selection_change(change):
    if len(change.new) > 2:
        correlations.value = change.new[:2]
        
def on_button_click(button):
    graph_output.clear_output()
    with graph_output:
        with plt.style.context('fivethirtyeight'):
            display(nbf.calculate_and_plot_correlation(time_series_df, correlations.value))

correlations = widgets.SelectMultiple(
    options=['None', 'cpuuser', 'gpu_usage', 'nfs', 'block', 'memused', 'memused_minus_diskcache'],
    value=['None'],
    description='Metrics',
    disabled=False
)

plot_button = widgets.Button(
    description = "Plot correlation",
    disabled = False,
    icon= "chart-line"
)
plot_button.on_click(on_button_click)

graph_output = widgets.Output()

container = widgets.VBox(
    [widgets.HBox([correlations, plot_button], layout = widgets.Layout(
        width = "50%", 
        justify_content="space-between", 
        align_items="center"),),
    graph_output])
correlations.observe(on_selection_change, names='value')

# Give the user the option to calculate correlations
print("Please select two metrics below to find their Pearson correlation:")
display(container)

In [None]:
# -------------- CELL 7 ---------------

# Give the user the option to download data here.
print("Select the files to be downloaded:")
files_to_provide = widgets.SelectMultiple(
    options=['None', 'job_ts_metrics_aug2022_anon', 'job_ts_metrics_dec2022_anon',
             'job_ts_metrics_jan2022_anon', 'job_ts_metrics_july2022_anon',
             'job_ts_metrics_nov2022_anon', 'job_ts_metrics_sep2022_anon'],
    value=['None'],
    description='Files',
    disabled=False
)
display(files_to_provide)

# Create and display download button
download_button = widgets.Button(description='Download File/s')
download_button.on_click(nbf.on_download_button_clicked)
display(download_button)