# graph_timeseries_rate

Graph the number of Terra workflow DRS localization operations per second,
based on `Localizing input drs://` log entry timestamps.

This data is based on workflow runs performed in the Terra `alpha` preproduction tier in the Terra workspace:   
`DRS Data Access Scale Testing - Alpha`  
as user `b.adm.firec@gmail.com`

The input data for this Notebook is prepared by the `extract_drs_localization_timestamps.sh` script, here:  
https://github.com/mbaumann-broad/data-wrangling/blob/workflow_data_access_rate/scripts/workflow_drs_data_access_rate/extract_drs_localization_timestamps.sh

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

Uncomment the desired `INPUT_FILE` below to plot the data for that workflow run.

In [None]:
# Configure the Terra workflow submission id from which to extract the data.
# Shape 1 - 20k inputs
# WF_SUBMISSION_ID="f67b144e-5b7c-4361-9c9f-381b4ff7f3e5"
# INPUT_FILE="./submission_f67b144e-5b7c-4361-9c9f-381b4ff7f3e5/drs_localization_timeseries.tsv"

In [None]:
# Shape 2 - 20k inputs scattered by 100 per task - Feb 1, 2022 3:30 PM
# WF_SUBMISSION_ID="a98b1b4d-25d5-489a-a955-191334c8ab32"
# INPUT_FILE="./submission_a98b1b4d-25d5-489a-a955-191334c8ab32/drs_localization_timeseries.tsv"

In [None]:
# Shape 2 - 20k inputs scattered 20 per task - Oct 6, 2021 1:13 PM - aborted
# Aborted due to end of test window. At the time, only 143 of 1,000 shards were created.
# Gen3 reported DRS request rate of ~250/RPS
# See: https://nhlbi-biodatacatalyst.slack.com/archives/C01CSE5P7KM/p1633717393028700?thread_ts=1633548575.026200&cid=C01CSE5P7KM
# WF_SUBMISSION_ID="698cc797-8235-4585-873a-3d7a68192fa6"
INPUT_FILE="./submission_698cc797-8235-4585-873a-3d7a68192fa6/drs_localization_timeseries.tsv"

In [None]:
df = pd.read_csv(INPUT_FILE, sep="\t")

In [None]:
df['Timestamp']= pd.to_datetime(df['Timestamp'])
df.set_index('Timestamp')
resampled_df = df.resample(pd.Timedelta(1, 'second'), on='Timestamp')['Count'].sum().reset_index()

The maximum rate per second measured:

In [None]:
resampled_df['Count'].max()

Line plot for the rate per second:

In [None]:
plt.style.use("fast")
plt.figure(figsize=(12, 10))
plt.xlabel("Time (one second interval)")
plt.ylabel("Rate per Second")
plt.title("Terra Workflow DRS Localization Rate Time Series Plot")
plt.plot(resampled_df["Count"])