# Symphony HostFactory GKE Plug-in Testing Simplified Data Treatment

This notebook is meant to walkthrough the data sourcing, data treatment and visualization for the instrumentation developped in the `bootstrap-gke` project. 

At the end of the document, we shall have a simple dataset containing:

- Time for which a GKE nodes are ready
- Time for which pods are scheduled
- Allocated vCPU accross time

## Data Sources

In this simplified version of analysis, we require only two data sources:

- **HostFactory API**:
    - Data from this source is manually noted by the user.
    - This simplified analysis requires only the request time (i.e. when the user requests the HostFactory provider plugin).
    - Optionally, the user can also note the machine types and/or templates and their respective quantity. 
- **Cloud Logging Pipeline**:
    - Data from this source is ingested automatically into a BigQuery table.
    - Data is ingested through logging sinks and forwarded to a pub/sub topic, with a respective subscription pointed to BigQuery, with a small UDF transformation. 
    - Schema for this BigQuery table can be found at the `automate/instances/orchestrator/base/bigquery-schemas/logs.json` path.
    - A complete list of the logging sinks definitions can be found at `automate/instances/bootstrap-gke/observability/sinks`. Please note that these sinks are Terraform templates which are formated in function of the configurations of the bootstraping terraform project.
    - A complete list of the UDF transforms can be found at the `automate/instances/bootstrap-gke/observability/transforms` folder. These sinks are also Terraform templates which also are formated. 
    - Following is a list of the file names for the logging sinks which interest us for this analysis:
        - `node_ready_patch.lql.tftpl` - Event logged when a node is upgraded to Ready state.
        - `pod_scheduled.lql.tftpl` - Event logged when a pod is scheduled onto a node.
- **Cloud Monitoring**:
    - The vCPU usage of the cluster is available through Cloud Monitoring. Alternativelly this can be deduced from the nodes ready events from the previous data-source. 

## Data Treatment 

Data treatment starts by defining the variables we want to query from the BigQuery database. For this, we are required to set the following variables: 

In [None]:
# Configure query by run id
PROJECT_ID = "symphony-dev-2"
DATASET_ID = "log_dataset_default"
TABLE_ID = "logs-2"

# When the request was made, as noted by the experiment metadata
REQUEST_TIMESTAMP = "2025-10-01T09:08:58+00:00"
RETURN_TIMESTAMP = None # In this example, we forcefully delete pods and do not analyse scale-down.

# The run ID, as configured by the tester
RUN_ID = "test-bigscale-14"

Following, let us query the data which interest us for this simple analysis. For this, we will also import the required libraries.

In [None]:
from google.cloud import bigquery
import pandas as pd
import datetime
from pathlib import Path
import json

# Set maximum width for table view
pd.set_option('max_colwidth', 60)
# Set maximum rows for table view
pd.set_option('display.max_rows',200)
pd.options.plotting.backend = "plotly"

client = bigquery.Client()


In [None]:
# Let's transform the request timestamp from string to a datetime objet at UTC.
REQUEST_TIMESTAMP = datetime.datetime.fromisoformat(REQUEST_TIMESTAMP).astimezone(datetime.UTC)

QUERY = """
SELECT time, event, node, pod
from `{project}.{dataset}.{table}`
WHERE (
    run = "{run_id}" AND
    time > "{start_time}" AND 
    time < "{end_time}" AND
    event IN ("node:ready_patch", "pod:scheduled")
)
""".format(
    project = PROJECT_ID,
    dataset = DATASET_ID,
    table = TABLE_ID,
    run_id = RUN_ID,
    start_time = REQUEST_TIMESTAMP.isoformat(),
    # We query up to 20 minutes after the start of the test
    end_time = (REQUEST_TIMESTAMP + datetime.timedelta(minutes=20)).isoformat()
)

# with open(DATA_FOLDER.joinpath("QUERY.sql"), "w") as fh:
#     fh.writelines(QUERY)

print(QUERY)

In [None]:
# Query the data

query_job = client.query(QUERY)
rows = query_job.result()
raw_df = rows.to_dataframe()

# Optionally save the data for future reference
# df.to_parquet(DATA_FOLDER.joinpath(f"{RUN_ID}.parquet"))

raw_df.head(n=5)

For analysis, we will pivot the table to have nodes as indexes, since we are interested on the first pod scheduled event for each node, and the last ready patch event for each node. 

To achieve this, we'll use two aggregation functions ("first" and "last"), and latter get our only desired datapoints.

In [None]:
df = raw_df.copy(deep=True)

df = df.pivot_table(
    values="time",
    index="node",
    columns="event",
    aggfunc=["first","last"]
)

# Join column multi index into index
columns = [":".join(column) for column in df.columns]
df.columns = pd.Index(columns)

# Drop unnused events
df = df.drop(
    columns=[
        "first:node:ready_patch",
        "last:pod:scheduled"
    ]
)

print(f"Total lines in DataFrame: {len(df.index)}")

df.head(n=5)

Since the only `pod:scheduled` events ingested are for the (Symphony) workload pods, we can drop the rows (nodes), which don't have a value in this columns. These nodes correspond to system or operator nodes / node pools.

In [None]:
df = df[~df["first:pod:scheduled"].isna()]

print(f"Total lines in DataFrame: {len(df.index)}")

We will now extract the number of cores and machine family from the node name, since this respects our framework naming pattern we are able to deduce the number of cores for each node. 

In [None]:
import re

machine_config = {}

for machine in df.index:
    match = re.match(
        "gke-cluster-test-0-(.*)-(.*)-node-pool-te-.*",
        machine
    )
    if match is None:
        raise Exception("Failed to match machine config.")
    groups = match.groups()
    machine_config[machine] = {
        "machine:family": groups[0],
        "machine:cores": int(groups[1])
    }

machine_config = pd.DataFrame.from_dict(machine_config, orient="index")
df = df.join(machine_config)

df.head()

Now, we have all the data we require for our plot. It is simply not adapted for visualization. The last step is to format the data such that it is plotable. 

In [None]:
# Let us sort by the time the first pod was scheduled
pod_count = (
    df["first:pod:scheduled"]
        .reset_index()
        .set_index("first:pod:scheduled")
        .sort_index()
)

# This is equivalent to counting the number of rows,
# or more precisely, generating a column coresponding to the 
# count index of the first pod scheduled events
pod_count["pod:count"] = range(1, len(pod_count.index) + 1)

# We remove the node names, as we do not plot this
pod_count = pod_count.drop(columns=["node"])

pod_count.head()

In [None]:
# We do the same for the nodes, but add the cummulative sum of the cores
node_count = (
    df[["last:node:ready_patch", "machine:cores"]]
        .reset_index()
        .set_index("last:node:ready_patch")
        .sort_index()
)

# Equivalent to the count index of nodes
node_count["node:count"] = range(1, len(node_count.index) + 1)

# We generate the cummulative sum of the cores
node_count["node:cores_sum"] = node_count["machine:cores"].cumsum()

# Remove node name and individual cores values
node_count = node_count.drop(columns=["node", "machine:cores"])

node_count.head()

Now we have our data which is plotable, the remaining code is plotting formatting...

In [None]:
from plotly.subplots import make_subplots
import plotly.graph_objects as go

parsed_data = pd.concat([
    pod_count,
    node_count
])

# Artifically add the zero index
parsed_data = pd.concat([
    pd.DataFrame([[0,0,0]], index=[REQUEST_TIMESTAMP], columns=parsed_data.columns),
    parsed_data
])

parsed_data = parsed_data.rename(columns={
    "node:count": "GkeNumberOfNodes",
    "node:cores_sum": "GkeNumberOfCores",
    "pod:count": "NumberOfSymphonyPods"
})
# parsed_data.index.name = "Timestamp"

# Optionally reset index to REQUEST_TIMESTAMP
parsed_data.index = (parsed_data.index - REQUEST_TIMESTAMP).total_seconds()
parsed_data.index.name = "TimeAfterSymphonyRequest"

fig = make_subplots(specs=[[{"secondary_y": True}]])


fig.add_trace(
    go.Scatter(
        x=parsed_data.index,
        y=parsed_data.GkeNumberOfNodes,
        mode="lines",
        name="GKE - Number of nodes"
    )
)

fig.add_trace(
    go.Scatter(
        x=parsed_data.index,
        y=parsed_data.GkeNumberOfCores,
        mode="lines",
        name="GKE - Number of cores"
    ),
    secondary_y=True
   
)

fig.add_trace(
    go.Scatter(
        x=parsed_data.index,
        y=parsed_data.NumberOfSymphonyPods,
        mode="lines",
        name="Number of Symphony Pods"
    )
)

fig.update_layout(
    title="Scaling performance of IBM Spectrum Symphony connector for GKE",
    plot_bgcolor="white",
    legend=dict(
        x=0.005,
        y=0.95,
        bordercolor='black',
        borderwidth=1
    ),
    # xaxis_range=[
    #     REQUEST_TIMESTAMP,
    #     REQUEST_TIMESTAMP + datetime.timedelta(minutes=10)
    # ],
    xaxis_range=[0,600]
)


fig.update_xaxes(
    mirror=True,
    ticks='outside',
    showline=True,
    linecolor='black',
    gridcolor='lightgrey',
    title_text="Time after Symphony HostFactory GKE plugin request", 
    tickvals=list(range(60,660,60)),
    ticktext=[f"{x} min" for x in range(1,11)],
    # tickangle=45
)

fig.update_yaxes(
    mirror=True,
    ticks='outside',
    showline=True,
    linecolor='black',
    gridcolor='lightgrey',
    tickcolor="lightgrey",
    zerolinecolor='lightgrey',
    title_text="Number of Pods and Nodes" 

)

fig.update_yaxes(
    mirror=True,
    ticks='outside',
    showline=False,
    showgrid=False,
    linecolor=None,
    gridcolor=None,
    title_text="Number of Cores",
    secondary_y=True,   
)

fig.update_traces(
    connectgaps=True
)

fig.show()



Finally, optinally save all the data

In [None]:
DATA_FOLDER = Path(f"/home/user/data-{RUN_ID}")

# Save the metadata
with open(DATA_FOLDER.joinpath("metadata.json"), "w") as fh:
    json.dump(
        {
            "PROJECT_ID": PROJECT_ID,
            "DATASET_ID": DATASET_ID,
            "TABLE_ID": TABLE_ID,
            "REQUEST_TIMESTAMP": REQUEST_TIMESTAMP.isoformat() if REQUEST_TIMESTAMP is not None else None,
            "RETURN_TIMESTAMP": RETURN_TIMESTAMP.isoformat() if RETURN_TIMESTAMP is not None else None,
            "RUN_ID": RUN_ID
        },
        fh,
        indent=4
    )

# Save the corresponding query
with open(DATA_FOLDER.joinpath("QUERY.sql"), "w") as fh:
    fh.writelines(QUERY)

# Save the corresponding raw data
raw_df.to_parquet(DATA_FOLDER.joinpath(f"{RUN_ID}-raw.parquet"))

# Save the corresponding indexed data
df.to_csv(DATA_FOLDER.joinpath(f"{RUN_ID}-indexed.csv"))

# Save the corresaponding plot data
parsed_data.to_csv(DATA_FOLDER.joinpath(f"{RUN_ID}-sparse.csv"))

# Save the plot itself
with open(DATA_FOLDER.joinpath(f"{RUN_ID}.html"), "w") as fh:
    fig.write_html(fh)

fig.write_image(
    file=DATA_FOLDER.joinpath(f"{RUN_ID}.png"),
    format="png",
    width=800,
    height=500,
)

fig.write_image(
    file=DATA_FOLDER.joinpath(f"{RUN_ID}.svg"),
    format="svg",
    width=800,
    height=500,
)
