# Visualizing run logs, metrics, and cost

This notebook will help us to explore logs and checkpoints generated during training. 

Some of the logging is optional, especially during full scale training runs. All logs are synced with a version naming convention in order to facilitate easy retrieval of a specific run.

In [39]:
import json
import os
import pandas as pd 
import plotly.graph_objects as go 

## Available logs

In [40]:
ckpts = os.listdir("checkpoints")
csv_logs = os.listdir("logs/csv-logs/")
perf_logs = os.listdir("logs/perf/")
prof_logs = os.listdir("logs/profiler/")

Checkpoints are .ckpt files that hold the model weights. To use these files, we can load them directly into our custom LightningModules with [.load_from_checkpoint](https://lightning.ai/docs/pytorch/stable/common/checkpointing_basic.html#lightningmodule-from-checkpoint).

Here are our available checkpoints:

In [41]:
ckpts

['microsoft-resnet-50_1xNVIDIA-A10G_LR5e-05_BS16_2024-02-06T20:41:21.727645.ckpt',
 'microsoft-resnet-50_1xTesla-T4_LR5e-05_BS16_2024-02-07T15:01:57.886575.ckpt',
 'microsoft-resnet-50_1xTesla-V100-SXM2-16GB_LR5e-05_BS16_2024-02-07T13:29:59.303672.ckpt']

PyTorch Lightning integrates with several experiment managers to log training metrics. My personal favorite, and perhaps the simplest to use to enable custom visualizations, is [CSVLogger](https://lightning.ai/docs/pytorch/stable/extensions/generated/lightning.pytorch.loggers.CSVLogger.html).

The logs saved during training runs are shown below:

In [42]:
csv_logs

['microsoft-resnet-50_1xNVIDIA-A10G_LR5e-05_BS16_2024-02-06T20:41:21.727645',
 'microsoft-resnet-50_1xTesla-T4_LR5e-05_BS16_2024-02-07T15:01:57.886575',
 'microsoft-resnet-50_1xTesla-V100-SXM2-16GB_LR5e-05_BS16_2024-02-07T13:29:59.303672']

PyTorch Lightning also comes equipped with several [profilers](https://lightning.ai/docs/pytorch/stable/api_references.html#profiler). Here, we use [PyTorchProfiler](https://lightning.ai/docs/pytorch/stable/api/lightning.pytorch.profilers.PyTorchProfiler.html#pytorchprofiler). This profiler uses PyTorch’s Autograd Profiler and lets you inspect the cost of different operators inside your model - both on the CPU and GPU.

Our available profiling logs were gathered during perf runs, as running the profiler is really only necessary to identify bottlenecks and performance bugs – and doing so during the perf runs is appropriate. 

> **Note**
>
> it is not necessary to use a profiler during training.

In [43]:
prof_logs

['microsoft-resnet-50_1xNVIDIA-A10G_LR5e-05_BS16_2024-02-06T20:41:21.727645',
 'microsoft-resnet-50_1xTesla-T4_LR5e-05_BS16_2024-02-07T15:01:57.886575',
 'microsoft-resnet-50_1xTesla-V100-SXM2-16GB_LR5e-05_BS16_2024-02-07T13:29:59.303672']

In order to get a basic sense of training times, there exists a `log_perf` function in `utils.py`. Using `log_perf` saves a simple json file with some basic metrics on the training run – notably, the CPU or GPU type, and the amount of time it took to complete a minimum number of epochs. 

These files were created by passing the `--perf` flag to `trainer.py` with

```sh
python trainer.py --perf=True
```

Doing the above will run your Trainer for 5 epochs and time the run. This can help to determine which machine to use before initiating a longer run. 

Logs for the example runs are:

In [44]:
perf_logs

['microsoft-resnet-50_1xNVIDIA-A10G_LR5e-05_BS16_2024-02-06T20:41:21.727645.json',
 'microsoft-resnet-50_1xTesla-T4_LR5e-05_BS16_2024-02-07T15:01:57.886575.json',
 'microsoft-resnet-50_1xTesla-V100-SXM2-16GB_LR5e-05_BS16_2024-02-07T13:29:59.303672.json']

## Examining training times

First, let's read in the performance logs:

In [45]:
perf_df = pd.DataFrame()
for results in perf_logs:
    new = pd.read_json("logs/perf/"+results)
    new.rename(columns={"perf": new["perf"]["device_name"]}, inplace=True)
    perf_df = pd.concat([perf_df, new], axis=1)
perf_df = perf_df.T
perf_df.sort_values("runtime_min", inplace=True)
perf_df

Unnamed: 0,batch_size,device_name,epochs,global_step,max_epochs,min_epochs,num_devices,num_node,precision,runtime_min,strategy
NVIDIA A10G,16,NVIDIA A10G,15,37500,15,0,1,1,16-mixed,36.107499,SingleDeviceStrategy
Tesla T4,16,Tesla T4,15,37500,15,0,1,1,16-mixed,57.729672,SingleDeviceStrategy
Tesla V100-SXM2-16GB,16,Tesla V100-SXM2-16GB,15,37500,15,0,1,1,16-mixed,67.923304,SingleDeviceStrategy


And then, we can find the machine with the minimum training time:

In [46]:
best_run_perf = perf_df.loc[perf_df["runtime_min"] == perf_df["runtime_min"].min(), :]

display(best_run_perf)

print(f'The machine with the minimum run time is {best_run_perf.index[0]} at {round(best_run_perf["runtime_min"].iloc[0], 2)} minutes')

Unnamed: 0,batch_size,device_name,epochs,global_step,max_epochs,min_epochs,num_devices,num_node,precision,runtime_min,strategy
NVIDIA A10G,16,NVIDIA A10G,15,37500,15,0,1,1,16-mixed,36.107499,SingleDeviceStrategy


The machine with the minimum run time is NVIDIA A10G at 36.11 minutes


And provide a nice visual display of the results:

In [47]:
fig = go.Figure(
    go.Scatter(
        x=perf_df.index, 
        y=perf_df["runtime_min"], 
        mode="markers", 
        marker=dict(size=20), showlegend=False
    ),
    layout_title_text="Run Time (in minutes) for 15 Epochs"
)

fig.update_layout(
    xaxis_title="Machine Type",
    yaxis_title="Run time in minutes"
)
fig.show()

## Visualizing training metrics

Given that hyperparameter optimization was not within the scope of this Studio, we can expect our training losses to be sub-optimal; however, let's demonstrate how we can utilize CSVLoggers output to visualize our logged metrics.

In [48]:
runs = {}

for run, logs in enumerate(csv_logs):
    machine = logs.split("_")[1]
    runs[machine] = pd.read_csv(os.path.join("logs/csv-logs", logs, "metrics.csv"))

And let's plot each run's training loss:

In [49]:
fig = go.Figure()

for run in runs.keys():
    if run == "1xTesla-T4":
        fig.add_trace(go.Scatter(x=runs[run].index, y=runs[run]["train-loss"], name=run))

        fig.update_layout(
            xaxis_title="Global Step (logged every 50 steps)", 
            yaxis_title="Loss", 
            title=dict(text=f"Training Loss for {run}"),
            legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="left", x=0)
        )
fig.show()

then the validation accuracy:

In [55]:
fig = go.Figure()

for run in runs.keys():
    if run == "1xTesla-T4":
        data = runs[run].loc[runs[run]["val-acc"].notna(), "val-acc"]
        fig.add_trace(go.Scatter(x=data.index, y=data, marker=dict(size=20), name=run))

        fig.update_layout(
            xaxis_title="Step (logged every 50th global step)", 
            yaxis_title="Accuracy", 
            title=dict(text=f"Validation Accuracy  for {run}"),
            legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="right", x=1)
        )
fig.show()

## Determining cost to train

Let's assume we've done HPO, and now have our best params configuration and are ready to train. 

here, we create a cost column for our perf_df - let's keep in mind our perf_df was sorted by runtime_min, and our assigned values need to match the index title:

In [51]:
v100 = 2.37
a10g = 1.8
t4 = .68

perf_df["cost_per_hour"] = [0., 0., 0.]

perf_df.loc[perf_df.index == "Tesla T4", "cost_per_hour"] = t4
perf_df.loc[perf_df.index == "NVIDIA A10G", "cost_per_hour"] = a10g
perf_df.loc[perf_df.index == "Tesla V100-SXM2-16GB", "cost_per_hour"] = v100

perf_df

Unnamed: 0,batch_size,device_name,epochs,global_step,max_epochs,min_epochs,num_devices,num_node,precision,runtime_min,strategy,cost_per_hour
NVIDIA A10G,16,NVIDIA A10G,15,37500,15,0,1,1,16-mixed,36.107499,SingleDeviceStrategy,1.8
Tesla T4,16,Tesla T4,15,37500,15,0,1,1,16-mixed,57.729672,SingleDeviceStrategy,0.68
Tesla V100-SXM2-16GB,16,Tesla V100-SXM2-16GB,15,37500,15,0,1,1,16-mixed,67.923304,SingleDeviceStrategy,2.37


next, let's calculate the cost for finetuning:

In [52]:
perf_df["actual_cost"] = (perf_df["runtime_min"] / 60 ) * perf_df["cost_per_hour"]

In [53]:
perf_df

Unnamed: 0,batch_size,device_name,epochs,global_step,max_epochs,min_epochs,num_devices,num_node,precision,runtime_min,strategy,cost_per_hour,actual_cost
NVIDIA A10G,16,NVIDIA A10G,15,37500,15,0,1,1,16-mixed,36.107499,SingleDeviceStrategy,1.8,1.083225
Tesla T4,16,Tesla T4,15,37500,15,0,1,1,16-mixed,57.729672,SingleDeviceStrategy,0.68,0.65427
Tesla V100-SXM2-16GB,16,Tesla V100-SXM2-16GB,15,37500,15,0,1,1,16-mixed,67.923304,SingleDeviceStrategy,2.37,2.682971


let's visualize these costs:

In [54]:
fig = go.Figure()

for run in perf_df.index:
    fig.add_trace(go.Scatter(x=perf_df.loc[perf_df.index == run, "runtime_min"], y=perf_df.loc[perf_df.index == run, "actual_cost"], name=run, marker=dict(size=20)))

fig.update_layout(
    xaxis_title="runtime in minutes", 
    yaxis_title="actual cost (USD)", 
    title=dict(text="Cost by Machine"),
    legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="right", x=1)
)

fig.show()

## Conclusion

The cost of this particular scenario is relatively cheap across all machines – with V100 being the most expensive at $2.68 USD, and T4 being the cheapest at just $0.65 USD!

An A10G completes training in 1/3 of the time of the T4, and nearly twice as fast as the V100 – and, the A10G is the second cheapest machine for this scenarion, costing just $1.08 USD to complete the training.