# U-Net 3D V100 Analysis 

## Notebook Overview

This notebook presents an analysis of a U-Net 3D workload using the DFAnalyzer tool. It demonstrates how to analyze I/O traces collected from a deep learning application running on a V100 system. The workflow includes:

- Setting up the environment and importing necessary libraries.
- Extracting and preparing trace data for analysis.
- Initializing DFAnalyzer with appropriate configuration.
- Running the analysis to generate summarized I/O statistics and views.
- Displaying and interpreting the results, including bandwidth and operation counts over time ranges for different I/O layers.

The notebook is intended to help users understand the I/O behavior of deep learning workloads and provides a template for similar analyses on other datasets.

## Interactive Analysis

### Prepare Environment

In this section, we set up the environment by importing required libraries, configuring warning filters, and updating the Python path to include the DFAnalyzer workspace. 

In [1]:
import os
import sys
import warnings

# Add DFAnalyzer to the path
workspace_dir = os.path.abspath("../")
sys.path.append(workspace_dir)

# Filter warnings
warnings.filterwarnings('ignore')

### Prepare Trace Data

Then, we extract the trace data archive into the designated directory to prepare it for analysis with DFAnalyzer.

In [7]:
!mkdir -p {workspace_dir}/tests/data/extracted/dftracer-dlio
!tar -xzf {workspace_dir}/tests/data/dftracer-dlio.tar.gz -C {workspace_dir}/tests/data/extracted/dftracer-dlio

### Run Analysis

Finaly, we initialize the DFAnalyzer with the specified configuration and run the trace analysis to generate summarized I/O statistics and views for further exploration.

In [11]:
from dfanalyzer import init_with_hydra

percentile = 0.9
run_dir = f"{workspace_dir}/notebooks/.dfanalyzer/unet3d_v100_hdf5"
time_granularity = 5e6  # 5 seconds
trace_path = f"{workspace_dir}/tests/data/extracted/dftracer-dlio"
view_types = ["time_range", "proc_name"]

dfa = init_with_hydra(
    hydra_overrides=[
        'analyzer=dftracer',
        'analyzer/preset=dlio',
        'analyzer.checkpoint=False',
        f"analyzer.time_granularity={time_granularity}",
        f"hydra.run.dir={run_dir}",
        f"percentile={percentile}",
        f"trace_path={trace_path}",
    ]
)

We access the underlying Dask client via our Python API.

In [3]:
dfa.client

0,1
Connection method: Cluster object,Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:8787/status,

0,1
Dashboard: http://127.0.0.1:8787/status,Workers: 4
Total threads: 12,Total memory: 0 B
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:33367,Workers: 4
Dashboard: http://127.0.0.1:8787/status,Total threads: 12
Started: Just now,Total memory: 0 B

0,1
Comm: tcp://127.0.0.1:40577,Total threads: 3
Dashboard: http://127.0.0.1:44869/status,Memory: 0 B
Nanny: tcp://127.0.0.1:37289,
Local directory: /tmp/dfanalyzer-izzet/0/dask-worker-space/worker-3n9ole8w,Local directory: /tmp/dfanalyzer-izzet/0/dask-worker-space/worker-3n9ole8w

0,1
Comm: tcp://127.0.0.1:45709,Total threads: 3
Dashboard: http://127.0.0.1:35159/status,Memory: 0 B
Nanny: tcp://127.0.0.1:42279,
Local directory: /tmp/dfanalyzer-izzet/0/dask-worker-space/worker-pgzcsfb5,Local directory: /tmp/dfanalyzer-izzet/0/dask-worker-space/worker-pgzcsfb5

0,1
Comm: tcp://127.0.0.1:37937,Total threads: 3
Dashboard: http://127.0.0.1:38957/status,Memory: 0 B
Nanny: tcp://127.0.0.1:43881,
Local directory: /tmp/dfanalyzer-izzet/0/dask-worker-space/worker-jpsjv22q,Local directory: /tmp/dfanalyzer-izzet/0/dask-worker-space/worker-jpsjv22q

0,1
Comm: tcp://127.0.0.1:34955,Total threads: 3
Dashboard: http://127.0.0.1:45505/status,Memory: 0 B
Nanny: tcp://127.0.0.1:43317,
Local directory: /tmp/dfanalyzer-izzet/0/dask-worker-space/worker-1_jxxo60,Local directory: /tmp/dfanalyzer-izzet/0/dask-worker-space/worker-1_jxxo60


We access to current preset configuration as follows.

In [4]:
dict(dfa.analyzer.preset.layer_defs)

{'app': 'func_name == "DLIOBenchmark.run"',
 'training': 'func_name == "DLIOBenchmark._train"',
 'compute': 'cat == "ai_framework"',
 'fetch_data': 'func_name.isin(["<module>.iter", "fetch-data.iter", "loop.iter"])',
 'data_loader': 'cat == "data_loader" & ~func_name.isin(["loop.iter", "loop.yield"])',
 'data_loader_fork': 'cat == "posix" & func_name == "fork"',
 'reader': 'cat == "reader"',
 'reader_posix_lustre': 'cat.str.contains("posix|stdio") & cat.str.contains("_reader_lustre")',
 'checkpoint': 'cat == "checkpoint"',
 'checkpoint_posix_lustre': 'cat.str.contains("posix|stdio") & cat.str.contains("_checkpoint_lustre")',
 'checkpoint_posix_ssd': 'cat.str.contains("posix|stdio") & cat.str.contains("_checkpoint_ssd")',
 'other_posix': 'cat.isin(["posix", "stdio"])'}

We run the analysis via the `analyze_trace` function as follows.

In [None]:
result = dfa.analyze_trace(percentile=percentile, view_types=view_types)

And, using the `output` variable available in our analyzer instance `dfa`, we output the DFAnalyzer summary.

In [6]:
dfa.output.handle_result(result)

### Result Exploration

We access the high-level characteristics and layer-based characteristics and metrics via our Python API as follows:

In [7]:
result._traces.head()

Unnamed: 0,func_name,cat,type,pid,tid,time_start,time_end,time,tinterval,time_range,...,size_bin_16kib_64kib,size_bin_64kib_256kib,size_bin_256kib_1mib,size_bin_1mib_4mib,size_bin_4mib_16mib,size_bin_16mib_64mib,size_bin_64mib_256mib,size_bin_256mib_1gib,size_bin_1gib_4gib,size_bin_4gib_plus
4,start,dftracer,0,1028571,1028571,0,0,0.0,,0,...,0,0,0,0,0,0,0,0,0,0
6,FileStorage.get_uri,storage,0,1028571,1028571,1300840,1300851,1.1e-05,,0,...,0,0,0,0,0,0,0,0,0,0
8,opendir,posix_reader_lustre,0,1028571,1028571,1300907,1304903,0.003996,,0,...,0,0,0,0,0,0,0,0,0,0
9,FileStorage.walk_node,storage,0,1028571,1028571,1300805,1305420,0.004615,,0,...,0,0,0,0,0,0,0,0,0,0
10,FileStorage.get_uri,storage,0,1028571,1028571,1305523,1305531,8e-06,,0,...,0,0,0,0,0,0,0,0,0,0


In [8]:
result._main_views['reader_posix_lustre'].head()

Unnamed: 0_level_0,Unnamed: 1_level_0,bw,close_count,close_file_name,close_ops,close_time,count,data_bw,data_count,data_file_name,data_intensity,...,write_size_bin_1gib_4gib,write_size_bin_1mib_4mib,write_size_bin_256kib_1mib,write_size_bin_256mib_1gib,write_size_bin_4gib_plus,write_size_bin_4kib_16kib,write_size_bin_4mib_16mib,write_size_bin_64kib_256kib,write_size_bin_64mib_256mib,write_time
time_range,proc_name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
0,app#corona171#1028571#1028571,376376400.0,4,{/p/lustre3/izzet/dlio-benchmark-test/unet3d_v...,1225.114855,0.003265,56,380482088.081022,36,{/p/lustre3/izzet/dlio-benchmark-test/unet3d_v...,0.0,...,,,,,,,,,,
1,app#corona171#1028571#1028571,398941000.0,7,{/p/lustre3/izzet/dlio-benchmark-test/unet3d_v...,1573.741007,0.004448,103,401318163.866653,72,{/p/lustre3/izzet/dlio-benchmark-test/unet3d_v...,0.0,...,,,,,,,,,,
2,app#corona171#1028571#1028571,403104900.0,6,{/p/lustre3/izzet/dlio-benchmark-test/unet3d_v...,1433.006926,0.004187,78,405925022.015696,54,{/p/lustre3/izzet/dlio-benchmark-test/unet3d_v...,0.0,...,,,,,,,,,,
3,app#corona171#1028571#1028571,800005400.0,9,{/p/lustre3/izzet/dlio-benchmark-test/unet3d_v...,1363.429783,0.006601,117,807332890.471577,81,{/p/lustre3/izzet/dlio-benchmark-test/unet3d_v...,0.0,...,,,,,,,,,,
4,app#corona171#1028571#1028571,1117199000.0,10,{/p/lustre3/izzet/dlio-benchmark-test/unet3d_v...,1370.614035,0.007296,118,1129234851.377007,81,{/p/lustre3/izzet/dlio-benchmark-test/unet3d_v...,0.0,...,,,,,,,,,,


In [9]:
result.views['reader_posix_lustre'][('time_range',)].head()

Unnamed: 0_level_0,bw_max,bw_mean,bw_min,bw_q10_q90_stats,bw_q1_q99_stats,bw_q25_q75_stats,bw_q5_q95_stats,bw_std,bw_sum,close_count_max,...,file_name_nunique,metadata_file_name_nunique,open_file_name_nunique,other_file_name_nunique,proc_name_nunique,read_file_name_nunique,seek_file_name_nunique,stat_file_name_nunique,sync_file_name_nunique,write_file_name_nunique
time_range,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,392828900.0,370461800.0,355103100.0,"[369293761.44715196, 8288549.282188045, 6]","[369293761.44715196, 8288549.282188045, 6]","[368107013.773381, 5981644.896344845, 4]","[369293761.44715196, 8288549.282188045, 6]",12853930.0,2963695000.0,4,...,35,35,32,0,8,32,0,33,0,0
1,424096700.0,391412300.0,367942800.0,"[389876428.2907986, 8513646.629767917, 6]","[389876428.2907986, 8513646.629767917, 6]","[390122204.8223287, 7074255.927956612, 4]","[389876428.2907986, 8513646.629767917, 6]",17188590.0,3131298000.0,8,...,64,64,64,0,8,64,0,64,0,0
2,445295200.0,418989600.0,403104900.0,"[417252772.4661761, 11401168.90427462, 6]","[417252772.4661761, 11401168.90427462, 6]","[414507505.23073936, 4416206.08954769, 4]","[417252772.4661761, 11401168.90427462, 6]",15776680.0,3351917000.0,6,...,54,54,48,0,8,48,0,48,0,0
3,901037400.0,815126400.0,781064100.0,"[806484996.0587629, 13040884.978397982, 6]","[806484996.0587629, 13040884.978397982, 6]","[806698636.4345162, 5875420.858666735, 4]","[806484996.0587629, 13040884.978397982, 6]",37814140.0,6521011000.0,10,...,82,82,74,0,8,74,0,74,0,0
4,1200353000.0,1133135000.0,1018028000.0,"[1141116984.6072495, 34988538.45572946, 6]","[1141116984.6072495, 34988538.45572946, 6]","[1137214065.649239, 21984781.804897595, 4]","[1141116984.6072495, 34988538.45572946, 6]",60350380.0,9065083000.0,10,...,78,78,70,0,8,70,0,70,0,0
