# DataCrumbs Analysis 

### Run Analysis

Finaly, we initialize the DFAnalyzer with the specified configuration and run the trace analysis to generate summarized I/O statistics and views for further exploration.

In [1]:
from dfanalyzer import init_with_hydra

percentile = 0.9
time_granularity = 5e6  # 5 seconds
trace_path = f"/home/haridev/data/lead2-trace-buff-ts.pfw.gz"
view_types = ["time_range", "proc_name"]

dfa = init_with_hydra(
    hydra_overrides=[
        'analyzer=dftracer',
        'analyzer.checkpoint=False',
        f"analyzer.time_granularity={time_granularity}",
        f"percentile={percentile}",
        f"trace_path={trace_path}",
    ]
)

2025-07-10 09:35:53,229 - distributed.diskutils - INFO - Found stale lock file and directory '/tmp/dfanalyzer-haridev/0/dask-worker-space/worker-_pifuahu', purging
2025-07-10 09:35:53,229 - distributed.diskutils - INFO - Found stale lock file and directory '/tmp/dfanalyzer-haridev/0/dask-worker-space/worker-qbpq_gad', purging
2025-07-10 09:35:53,229 - distributed.diskutils - INFO - Found stale lock file and directory '/tmp/dfanalyzer-haridev/0/dask-worker-space/worker-goxraqnd', purging
2025-07-10 09:35:53,229 - distributed.diskutils - INFO - Found stale lock file and directory '/tmp/dfanalyzer-haridev/0/dask-worker-space/worker-girbwgyd', purging


We access the underlying Dask client via our Python API.

In [2]:
dfa.client

0,1
Connection method: Cluster object,Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:8787/status,

0,1
Dashboard: http://127.0.0.1:8787/status,Workers: 4
Total threads: 16,Total memory: 0 B
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:45585,Workers: 4
Dashboard: http://127.0.0.1:8787/status,Total threads: 16
Started: Just now,Total memory: 0 B

0,1
Comm: tcp://127.0.0.1:39895,Total threads: 4
Dashboard: http://127.0.0.1:40011/status,Memory: 0 B
Nanny: tcp://127.0.0.1:37123,
Local directory: /tmp/dfanalyzer-haridev/0/dask-worker-space/worker-j864x91a,Local directory: /tmp/dfanalyzer-haridev/0/dask-worker-space/worker-j864x91a

0,1
Comm: tcp://127.0.0.1:45839,Total threads: 4
Dashboard: http://127.0.0.1:42765/status,Memory: 0 B
Nanny: tcp://127.0.0.1:36275,
Local directory: /tmp/dfanalyzer-haridev/0/dask-worker-space/worker-_922mrv0,Local directory: /tmp/dfanalyzer-haridev/0/dask-worker-space/worker-_922mrv0

0,1
Comm: tcp://127.0.0.1:43165,Total threads: 4
Dashboard: http://127.0.0.1:36445/status,Memory: 0 B
Nanny: tcp://127.0.0.1:36133,
Local directory: /tmp/dfanalyzer-haridev/0/dask-worker-space/worker-mlqztygw,Local directory: /tmp/dfanalyzer-haridev/0/dask-worker-space/worker-mlqztygw

0,1
Comm: tcp://127.0.0.1:41863,Total threads: 4
Dashboard: http://127.0.0.1:34087/status,Memory: 0 B
Nanny: tcp://127.0.0.1:39921,
Local directory: /tmp/dfanalyzer-haridev/0/dask-worker-space/worker-88w57vr0,Local directory: /tmp/dfanalyzer-haridev/0/dask-worker-space/worker-88w57vr0


We run the analysis via the `analyze_trace` function as follows.

In [3]:
result = dfa.analyzer.read_trace(trace_path=trace_path)

In [4]:
result["cat"].unique().compute()

0               mm
1    os_page_cache
2       test_posix
3             swap
4              vfs
5              sys
6              shm
7              xfs
8            iomap
Name: cat, dtype: string

In [5]:
result.query("cat == 'sys' and file_name.str.contains('file_0')").compute()

Unnamed: 0,func_name,cat,type,pid,tid,time_start,time_end,time,tinterval,time_range,...,io_cat,phase,size,compute_time,checkpoint_time,read_time,hash,value,file_name,host_name
10067,openat,sys,0,66349,66349,97418183,97925194,0.507011,,19,...,6,0,,,,,,,/scratch/haridev/data/file_0_0.dat,lead2
10134,write,sys,0,66349,66349,97964119,98155663,0.191544,,19,...,6,0,4096.0,,,,,,/scratch/haridev/data/file_0_0.dat,lead2
10201,close,sys,0,66349,66349,98165689,98261890,0.096201,,19,...,6,0,,,,,,,/scratch/haridev/data/file_0_0.dat,lead2
3972,openat,sys,0,66395,66395,5240928950,5240956462,0.027512,,1048,...,6,0,,,,,,,/scratch/haridev/data/file_0_0.dat,lead2
4031,read,sys,0,66395,66395,5240996354,5241142775,0.146421,,1048,...,6,0,4096.0,,,,,,/scratch/haridev/data/file_0_0.dat,lead2
4045,close,sys,0,66395,66395,5241151798,5241179731,0.027933,,1048,...,6,0,,,,,,,/scratch/haridev/data/file_0_0.dat,lead2
14524,openat,sys,0,66402,66402,5388495738,5388988887,0.493149,,1077,...,6,0,,,,,,,/scratch/haridev/data/file_0_0.dat,lead2
14638,write,sys,0,66402,66402,5389027691,5389275434,0.247743,,1077,...,6,0,16384.0,,,,,,/scratch/haridev/data/file_0_0.dat,lead2
14706,close,sys,0,66402,66402,5389284340,5389396885,0.112545,,1077,...,6,0,,,,,,,/scratch/haridev/data/file_0_0.dat,lead2
8491,openat,sys,0,66432,66432,10530408462,10530436142,0.02768,,2106,...,6,0,,,,,,,/scratch/haridev/data/file_0_0.dat,lead2


In [11]:
result.query("time_start >= 97964119 and time_end <=98155663").groupby(["func_name"]).agg({'cat': 'count', 'time': 'sum'}).sort_values('time').compute()

Unnamed: 0_level_0,cat,time
func_name,Unnamed: 1_level_1,Unnamed: 2_level_1
rw_verify_area,1,0.000319
xfs_da_hashname,1,0.000382
folio_unlock,1,0.000383
current_time,1,0.000394
setattr_should_drop_suidgid,1,0.000394
folio_add_lru,1,0.000452
filemap_get_entry,1,0.000461
generic_write_check_limits,1,0.000566
xfs_mod_freecounter,2,0.000676
xfs_iext_state_to_fork,1,0.000845
