# Nextflow optimizer notebook

The current objective of this notebook is to:
1. Load execution trace of a Nextflow workflow.
2. Extract timing information of the different tasks executed.
3. (Optionnaly) Visualize the extracted information, similarly to what is done in nextflow reports
4. Generate Nextflow config file overriding the process time limit with the worst-case execution time observed.

In future version, it might be useful to maintain a database of process runtimes to better understand how this runtime evolves depending on its parameterization, or depending on the node used to run it.

## 1. Notebook parameters

In [None]:
from pathlib import Path

html_report = Path("/path/t/folder","report_file.html")
output_config_file = Path("/path/to/folder","config_file.config")

## 2. Load data

### 2.1 Load data from HTML

In [None]:
import extract_trace_from_html as parser

trace_df = parser.extract_trace_data(html_report)

if trace_df is not None:
    print(f"Extracted {trace_df.shape[0]} process execution traces.")

### 2.2 Process dataframe

In [None]:
# Extract process name and path within the workflow as separate columns.
trace_df['process_name'] = trace_df['process'].str.split(':').str[-1]
trace_df['process_path'] = trace_df['process'].str.split(':').str[:-1].str.join(':')

## 3. Display useful info

### 3.1 Process execution time box plot

In [None]:
import display_process_timings as viewer

name_filter = None # Optionnaly a string can be given to the viewer to display only processes containing this string
                   # Use None if no filter is wanted

viewer.plot_realtime_boxplot(trace_df, name_filter)

### 3.2 Icicle chart of processes

In [None]:
import display_icicle_chart as visualizer

visualizer.create_icicle_chart(trace_df, include_names=True)

### 3.3 Processing times

In [None]:
sum = trace_df['realtime'].sum()
sum_cpu = (trace_df['realtime'] * trace_df['cpus']).sum()

print(f'Sum of all process execution time: {sum}')
print(f'Sum of all (process exec time)*(nb cpu): {sum_cpu}')

### 3.4 Average wait time

In [None]:
import display_process_timings as viewer

viewer.plot_wait_times(trace_df)

## 4. Export Config File

In [None]:
import config_file_generator as generator

generator.generate_nextflow_config(trace_df, output_config_file)