In [None]:
# !pip install doit papermill

In [1]:
from doit import load_ipython_extension
load_ipython_extension()

# Building Workflows of Jupyter Notebooks

## File Dependency

In `doit`, file dependencies are essential for managing task execution and ensuring tasks are re-executed only when necessary. 
File dependencies inform `doit` that a task depends on one or more files, and the task will be re-run if any of these files change. 
This mechanism ensures efficient workflow management, avoiding redundant computations by tracking changes in dependencies.

In [40]:
def task_download():
    return {
        'actions': ['papermill notebooks/01_data_access.ipynb 01_data_access_report.ipynb -p output_csv data/data.csv'],
    }

%doit download

**Example** Create a task called stats that runs the below command with a file dependency on `data/data.csv`

`papermill notebooks/02_stats.ipynb 02_stats_report.ipynb -p csv_file data/data.csv -p output_csv data/data_stats.csv`

Here we are just using papermill command line to pass in arguments with -p

In [14]:
def task_stats():  
    return {
        'actions': ['papermill notebooks/02_stats.ipynb 02_stats_report.ipynb  -p csv_file data/data.csv -p output_csv data/data_stats.csv'],
        'file_dep': ['data/data.csv']
    }

In [7]:
%doit list

stats   


In [8]:
%doit stats

.  stats


Input Notebook:  notebooks/02_stats.ipynb
Output Notebook: 02_stats_report.ipynb
Executing:   0%|          | 0/12 [00:00<?, ?cell/s]Executing notebook with kernel: python3
Executing: 100%|##########| 12/12 [00:02<00:00,  4.20cell/s]


Run `%doit stats` again 

In [9]:
%doit stats

-- stats


`--` means that the task is skipped because there has not been any change in the original data file

Add random value to data/data.csv and run `%doit stats`

In [10]:
%doit stats

.  stats


Input Notebook:  notebooks/02_stats.ipynb
Output Notebook: 02_stats_report.ipynb
Executing:   0%|          | 0/12 [00:00<?, ?cell/s]Executing notebook with kernel: python3
Executing: 100%|##########| 12/12 [00:03<00:00,  3.87cell/s]


In [12]:
%doit list

stats   


Create a task called plot that runs the below command with a file dependency on `data/data.csv`

`papermill notebooks/03_visualization.ipynb 03_visualization_report.ipynb -p csv_file data/data.csv`


In [15]:
def task_plot():  
    return {
        'actions': ['papermill notebooks/03_visualization.ipynb 03_visualization_report.ipynb -p csv_file data/data.csv'],
        'file_dep': ['data/data.csv']
    }

In [16]:
%doit list

plot    
stats   


In [18]:
%doit plot

.  plot


Input Notebook:  notebooks/03_visualization.ipynb
Output Notebook: 03_visualization_report.ipynb
Executing:   0%|          | 0/11 [00:00<?, ?cell/s]Executing notebook with kernel: python3
Executing: 100%|##########| 11/11 [00:06<00:00,  1.67cell/s]


Run `%doit plot` again

In [19]:
%doit plot

-- plot


Add a new row in data.csv and run `%doit plot`

In [20]:
%doit plot

.  plot


Input Notebook:  notebooks/03_visualization.ipynb
Output Notebook: 03_visualization_report.ipynb
Executing:   0%|          | 0/11 [00:00<?, ?cell/s]Executing notebook with kernel: python3
Executing: 100%|##########| 11/11 [00:06<00:00,  1.69cell/s]


Run `%doit plot` again

In [21]:
%doit plot

-- plot


**Example** Run both the tasks

In [22]:
%doit

-- plot
.  stats


Input Notebook:  notebooks/02_stats.ipynb
Output Notebook: 02_stats_report.ipynb
Executing:   0%|          | 0/12 [00:00<?, ?cell/s]Executing notebook with kernel: python3
Executing: 100%|##########| 12/12 [00:02<00:00,  4.19cell/s]


Run both the tasks again

In [23]:
%doit

-- plot
-- stats


Delete the newly added rows from data.csv and run both the tasks again

In [24]:
%doit

.  plot


Input Notebook:  notebooks/03_visualization.ipynb
Output Notebook: 03_visualization_report.ipynb
Executing:   0%|          | 0/11 [00:00<?, ?cell/s]Executing notebook with kernel: python3
Executing: 100%|##########| 11/11 [00:06<00:00,  1.74cell/s]


.  stats


Input Notebook:  notebooks/02_stats.ipynb
Output Notebook: 02_stats_report.ipynb
Executing:   0%|          | 0/12 [00:00<?, ?cell/s]Executing notebook with kernel: python3
Executing: 100%|##########| 12/12 [00:02<00:00,  4.12cell/s]


## Target Dependency


In `doit`, `targets` refer to the output files or directories a task generates, and they are used to determine whether a task needs to be re-executed. If the targets donâ€™t exist or are older than the specified dependencies (`file_dep`), the task will run; otherwise, it will be skipped, ensuring efficient task execution. Targets can be single or multiple files, and they enable chaining of tasks, where one task's output becomes the dependency of another. This mechanism allows for structured, incremental workflows, avoiding unnecessary re-runs and optimizing task management in projects.

**Example** Create task download with the below code such that it depends on `notebooks/01_data_access.ipynb` and target is `data/data.csv`

`papermill notebooks/01_data_access.ipynb 01_data_access_report.ipynb -p output_csv data/data.csv`

In [33]:
def task_download():
    return {
        'actions': ['papermill notebooks/01_data_access.ipynb 01_data_access_report.ipynb -p output_csv data/data.csv'],
        'file_dep': ['notebooks/01_data_access.ipynb'],
        'targets': ["data/data.csv"]
    }

In [34]:
%doit download

.  download


Input Notebook:  notebooks/01_data_access.ipynb
Output Notebook: 01_data_access_report.ipynb
Executing:   0%|          | 0/9 [00:00<?, ?cell/s]Executing notebook with kernel: python3
Executing: 100%|##########| 9/9 [00:03<00:00,  2.86cell/s]


Run `%doit download` again

In [35]:
%doit download

-- download


Create a task called stats that runs the below command with a file dependency on `data/data.csv` and target of data/data_stats.csv

`papermill notebooks/02_stats.ipynb 02_stats_report.ipynb -p csv_file data/data.csv -p output_csv data/data_stats.csv`

In [36]:
def task_stats():  
    return {
        'actions': ['papermill notebooks/02_stats.ipynb 02_stats_report.ipynb  -p csv_file data/data.csv -p output_csv data/data_stats.csv'],
        'file_dep': ['data/data.csv'],
        'targets': ['data/data_stats.csv']
    }

Run `%doit stats`

In [37]:
%doit stats

-- download
-- stats


What do you see and why?

Create a task called plot that runs the below command with a file dependency on `data/data.csv` and target 'count.html'

`papermill notebooks/03_visualization.ipynb 03_visualization_report.ipynb -p csv_file data/data.csv`


In [38]:
def task_plot():  
    return {
        'actions': ['papermill notebooks/03_visualization.ipynb 03_visualization_report.ipynb -p csv_file data/data.csv'],
        'file_dep': ['data/data.csv'],
        'targets': ['count.html']
    }

Run `%doit plot`. What do you see and why?

In [39]:
%doit plot

-- download
-- plot


**DELETE data/data.csv, data/data_stats.csv, count.html**

**Example** Create task download with the below code such that it depends on `notebooks/01_data_access.ipynb` and targets are `data/data.csv` and `01_data_access_report.ipynb`

`papermill notebooks/01_data_access.ipynb 01_data_access_report.ipynb -p output_csv data/data.csv`

In [41]:
def task_download():
    return {
        'actions': ['papermill notebooks/01_data_access.ipynb 01_data_access_report.ipynb -p output_csv data/data.csv'],
        'file_dep': ['notebooks/01_data_access.ipynb'],
        'targets': ["data/data.csv", "01_data_access_report.ipynb"]
    }

In [42]:
%doit download

-- download


In [43]:
%doit download

-- download


Create a task called stats that runs the below command with a file dependency on `data/data.csv` and targets are `data/data_stats.csv` and `02_stats_report.ipynb`

`papermill notebooks/02_stats.ipynb 02_stats_report.ipynb -p csv_file data/data.csv -p output_csv data/data_stats.csv`

In [None]:
def task_stats():  
    return {
        'actions': ['papermill notebooks/02_stats.ipynb 02_stats_report.ipynb  -p csv_file data/data.csv -p output_csv data/data_stats.csv'],
        'file_dep': ['data/data.csv'],
        'targets': ['data/data_stats.csv', "02_stats_report.ipynb"]
    }

In [44]:
%doit stats

-- download
-- stats


In [45]:
%doit stats

-- download
-- stats


Create a task called plot that runs the below command with a file dependency on `data/data.csv` and targets are `count.html` and `03_visualization_report.ipynb`

`papermill notebooks/03_visualization.ipynb 03_visualization_report.ipynb -p csv_file data/data.csv`


In [None]:
def task_plot():  
    return {
        'actions': ['papermill notebooks/03_visualization.ipynb 03_visualization_report.ipynb -p csv_file data/data.csv'],
        'file_dep': ['data/data.csv'],
        'targets': ['count.html', '03_visualization_report.ipynb']
    }

In [46]:
%doit plot

-- download
-- plot


In [47]:
%doit plot

-- download
-- plot
