In [1]:
# !pip install doit papermill pandas hvplot matplotlib

In [None]:
from doit import load_ipython_extension
load_ipython_extension()
import pandas as pd
import hvplot.pandas

In [None]:
import sys
sys.path.append('src')
import sciebo

sciebo.download_file('https://uni-bonn.sciebo.de/s/N8t6uo4mn6itdtG', 'data/steinmetz_all.csv')

# Building Workflows in Notebook

Pydoit is a task management and automation tool designed to execute commands and scripts in a structured, reproducible way.
In this notebook, we will understand the structure of `doit` workflows by writing doit tasks and execute them.

## Functions

Functions are reusable blocks of code that perform a specific task. 
They help organize code, make it more readable, and allow you to avoid repetition by encapsulating logic that can be called multiple times throughout a program.

A function is defined using the def keyword followed by the function name and parentheses `()`. 
The code that performs the task is placed inside the function body, indented under the function definition

```python
def function_name(param1, param2, ...):  
    # use param1, param2, ...
    return value  # return (optional)
```

**Example** Make a function called `active_trials` the replaces the below code

```python
input_csv = 'data/steinmetz_all.csv'
output_csv = 'data/active_trials.csv'
df = pd.read_csv(input_csv)
df_active = df[df['active_trials'] == True]
df_active.to_csv(output_csv, index=False)
```

In [4]:
def active_trials():
    input_csv = 'data/steinmetz_all.csv'
    output_csv = 'data/active_trials.csv'
    df = pd.read_csv(input_csv)
    df_active = df[df['active_trials'] == True]
    df_active.to_csv(output_csv, index=False)    

active_trials()

Make a function called `descriptive_stats` that replaces the below code

```python
active_trials_csv = 'data/active_trials.csv'
stats_csv = 'data/stats.csv'
df = pd.read_csv(active_trials_csv)
df_stats = df.describe().reset_index()
df_stats.to_csv(stats_csv, index=False)
```

Make a function called `histogram_plot` that replaces the code below

```python
import matplotlib.pyplot as plt
active_trials_csv = 'data/active_trials.csv'
hist_col_name = 'response_time'
df = pd.read_csv(active_trials_csv)
df[hist_col_name].plot.hist()
plt.savefig(f'{hist_col_name}_histogram.png')
```

**Example** Make `input_csv` as a parameter for `active_trials`

In [7]:
def active_trials(input_csv):
    output_csv = 'data/active_trials.csv'
    df = pd.read_csv(input_csv)
    df_active = df[df['active_trials'] == True]
    df_active.to_csv(output_csv, index=False)    

active_trials('data/steinmetz_all.csv')

Make `active_trials_csv` as a parameter of `descriptive_stats`

Make `active_trials_csv` as a parameter of `histogram_plot`

**Example** Make `input_csv` and `output_csv` as parameters for `active_trials`

In [10]:
def active_trials(input_csv, output_csv):
    df = pd.read_csv(input_csv)
    df_active = df[df['active_trials'] == True]
    df_active.to_csv(output_csv, index=False)    

active_trials('data/steinmetz_all.csv', 'data/active_trials.csv')

Make `active_trials_csv` and `stats_csv` as parameters of `descriptive_stats`

Make `active_trials_csv` and `stats_csv` as parameters of `histogram_plot`

## Building doit Workflows

`doit` is a task automation tool that helps manage dependencies and execute tasks. `doit` workflows are made up of units called `tasks` which are python functions. 

Basic syntax of doit task:

```python
def task_name():
    return {
        'actions': ['command to execute'],
    }
```

`task_name` is name of the task (must start with `task_`)

`actions` is a list of commands, Python functions, etc.

Let's get some practice building `doit` tasks.

**Example** Add a doit task called `process` that implements the below code:

```python
def active_trials():
    input_csv = 'data/steinmetz_all.csv'
    output_csv = 'data/active_trials.csv'
    df = pd.read_csv(input_csv)
    df_active = df[df['active_trials'] == True]
    df_active.to_csv(output_csv, index=False)   
```

In [13]:
def task_process():
    def active_trials():
        input_csv = 'data/steinmetz_all.csv'
        output_csv = 'data/active_trials.csv'
        df = pd.read_csv(input_csv)
        df_active = df[df['active_trials'] == True]
        df_active.to_csv(output_csv, index=False)       

    return {
        'actions': [active_trials]
    }

In [None]:
%doit list

Add a doit task called `stats` for the below code

Add a doit task called `plot` for the below code

```python
def histogram_plot():
    import matplotlib.pyplot as plt
    active_trials_csv = 'data/active_trials.csv'
    hist_col_name = 'response_time'
    df = pd.read_csv(active_trials_csv)
    df[hist_col_name].plot.hist()
    plt.savefig(f'{hist_col_name}_histogram.png')
```

We can also run notebooks as a `doit` task. Run the below cell to download notebooks.

`nb_active_trials`: Same as `active_trials` function </br>
`nb_stats`: Same as `descriptive_stats` function </br>
`nb_plots`: Same as `histogram_plot` function

In [None]:
import sys
sys.path.append('src')
import sciebo

sciebo.download_file('https://uni-bonn.sciebo.de/s/5ke7GSFfMErS20y', 'nb_active_trials.ipynb')
sciebo.download_file('https://uni-bonn.sciebo.de/s/UQKaks9opGYu211', 'nb_stats.ipynb')
sciebo.download_file('https://uni-bonn.sciebo.de/s/dFomib4RDkGL39A', 'nb_plots.ipynb')


**Example** Run `nb_active_trials` notebook inside process task.

In [20]:
def task_process():
    return {
        'actions': ['papermill nb_active_trials.ipynb process.ipynb']
    }

Run `nb_stats` notebook inside stats task.

Run `nb_plots` notebook inside `plot` task

## Running `doit` tasks

Now that we know how to make doit tasks, let see how to run them.

**Delete `data` directory and run the below cell to get only `steinmetz_all.csv`**

In [None]:
import sys
sys.path.append('src')
import sciebo

sciebo.download_file('https://uni-bonn.sciebo.de/s/N8t6uo4mn6itdtG', 'data/steinmetz_all.csv')

**Example** Run `process` task

In [None]:
%doit process

Run `stats` task

Run `plot` task

`doit` does not enforce a specific order of task execution unless task dependencies are explicitly defined. By default, tasks are executed in parallel or in whatever order doit chooses, which might not match the logical order you expect.

**Delete `data` directory and run the below cell to get only `steinmetz_all.csv`**

In [None]:
import sys
sys.path.append('src')
import sciebo

sciebo.download_file('https://uni-bonn.sciebo.de/s/N8t6uo4mn6itdtG', 'data/steinmetz_all.csv')



**Example** Run everything (Can you?)

In [None]:
%doit

**Example** Specify that `plot` task depends on `process` task

In [29]:
def task_plot():  
    return {
        'actions': ['papermill nb_plots.ipynb plots.ipynb'],
        'task_dep': ['process']

    }

In [None]:
%doit

Specify that `stats` task depends on `process` task