# Exploration

## Prerequisites

In [None]:
from IPython.display import Markdown
import numpy
import pandas
from src import PROCESSED_DATA_DIR

In [None]:
jobs = (
    pandas.read_feather(PROCESSED_DATA_DIR / "jobs.feather")
    .set_index("id")
    .sort_index()
)
assert jobs.index.is_unique

## Nomenclature

A ***job*** is the execution of an action.
An ***action*** is a stage in a pipeline.
One job is associated with zero or one actions
(zero, because of missing pipelines and parsing errors).
Hence, an action is a concrete concept:
whilst action `a` associated with job `j1` may have the same invocation as action `a` associated with job `j2`,
`a-j1` and `a-j2` are different actions.

A ***workspace*** is a collection of jobs and, hence, a collection of actions;
it is a proxy for a study.

We could assume that actions with the same ID that are associated with the same workspace are different executions of the same invocation.
However, we should be cautious because both IDs and invocations may change.
For example:

* the same ID may have different invocations,
    such as when a jupyter action type is changed to a python action type.

* the same invocation may have different IDs,
    such as when a more general ID is replaced by a more specific ID, as more actions are added to a pipeline.

In [None]:
Markdown(
    f"""
There are {len(jobs):,} jobs.
"""
)

## Analysis

How many times have actions of each type been executed?

In [None]:
jobs.groupby("action_type").size().sort_values(ascending=False).rename(
    "count"
).to_frame()

Recognising the need to be cautious,
we'd expect actions with the same ID that are associated with the same workspace to be executed more than once per workspace.
However, how many times is a normal number of times?
Are some types of action executed more than other types of action?

In [None]:
num_runs_per_workspace = (
    jobs.groupby(["workspace_id", "action_id", "action_type"]).size().rename("count")
)

In [None]:
num_runs_per_workspace.groupby("action_type").aggregate([numpy.mean, max, min])

In which cases do actions with the same ID that are associated with the same workspace have different invocations?

In [None]:
num_runs_per_workspace.reset_index().loc[
    num_runs_per_workspace.reset_index().duplicated(
        ["workspace_id", "action_id"],
        keep=False,
    )
].set_index(["workspace_id", "action_id", "action_type"])