# Exploration

## Prerequisites

In [None]:
from IPython.display import Markdown
import pandas
from src import PROCESSED_DATA_DIR

An *action* is an entry in a pipeline;
it is executed in a job.
An action's `id` is the entry's key.
Keys are unique within pipelines but are not unique between pipelines;
that is, keys are locally unique but not globally unique.
Consequently, an action's primary key is a composite of `id` and `job_id`.

In [None]:
actions = (
    pandas.read_feather(PROCESSED_DATA_DIR / "actions.feather")
    .set_index(["id", "job_id"])
    .sort_index()
)
assert actions.index.is_unique

A *job* is an execution of an action.

In [None]:
jobs = (
    pandas.read_feather(PROCESSED_DATA_DIR / "jobs.feather")
    .set_index("id")
    .sort_index()
)
assert jobs.index.is_unique

In [None]:
Markdown(
    f"""
There are {len(actions):,} actions and {len(jobs):,} jobs.
"""
)

Why isn't there a one-to-one relationship between actions and jobs?

Some pipelines couldn't be parsed, so there are jobs that aren't associated with actions.
This suggests there should be more jobs than actions.
However, one job corresponds to one or more actions because the `run_all` action has been expanded.
This suggests there should be more actions than jobs.

---

An action's `pseudo_id` indicates whether the action was run explicitly (`pseudo_id == id`) or implicitly (`pseudo_id == "run_all"`).

---

A *workspace* is a collection of jobs.
We consider a workspace to be a proxy for a study.

In [None]:
actions = actions.join(jobs.workspace_id, on="job_id")

In [None]:
actions.head()

## Analysis

How many times have actions of each type been run?

In [None]:
actions.groupby("type").size().sort_values(ascending=False).rename("count").to_frame()

How many times have actions of each type been run, per workspace?

* `count` is the number of workspaces within which the type of action has been run

In [None]:
actions.groupby(["workspace_id", "type"]).size().groupby("type").describe()