Note: Variables that are general notebook settings that usually do not need to be
changed by a user are written in CAPSLOCK. Only change them if you know what you
are doing.

## Parameters code block

This code block usually does not need to be touched. It just extracts the parameters
passed to the notebook.

The `JOB_TESTING` flag can be used within this notebook to encode decisions that
reduce runtime to make testing easier.

Note that, while the jobdir can be discovered automatically when using the knitr
engine, I have not found a way to do that when using the jupyter engine. So this
part has to be manually adjusted.

In [None]:
JOB_DIR: str = "jobs/002-demo-jupyter"  # <- UPDATE THIS TO YOUR JOBDIR
JOB_ROW: int = 0
JOB_TESTING: bool = True

## Setup

### Print Python executable

This code block prints the Python executable in order to make debugging easier.

In [None]:
# | label: print-python-executable
import sys

print(sys.executable)

### Print current working directory

Like printing the Python executable, this is mainly for debugging and should be executed at the beginning.

In [None]:
# | label: print-working-directory
from pathlib import Path

print(Path.cwd())

### Import libraries

This codeblock import general utility libraries for this notebook.
This is usually not the optimal place for model-specific dependencies.
They are placed in the model section.

In [None]:
# | label: general-imports
import logging
from datetime import datetime
import pandas as pd

### Process parameters

In [None]:
JOB_PATH = Path(JOB_DIR)

In [None]:
# | label: load-parameters-dataframe
PARAMS = pd.read_csv(JOB_PATH / "params.csv").iloc[JOB_ROW, :]
print(PARAMS)

### Set identifier

In [None]:
# | label: create-identifier
now = datetime.now().strftime("%Y-%m-%d-%H%M")
JOB_IDENTIFIER = JOB_PATH.name + "-" + f"{JOB_ROW:04d}" + "-" + now
PARAMS["job"] = JOB_PATH.name
PARAMS["run"] = JOB_IDENTIFIER

### Create directories

In [None]:
DIR_LOG = JOB_PATH / "log"
DIR_OUT = JOB_PATH / "out"
DIR_FIN = JOB_PATH / "finished"

if JOB_TESTING:
    DIR_OUT = JOB_PATH / "out-test"

DIR_LOG.mkdir(exist_ok=True)
DIR_OUT.mkdir(exist_ok=True)
DIR_FIN.mkdir(exist_ok=True)

The next code block makes sure that finished jobs are not run again.

In [None]:
FINFILE = DIR_FIN / str(JOB_ROW)

if FINFILE.exists() and not JOB_TESTING:
    raise RuntimeError(
        f"Row {JOB_ROW} of job '{JOB_PATH.name}' is already finished. "
        "To run this job again, you need to delete the file "
        f"{str(JOB_PATH / 'finished' / str(JOB_ROW))}."
    )

### Initialize logger

Initializing a logger is standard practice in any script. Logging messages help you
to be confident that your code is running as expected or to debug problems if necessary.

Note that the logger will continue to append output to the log files, so if you have many runs and never clean them up, they will grow very large. So you should make sure that this does not happen.

In [None]:
# | label: initialize-logger
formatter = logging.Formatter("%(asctime)s - %(levelname)s - %(message)s")

handler = logging.FileHandler(filename=DIR_LOG / f"run-{JOB_ROW:04d}.log")
handler.setLevel(logging.INFO)
handler.setFormatter(formatter)

stream_handler = logging.StreamHandler()
stream_handler.setLevel(logging.INFO)
handler.setFormatter(formatter)

logger = logging.getLogger(str(JOB_PATH.name))
logger.addHandler(handler)
logger.addHandler(stream_handler)
logger.setLevel(logging.INFO)

In [None]:
logger.info(f"Job started: \t{JOB_IDENTIFIER} ")

## Model Code

This is where your serious model code starts. This notebook includes only some dummy code for testing.

### Model imports

Note that some libraries are imported here. I think for these notebooks, it is good practice to import the libraries that are specific to your model here, to keep them separate from the libraries that are generally imported for this notebook.

In [None]:
# | label: model-imports
import numpy as np
import plotnine as p9

### Data generation

In [None]:
# | label: data-generation
rng = np.random.default_rng(JOB_ROW)

n = 100
b0 = 1.0
b1 = 1.5
x = rng.uniform(-2.0, 2.0, size=n)
X = np.c_[np.ones_like(x), x]
y = X @ np.r_[b0, b1] + rng.normal(size=n)

p9.qplot(x, y)

### Model fitting

In [None]:
# | label: model-fitting
beta_estimated = np.linalg.inv((X.T @ X)) @ X.T @ y

### Results dictionary

By initializing the results dictionary like done below, i.e. by initializing it
from the `PARAMS` object,
you ensure that all necessary information about this job is saved,
most importantly the job name, the job row (the row of params.csv
that was used), and the job identifier. It also includes all parameter settings from the row of params.csv used for this run.
This is a little wasteful in
terms of file size for the results objects, because, strictly speaking,
it would be sufficient to save the job name and job row; the parameter
values can be retrieved from params.csv with this information. Doing it
like it is coded here saves you the effort of merging. If file size becomes
an issue, you may want to consider switching to the more sparse representation.
Just always be careful to ensure that you will be able to identify the exact
run conditions of each job.

In [None]:
# | label: results
results = PARAMS.to_dict()

results["b0_bias"] = b0 - beta_estimated[0]
results["b1_bias"] = b1 - beta_estimated[1]

## Save Results

It is up to you how exactly to save your outputs. I have had good experiences with creating one dedicated directory for each type of output dataframe that I save. In this demo, this is only one. In any case, make sure that each output dataframe contains the necessary information about the parameters used to create it.

In [None]:
# | label: save-results
(DIR_OUT / "results").mkdir(exist_ok=True)

results = pd.DataFrame(results, index=pd.Index([0]))
results.to_csv(
    DIR_OUT / "results" / f"results-row{JOB_ROW:04d}.csv",
    index=False,
)

## Mark Job as Finished

In [None]:
# | label: mark-finished
if not JOB_TESTING:
    FINFILE.touch()