# core

> This is the core module, I haven't decided if I need any other modules yet

List of things we'd like to log:

* Code version
* Uncommitted changes
* Tree directory snapshot
* All shell output logs
* Time and date
* Docker container info
* All code using wandb to grab code
* Any config files mentioned in argv
* nvidia-smi (this is in some experiments anyway)

This is mostly a wandb wrapper, with the goal being to copy the wandb directory to some location (such as Onedrive) where it can be stored indefinitely, along with a UUID identifier. At the same time it writes the output log and short job info to a shared git repo. Then we can commit these logs and use the git repo as a continuous log of all experiments run, with the UUIDs available so it would be possible to ask for more info from whoever ran the experiment and they may have it saved.

It might become a huge storage hog but I'm willing to take that risk.

TODO:

* refactor to put each function/class in it's own cell
* put code in sections
* instead of printing, use `logging`
* use full wandb dir run name instead of just ID
* Extract GPU logs if they're in the wandb data

## Example Usage


## Structure

Before running anything you run `profane setup` to specify where to save:

* Complete logs (some directory with enough storage space)
* Minimal shared logs (should be a shared git repo)

`profane.core.init` is the entrypoint, it's supposed to act like `wandb.init`.

When initialized:

* It uses the run ID generated by wandb as the run name
* Creates output directory according to the name in the users subdirectory in the shared repository
* Creates mirrored output directory in the local complete log directory

When the run finishes:

* Extracts the `output.log` and saves it to the shared log directory from before, along with the command run and the metadata
* Copies the entire `wandb` run directory to complete log directory

To do this it has to add a teardown hook that will run on the run finishing.

Testing that this definitely works in a standalone script will be annoying.

In [None]:
#| default_exp core

In [None]:
#| hide
from nbdev.showdoc import *

In [None]:
#| export
import argparse
from pathlib import Path

# this is used for testing
RCDIR = None

def get_rcdir():
    """Function to get the rcdir."""
    rcdir = RCDIR or Path.home()
    if isinstance(rcdir, str):
        rcdir = Path(rcdir)
    return rcdir

def parse_args():
    argument_parser = argparse.ArgumentParser("Set up a ~/.profanerc file and register the required directories.")
    argument_parser.add_argument('local_storage', type=Path, help="The local storage directory, everything will be stored here.")
    argument_parser.add_argument('shared_storage', type=Path, help="The shared storage directory, only the terminal logs and metadata will be stored here.")
    argument_parser.add_argument('--user', type=str, default=None, help="Optional: The user name to use for the shared storage directory.")
    return argument_parser.parse_args()

def setup():
    args = parse_args() 
    return _setup(args.local_storage, args.shared_storage, args.user)

def _setup(local_storage, shared_storage, user):
    """Function to set up the ~/.profanerc file and register the required directories."""
    # check the two directories exist
    if not local_storage.exists():
        raise FileNotFoundError(f"{local_storage} does not exist.")
    if not shared_storage.exists():
        raise FileNotFoundError(f"{shared_storage} does not exist.")
    # save the config file
    rcdir = get_rcdir()
    config_file = rcdir / ".profanerc"
    config_file.write_text(f"local_storage={str(local_storage.resolve())}\nshared_storage={str(shared_storage.resolve())}" + (f"\nuser={user}" if user else ""))

def get_config():
    """Function to get the config file."""
    rcdir = get_rcdir()
    config_file = rcdir / ".profanerc"
    if not config_file.exists():
        raise FileNotFoundError(f"{config_file} does not exist. Run profane_setup <local dir> <shared dir> to set it up.")
    config = {}
    for line in config_file.read_text().splitlines():
        key, value = line.split("=")
        config[key] = value
    return config

In [None]:
#| export
import wandb
# from typing import Callable, NamedTuple
from enum import IntEnum
import shutil
from distutils.dir_util import copy_tree
import os
import atexit
from wandb.sdk.wandb_run import TeardownHook, TeardownStage
import time

# create directories using run information
def create_dirs(run_id, user, project):
    config = get_config()
    if len(project) == 0:
        project = "misc"
    shared_dir = Path(config['shared_storage']) / user / project / run_id
    shared_dir.mkdir(parents=True)
    local_dir = Path(config['local_storage']) / project / run_id
    local_dir.mkdir(parents=True)
    return shared_dir, local_dir


class SyncLocalCallback:
    def __init__(self):
        self.exited = False

    def register_dirs(self, local_dir, wandb_dir):
        self.local_dir = local_dir
        self.wandb_dir = wandb_dir

    def __call__(self):
        print("local callback called")
        if not self.exited:
            # if dir is empty sleep
            while len([p for p in self.wandb_dir.iterdir()]) < 1:
                time.sleep(1)
                print(f"waiting for {self.wandb_dir} to be populated")
            print(f"found {len([p for p in self.wandb_dir.iterdir()])} files in {self.wandb_dir}")
            # copy wandb dir to shared dir, nothing fancy
            # copy_tree(str(self.wandb_dir), str(self.local_dir), verbose=1)
            for path_object in self.wandb_dir.rglob('*'):
                if path_object.is_file():
                    print(f"copying {path_object} to {self.local_dir / path_object.relative_to(self.wandb_dir)}")
                    shutil.copy(path_object, self.local_dir / path_object.relative_to(self.wandb_dir))
                else:
                    (self.local_dir / path_object.relative_to(self.wandb_dir)).mkdir(parents=True, exist_ok=True)
            print(f"copied {self.wandb_dir} to {self.local_dir}")
            self.exited = True

       
class SyncSharedCallback:
    def __init__(self):
        self.exited = False

    def register_dirs(self, shared_dir, wandb_dir):
        self.shared_dir = shared_dir
        self.wandb_dir = wandb_dir

    def __call__(self):
        print("shared callback called")
        if not self.exited:
            # find the wandb output file:
            for path_object in self.wandb_dir.rglob('*'):
                if path_object.is_file():
                    if path_object.suffix == '.wandb':
                        wandb_file = path_object
                        break
            # parse the wandb file
            output_log = parse_output_log(wandb_file)
            # write output log to shared_dir
            with open(self.shared_dir / 'output.log', 'w') as f:
                f.write(output_log)
            # this should be where the metadata is
            metadata_file = self.wandb_dir / "files/wandb-metadata.json"
            # copy metadata file and write output log
            shutil.copy(metadata_file, self.shared_dir)
            print(f"file written to {self.shared_dir / 'output.log'}")
            self.exited = True
 

def init(**kwargs):
    """
    A wrapper for `wandb.init`.
    """
    # atexit called in reverse order so these need to be created first
    local_hook = SyncLocalCallback()
    shared_hook = SyncSharedCallback()

    kwargs['mode'] = 'offline'
    run = wandb.init(**kwargs)
    config = get_config()
    wandb_dir = Path(run.dir).parent
    username = os.environ['USER'] if 'user' not in config else config['user']
    shared_dir, local_dir = create_dirs(run.id, username, run.project)
    shared_hook.register_dirs(shared_dir, wandb_dir)
    local_hook.register_dirs(local_dir, wandb_dir)
    # this will trigger if run.finish() is called
    run._teardown_hooks += [TeardownHook(local_hook, TeardownStage.LATE),
                            TeardownHook(shared_hook, TeardownStage.LATE)]
    def finish():
        print("finish called")
        run.finish()
        local_hook()
        shared_hook()
    atexit.register(finish)
    return run

In [None]:
#| export
from wandb.proto import wandb_internal_pb2
from wandb.sdk.internal import datastore


def parse_output_log(data_path):
    """
    Parse wandb data from a given path.
    Returns the terminal log typically saved as `output.log`,
    which isn't created unless you're running in online mode.
    But, the data still exists in the `.wandb` file.
    """
    # https://github.com/wandb/wandb/issues/1768#issuecomment-976786476 
    ds = datastore.DataStore()
    ds.open_for_scan(data_path)
    terminal_log = []

    data = ds.scan_record()
    while data is not None:
        pb = wandb_internal_pb2.Record()
        pb.ParseFromString(data[1])  
        record_type = pb.WhichOneof("record_type")
        if record_type == "output_raw":
            terminal_log.append(pb.output_raw.line)
            #print(pb.output_raw)
        data = ds.scan_record()
    return "".join(terminal_log)

In [None]:
# to test this I need to make a temporary directory and run a fake experiment
# then I can check that whatever gets printed is exactly what is stored in the
# wandb file
from tempfile import TemporaryDirectory
from pathlib import Path
import os
os.environ['WANDB_SILENT'] = 'true'

# import subprocess

printed = []
def _print(*args, **kwargs):
    global printed
    printed += [*args, "\n"]
    return print(*args, **kwargs)

def test_experiment():
    global printed
    global RCDIR
    with TemporaryDirectory() as tmpdirname:
        print("created ", tmpdirname)
        RCDIR = tmpdirname
        tmpdirname = Path(tmpdirname)
        local_storage = Path(tmpdirname/'profane_storage')
        local_storage.mkdir()
        shared_storage = Path(tmpdirname/'profane_shared_storage')
        shared_storage.mkdir()
        _setup(local_storage, shared_storage, None)
        with open(tmpdirname / '.profanerc', 'r') as f:
            print(f.read())
        print('created temporary directory', tmpdirname) # lol this is from the docs
        # wandb by default saves to a directory in the current working directory so
        # we need to pass a path to the directory we just created
        target_dir = tmpdirname / 'wandb'
        print(target_dir)
        run = init(dir=tmpdirname)
        wandb.log({'test': 1})
        _print("something to be logged")
        for i in range(10):
            _print(f"logging {i}")
        run.finish()
        # check that the files that should have been created are there
        wandb_file_exists = False
        for fpath in local_storage.rglob('*'):
            if fpath.suffix == '.wandb':
                wandb_file_exists = True
        assert wandb_file_exists, f"{local_storage} does not contain a .wandb file"
        output_log_exists = False
        for fpath in shared_storage.rglob('*'):
            if fpath.name == 'output.log':
                output_log_exists = True
                with open(fpath, 'r') as f:
                    output_log = f.read()
        assert output_log_exists
        # check that the output log is correct
        assert output_log == ''.join(printed), f"output log does not match printed output\n{output_log}\n{''.join(printed)}"

test_experiment()

created  /var/folders/ln/1018n5357kjc745b28hf7jzm0000gn/T/tmp6bj74k3w
local_storage=/private/var/folders/ln/1018n5357kjc745b28hf7jzm0000gn/T/tmp6bj74k3w/profane_storage
shared_storage=/private/var/folders/ln/1018n5357kjc745b28hf7jzm0000gn/T/tmp6bj74k3w/profane_shared_storage
created temporary directory /var/folders/ln/1018n5357kjc745b28hf7jzm0000gn/T/tmp6bj74k3w
/var/folders/ln/1018n5357kjc745b28hf7jzm0000gn/T/tmp6bj74k3w/wandb
something to be logged
logging 0
logging 1
logging 2
logging 3
logging 4
logging 5
logging 6
logging 7
logging 8
logging 9
local callback called
found 4 files in /var/folders/ln/1018n5357kjc745b28hf7jzm0000gn/T/tmp6bj74k3w/wandb/offline-run-20230801_113359-q9zgbgd7
copying /var/folders/ln/1018n5357kjc745b28hf7jzm0000gn/T/tmp6bj74k3w/wandb/offline-run-20230801_113359-q9zgbgd7/run-q9zgbgd7.wandb to /private/var/folders/ln/1018n5357kjc745b28hf7jzm0000gn/T/tmp6bj74k3w/profane_storage/misc/q9zgbgd7/run-q9zgbgd7.wandb
copying /var/folders/ln/1018n5357kjc745b28hf7jzm00

In [None]:
#| hide
import nbdev; nbdev.nbdev_export()

**WARNING**: this test may run different code if you didn't save and run the above cell beforehand

In [None]:
# the same test as above but run on the command line
import subprocess

os.environ['WANDB_SILENT'] = 'false'

with TemporaryDirectory() as tmpdirname:
    RCDIR = tmpdirname
    tmpdirname = Path(tmpdirname)
    local_storage = Path(tmpdirname/'profane_storage')
    local_storage.mkdir()
    shared_storage = Path(tmpdirname/'profane_shared_storage')
    shared_storage.mkdir()
    _setup(local_storage, shared_storage, None)

    script = f"""
import profane.core
profane.core.RCDIR = '{tmpdirname}'
profane.core.init(dir='{tmpdirname}')
print('done')
"""
    tmpdir = Path(tmpdirname)
    with open(tmpdir/'test.py', 'w') as f:
        f.write(script)
    subprocess.run(['python', tmpdir/'test.py'])
    #for fpath in local_storage.rglob('*'):
    #    print(fpath)
    # for fpath in shared_storage.rglob('*'):
    #     print(fpath)
    # for fpath in tmpdir.rglob('*'):
    #     print(fpath)
    wandb_file_exists = False
    for fpath in local_storage.rglob('*'):
        if fpath.suffix == '.wandb':
            wandb_file_exists = True
    assert wandb_file_exists, f"{local_storage} does not contain a .wandb file"
    output_log_exists = False
    for fpath in shared_storage.rglob('*'):
        if fpath.name == 'output.log':
            output_log_exists = True
            with open(fpath, 'r') as f:
                output_log = f.read()
    assert output_log_exists
    # check that the output log is correct
    assert output_log == "done\n", f"output log is incorrect: {output_log}"

wandb: Tracking run with wandb version 0.15.7
wandb: W&B syncing is set to `offline` in this directory.  
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
wandb: Waiting for W&B process to finish... (success).


done
finish called


wandb: You can sync this run to the cloud by running:
wandb: wandb sync /var/folders/ln/1018n5357kjc745b28hf7jzm0000gn/T/tmpsxde72hp/wandb/offline-run-20230801_113917-ga25ie39
wandb: Find logs at: /var/folders/ln/1018n5357kjc745b28hf7jzm0000gn/T/tmpsxde72hp/wandb/offline-run-20230801_113917-ga25ie39/logs


local callback called
found 4 files in /var/folders/ln/1018n5357kjc745b28hf7jzm0000gn/T/tmpsxde72hp/wandb/offline-run-20230801_113917-ga25ie39
copying /var/folders/ln/1018n5357kjc745b28hf7jzm0000gn/T/tmpsxde72hp/wandb/offline-run-20230801_113917-ga25ie39/run-ga25ie39.wandb to /private/var/folders/ln/1018n5357kjc745b28hf7jzm0000gn/T/tmpsxde72hp/profane_storage/misc/ga25ie39/run-ga25ie39.wandb
copying /var/folders/ln/1018n5357kjc745b28hf7jzm0000gn/T/tmpsxde72hp/wandb/offline-run-20230801_113917-ga25ie39/logs/debug-internal.log to /private/var/folders/ln/1018n5357kjc745b28hf7jzm0000gn/T/tmpsxde72hp/profane_storage/misc/ga25ie39/logs/debug-internal.log
copying /var/folders/ln/1018n5357kjc745b28hf7jzm0000gn/T/tmpsxde72hp/wandb/offline-run-20230801_113917-ga25ie39/logs/debug.log to /private/var/folders/ln/1018n5357kjc745b28hf7jzm0000gn/T/tmpsxde72hp/profane_storage/misc/ga25ie39/logs/debug.log
copying /var/folders/ln/1018n5357kjc745b28hf7jzm0000gn/T/tmpsxde72hp/wandb/offline-run-20230801_113