# Project

### Project Resources
**Experiment Template**  

[templates/experiments/example/experiment.yaml](templates/experiments/example/experiment.yaml) 
    [templates/project.yaml](templates/project.yaml) 
    
[templates/experiments/example/model_config.yaml](templates/experiments/example/model_config.yaml)  
[templates/experiments/example/trainer_config.yaml](templates/experiments/example/trainer_config.yaml)  

**Project Templates**  

[meta_config.yaml](meta_config.yaml)  
[templates/directories.yaml](templates/directories.yaml)  
 

**Library Templates**  

Config Templates  
[../../templates/configs/default_train_script.yaml](../../templates/configs/default_train_script.yaml)  
[../../templates/configs/base_train_config.yaml](../../templates/configs/base_train_config.yaml)  

Model Templates  
[../../templates/models/tiny_d128_l2.yaml](../../templates/models/tiny_d128_l2.yaml)  
[../../templates/models/vanilla_transformer.yaml](../../templates/models/vanilla_transformer.yaml)  
[../../templates/models/custom_model.yaml](../../templates/models/custom_model.yaml)  
[../../templates/models/load_custom_model.yaml](../../templates/models/load_custom_model.yaml)  

Dataset Templates  
[../../templates/datasets/tiny_stories_pretokenized_2k.yaml](../../templates/datasets/tiny_stories_pretokenized_2k.yaml)  
[../../templates/datasets/base_dataset_loader.yaml](../../templates/datasets/base_dataset_loader.yaml)  

Trainer Templates  
[../../templates/trainers/base_trainer.yaml](../../templates/trainers/base_trainer.yaml)  
[../../templates/trainers/trainer.yaml](../../templates/trainers/trainer.yaml)  
[../../templates/trainers/accel_trainer.yaml](../../templates/trainers/accel_trainer.yaml)  
[../../templates/trainers/hf_trainer.yaml](../../templates/trainers/hf_trainer.yaml)  

**Whitelists**  
[templates/whitelist.yaml](templates/whitelist.yaml)  
[../../templates/whitelists/global_whitelist.yaml](../../templates/whitelists/global_whitelist.yaml)  
[../../templates/whitelists/model_zoo_whitelist.yaml](../../templates/whitelists/model_zoo_whitelist.yaml)  

**Model Code**  

[model_zoo/vanilla_transformer/vanilla_transformer.py](../../model_zoo/vanilla_transformer/vanilla_transformer.py)  

## Setup

In [1]:
import sys, os
modules_path = os.path.join('..', '..', 'src')
if modules_path not in sys.path: sys.path.insert(0, modules_path)
import shutil

from pprint import pformat, pp
from transformers import set_seed

from forgather.config import (
    load_config,
    ConfigEnvironment,
    default_pp_globals,
    pconfig,
)
from forgather.preprocess import LineStatementProcessor
from forgather import Latent

# Path to your project meta-config
meta_config_path = 'meta_config.yaml'
experiment_definition_file = 'example_experiment.yaml'

# Load meta-config
metacfg = load_config(meta_config_path)
print(f"{' '+meta_config_path+' ':-^40}")
pconfig(metacfg)

# Create a configuration envrionment from the meta-config data
config_environ = ConfigEnvironment(searchpath=metacfg.search_paths, globals=default_pp_globals())

# Get path to selected experiment
experiment_path = os.path.join(metacfg.experiment_dir, experiment_definition_file)
print(f"{' Experiment ':-^40}\n{experiment_path}")

----------- meta_config.yaml -----------
assets_dir: '../..'
datasets_dir: '../../datasets'
experiment_dir: './templates/experiments'
model_dir: './output_models'
model_src_dir: '../../model_zoo'
project_dir: '.'
project_templates: './templates'
scripts_dir: '../../scripts'
search_paths:
  - '../../templates'
  - './templates'
src_dir: '../../src'
templates: '../../templates'
tokenizer_dir: '../../tokenizers'
train_script_path: '../../scripts/train_script.py'
whitelist_path: './templates/whitelist.yaml'
-------------- Experiment --------------
./templates/experiments/example_experiment.yaml


### Preprocess Configuration
Run the cell and click the link to open the preprocessed file in the notebook.  
[preprocessed_config.yaml](preprocessed_config.yaml)

In [None]:
# Setting 'preserve_line_numbers' to True preserves line numbers; without this, the reported lines in exceptions may be incorrect.
# Setting 'pp_verbose' dumpts all pre-pre-processed templates for diagnostics.
def debug_preprocess(preserve_line_numbers=False, pp_verbose=False):
    # These are class-level attributes which control dianostics on this class.
    LineStatementProcessor.preserve_line_numbers = preserve_line_numbers
    LineStatementProcessor.pp_verbose = pp_verbose

    # Preprocess and print with line numbers
    pp_config = config_environ.preprocess(experiment_path)
    print(pp_config.with_line_numbers())
        
    LineStatementProcessor.preserve_line_numbers = False
    LineStatementProcessor.pp_verbose = False
    
debug_preprocess(preserve_line_numbers=False, pp_verbose=False)
with open('preprocessed_config.yaml', 'w') as f:
    f.write(config_environ.preprocess(experiment_path))

### Load Configuration

In addtion to preprocessing, this will parse the YAML file, but will not try to instantite the configuration.

In [None]:
loaded_config = config_environ.load(experiment_path)

# Print parsed config
pconfig(loaded_config.config)

### Materialize the Configuration

Preprocess, parse, and instantiate the objects defined in the configuration.

In [None]:
# Set the seed, if this is to be completely deterministic.
# set_seed(42)
config = config_environ.load(experiment_path).materialize()
pconfig(config)

### Run Trainer

Note: This is not a complete training implementation. Use the code below for that.

In [None]:
config = config_environ.load(experiment_path).materialize()
config.trainer.train()

### Training Loop

In [2]:
from accelerate import notebook_launcher

# This is the entry-point for the spawned procceses.
def training_loop(meta_config, experiment_name):
    # Ensure that initialization is deterministic.
    set_seed(42)
    metacfg = load_config(meta_config)

    # Get Torch Distributed parameters from environ.
    world_size = int(os.environ.get('WORLD_SIZE', 1))
    rank = int(os.environ.get('RANK', 0))
    local_rank = int(os.environ.get('LOCAL_RANK', 0))

    # Initialize the pre-processor globals
    env_globals = default_pp_globals() | dict(
        world_size=world_size,
        rank=rank,
        local_rank=local_rank,
    )
    
    # Construct config environment and inject the distributed config env vars.
    config_environ = ConfigEnvironment(
        searchpath=metacfg.search_paths,
        globals=env_globals,
    )

    # Materialize the configuration
    config = config_environ.load(experiment_path).materialize()

    # In a distriubted environment, we only want one process to print messages
    is_main_process = (local_rank == 0)
    
    if is_main_process:
        print("**** Training Started *****")
        print(f"experiment_name: {config.experiment_name}")
        print(f"experiment_description: {config.experiment_description}")
        print(f"output_dir: {config.output_dir}")
        print(f"logging_dir: {config.logging_dir}")

    # This is where the actual 'loop' is.
    metrics = config.trainer.train().metrics
    
    if is_main_process:
        print("**** Training Completed *****")
        print(metrics)

    metrics = config.trainer.evaluate()

    if is_main_process:
        print("**** Evaluation Completed *****")
        print(metrics)
    
    if config.do_save:
        config.trainer.save_model()
        if is_main_process:
            print(f"Model saved to: {config.trainer.args.output_dir}")

#### Run Training Loop Directly

In [None]:
training_loop(meta_config_path, experiment_path)

#### Launch with Notebook Launcher

Note: If CUDA has been initialized, this will fail! Reset the notebook first.

In [None]:
notebook_launcher(
    training_loop,
    args=(meta_config_path, experiment_path,),
    num_processes=2
)

### Train from a Training Script

This cell defines code for generate a bash command-line for running this configuration from the sell.

In [28]:
import stat
# Output train-script command line as a string
def train_cmdline(metacfg, nproc='gpu', cuda_devices=None):
    includes = ''.join(f"-I '{inc}' " for inc in metacfg.search_paths)
    s = (f"torchrun --standalone --nproc-per-node '{nproc}' '{metacfg.train_script_path}'"
        + f" -w '{metacfg.whitelist_path}' {includes} -s '{metacfg.src_dir}'"
    )
    if cuda_devices is not None:
        s = f"CUDA_VISIBLE_DEVICES='{cuda_devices}' " + s
    return s

# Output train-script as command line as a bash-script
# ./train.sh [<other-sli-args] <experiment-config-file>
def make_bash_script(metacfg, script_path='train.sh', nproc='gpu', cuda_devices=None):
    with open(script_path, 'w') as f:
        f.write('#!/bin/bash\n' + train_cmdline(metacfg, nproc, cuda_devices) + ' "${@}"\n')
        os.chmod(f.fileno(), stat.S_IREAD|stat.S_IRUSR|stat.S_IWUSR|stat.S_IXUSR)

#### Generate Bash Script

This will output a shell-script which will invoke the training script with the arguments for this project.

```bash
./train.sh path/to/experiment.yaml
```

If 'cuda_devices' is not None, this can restrict execution to a sub-set of available GPUs.
```python
# Restrict training to GPU's 0 and 1
make_bash_script(metacfg, cuda_devices="0,1")
```

In [21]:
make_bash_script(metacfg, cuda_devices="0")

# Read back to verify
with open('train.sh', 'r') as f:
    print(f.read())

#!/bin/bash
CUDA_VISIBLE_DEVICES='0' torchrun --standalone --nproc-per-node 'gpu' '../../scripts/train_script.py' -w './templates/whitelist.yaml' -I '.' -I './templates' -I '../../templates'  -s '../../src' "${@}"



#### Run Script from Notebook

In [None]:
!{train_cmdline(metacfg, cuda_devices="0,1")} '{experiment_path}'

### View in Tensorboard

In [11]:
!tensorboard --bind_all --logdir "{metacfg.model_dir}"

TensorFlow installation not found - running with reduced feature set.

NOTE: Using experimental fast data loading logic. To disable, pass
    "--load_fast=false" and report issues on GitHub. More details:
    https://github.com/tensorflow/tensorboard/issues/4784

TensorBoard 2.16.2 at http://hal9000:6006/ (Press CTRL+C to quit)
^C


### Cleanup

Delete all of the output models produced by the demo and start over.

In [24]:
print(f"Removing '{metacfg.model_dir}'")
shutil.rmtree(metacfg.model_dir, ignore_errors=True)

Removing './output_models'
