# Project

[forgather/config.py](../../forgather/config.py)  
[forgather/latent.py](../../forgather/latent.py)  
[forgather/dynamic.py](../../forgather/dynamic.py)  

**Experiments Definitions**  
[meta_config.yaml](meta_config.yaml)

[templates/experiment.yaml](forgather_demo/experiment%201.yaml)  
[templates/paths.yaml](templates/paths.yaml)  
[templates/whitelist.yaml](templates/whitelist.yaml)  
[templates/trainer_config.yaml](templates/trainer_config.yaml)    

[templates/common/whitelist.yaml](../../templates/common/whitelist.yaml)  
[templates/common/defaults.yaml](../../templates/common/defaults.yaml)  
[templates/common/helpers.yaml](../../templates/common/helpers.yaml)  
[templates/common/trainer/base_trainer.yaml](../../templates/common/trainer/base_trainer.yaml)  
[templates/common/causal_lm/base_train.yaml](../../templates/common/causal_lm/base_train.yaml)  
[model_zoo/models/vanilla_transformer/vanilla_transformer.yaml](../../model_zoo/models/vanilla_transformer/vanilla_transformer.yaml)  
[model_zoo/models/model_zoo_whitelist.yaml](../../model_zoo/models/model_zoo_whitelist.yaml)  

**Model Code**  

[model_zoo/models/vanilla_transformer/vanilla_transformer.py](../../model_zoo/models/vanilla_transformer/vanilla_transformer.py)  



## Setup

In [1]:
import sys, os
modules_path = os.path.join('..', '..')
if modules_path not in sys.path: sys.path.insert(0, modules_path)
import shutil

from pprint import pformat, pp
from transformers import set_seed

from forgather.config import (
    preprocess_config,
    load_config,
    load_whitelist_as_set,
    materialize_config,
    enumerate_whitelist_exceptions,
    pconfig,
)
from forgather import Latent
from aiws.dotdict import DotDict

# Path to your project meta-config
meta_config_path = 'meta_config.yaml'

# Path to an experiment config to run.
experiment_path = os.path.join("templates", 'experiment.yaml')

print(f'Meta Config: {meta_config_path}')
print(f'Experiment: {experiment_path}')
print('*' * 40)
metacfg = DotDict(load_config(meta_config_path).config)
pconfig(metacfg)

Meta Config: meta_config.yaml
Experiment: templates/experiment.yaml
****************************************
project_templates: 'templates'
templates: '../../templates'
tokenizer_dir: '../../tokenizers'
datasets_dir: '../../datasets'
assets_dir: '../..'
search_paths:
  - 'templates'
  - '../../templates'
  - '../../model_zoo'
whitelist_path: 'templates/whitelist.yaml'
model_src_dir: '../../model_zoo'
script_dir: '../../scripts'
train_script_path: '../../scripts/train_script.py'
models_dir: 'output_models'
dataset_id: 'roneneldan/TinyStories'
tokenizer_def: '../../templates/common/tokenizers/tiny_2k_bpe.yaml'
tokenizer_path: '../../tokenizers/tiny_stories_2k'
tokenizers_whitelist: '../../templates/common/tokenizers/whitelist.yaml'


### Preprocess Configuration

#### preprocess_config() : Preprocess a configuration file.

```python
def preprocess_config(
    config:  os.PathLike | str, *,
    search_path: str | List[str] = '.',
    load_method: LoadMethod = DEFAULT_LOAD_METHOD,
    **kwargs,
) -> str:
```

Run the cell and click the link to open the preprocessed file in the notebook.  
[preprocessed_config.yaml](preprocessed_config.yaml)

In [8]:
# Only preprocess the experiment template
pp_config = preprocess_config(experiment_path, search_path=metacfg.search_paths)

with open('preprocessed_config.yaml', 'w') as f:
    f.write(pp_config.with_line_numbers(False))

#### Check Whitelist Requirements

If you would like to see which import-specs are used in a configuraiton (or which are missing), you can use enumerate_whitelist_exceptions().

In [4]:
enumerate_whitelist_exceptions(load_config(experiment_path, search_path=metacfg.search_paths).config)

- 'aiws.accel_trainer:AccelTrainer'
- 'aiws.construct:register_for_auto_class'
- 'transformers:DataCollatorForLanguageModeling'
- 'torch.utils.tensorboard:SummaryWriter'
- '/home/dinalt/ai_assets/aiworkshop/model_zoo/vanilla_transformer/vanilla_transformer.py:VanillaTransformerConfig'
- 'aiws.default_callbacks:JsonLogger'
- 'aiws.accel_trainer:AccelTrainingArguments'
- 'accelerate:DataLoaderConfiguration'
- 'aiws.tb_logger:TBLogger'
- 'transformers:AutoTokenizer.from_pretrained'
- 'forgather.construct:get_attr'
- '/home/dinalt/ai_assets/aiworkshop/model_zoo/vanilla_transformer/vanilla_transformer.py:VanillaTransformer'
- 'forgather.construct:get_item'
- 'datasets:load_from_disk'


### Materialize the Configuration

#### materialize_config() : Materialize the Latent objects in the configuration
```python
def materialize_config(
    config: Any,
    whitelist: Container | os.PathLike | str = None,
    preprocess: bool = True,
    search_path: str | List[str] = '.',
    load_method: LoadMethod=DEFAULT_LOAD_METHOD,
    pp_kwargs: Dict[str, Any] = {},
    kwargs: Dict[str, Callable] = {},
) -> MaterializedOutput:
```
- config: An instantiated, but Latent, configuration; a preprocessed configuration string; or a path to a configuraiton file.  
- whitelist: A Container type, which means any object which supports 'str is in container'  
- preprocess: Preprocess the string or file. Only applies if input is a path or string.  
- load_method: One of "from_file", "from_string", "from_file_search"  
- search_path: A str or List\[str\] paths to search for templates; also applies to "from_file_search" load method.  
- pp_kwargs: Arguments to pass to the template, if preprocessing is to be performed.  
- kwargs: A mapping str -> Callable to substitute when materializing the final config. This allows passing already instantiated objects into the config.

In [9]:
set_seed(42)

config_output = materialize_config(experiment_path, search_path=metacfg.search_paths)
config = DotDict(config_output.config)
#pconfig(config)

### Run Trainer

In [None]:
set_seed(42)
config.trainer.train()

### Training Loop

In [4]:
# This is the entry-point for the spawned procceses.
def training_loop(meta_config, experiment_name):
    set_seed(42)
    metacfg = DotDict(load_config(meta_config).config)
    config_output = materialize_config(os.path.join(metacfg.project_templates, experiment_name),
        metacfg.whitelist_path, search_path=metacfg.search_paths)
    config = DotDict(config_output.config)
    
    # If you don't want all processes to print to the console...
    if config.trainer.accelerator.is_main_process:
        print("**** Training Started *****")
        print(f"experiment_name: {config.experiment_name}")
        print(f"experiment_description: {config.experiment_description}")
        print(f"output_dir: {config.output_dir}")
        print(f"logging_dir: {config.logging_dir}")

    # This is where the actual 'loop' is.
    metrics = config.trainer.train().metrics
    
    if config.trainer.accelerator.is_main_process:
        print("**** Training Completed *****")
        print(metrics)

    metrics = config.trainer.evaluate()

    if config.trainer.accelerator.is_main_process:
        print("**** Evaluation Completed *****")
        print(metrics)
    
    if config.save:
        config.trainer.save_model()
        if config.trainer.accelerator.is_main_process:
            print(f"Model saved to: {config.trainer.args.output_dir}")

#### Run Training Loop Directly

In [5]:
training_loop(meta_config_path, experiment_path)

**** Training Started *****
experiment_name: Multi-GPU
experiment_description: Try training on multiple GPUs
output_dir: output_models/test_model
logging_dir: output_models/test_model/runs/Multi-GPU_1720460385137788484


  0%|                                                                                                         …

total_examples: 2,119,712
total_train_samples: 2,119,712
per_device_train_batch_size: 32
actual_per_device_batch_size: 32
total_train_batch_size: 32
max_steps: 2,000
total_parameters: 0.9M
trainable_parameters: 0.9M
model:
VanillaTransformer(
  (embedding): Embedding(2000, 128)
  (positional_encoder): PositionalEncoder()
  (layers): ModuleList(
    (0-1): 2 x TransformerLayer(
      (attention): MultiheadAttention(
        (query_linear): Linear(in_features=128, out_features=128, bias=True)
        (key_linear): Linear(in_features=128, out_features=128, bias=True)
        (value_linear): Linear(in_features=128, out_features=128, bias=True)
      )
      (feedforward): FeedforwardLayer(
        (linear1): Linear(in_features=128, out_features=512, bias=True)
        (activation): ReLU()
        (linear2): Linear(in_features=512, out_features=128, bias=True)
      )
      (norm1): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
      (norm2): LayerNorm((128,), eps=1e-05, elementwise

  0%|                                                                                                         …

2024-07-08 17:39:49          200  0.0   eval-loss:  3.90324   
2024-07-08 17:39:50          300  0.0   train-loss: 3.74679   learning-rate: 1.00e-03
2024-07-08 17:39:51          400  0.01  train-loss: 3.62816   learning-rate: 1.00e-03


  0%|                                                                                                         …

2024-07-08 17:39:52          400  0.01  eval-loss:  3.56554   
2024-07-08 17:39:53          500  0.01  train-loss: 3.45819   learning-rate: 1.00e-03
2024-07-08 17:39:55          600  0.01  train-loss: 3.38539   learning-rate: 1.00e-03


  0%|                                                                                                         …

2024-07-08 17:39:55          600  0.01  eval-loss:  3.36802   
2024-07-08 17:39:56          700  0.01  train-loss: 3.28423   learning-rate: 1.00e-03
2024-07-08 17:39:58          800  0.01  train-loss: 3.28043   learning-rate: 1.00e-03


  0%|                                                                                                         …

2024-07-08 17:39:58          800  0.01  eval-loss:  3.22433   
2024-07-08 17:39:59          900  0.01  train-loss: 3.16703   learning-rate: 1.00e-03
2024-07-08 17:40:01        1,000  0.02  train-loss: 3.03671   learning-rate: 1.00e-03


  0%|                                                                                                         …

2024-07-08 17:40:01        1,000  0.02  eval-loss:  3.10673   
2024-07-08 17:40:03        1,100  0.02  train-loss: 3.06074   learning-rate: 1.00e-03
2024-07-08 17:40:04        1,200  0.02  train-loss: 3.01994   learning-rate: 1.00e-03


  0%|                                                                                                         …

2024-07-08 17:40:04        1,200  0.02  eval-loss:  3.0159    
2024-07-08 17:40:06        1,300  0.02  train-loss: 2.97168   learning-rate: 1.00e-03
2024-07-08 17:40:07        1,400  0.02  train-loss: 2.93187   learning-rate: 1.00e-03


  0%|                                                                                                         …

2024-07-08 17:40:07        1,400  0.02  eval-loss:  2.90429   
2024-07-08 17:40:09        1,500  0.02  train-loss: 2.91681   learning-rate: 1.00e-03
2024-07-08 17:40:10        1,600  0.02  train-loss: 2.86761   learning-rate: 1.00e-03


  0%|                                                                                                         …

2024-07-08 17:40:10        1,600  0.02  eval-loss:  2.82027   
2024-07-08 17:40:12        1,700  0.03  train-loss: 2.82465   learning-rate: 1.00e-03
2024-07-08 17:40:13        1,800  0.03  train-loss: 2.77523   learning-rate: 1.00e-03


  0%|                                                                                                         …

2024-07-08 17:40:14        1,800  0.03  eval-loss:  2.78279   
2024-07-08 17:40:15        1,900  0.03  train-loss: 2.75825   learning-rate: 1.00e-03
2024-07-08 17:40:16        2,000  0.03  train-loss: 2.77675   learning-rate: 1.00e-03


  0%|                                                                                                         …

2024-07-08 17:40:17        2,000  0.03  eval-loss:  2.7029    
train_runtime: 31.09
train_samples: 64,000
step: 2,000
train_samples_per_second: 2.059e+03
train_steps_per_second: 64.34
train_loss: 3.235
epoch: 0.03019

**** Training Completed *****
{'train_runtime': 31.086211919784546, 'train_samples': 64000, 'step': 2000, 'train_samples_per_second': 2058.791, 'train_steps_per_second': 64.337, 'train_loss': 3.2351248264312744, 'epoch': 0.030192780906085355}


  0%|                                                                                                         …

2024-07-08 17:40:17        2,000  0.03  eval-loss:  2.7029    
**** Evaluation Completed *****
{'eval_loss': 2.7028963565826416}


#### Launch with Notebook Launcher

In [None]:
notebook_launcher(
    training_loop,
    args=(meta_config_path, experiment_path,),
    num_processes=2
)

### Train from a Training Script

In [7]:
# Output train-script command line as a string
def train_cmdline(metacfg, nproc='gpu'):
    includes = ''.join(f"-I '{inc}' " for inc in metacfg.search_paths)
    return f"torchrun --standalone --nproc-per-node {nproc} '{metacfg.train_script_path}' -w '{metacfg.whitelist_path}' {includes} -s '{metacfg.assets_dir}'"

# Output train-script as command line as a bash-script
# ./train.sh [<other-sli-args] <experiment-config-file>
def make_bash_script(metacfg, script_path='train.sh', nproc='gpu'):
    with open(script_path, 'w') as f:
        f.write('#!/bin/bash\n' + train_cmdline(metacfg) + ' "${@}"\n')
        os.chmod(f.fileno(), stat.S_IREAD|stat.S_IRUSR|stat.S_IWUSR|stat.S_IXUSR)

project_templates: 'templates'
templates: '../../templates'
tokenizer_dir: '../../tokenizers'
datasets_dir: '../../datasets'
assets_dir: '../..'
search_paths:
  - 'templates'
  - '../../templates'
  - '../../model_zoo'
whitelist_path: 'templates/whitelist.yaml'
model_src_dir: '../../model_zoo'
script_dir: '../../scripts'
train_script_path: '../../scripts/train_script.py'
models_dir: 'output_models'
dataset_id: 'roneneldan/TinyStories'
tokenizer_def: '../../templates/common/tokenizers/tiny_2k_bpe.yaml'
tokenizer_path: '../../tokenizers/tiny_stories_2k'
tokenizers_whitelist: '../../templates/common/tokenizers/whitelist.yaml'


#### Generate Bash Script

This will output a shell-script which will invoke the training script with the arguments for this project.

```bash
# Optional: Restrict the GPUs to use to a sub-set of those avialable.
export CUDA_VISIBLE_DEVICES="0,1"

./train path_to_experiment.yaml
```

In [5]:
make_bash_script(metacfg)

# Read back to verify
with open('train.sh', 'r') as f:
    print(f.read())

#!/bin/bash
torchrun --standalone --nproc-per-node gpu '../scripts/train_script.py' -w 'forgather_demo/whitelist.yaml' -I 'forgather_demo' -I '../templates' -I '../model_zoo'  -s '..' "${@}"



#### Run Training Script from Notebook

In [None]:
# By default, this will run on all available GPUs. To restrict it to a sub-set, you can use this envrionment variable.
os.environ['CUDA_VISIBLE_DEVICES'] = '0,1'

!{train_cmdline(metacfg)} 'forgather_demo/hf_trainer_experiment.yaml'

### View in Tensorboard

In [11]:
!tensorboard --bind_all --logdir output_models/test_model/runs/

TensorFlow installation not found - running with reduced feature set.

NOTE: Using experimental fast data loading logic. To disable, pass
    "--load_fast=false" and report issues on GitHub. More details:
    https://github.com/tensorflow/tensorboard/issues/4784

TensorBoard 2.16.2 at http://hal9000:6006/ (Press CTRL+C to quit)
^C


### Cleanup

Delete all of the output models produced by the demo and start over.

In [25]:
print(f"Removing '{metacfg.models_dir}'")
shutil.rmtree(metacfg.models_dir, ignore_errors=True)

Removing 'forgather_demo/output_models'
