# Project

[forgather/config.py](../../forgather/config.py)  
[forgather/latent.py](../../forgather/latent.py)  
[forgather/dynamic.py](../../forgather/dynamic.py)  

**Experiments Definitions**  
[meta_config.yaml](meta_config.yaml)

[templates/experiment.yaml](forgather_demo/experiment%201.yaml)  
[templates/paths.yaml](templates/paths.yaml)  
[templates/whitelist.yaml](templates/whitelist.yaml)  
[templates/trainer_config.yaml](templates/trainer_config.yaml)    

[templates/common/whitelist.yaml](../../templates/common/whitelist.yaml)  
[templates/common/defaults.yaml](../../templates/common/defaults.yaml)  
[templates/common/helpers.yaml](../../templates/common/helpers.yaml)  
[templates/common/trainer/base_trainer.yaml](../../templates/common/trainer/base_trainer.yaml)  
[templates/common/causal_lm/base_train.yaml](../../templates/common/causal_lm/base_train.yaml)  
[model_zoo/models/vanilla_transformer/vanilla_transformer.yaml](../../model_zoo/models/vanilla_transformer/vanilla_transformer.yaml)  
[model_zoo/models/model_zoo_whitelist.yaml](../../model_zoo/models/model_zoo_whitelist.yaml)  

**Model Code**  

[model_zoo/models/vanilla_transformer/vanilla_transformer.py](../../model_zoo/models/vanilla_transformer/vanilla_transformer.py)  



## Setup

In [1]:
import sys, os
modules_path = os.path.join('..', '..')
if modules_path not in sys.path: sys.path.insert(0, modules_path)
import shutil

from pprint import pformat, pp
from transformers import set_seed

from forgather.config import (
    preprocess_config,
    load_config,
    load_whitelist_as_set,
    materialize_config,
    enumerate_whitelist_exceptions,
    pconfig,
)
from forgather import Latent
from aiws.dotdict import DotDict

# Path to your project meta-config
meta_config_path = 'meta_config.yaml'
experiment_name = 'example'
# Path to an experiment config to run.

metacfg = DotDict(load_config(meta_config_path).config)
pconfig(metacfg)

# Get path to selected experiment
experiment_path = os.path.join(metacfg.experiment_dir, experiment_name, 'experiment.yaml')

print('*' * 40)
print(f'Meta Config: {meta_config_path}')
print(f'Experiment Path: {experiment_path}')

def preprocess():
    return preprocess_config(experiment_path, search_path=metacfg.search_paths)

assets_dir: '../..'
datasets_dir: '../../datasets'
experiment_dir: './templates/experiments/'
model_dir: './output_models'
model_src_dir: '../../model_zoo'
project_dir: '.'
project_templates: './templates'
scripts_dir: '../../scripts'
search_paths:
  - '../../templates'
  - './templates'
templates: '../../templates'
tokenizer_dir: '../../tokenizers'
train_script_path: '../../scripts/train_script.py'
whitelist_path: './templates/whitelist.yaml'
****************************************
Meta Config: meta_config.yaml
Experiment Path: ./templates/experiments/example/experiment.yaml


### Preprocess Configuration

#### preprocess_config() : Preprocess a configuration file.

```python
def preprocess_config(
    config:  os.PathLike | str, *,
    search_path: str | List[str] = '.',
    load_method: LoadMethod = DEFAULT_LOAD_METHOD,
    **kwargs,
) -> str:
```

In [2]:
print(preprocess().with_line_numbers())

     1: ############## Experiment ##############
     2: 
     3: # The latest example
     4: # 2024-07-10 04:29:25
     5: # Description: It's not supid, it's advanced!
     6: # Model: anonymous_model
     7: # World Size: 1
     8: # Hostname: hal9000
     9: # Script Args: N/A
    10: 
    11: ############# Config Vars ##############
    12: 
    13: # ns.TOKENIZERS_DIR: "../../tokenizers"
    14: # ns.MODELS_DIR: "./output_models"
    15: # ns.DATASETS_DIR: "../../datasets"
    16: # ns.SCRIPTS_DIR: "../../scripts"
    17: # ns.MODEL_SOURCE_DIR: "../../model_zoo"
    18: # ns.OUTPUT_DIR: "./output_models/anonymous_model"
    19: # ns.LOGGING_DIR: path = "./output_models/anonymous_model/runs/The latest example_1720585765000114070"
    20: # ns.CREATE_NEW_MODEL: path = False
    21: # ns.SAVE_MODEL: path = True
    22: 
    23: ####### Additional Dependencies ########
    24: 
    25: # Experiment tracking: Tensorboard SummaryWriter
    26: .define: &summary_writer !callable:torch.

Run the cell and click the link to open the preprocessed file in the notebook.  
[preprocessed_config.yaml](preprocessed_config.yaml)

In [None]:
# Only preprocess the experiment template
pp_config = preprocess()

with open('preprocessed_config.yaml', 'w') as f:
    f.write(pp_config.with_line_numbers(False))

### Preprocess and Load the Whitelist
```python
def load_whitelist_as_set(
    config: os.PathLike | str, *,
    preprocess: bool = True,
    search_path: str | List[str] = '.',
    load_method: LoadMethod = DEFAULT_LOAD_METHOD
) -> Set[str]:
```
Load a whitelist configuration from a file or string

This is essentially just load_config, but it normalizes the paths in the whitelist and converts the list to a set, to improve search performance.

In [10]:
whitelist_out = load_whitelist_as_set(metacfg.whitelist_path, search_path=metacfg.search_paths)
pconfig(whitelist_out)

config:
  - '/home/dinalt/ai_assets/aiworkshop/model_zoo/attention_only/attention_only.py:TransformerConfig'
  - '/home/dinalt/ai_assets/aiworkshop/model_zoo/attention_only/attention_only.py:TransformerModel'
  - '/home/dinalt/ai_assets/aiworkshop/model_zoo/vanilla_transformer/vanilla_transformer.py:VanillaTransformer'
  - '/home/dinalt/ai_assets/aiworkshop/model_zoo/vanilla_transformer/vanilla_transformer.py:VanillaTransformerConfig'
  - 'accelerate:DataLoaderConfiguration'
  - 'aiws.accel_trainer:AccelTrainer'
  - 'aiws.accel_trainer:AccelTrainingArguments'
  - 'aiws.construct:register_for_auto_class'
  - 'aiws.default_callbacks:InfoCallback'
  - 'aiws.default_callbacks:JsonLogger'
  - 'aiws.default_callbacks:ProgressCallback'
  - 'aiws.tb_logger:TBLogger'
  - 'aiws.trainer:Trainer'
  - 'aiws.trainer_types:TrainingArguments'
  - 'datasets:load_dataset'
  - 'datasets:load_from_disk'
  - 'forgather.construct:flatten'
  - 'forgather.construct:get_attr'
  - 'forgather.construct:get_item'

#### Check Whitelist Requirements

If you would like to see which import-specs are used in a configuraiton (or which are missing), you can use enumerate_whitelist_exceptions().
```python
def enumerate_whitelist_exceptions(config: Any, whitelist: Container = set())
```
Print all import-specs not matching the whitelist

In [9]:
whitelist = whitelist_out.config
#whitelist = set()
enumerate_whitelist_exceptions(load_config(experiment_path, search_path=metacfg.search_paths).config, whitelist)




### Materialize the Configuration

#### materialize_config() : Materialize the Latent objects in the configuration
```python
def materialize_config(
    config: Any,
    whitelist: Container | os.PathLike | str = None,
    preprocess: bool = True,
    search_path: str | List[str] = '.',
    load_method: LoadMethod=DEFAULT_LOAD_METHOD,
    pp_kwargs: Dict[str, Any] = {},
    kwargs: Dict[str, Callable] = {},
) -> MaterializedOutput:
```
- config: An instantiated, but Latent, configuration; a preprocessed configuration string; or a path to a configuraiton file.  
- whitelist: A Container type, which means any object which supports 'str is in container'  
- preprocess: Preprocess the string or file. Only applies if input is a path or string.  
- load_method: One of "from_file", "from_string", "from_file_search"  
- search_path: A str or List\[str\] paths to search for templates; also applies to "from_file_search" load method.  
- pp_kwargs: Arguments to pass to the template, if preprocessing is to be performed.  
- kwargs: A mapping str -> Callable to substitute when materializing the final config. This allows passing already instantiated objects into the config.

In [3]:
set_seed(42)

config_output = materialize_config(experiment_path, whitelist=metacfg.whitelist_path, search_path=metacfg.search_paths)
config = DotDict(config_output.config)
pconfig(config)

do_save: True
experiment_description: 'It's not supid, it's advanced!'
experiment_name: 'The latest example'
logging_dir: './output_models/anonymous_model/runs/The latest example_1720586293852204866'
output_dir: './output_models/anonymous_model'
trainer:
  Trainer(model=VanillaTransformer(
    (embedding): Embedding(2000, 256)
    (positional_encoder): PositionalEncoder()
    (layers): ModuleList(
      (0-1): 2 x TransformerLayer(
        (attention): MultiheadAttention(
          (query_linear): Linear(in_features=256, out_features=256, bias=True)
          (key_linear): Linear(in_features=256, out_features=256, bias=True)
          (value_linear): Linear(in_features=256, out_features=256, bias=True)
        )
        (feedforward): FeedforwardLayer(
          (linear1): Linear(in_features=256, out_features=512, bias=True)
          (activation): ReLU()
          (linear2): Linear(in_features=512, out_features=256, bias=True)
        )
        (norm1): LayerNorm((256,), eps=1e-05, el

### Run Trainer

In [4]:
set_seed(42)
config.trainer.train()

  0%|                                                                                                         …

total_examples: 2,119,712
total_train_samples: 2,119,712
per_device_train_batch_size: 16
actual_per_device_batch_size: 16
total_train_batch_size: 16
max_steps: 2,000
total_parameters: 1.9M
trainable_parameters: 1.9M
model:
VanillaTransformer(
  (embedding): Embedding(2000, 256)
  (positional_encoder): PositionalEncoder()
  (layers): ModuleList(
    (0-1): 2 x TransformerLayer(
      (attention): MultiheadAttention(
        (query_linear): Linear(in_features=256, out_features=256, bias=True)
        (key_linear): Linear(in_features=256, out_features=256, bias=True)
        (value_linear): Linear(in_features=256, out_features=256, bias=True)
      )
      (feedforward): FeedforwardLayer(
        (linear1): Linear(in_features=256, out_features=512, bias=True)
        (activation): ReLU()
        (linear2): Linear(in_features=512, out_features=256, bias=True)
      )
      (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
      (norm2): LayerNorm((256,), eps=1e-05, elementwise

  0%|                                                                                                         …

2024-07-09 23:01:06          500  0.0   eval-loss:  3.24676   
2024-07-09 23:01:10        1,000  0.01  train-loss: 2.97355   learning-rate: 1.00e-03


  0%|                                                                                                         …

2024-07-09 23:01:11        1,000  0.01  eval-loss:  2.84039   
2024-07-09 23:01:15        1,500  0.01  train-loss: 2.68934   learning-rate: 1.00e-03


  0%|                                                                                                         …

2024-07-09 23:01:15        1,500  0.01  eval-loss:  2.63415   
2024-07-09 23:01:19        2,000  0.02  train-loss: 2.51999   learning-rate: 1.00e-03


  0%|                                                                                                         …

2024-07-09 23:01:20        2,000  0.02  eval-loss:  2.52777   
train_runtime: 18.08
train_samples: 32,000
step: 2,000
train_samples_per_second: 1.77e+03
train_steps_per_second: 110.6
train_loss: 2.997
epoch: 0.0151



TrainOutput(global_step=2000, training_loss=2.519994020462036, metrics={'train_runtime': 18.078115224838257, 'train_samples': 32000, 'step': 2000, 'train_samples_per_second': 1770.096, 'train_steps_per_second': 110.631, 'train_loss': 2.997377872467041, 'epoch': 0.015096390453042677})

### Training Loop

In [4]:
from accelerate import notebook_launcher

# This is the entry-point for the spawned procceses.
def training_loop(meta_config, experiment_name):
    set_seed(42)
    metacfg = DotDict(load_config(meta_config).config)

    # Get Torch Distributed parameters from environ.
    world_size = int(os.environ.get('WORLD_SIZE', 1))
    rank = int(os.environ.get('RANK', 0))
    local_rank = int(os.environ.get('LOCAL_RANK', 0))
    
    config_output = materialize_config(
        experiment_name,
        metacfg.whitelist_path,
        search_path=metacfg.search_paths,
        pp_kwargs = dict(
            world_size=world_size,
            rank=rank,
            local_rank=local_rank,
        )
    )
    config = DotDict(config_output.config)
    is_main_process = config.trainer.accelerator if hasattr(config.trainer, "accelerator") else True
    # If you don't want all processes to print to the console...
    if is_main_process:
        print("**** Training Started *****")
        print(f"experiment_name: {config.experiment_name}")
        print(f"experiment_description: {config.experiment_description}")
        print(f"output_dir: {config.output_dir}")
        print(f"logging_dir: {config.logging_dir}")

    # This is where the actual 'loop' is.
    metrics = config.trainer.train().metrics
    
    if is_main_process:
        print("**** Training Completed *****")
        print(metrics)

    metrics = config.trainer.evaluate()

    if is_main_process:
        print("**** Evaluation Completed *****")
        print(metrics)
    
    if config.do_save:
        config.trainer.save_model()
        if is_main_process:
            print(f"Model saved to: {config.trainer.args.output_dir}")

#### Run Training Loop Directly

In [5]:
training_loop(meta_config_path, experiment_path)

**** Training Started *****
experiment_name: The latest example
experiment_description: It's not supid, it's advanced!
output_dir: ./output_models/anonymous_model
logging_dir: ./output_models/anonymous_model/runs/The latest example_1720586319070237488


  0%|                                                                                                         …

total_examples: 2,119,712
total_train_samples: 2,119,712
per_device_train_batch_size: 16
actual_per_device_batch_size: 16
total_train_batch_size: 16
max_steps: 2,000
total_parameters: 1.9M
trainable_parameters: 1.9M
model:
VanillaTransformer(
  (embedding): Embedding(2000, 256)
  (positional_encoder): PositionalEncoder()
  (layers): ModuleList(
    (0-1): 2 x TransformerLayer(
      (attention): MultiheadAttention(
        (query_linear): Linear(in_features=256, out_features=256, bias=True)
        (key_linear): Linear(in_features=256, out_features=256, bias=True)
        (value_linear): Linear(in_features=256, out_features=256, bias=True)
      )
      (feedforward): FeedforwardLayer(
        (linear1): Linear(in_features=256, out_features=512, bias=True)
        (activation): ReLU()
        (linear2): Linear(in_features=512, out_features=256, bias=True)
      )
      (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
      (norm2): LayerNorm((256,), eps=1e-05, elementwise

  0%|                                                                                                         …

2024-07-10 04:38:43          500  0.0   eval-loss:  3.24676   
2024-07-10 04:38:47        1,000  0.01  train-loss: 2.97355   learning-rate: 1.00e-03


  0%|                                                                                                         …

2024-07-10 04:38:48        1,000  0.01  eval-loss:  2.84039   
2024-07-10 04:38:51        1,500  0.01  train-loss: 2.68934   learning-rate: 1.00e-03


  0%|                                                                                                         …

2024-07-10 04:38:52        1,500  0.01  eval-loss:  2.63415   
2024-07-10 04:38:56        2,000  0.02  train-loss: 2.51999   learning-rate: 1.00e-03


  0%|                                                                                                         …

2024-07-10 04:38:56        2,000  0.02  eval-loss:  2.52777   
train_runtime: 17.29
train_samples: 32,000
step: 2,000
train_samples_per_second: 1.851e+03
train_steps_per_second: 115.7
train_loss: 2.997
epoch: 0.0151

**** Training Completed *****
{'train_runtime': 17.29047727584839, 'train_samples': 32000, 'step': 2000, 'train_samples_per_second': 1850.73, 'train_steps_per_second': 115.671, 'train_loss': 2.997377872467041, 'epoch': 0.015096390453042677}


  0%|                                                                                                         …

2024-07-10 04:38:56        2,000  0.02  eval-loss:  2.52777   
**** Evaluation Completed *****
{'eval_loss': 2.527772903442383}
[2024-07-10 04:38:57,086] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Model saved to: ./output_models/anonymous_model


#### Launch with Notebook Launcher

In [None]:
notebook_launcher(
    training_loop,
    args=(meta_config_path, experiment_path,),
    num_processes=2
)

### Train from a Training Script

In [7]:
# Output train-script command line as a string
def train_cmdline(metacfg, nproc='gpu'):
    includes = ''.join(f"-I '{inc}' " for inc in metacfg.search_paths)
    return f"torchrun --standalone --nproc-per-node {nproc} '{metacfg.train_script_path}' -w '{metacfg.whitelist_path}' {includes} -s '{metacfg.assets_dir}'"

# Output train-script as command line as a bash-script
# ./train.sh [<other-sli-args] <experiment-config-file>
def make_bash_script(metacfg, script_path='train.sh', nproc='gpu'):
    with open(script_path, 'w') as f:
        f.write('#!/bin/bash\n' + train_cmdline(metacfg) + ' "${@}"\n')
        os.chmod(f.fileno(), stat.S_IREAD|stat.S_IRUSR|stat.S_IWUSR|stat.S_IXUSR)

project_templates: 'templates'
templates: '../../templates'
tokenizer_dir: '../../tokenizers'
datasets_dir: '../../datasets'
assets_dir: '../..'
search_paths:
  - 'templates'
  - '../../templates'
  - '../../model_zoo'
whitelist_path: 'templates/whitelist.yaml'
model_src_dir: '../../model_zoo'
script_dir: '../../scripts'
train_script_path: '../../scripts/train_script.py'
models_dir: 'output_models'
dataset_id: 'roneneldan/TinyStories'
tokenizer_def: '../../templates/common/tokenizers/tiny_2k_bpe.yaml'
tokenizer_path: '../../tokenizers/tiny_stories_2k'
tokenizers_whitelist: '../../templates/common/tokenizers/whitelist.yaml'


#### Generate Bash Script

This will output a shell-script which will invoke the training script with the arguments for this project.

```bash
# Optional: Restrict the GPUs to use to a sub-set of those avialable.
export CUDA_VISIBLE_DEVICES="0,1"

./train path_to_experiment.yaml
```

In [5]:
make_bash_script(metacfg)

# Read back to verify
with open('train.sh', 'r') as f:
    print(f.read())

#!/bin/bash
torchrun --standalone --nproc-per-node gpu '../scripts/train_script.py' -w 'forgather_demo/whitelist.yaml' -I 'forgather_demo' -I '../templates' -I '../model_zoo'  -s '..' "${@}"



#### Run Training Script from Notebook

In [None]:
# By default, this will run on all available GPUs. To restrict it to a sub-set, you can use this envrionment variable.
os.environ['CUDA_VISIBLE_DEVICES'] = '0,1'

!{train_cmdline(metacfg)} 'forgather_demo/hf_trainer_experiment.yaml'

### View in Tensorboard

In [11]:
!tensorboard --bind_all --logdir output_models/test_model/runs/

TensorFlow installation not found - running with reduced feature set.

NOTE: Using experimental fast data loading logic. To disable, pass
    "--load_fast=false" and report issues on GitHub. More details:
    https://github.com/tensorflow/tensorboard/issues/4784

TensorBoard 2.16.2 at http://hal9000:6006/ (Press CTRL+C to quit)
^C


### Cleanup

Delete all of the output models produced by the demo and start over.

In [25]:
print(f"Removing '{metacfg.models_dir}'")
shutil.rmtree(metacfg.models_dir, ignore_errors=True)

Removing 'forgather_demo/output_models'
