# Training Notebook
Configuration details: [Configuration Notebook](project_config.ipynb)

Run project training configurations and generate training scripts from project meta-data.

## Setup

In [1]:
import os

projects_directory = "/home/dinalt/ai_assets/forgather/tutorials/project_delta"
config_template = ""

# Path to training script to use.
train_script_path = os.path.join('..', 'scripts', 'train_script.py')

## Project Info

In [2]:
import sys, os
modules_path = os.path.join('..', 'src')
if modules_path not in sys.path: sys.path.insert(0, modules_path)

from pprint import pp, pformat
from IPython import display as ds
from forgather import Project
import forgather.nb.notebooks as nb

# Load the project
proj = Project(projects_directory, config_template)

# Show project info
md = ""
md += nb.render_project_readme(proj.project_dir)
md += nb.render_meta(proj.meta, "### Meta Config\n")
md += nb.render_template_list(proj.meta.find_templates(proj.meta.config_prefix), "### Available Configurations\n")

# Only construct the meta object
config_meta = proj.config.meta()
md += f"### {config_meta['config_name']}:\n\n"
md += nb.render_codeblock("python", pformat(config_meta))
display(ds.Markdown(md))

# Tiny Generative Language Model

This project demonstrates how to use the templates library to construct a tiny causal transformer.

Most of the remaining examples make use of the Tiny Stories dataset, as it allows one to quickly train a relatively small transformer model (< 10M parameters), which can generate relatively coherent speech.

- Dataset: datasets/tiny/tiny_stories_abridged.yaml
    - Dataset ID: roneneldan/TinyStories
    - Reference: https://arxiv.org/abs/2305.07759

Unlike the previous examples, the project meta-config now makes use of the templates library, thus many more templates are now available to the project.

The project configuration itself is now even derived from a common "Tiny Experiments" project template, which defines defaults for a number of projects with similar setups. See "projects/tiny.yaml."

---

### Meta Config
Meta Config: [/home/dinalt/ai_assets/forgather/tutorials/project_delta/meta.yaml](../tutorials/project_delta/meta.yaml)

- [meta.yaml](../tutorials/project_delta/meta.yaml)

Template Search Paths:
- [/home/dinalt/ai_assets/forgather/tutorials/project_delta/templates](../tutorials/project_delta/templates)
- [/home/dinalt/ai_assets/forgather/templates/tiny_experiments](../templates/tiny_experiments)
- [/home/dinalt/ai_assets/forgather/templates/modellib](../templates/modellib)
- [/home/dinalt/ai_assets/forgather/templates/base](../templates/base)

### Available Configurations
- [tiny_causal.yaml](../tutorials/project_delta/templates/configs/tiny_causal.yaml)

### Tiny Causal:

```python
{'config_description': 'A tiny causal transformer.',
 'config_name': 'Tiny Causal',
 'create_new_model': 'True',
 'datasets_dir': '/home/dinalt/ai_assets/forgather/datasets',
 'eval': 'False',
 'logging_dir': '/home/dinalt/ai_assets/forgather/tutorials/project_delta/output_models/tiny_causal/runs/log_2024-08-12T07-03-35',
 'model_src_dir': '/home/dinalt/ai_assets/forgather/model_src',
 'models_dir': '/home/dinalt/ai_assets/forgather/tutorials/project_delta/output_models',
 'output_dir': '/home/dinalt/ai_assets/forgather/tutorials/project_delta/output_models/tiny_causal',
 'project_dir': '/home/dinalt/ai_assets/forgather/tutorials/project_delta',
 'save_model': 'False',
 'tokenizers_dir': '/home/dinalt/ai_assets/forgather/tokenizers',
 'train': 'True'}

```



### Launch Notebook Trainer

In [None]:
from accelerate import notebook_launcher
from forgather.ml.training_script import training_loop

notebook_launcher(
    training_loop,
    args=(proj.project_dir, proj.config_name),
    num_processes=1
)

Launching training on one GPU.


Repo card metadata block was not found. Setting CardData to empty.


**** Training Script Started *****
config_name: Tiny Causal
config_description: A tiny causal transformer.
output_dir: /home/dinalt/ai_assets/forgather/tutorials/project_delta/output_models/tiny_causal
logging_dir: /home/dinalt/ai_assets/forgather/tutorials/project_delta/output_models/tiny_causal/runs/log_2024-08-12T07-03-55
not compiling model


  0%|                                                                                                         …

total_examples: 212,000
total_train_samples: 212,000
per_device_train_batch_size: 32
actual_per_device_batch_size: 32
total_train_batch_size: 32
max_steps: 6,625
total_parameters: 4.2M
trainable_parameters: 4.2M
model:
DynamicCasualLM(
  (causal_lm): CasualLM(
    loss_fn=CausalLoss(), init_weights=InitWeights(std=0.02)
    (input_encoder): InputEncoder(
      d_model=256, vocab_size=2000, embedding_scale=16.0
      (dropout): Identity()
      (embedding): Embedding(2000, 256)
      (positional_encoder): SinusoidalPE(d_model=256, max_sequence_length=2048)
    )
    (output_decoder): Linear(in_features=256, out_features=2000, bias=True)
    (layer_stack): LayerStack(
      (layers): ModuleList(
        (0-3): 4 x PostLNLayer(
          (feedforward): FeedforwardLayer(
            d_model=256, d_feedforward=1024
            (linear1): Linear(in_features=256, out_features=1024, bias=True)
            (dropout): Identity()
            (activation): ReLU()
            (linear2): Linear(in_f

  0%|                                                                                                         …

2024-08-12 07:04:14          500  0.08  eval-loss:  3.0656    
2024-08-12 07:04:18          600  0.09  train-loss: 2.91868   learning-rate: 9.80e-04
2024-08-12 07:04:20          700  0.11  train-loss: 2.78024   learning-rate: 9.73e-04
2024-08-12 07:04:23          800  0.12  train-loss: 2.72547   learning-rate: 9.64e-04
2024-08-12 07:04:25          900  0.14  train-loss: 2.5799    learning-rate: 9.55e-04
2024-08-12 07:04:28        1,000  0.15  train-loss: 2.43488   learning-rate: 9.45e-04


  0%|                                                                                                         …

2024-08-12 07:04:28        1,000  0.15  eval-loss:  2.47814   
2024-08-12 07:04:31        1,100  0.17  train-loss: 2.44562   learning-rate: 9.34e-04
2024-08-12 07:04:33        1,200  0.18  train-loss: 2.40362   learning-rate: 9.21e-04
2024-08-12 07:04:35        1,300  0.2   train-loss: 2.37533   learning-rate: 9.08e-04
2024-08-12 07:04:37        1,400  0.21  train-loss: 2.35209   learning-rate: 8.94e-04
2024-08-12 07:04:40        1,500  0.23  train-loss: 2.32055   learning-rate: 8.79e-04


  0%|                                                                                                         …

2024-08-12 07:04:40        1,500  0.23  eval-loss:  2.20528   
2024-08-12 07:04:42        1,600  0.24  train-loss: 2.28048   learning-rate: 8.63e-04
2024-08-12 07:04:45        1,700  0.26  train-loss: 2.2436    learning-rate: 8.46e-04
2024-08-12 07:04:47        1,800  0.27  train-loss: 2.1873    learning-rate: 8.29e-04
2024-08-12 07:04:50        1,900  0.29  train-loss: 2.17336   learning-rate: 8.10e-04
2024-08-12 07:04:52        2,000  0.3   train-loss: 2.22112   learning-rate: 7.91e-04


  0%|                                                                                                         …

2024-08-12 07:04:52        2,000  0.3   eval-loss:  2.1152    
2024-08-12 07:04:57        2,100  0.32  train-loss: 2.16833   learning-rate: 7.72e-04
2024-08-12 07:04:59        2,200  0.33  train-loss: 2.11852   learning-rate: 7.52e-04
2024-08-12 07:05:01        2,300  0.35  train-loss: 2.07779   learning-rate: 7.31e-04
2024-08-12 07:05:03        2,400  0.36  train-loss: 2.1432    learning-rate: 7.10e-04
2024-08-12 07:05:05        2,500  0.38  train-loss: 2.10482   learning-rate: 6.88e-04


  0%|                                                                                                         …

2024-08-12 07:05:05        2,500  0.38  eval-loss:  2.00863   
2024-08-12 07:05:08        2,600  0.39  train-loss: 2.11472   learning-rate: 6.66e-04
2024-08-12 07:05:10        2,700  0.41  train-loss: 2.04447   learning-rate: 6.43e-04
2024-08-12 07:05:12        2,800  0.42  train-loss: 2.09566   learning-rate: 6.20e-04
2024-08-12 07:05:14        2,900  0.44  train-loss: 2.00747   learning-rate: 5.97e-04
2024-08-12 07:05:17        3,000  0.45  train-loss: 1.89774   learning-rate: 5.74e-04


  0%|                                                                                                         …

2024-08-12 07:05:17        3,000  0.45  eval-loss:  1.89916   
2024-08-12 07:05:20        3,100  0.47  train-loss: 1.99214   learning-rate: 5.50e-04
2024-08-12 07:05:21        3,200  0.48  train-loss: 2.07619   learning-rate: 5.27e-04
2024-08-12 07:05:24        3,300  0.5   train-loss: 1.96536   learning-rate: 5.03e-04


### Run All Configurations

In [None]:
from accelerate import notebook_launcher
from forgather.ml.training_script import training_loop

#os.environ['CUDA_VISIBLE_DEVICES'] = str(3)
def run_all_configurations():
    for proj.config_name, _ in proj.meta.find_templates(proj.meta.config_prefix):
        print(f"{ ' Starting ' + proj.config_name + ' ':-^60}")
        notebook_launcher(
            training_loop,
            args=(proj.project_dir, proj.config_name,),
            num_processes=1
        )

run_all_configurations()

### Generate Training Script

```python
def make_train_script(
    project_directory,
    config_template=None,
    script_name='train.sh',
    nproc='gpu',
    cuda_devices=None
):
```
Generate a bash training script from a project meta-config

The generated script will be written to 'project_directory' and all paths will be
relative to this location.

- project_directory: The project directory. Assumes meta-config is 'meta_config.yaml'
- script_name: The name of the output script. If none, the script can be specified on the command-line.
- nproc: Number of processes; 'gpu' is number of available GPUs
- cuda_devices: List of CUDA devices to limit training to.  

In [None]:
def generate_script(cuda_devices=None):
    script_name = os.path.splitext(os.path.basename(proj.config_name))[0] + ".sh"
    nb.make_train_script(
        train_script_path=os.path.abspath(train_script_path),
        project_directory=proj.project_dir,
        config_template=proj.config_name,
        script_name=script_name,
        cuda_devices=cuda_devices)

    # Read back to verify
    script_path = os.path.join(proj.project_dir, script_name)
    with open(script_path, 'r') as f:
        md = (
            f"#### Generated Shell Script\n"
            f"[{script_name}]({os.path.relpath(script_path)})\n"
            f"```bash\n{f.read()}\n```"
        )
        display(ds.Markdown(md))
generate_script("3")

In [None]:
# Assign sequential GPUs to each configuration
def sequential_devices(i=0):
    while True:
        yield str(i)
        i += 1

# Assign the same fixed set of GPUs to each config
def same_devices(devices="0,1"):
    while True:
        yield devices

# Assign all GPUs to all configs
def all_devices():
    while True:
        yield None

def generate_all_scripts(device_iter=all_devices()):
    for devices, (proj.config_name, _) in zip(device_iter, proj.meta.find_templates(proj.meta.config_prefix)):
        script_name = os.path.splitext(proj.config_name)[0] + ".sh"
        nb.make_train_script(
            train_script_path=os.path.abspath(train_script_path),
            project_directory=proj.project_dir,
            config_template=proj.config_name,
            script_name=script_name,
            cuda_devices=devices)
        script_path = os.path.join(proj.project_dir, script_name)
        with open(script_path, 'r') as f:
            
            md = (
                f"[{script_name}]({os.path.relpath(script_path)})\n"
                f"```bash\n{f.read()}\n```"
            )
            display(ds.Markdown(md))

generate_all_scripts(sequential_devices(3))

### Run Script from Notebook
Lauch the training script from the notebook.

Note: The terminal emulation of the notebook is lacking, thus rendering of progress bars may be broken.

In [None]:
print(f"{nb.get_train_cmdline(train_script_path, proj.meta, cuda_devices='0')} '{proj.config_name}'")

In [None]:
!{nb.get_train_cmdline(train_script_path, proj.meta, cuda_devices="0")} '{proj.config_name}'

### View in Tensorboard
Note: If the notebook is running on the same machine as the trainer, remove "--bind_all"

In [None]:
# All models
!tensorboard --bind_all --logdir "{config_meta['models_dir']}"

In [None]:
# Current model only
!tensorboard --bind_all --logdir "{config_meta['output_dir']}"

### Cleanup
Note: These will show the target directory and ask for confirmation before proceeding.

#### Delete All

In [None]:
nb.delete_dir(config_meta['models_dir'], "Delete all models in project")

#### Delete Configuration Output Directory

In [None]:
nb.delete_dir(config_meta['output_dir'], "Delete output directory")