# Training Notebook

Configuration details: [Configuration Notebook](project_config.ipynb)

Run project training configurations and generate training scripts from project meta-data.

https://huggingface.co/blog/codeparrot

In [6]:
import sys, os
modules_path = os.path.join('..', 'src')
if modules_path not in sys.path: sys.path.insert(0, modules_path)
from IPython import display

from forgather.config import load_config, ConfigEnvironment
from aiws.notebooks import get_train_cmdline, make_train_script
from aiws.config import base_preprocessor_globals, MetaConfig
from aiws.training_loop import TrainingScriptConfig
import aiws.notebooks as nb

# Set project:
project_directory = "example"

# Set configuration:
config_template = "example_experiment.yaml"


nb.show_project_readme(project_directory)
meta = MetaConfig(project_directory)
nb.display_meta(meta, "### Meta Config\n")
nb.list_templates(meta.find_templates(meta.config_prefix), "### Available Configurations\n")
config_template_path = os.path.join(meta.config_prefix, config_template)
environment = ConfigEnvironment(
    searchpath=meta.searchpath,
    globals = base_preprocessor_globals() | dict(project_directory=project_directory)
)

config = environment.load(config_template_path).config
print(f"{' Active Configuration ':-^60}")
print(f"Project: {project_directory}")
print(f"Configuration: {config_template_path}")
print(f"Name: {config.experiment_name}")
print(f"Description: {config.experiment_description}")
print(f"Output Directory: {config.output_dir}")
print(f"Logging Directory: {config.logging_dir}")
print(f"Save Model: {config.do_save}")

## Example Project

This is a simple example project which can be used as a template.

### Meta Config
Project Directory: example

Meta Config: [example/meta.yaml](example/meta.yaml)

Template Search Paths:
- [example/templates](example/templates)
- [../templates](../templates)


### Available Configurations
- [example_experiment.yaml](example/templates/experiments/example_experiment.yaml)


------------------- Active Configuration -------------------
Project: example
Configuration: experiments/example_experiment.yaml
Name: My ML Science Project
Description: It's not supid, it's advanced!
Output Directory: example/output_models/default_model
Logging Directory: example/output_models/default_model/runs/My ML Science Project_1721342344546780567
Save Model: True


### Launch Notebook Trainer

In [7]:
from accelerate import notebook_launcher
from aiws.training_loop import training_loop

notebook_launcher(
    training_loop,
    args=(project_directory, config_template,),
    num_processes=1
)

Launching training on one GPU.
Creating directory: example/output_models/default_model
Creating directory: example/output_models/default_model/runs/My ML Science Project_1721342358550527490
**** Training Started *****
experiment_name: My ML Science Project
experiment_description: It's not supid, it's advanced!
output_dir: example/output_models/default_model
logging_dir: example/output_models/default_model/runs/My ML Science Project_1721342358550527490


Repo card metadata block was not found. Setting CardData to empty.


Map:   0%|          | 0/2119719 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

  0%|                                                                                                         …

total_examples: 2,119,680
total_train_samples: 2,119,680
per_device_train_batch_size: 64
actual_per_device_batch_size: 64
total_train_batch_size: 64
max_steps: 1,000
total_parameters: 2.4M
trainable_parameters: 2.4M
model:
VanillaTransformer(
  (embedding): Embedding(2000, 256)
  (positional_encoder): PositionalEncoder()
  (layers): ModuleList(
    (0-2): 3 x TransformerLayer(
      (attention): MultiheadAttention(
        (query_linear): Linear(in_features=256, out_features=256, bias=True)
        (key_linear): Linear(in_features=256, out_features=256, bias=True)
        (value_linear): Linear(in_features=256, out_features=256, bias=True)
      )
      (feedforward): FeedforwardLayer(
        (linear1): Linear(in_features=256, out_features=512, bias=True)
        (activation): ReLU()
        (linear2): Linear(in_features=512, out_features=256, bias=True)
      )
      (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
      (norm2): LayerNorm((256,), eps=1e-05, elementwise

  0%|                                                                                                         …

2024-07-18 22:42:40          500  0.02  eval-loss:  3.00351   
2024-07-18 22:42:43          600  0.02  train-loss: 2.74256   learning-rate: 1.00e-03
2024-07-18 22:42:46          700  0.02  train-loss: 2.64946   learning-rate: 1.00e-03
2024-07-18 22:42:49          800  0.02  train-loss: 2.56928   learning-rate: 1.00e-03
2024-07-18 22:42:52          900  0.03  train-loss: 2.4667    learning-rate: 1.00e-03
2024-07-18 22:42:55        1,000  0.03  train-loss: 2.44162   learning-rate: 1.00e-03


  0%|                                                                                                         …

2024-07-18 22:42:55        1,000  0.03  eval-loss:  2.52315   
train_runtime: 31.13
train_samples: 64,000
step: 1,000
train_samples_per_second: 2.056e+03
train_steps_per_second: 32.12
train_loss: 3.019
epoch: 0.03019

**** Training Completed *****
{'train_runtime': 31.131957054138184, 'train_samples': 64000, 'step': 1000, 'train_samples_per_second': 2055.765, 'train_steps_per_second': 32.121, 'train_loss': 3.0191736221313477, 'epoch': 0.030193236714975844}
[2024-07-18 22:42:55,744] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Model saved to: example/output_models/default_model


### Generate Training Script

```python
def make_train_script(
    project_directory,
    config_template=None,
    script_name='train.sh',
    nproc='gpu',
    cuda_devices=None
):
```
Generate a bash training script from a project meta-config

The generated script will be written to 'project_directory' and all paths will be
relative to this location.

- project_directory: The project directory. Assumes meta-config is 'meta_config.yaml'
- script_name: The name of the output script. If none, the script can be specified on the command-line.
- nproc: Number of processes; 'gpu' is number of available GPUs
- cuda_devices: List of CUDA devices to limit training to.  

i.e. If you wish to only CUDA 0 and 1, then "0,1"

```
.../my_project$ ./train.sh
```

In [4]:
# Select name of generaed script.
script_name = 'base_2gpu.sh'

make_train_script(
    project_directory=project_directory,
    config_template=config_template,
    script_name=script_name,
    cuda_devices="0,1")

# Read back to verify
with open(os.path.join(project_directory, script_name), 'r') as f:
    md = (
        f"#### Generated Shell Script\n"
        f"[train.sh](train.sh)\n"
        f"```bash\n{f.read()}\n```"
    )
    display.display(display.Markdown(md))

#### Generated Shell Script
[train.sh](train.sh)
```bash
#!/bin/bash
CUDA_VISIBLE_DEVICES='0,1' torchrun --standalone --nproc-per-node 'gpu' '../../scripts/train_script.py' -p '.' -s '../../src' "base_config.yaml"

```

### Run Script from Notebook
Lauch the training script from the notebook.

Note: The terminal emulation of the notebook is lacking, thus rendering of progress bars may be broken.

In [None]:
print(f"{get_train_cmdline(meta, cuda_devices='0')} '{config_template}'")

In [None]:
!{get_train_cmdline(meta, cuda_devices="0")} '{config_template}'

### View in Tensorboard
Note: If the notebook is running on the same machine as the trainer, remove "--bind_all"

In [None]:
!tensorboard --bind_all --logdir "{config.output_dir}"

#### Generate Bash Script

This will output a shell-script which will invoke the training script with the arguments for this project.

```bash
./train.sh path/to/experiment.yaml
```

If 'cuda_devices' is not None, this can restrict execution to a sub-set of available GPUs.
```python
# Restrict training to GPU's 0 and 1
make_bash_script(metacfg, cuda_devices="0,1")
```

In [None]:
make_bash_script(metacfg, cuda_devices="0")

# Read back to verify
with open('train.sh', 'r') as f:
    md = (
        f"#### Generated Shell Script\n"
        f"[train.sh](train.sh)\n"
        f"```bash\n{f.read()}\n```"
    )
    display.display(display.Markdown(md))

### Cleanup
Note: These will show the target directory and ask for confirmation before proceeding.

#### Delete All

In [None]:
nb.delete_dir(metacfg.model_dir, "Delete all models in project")

#### Delete Configuration Output Directory

In [None]:
nb.delete_dir(config.output_dir, "Delete output directory")