# Project Index

In [1]:
import forgather.nb.notebooks as nb
nb.display_project_index(show_available_templates=True)

## Tiny LLama

In this tutorial we will train a very small Llama model (about 5M parameters) on 10% of the Tiny Stories dataset. On a single RTX-4090, this takes about three minutes. Once training is complete, we will load the model an use it for text generation -- and the generation will be reasonably coherent for a three-minute-old model.

#### Project Directory: "/home/dinalt/ai_assets/forgather/examples/tutorials/tiny_llama"

## Meta Config
Meta Config: [/home/dinalt/ai_assets/forgather/examples/tutorials/tiny_llama/meta.yaml](meta.yaml)

- [meta.yaml](meta.yaml)
    - [meta_defaults.yaml](../../../forgather_workspace/meta_defaults.yaml)
        - [base_directories.yaml](../../../forgather_workspace/base_directories.yaml)

Template Search Paths:
- [/home/dinalt/ai_assets/forgather/examples/tutorials/tiny_llama/templates](templates)
- [/home/dinalt/ai_assets/forgather/forgather_workspace](../../../forgather_workspace)
- [/home/dinalt/ai_assets/forgather/templatelib/modellib](../../../templatelib/modellib)
- [/home/dinalt/ai_assets/forgather/templatelib/examples](../../../templatelib/examples)
- [/home/dinalt/ai_assets/forgather/templatelib/base](../../../templatelib/base)

## Available Configurations
- [train_hf_llama.yaml](templates/configs/train_hf_llama.yaml)
- [train_tiny_llama.yaml](templates/configs/train_tiny_llama.yaml)
- [experimental_llama.yaml](templates/configs/experimental_llama.yaml)

Default Configuration: train_tiny_llama.yaml

## Available Templates
- [base_directories.yaml](../../../forgather_workspace/base_directories.yaml)
- [meta_defaults.yaml](../../../forgather_workspace/meta_defaults.yaml)
- [datasets/llm_dataset_project.yaml](../../../templatelib/examples/datasets/llm_dataset_project.yaml)
- [prompts/tiny_stories.yaml](../../../templatelib/examples/prompts/tiny_stories.yaml)
- [prompts/short_stories.yaml](../../../templatelib/examples/prompts/short_stories.yaml)
- [tokenizers/tiny_2k.yaml](../../../templatelib/examples/tokenizers/tiny_2k.yaml)
- [tokenizers/tiny_8k.yaml](../../../templatelib/examples/tokenizers/tiny_8k.yaml)
- [tokenizers/wikitext/32k.yaml](../../../templatelib/examples/tokenizers/wikitext/32k.yaml)
- [tokenizers/wikitext/8k.yaml](../../../templatelib/examples/tokenizers/wikitext/8k.yaml)
- [config_type.yaml](../../../templatelib/base/config_type.yaml)
    - [datasets/dataset_type.yaml](../../../templatelib/base/datasets/dataset_type.yaml)
        - [datasets/tokenized_dataset.yaml](../../../templatelib/base/datasets/tokenized_dataset.yaml)
    - [models/model_type.yaml](../../../templatelib/base/models/model_type.yaml)
    - [training_script/training_script_type.yaml](../../../templatelib/base/training_script/training_script_type.yaml)
        - [training_script/causal_lm/causal_lm.yaml](../../../templatelib/base/training_script/causal_lm/causal_lm.yaml)
            - [project.yaml](templates/project.yaml)
                - [configs/train_hf_llama.yaml](templates/configs/train_hf_llama.yaml)
                - [configs/train_tiny_llama.yaml](templates/configs/train_tiny_llama.yaml)
                - [configs/experimental_llama.yaml](templates/configs/experimental_llama.yaml)
    - [tokenizers/tokenizer_type.yaml](../../../templatelib/base/tokenizers/tokenizer_type.yaml)
        - [tokenizers/bpe/bpe.yaml](../../../templatelib/base/tokenizers/bpe/bpe.yaml)
- [trainers/base_trainer.yaml](../../../templatelib/base/trainers/base_trainer.yaml)
    - [trainers/trainer.yaml](../../../templatelib/base/trainers/trainer.yaml)
        - [project.trainer_config](templates/project.yaml)
        - [trainers/accel_trainer.yaml](../../../templatelib/base/trainers/accel_trainer.yaml)
        - [trainers/pipeline_trainer.yaml](../../../templatelib/base/trainers/pipeline_trainer.yaml)
    - [trainers/hf_trainer.yaml](../../../templatelib/base/trainers/hf_trainer.yaml)
- [models/base_language_model.yaml](../../../templatelib/base/models/base_language_model.yaml)
    - [models/causal_lm/from_pretrained.yaml](../../../templatelib/base/models/causal_lm/from_pretrained.yaml)
    - [models/causal_lm/from_pretrained_config.yaml](../../../templatelib/base/models/causal_lm/from_pretrained_config.yaml)
    - [models/causal_lm/custom.yaml](../../../templatelib/base/models/causal_lm/custom.yaml)
        - [models/causal_lm/custom_dynamic.yaml](../../../templatelib/base/models/causal_lm/custom_dynamic.yaml)
            - [models/transformers/deepone.yaml](../../../templatelib/examples/models/transformers/deepone.yaml)
            - [models/transformers/dynamic_causal_transformer.yaml](../../../templatelib/examples/models/transformers/dynamic_causal_transformer.yaml)
            - [models/transformers/dynamic_llama.yaml](../../../templatelib/examples/models/transformers/dynamic_llama.yaml)
                - [models/tiny_dynamic_llama.yaml](templates/models/tiny_dynamic_llama.yaml)
                    - [project.model_config](templates/project.yaml)
                    - [experiment.model_config](templates/configs/experimental_llama.yaml)
    - [models/causal_lm/from_config.yaml](../../../templatelib/base/models/causal_lm/from_config.yaml)
        - [models/transformers/gpt2.yaml](../../../templatelib/examples/models/transformers/gpt2.yaml)
        - [models/transformers/llama.yaml](../../../templatelib/examples/models/transformers/llama.yaml)
            - [models/tiny_hf_llama.yaml](templates/models/tiny_hf_llama.yaml)
- [models/causal_lm/import_model_project.yaml](../../../templatelib/base/models/causal_lm/import_model_project.yaml)
- [callbacks/base_callbacks.yaml](../../../templatelib/base/callbacks/base_callbacks.yaml)
    - [callbacks/loggers.yaml](../../../templatelib/base/callbacks/loggers.yaml)
        - [project.logger_config](templates/project.yaml)


---
This example makes extensive use of the Forgather templates library. Take a look at the various files which go into the configuration and compare these to the pre-processed output.

In [2]:
nb.display_config(config_template="", show_pp_config=True, show_generated_code=False)

## Included Templates
- [configs/train_tiny_llama.yaml](templates/configs/train_tiny_llama.yaml)
    - [project.yaml](templates/project.yaml)
        - [datasets/llm_dataset_project.yaml](../../../templatelib/examples/datasets/llm_dataset_project.yaml)
        - [models/causal_lm/import_model_project.yaml](../../../templatelib/base/models/causal_lm/import_model_project.yaml)
        - [project.logger_config](templates/project.yaml)
            - [callbacks/loggers.yaml](../../../templatelib/base/callbacks/loggers.yaml)
                - [callbacks/base_callbacks.yaml](../../../templatelib/base/callbacks/base_callbacks.yaml)
                    - [inc/formatting.jinja](../../../templatelib/base/inc/formatting.jinja)
            - [prompts/tiny_stories.yaml](../../../templatelib/examples/prompts/tiny_stories.yaml)
        - [project.trainer_config](templates/project.yaml)
            - [trainers/trainer.yaml](../../../templatelib/base/trainers/trainer.yaml)
                - [trainers/base_trainer.yaml](../../../templatelib/base/trainers/base_trainer.yaml)
        - [training_script/causal_lm/causal_lm.yaml](../../../templatelib/base/training_script/causal_lm/causal_lm.yaml)
            - [training_script/training_script_type.yaml](../../../templatelib/base/training_script/training_script_type.yaml)
                - [config_type.yaml](../../../templatelib/base/config_type.yaml)
                    - [base_directories.yaml](../../../forgather_workspace/base_directories.yaml)
### Config Metadata:

```python
{'config_class': 'type.training_script.causal_lm',
 'config_description': 'A demo of training a tiny llama model from scratch',
 'config_name': 'Tiny Llama',
 'datasets_dir': '/home/dinalt/ai_assets/forgather/datasets',
 'forgather_dir': '/home/dinalt/ai_assets/forgather',
 'logging_dir': './output_models/tiny_llama/runs/log_2025-09-19T10-05-42',
 'model_src_dir': '/home/dinalt/ai_assets/forgather/model_src',
 'models_dir': './output_models',
 'nproc_per_node': 1,
 'output_dir': './output_models/tiny_llama',
 'project_dir': '.',
 'tokenizers_dir': '/home/dinalt/ai_assets/forgather/tokenizers',
 'workspace_root': '/home/dinalt/ai_assets/forgather'}

```

## Modules
## Output Targets
- distributed_env
- model_constructor_args
- tokenizer
- model
- tokenizer_args
- train_dataset
- eval_dataset
- data_collator
- experiment_info
- testprompts
- generation_config
- trainer_callbacks
- optimizer
- lr_scheduler
- trainer_args
- model_preprocessor
- trainer
- dynamic_args
- meta
- main

## Preprocessed Config

```yaml
#---------------------------------------
#               Tiny Llama               
#---------------------------------------
# 2025-09-19T10:05:42
# Description: A demo of training a tiny llama model from scratch
# Project Dir: /home/dinalt/ai_assets/forgather/examples/tutorials/tiny_llama
# Current Working Dir: "/home/dinalt/ai_assets/forgather/examples/tutorials/tiny_llama"
# Forgather Config Dir: "/home/dinalt/.config/forgather"
# Model: tiny_llama
# Hostname: hal9000
# Versions:
#     python: 3.10.13
#     torch: 2.8.0
#     transformers: 4.56.1
#     accelerate: 1.10.1

############# Config Vars ##############

# ns.forgather_dir: "/home/dinalt/ai_assets/forgather"
# ns.models_dir: "/home/dinalt/ai_assets/forgather/examples/tutorials/tiny_llama/output_models"
# ns.project_model_src_dir: "/home/dinalt/ai_assets/forgather/examples/tutorials/tiny_llama/model_src"
# ns.tokenizers_dir: "/home/dinalt/ai_assets/forgather/tokenizers"
# ns.datasets_dir: "/home/dinalt/ai_assets/forgather/datasets"
# ns.model_src_dir: "/home/dinalt/ai_assets/forgather/model_src"
# ns.output_dir: "./output_models/tiny_llama"
# ns.logging_dir: "./output_models/tiny_llama/runs/log_2025-09-19T10-05-42"
# ns.nproc_per_node: 1
# ns.trust_remote_code: False

####### Distributed Environment ########

distributed_env: &distributed_env !singleton:forgather.ml.distributed:DistributedEnvironment@distributed_env

############# Dependencies #############



################ Model #################

# https://huggingface.co/docs/transformers/en/model_doc/auto
model_constructor_args: &model_constructor_args {}

# Import a model definition from another Forgather project
.define: &model_dict !call:forgather:from_project
    project_dir: "/home/dinalt/ai_assets/forgather/examples/models/llama"
    config_template: "4M.yaml"
    targets: [  "pretrained_tokenizer", "pretrained_model_ctor" ] 
    pp_kwargs:
        output_dir: "./output_models/tiny_llama"
    pp_debug: False
    model_constructor_args: *model_constructor_args

tokenizer: &tokenizer !call:getitem [ *model_dict, 'pretrained_tokenizer' ]
model: &model !call:getitem [ *model_dict, 'pretrained_model_ctor' ]

############### Datasets ###############

tokenizer_args: &tokenizer_args !dict
    truncation: True
    max_length: 512    

# Load dataset from sub-project
.define: &dataset_dict !call:forgather:from_project
    project_dir: "/home/dinalt/ai_assets/forgather/examples/datasets/roneneldan"
    config_template: "tinystories-abridged.yaml"
    targets: [  "train_dataset", "eval_dataset" ] 
    preprocess_args: *tokenizer_args
    tokenizer: *tokenizer

train_dataset: &train_dataset !call:getitem [ *dataset_dict, 'train_dataset' ]
eval_dataset: &eval_dataset !call:getitem [ *dataset_dict, 'eval_dataset' ]

############ Data Collator #############

# Data collator for causal model
# Batches are dynamically padded to longest sequence
# labels are set to input_ids, with pad tokens set to -100
data_collator: &data_collator !singleton:forgather.ml.data_collator:DataCollatorForCausalLM@DataCollatorForCausalLM
    tokenizer: *tokenizer
    return_tensors: pt

    # Tiny Llama
    truncation: True
    max_length: 512

########## Trainer Callbacks ###########

# **Dependencies**

# Experiment tracking: Tensorboard SummaryWriter
.define: &summary_writer !singleton:torch.utils.tensorboard:SummaryWriter
    - "./output_models/tiny_llama/runs/log_2025-09-19T10-05-42"

# Additional data to record to experiment loggers
experiment_info: &experiment_info !dict:@experiment_info
    date: "2025-09-19T10:05:42"
    name: "Tiny Llama"
    description: "A demo of training a tiny llama model from scratch"
    config: !var "pp_config"
    versions: {'python': '3.10.13', 'torch': '2.8.0', 'transformers': '4.56.1', 'accelerate': '1.10.1'}

# **Callback List**

# The model will be given the following prompts for text-gen at regular intervals.
testprompts: &testprompts !list:@testprompts
    # Test prompts from "https://arxiv.org/abs/2305.07759"
    - "Alice was so tired when she got back home so she went"
    - "Jack and Lily liked to watch the moon at night. They noticed that the moon changed its shape every night. Sometimes the moon was big and round, and sometimes it was"
    - "Jack and Lily saw a rainbow after a rainy day.They were amazed by the colors. Jack said, \"Look, Lily. A rainbow has"
    - "Jack wanted to read a book, so he went to"
    - "\"Can cows fly?\" Alice asked her mother."
    - "\"What do birds like to eat?\" Tom asked his mother."
    - "\"What language do they speak in France?\" Tom asked his mother."
    - "If I throw a ball up in the air, eventually it will"
    - "It was winter and cold outside so his mother told him, \"You should"
    - "Lily likes cats and dogs. She asked her mom for a dog and her mom said no, so instead she asked"
    - "Jack told Mary, \"If you give me your banana, I'll give you my apple.\" Mary gave Jack her Banana, so"
    - "On weekends Jack went to visit his grandmother whereas on weekdays he would go to school. Last weekend, when Jack was on his way to"
    - "Lily and Ben were having an argument. Ben said that cake is much better than ice cream and Lily said that"
    - "Lily and Ben are having an argument. They are trying to decide between the park and the swimming pool. Ben says, \"I want to go to the park\". Lily says"
    - "Jack's mother was not home, and his father was at home. When Jack came home, he said hello to"
    - "Lily doesn't like swimming. When her father wants to take her to the swimming pool, she says"
    - "Both Ben and Lily wanted cake. Father said that there was only one piece of cake left. They"
    - "Ben went to visit Lily in her house, but she was not at home. Ben knocked on the door,"

# Conservative text-generation parameters.
generation_config: &generation_config !dict:@generation_config
    identity: generation_config
    do_sample: True
    top_k: 20
    top_p: 0.9
    temperature: 0.7
    repitition_penalty: 1.15
trainer_callbacks: &trainer_callbacks !dlist:@trainer_callbacks
    null: ~
    # Log all training output to JSON
    json_logger: !singleton:forgather.ml.trainer.callbacks:JsonLogger
        <<: *experiment_info
    # Log configuration and metrics to Tensorboard file
    tb_logger: !singleton:forgather.ml.trainer.callbacks:TBLogger
        args: [ *summary_writer ]
        kwargs:
            <<: *experiment_info
    text_gen_callback: !singleton:forgather.ml.trainer.callbacks:TextgenCallback
        summary_writer: *summary_writer
        prompts: *testprompts
        generation_config: *generation_config
        max_new_tokens: 40
        generation_steps: 1000
    
    # Allow remote control of the training process
    trainer_control: !singleton:forgather.ml.trainer.callbacks:TrainerControlCallback

############## Optimizer ###############

optimizer: &optimizer !partial:torch:optim.AdamW
    lr: 1.0e-3

############# LR Scheduler #############

# https://arxiv.org/html/2503.02844v1
lr_scheduler: &lr_scheduler !lambda:forgather.ml.optim.infinite_lr_scheduler:InfiniteLRScheduler@lr_scheduler
    warmup_steps: 500
    cooldown_steps: 50000
    constant_lr: 1.0e-4

############### Trainer ################

# Name: Forgather Trainer
# Description: A lightweight, extensible trainer; does not support multiple GPUs
# Trainer Config Class: forgather.ml.trainer:TrainingArguments
# Trainer Class: forgather.ml.trainer:Trainer
# nproc_per_node: 1

# **Trainer Args**



trainer_args: &trainer_args !singleton:forgather.ml.trainer:TrainingArguments@trainer_args
    save_strategy: "no"
    max_steps: -1
    output_dir: "./output_models/tiny_llama"
    logging_dir: "./output_models/tiny_llama/runs/log_2025-09-19T10-05-42"
    # Tiny Llama Project Overrides
    eval_strategy: "steps"
    save_strategy: "steps"
    save_steps: 10000
    # Safetensors can't handle tied parameters/buffers, so fallback to PyTorch format.
    save_safetensors: False
    seed: 42
    per_device_train_batch_size: 32
    per_device_eval_batch_size: 64
    logging_steps: 100
    eval_steps: 500
    num_train_epochs: 1
    dataloader_num_workers: 1

model_preprocessor: &model_preprocessor !partial:call
    - *model

# **Trainer Constructor**

trainer: &trainer !singleton:forgather.ml.trainer:Trainer@trainer
    args: *trainer_args
    model_init: *model_preprocessor
    data_collator: *data_collator
    train_dataset: *train_dataset
    eval_dataset: *eval_dataset
    processing_class: *tokenizer
    callbacks: *trainer_callbacks

    # **Trainer**
    compute_loss_func: !singleton:forgather.ml.loss:CausalLoss
    distributed_env: *distributed_env
    optimizer_factory: *optimizer
    lr_scheduler_factory: *lr_scheduler

# **Dynamic Args**
dynamic_args: !dlist
    null: ~
    max_steps:
        names: "--max-steps"
        type: "int"
        help: "Set maximum training steps"
    save_strategy:
        names: [ "--save-strategy", "-S" ]
        choices: [ "no", "steps", "epoch" ]
        type: "str"
        help: "When to save checkpoints"

#---------------------------------------
#          Configuration Output          
#---------------------------------------
meta: &meta_output !dict:@meta
    config_name: "Tiny Llama"
    config_description: "A demo of training a tiny llama model from scratch"
    config_class: "type.training_script.causal_lm"
    project_dir: "."
    workspace_root: "/home/dinalt/ai_assets/forgather"
    forgather_dir: "/home/dinalt/ai_assets/forgather"
    models_dir: "./output_models"
    tokenizers_dir: "/home/dinalt/ai_assets/forgather/tokenizers"
    datasets_dir: "/home/dinalt/ai_assets/forgather/datasets"
    output_dir: "./output_models/tiny_llama"
    model_src_dir: "/home/dinalt/ai_assets/forgather/model_src"
    logging_dir: "./output_models/tiny_llama/runs/log_2025-09-19T10-05-42"
    nproc_per_node: 1

main: !singleton:forgather.ml.training_script:TrainingScript@training_script
    meta: *meta_output
    do_train: True
    do_save: False
    do_eval: False
    distributed_env: *distributed_env
    trainer: *trainer

```



## Load Project

Load the default configuraiton.

In [3]:
from forgather.project import Project
import forgather.nb.notebooks as nb

# Load the default project, which is "train_tiny_llama.yaml"
proj = Project()

## Start Tensorboard

This project has been configured to log training to Tensorboard (TB). To watch the model's training progress with TB, run the following command, which will generate a CLI command to start the TB server. Then run the command from a shell.

Tensorboard can be started from a terminal like this:

```bash
# By default, Tensorboard bind only to localhost. To bind to all interfaces, add --bind_all
tensorboard --logdir "/path/to/model/log/directory" [--bind_all]
```

You can use the CLI to launch TB for you, where it will automatically determine the path to the log directory:

```bash
# --all : Watch all output model directories, otherwise just the one for the current configuration.
# -- : Any arguments after '--' are passed directly to tensorboard, for example "--bind_all"
cd PROJECT_DIR
cfcli.py tb [--all] [-- <tensorboard-args>]
```

When TB starts, it should provide the URL to access it. e.g.

```
Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.16.2 at http://localhost:6006/ (Press CTRL+C to quit)
```

## Train Model

You have a few options for training the mode.

1. Run it directly from the notebook. This should work find with this example, although for projects using multiple GPUs, you will want to use one of the other options. To train from the notebook, just run the following cell.
2. You can generate a training script and run it from the shell. To do so, run the cell with "generate_trainingscript()," then run the generated shell script from a terminal.
3. You can use the Forgather CLI.

```bash
# Open a shell in thie project's directory, then run this command:
cd PROJECT_DIR
forgather train

# See forgather --help for more details.
```

Once training starts, switch to Tensorboard in your browser. One of the first things you will want to do is enable automatic refresh. To do so, click the gear in the upper-right corner and check "Reload Data."

Once training has started, take a look at the "Text" tab. You will see that we have automatically logged the preprocessed configuraiton as well as having dumped the primary training artifacts.

Next, switch to the "Scalars" tab. You will see a plot of train and evaluation loss which will automatically update every 30 seconds. If you are not familiar with Tensorboard, now would be a good time to play with the UI elements to see how they work.

When training completes, the model will be automatically saved to the output directory ("./output_models/default_model").

In [4]:
# Train model in notebook.

# Construct the default target, "main," which is a training script.
training_script = proj()

# Start training the model.
training_script.run()

# Release resources
training_script = None

Tokenizing train:   0%|          | 0/211972 [00:00<?, ? examples/s]

Tokenizing validation:   0%|          | 0/500 [00:00<?, ? examples/s]

  0%|                                                                                                         …

total_examples: 212,000
total_train_samples: 212,000
per_device_train_batch_size: 32
actual_per_device_batch_size: 32
total_train_batch_size: 32
max_steps: 6,625
total_parameters: 4.2M
trainable_parameters: 4.2M
model:
DynamicCasualLM(
  (causal_lm): CasualLM(
    loss_fn=CausalLoss()
    (input_encoder): InputEncoder(
      d_model=256, vocab_size=2000
      (dropout): Identity()
      (embedding): Embedding(2000, 256)
    )
    (output_decoder): Linear(in_features=256, out_features=2000, bias=False)
    (layer_stack): LayerStack(
      gradient_checkpointing=False, checkpoint_stride=1
      (layers): ModuleList(
        (0-3): 4 x PreLNLayer(
          (feedforward): GLUFeedforwardLayer(
            d_model=256, d_feedforward=676
            (up_proj): Linear(in_features=256, out_features=676, bias=False)
            (gate_proj): Linear(in_features=256, out_features=676, bias=False)
            (down_proj): Linear(in_features=676, out_features=256, bias=False)
            (activation

  0%|                                                                                                         …

2025-09-19 10:07:00          500  0.08  eval-loss:  2.69354   
2025-09-19 10:07:04          600  0.09  train-loss: 2.66498   grad-norm: 0.45028   learning-rate: 1.00e-03
2025-09-19 10:07:06          700  0.11  train-loss: 2.478     grad-norm: 0.44827   learning-rate: 1.00e-03
2025-09-19 10:07:08          800  0.12  train-loss: 2.41238   grad-norm: 0.4311    learning-rate: 1.00e-03
2025-09-19 10:07:11          900  0.14  train-loss: 2.27736   grad-norm: 0.4189    learning-rate: 1.00e-03
2025-09-19 10:07:13        1,000  0.15  train-loss: 2.12937   grad-norm: 0.41794   learning-rate: 1.00e-03


  0%|                                                                                                         …

2025-09-19 10:07:13        1,000  0.15  eval-loss:  2.0445    
2025-09-19 10:07:17        1,100  0.17  train-loss: 2.11906   grad-norm: 0.41671   learning-rate: 1.00e-03
2025-09-19 10:07:20        1,200  0.18  train-loss: 2.08488   grad-norm: 0.4173    learning-rate: 1.00e-03
2025-09-19 10:07:22        1,300  0.2   train-loss: 2.04039   grad-norm: 0.40087   learning-rate: 9.99e-04
2025-09-19 10:07:24        1,400  0.21  train-loss: 2.00084   grad-norm: 0.39857   learning-rate: 9.99e-04
2025-09-19 10:07:26        1,500  0.23  train-loss: 1.96937   grad-norm: 0.40235   learning-rate: 9.99e-04


  0%|                                                                                                         …

2025-09-19 10:07:26        1,500  0.23  eval-loss:  1.7836    
2025-09-19 10:07:29        1,600  0.24  train-loss: 1.92732   grad-norm: 0.39787   learning-rate: 9.99e-04
2025-09-19 10:07:31        1,700  0.26  train-loss: 1.88687   grad-norm: 0.3942    learning-rate: 9.99e-04
2025-09-19 10:07:33        1,800  0.27  train-loss: 1.838     grad-norm: 0.4001    learning-rate: 9.98e-04
2025-09-19 10:07:35        1,900  0.29  train-loss: 1.81779   grad-norm: 0.39304   learning-rate: 9.98e-04
2025-09-19 10:07:37        2,000  0.3   train-loss: 1.85246   grad-norm: 0.38301   learning-rate: 9.98e-04


  0%|                                                                                                         …

2025-09-19 10:07:38        2,000  0.3   eval-loss:  1.66765   
2025-09-19 10:07:42        2,100  0.32  train-loss: 1.80197   grad-norm: 0.37383   learning-rate: 9.98e-04
2025-09-19 10:07:44        2,200  0.33  train-loss: 1.75699   grad-norm: 0.38189   learning-rate: 9.97e-04
2025-09-19 10:07:46        2,300  0.35  train-loss: 1.72324   grad-norm: 0.39111   learning-rate: 9.97e-04
2025-09-19 10:07:49        2,400  0.36  train-loss: 1.7833    grad-norm: 0.39717   learning-rate: 9.97e-04
2025-09-19 10:07:51        2,500  0.38  train-loss: 1.74988   grad-norm: 0.39627   learning-rate: 9.96e-04


  0%|                                                                                                         …

2025-09-19 10:07:51        2,500  0.38  eval-loss:  1.58031   
2025-09-19 10:07:53        2,600  0.39  train-loss: 1.74971   grad-norm: 0.38706   learning-rate: 9.96e-04
2025-09-19 10:07:55        2,700  0.41  train-loss: 1.69475   grad-norm: 0.37888   learning-rate: 9.96e-04
2025-09-19 10:07:57        2,800  0.42  train-loss: 1.73729   grad-norm: 0.37353   learning-rate: 9.95e-04
2025-09-19 10:08:00        2,900  0.44  train-loss: 1.65546   grad-norm: 0.3872    learning-rate: 9.95e-04
2025-09-19 10:08:02        3,000  0.45  train-loss: 1.56566   grad-norm: 0.36852   learning-rate: 9.94e-04


  0%|                                                                                                         …

2025-09-19 10:08:02        3,000  0.45  eval-loss:  1.48591   
2025-09-19 10:08:06        3,100  0.47  train-loss: 1.6476    grad-norm: 0.38164   learning-rate: 9.94e-04
2025-09-19 10:08:08        3,200  0.48  train-loss: 1.7147    grad-norm: 0.36078   learning-rate: 9.94e-04
2025-09-19 10:08:11        3,300  0.5   train-loss: 1.62638   grad-norm: 0.35737   learning-rate: 9.93e-04
2025-09-19 10:08:13        3,400  0.51  train-loss: 1.56545   grad-norm: 0.368     learning-rate: 9.93e-04
2025-09-19 10:08:15        3,500  0.53  train-loss: 1.58629   grad-norm: 0.35055   learning-rate: 9.92e-04


  0%|                                                                                                         …

2025-09-19 10:08:15        3,500  0.53  eval-loss:  1.4569    
2025-09-19 10:08:17        3,600  0.54  train-loss: 1.63956   grad-norm: 0.36239   learning-rate: 9.91e-04
2025-09-19 10:08:20        3,700  0.56  train-loss: 1.56118   grad-norm: 0.36122   learning-rate: 9.91e-04
2025-09-19 10:08:22        3,800  0.57  train-loss: 1.56943   grad-norm: 0.36209   learning-rate: 9.90e-04
2025-09-19 10:08:24        3,900  0.59  train-loss: 1.61105   grad-norm: 0.35517   learning-rate: 9.90e-04
2025-09-19 10:08:26        4,000  0.6   train-loss: 1.64238   grad-norm: 0.35412   learning-rate: 9.89e-04


  0%|                                                                                                         …

2025-09-19 10:08:26        4,000  0.6   eval-loss:  1.42438   
2025-09-19 10:08:31        4,100  0.62  train-loss: 1.54302   grad-norm: 0.34402   learning-rate: 9.89e-04
2025-09-19 10:08:33        4,200  0.63  train-loss: 1.53296   grad-norm: 0.35615   learning-rate: 9.88e-04
2025-09-19 10:08:35        4,300  0.65  train-loss: 1.56578   grad-norm: 0.35344   learning-rate: 9.87e-04
2025-09-19 10:08:38        4,400  0.66  train-loss: 1.61292   grad-norm: 0.34805   learning-rate: 9.87e-04
2025-09-19 10:08:40        4,500  0.68  train-loss: 1.55118   grad-norm: 0.34782   learning-rate: 9.86e-04


  0%|                                                                                                         …

2025-09-19 10:08:40        4,500  0.68  eval-loss:  1.41177   
2025-09-19 10:08:42        4,600  0.69  train-loss: 1.4922    grad-norm: 0.34156   learning-rate: 9.85e-04
2025-09-19 10:08:44        4,700  0.71  train-loss: 1.49984   grad-norm: 0.3425    learning-rate: 9.84e-04
2025-09-19 10:08:46        4,800  0.72  train-loss: 1.52592   grad-norm: 0.33863   learning-rate: 9.84e-04
2025-09-19 10:08:49        4,900  0.74  train-loss: 1.52044   grad-norm: 0.32511   learning-rate: 9.83e-04
2025-09-19 10:08:51        5,000  0.75  train-loss: 1.5379    grad-norm: 0.34537   learning-rate: 9.82e-04


  0%|                                                                                                         …

2025-09-19 10:08:51        5,000  0.75  eval-loss:  1.3994    
2025-09-19 10:08:56        5,100  0.77  train-loss: 1.53196   grad-norm: 0.34065   learning-rate: 9.81e-04
2025-09-19 10:08:58        5,200  0.78  train-loss: 1.45921   grad-norm: 0.34047   learning-rate: 9.81e-04
2025-09-19 10:09:00        5,300  0.8   train-loss: 1.45618   grad-norm: 0.33018   learning-rate: 9.80e-04
2025-09-19 10:09:02        5,400  0.82  train-loss: 1.48482   grad-norm: 0.3375    learning-rate: 9.79e-04
2025-09-19 10:09:05        5,500  0.83  train-loss: 1.46733   grad-norm: 0.32711   learning-rate: 9.78e-04


  0%|                                                                                                         …

2025-09-19 10:09:05        5,500  0.83  eval-loss:  1.35102   
2025-09-19 10:09:07        5,600  0.85  train-loss: 1.50472   grad-norm: 0.33294   learning-rate: 9.77e-04
2025-09-19 10:09:09        5,700  0.86  train-loss: 1.52715   grad-norm: 0.33114   learning-rate: 9.76e-04
2025-09-19 10:09:11        5,800  0.88  train-loss: 1.48738   grad-norm: 0.33355   learning-rate: 9.75e-04
2025-09-19 10:09:14        5,900  0.89  train-loss: 1.49892   grad-norm: 0.31996   learning-rate: 9.74e-04
2025-09-19 10:09:16        6,000  0.91  train-loss: 1.43931   grad-norm: 0.33945   learning-rate: 9.73e-04


  0%|                                                                                                         …

2025-09-19 10:09:16        6,000  0.91  eval-loss:  1.35941   
2025-09-19 10:09:20        6,100  0.92  train-loss: 1.42192   grad-norm: 0.34209   learning-rate: 9.72e-04
2025-09-19 10:09:23        6,200  0.94  train-loss: 1.46797   grad-norm: 0.3288    learning-rate: 9.71e-04
2025-09-19 10:09:25        6,300  0.95  train-loss: 1.44657   grad-norm: 0.33299   learning-rate: 9.70e-04
2025-09-19 10:09:27        6,400  0.97  train-loss: 1.43737   grad-norm: 0.34078   learning-rate: 9.69e-04
2025-09-19 10:09:29        6,500  0.98  train-loss: 1.45573   grad-norm: 0.33638   learning-rate: 9.68e-04


  0%|                                                                                                         …

2025-09-19 10:09:29        6,500  0.98  eval-loss:  1.32964   
2025-09-19 10:09:32        6,600  1.0   train-loss: 1.39928   grad-norm: 0.31088   learning-rate: 9.67e-04


Server thread error: Event loop stopped before Future completed.


2025-09-19 10:09:32        6,625  1.0   train_runtime: 163.9 train_samples: 212,000 step: 6,625 train_samples_per_second: 1.294e+03 train_steps_per_second: 40.43 epoch: 1.0 effective_batch_size: 32 


## Load Trained Model

You can use the regular HF APIs to load the saved model and tokenizer.

In [5]:
from forgather.project import Project
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import GenerationConfig, StoppingCriteria
from forgather.ml.sharded_checkpoint import create_pretrained_symlinks
import torch

model_path = "./output_models/tiny_llama"

# Create symlinks to latest checkpoint model output directory
# This is required for .from_pretrained() to find the latest checkpoint.
create_pretrained_symlinks(model_path)

# Set device to run inference on
device = "cuda" if torch.cuda.is_available() else "cpu"

model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_path)

## Text Generation

This loop will use the newly trained model to generate text, seeded with the above prompts.

In [6]:
import torch

def generate_text(model, tokenizer, prompts, gen_config, max_new_tokens, device):
    model.to(device)
    model.eval()
    
    with torch.inference_mode():
        for prompt in prompts:
            tokenizer_outputs = tokenizer(
                [prompt],
                truncation=False,
                return_length=True,
                return_tensors="pt",
                return_attention_mask=True,
            )
        
            input_ids = tokenizer_outputs["input_ids"].to(device)
            attention_mask = tokenizer_outputs["attention_mask"].to(device)
            use_cache = getattr(model, "_supports_cache_class", False)
            outputs = model.generate(
                input_ids,
                attention_mask=attention_mask,
                generation_config=gen_config,
                return_dict_in_generate=True,
                use_cache=use_cache,
                past_key_values=None,
                max_new_tokens=max_new_tokens,
            )
    
            output_text = tokenizer.decode(
                outputs.sequences[0],
                skip_special_tokens=True,
            )
            yield prompt + " [START] " + output_text[len(prompt) + 1 :]

prompts = [
    'Alice was so tired when she got back home so she went',
    'Jack and Lily liked to watch the moon at night. They noticed that the moon changed its shape every night. Sometimes the moon was big and round, and sometimes it was',
    'Jack and Lily saw a rainbow after a rainy day.They were amazed by the colors. Jack said, "Look, Lily. A rainbow has',
    'Jack wanted to read a book, so he went to',
    '"Can cows fly?" Alice asked her mother.',
]

gen_config = GenerationConfig(
    pad_token_id=model.config.pad_token_id,
    bos_token_id=model.config.bos_token_id,
    eos_token_id=model.config.eos_token_id,
    do_sample=True,
    top_k=20,
    top_p=0.9,
    temperature=0.7,
    repitition_penalty=1.15,
)

for s in generate_text(model, tokenizer, prompts, gen_config, 100, "cuda:0"):
    print(s)
    print(f"{'-' * 40}")

Alice was so tired when she got back home so she went [START] to sleep. She slept and slept, until she saw a big piece of paper on the table. She saw a piece of paper on the floor. The paper was so happy! She smiled and thanked the paper. She picked up the paper and put it on. She snuggled up and sle
----------------------------------------
Jack and Lily liked to watch the moon at night. They noticed that the moon changed its shape every night. Sometimes the moon was big and round, and sometimes it was [START] different.

One night, Jack saw a big balloon in the sky. He was very hungry and wanted to eat it. He tried to pick it, but it was too heavy. He felt sad.

Luckily, he saw a little girl. She looked sad. She did not know what to do. She looked sad. She w
----------------------------------------
Jack and Lily saw a rainbow after a rainy day.They were amazed by the colors. Jack said, "Look, Lily. A rainbow has [START] a big, red bow on it."

Jack asked, "What's inside, sweetie?" He 

## Train Hugginface LLama Model

Next, let's try training a Llama model using the Huggingface implementation.

Train the model on the CLI

```bash
forgather -t train_hf_llama.yaml train
```

In [None]:
nb.display_config(config_template="train_hf_llama.yaml", show_pp_config=True, show_generated_code=False)

## Let's See What Happens...

...if we replace the post-layer-norm implementation with a pre-layer-norm implementation.

In [None]:
nb.display_config(config_template="experimental_llama.yaml", show_pp_config=True, show_generated_code=False)

```bash
forgather -t experimental_llama.yaml train
```

## Test Model With the Inference Server

There is a simple OpenAI compatible inference server implementation in "tools/inference_server"  

To host your newly trained model on the inference server:

```bash
./server.py server_configs/tiny_llama.yaml
```

From another session, you can perform text completion like this:

```bash
./client.py client_configs/tiny_llama.yaml --stream --completion "Once upon a time,"
```

The Tiny Llama model, trained on Tiny Stories, will not be very good at interactive chat, but you cat test this with the following command:

```bash
./client.py client_configs/tiny_llama.yaml --stream --interactive
```

This server should work with other OpenAI compatible clients as well.