# Project Index

In [1]:
import forgather.nb.notebooks as nb
nb.display_project_index(show_available_templates=True)

## Tiny LLama

In this tutorial we will train a very small Llama model (about 5M parameters) on 10% of the Tiny Stories dataset. On a single RTX-4090, this takes about three minutes. Once training is complete, we will load the model an use it for text generation -- and the generation will be reasonably coherent for a three-minute-old model.

#### Project Directory: "/home/dinalt/rust/forgather/examples/tutorials/tiny_llama"

## Meta Config
Meta Config: [/home/dinalt/rust/forgather/examples/tutorials/tiny_llama/meta.yaml](meta.yaml)

- [meta.yaml](meta.yaml)
    - [meta_defaults.yaml](../../../forgather_workspace/meta_defaults.yaml)
        - [base_directories.yaml](../../../forgather_workspace/base_directories.yaml)

Template Search Paths:
- [/home/dinalt/rust/forgather/examples/tutorials/tiny_llama/templates](templates)
- [/home/dinalt/rust/forgather/forgather_workspace](../../../forgather_workspace)
- [/home/dinalt/rust/forgather/templatelib/modellib](../../../templatelib/modellib)
- [/home/dinalt/rust/forgather/templatelib/examples](../../../templatelib/examples)
- [/home/dinalt/rust/forgather/templatelib/base](../../../templatelib/base)

## Available Configurations
- [train_hf_llama.yaml](templates/configs/train_hf_llama.yaml)
- [experimental_llama.yaml](templates/configs/experimental_llama.yaml)
- [full_dataset.yaml](templates/configs/full_dataset.yaml)
- [train_tiny_llama.yaml](templates/configs/train_tiny_llama.yaml)

Default Configuration: train_tiny_llama.yaml

## Available Templates
- [base_directories.yaml](../../../forgather_workspace/base_directories.yaml)
- [meta_defaults.yaml](../../../forgather_workspace/meta_defaults.yaml)
- [datasets/llm_dataset_project.yaml](../../../templatelib/examples/datasets/llm_dataset_project.yaml)
- [tokenizers/tiny_8k.yaml](../../../templatelib/examples/tokenizers/tiny_8k.yaml)
- [tokenizers/tiny_2k.yaml](../../../templatelib/examples/tokenizers/tiny_2k.yaml)
- [tokenizers/wikitext/32k.yaml](../../../templatelib/examples/tokenizers/wikitext/32k.yaml)
- [tokenizers/wikitext/8k.yaml](../../../templatelib/examples/tokenizers/wikitext/8k.yaml)
- [prompts/tiny_stories.yaml](../../../templatelib/examples/prompts/tiny_stories.yaml)
- [prompts/short_stories.yaml](../../../templatelib/examples/prompts/short_stories.yaml)
- [attn_functions/default.yaml](../../../templatelib/examples/attn_functions/default.yaml)
- [flex_kernel_options/default.yaml](../../../templatelib/examples/flex_kernel_options/default.yaml)
- [config_type.yaml](../../../templatelib/base/config_type.yaml)
    - [datasets/dataset_type.yaml](../../../templatelib/base/datasets/dataset_type.yaml)
        - [datasets/tokenized_dataset.yaml](../../../templatelib/base/datasets/tokenized_dataset.yaml)
    - [tokenizers/tokenizer_type.yaml](../../../templatelib/base/tokenizers/tokenizer_type.yaml)
        - [tokenizers/bpe/bpe.yaml](../../../templatelib/base/tokenizers/bpe/bpe.yaml)
    - [models/model_type.yaml](../../../templatelib/base/models/model_type.yaml)
        - [projects/causal_lm_def.yaml](../../../templatelib/examples/projects/causal_lm_def.yaml)
    - [training_script/training_script_type.yaml](../../../templatelib/base/training_script/training_script_type.yaml)
        - [training_script/causal_lm/causal_lm.yaml](../../../templatelib/base/training_script/causal_lm/causal_lm.yaml)
            - [project.yaml](templates/project.yaml)
                - [configs/train_hf_llama.yaml](templates/configs/train_hf_llama.yaml)
                - [configs/experimental_llama.yaml](templates/configs/experimental_llama.yaml)
                - [configs/full_dataset.yaml](templates/configs/full_dataset.yaml)
                - [configs/train_tiny_llama.yaml](templates/configs/train_tiny_llama.yaml)
- [trainers/base_trainer.yaml](../../../templatelib/base/trainers/base_trainer.yaml)
    - [trainers/trainer.yaml](../../../templatelib/base/trainers/trainer.yaml)
        - [project.trainer_config](templates/project.yaml)
        - [trainers/pipeline_trainer.yaml](../../../templatelib/base/trainers/pipeline_trainer.yaml)
            - [trainers/auto_pipeline_trainer.yaml](../../../templatelib/base/trainers/auto_pipeline_trainer.yaml)
        - [trainers/accel_trainer.yaml](../../../templatelib/base/trainers/accel_trainer.yaml)
    - [trainers/hf_trainer.yaml](../../../templatelib/base/trainers/hf_trainer.yaml)
- [models/base_language_model.yaml](../../../templatelib/base/models/base_language_model.yaml)
    - [models/causal_lm/from_pretrained_model.yaml](../../../templatelib/base/models/causal_lm/from_pretrained_model.yaml)
    - [models/causal_lm/from_pretrained_config.yaml](../../../templatelib/base/models/causal_lm/from_pretrained_config.yaml)
    - [models/causal_lm/from_pretrained_class.yaml](../../../templatelib/base/models/causal_lm/from_pretrained_class.yaml)
        - [models/transformers/llama.yaml](../../../templatelib/examples/models/transformers/llama.yaml)
            - [config.model_config](templates/configs/train_hf_llama.yaml)
        - [models/transformers/gpt2.yaml](../../../templatelib/examples/models/transformers/gpt2.yaml)
    - [models/causal_lm/custom.yaml](../../../templatelib/base/models/causal_lm/custom.yaml)
        - [models/causal_lm/custom_dynamic.yaml](../../../templatelib/base/models/causal_lm/custom_dynamic.yaml)
            - [models/transformers/dynamic_causal_transformer.yaml](../../../templatelib/examples/models/transformers/dynamic_causal_transformer.yaml)
            - [models/transformers/dynamic_llama.yaml](../../../templatelib/examples/models/transformers/dynamic_llama.yaml)
            - [models/transformers/deepone.yaml](../../../templatelib/examples/models/transformers/deepone.yaml)
- [models/causal_lm/import_model_project.yaml](../../../templatelib/base/models/causal_lm/import_model_project.yaml)
- [callbacks/base_callbacks.yaml](../../../templatelib/base/callbacks/base_callbacks.yaml)
    - [callbacks/loggers.yaml](../../../templatelib/base/callbacks/loggers.yaml)
        - [project.logger_config](templates/project.yaml)


---
This example makes extensive use of the Forgather templates library. Take a look at the various files which go into the configuration and compare these to the pre-processed output.

In [2]:
nb.display_config(config_template="", show_pp_config=True, show_generated_code=False)

## Included Templates
- [configs/train_tiny_llama.yaml](templates/configs/train_tiny_llama.yaml)
    - [project.yaml](templates/project.yaml)
        - [datasets/llm_dataset_project.yaml](../../../templatelib/examples/datasets/llm_dataset_project.yaml)
        - [models/causal_lm/import_model_project.yaml](../../../templatelib/base/models/causal_lm/import_model_project.yaml)
        - [project.logger_config](templates/project.yaml)
            - [callbacks/loggers.yaml](../../../templatelib/base/callbacks/loggers.yaml)
                - [callbacks/base_callbacks.yaml](../../../templatelib/base/callbacks/base_callbacks.yaml)
                    - [inc/formatting.jinja](../../../templatelib/base/inc/formatting.jinja)
            - [prompts/tiny_stories.yaml](../../../templatelib/examples/prompts/tiny_stories.yaml)
        - [project.trainer_config](templates/project.yaml)
            - [trainers/trainer.yaml](../../../templatelib/base/trainers/trainer.yaml)
                - [trainers/base_trainer.yaml](../../../templatelib/base/trainers/base_trainer.yaml)
        - [training_script/causal_lm/causal_lm.yaml](../../../templatelib/base/training_script/causal_lm/causal_lm.yaml)
            - [training_script/training_script_type.yaml](../../../templatelib/base/training_script/training_script_type.yaml)
                - [config_type.yaml](../../../templatelib/base/config_type.yaml)
                    - [base_directories.yaml](../../../forgather_workspace/base_directories.yaml)
### Config Metadata:

```python
{'config_class': 'type.training_script.causal_lm',
 'config_description': 'A demo of training a tiny llama model from scratch',
 'config_name': 'Tiny Llama',
 'datasets_dir': '/home/dinalt/rust/forgather/datasets',
 'forgather_dir': '/home/dinalt/rust/forgather',
 'logging_dir': './output_models/tiny_llama/runs/log_2025-12-29T07-23-26',
 'model_src_dir': '/home/dinalt/rust/forgather/model_src',
 'models_dir': './output_models',
 'nproc_per_node': 1,
 'output_dir': './output_models/tiny_llama',
 'project_dir': '.',
 'tokenizers_dir': '/home/dinalt/rust/forgather/tokenizers',
 'workspace_root': '/home/dinalt/rust/forgather'}

```

## Modules
## Output Targets
- distributed_env
- tokenizer
- model
- tokenizer_args
- train_dataset
- eval_dataset
- data_collator
- experiment_info
- testprompts
- generation_config
- trainer_callbacks
- optimizer
- lr_scheduler
- trainer_args
- model_preprocessor
- trainer
- dynamic_args
- meta
- main

## Preprocessed Config

```yaml
#---------------------------------------
#               Tiny Llama               
#---------------------------------------
# 2025-12-29T07:23:26
# Description: A demo of training a tiny llama model from scratch
# Project Dir: /home/dinalt/rust/forgather/examples/tutorials/tiny_llama
# Current Working Dir: "/home/dinalt/rust/forgather/examples/tutorials/tiny_llama"
# Forgather Config Dir: "/home/dinalt/.config/forgather"
# Model: tiny_llama
# Hostname: hal9000
# Versions:
#     python: 3.12.3
#     torch: 2.9.1+cu130
#     transformers: 4.57.1
#     accelerate: 1.12.0

############# Config Vars ##############

# ns.forgather_dir: "/home/dinalt/rust/forgather"
# ns.models_dir: "/home/dinalt/rust/forgather/examples/tutorials/tiny_llama/output_models"
# ns.project_model_src_dir: "/home/dinalt/rust/forgather/examples/tutorials/tiny_llama/model_src"
# ns.tokenizers_dir: "/home/dinalt/rust/forgather/tokenizers"
# ns.datasets_dir: "/home/dinalt/rust/forgather/datasets"
# ns.model_src_dir: "/home/dinalt/rust/forgather/model_src"
# ns.output_dir: "./output_models/tiny_llama"
# ns.logging_dir: "./output_models/tiny_llama/runs/log_2025-12-29T07-23-26"
# ns.nproc_per_node: 1
# ns.trust_remote_code: False

####### Distributed Environment ########

distributed_env: &distributed_env !singleton:forgather.ml.distributed:DistributedEnvironment@distributed_env

############# Dependencies #############



################ Model #################

# https://huggingface.co/docs/transformers/en/model_doc/auto
.define: &model_constructor_args
    # See: https://huggingface.co/docs/transformers/en/attention_interface
    attn_implementation: "sdpa"

# Import a model definition from another Forgather project
.define: &model_dict !call:forgather:from_project
    project_dir: "/home/dinalt/rust/forgather/examples/models/llama"
    config_template: "4M.yaml"
    targets: [  "pretrained_tokenizer", "model" ] 
    pp_kwargs:
        output_dir: "./output_models/tiny_llama"
    pp_debug: False
    model_constructor_args: *model_constructor_args

tokenizer: &tokenizer !call:getitem [ *model_dict, 'pretrained_tokenizer' ]
model: &model !call:getitem [ *model_dict, 'model' ]

############### Datasets ###############

tokenizer_args: &tokenizer_args !dict
    truncation: True
    max_length: 512    

# Load dataset from sub-project
.define: &dataset_dict !call:forgather:from_project
    project_dir: "/home/dinalt/rust/forgather/examples/datasets/roneneldan"
    config_template: "tinystories-abridged.yaml"
    targets: [  "train_dataset", "eval_dataset" ] 
    preprocess_args: *tokenizer_args
    tokenizer: *tokenizer

train_dataset: &train_dataset !call:getitem [ *dataset_dict, 'train_dataset' ]
eval_dataset: &eval_dataset !call:getitem [ *dataset_dict, 'eval_dataset' ]

############ Data Collator #############

# Data collator for causal model
# Batches are dynamically padded to longest sequence
# labels are set to input_ids, with pad tokens set to -100
data_collator: &data_collator !singleton:forgather.ml.data_collator:DataCollatorForCausalLM@DataCollatorForCausalLM
    tokenizer: *tokenizer
    return_tensors: pt

    # Tiny Llama
    truncation: True
    max_length: 512

########## Trainer Callbacks ###########

# **Dependencies**

# Experiment tracking: Tensorboard SummaryWriter
.define: &summary_writer !singleton:torch.utils.tensorboard:SummaryWriter
    - "./output_models/tiny_llama/runs/log_2025-12-29T07-23-26"

# Additional data to record to experiment loggers
experiment_info: &experiment_info !dict:@experiment_info
    date: "2025-12-29T07:23:26"
    name: "Tiny Llama"
    description: "A demo of training a tiny llama model from scratch"
    config: !var "pp_config"
    versions: {'python': '3.12.3', 'torch': '2.9.1+cu130', 'transformers': '4.57.1', 'accelerate': '1.12.0'}


# **Callback List**

# The model will be given the following prompts for text-gen at regular intervals.
testprompts: &testprompts !list:@testprompts
    # Test prompts from "https://arxiv.org/abs/2305.07759"
    - "Alice was so tired when she got back home so she went"
    - "Jack and Lily liked to watch the moon at night. They noticed that the moon changed its shape every night. Sometimes the moon was big and round, and sometimes it was"
    - "Jack and Lily saw a rainbow after a rainy day.They were amazed by the colors. Jack said, \"Look, Lily. A rainbow has"
    - "Jack wanted to read a book, so he went to"
    - "\"Can cows fly?\" Alice asked her mother."
    - "\"What do birds like to eat?\" Tom asked his mother."
    - "\"What language do they speak in France?\" Tom asked his mother."
    - "If I throw a ball up in the air, eventually it will"
    - "It was winter and cold outside so his mother told him, \"You should"
    - "Lily likes cats and dogs. She asked her mom for a dog and her mom said no, so instead she asked"
    - "Jack told Mary, \"If you give me your banana, I'll give you my apple.\" Mary gave Jack her Banana, so"
    - "On weekends Jack went to visit his grandmother whereas on weekdays he would go to school. Last weekend, when Jack was on his way to"
    - "Lily and Ben were having an argument. Ben said that cake is much better than ice cream and Lily said that"
    - "Lily and Ben are having an argument. They are trying to decide between the park and the swimming pool. Ben says, \"I want to go to the park\". Lily says"
    - "Jack's mother was not home, and his father was at home. When Jack came home, he said hello to"
    - "Lily doesn't like swimming. When her father wants to take her to the swimming pool, she says"
    - "Both Ben and Lily wanted cake. Father said that there was only one piece of cake left. They"
    - "Ben went to visit Lily in her house, but she was not at home. Ben knocked on the door,"

# Conservative text-generation parameters.
generation_config: &generation_config !dict:@generation_config
    identity: generation_config
    do_sample: True
    top_k: 20
    top_p: 0.9
    temperature: 0.7
    repitition_penalty: 1.15
trainer_callbacks: &trainer_callbacks !dlist:@trainer_callbacks
    progress_callback: !singleton:forgather.ml.trainer.callbacks:ProgressCallback
        use_tqdm: null # Optional[bool] : Use TQDM, Auto, if unspecified
        output_stream: "stdout" #Literal["stderr", "stdout"]
    info_callback: !singleton:forgather.ml.trainer.callbacks:InfoCallback
        verbose: False
    # Log all training output to JSON
    json_logger: !singleton:forgather.ml.trainer.callbacks:JsonLogger
        <<: *experiment_info

    # Log configuration and metrics to Tensorboard file
    tb_logger: !singleton:forgather.ml.trainer.callbacks:TBLogger
        arg0: *summary_writer
        <<: *experiment_info

    text_gen_callback: !singleton:forgather.ml.trainer.callbacks:TextgenCallback
        summary_writer: *summary_writer
        prompts: *testprompts
        generation_config: *generation_config
        max_new_tokens: 40
        generation_steps: 1000
    # Allow remote control of the training process
    trainer_control: !singleton:forgather.ml.trainer.callbacks:TrainerControlCallback

############## Optimizer ###############

optimizer: &optimizer !partial:torch:optim.AdamW
    lr: 1.0e-3

############# LR Scheduler #############

# https://arxiv.org/html/2503.02844v1
lr_scheduler: &lr_scheduler !lambda:forgather.ml.optim.infinite_lr_scheduler:InfiniteLRScheduler@lr_scheduler
    warmup_steps: 500
    cooldown_steps: 50000
    constant_lr: 1.0e-6

############### Trainer ################

# Name: Forgather Trainer
# Description: A lightweight, extensible trainer; does not support multiple GPUs
# Trainer Config Class: forgather.ml.trainer:TrainingArguments
# Trainer Class: forgather.ml.trainer:Trainer
# nproc_per_node: 1

# **Trainer Args**



trainer_args: &trainer_args !singleton:forgather.ml.trainer:TrainingArguments@trainer_args
    save_strategy: "no"
    max_steps: -1
    output_dir: "./output_models/tiny_llama"
    logging_dir: "./output_models/tiny_llama/runs/log_2025-12-29T07-23-26"
    # Tiny Llama Project Overrides
    eval_strategy: "steps"
    save_strategy: "steps"
    save_steps: 10000
    # Safetensors can't handle tied parameters/buffers, so fallback to PyTorch format.
    save_safetensors: False
    seed: 42
    per_device_train_batch_size: 32
    per_device_eval_batch_size: 64
    logging_steps: 100
    eval_steps: 500
    num_train_epochs: 1
    dataloader_num_workers: 1

model_preprocessor: &model_preprocessor !partial:call
    - *model

# **Trainer Constructor**

trainer: &trainer !singleton:forgather.ml.trainer:Trainer@trainer
    args: *trainer_args
    model_init: *model_preprocessor
    data_collator: *data_collator
    train_dataset: *train_dataset
    eval_dataset: *eval_dataset
    processing_class: *tokenizer
    callbacks: *trainer_callbacks
    # **Trainer**
    compute_loss_func: !singleton:forgather.ml.loss:CausalLoss
    distributed_env: *distributed_env
    optimizer_factory: *optimizer
    lr_scheduler_factory: *lr_scheduler

# **Dynamic Args**
dynamic_args: !dlist
    null: ~
    max_steps:
        names: "--max-steps"
        type: "int"
        help: "Set maximum training steps"
    save_strategy:
        names: [ "--save-strategy", "-S" ]
        choices: [ "no", "steps", "epoch" ]
        type: "str"
        help: "When to save checkpoints"
    attn_implementation:
        names: "--attn-implementation"
        type: "str"
        choices: [ "eager", "sdpa", "flash_attention_2", "flex_attention" ]
        help: "Attention implementation"

#---------------------------------------
#          Configuration Output          
#---------------------------------------
meta: &meta_output !dict:@meta
    config_name: "Tiny Llama"
    config_description: "A demo of training a tiny llama model from scratch"
    config_class: "type.training_script.causal_lm"
    project_dir: "."
    workspace_root: "/home/dinalt/rust/forgather"
    forgather_dir: "/home/dinalt/rust/forgather"
    models_dir: "./output_models"
    tokenizers_dir: "/home/dinalt/rust/forgather/tokenizers"
    datasets_dir: "/home/dinalt/rust/forgather/datasets"
    output_dir: "./output_models/tiny_llama"
    model_src_dir: "/home/dinalt/rust/forgather/model_src"
    logging_dir: "./output_models/tiny_llama/runs/log_2025-12-29T07-23-26"
    nproc_per_node: 1

main: !singleton:forgather.ml.training_script:TrainingScript@training_script
    meta: *meta_output
    do_train: True
    do_save: False
    do_eval: False
    distributed_env: *distributed_env
    trainer: *trainer

```



## Load Project

Load the default configuraiton.

In [3]:
from forgather.project import Project
import forgather.nb.notebooks as nb

# Load the default project, which is "train_tiny_llama.yaml"
proj = Project()

## Start Tensorboard

This project has been configured to log training to Tensorboard (TB). To watch the model's training progress with TB, run the following command, which will generate a CLI command to start the TB server. Then run the command from a shell.

Tensorboard can be started from a terminal like this:

```bash
# By default, Tensorboard bind only to localhost. To bind to all interfaces, add --bind_all
tensorboard --logdir "/path/to/model/log/directory" [--bind_all]
```

You can use the CLI to launch TB for you, where it will automatically determine the path to the log directory:

```bash
# --all : Watch all output model directories, otherwise just the one for the current configuration.
# -- : Any arguments after '--' are passed directly to tensorboard, for example "--bind_all"
cd PROJECT_DIR
forgather tb [--all] [-- <tensorboard-args>]
```

When TB starts, it should provide the URL to access it. e.g.

```
Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.16.2 at http://localhost:6006/ (Press CTRL+C to quit)
```

## Train Model

You have a few options for training the mode.

1. Run it directly from the notebook. This should work find with this example, although for projects using multiple GPUs, you will want to use one of the other options. To train from the notebook, just run the following cell.
2. You can generate a training script and run it from the shell. To do so, run the cell with "generate_trainingscript()," then run the generated shell script from a terminal.
3. You can use the Forgather CLI.

```bash
# Open a shell in thie project's directory, then run this command:
cd PROJECT_DIR
forgather train

# See forgather --help for more details.
```

Once training starts, switch to Tensorboard in your browser. One of the first things you will want to do is enable automatic refresh. To do so, click the gear in the upper-right corner and check "Reload Data."

Once training has started, take a look at the "Text" tab. You will see that we have automatically logged the preprocessed configuraiton as well as having dumped the primary training artifacts.

Next, switch to the "Scalars" tab. You will see a plot of train and evaluation loss which will automatically update every 30 seconds. If you are not familiar with Tensorboard, now would be a good time to play with the UI elements to see how they work.

When training completes, the model will be automatically saved to the output directory ("./output_models/default_model").

In [4]:
# Train model in notebook.

# Construct the default target, "main," which is a training script.
training_script = proj()

# Start training the model.
training_script.run()

# Release resources
training_script = None

Skipping import of cpp extensions due to incompatible torch version 2.9.1+cu130 for torchao version 0.14.1             Please see https://github.com/pytorch/ao/issues/2919 for more info
Fused loss factory not provided. Provide a fused-loss factory for enhanced performance.


  0%|                                                                                                         …

[INFO|info_logger] 
total_examples: 212,000
total_train_samples: 212,000
per_device_train_batch_size: 32
actual_per_device_batch_size: 32
total_train_batch_size: 32
max_steps: 6,625
total_parameters: 4.4M
trainable_parameters: 4.4M

2025-12-29 07:23:47          100  0.02  train-loss: 6.88765   grad-norm: 0.65267   learning-rate: 2.00e-04
2025-12-29 07:23:49          200  0.03  train-loss: 4.30503   grad-norm: 0.48353   learning-rate: 4.00e-04
2025-12-29 07:23:51          300  0.05  train-loss: 3.36338   grad-norm: 0.50338   learning-rate: 6.00e-04
2025-12-29 07:23:53          400  0.06  train-loss: 3.06899   grad-norm: 0.49006   learning-rate: 8.00e-04
2025-12-29 07:23:55          500  0.08  train-loss: 2.77843   grad-norm: 0.42629   learning-rate: 1.00e-03


 12%|#################################################6                                                       …

2025-12-29 07:23:55          500  0.08  eval-loss:  2.62731   
2025-12-29 07:23:59          600  0.09  train-loss: 2.59655   grad-norm: 0.42681   learning-rate: 1.00e-03
2025-12-29 07:24:02          700  0.11  train-loss: 2.40455   grad-norm: 0.41618   learning-rate: 1.00e-03
2025-12-29 07:24:04          800  0.12  train-loss: 2.34082   grad-norm: 0.40722   learning-rate: 1.00e-03
2025-12-29 07:24:06          900  0.14  train-loss: 2.212     grad-norm: 0.39784   learning-rate: 1.00e-03
2025-12-29 07:24:08        1,000  0.15  train-loss: 2.07026   grad-norm: 0.40033   learning-rate: 1.00e-03


 12%|#################################################6                                                       …

2025-12-29 07:24:08        1,000  0.15  eval-loss:  1.99104   
2025-12-29 07:24:12        1,100  0.17  train-loss: 2.06937   grad-norm: 0.40356   learning-rate: 1.00e-03
2025-12-29 07:24:14        1,200  0.18  train-loss: 2.03831   grad-norm: 0.39567   learning-rate: 1.00e-03
2025-12-29 07:24:16        1,300  0.2   train-loss: 2.00233   grad-norm: 0.38852   learning-rate: 9.99e-04
2025-12-29 07:24:18        1,400  0.21  train-loss: 1.97046   grad-norm: 0.37826   learning-rate: 9.99e-04
2025-12-29 07:24:20        1,500  0.23  train-loss: 1.94784   grad-norm: 0.38883   learning-rate: 9.99e-04


 12%|#################################################6                                                       …

2025-12-29 07:24:21        1,500  0.23  eval-loss:  1.74975   
2025-12-29 07:24:23        1,600  0.24  train-loss: 1.90823   grad-norm: 0.38616   learning-rate: 9.99e-04
2025-12-29 07:24:25        1,700  0.26  train-loss: 1.86886   grad-norm: 0.37719   learning-rate: 9.99e-04
2025-12-29 07:24:27        1,800  0.27  train-loss: 1.82517   grad-norm: 0.38816   learning-rate: 9.98e-04
2025-12-29 07:24:29        1,900  0.29  train-loss: 1.8071    grad-norm: 0.37966   learning-rate: 9.98e-04
2025-12-29 07:24:31        2,000  0.3   train-loss: 1.84373   grad-norm: 0.37438   learning-rate: 9.98e-04


 12%|#################################################6                                                       …

2025-12-29 07:24:31        2,000  0.3   eval-loss:  1.65692   
2025-12-29 07:24:36        2,100  0.32  train-loss: 1.79691   grad-norm: 0.36641   learning-rate: 9.97e-04
2025-12-29 07:24:38        2,200  0.33  train-loss: 1.74844   grad-norm: 0.36792   learning-rate: 9.97e-04
2025-12-29 07:24:40        2,300  0.35  train-loss: 1.71255   grad-norm: 0.37593   learning-rate: 9.97e-04
2025-12-29 07:24:42        2,400  0.36  train-loss: 1.77328   grad-norm: 0.37834   learning-rate: 9.96e-04
2025-12-29 07:24:44        2,500  0.38  train-loss: 1.74461   grad-norm: 0.39035   learning-rate: 9.96e-04


 12%|#################################################6                                                       …

2025-12-29 07:24:44        2,500  0.38  eval-loss:  1.56892   
2025-12-29 07:24:46        2,600  0.39  train-loss: 1.74146   grad-norm: 0.37168   learning-rate: 9.96e-04
2025-12-29 07:24:48        2,700  0.41  train-loss: 1.68803   grad-norm: 0.37144   learning-rate: 9.95e-04
2025-12-29 07:24:50        2,800  0.42  train-loss: 1.73194   grad-norm: 0.3629    learning-rate: 9.95e-04
2025-12-29 07:24:52        2,900  0.44  train-loss: 1.64809   grad-norm: 0.37706   learning-rate: 9.94e-04
2025-12-29 07:24:54        3,000  0.45  train-loss: 1.55665   grad-norm: 0.35236   learning-rate: 9.94e-04


 12%|#################################################6                                                       …

2025-12-29 07:24:54        3,000  0.45  eval-loss:  1.47587   
2025-12-29 07:24:59        3,100  0.47  train-loss: 1.6416    grad-norm: 0.37025   learning-rate: 9.93e-04
2025-12-29 07:25:01        3,200  0.48  train-loss: 1.70948   grad-norm: 0.34801   learning-rate: 9.93e-04
2025-12-29 07:25:03        3,300  0.5   train-loss: 1.61597   grad-norm: 0.34541   learning-rate: 9.92e-04
2025-12-29 07:25:05        3,400  0.51  train-loss: 1.55825   grad-norm: 0.36052   learning-rate: 9.92e-04
2025-12-29 07:25:07        3,500  0.53  train-loss: 1.5827    grad-norm: 0.35073   learning-rate: 9.91e-04


 12%|#################################################6                                                       …

2025-12-29 07:25:07        3,500  0.53  eval-loss:  1.45303   
2025-12-29 07:25:09        3,600  0.54  train-loss: 1.63534   grad-norm: 0.3483    learning-rate: 9.91e-04
2025-12-29 07:25:11        3,700  0.56  train-loss: 1.55805   grad-norm: 0.3547    learning-rate: 9.90e-04
2025-12-29 07:25:13        3,800  0.57  train-loss: 1.56522   grad-norm: 0.35229   learning-rate: 9.89e-04
2025-12-29 07:25:15        3,900  0.59  train-loss: 1.60674   grad-norm: 0.34669   learning-rate: 9.89e-04
2025-12-29 07:25:17        4,000  0.6   train-loss: 1.63565   grad-norm: 0.34443   learning-rate: 9.88e-04


 12%|#################################################6                                                       …

2025-12-29 07:25:17        4,000  0.6   eval-loss:  1.42217   
2025-12-29 07:25:22        4,100  0.62  train-loss: 1.53881   grad-norm: 0.33184   learning-rate: 9.87e-04
2025-12-29 07:25:24        4,200  0.63  train-loss: 1.52706   grad-norm: 0.34786   learning-rate: 9.87e-04
2025-12-29 07:25:26        4,300  0.65  train-loss: 1.55825   grad-norm: 0.34447   learning-rate: 9.86e-04
2025-12-29 07:25:28        4,400  0.66  train-loss: 1.60822   grad-norm: 0.34165   learning-rate: 9.85e-04
2025-12-29 07:25:30        4,500  0.68  train-loss: 1.54526   grad-norm: 0.3382    learning-rate: 9.84e-04


 12%|#################################################6                                                       …

2025-12-29 07:25:30        4,500  0.68  eval-loss:  1.40635   
2025-12-29 07:25:32        4,600  0.69  train-loss: 1.48855   grad-norm: 0.33326   learning-rate: 9.84e-04
2025-12-29 07:25:35        4,700  0.71  train-loss: 1.49441   grad-norm: 0.33088   learning-rate: 9.83e-04
2025-12-29 07:25:37        4,800  0.72  train-loss: 1.51925   grad-norm: 0.32635   learning-rate: 9.82e-04
2025-12-29 07:25:39        4,900  0.74  train-loss: 1.51519   grad-norm: 0.31614   learning-rate: 9.81e-04
2025-12-29 07:25:41        5,000  0.75  train-loss: 1.53234   grad-norm: 0.33127   learning-rate: 9.80e-04


 12%|#################################################6                                                       …

2025-12-29 07:25:41        5,000  0.75  eval-loss:  1.3907    
2025-12-29 07:25:45        5,100  0.77  train-loss: 1.52618   grad-norm: 0.32921   learning-rate: 9.79e-04
2025-12-29 07:25:47        5,200  0.78  train-loss: 1.45455   grad-norm: 0.33294   learning-rate: 9.78e-04
2025-12-29 07:25:50        5,300  0.8   train-loss: 1.45012   grad-norm: 0.32411   learning-rate: 9.77e-04
2025-12-29 07:25:52        5,400  0.82  train-loss: 1.48363   grad-norm: 0.33564   learning-rate: 9.77e-04
2025-12-29 07:25:54        5,500  0.83  train-loss: 1.46404   grad-norm: 0.32302   learning-rate: 9.76e-04


 12%|#################################################6                                                       …

2025-12-29 07:25:54        5,500  0.83  eval-loss:  1.34799   
2025-12-29 07:25:56        5,600  0.85  train-loss: 1.50253   grad-norm: 0.32968   learning-rate: 9.75e-04
2025-12-29 07:25:58        5,700  0.86  train-loss: 1.52154   grad-norm: 0.32171   learning-rate: 9.74e-04
2025-12-29 07:26:00        5,800  0.88  train-loss: 1.48362   grad-norm: 0.32428   learning-rate: 9.73e-04
2025-12-29 07:26:02        5,900  0.89  train-loss: 1.49482   grad-norm: 0.31245   learning-rate: 9.72e-04
2025-12-29 07:26:04        6,000  0.91  train-loss: 1.43278   grad-norm: 0.32528   learning-rate: 9.70e-04


 12%|#################################################6                                                       …

2025-12-29 07:26:05        6,000  0.91  eval-loss:  1.35024   
2025-12-29 07:26:09        6,100  0.92  train-loss: 1.4147    grad-norm: 0.33273   learning-rate: 9.69e-04
2025-12-29 07:26:11        6,200  0.94  train-loss: 1.46516   grad-norm: 0.33047   learning-rate: 9.68e-04
2025-12-29 07:26:13        6,300  0.95  train-loss: 1.43909   grad-norm: 0.32528   learning-rate: 9.67e-04
2025-12-29 07:26:15        6,400  0.97  train-loss: 1.43014   grad-norm: 0.32405   learning-rate: 9.66e-04
2025-12-29 07:26:17        6,500  0.98  train-loss: 1.44907   grad-norm: 0.32634   learning-rate: 9.65e-04


 12%|#################################################6                                                       …

2025-12-29 07:26:17        6,500  0.98  eval-loss:  1.32157   
2025-12-29 07:26:19        6,600  1.0   train-loss: 1.39438   grad-norm: 0.30354   learning-rate: 9.64e-04


Server thread error: Event loop stopped before Future completed.


2025-12-29 07:26:20        6,625  1.0  train_runtime: 155.5
train_samples: 211,968
step: 6,624
train_samples_per_second: 1.363e+03
train_steps_per_second: 42.59
epoch: 1.0
effective_batch_size: 32



## Load Trained Model

You can use the regular HF APIs to load the saved model and tokenizer.

In [5]:
from forgather.project import Project
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import GenerationConfig, StoppingCriteria
from forgather.ml.sharded_checkpoint import create_pretrained_symlinks
import torch

model_path = "./output_models/tiny_llama"

# Create symlinks to latest checkpoint model output directory
# This is required for .from_pretrained() to find the latest checkpoint.
create_pretrained_symlinks(model_path)

# Set device to run inference on
device = "cuda" if torch.cuda.is_available() else "cpu"

model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_path)

The equivalent CLI for creating symbolic links to the latest checkpoint in the output directory is:
```bash
forgather checkpoint link
```

## Text Generation

This loop will use the newly trained model to generate text, seeded with the above prompts.

In [6]:
import torch

def generate_text(model, tokenizer, prompts, gen_config, max_new_tokens, device):
    model.to(device)
    model.eval()
    
    with torch.inference_mode():
        for prompt in prompts:
            tokenizer_outputs = tokenizer(
                [prompt],
                truncation=False,
                return_length=True,
                return_tensors="pt",
                return_attention_mask=True,
            )
        
            input_ids = tokenizer_outputs["input_ids"].to(device)
            attention_mask = tokenizer_outputs["attention_mask"].to(device)
            outputs = model.generate(
                input_ids,
                attention_mask=attention_mask,
                generation_config=gen_config,
                return_dict_in_generate=True,
                past_key_values=None,
                max_new_tokens=max_new_tokens,
            )
    
            output_text = tokenizer.decode(
                outputs.sequences[0],
                skip_special_tokens=True,
            )
            yield prompt + " [START] " + output_text[len(prompt) + 1 :]

prompts = [
    'Alice was so tired when she got back home so she went',
    'Jack and Lily liked to watch the moon at night. They noticed that the moon changed its shape every night. Sometimes the moon was big and round, and sometimes it was',
    'Jack and Lily saw a rainbow after a rainy day.They were amazed by the colors. Jack said, "Look, Lily. A rainbow has',
    'Jack wanted to read a book, so he went to',
    '"Can cows fly?" Alice asked her mother.',
]

gen_config = GenerationConfig(
    pad_token_id=model.config.pad_token_id,
    bos_token_id=model.config.bos_token_id,
    eos_token_id=model.config.eos_token_id,
    do_sample=True,
    top_k=20,
    top_p=0.9,
    temperature=0.7,
    repitition_penalty=1.15,
)

for s in generate_text(model, tokenizer, prompts, gen_config, 100, "cuda:0"):
    print(s)
    print(f"{'-' * 40}")

Alice was so tired when she got back home so she went [START] to her mum and said, "Mum, we need to rest."

Mum looked at her and said, "I have a surprise for you." She said, "Okay, you can try."

Alice hopped and jumped in the park. She felt a bit bitter, but she got a bit bit. She felt a little bit
----------------------------------------
Jack and Lily liked to watch the moon at night. They noticed that the moon changed its shape every night. Sometimes the moon was big and round, and sometimes it was [START] cool.

One day, Jack and Lily went to the moon. They saw a big box with a lock on it. They looked at it and saw a man with a big box. He had a lock on it.

"What do you want?" he asked.

"I have a box," Lily said. "Let's open the box.
----------------------------------------
Jack and Lily saw a rainbow after a rainy day.They were amazed by the colors. Jack said, "Look, Lily. A rainbow has [START] a big sun!"

Lily liked the rainbow and wanted to play with it. She ran to the sun a

## Train Hugginface LLama Model

Next, let's try training a Llama model using the Huggingface implementation.

Train the model on the CLI

```bash
forgather -t train_hf_llama.yaml train
```

In [None]:
nb.display_config(config_template="train_hf_llama.yaml", show_pp_config=True, show_generated_code=False)

## Let's See What Happens...

...if we replace the post-layer-norm implementation with a pre-layer-norm implementation. This configuration uses a [custom model definition](./custom_models/llama/README.md) in the custom_models directory.

```bash
forgather -t experimental_llama.yaml train
```

## Test Model With the Inference Server

There is a simple OpenAI compatible inference server implementation in "tools/inference_server"  

To host your newly trained model on the inference server:

```bash
# Manual start
forgather inf server -c -m ./output_models/tiny_llama/

# Config with YAML file
forgather inf server ./tiny_llama_server.yaml
```

From another session, you can perform text completion like this:

```bash
# Text completion request
forgather inf client --completion "Once upon a time"

# With manual generation settings
forgather inf client --temperature 0.7 --no-repeat-ngram-size 2 --repetition-penalty 1.2 --top-k 40 --completion "Once upon a time" --max-tokens 512

# From YAML config
forgather inf client ./tiny_llama_client.yaml --completion "Once upon a time"
```

As the model has not been trained on a chat format, it will not be very good at it, but you can try with:

```bash
forgather inf client ./tiny_llama_client.yaml
```

This server should work with other OpenAI compatible clients as well.

## Train on the Full Dataset

The examples so far have been limited to training on only the first 10% of the dataset. You can train on the complete dataset with this configuration:
```bash
forgather -t full_dataset.yaml train
```