# Advanced Forgather Syntax

Below the "project" abstraction lies a lower level API. We use it here, as it's easier to use for syntax experimentation.

We will start with a few simple examples and work our way to constructing a modular transformer model and feeding a training example through it.

In [1]:
import sys, os
modules_path = os.path.join('..', 'src')
if modules_path not in sys.path: sys.path.insert(0, modules_path)

from pprint import pp, pformat

from IPython import display as ds

from forgather.latent import Latent
from forgather.config import ConfigEnvironment
from forgather.preprocess import PPEnvironment
from forgather.codegen import generate_code
from forgather.yaml_encoder import to_yaml
import forgather.nb.notebooks as nb

## Trivial Examples

In [2]:
# Imports
from forgather.config import ConfigEnvironment

# Construct a configuration environment
env = ConfigEnvironment()

# Define a configuration
# Here, we construct a 2x2 random tensor.
document = """
!call:torch:randn [ 2, 2 ]
"""

# Convert the configuration to a graph
graph = env.load_from_string(document).config

# Construct the graph
graph()

tensor([[-1.0921,  0.2509],
        [-1.2511,  1.3328]])

In [3]:
# Construct a function which computes the square-root of its argument.
graph = env.load_from_string("main: !partial:math:sqrt []").config
graph.main(16)

4.0

## Complex Example

The following template defines a fairly simple causal transformer model. We will make use of the reusable model components library.

[./templates/model_def.yaml](./templates/model_def.yaml)

In [4]:
template_path = os.path.join('templates', 'model_def.yaml')
with open(template_path, 'r') as f:
    nb.display_codeblock("yaml", f.read(), "### Configuration Template")

### Configuration Template
```yaml
-- set ns = namespace()
-- from 'templates/formatting.jinja' import h1, h2, h3
-- filter trim() ## This removes whitespace before the header.

## Jina2 block definitions; we can override these in derived templates.
-- block meta_config
    -- set ns.model_src = '../../../modelsrc/transformer/'
    -- set ns.config_name = 'Control'
    -- set ns.config_description = "Baseline Control"
    ## Example of variable set by jinja2 template.
    -- set ns.vocab_size = 1024
<< endblock meta_config


-- endfilter
-- block header
== h1(ns.config_name)
# {{ utcisotime() }}
# Description: {{ ns.config_description }}
# model_src = {{ ns.model_src }}
# Current Working Dir: "{{ getcwd() }}"
# Forgather Config Dir: "{{ abspath(forgather_config_dir()) }}"
<< endblock header


== h2("Model Definition")

== h3("Layer Norm Factory")

-- block layer_norm_factory
.define: &layer_norm_factory !lambda:torch.nn:LayerNorm@layer_norm_factory
    - !var "hidden_size"
<< endblock layer_norm_factory


== h3("Activation Factory")

-- block activation_factory
.define: &activation_factory !partial:torch.nn:ReLU@activation_factory []
<< endblock activation_factory


== h3("Feedforward Factory")

-- block feedforward_factory
.define: &feedforward_factory !partial:{{ns.model_src}}feedforward_layer.py:FeedforwardLayer@feedforward_factory
    activation_factory: *activation_factory
    d_model: !var "hidden_size"
    d_feedforward: !var "dim_feedforward"
<< endblock feedforward_factory


== h3("Attention Factory")

-- block attention_factory
.define: &attention_factory !partial:{{ns.model_src}}single_head_attn.py:SingleHeadAttn@attention_factory
    d_model: !var "hidden_size"
<< endblock attention_factory


== h3("Layer Factory")

-- block layer_factory
.define: &layer_factory !partial:{{ns.model_src}}pre_ln_layer.py:PreLNLayer@layer_factory
    feedforward_factory: *feedforward_factory
    attention_factory: *attention_factory
    norm_factory: *layer_norm_factory
<< endblock layer_factory


== h3("Layer Stack Factory")

-- block layer_stack_factory
.define: &layer_stack_factory !factory:{{ns.model_src}}layer_stack.py:LayerStack@layer_stack_factory
    layer_factory: *layer_factory
    post_norm_factory: *layer_norm_factory
    num_hidden_layers: !var "n_layers"
<< endblock layer_stack_factory


== h3("Model")

-- block model
## This block is not nearly as factored-out as the others, using inline-definiions.
.define: &model !call:{{ns.model_src}}causal_lm.py:CasualLM@model
    loss_fn: !factory:{{ns.model_src}}causal_loss.py:CausalLoss
    input_encoder: !factory:{{ns.model_src}}input_encoder.py:InputEncoder
        d_model: !var "hidden_size"
        vocab_size: {{ ns.vocab_size }}
    output_decoder: !factory:torch.nn:Linear [ !var "hidden_size", {{ ns.vocab_size }} ]
    init_weights: !partial:{{ns.model_src}}init_weights.py:simple_weight_init
    layer_stack: *layer_stack_factory
<< endblock model

== h3("Optimizer")

-- block optimizer
## Define an optimizer
optimizer: !partial:torch.optim:AdamW
    lr: 1.0e-3
<< endblock optimizer

meta:
    d_model: !var "hidden_size"
    vocab_size: {{ ns.vocab_size }}

## Main output
main: *model

```



## Preprocess the Template

This will only run the Jinja preprocessor.

This more or less looks like the original template...

In [5]:
env = ConfigEnvironment()

pp_config = env.preprocess(template_path)
nb.display_codeblock("yaml", pp_config, "### Pre Processed Template")

### Pre Processed Template
```yaml
#---------------------------------------
#                 Control                
#---------------------------------------
# 2025-06-17T08:07:23
# Description: Baseline Control
# model_src = ../../../modelsrc/transformer/
# Current Working Dir: "/home/dinalt/ai_assets/forgather/examples/basic/syntax"
# Forgather Config Dir: "/home/dinalt/.config/forgather"

########### Model Definition ###########

# **Layer Norm Factory**

.define: &layer_norm_factory !lambda:torch.nn:LayerNorm@layer_norm_factory
    - !var "hidden_size"

# **Activation Factory**

.define: &activation_factory !partial:torch.nn:ReLU@activation_factory []

# **Feedforward Factory**

.define: &feedforward_factory !partial:../../../modelsrc/transformer/feedforward_layer.py:FeedforwardLayer@feedforward_factory
    activation_factory: *activation_factory
    d_model: !var "hidden_size"
    d_feedforward: !var "dim_feedforward"

# **Attention Factory**

.define: &attention_factory !partial:../../../modelsrc/transformer/single_head_attn.py:SingleHeadAttn@attention_factory
    d_model: !var "hidden_size"

# **Layer Factory**

.define: &layer_factory !partial:../../../modelsrc/transformer/pre_ln_layer.py:PreLNLayer@layer_factory
    feedforward_factory: *feedforward_factory
    attention_factory: *attention_factory
    norm_factory: *layer_norm_factory

# **Layer Stack Factory**

.define: &layer_stack_factory !factory:../../../modelsrc/transformer/layer_stack.py:LayerStack@layer_stack_factory
    layer_factory: *layer_factory
    post_norm_factory: *layer_norm_factory
    num_hidden_layers: !var "n_layers"

# **Model**

.define: &model !call:../../../modelsrc/transformer/causal_lm.py:CasualLM@model
    loss_fn: !factory:../../../modelsrc/transformer/causal_loss.py:CausalLoss
    input_encoder: !factory:../../../modelsrc/transformer/input_encoder.py:InputEncoder
        d_model: !var "hidden_size"
        vocab_size: 1024
    output_decoder: !factory:torch.nn:Linear [ !var "hidden_size", 1024 ]
    init_weights: !partial:../../../modelsrc/transformer/init_weights.py:simple_weight_init
    layer_stack: *layer_stack_factory
# **Optimizer**

optimizer: !partial:torch.optim:AdamW
    lr: 1.0e-3
meta:
    d_model: !var "hidden_size"
    vocab_size: 1024

main: *model

```



## Construct Model Instance

In [6]:
graph = env.load(template_path).config

model_args =dict(
    hidden_size=64,
    dim_feedforward=256,
    n_layers=2,
)

graph.main(context_vars=model_args)

CasualLM(
  loss_fn=CausalLoss()
  (input_encoder): InputEncoder(
    d_model=64, vocab_size=1024
    (dropout): Dropout(p=0.1, inplace=False)
    (embedding): Embedding(1024, 64)
  )
  (output_decoder): Linear(in_features=64, out_features=1024, bias=True)
  (layer_stack): LayerStack(
    (layers): ModuleList(
      (0-1): 2 x PreLNLayer(
        (feedforward): FeedforwardLayer(
          d_model=64, d_feedforward=256
          (linear1): Linear(in_features=64, out_features=256, bias=True)
          (dropout): Identity()
          (activation): ReLU()
          (linear2): Linear(in_features=256, out_features=64, bias=True)
        )
        (attention): SingleHeadAttn(
          d_model=64, bias=True
          (query_key_linear): Linear(in_features=64, out_features=64, bias=True)
          (value_linear): Linear(in_features=64, out_features=64, bias=True)
        )
        (norm1): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((64,), eps=1e-05, element

### Override Something

For our experiment, we will want to change just one variable.

In [7]:
experiment_config = """
-- extends 'templates/model_def.yaml'

-- block meta_config
    ## This includes the definition from the parent.
    == super()
    -- set ns.config_name = "No Bias"
    -- set ns.config_description = "Disabled bias in attention. Does it matter?"
<< endblock meta_config

-- block attention_factory
    == super()

    ## And add an override. We are essentially just appending more arguments to the definition.
    # Experiment override.
    bias: False
<< endblock attention_factory
"""

output = env.load_from_string(experiment_config)
nb.display_codeblock("yaml", output.pp_config, "#### Pre Processed Experiment Config")

#### Pre Processed Experiment Config
```yaml

#---------------------------------------
#                 No Bias                
#---------------------------------------
# 2025-06-17T08:07:25
# Description: Disabled bias in attention. Does it matter?
# model_src = ../../../modelsrc/transformer/
# Current Working Dir: "/home/dinalt/ai_assets/forgather/examples/basic/syntax"
# Forgather Config Dir: "/home/dinalt/.config/forgather"

########### Model Definition ###########

# **Layer Norm Factory**

.define: &layer_norm_factory !lambda:torch.nn:LayerNorm@layer_norm_factory
    - !var "hidden_size"

# **Activation Factory**

.define: &activation_factory !partial:torch.nn:ReLU@activation_factory []

# **Feedforward Factory**

.define: &feedforward_factory !partial:../../../modelsrc/transformer/feedforward_layer.py:FeedforwardLayer@feedforward_factory
    activation_factory: *activation_factory
    d_model: !var "hidden_size"
    d_feedforward: !var "dim_feedforward"

# **Attention Factory**

.define: &attention_factory !partial:../../../modelsrc/transformer/single_head_attn.py:SingleHeadAttn@attention_factory
    d_model: !var "hidden_size"

    # Experiment override.
    bias: False

# **Layer Factory**

.define: &layer_factory !partial:../../../modelsrc/transformer/pre_ln_layer.py:PreLNLayer@layer_factory
    feedforward_factory: *feedforward_factory
    attention_factory: *attention_factory
    norm_factory: *layer_norm_factory

# **Layer Stack Factory**

.define: &layer_stack_factory !factory:../../../modelsrc/transformer/layer_stack.py:LayerStack@layer_stack_factory
    layer_factory: *layer_factory
    post_norm_factory: *layer_norm_factory
    num_hidden_layers: !var "n_layers"

# **Model**

.define: &model !call:../../../modelsrc/transformer/causal_lm.py:CasualLM@model
    loss_fn: !factory:../../../modelsrc/transformer/causal_loss.py:CausalLoss
    input_encoder: !factory:../../../modelsrc/transformer/input_encoder.py:InputEncoder
        d_model: !var "hidden_size"
        vocab_size: 1024
    output_decoder: !factory:torch.nn:Linear [ !var "hidden_size", 1024 ]
    init_weights: !partial:../../../modelsrc/transformer/init_weights.py:simple_weight_init
    layer_stack: *layer_stack_factory
# **Optimizer**

optimizer: !partial:torch.optim:AdamW
    lr: 1.0e-3
meta:
    d_model: !var "hidden_size"
    vocab_size: 1024

main: *model

```



## Construct Experiment Model

This model now has been modified. The bias is now disabled on the attention module.

In [8]:
graph = output.config
graph.main(context_vars=model_args)

CasualLM(
  loss_fn=CausalLoss()
  (input_encoder): InputEncoder(
    d_model=64, vocab_size=1024
    (dropout): Dropout(p=0.1, inplace=False)
    (embedding): Embedding(1024, 64)
  )
  (output_decoder): Linear(in_features=64, out_features=1024, bias=True)
  (layer_stack): LayerStack(
    (layers): ModuleList(
      (0-1): 2 x PreLNLayer(
        (feedforward): FeedforwardLayer(
          d_model=64, d_feedforward=256
          (linear1): Linear(in_features=64, out_features=256, bias=True)
          (dropout): Identity()
          (activation): ReLU()
          (linear2): Linear(in_features=256, out_features=64, bias=True)
        )
        (attention): SingleHeadAttn(
          d_model=64, bias=False
          (query_key_linear): Linear(in_features=64, out_features=64, bias=False)
          (value_linear): Linear(in_features=64, out_features=64, bias=False)
        )
        (norm1): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((64,), eps=1e-05, elem

### Implementation Override

Unlike most configuration systems, we can not only change numerical parameters, we can alter the implementatinon!

Let's replace the simple single-head attention module with a multihead-attention module.

In [9]:
experiment_config = """
-- extends 'templates/model_def.yaml'

-- block meta_config
    == super()
    -- set ns.config_name = "Multihead Attention"
    -- set ns.config_description = "Swapped singlehead attention for multihead attention."
    -- set ns.attention_heads = 2
<< endblock meta_config


-- block attention_factory
# Experiment Override.
.define: &attention_factory !partial:{{ns.model_src}}causal_multihead_attn.py:CausalMultiheadAttn@attention_factory
    d_model: !var "hidden_size"
    num_heads: {{ ns.attention_heads }}
<< endblock attention_factory
"""

output = env.load_from_string(experiment_config)
nb.display_codeblock("yaml", output.pp_config, "#### Pre Processed Experiment Config")

#### Pre Processed Experiment Config
```yaml

#---------------------------------------
#           Multihead Attention          
#---------------------------------------
# 2025-06-17T08:07:28
# Description: Swapped singlehead attention for multihead attention.
# model_src = ../../../modelsrc/transformer/
# Current Working Dir: "/home/dinalt/ai_assets/forgather/examples/basic/syntax"
# Forgather Config Dir: "/home/dinalt/.config/forgather"

########### Model Definition ###########

# **Layer Norm Factory**

.define: &layer_norm_factory !lambda:torch.nn:LayerNorm@layer_norm_factory
    - !var "hidden_size"

# **Activation Factory**

.define: &activation_factory !partial:torch.nn:ReLU@activation_factory []

# **Feedforward Factory**

.define: &feedforward_factory !partial:../../../modelsrc/transformer/feedforward_layer.py:FeedforwardLayer@feedforward_factory
    activation_factory: *activation_factory
    d_model: !var "hidden_size"
    d_feedforward: !var "dim_feedforward"

# **Attention Factory**

# Experiment Override.
.define: &attention_factory !partial:../../../modelsrc/transformer/causal_multihead_attn.py:CausalMultiheadAttn@attention_factory
    d_model: !var "hidden_size"
    num_heads: 2

# **Layer Factory**

.define: &layer_factory !partial:../../../modelsrc/transformer/pre_ln_layer.py:PreLNLayer@layer_factory
    feedforward_factory: *feedforward_factory
    attention_factory: *attention_factory
    norm_factory: *layer_norm_factory

# **Layer Stack Factory**

.define: &layer_stack_factory !factory:../../../modelsrc/transformer/layer_stack.py:LayerStack@layer_stack_factory
    layer_factory: *layer_factory
    post_norm_factory: *layer_norm_factory
    num_hidden_layers: !var "n_layers"

# **Model**

.define: &model !call:../../../modelsrc/transformer/causal_lm.py:CasualLM@model
    loss_fn: !factory:../../../modelsrc/transformer/causal_loss.py:CausalLoss
    input_encoder: !factory:../../../modelsrc/transformer/input_encoder.py:InputEncoder
        d_model: !var "hidden_size"
        vocab_size: 1024
    output_decoder: !factory:torch.nn:Linear [ !var "hidden_size", 1024 ]
    init_weights: !partial:../../../modelsrc/transformer/init_weights.py:simple_weight_init
    layer_stack: *layer_stack_factory
# **Optimizer**

optimizer: !partial:torch.optim:AdamW
    lr: 1.0e-3
meta:
    d_model: !var "hidden_size"
    vocab_size: 1024

main: *model

```



### Examine the Graph

Internally, the processed configuraiton is represented as an abstract node graph.

In [10]:
nb.display_codeblock("python", pformat(graph), "### Node Graph")

### Node Graph
```python
{'main': SingletonNode('../../../modelsrc/transformer/causal_lm.py:CasualLM', *(), identity='model', **{'loss_fn': FactoryNode('../../../modelsrc/transformer/causal_loss.py:CausalLoss', *(), identity=140135030590480, **{}), 'input_encoder': FactoryNode('../../../modelsrc/transformer/input_encoder.py:InputEncoder', *(), identity=140135030589424, **{'d_model': VarNode('hidden_size', identity=140135030602960, value=Undefined), 'vocab_size': 1024}), 'output_decoder': FactoryNode('torch.nn:Linear', *(VarNode('hidden_size', identity=140135030589664, value=Undefined), 1024), identity=140135030589472, **{}), 'init_weights': LambdaNode('../../../modelsrc/transformer/init_weights.py:simple_weight_init', *(), identity=140135030589232, **{}), 'layer_stack': FactoryNode('../../../modelsrc/transformer/layer_stack.py:LayerStack', *(), identity='layer_stack_factory', **{'layer_factory': LambdaNode('../../../modelsrc/transformer/pre_ln_layer.py:PreLNLayer', *(), identity='layer_factory', **{'feedforward_factory': LambdaNode('../../../modelsrc/transformer/feedforward_layer.py:FeedforwardLayer', *(), identity='feedforward_factory', **{'activation_factory': LambdaNode('torch.nn:ReLU', *(), identity='activation_factory', **{}), 'd_model': VarNode('hidden_size', identity=140135030603728, value=Undefined), 'd_feedforward': VarNode('dim_feedforward', identity=140135030600608, value=Undefined)}), 'attention_factory': LambdaNode('../../../modelsrc/transformer/single_head_attn.py:SingleHeadAttn', *(), identity='attention_factory', **{'d_model': VarNode('hidden_size', identity=140135030590192, value=Undefined), 'bias': False}), 'norm_factory': LambdaNode('torch.nn:LayerNorm', *(VarNode('hidden_size', identity=140135030591872, value=Undefined),), identity='layer_norm_factory', **{})}), 'post_norm_factory': LambdaNode('torch.nn:LayerNorm', *(VarNode('hidden_size', identity=140135030591872, value=Undefined),), identity='layer_norm_factory', **{}), 'num_hidden_layers': VarNode('n_layers', identity=140135030600416, value=Undefined)})}),
 'meta': {'d_model': VarNode('hidden_size', identity=140135030589520, value=Undefined),
          'vocab_size': 1024},
 'optimizer': LambdaNode('torch.optim:AdamW', *(), identity=140135030589376, **{'lr': 0.001})}

```



### Convert Graph to YAML

Convert the node-graph to a YAML representation. This may not be exactly the same as it was in the source template, but should be symantically equivalent.

In [11]:
nb.display_codeblock("yaml", to_yaml(graph))

```yaml
.define: &activation_factory !lambda:torch.nn:ReLU@activation_factory []

.define: &feedforward_factory !lambda:../../../modelsrc/transformer/feedforward_layer.py:FeedforwardLayer@feedforward_factory
    activation_factory: *activation_factory
    d_model: !var 'hidden_size'
    d_feedforward: !var 'dim_feedforward'

.define: &attention_factory !lambda:../../../modelsrc/transformer/single_head_attn.py:SingleHeadAttn@attention_factory
    d_model: !var 'hidden_size'
    bias: False

.define: &layer_norm_factory !lambda:torch.nn:LayerNorm@layer_norm_factory
    - !var 'hidden_size'

.define: &layer_factory !lambda:../../../modelsrc/transformer/pre_ln_layer.py:PreLNLayer@layer_factory
    feedforward_factory: *feedforward_factory
    attention_factory: *attention_factory
    norm_factory: *layer_norm_factory

.define: &layer_stack_factory !factory:../../../modelsrc/transformer/layer_stack.py:LayerStack@layer_stack_factory
    layer_factory: *layer_factory
    post_norm_factory: *layer_norm_factory
    num_hidden_layers: !var 'n_layers'

.define: &model !singleton:../../../modelsrc/transformer/causal_lm.py:CasualLM@model
    loss_fn: !factory:../../../modelsrc/transformer/causal_loss.py:CausalLoss []
    input_encoder: !factory:../../../modelsrc/transformer/input_encoder.py:InputEncoder
        d_model: !var 'hidden_size'
        vocab_size: 1024
    output_decoder: !factory:torch.nn:Linear
        - !var 'hidden_size'
        - 1024
    init_weights: !lambda:../../../modelsrc/transformer/init_weights.py:simple_weight_init []
    layer_stack: *layer_stack_factory


optimizer: !lambda:torch.optim:AdamW
    lr: 0.001
meta: 
    d_model: !var 'hidden_size'
    vocab_size: 1024
main: *model

```



### Convert Graph to Python

This function takes the output from Latent.to_py(graph) and uses it to render Pyhon code using a Jinja2 template. If the template is unspecified, an implicit "built-in" template is used, which will generate appropriate import and dynamic import statements, where required.

In [12]:
from forgather.graph_encoder import NamePolicy # NamePolicy.REQUIRED | NamePolicy.ALL | NamePolicy.NAMED
generated_code = generate_code(graph.main, name_policy=None)
nb.display_codeblock("python", generated_code, "### Generated Code", )

### Generated Code
```python
from torch.nn import LayerNorm
from torch.nn import Linear
from torch.nn import ReLU
from importlib.util import spec_from_file_location, module_from_spec
import os
import sys
from functools import partial

# Import a dynamic module.
def dynimport(module, name, searchpath):
    module_path = module
    module_name = os.path.basename(module).split(".")[0]
    module_spec = spec_from_file_location(
        module_name,
        module_path,
        submodule_search_locations=searchpath,
    )
    mod = module_from_spec(module_spec)
    sys.modules[module_name] = mod
    module_spec.loader.exec_module(mod)
    for symbol in name.split("."):
        mod = getattr(mod, symbol)
    return mod

CasualLM = lambda: dynimport("../../../modelsrc/transformer/causal_lm.py", "CasualLM", ())
CausalLoss = lambda: dynimport("../../../modelsrc/transformer/causal_loss.py", "CausalLoss", ())
FeedforwardLayer = lambda: dynimport("../../../modelsrc/transformer/feedforward_layer.py", "FeedforwardLayer", ())
simple_weight_init = lambda: dynimport("../../../modelsrc/transformer/init_weights.py", "simple_weight_init", ())
InputEncoder = lambda: dynimport("../../../modelsrc/transformer/input_encoder.py", "InputEncoder", ())
LayerStack = lambda: dynimport("../../../modelsrc/transformer/layer_stack.py", "LayerStack", ())
PreLNLayer = lambda: dynimport("../../../modelsrc/transformer/pre_ln_layer.py", "PreLNLayer", ())
SingleHeadAttn = lambda: dynimport("../../../modelsrc/transformer/single_head_attn.py", "SingleHeadAttn", ())

def construct(
    dim_feedforward,
    hidden_size,
    n_layers,
):
    activation_factory = partial(ReLU, )

    feedforward_factory = partial(FeedforwardLayer(), 
        activation_factory=activation_factory,
        d_model=hidden_size,
        d_feedforward=dim_feedforward,
    )

    attention_factory = partial(SingleHeadAttn(), 
        d_model=hidden_size,
        bias=False,
    )

    layer_norm_factory = partial(LayerNorm, 
        hidden_size,
    )

    layer_factory = partial(PreLNLayer(), 
        feedforward_factory=feedforward_factory,
        attention_factory=attention_factory,
        norm_factory=layer_norm_factory,
    )

    layer_stack_factory = partial(LayerStack(), 
        layer_factory=layer_factory,
        post_norm_factory=layer_norm_factory,
        num_hidden_layers=n_layers,
    )

    model = CasualLM()(
        loss_fn=CausalLoss()(),
        input_encoder=InputEncoder()(
            d_model=hidden_size,
            vocab_size=1024,
        ),
        output_decoder=Linear(
            hidden_size,
            1024,
        ),
        init_weights=partial(simple_weight_init(), ),
        layer_stack=layer_stack_factory(),
    )
    
    return model

```



### Custom Code Template

The above code is pretty generic. How about we wrap this class with a HF PreTrainedModel?  
[./templates/causal_lm.py](./templates/causal_lm.py)

In [13]:
generated_code = generate_code(graph.main, template_name="templates/causal_lm.py", model_type="my_model")
nb.display_codeblock("python", generated_code, "### Generated Code", )

### Generated Code
```python
# See: https://huggingface.co/docs/transformers/custom_models
# This is a template model, with the details filled-in by the code-generator.
from typing import Optional, Tuple

from functools import partial
from torch import nn, Tensor, LongTensor, FloatTensor
import torch
from transformers.modeling_outputs import CausalLMOutput
from transformers import (
    PreTrainedModel,
    PretrainedConfig,
    AutoConfig,
    AutoModelForCausalLM,
    GenerationMixin,
)

from torch.nn import LayerNorm
from torch.nn import Linear
from torch.nn import ReLU

from importlib.util import spec_from_file_location, module_from_spec
import os
import sys
from functools import partial

# Import a dynamic module.
def dynimport(module, name, searchpath):
    module_path = module
    module_name = os.path.basename(module).split(".")[0]
    module_spec = spec_from_file_location(
        module_name,
        module_path,
        submodule_search_locations=searchpath,
    )
    mod = module_from_spec(module_spec)
    sys.modules[module_name] = mod
    module_spec.loader.exec_module(mod)
    for symbol in name.split("."):
        mod = getattr(mod, symbol)
    return mod

CasualLM = lambda: dynimport("../../../modelsrc/transformer/causal_lm.py", "CasualLM", ())
CausalLoss = lambda: dynimport("../../../modelsrc/transformer/causal_loss.py", "CausalLoss", ())
FeedforwardLayer = lambda: dynimport("../../../modelsrc/transformer/feedforward_layer.py", "FeedforwardLayer", ())
simple_weight_init = lambda: dynimport("../../../modelsrc/transformer/init_weights.py", "simple_weight_init", ())
InputEncoder = lambda: dynimport("../../../modelsrc/transformer/input_encoder.py", "InputEncoder", ())
LayerStack = lambda: dynimport("../../../modelsrc/transformer/layer_stack.py", "LayerStack", ())
PreLNLayer = lambda: dynimport("../../../modelsrc/transformer/pre_ln_layer.py", "PreLNLayer", ())
SingleHeadAttn = lambda: dynimport("../../../modelsrc/transformer/single_head_attn.py", "SingleHeadAttn", ())

model_type = "my_model"


class DynamicCausalLMConfig(PretrainedConfig):
    model_type = model_type


class DynamicCasualLM(PreTrainedModel, GenerationMixin):
    config_class = DynamicCausalLMConfig
    model_type = model_type

    def __init__(self, config: PretrainedConfig):
        super().__init__(config)
        self.causal_lm = self.construct_model(**config.to_dict())
        if "torch_dtype" in config:
            self.to(config.torch_dtype)

    @staticmethod
    def construct_model(
        dim_feedforward,
        hidden_size,
        n_layers,
        **kwargs
    ):
        activation_factory = partial(ReLU, )

        feedforward_factory = partial(FeedforwardLayer(), 
            activation_factory=activation_factory,
            d_model=hidden_size,
            d_feedforward=dim_feedforward,
        )

        attention_factory = partial(SingleHeadAttn(), 
            d_model=hidden_size,
            bias=False,
        )

        layer_norm_factory = partial(LayerNorm, 
            hidden_size,
        )

        layer_factory = partial(PreLNLayer(), 
            feedforward_factory=feedforward_factory,
            attention_factory=attention_factory,
            norm_factory=layer_norm_factory,
        )

        layer_stack_factory = partial(LayerStack(), 
            layer_factory=layer_factory,
            post_norm_factory=layer_norm_factory,
            num_hidden_layers=n_layers,
        )

        model = CasualLM()(
            loss_fn=CausalLoss()(),
            input_encoder=InputEncoder()(
                d_model=hidden_size,
                vocab_size=1024,
            ),
            output_decoder=Linear(
                hidden_size,
                1024,
            ),
            init_weights=partial(simple_weight_init(), ),
            layer_stack=layer_stack_factory(),
        )
        
        return model

    def forward(
        self,
        input_ids: LongTensor,
        labels: Optional[LongTensor] = None,
        position_ids: Optional[LongTensor] = None,
        attention_mask: Optional[FloatTensor] = None,
        return_dict: bool = False,
        **kwargs,
    ) -> CausalLMOutput | Tuple[FloatTensor, dict[str, FloatTensor]] | FloatTensor:

        outputs = self.causal_lm(
            input_ids=input_ids,
            labels=labels,
            position_ids=position_ids,
            attention_mask=attention_mask,
            **kwargs,
        )

        # Return type depends on arguments.
        if return_dict:
            return CausalLMOutput(**outputs)
        elif labels is not None:
            return (outputs["loss"], outputs["logits"])
        else:
            return outputs["logits"]

    # Bare-minimum for HF text generation interface to work.
    def prepare_inputs_for_generation(self, input_ids, **kwargs):
        attention_mask = kwargs.get("attention_mask", None)
        model_inputs = {
            "input_ids": input_ids,
            "attention_mask": attention_mask,
        }
        return model_inputs


AutoConfig.register(model_type, DynamicCausalLMConfig)
AutoModelForCausalLM.register(DynamicCausalLMConfig, DynamicCasualLM)

```



## Execute Generated Code

Execute the generated code, then call the generated 'construct' function to construct the objects.

Note: When directly constucting objects from a graph, there is no intermediate code-generation step; objects are directly constructed from the graph. This example is primarily to illustrate that you can export a graph as code, which can be useful if you are going wish to export the code with the model weighs.

In [14]:
exec(generated_code)

In [15]:
model_config = DynamicCausalLMConfig(hidden_size=128, dim_feedforward=512, n_layers=3)
model = DynamicCasualLM(model_config)
model

DynamicCasualLM(
  (causal_lm): CasualLM(
    loss_fn=CausalLoss()
    (input_encoder): InputEncoder(
      d_model=128, vocab_size=1024
      (dropout): Dropout(p=0.1, inplace=False)
      (embedding): Embedding(1024, 128)
    )
    (output_decoder): Linear(in_features=128, out_features=1024, bias=True)
    (layer_stack): LayerStack(
      (layers): ModuleList(
        (0-2): 3 x PreLNLayer(
          (feedforward): FeedforwardLayer(
            d_model=128, d_feedforward=512
            (linear1): Linear(in_features=128, out_features=512, bias=True)
            (dropout): Identity()
            (activation): ReLU()
            (linear2): Linear(in_features=512, out_features=128, bias=True)
          )
          (attention): SingleHeadAttn(
            d_model=128, bias=False
            (query_key_linear): Linear(in_features=128, out_features=128, bias=False)
            (value_linear): Linear(in_features=128, out_features=128, bias=False)
          )
          (norm1): LayerNorm((12

In [17]:
# Let's get the meta-data and the optimizer constructor
# We can construct other target in the graph like this...
outputs = Latent.materialize(
   graph, mtargets=["optimizer", "meta" ], context_vars=model_args
)
optim_ctor = outputs["optimizer"]
meta = outputs["meta"]
optim_ctor, meta

(functools.partial(LambdaNode('torch.optim:AdamW', *(), identity=140135030589376, **{'lr': 0.001}), context_vars={'hidden_size': 64, 'dim_feedforward': 256, 'n_layers': 2}),
 {'d_model': 64, 'vocab_size': 1024})

This is how partial functions are useful. We created a callable object, where the hyper-parameters have already been specified. All we have to do is pass in the missing argument(s), the model's parameters, as 'lr' has already been baked in.

In [18]:
optimizer = optim_ctor(model.named_parameters())
optimizer

AdamW (
Parameter Group 0
    amsgrad: False
    betas: (0.9, 0.999)
    capturable: False
    decoupled_weight_decay: True
    differentiable: False
    eps: 1e-08
    foreach: None
    fused: None
    lr: 0.001
    maximize: False
    param_names: ['causal_lm.input_encoder.embedding.weight', 'causal_lm.output_decoder.weight', 'causal_lm.output_decoder.bias', 'causal_lm.layer_stack.layers.0.feedforward.linear1.weight', 'causal_lm.layer_stack.layers.0.feedforward.linear1.bias', 'causal_lm.layer_stack.layers.0.feedforward.linear2.weight', 'causal_lm.layer_stack.layers.0.feedforward.linear2.bias', 'causal_lm.layer_stack.layers.0.attention.query_key_linear.weight', 'causal_lm.layer_stack.layers.0.attention.value_linear.weight', 'causal_lm.layer_stack.layers.0.norm1.weight', 'causal_lm.layer_stack.layers.0.norm1.bias', 'causal_lm.layer_stack.layers.0.norm2.weight', 'causal_lm.layer_stack.layers.0.norm2.bias', 'causal_lm.layer_stack.layers.1.feedforward.linear1.weight', 'causal_lm.layer_sta

In [19]:
import torch

# We can then grab the vocab_size from the meta-data and construct a batch of random input-ids.
# Create a batch of 2 x 16 input-ids
input_ids = torch.randint(meta['vocab_size'], (2, 16,))
input_ids

tensor([[ 225,   62,  989,  864,  337,  196,   24,  847,  584, 1019,  447,  484,
          187,  861,  433,  640],
        [ 816,  304,  517,  326,  277,  393,  160,  560,  927,   19,  906,   28,
          517,  146,   86,   38]])

In [21]:
# Feed the inputs through the model and get loss and logits.
loss, logits = model(input_ids, labels=input_ids)

# Backward step
loss.backward()

# Step the optimizer
optimizer.step()

# Reset grad
optimizer.zero_grad()

# Show loss
loss.item()

7.038168907165527