# Forgather

[forgather/config.py](../forgather/config.py)  
[forgather/latent.py](../forgather/latent.py)  
[forgather/dynamic.py](../forgather/dynamic.py)  

---
What exactly is this "Forgather" thing? What is it good for?

That's a good question. It's probably easiest to just demonstrate...

## forgather.config


In [1]:
import sys, os
modules_path = os.path.join('..')
if modules_path not in sys.path: sys.path.insert(0, modules_path)

from pprint import pformat, pp
from transformers import set_seed

from forgather.config import (
    preprocess_config,
    load_config,
    load_whitelist_as_set,
    materialize_config,
    enumerate_whitelist_exceptions,
    pconfig,
)
#from forgather import Latent
from aiws.dotdict import DotDict

### A Quick Demo

This demontrates what this package is about...

In [2]:
# Define a training configuration with YAML, including specifying object types.
#
# We use the Yaml SafeLoader, which disallows the creation of arbitrary Python
# objects, but we had added the '!callable' tag. More on that later...
#
# Note that this is not pure Yaml. Jinja (sandboxed) is used as a pre-processor.
yaml_config = """
-- set output_dir = path_join('forgather_demo', 'output_models', 'quick_demo')
# Define the tokenizer to use
.define: &tokenizer !callable:transformers:AutoTokenizer.from_pretrained
    - "../tokenizers/tiny_stories_2k"

# Define a model -- note how we can specify the file path to the module...
.define: &model !callable:../model_zoo/vanilla_transformer/vanilla_transformer.py:VanillaTransformer
    - !callable:../model_zoo/vanilla_transformer/vanilla_transformer.py:VanillaTransformerConfig
        kwargs:
            hidden_size: 128
            num_hidden_layers: 2

# Define the train and eval datasets
.define: &dataset !callable:datasets:load_from_disk [ "../datasets/tiny_stories_tokenized" ]
.define: &train_dataset !callable:forgather.construct:get_item [ *dataset, "train" ]
.define: &eval_dataset !callable:forgather.construct:get_item [ *dataset, "validation" ]

# Define a trainer
trainer: !callable:aiws.trainer:Trainer
    kwargs:
        model: *model
        train_dataset: *train_dataset
        eval_dataset: *eval_dataset
        tokenizer: *tokenizer
        args: !callable:aiws.trainer_types:TrainingArguments
            kwargs:
                output_dir: "{{ output_dir }}"
                eval_steps: 250
                logging_steps: 50
                max_steps: 500
                eval_strategy: "steps"
                save_strategy: "no"
"""

# As you may have guessed, the '!callable' tags call Python code.
# Which 'Callables' are allowed is controlled by defining a whitelist.
#
# Note: Passing a whitelist is optional, but VERY strongly recommended.
#
# It should go without saying that you should NEVER use an untrusted
# config file with an untrusted whitelist without careful examination.
#
# Care has been taking to try to make this as safe as possible, but I can't promise
# that the security is perfect. I'm not aware of any flaws, but that does not mean that
# they don't exist.
whitelist_yaml = """
- transformers:AutoTokenizer.from_pretrained
- datasets:load_from_disk
- forgather.construct:get_item
- aiws.trainer:Trainer
- aiws.trainer_types:TrainingArguments
- ../model_zoo/vanilla_transformer/vanilla_transformer.py:VanillaTransformer
- ../model_zoo/vanilla_transformer/vanilla_transformer.py:VanillaTransformerConfig
"""

In [None]:
# Materialize the object definition and use it!
trainer = materialize_config(yaml_config, whitelist_yaml, load_method="from_string").config["trainer"]
trainer.train()

### Digging a little deeper...
That's pretty much what it's for, in a nut-shell.

But wait. There's more!

One of the things which has bugged me when working on ML projects is the proliferation of training scripts and configurations. Before long you are working with a copy-of-a-copy-of-a-copy of a configuration and they keep getting longer, more complex, and harder to maintain. Each tends to be a subtle variation of a previous version and as your code-base evolves, compatibility of an older config with a newer script tends to break.

Ultimately, the whole process is directly at odds with principle of "[Don't Repeat Yourself (DRY)](https://en.wikipedia.org/wiki/Don%27t_repeat_yourself)."

Using Yaml to define the configuration is definitely a step-up from defing long strings of command-line arguments or even JSON, but it still does not solve the DRY problem.

We can improve on this by using [Jinja](https://jinja.palletsprojects.com/en/3.1.x/) as a pre-processor and templatizing the configurations.

The template library is still a work-in-progress, but it is comming along.

**Experiments Definitions**  

[forgather_demo/experiment 1.yaml](forgather_demo/experiment%201.yaml)  
[forgather_demo/experiment 2.yaml](forgather_demo/experiment%202.yaml)  

**Project Definitions**  

[forgather_demo/paths.yaml](forgather_demo/paths.yaml)  
[forgather_demo/defaults.yaml](forgather_demo/defaults.yaml)  
[forgather_demo/whitelist.yaml](forgather_demo/whitelist.yaml)  

**Library Definitions**  

[templates/common/whitelist.yaml](../templates/common/whitelist.yaml)  
[templates/common/defaults.yaml](../templates/common/defaults.yaml)  
[templates/common/helpers.yaml](../templates/common/helpers.yaml)  
[templates/common/trainer/base_trainer.yaml](../templates/common/trainer/base_trainer.yaml)  
[templates/common/causal_lm/base_train.yaml](../templates/common/causal_lm/base_train.yaml)  
[model_zoo/models/vanilla_transformer/vanilla_transformer.yaml](../model_zoo/models/vanilla_transformer/vanilla_transformer.yaml)  
[model_zoo/models/model_zoo_whitelist.yaml](../model_zoo/models/model_zoo_whitelist.yaml)  

**Model Code**  

[model_zoo/models/vanilla_transformer/vanilla_transformer.py](../model_zoo/models/vanilla_transformer/vanilla_transformer.py)  


### Loading a Template Config

Let's start by using a config to get the paths we will need. As we would like to avoid defining the same thing more than once, this config makes use of the project's 'paths.yaml' file.

[forgather_config.yaml](forgather_config.yaml)  
[forgather_demo/paths.yaml](forgather_demo/paths.yaml)  

#### load_config() : Load Jinja/Yaml configuration
```python
load_config(
    config: os.PathLike | str, *,
    preprocess: bool = True,
    search_path: str | List[str] = '.',
    load_method: LoadMethod = DEFAULT_LOAD_METHOD,
    **kwargs,
) -> LoadConfigOutput
```

By default, the single argument is a relative path to a config file which will be preprocessed with Jinja2 and parsed with Yaml.

Any additional keyword-args are passed to the Jina template.

In [2]:
# The 'DotDict' just allows access to the dictionary keys using attribute dot-notation.
metacfg = DotDict(load_config('forgather_config.yaml').config)

# pconfig() is from fortather.config; it just pretty-formats a configuration.
pconfig(metacfg)

project_templates: 'forgather_demo'
templates: '../templates'
tokenizer_dir: '../tokenizers'
datasets_dir: '../datasets'
assets_dir: '..'
search_paths:
  - 'forgather_demo'
  - '../templates'
  - '../model_zoo'
whitelist_path: 'forgather_demo/whitelist.yaml'
model_src_dir: '../model_zoo'
script_dir: '../scripts'
train_script_path: '../scripts/train_script.py'
models_dir: 'forgather_demo/output_models'
dataset_id: 'roneneldan/TinyStories'
tokenizer_def: '../templates/common/tokenizers/tiny_2k_bpe.yaml'
tokenizer_path: '../tokenizers/tiny_stories_2k'
tokenizers_whitelist: '../templates/common/tokenizers/whitelist.yaml'


In [3]:
# Now, each experiment configuration has been reduced to something like this.
with open(os.path.join(metacfg.project_templates, 'experiment 1.yaml'), 'r') as f:
    print(f.read())

-- set experiment = namespace()
-- include 'project_defaults.yaml'
-- set experiment.EXPERIMENT_NAME = 'Single Layer'

-- set experiment.MODEL_CONFIG
{{ experiment.MODEL_CONFIG }}
    # The single variable under study.
    num_hidden_layers: 1
-- endset

-- include 'common/causal_lm/base_train.yaml'


### Preprocess Configuration
First, let's take a closer look at the pre-processed configuraiton file.

As configured, this generated file will be saved in the test-run directory for the experiment. This should allow for reproducability; even if you muck about with the template defintions afterwards, the generated configuration will still be available for to use on its own.

Also note that this automatically generated a number of comments about the experiment details. Overall, this is much better than hand-crafting a heap of command-line arguments to feed to a script!

#### preprocess_config() : Preprocess a configuration file.
```python
def preprocess_config(
    config:  os.PathLike | str, *,
    search_path: str | List[str] = '.',
    load_method: LoadMethod = DEFAULT_LOAD_METHOD,
    **kwargs,
) -> str:
```

In [4]:
# Select an experiment template
experiment_path = os.path.join(metacfg.project_templates, 'experiment 1.yaml')

# Only preprocess the experiment template
pp_config = preprocess_config(experiment_path, search_path=metacfg.search_paths)

# You can print a pre-processed config, a sub-class of 'str', '.with_line_numbers()'
# This can be helpful when trying to diagnose YAML parse errors.
print(pp_config.with_line_numbers(True))

     1: # Single Layer
     2: # 2024-07-08 04:49:42
     3: # Description: Compare the impact of different numbers of layers on model performance
     4: # World Size: 1
     5: # Hostname: hal9000
     6: # Script Args:: N/A
     7: 
     8: # experiment.TOKENIZERS_DIR: "../tokenizers"
     9: # experiment.DATASETS_DIR: "../datasets"
    10: # experiment.MODEL_SRC_DIR: "../model_zoo"
    11: # experiment.MODELS_DIR: "forgather_demo/output_models"
    12: 
    13: # experiment.DATASET: "tiny_stories_tokenized"
    14: # experiment.EXPERIMENT_NAME: "Single Layer"
    15: # experiment.EXPERIMENT_DESCRIPTION: "Compare the impact of different numbers of layers on model performance"
    16: # experiment.MODEL_NAME: "test_model"
    17: # experiment.CREATE_NEW_MODEL: True
    18: # experiment.SAVE_MODEL: True
    19: 
    20: # config.OUTPUT_DIR: path = "forgather_demo/output_models/test_model"
    21: # config.DATASET_PATH: path = "../datasets/tiny_stories_tokenized"
    22: # config.LOGGI

### Configuration Syntax

We use [Jinja2](https://jinja.palletsprojects.com/) for preprocessing and [YAML](https://pyyaml.org/wiki/PyYAMLDocumentation) for the actual configuration. I'll spare going into details, as these are well covered in the links, but it may be helpful to point out a few non-standard and non-obvious things.

#### Jinja

Jinja is running in a sandboxed environment. This limits what functions and data may be accessed.

We have enabled line-statement and line comments.

```jinja2
## This is a line-comment. The next line is a line-statement.
-- set foo = 'bar'

## Line comments are shorthand for...
{# ...regular comments and line-statements are short for... #}
{% macro foobar() %}
```
While both of these are regular Jinja features, there is not a standard prefix for either. The prefix is set when the environment is created.

Line-comments don't show up in the pre-processed configuration, while regular Yaml comments will show up.  
Warning: Something appears to be broken in Jina. While line-comments are removed, the associated newline is not and regular whitespace control appears to be unable to remove it. This can result in non-obvious failures.

We have injected a number of symbols into the environment.
```
now : datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    local time
utcnow : datetime.datetime.utcnow().strftime("%Y-%m-%d %H:%M:%S")
    UTC time
time_ns : str(time.time_ns())
    Timestamp, integer nanoseconds
path_join(...) : os.path.join(...)
    Join path names in an os independent way.
dirname(dir) : os.path.dirname(dir)
    Get the directory part of a path.

# Defaults for the following have been set, but can be overridden by the training-script.
world_size : 1
    The number of concurrent proccesses used in distributed training.
rank: 0
    The global multiprocess rank. See torch.distributed
local_rank:
    The local multiprocess rank. See torch.distributed
script_args : 'N/A'
    The args passed to the configuration script.
hostname: platform.node()
    The hostname of the machine.
```

*White-space control*

If you need wish to strip leading and/or trailing whitespace surrounding a Jinja statement, add the '-' symbol to the start and end tokens.
```jinja2
## Strip both left and right sides
{{- foo + bar -}}
{#- strip only left #}
## Strip only right-side.
{% if foo > 12 -%}
```

*Namespaces*

By default, 'included' templates inherit the namespace of the caller, while 'imported' templates do not.

Despite what you may assume, 'include' is not quite the same as directly substituting text. If root template 'A' includes templates 'B' and 'C', the namespace of A is visible to both B and C and vice-versa. What is not obvious is that B and C will not have access to each others namespaces.

```jinja2
## Content of A.jinja
-- include 'B.jinja'
-- include 'C.jinja'
## Contents of B.jinja
-- set FOO = 1
## Contents of C.jinja
{{ FOO }}
```

This will not work, as 'FOO' will not be visible in C. To work around this, you can declare a namespace in A, which will be visible to both B and C. Any changes by one will be visible to the other.

```jinja2
## Content of A.jinja
-- set experiment = namespace()
-- include 'B.jinja'
-- include 'C.jinja'
## Contents of B.jinja
-- set experiment.FOO = 1
## Contents of C.jinja
{{ experiment.FOO }}
```
This will work.

Another namespace oddity is that you can 'set' a variable on a namespace, but you can't directly define a macro in a namespace.

```jinja2
## Contents of A.jinja
-- set experiment = namespace()
-- set experiment.foobar = "foobar"
-- include 'B'jinja'
{{ experimment.some_macro() }}
## Contents of B.jinja
-- macro experimment.some_macro()
{{ experiment.foobar }}
-- endmacro
```
This will not work, as you can't declare a macro in a namespace... but you can assign one!

```jinja2
## Contents of A.jinja
-- set experiment = namespace()
-- set experiment.foobar = "foobar"
-- include 'B'jinja'
{{ experimment.some_macro() }}
## Contents of B.jinja
-- macro B__some_macro()
{{ experiment.foobar }}
-- endmacro
-- set experimment.some_macro = B__some_macro
```
Seems like a bug, but at least it can be worked around.

#### YAML

We are using the PyYAML, with all of its warts. This follows the Yaml 1.1 specification... mostly.

The loader is the 'yaml.SafeLoader,' which prohibits constructing arbitrary Python objects.

There is one custom tag present: '!callable'

!callable constructs 'Latent' objects. That is to say, a Laent holds the definition for a Python Callable, but does not immediatly load any module code or construct anything.

The Latent objects must be explicilty 'materialized,' at which point the safety of the types are checked, the symbols are resolved, and only then, is anything actually constructed.

!callable must always be followed by either a list or a mapping. If a list is given, it contains positional args and may be empty. 
If a mappying is used, there are two defined keys: 'args,' and 'kwargs'; neither is required.

```yaml
- !callable:datetime:now [] # No arguments
- !callable:torch:tensor [1, 2, 3] # Only positional arguments
- !callable:torch:tensor { args: [4, 5, 6], kwargs: { requires_grad: True }} # Both positional and keyword
```

One feature in Yaml, which you may not be familiar with, are 'anchors' and 'aliases.' An anchor defines a symbolic reference which may be used again later in the definition.
```yaml
point: &my_anchor
    x: 1
    y: -5

line:
    start: *my_anchor
    end: *my_anchor 
```
Note that using an alias does not create a copy, it refers to the same instance!

We also make use of Yaml's esoteric 'merge' operator, '<<:'
```yaml
defauts: &defaults
    x: 1
    y: 2

point:
    <<: *defaults
    x: 5
    z: 10
# The above results in:
point:
    x: 5
    y: 2
    z: 10
```

Yaml does not allow you abstract anchor definitions; an anchor must refer to an actual object in the graph.

This is rather annoying when you just want to define something which is only for later use. To address this, any keys at the root level which start with '.' will be pruned before the configuration is returned. By convention, I give name all of these keys '.define,' as Yaml has no issues with using the same key more than once, but anything starting with '.' will do.

```yaml
.define: &x !callable:torch.tensor [ 2 ]
.define: &y !callable:torch.tensor [ 3 ]
sum: !callable:torch.add [ *x, *y ]
```
After loading, only the key for 'sum' will be in the dictionary.

*File Extensions*

You may use whater extension you like. All of the pre-defined templates end in '.yaml' This produces the best syntax highlighting compromise between Jinja2 and YAML syntax.

### Parse Configuration

We can feed the pre-processed configuration into the YAML parser with load_config().

By default, this will pre-process the input; this can be skipped by setting 'preprocess=False'

Also note the 'load_method' argument. The default is to assume the input string is a file path, but we can tell it that it's the actual input by setting 'from_string'

In [5]:
config_out = load_config(pp_config, preprocess=False, load_method="from_string")
pconfig(config_out.config)
# Note: The config_out also has a 'pp_config' member, which would have the preprocessed file, if we had combined parsing and pre-processing.

output_dir: 'forgather_demo/output_models/test_model'
logging_dir: 'forgather_demo/output_models/test_model/runs/Single Layer_1720414182151205578'
experiment_name: 'Single Layer'
experiment_description: 'Compare the impact of different numbers of layers on model performance'
trainer:
  Latent 'aiws.trainer:Trainer'
    model:
      Latent 'aiws.construct:register_for_auto_class'
        - Latent '../model_zoo/vanilla_transformer/vanilla_transformer.py:VanillaTransformer'
          - Latent 'aiws.construct:register_for_auto_class'
            - Latent '../model_zoo/vanilla_transformer/vanilla_transformer.py:VanillaTransformerConfig'
              vocab_size:
                Latent 'forgather.construct:get_attr'
                  - Latent 'transformers:AutoTokenizer.from_pretrained'
                    - '../tokenizers/tiny_stories_2k'
                  - 'vocab_size'
              hidden_size: 128
              dim_feedforward: 512
              num_attention_heads: 1
              num_

### Load Whitelist

#### load_whitelist_as_set() : Load a whitelist configuration from a file or string
```python
def load_whitelist_as_set(
    config: os.PathLike | str, *,
    preprocess: bool = True,
    search_path: str | List[str] = '.',
    load_method: LoadMethod = DEFAULT_LOAD_METHOD
) -> Set[str]:
```
This is essentially just load_config, but it normalizes the paths in the whitelist and converts the list to a set, to improve search performance.

In [6]:
whitelist_out = load_whitelist_as_set(metacfg.whitelist_path, search_path=metacfg.search_paths)
pconfig(whitelist_out.config)

- 'forgather.construct:flatten'
- 'aiws.accel_trainer:AccelTrainingArguments'
- 'aiws.trainer:Trainer'
- '/home/dinalt/ai_assets/aiworkshop/model_zoo/vanilla_transformer/vanilla_transformer.py:VanillaTransformer'
- '/home/dinalt/ai_assets/aiworkshop/model_zoo/vanilla_transformer/vanilla_transformer.py:VanillaTransformerConfig'
- 'forgather.construct:get_item'
- 'accelerate:DataLoaderConfiguration'
- 'datasets:load_from_disk'
- 'aiws.default_callbacks:JsonLogger'
- 'aiws.default_callbacks:InfoCallback'
- 'transformers:AutoTokenizer.from_pretrained'
- 'aiws.construct:register_for_auto_class'
- 'aiws.accel_trainer:AccelTrainer'
- '/home/dinalt/ai_assets/aiworkshop/model_zoo/attention_only/attnonly.py:TransformerConfig'
- '/home/dinalt/ai_assets/aiworkshop/model_zoo/attention_only/attnonly.py:TransformerModel'
- 'datasets:load_dataset'
- 'aiws.trainer_types:TrainingArguments'
- 'aiws.tb_logger:TBLogger'
- 'aiws.default_callbacks:ProgressCallback'
- 'torch.utils.tensorboard:SummaryWriter'
-

#### Check Whitelist Requirements

If you would like to see which import-specs are used in a configuraiton (or which are missing), you can use enumerate_whitelist_exceptions().

In [7]:
enumerate_whitelist_exceptions(config_out.config)

- '/home/dinalt/ai_assets/aiworkshop/model_zoo/vanilla_transformer/vanilla_transformer.py:VanillaTransformer'
- 'aiws.trainer:Trainer'
- 'datasets:load_from_disk'
- 'aiws.default_callbacks:JsonLogger'
- 'aiws.construct:register_for_auto_class'
- 'transformers:AutoTokenizer.from_pretrained'
- 'aiws.trainer_types:TrainingArguments'
- 'transformers:DataCollatorForLanguageModeling'
- 'aiws.tb_logger:TBLogger'
- 'forgather.construct:get_attr'
- '/home/dinalt/ai_assets/aiworkshop/model_zoo/vanilla_transformer/vanilla_transformer.py:VanillaTransformerConfig'
- 'forgather.construct:get_item'
- 'torch.utils.tensorboard:SummaryWriter'


### Materialize the Configuration

#### materialize_config() : Materialize the Latent objects in the configuration
```python
def materialize_config(
    config: Any,
    whitelist: Container | os.PathLike | str = None,
    preprocess: bool = True,
    search_path: str | List[str] = '.',
    load_method: LoadMethod=DEFAULT_LOAD_METHOD,
    pp_kwargs: Dict[str, Any] = {},
    kwargs: Dict[str, Callable] = {},
) -> MaterializedOutput:
```
- config: An instantiated, but Latent, configuration; a preprocessed configuration string; or a path to a configuraiton file.  
- whitelist: A Container type, which means any object which supports 'str is in container'  
- preprocess: Preprocess the string or file. Only applies if input is a path or string.  
- load_method: One of "from_file", "from_string", "from_file_search"  
- search_path: A str or List\[str\] paths to search for templates; also applies to "from_file_search" load method.  
- pp_kwargs: Arguments to pass to the template, if preprocessing is to be performed.  
- kwargs: A mapping str -> Callable to substitute when materializing the final config. This allows passing already instantiated objects into the config.


We will just materialize the config we have already loaded, which is to be checked against the whitelist.

In [7]:
# For reproducible experiments, it's probably best to make sure all of your random-seeds have been initialized
# to a consistent value. This is especially important in a multi-process environment.
set_seed(42)

config_output = materialize_config(config_out.config, whitelist=whitelist_out.config)
pconfig(config_output.config)



output_dir: 'forgather_demo/output_models/test_model'
logging_dir: 'forgather_demo/output_models/test_model/runs/Single Layer_1720386364132348531'
experiment_name: 'Single Layer'
experiment_description: 'Compare the impact of different numbers of layers on model performance'
trainer:
  Trainer(model=VanillaTransformer(
    (embedding): Embedding(2000, 128)
    (positional_encoder): PositionalEncoder()
    (layers): ModuleList(
      (0): TransformerLayer(
        (attention): MultiheadAttention(
          (query_linear): Linear(in_features=128, out_features=128, bias=True)
          (key_linear): Linear(in_features=128, out_features=128, bias=True)
          (value_linear): Linear(in_features=128, out_features=128, bias=True)
        )
        (feedforward): FeedforwardLayer(
          (linear1): Linear(in_features=128, out_features=512, bias=True)
          (activation): ReLU()
          (linear2): Linear(in_features=512, out_features=128, bias=True)
        )
        (norm1): LayerNo

### Run Experiment 1

In [None]:
# Wrapping the dict in a DotDict just allows using attribute dot-notation to acceses the values.
config = DotDict(config_output.config)
config.trainer.train()

### Run Experiment 2

This time we will skip all of the intermediate steps and go straight to instantiating the config.

In [None]:
set_seed(42)

materialize_config(
    os.path.join(metacfg.project_templates, 'experiment 2.yaml'), metacfg.whitelist_path, search_path=metacfg.search_paths
).config['trainer'].train()

### View in Tensorboard
Assuming that you have Tensorboard installed (you do, right?), you can take a look at the information collected by the logger.

Ideally, start Tensorboard from a console, but to take a quick peek, you can launch it from the notebook. If using the Notebook, you will have to stop the command when done.

In [11]:
# Use this version if you are training on the same machine that your web-browser in running on.
#!tensorboard --bind_all --logdir forgather_demo/output_models/test_model/runs/

# Use this version if you are not running training on the same machine as your web-browser.
!tensorboard --bind_all --logdir forgather_demo/output_models/test_model/runs/

TensorFlow installation not found - running with reduced feature set.

NOTE: Using experimental fast data loading logic. To disable, pass
    "--load_fast=false" and report issues on GitHub. More details:
    https://github.com/tensorflow/tensorboard/issues/4784

TensorBoard 2.16.2 at http://hal9000:6006/ (Press CTRL+C to quit)
^C


### Running on Multiple GPUs

Running a trainer with multiple GPUs inside of a notebook has a number of issues.

In the least complex mode, Torch DataParallel is used. This runs each GPU on a seperate Python thread. Unfortunately, the performance is terrible, thnaks to the Global Interpreter Lock.

Running training with Torch Distributed solves this problem by running each node in a seperate process. This makes things difficult for running in a notebook. The Accelerate library attempts to solve this by offering a [notebook launcher](https://huggingface.co/docs/accelerate/en/basic_tutorials/notebook), but there are complications: see below.

### Multi-GPU Training in a Notebook

First, we will need to switch over to a Trainer implementation which supports the Acclerate library.
This is easy enough. We can just override the trainer definition in the experiment to use one which supports Accelerate:

[forgather_demo/accel_experiment.yaml](forgather_demo/accel_experiment.yaml)  

Running in a notebook entails a few additional commplications:

- To run the experiment with the notebook_launcher, we need to perform all of our initialization within the 'training_loop' function passed to the launcher.
- An additional complication is that you will likely need to restart your notebook's kernel, should you have already used the GPUs.
- It tends to be unstable, crashing, witout an obvious cause -- and the crash can't be reproduced from a training script!
- If something goes wrong, you will get a 'SIGTERM' and poor diagnostic info. It's best to run the 'train_loop' on its own for better diagnostics.
- If you want to work with the model after training, you will need to save it and load it back into the notebook.

In [1]:
import sys, os
modules_path = os.path.join('..')
if modules_path not in sys.path: sys.path.insert(0, modules_path)
from accelerate import notebook_launcher
from forgather.config import load_config, materialize_config
from aiws.dotdict import DotDict
from transformers import set_seed

# This is the entry-point for the spawned procceses.
def training_loop(dir_config, experiment_name):
    set_seed(42)
    metacfg = DotDict(load_config('forgather_config.yaml').config)
    config_output = materialize_config(os.path.join(metacfg.project_templates, experiment_name),
        metacfg.whitelist_path, search_path=metacfg.search_paths)
    config = DotDict(config_output.config)
    
    # If you don't want all processes to print to the console...
    if config.trainer.accelerator.is_main_process:
        print("**** Training Started *****")
        print(f"experiment_name: {config.experiment_name}")
        print(f"experiment_description: {config.experiment_description}")
        print(f"output_dir: {config.output_dir}")
        print(f"logging_dir: {config.logging_dir}")

    # This is where the actual 'loop' is.
    metrics = config.trainer.train().metrics
    
    if config.trainer.accelerator.is_main_process:
        print("**** Training Completed *****")
        print(metrics)

    metrics = config.trainer.evaluate()

    if config.trainer.accelerator.is_main_process:
        print("**** Evaluation Completed *****")
        print(metrics)
    
    if config.save:
        config.trainer.save_model()
        if config.trainer.accelerator.is_main_process:
            print(f"Model saved to: {config.trainer.args.output_dir}")

#### Launch Accelerate Trainer Directly

This will use Accelerate, but if you have multiple GPUs, this will only use one.

In [None]:
training_loop('forgather_config.yaml', 'accel_experiment.yaml')

#### Launch Accelerate Trainer with Notebook Launcher

If you have already trained anything in the noteboo, without using notebook_launcher, this
will fail with "ValueError: To launch a multi-GPU training from your notebook ..."

After restarting your notebook, you can just run the prior cell again to reinitialize.

In [None]:
notebook_launcher(
    training_loop,
    args=('forgather_config.yaml', 'accel_experiment.yaml',),
    num_processes=2
)

#### Launch Huggingface Trainer directly in Notebook

This will use multiple GPUs, but will be hideously crippled on account of contention for the Global Interpreter Lock

In [None]:
training_loop('forgather_config.yaml', 'hf_trainer_experiment.yaml')

#### Launch Huggingface Trainer with Notebook Launcher

In theory, this should work with the Huggingface Trainer...

[forgather_demo/hf_trainer_experiment.yaml](forgather_demo/hf_trainer_experiment.yaml)  

At present, it does not appear to detect that it is running in a multi-gpu configuration. The same config works just fine from a regual training script. Cause TBD.

In [None]:
notebook_launcher(
    training_loop,
    args=('forgather_config.yaml', 'hf_trainer_experiment.yaml',),
    num_processes=2
)

### Train from a Training Script

The preferred way to run non-trivial training tasks is from the command-line.

The following code can help you get started. It will take the path configuration and build a command-line, which can either be executed from the notebook or can be exported as a bash-script.

This is especially important for long-running training sessions, as various issues with the notebook could interrupt training.

In [2]:
import sys, os
modules_path = os.path.join('..')
if modules_path not in sys.path: sys.path.insert(0, modules_path)
from aiws.dotdict import DotDict
from forgather.config import load_whitelist_as_set, load_config, pconfig
from forgather import Latent
import stat
import os

# Load project directory definitions
metacfg = DotDict(load_config('forgather_config.yaml').config)
pconfig(metacfg)

# Output train-script command line as a string
def train_cmdline(metacfg, nproc='gpu'):
    includes = ''.join(f"-I '{inc}' " for inc in metacfg.search_paths)
    return f"torchrun --standalone --nproc-per-node {nproc} '{metacfg.train_script_path}' -w '{metacfg.whitelist_path}' {includes} -s '{metacfg.assets_dir}'"

# Output train-script as command line as a bash-script
# ./train.sh [<other-sli-args] <experiment-config-file>
def make_bash_script(metacfg, script_path='train.sh', nproc='gpu'):
    with open(script_path, 'w') as f:
        f.write('#!/bin/bash\n' + train_cmdline(metacfg) + ' "${@}"\n')
        os.chmod(f.fileno(), stat.S_IREAD|stat.S_IRUSR|stat.S_IWUSR|stat.S_IXUSR)

templates_dir: 'forgather_demo'
tokenizer_dir: 'experiment.TOKENIZERS_DIR'
datasets_dir: 'experiment.DATASETS_DIR'
assets_dir: '..'
search_paths: '['forgather_demo', '../templates', '../model_zoo']'
whitelist_path: 'forgather_demo/whitelist.yaml'
model_src_dir: '../model_zoo'
script_dir: '../scripts'
train_script_path: '../scripts/train_script.py'
models_dir: 'forgather_demo/output_models'


#### Generate Bash Script

This will output a shell-script which will invoke the training script with the arguments for this project.

```bash
# Optional: Restrict the GPUs to use to a sub-set of those avialable.
export CUDA_VISIBLE_DEVICES="0,1"

# Run Accelerate Trainer experiment
./train forgather_demo/accel_experiment.yaml

# Run Huggingface Trainer experiment
./train forgather_demo/hf_trainer_experiment.yaml
```

In [5]:
make_bash_script(metacfg)

# Read back to verify
with open('train.sh', 'r') as f:
    print(f.read())

#!/bin/bash
torchrun --standalone --nproc-per-node gpu '../scripts/train_script.py' -w 'forgather_demo/whitelist.yaml' -I 'forgather_demo' -I '../templates' -I '../model_zoo'  -s '..' "${@}"



#### Run Training Script from Notebook

This will execute a shell command to run the training script, where the notebook will act as the shell console.  

Note: The tqdm progress bars do not render properly in the notebook. It will train, but it's ugly. Running this from a real terminal looks much better!

In [None]:
# By default, this will run on all available GPUs. To restrict it to a sub-set, you can use this envrionment variable.
os.environ['CUDA_VISIBLE_DEVICES'] = '0,1'

!{train_cmdline(metacfg)} 'forgather_demo/hf_trainer_experiment.yaml'

### Injecting Callables

In [1]:
import sys, os
modules_path = os.path.join('..')
if modules_path not in sys.path: sys.path.insert(0, modules_path)

tokenize_config = """
-- set tokenizer_path = path_join('..', 'tokenizers', 'tiny_stories_2k')
-- set dataset_id = "roneneldan/TinyStories"

# Define the tokenizer to use
.define: &tokenizer !callable:transformers:AutoTokenizer.from_pretrained
    - "{{ tokenizer_path }}"

# Load a dataset from the hub
.define: &raw_dataset !callable:datasets:load_dataset [ "{{ dataset_id }}" ]

# Call the injected Callable, 'tokenize_dataset'
# Note that we can pass objects created in the config into the callable.
dataset: &dataset !callable:tokenize_dataset [ *raw_dataset, *tokenizer, [ 0.01, 1.0 ] ]

# Get splits
train_dataset: &train_dataset !callable:forgather.construct:get_item [ *dataset, "train" ]
eval_dataset: &eval_dataset !callable:forgather.construct:get_item [ *dataset, "validation" ]
"""

tokenize_whitelist = """
- transformers:AutoTokenizer.from_pretrained
- datasets:load_dataset
- forgather.construct:get_item
"""

# Dump a preview of the pre-processed configs
print(preprocess_config(tokenize_config, load_method="from_string"))
print('-' * 40)
print(preprocess_config(tokenize_whitelist, load_method="from_string"))
print('-' * 40)

# Define the tokenizer to use
.define: &tokenizer !callable:transformers:AutoTokenizer.from_pretrained
    - "../tokenizers/tiny_stories_2k"

# Load a dataset from the hub
.define: &raw_dataset !callable:datasets:load_dataset [ "roneneldan/TinyStories" ]

# Call the injected Callable, 'tokenize_dataset'
# Note that we can pass objects created in the config into the callable.
dataset: &dataset !callable:tokenize_dataset [ *raw_dataset, *tokenizer, [ 0.01, 1.0 ] ]

# Get splits
train_dataset: &train_dataset !callable:forgather.construct:get_item [ *dataset, "train" ]
eval_dataset: &eval_dataset !callable:forgather.construct:get_item [ *dataset, "validation" ]
----------------------------------------
- transformers:AutoTokenizer.from_pretrained
- datasets:load_dataset
- forgather.construct:get_item
----------------------------------------


Note that the tag '!callable:tokenize_dataset' is not a valid import-spec, as it lacks a ':' in the string.

This is a 'stand-in' tag, which needs to be filled in when the config is materialized.

Let's define the stand-in and inject it into materialize_config()

In [2]:
# Define a Callable to inject.
def tokenize_dataset(dataset_dict, tokenizer, select: list[float]):
    """
    Given a DatasetDict and tokenizer, tokenize each split and return it in a new dictionary.

    select: A list of floats, each which specifies how much of the dataset to include.
        e.g. [0.1, 1.0 ] = 10% and 100%
    """
    output = {}
    
    def map_fn(element, tokenizer):
        outputs = tokenizer(
            element["text"],
            truncation=True,
        )
        return {"input_ids": outputs["input_ids"]}
    
    for i, (split, dataset) in enumerate(dataset_dict.items()):
        if select[i] < 1.0:
            dataset = dataset.select(range(0, int(len(dataset) * select[i])))
        
        tokenized_data = dataset.map(
            map_fn,
            batched=True,
            remove_columns=dataset.column_names,
            fn_kwargs=dict(tokenizer=tokenizer)
        )
        output[split] = tokenized_data
    return output

# Inject the Callable via kwargs.
config_output = materialize_config(
    tokenize_config,
    tokenize_whitelist,
    load_method="from_string",
    kwargs=dict(tokenize_dataset=tokenize_dataset),
)

pconfig(config_output)

Repo card metadata block was not found. Setting CardData to empty.


config:
  dataset:
    train:
      Dataset({
          features: ['input_ids'],
          num_rows: 21197
      })
    validation:
      Dataset({
          features: ['input_ids'],
          num_rows: 21990
      })
  train_dataset:
    Dataset({
        features: ['input_ids'],
        num_rows: 21197
    })
  eval_dataset:
    Dataset({
        features: ['input_ids'],
        num_rows: 21990
    })
pp_config:
       1: # Define the tokenizer to use
       2: .define: &tokenizer !callable:transformers:AutoTokenizer.from_pretrained
       3:     - "../tokenizers/tiny_stories_2k"
       4: 
       5: # Load a dataset from the hub
       6: .define: &raw_dataset !callable:datasets:load_dataset [ "roneneldan/TinyStories" ]
       7: 
       8: # Call the injected Callable, 'tokenize_dataset'
       9: # Note that we can pass objects created in the config into the callable.
      10: dataset: &dataset !callable:tokenize_dataset [ *raw_dataset, *tokenizer, [ 0.01, 1.0 ] ]
      11: 
    

As instances of 'Latent' are themselves Callables, where they materialize their definition when called, they definitions can be chained.

Here, we define a config for a raw dataset and a config for tokenizing an injected dataset.

In [3]:
from forgather.config import load_config
from forgather import Latent

raw_dataet_def = """
-- set dataset_id = "roneneldan/TinyStories"
!callable:datasets:load_dataset [ "{{ dataset_id }}" ]
"""

tokenize_dataset_def = """
-- set tokenizer_path = path_join('..', 'tokenizers', 'tiny_stories_2k')
.define: &tokenizer !callable:transformers:AutoTokenizer.from_pretrained
    - "{{ tokenizer_path }}"
dataset: &dataset !callable:tokenize_dataset [ !callable:raw_dataset [], *tokenizer, [ 0.01, 1.0 ] ]
"""

whitelist = load_config(tokenize_whitelist, load_method="from_string").config
raw_dataset = load_config(raw_dataet_def, load_method="from_string").config

dataset = load_config(
    tokenize_dataset_def,
    load_method="from_string",
    tokenize_dataset=tokenize_dataset,
    raw_dataset=raw_dataset
).config

print(raw_dataset)
print('-' * 40)
pconfig(dataset)
print('-' * 40)

Latent('datasets:load_dataset', *['roneneldan/TinyStories'], **{})
----------------------------------------
dataset:
  Latent 'tokenize_dataset'
    - Latent 'raw_dataset'
    - Latent 'transformers:AutoTokenizer.from_pretrained'
      - '../tokenizers/tiny_stories_2k'
    - - 0.01
    - 1.0
----------------------------------------


When we materialize the dataset, we pass the raw_dataset definition to the tokenize_dataset definition.

We could also have first materialized the raw_dataset and then injected it as a lambda expression.

In [4]:
output_config = Latent.materialize(dataset, whitelist=whitelist, raw_dataset=raw_dataset, tokenize_dataset=tokenize_dataset)
pconfig(output_config)

Repo card metadata block was not found. Setting CardData to empty.


dataset:
  train:
    Dataset({
        features: ['input_ids'],
        num_rows: 21197
    })
  validation:
    Dataset({
        features: ['input_ids'],
        num_rows: 21990
    })


### Cleanup

Delete all of the output models produced by the demo and start over.

In [25]:
import sys, os
modules_path = os.path.join('..')
if modules_path not in sys.path: sys.path.insert(0, modules_path)
from aiws.dotdict import DotDict
from forgather.config import load_config, pconfig
metacfg = DotDict(load_config('forgather_config.yaml').config)

print(f"Removing '{metacfg.models_dir}'")
shutil.rmtree(metacfg.models_dir, ignore_errors=True)

Removing 'forgather_demo/output_models'


## forgather.latent

A Latent \[object\] abstracts what to create from when to create it

The primary intended use-case is for safely constructing objects from a
configuration file. Consider the case where a configuration file may define objects which
can take a considerable amount of time to construct (i.e. processing a dataset).

In this case, its useful to allow the complete file to be parsed before attempting a
lengthy task, as there may still be errors present which will cause the operation to
abort. It's much better to first fully parse the file, validate the safety of 
the all the types, and only then then, materialize the definiton. This is far less
painful than having to fix a single error, wait for the long operation to complete (again)
and then hit another error. Fun times...

Allowing deferal can also avoid materializing expensive objects which are not needed, as per
runtime logic. For example, a definition may define several datasets, where-as only a single
one is actually selected, contingent upon 'whatever.'

If an object is never materialized, this also avoids loading the associated modules.

Finally, this allows one two lazilly construct objects in whatever order makes sense.

In [1]:
import sys, os
modules_path = os.path.join('..')
if modules_path not in sys.path: sys.path.insert(0, modules_path)
from forgather import Latent
from forgather.config import pconfig

# Define the object to construct
latent_tensor = Latent("torch:Tensor", [1 ,2, 3], as_callable=True, is_singleton=False)
print(latent_tensor)

Latent('torch:Tensor', *([1, 2, 3],), **{}, as_callable=True, is_singleton=False)


In [3]:
# ... and some time later, materialize the object instance.
tensor = latent_tensor()
print(tensor)

tensor([1., 2., 3.])


This can also extend to graphs of objects...

In [4]:
data = dict(
    total = Latent("torch:sum", Latent("torch:Tensor", [1 ,2, 3]))
)
print(data)

obj = Latent.materialize(data)
print(obj)

{'total': Latent('torch:sum', *(Latent('torch:Tensor', *([1, 2, 3],), **{}, as_callable=False, is_singleton=False),), **{}, as_callable=False, is_singleton=False)}
{'total': tensor(6.)}


We can also specify modules by path-name.

In [5]:
model = Latent(
    "../model_zoo/vanilla_transformer/vanilla_transformer.py:VanillaTransformer",
    Latent(
        "../model_zoo/vanilla_transformer/vanilla_transformer.py:VanillaTransformerConfig",
        hidden_size=64, num_hidden_layers=3
    )
)
print(model)
print('*' * 20 + " or pretty-printed... " + "*" * 20)
pconfig(model)

Latent('../model_zoo/vanilla_transformer/vanilla_transformer.py:VanillaTransformer', *(Latent('../model_zoo/vanilla_transformer/vanilla_transformer.py:VanillaTransformerConfig', *(), **{'hidden_size': 64, 'num_hidden_layers': 3}, as_callable=False, is_singleton=False),), **{}, as_callable=False, is_singleton=False)
******************** or pretty-printed... ********************
Latent '../model_zoo/vanilla_transformer/vanilla_transformer.py:VanillaTransformer'
  - Latent '../model_zoo/vanilla_transformer/vanilla_transformer.py:VanillaTransformerConfig'
    hidden_size: 64
    num_hidden_layers: 3


In [6]:
# And materialize the definition...
# The __call__ method is short-hand for Latent.materialize(model)
# If called (or materialized) again, the same instance will be returned.
materialized_model = model()
print(materialized_model)

VanillaTransformer(
  (embedding): Embedding(2000, 64)
  (positional_encoder): PositionalEncoder()
  (layers): ModuleList(
    (0-2): 3 x TransformerLayer(
      (attention): MultiheadAttention(
        (query_linear): Linear(in_features=64, out_features=64, bias=True)
        (key_linear): Linear(in_features=64, out_features=64, bias=True)
        (value_linear): Linear(in_features=64, out_features=64, bias=True)
      )
      (feedforward): FeedforwardLayer(
        (linear1): Linear(in_features=64, out_features=512, bias=True)
        (activation): ReLU()
        (linear2): Linear(in_features=512, out_features=64, bias=True)
      )
      (norm1): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
      (norm2): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
    )
  )
  (output_projection): Linear(in_features=64, out_features=2000, bias=True)
)


You can restrict which types of objects can be materialized by specifying a whitelist.

In [7]:
from forgather.dynamic import normalize_import_spec
# Note: Any import-specs with paths should be normaized
# with forgather.dynamic.normalize_import_spec(). This
# ensures that all equivalent paths have the same representation.
whitelist = set((
    "torch:Tensor",
    "torch:add",
    normalize_import_spec("../model_zoo/vanilla_transformer/vanilla_transformer.py:VanillaTransformer"),
))

allowed_instance = Latent("torch:Tensor", [4, 5, 6])
allowed_instance(whitelist=whitelist)

tensor([4., 5., 6.])

If something is not in the whitelist, an exception will be raised with a list of all prohibited types listed.

In [8]:
prohibited_instance = Latent("torch:mul", Latent("torch:sum", Latent("torch:Tensor", [1 ,2, 3])), 3)
prohibited_instance(whitelist=whitelist)

LatentException: The following dynamic imports were not found in the whitelist: {'torch:mul', 'torch:sum'}

Alternatively, we can just get the list of disallowed types.

In [9]:
invalid_set = Latent.validate_whitelist(prohibited_instance, whitelist)
if len(invalid_set):
    # Show all disallowed types in the graph
    print(f"Disallowed: {invalid_set}")

Disallowed: {'torch:mul', 'torch:sum'}


#### Object Identity

By default, each call returns the same object instance.

In [10]:
latent_tensor = Latent("torch:Tensor", [1 ,2, 3])
tensor = latent_tensor()
assert(id(tensor) != id(latent_tensor()))

This can be overridden by setting the "is_singleton" flag.

In [11]:
latent_tensor = Latent("torch:Tensor", [1 ,2, 3], is_singleton=True)
tensor = latent_tensor()
assert(id(tensor) == id(latent_tensor()))

Arguments can be injected into the graph at the poin of materialization.

#### Passing Arugments

In [12]:
import torch

# Notice how the second import-spec, "arg," does not have a ':' character.
# This is a placeholder for a real value to be specified when the object is materialized.
deferred_sum = Latent("torch:sum", Latent("arg"))

# Let's create an object to substitue 'sum_input' with.
tensor = torch.tensor([1 ,2, 3])

# Materialize the value
deferred_sum(arg=tensor)

tensor(6)

### Corner Cases

#### Tied Parameters
As mentioned above, the default is for all instancs of the same object to be singletons; that is, there is really only one instance, no matter how many times you call the object.

By setting 'is_singleton' to False, you will get a different instance each time the object is materialized, but what happens when a non-singleton object exists in more than one place in the graph?

For example, here we have a simple ML model which takes an input tensor and an output tensor as arguments, which are then used as parameters. If we create a single tensor and pass it as both the input and output arguments, this ties the weights together, as they share the same instance.

If the shared parameter is not a singleton, will won't this 'untie' the shared parameter?

No. When constructing the object graph, we keep track of which objects have already been instantiated with an object-id map. If an object with the same ID is 'constructed' a second time, the 'cached' object will be returned, rather than a new one.

The difference only comes about when the object is materialized more than once, in which case the 'cache' is flushed between calls and a new instance of the object will be constructed. This difference can only be observed when the graph is constructed more than once.

In this example, we construct the model described above and print the values of the input and output weights.

If the shared_weights are non-singleton, then each call will initialize different weights, but they will still be tied.

If configured as a singleton, each call will produce the same weights.

In [16]:

# Define a simple model
class Net(torch.nn.Module):
    def __init__(self, input, output):
        super().__init__()
        self.input = torch.nn.Parameter(input)
        self.output = torch.nn.Parameter(output)

    def forward(self, x):
        x =  x @ self.input
        x = x @ self.output.t()
        return x

# Create a 'shared' tensor for both input and output networks.
# Try changing 'is_singleton'
shared_weights = Latent('torch:randn', 3, 4, requires_grad=True, is_singleton=False)
latent_model = Latent(Net, input=shared_weights, output=shared_weights, is_singleton=False)
print(latent_model)

model = latent_model()
print(model.input.data)
print(model.output.data)
print("\n\n")
model = latent_model()
print(model.input.data)
print(model.output.data)

Latent(<class '__main__.Net'>, *(), **{'input': Latent('torch:randn', *(3, 4), **{'requires_grad': True}, as_callable=False, is_singleton=False), 'output': Latent('torch:randn', *(3, 4), **{'requires_grad': True}, as_callable=False, is_singleton=False)}, as_callable=False, is_singleton=False)
tensor([[-1.0802, -0.2785, -0.5239,  1.4973],
        [-1.2288,  0.7229,  1.9864,  0.4387],
        [ 1.3913, -1.0829, -0.9404, -0.3665]])
tensor([[-1.0802, -0.2785, -0.5239,  1.4973],
        [-1.2288,  0.7229,  1.9864,  0.4387],
        [ 1.3913, -1.0829, -0.9404, -0.3665]])



tensor([[ 1.1079,  0.3517, -0.6129,  0.7602],
        [-0.4242, -0.4379,  0.8860,  0.6022],
        [ 0.0201,  0.4126, -1.1070,  0.5717]])
tensor([[ 1.1079,  0.3517, -0.6129,  0.7602],
        [-0.4242, -0.4379,  0.8860,  0.6022],
        [ 0.0201,  0.4126, -1.1070,  0.5717]])


#### Factory Objects

One use-case calls for providing 'factory' agruments to an object, where each call produces a new objects instance.

Consider this use-case:

The Huggingface Trainer class allows you to pass a "model intializer," rather than a model instance, to the Trainer. The Trainer will then explore various hyper-parameters, initializing a new model instance on each iteration.

If the Trainer is part of a configuration and the model is also in the configuration, this makes it rather difficult to pass a "model initializer" to the Trainer; when the Latent graph is constructed, the initializer will be a concrete model instance, not a callable constructor.

This can be solved by setting the 'as_callable' flag on the model constructor, which result in an unmaterialized callable being passed to the Trainer.

Now, when the Trainer calls the model initializer, the model will be materialized.

This does not fully solve the problem, as subsequent calls will return the same model. We can solve this by setting 'is_singleton=False,' which will resut in a new model instance each time it is called.

If the model has any other Latent objects, these too can be independently configured as singletons or callables.

Finally, if the called function is expected to take any arguments, these can be mapped to arguments anywhere in the graph of objects.

In [17]:
from typing import Callable

# Factory class. Given a callable, when called it uses the provided callable to create new objects.
class TensorFactory:
    def __init__(self, factory: Callable):
        assert isinstance(factory, Callable)
        self.factory = factory
        self.n_cols = 1

    def make_tensor(self, **argv):
        # Pass a combination of arguments from the caller and from the factory
        # Increase the number of columns by two each time
        self.n_cols += 2
        return self.factory(cols=self.n_cols, **argv)

    def __repr__(self):
        return f"TensorFactory({self.factory})"

# Experiment with changing the arguments to see how this works.
print("\Construct Latent; all Latent objects are still latent.\n")
latent_factory = Latent(
    TensorFactory,
    Latent(
        "torch:randn", # Initialize a random tensor with the specified dimensions.
        Latent(
            "rows", # This is an argument which can be specified called.
            is_singleton=False, # Create a new instance each time.
        ),
        Latent(
            "cols", # This is an argument which can be specified called.
            is_singleton=False, # Create a new instance each time.
        ),
        is_singleton=False, # Each call should return a new instance.
        as_callable=True, # Pass object as a Callable, rather than immediately materializing it.
    )
)
print(latent_factory)

print('-' * 40)
print("\nMaterialized Latnet; as the 2nd level latent is 'as_callable,' is was passed to the factory untouched.")
print("\nFurthermore, this isolated the 3rd level Latent, so it was also not materialized.\n")
factory = latent_factory()
print(factory)

print('-' * 40)
print("\nFactory calls Latent, passing different arguments each time.")
print(" On each call, the arguments are resovled and a new instance is returned\n")
print("Tensor 1: ", factory.make_tensor(rows=6))
print("Tensor 2: ", factory.make_tensor(rows=3))

\Construct Latent; all Latent objects are still latent.

Latent(<class '__main__.TensorFactory'>, *(Latent('torch:randn', *(Latent('rows', *(), **{}, as_callable=False, is_singleton=False), Latent('cols', *(), **{}, as_callable=False, is_singleton=False)), **{}, as_callable=True, is_singleton=False),), **{}, as_callable=False, is_singleton=False)
----------------------------------------

Materialized Latnet; as the 2nd level latent is 'as_callable,' is was passed to the factory untouched.

Furthermore, this isolated the 3rd level Latent, so it was also not materialized.

TensorFactory(Latent(<built-in method randn of type object at 0x7f262e6a4760>, *(Latent('rows', *(), **{}, as_callable=False, is_singleton=False), Latent('cols', *(), **{}, as_callable=False, is_singleton=False)), **{}, as_callable=True, is_singleton=False))
----------------------------------------

Factory calls Latent, passing different arguments each time.
 On each call, the arguments are resovled and a new instance

## forgather.dynamic

The 'dynamic' module can dynamically import attributes from Python modules, given either a module and attribute name in the sys.path or directly from a file-path.

In [18]:
from types import SimpleNamespace
from forgather.dynamic import dynamic_import
import os

# Create a simple namespace to put our dynamic imports in
# The global namespace works too, but I want to avoid cluttering it with our demo imports.
ns = SimpleNamespace()

We can import an attribute from a module.

In this example, we get the torch.tensor class and the torch 'nn' namespace. Once imported, be can use them just like a regular import.

In [19]:
ns.tensor = dynamic_import("torch:tensor")
ns.nn = dynamic_import("torch:nn")

# Create a tensor
tensor = ns.tensor([1, 2, 3])
print(type(tensor), tensor)

# Create an torch.nn.Module
module = ns.nn.Module()
print(module)

<class 'torch.Tensor'> tensor([1, 2, 3])
Module()


We can also import modules directly from a Python source file, even if it's not in our sys.path.

For example, let's get a transformer config and model definiton and instantiate the model.

In [20]:
module_path = os.path.join('..', 'model_zoo', 'vanilla_transformer', 'vanilla_transformer.py')
print(module_path)
ns.ModelConfig = dynamic_import(module_path + ':VanillaTransformerConfig')
ns.TransformerModel = dynamic_import(module_path + ':VanillaTransformer')

model = ns.TransformerModel(ns.ModelConfig(hidden_size=128, num_hidden_layers=2))
print(model)

../model_zoo/vanilla_transformer/vanilla_transformer.py
VanillaTransformer(
  (embedding): Embedding(2000, 128)
  (positional_encoder): PositionalEncoder()
  (layers): ModuleList(
    (0-1): 2 x TransformerLayer(
      (attention): MultiheadAttention(
        (query_linear): Linear(in_features=128, out_features=128, bias=True)
        (key_linear): Linear(in_features=128, out_features=128, bias=True)
        (value_linear): Linear(in_features=128, out_features=128, bias=True)
      )
      (feedforward): FeedforwardLayer(
        (linear1): Linear(in_features=128, out_features=512, bias=True)
        (activation): ReLU()
        (linear2): Linear(in_features=512, out_features=128, bias=True)
      )
      (norm1): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
      (norm2): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
    )
  )
  (output_projection): Linear(in_features=128, out_features=2000, bias=True)
)
