"I am not afraid of storms, for I am learning how to sail my ship."
– Louisa May Alcott
Marin is an open-source framework for the research and development of foundation models.
A key feature of Marin is reproducibility: every step, from raw data to the final model are recorded, not just the end result. This includes failed experiments, so the entire research process is transparent.
Marin's primary use case is training language model like Llama, DeepSeek, Qwen, etc. Notably, this includes data curation, transformation, filtering, tokenization, training, and evaluation.
We used Marin to train the first open-source 8B parameter model to outperform Llama 3.1 8B. You can see the training script or read the retrospective.
The documentation for Marin is available on ReadTheDocs or in the docs/
folder.
To get started with Marin:
- Install Marin.
- Train a tiny language model using Marin.
- See how to run a much larger DCLM 1B/1x experiment using Marin.
- See a summary of the experiments we've run.
- Participate in the Marin Speedrun competition to try to find the most efficient way to train a language model.
- Try out the Marin Datashop to contribute and create data for your use case.
- Join the Marin Discord to chat with the community.
Marin experiments are defined as a set of steps that can depend on each other and are executed in a topological order, like a Makefile.
As a brief example of how you can use Marin, here is a complete script for training a tiny model on TinyStories. You can check out the full script for more details.
from experiments.defaults import default_tokenize, default_train
from experiments.llama import llama3_tokenizer, llama_nano
from experiments.simple_train_config import SimpleTrainConfig
from marin.execution.executor import executor_main
from marin.resources import CpuOnlyConfig
# 1. Choose a dataset
tinystories_hf_id = "roneneldan/TinyStories"
# 2. Tokenize the dataset
tinystories_tokenized = default_tokenize(
name=tinystories_hf_id, # path to write tokenized files (tokenized/ will be prepended)
dataset=tinystories_hf_id, # HF dataset id
tokenizer=llama3_tokenizer,
)
# 3. Define training configuration
nano_train_config = SimpleTrainConfig(
# Here we define the hardware resources we need.
resources=CpuOnlyConfig(num_cpus=1),
train_batch_size=4,
num_train_steps=100,
# set hyperparameters
learning_rate=6e-4,
weight_decay=0.1,
# keep eval quick for tutorial
max_eval_batches=4,
)
# 4. Train the model
nano_tinystories_model = default_train(
name="marin-nano-tinystories",
# Steps can depend on other steps: nano_tinystories_model depends on tinystories_tokenized
tokenized=tinystories_tokenized,
model_config=llama_nano,
train_config=nano_train_config,
# wandb tags
tags=["llama", "nano", "tinystories", "tutorial"],
# We can run many [eval_harness](https://github.com/EleutherAI/lm-evaluation-harness) tasks in the loop
# during training, but there's no point in running evals on such a tiny model
eval_harness_tasks=[],
# to keep tutorial fast, skip default validation sets
use_default_validation=False,
)
if __name__ == "__main__":
executor_main(steps=[
nano_tinystories_model,
])
Here, we create two steps, one for tokenizing the dataset and one for training the model. The training step depends on the tokenized dataset step, so it will be executed after the tokenization step is completed.
With slight modifications, you can extend this to train a larger model on a larger dataset, a mixture of datasets, even scaling to very large TPU pods (or multislice TPU, and, soon, multi-node GPUs!).