Metaflow-Trainium Examples

This repository contains examples that demonstrate how to use Metaflow to define and run machine learning training jobs with AWS Trainium. The training jobs are executed as batch jobs running on AWS EC2 trn1 instances in AWS Batch.

To run these examples, you first need to provision AWS resources for Metaflow and AWS Batch. Please refer to the installation guide for instructions on how to deploy the required resources using CloudFormation and finalize your Metaflow setup.

Step 1: Deploy infrastructure

deploy-infra.mp4

Step 2: Configure Metaflow

metaflow-configure.mp4

Step 3: Run experiments

Once the required resources have been created and configured, please try to run the included allreduce example as a basic test of the Metaflow/Trainium/Batch setup. When the allreduce example is successfully running, you can then proceed to the more realistic workflows such as Llama2-7b pretraining.

AWS Trainium is currently supported in us-east-1, us-east-2, and us-west-2. Please make sure that you are working in one of these supported regions.

Example registry

We have included the following examples, and are happy to take requests to expand the list. Note that some sub-directories for Trainium have counterpart implementations for running comparisons on GPUs. This is not intended to be a benchmarking repository, but running a comparison against GPUs you have access to is useful for understanding general performance characteristics relative to other hardware architectures.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
allreduce-trn		allreduce-trn
bert-finetune-gpu		bert-finetune-gpu
bert-finetune-trn		bert-finetune-trn
cfn		cfn
docker		docker
llama2-7b-finetune-gpu-single-node		llama2-7b-finetune-gpu-single-node
llama2-7b-finetune-gpu		llama2-7b-finetune-gpu
llama2-7b-finetune-trn		llama2-7b-finetune-trn
llama2-7b-pretrain-trn		llama2-7b-pretrain-trn
static		static
.gitignore		.gitignore
README.md		README.md
install_metaflow_and_batch.md		install_metaflow_and_batch.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Metaflow-Trainium Examples

Step 1: Deploy infrastructure

Step 2: Configure Metaflow

Step 3: Run experiments

Example registry

Llama2 pre-training

Llama2 fine-tuning on Trainium

BERT fine-tuning on Trainium

About

Releases

Packages

Contributors 2

Languages

outerbounds/metaflow-trainium

Folders and files

Latest commit

History

Repository files navigation

Metaflow-Trainium Examples

Step 1: Deploy infrastructure

Step 2: Configure Metaflow

Step 3: Run experiments

Example registry

Llama2 pre-training

Llama2 fine-tuning on Trainium

BERT fine-tuning on Trainium

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages