<a href="https://colab.research.google.com/github/khalilhimura/oumi-explore/blob/main/notebooks/Oumi%20-%20A%20Tour.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<div class="align-center">
<a href="https://oumi.ai/"><img src="https://oumi.ai/docs/en/latest/_static/logo/header_logo.png" height="200"></a>

[![Documentation](https://img.shields.io/badge/Documentation-latest-blue.svg)](https://oumi.ai/docs/en/latest/index.html)
[![Discord](https://img.shields.io/discord/1286348126797430814?label=Discord)](https://discord.gg/oumi)
[![GitHub Repo stars](https://img.shields.io/github/stars/oumi-ai/oumi)](https://github.com/oumi-ai/oumi)
</div>

👋 Welcome to Open Universal Machine Intelligence (Oumi)!

🚀 Oumi is a fully open-source platform that streamlines the entire lifecycle of foundation models - from [data preparation](https://oumi.ai/docs/en/latest/resources/datasets/datasets.html) and [training](hhttps://oumi.ai/docs/en/latest/user_guides/train/train.html) to [evaluation](https://oumi.ai/docs/en/latest/user_guides/evaluate/evaluate.html) and [deployment](https://oumi.ai/docs/en/latest/user_guides/launch/launch.html). Whether you're developing on a laptop, launching large scale experiments on a cluster, or deploying models in production, Oumi provides the tools and workflows you need.

🤝 Make sure to join our [Discord community](https://discord.gg/oumi) to get help, share your experiences, and contribute to the project! If you are interested in joining one of the community's open-science efforts, check out our [open collaboration](https://oumi.ai/community) page.

⭐ If you like Oumi and you would like to support it, please give it a star on [GitHub](https://github.com/oumi-ai/oumi).

# A Tour of Oumi

This tutorial will give you a brief overview of Oumi's core functionality. We'll cover:

1. Training a model
1. Performing model inference
1. Evaluating a model against common benchmarks
1. Launching jobs
1. Customizing datasets and clouds

# 📋 Prerequisites
## Oumi Installation

First, let's install Oumi. You can find more detailed instructions [here](https://oumi.ai/docs/en/latest/get_started/installation.html).

If you have a GPU, you can run the following commands to install Oumi:


In [4]:
%pip install uv -q
!uv pip install oumi --no-progress --system

[2mUsing Python 3.11.11 environment at: /usr[0m
[2mResolved [1m144 packages[0m [2min 550ms[0m[0m
[2mUninstalled [1m14 packages[0m [2min 679ms[0m[0m
[2mInstalled [1m14 packages[0m [2min 352ms[0m[0m
 [31m-[39m [1mnvidia-cublas-cu12[0m[2m==12.4.5.8[0m
 [32m+[39m [1mnvidia-cublas-cu12[0m[2m==12.1.3.1[0m
 [31m-[39m [1mnvidia-cuda-cupti-cu12[0m[2m==12.4.127[0m
 [32m+[39m [1mnvidia-cuda-cupti-cu12[0m[2m==12.1.105[0m
 [31m-[39m [1mnvidia-cuda-nvrtc-cu12[0m[2m==12.4.127[0m
 [32m+[39m [1mnvidia-cuda-nvrtc-cu12[0m[2m==12.1.105[0m
 [31m-[39m [1mnvidia-cuda-runtime-cu12[0m[2m==12.4.127[0m
 [32m+[39m [1mnvidia-cuda-runtime-cu12[0m[2m==12.1.105[0m
 [31m-[39m [1mnvidia-cufft-cu12[0m[2m==11.2.1.3[0m
 [32m+[39m [1mnvidia-cufft-cu12[0m[2m==11.0.2.54[0m
 [31m-[39m [1mnvidia-curand-cu12[0m[2m==10.3.5.147[0m
 [32m+[39m [1mnvidia-curand-cu12[0m[2m==10.3.2.106[0m
 [31m-[39m [1mnvidia-cusolver-cu12[0m[2m==11.6.1.9

In [5]:
import os
from pathlib import Path

tutorial_dir = "tour_tutorial"

Path(tutorial_dir).mkdir(parents=True, exist_ok=True)
os.environ["TOKENIZERS_PARALLELISM"] = "false"  # Disable warnings from HF.

# ⚒️ Training a Model

Oumi supports training both custom and out-of-the-box models. Want to try out a model on HuggingFace? You can do that. Want to train your own custom Pytorch model? No problem.

## A Quick Demo

Let's try training a pre-existing model on HuggingFace. We'll use SmolLM2 135M as it's small and trains quickly.

Oumi uses [training configuration files](https://oumi.ai/docs/en/latest/api/oumi.core.configs.html#oumi.core.configs.TrainingConfig) to specify training parameters. We've already created a training config for SmolLM2 — let's give it a try!

In [6]:
yaml_content = f"""
model:
  model_name: "HuggingFaceTB/SmolLM2-135M-Instruct"
  torch_dtype_str: "bfloat16"
  trust_remote_code: True

data:
  train:
    datasets:
      - dataset_name: "yahma/alpaca-cleaned"
    target_col: "prompt"

training:
  trainer_type: "TRL_SFT"
  per_device_train_batch_size: 2
  max_steps: 10 # Quick "mini" training, for demo purposes only.
  run_name: "smollm2_135m_sft"
  output_dir: "{tutorial_dir}/output"
"""

with open(f"{tutorial_dir}/train.yaml", "w") as f:
    f.write(yaml_content)

In [7]:
from oumi.core.configs import TrainingConfig
from oumi.train import train

config = TrainingConfig.from_yaml(str(Path(tutorial_dir) / "train.yaml"))

train(config)

[2025-01-30 09:44:55,279][oumi][rank0][pid:4366][MainThread][INFO]][torch_utils.py:66] Torch version: 2.6.0+cu124. NumPy version: 1.26.4
[2025-01-30 09:44:55,313][oumi][rank0][pid:4366][MainThread][INFO]][torch_utils.py:72] CUDA version: 12.4 CuDNN version: 90.1.0
[2025-01-30 09:44:55,469][oumi][rank0][pid:4366][MainThread][INFO]][torch_utils.py:106] CPU cores: 2 CUDA devices: 1
device(0)='Tesla T4' Capability: (7, 5) Memory: [Total: 14.74GiB Free: 14.64GiB Allocated: 0.0GiB Cached: 0.0GiB]
[2025-01-30 09:44:55,482][oumi][rank0][pid:4366][MainThread][INFO]][train.py:133] Oumi version: 0.1.3
[2025-01-30 09:44:55,490][oumi][rank0][pid:4366][MainThread][INFO]][train.py:174] TrainingConfig:
TrainingConfig(data=DataParams(train=DatasetSplitParams(datasets=[DatasetParams(dataset_name='yahma/alpaca-cleaned',
                                                                                dataset_path=None,
                                                                                subset=N

max_steps is given, it will override any value given in num_train_epochs


[2025-01-30 09:45:01,249][oumi][rank0][pid:4366][MainThread][INFO]][device_utils.py:283] GPU Metrics Before Training: GPU runtime info: None.
[2025-01-30 09:45:01,251][oumi][rank0][pid:4366][MainThread][INFO]][train.py:312] Training init time: 5.972s
[2025-01-30 09:45:01,253][oumi][rank0][pid:4366][MainThread][INFO]][train.py:313] Starting training... (TrainerType.TRL_SFT, transformers: 4.45.2)


Step,Training Loss


[2025-01-30 09:45:09,276][oumi][rank0][pid:4366][MainThread][INFO]][train.py:320] Training is Complete.
[2025-01-30 09:45:09,278][oumi][rank0][pid:4366][MainThread][INFO]][device_utils.py:283] GPU Metrics After Training: GPU runtime info: None.
[2025-01-30 09:45:09,281][oumi][rank0][pid:4366][MainThread][INFO]][torch_utils.py:117] Peak GPU memory usage: 2.06 GB
[2025-01-30 09:45:09,283][oumi][rank0][pid:4366][MainThread][INFO]][train.py:327] Saving final state...
[2025-01-30 09:45:09,285][oumi][rank0][pid:4366][MainThread][INFO]][train.py:332] Saving final model...
[2025-01-30 09:45:12,429][oumi][rank0][pid:4366][MainThread][INFO]][hf_trainer.py:102] Model has been saved at tour_tutorial/output
[2025-01-30 09:45:12,432][oumi][rank0][pid:4366][MainThread][INFO]][train.py:339] 

» We're always looking for feedback. What's one thing we can improve? https://oumi.ai/feedback


Congratulations, you've trained your first model using Oumi!

You can also train your own custom Pytorch model. We cover that in depth in our [Finetuning Tutorial](https://github.com/oumi-ai/oumi/blob/main/notebooks/Oumi%20-%20Finetuning%20Tutorial.ipynb).

# 🧠 Model Inference

Now that you've trained a model, let's run inference.

In [8]:
yaml_content = f"""
model:
  model_name: "{tutorial_dir}/output"
  torch_dtype_str: "bfloat16"

generation:
  max_new_tokens: 128
  batch_size: 1
"""

with open(f"{tutorial_dir}/infer.yaml", "w") as f:
    f.write(yaml_content)

In [9]:
from oumi.core.configs import InferenceConfig
from oumi.infer import infer

config = InferenceConfig.from_yaml(str(Path(tutorial_dir) / "infer.yaml"))

input_text = (
    "Remember that we didn't train for long, so the results might not be great."
)

results = infer(config=config, inputs=[input_text])

print(results[0])

[2025-01-30 09:45:28,413][oumi][rank0][pid:4366][MainThread][INFO]][models.py:185] Building model using device_map: auto (DeviceRankInfo(world_size=1, rank=0, local_world_size=1, local_rank=0))...
[2025-01-30 09:45:28,416][oumi][rank0][pid:4366][MainThread][INFO]][models.py:255] Using model class: <class 'transformers.models.auto.modeling_auto.AutoModelForCausalLM'> to instantiate model.
[2025-01-30 09:45:29,239][oumi][rank0][pid:4366][MainThread][INFO]][native_text_inference_engine.py:111] Setting EOS token id to `2`


Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)


conversation_id=None messages=[USER: Remember that we didn't train for long, so the results might not be great., ASSISTANT: I'm sorry for the inconvenience, but as a chatbot, I don't have the ability to access or process data from external sources. I'm designed to provide information and guidance based on the information I have available. I'm designed to be helpful and informative, but I don't have the capability to access or process data from external sources. If you have any specific questions or need help with a particular topic, feel free to ask.] metadata={}


We can also run inference using the pretrained model by slightly tweaking our config:

In [10]:
base_model_config = InferenceConfig.from_yaml(str(Path(tutorial_dir) / "infer.yaml"))
base_model_config.model.model_name = "HuggingFaceTB/SmolLM2-135M-Instruct"

input_text = "Input for the pretrained model: What is your name? "

results = infer(config=base_model_config, inputs=[input_text])

print(results[0])

[2025-01-30 09:45:46,768][oumi][rank0][pid:4366][MainThread][INFO]][models.py:185] Building model using device_map: auto (DeviceRankInfo(world_size=1, rank=0, local_world_size=1, local_rank=0))...
[2025-01-30 09:45:46,888][oumi][rank0][pid:4366][MainThread][INFO]][models.py:255] Using model class: <class 'transformers.models.auto.modeling_auto.AutoModelForCausalLM'> to instantiate model.
[2025-01-30 09:45:47,907][oumi][rank0][pid:4366][MainThread][INFO]][native_text_inference_engine.py:111] Setting EOS token id to `2`
conversation_id=None messages=[USER: Input for the pretrained model: What is your name? , ASSISTANT: My name is Alex Chen. I'm a data scientist and AI assistant, trained on a vast dataset of text data, which I use to train my models for various tasks.] metadata={}


# 📊 Evaluating a Model against Common Benchmarks

You can use Oumi to evaluate pretrained and tuned models against standard benchmarks. For example, let's evaluate our tuned model against `Hellaswag`:

In [11]:
yaml_content = f"""
model:
  model_name: "{tutorial_dir}/output"
  torch_dtype_str: "bfloat16"

tasks:
  - evaluation_platform: lm_harness
    task_name: mmlu_college_computer_science

generation:
  batch_size: null # This will let LM HARNESS find the maximum possible batch size.
output_dir: "{tutorial_dir}/output/evaluation"
"""

with open(f"{tutorial_dir}/eval.yaml", "w") as f:
    f.write(yaml_content)

In [12]:
from oumi.core.configs import EvaluationConfig
from oumi.evaluate import evaluate

eval_config = EvaluationConfig.from_yaml(str(Path(tutorial_dir) / "eval.yaml"))

# Uncomment the following line to run evals against the V1 HuggingFace Leaderboard.
# This may take a while.
# eval_config.data.datasets[0].dataset_name = "huggingface_leaderboard_v1"

evaluate(eval_config)

[2025-01-30 09:46:18,944][oumi][rank0][pid:4366][MainThread][INFO]][lm_harness.py:110] Starting evaluation...
[2025-01-30 09:46:18,946][oumi][rank0][pid:4366][MainThread][INFO]][lm_harness.py:111] 	LM Harness `model_params`:
{'device_map': 'auto',
 'dtype': torch.bfloat16,
 'parallelize': False,
 'pretrained': 'tour_tutorial/output',
 'trust_remote_code': False}
[2025-01-30 09:46:18,948][oumi][rank0][pid:4366][MainThread][INFO]][lm_harness.py:112] 	LM Harness `task_params`:
LMHarnessTaskParams(evaluation_platform='lm_harness',
                    task_name='mmlu_college_computer_science',
                    num_samples=None,
                    eval_kwargs={},
                    num_fewshot=None)


INFO:lm-eval:Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
INFO:lm-eval:Initializing hf model, with arguments: {'pretrained': 'tour_tutorial/output', 'trust_remote_code': False, 'parallelize': False, 'dtype': torch.bfloat16, 'device_map': 'auto'}
INFO:lm-eval:Using device 'cuda:0'
INFO:lm-eval:Model parallel was set to False.


README.md:   0%|          | 0.00/1.11k [00:00<?, ?B/s]

mmlu_no_train.py:   0%|          | 0.00/5.86k [00:00<?, ?B/s]

data.tar:   0%|          | 0.00/166M [00:00<?, ?B/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

INFO:lm-eval:Building contexts for mmlu_college_computer_science on rank 0...
100%|██████████| 100/100 [00:00<00:00, 642.71it/s]
INFO:lm-eval:Running loglikelihood requests
Running loglikelihood requests:   0%|          | 0/400 [00:00<?, ?it/s]

Passed argument batch_size = auto:1. Detecting largest batch size
Determined largest batch size: 64


Running loglikelihood requests: 100%|██████████| 400/400 [00:12<00:00, 30.86it/s]


[2025-01-30 09:46:47,773][oumi][rank0][pid:4366][MainThread][INFO]][lm_harness.py:132] mmlu_college_computer_science's metric dict is {'acc,none': 0.26,
 'acc_stderr,none': 0.044084400227680794,
 'alias': 'college_computer_science'}


[{'results': {'mmlu_college_computer_science': {'alias': 'college_computer_science',
    'acc,none': 0.26,
    'acc_stderr,none': 0.044084400227680794}}}]

In [13]:
!pip install --upgrade Pillow torchvision

Collecting Pillow
  Downloading pillow-11.1.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (9.1 kB)
Collecting torchvision
  Downloading torchvision-0.21.0-cp311-cp311-manylinux1_x86_64.whl.metadata (6.1 kB)
Collecting torch==2.6.0 (from torchvision)
  Downloading torch-2.6.0-cp311-cp311-manylinux1_x86_64.whl.metadata (28 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch==2.6.0->torchvision)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch==2.6.0->torchvision)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch==2.6.0->torchvision)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch==2.6.0->torchvision)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86

# ☁️ Launching Jobs

Oftentimes you'll need to run various tasks (training, evaluation, etc.) on remote hardware that's better suited for the task. Oumi can handle this for you by launching jobs on various compute clusters. For more information about running jobs, see our [Running Jobs Remotely tutorial](https://github.com/oumi-ai/oumi/blob/main/notebooks/Oumi%20-%20Running%20Jobs%20Remotely.ipynb). For running jobs on custom clusters, see our [Launching Jobs on Custom Clusters tutorial](https://github.com/oumi-ai/oumi/blob/main/notebooks/Oumi%20-%20Launching%20Jobs%20on%20Custom%20Clusters.ipynb).


Today, Oumi supports running jobs on several cloud provider platforms.

For the latest list, we can run the `which_clouds` method:

In [13]:
import oumi.launcher as launcher

print("Supported Clouds in Oumi:")
for cloud in launcher.which_clouds():
    print(cloud)

Supported Clouds in Oumi:
local
polaris
runpod
gcp
lambda
aws
azure


Let's run a simple "Hello World" job locally to demonstrate how to use the Oumi job launcher. This job will echo `Hello World`, then run the same training job executed above. Running this job on a cloud provider like GCP simply involves changing the `cloud` field.

In [14]:
yaml_content = f"""
name: hello-world
resources:
  cloud: local

working_dir: .

envs:
  TEST_ENV_VARIABLE: '"Hello, World!"'
  OUMI_LOGGING_DIR: "{tutorial_dir}/logs"

run: |
  echo "$TEST_ENV_VARIABLE"
  oumi train -c {tutorial_dir}/train.yaml
"""

with open(f"{tutorial_dir}/job.yaml", "w") as f:
    f.write(yaml_content)

In [15]:
import time

job_config = launcher.JobConfig.from_yaml(str(Path(tutorial_dir) / "job.yaml"))
cluster, job_status = launcher.up(job_config, cluster_name=None)

while job_status and not job_status.done:
    print("Job is running...")
    time.sleep(15)
    job_status = cluster.get_job(job_status.id)
print("Job is done!")

Job is running...
Job is running...
Job is running...
Job is done!


The job created logs under our tutorial directory. Let's take a look at the directory:

In [16]:
logs_dir = f"{tutorial_dir}/logs"
os.listdir(logs_dir)

['2025_01_30_09_52_19_598_0.stdout', '2025_01_30_09_52_19_598_0.stderr']

Now let's parse the logfiles.

In [17]:
for log_file in Path(logs_dir).iterdir():
    print(f"Log file: {log_file}")
    with open(log_file) as f:
        print(f.read())

Log file: tour_tutorial/logs/2025_01_30_09_52_19_598_0.stdout
"Hello, World!"

@@@@@@@@@@@@@@@@@@@
@                 @
@   @@@@@  @  @   @
@   @   @  @  @   @
@   @@@@@  @@@@   @
@                 @
@   @@@@@@@   @   @
@   @  @  @   @   @
@   @  @  @   @   @
@                 @
@@@@@@@@@@@@@@@@@@@

[2025-01-30 09:52:27,261][oumi][rank0][pid:6786][MainThread][INFO]][distributed.py:546] Setting random seed to 42 on rank 0.
[2025-01-30 09:52:31,697][oumi][rank0][pid:6786][MainThread][INFO]][torch_utils.py:66] Torch version: 2.4.1+cu121. NumPy version: 1.26.4
[2025-01-30 09:52:31,699][oumi][rank0][pid:6786][MainThread][INFO]][torch_utils.py:72] CUDA version: 12.1 CuDNN version: 90.1.0
[2025-01-30 09:52:31,710][oumi][rank0][pid:6786][MainThread][INFO]][torch_utils.py:106] CPU cores: 2 CUDA devices: 1
device(0)='Tesla T4' Capability: (7, 5) Memory: [Total: 14.74GiB Free: 6.17GiB Allocated: 0.0GiB Cached: 0.0GiB]
[2025-01-30 09:52:31,712][oumi][rank0][pid:6786][MainThread][INFO]][train.py:133

# ⚙️ Customizing Datasets and Clusters

Oumi offers rich customization that allows users to build custom solutions on top of our existing building blocks. Several of Oumi's primary resources (Datasets, Clouds, etc.) leverage the Oumi Registry when invoked.

This registry allows users to build custom classes that function as drop-in replacements for core functionality.

For more details on registering custom datasets, see the [tutorial here](https://github.com/oumi-ai/oumi/blob/main/notebooks/Oumi%20-%20Datasets%20Tutorial.ipynb).

For a tutorial on writing a custom cloud/cluster for running jobs, see the [tutorial here](https://github.com/oumi-ai/oumi/blob/main/notebooks/Oumi%20-%20Launching%20Jobs%20on%20Custom%20Clusters.ipynb).

You can find further information about the required registry decorators [here](https://oumi.ai/docs/en/latest/api/oumi.core.registry.html#oumi.core.registry.register_cloud_builder).

# 🧭 What's Next?

Now that you've completed the basic tour, you're ready to tackle the other [notebook guides & tutorials](https://oumi.ai/docs/en/latest/get_started/tutorials.html).

If you have not already, make sure to take a look at the [Quickstart](https://oumi.ai/docs/en/latest/get_started/quickstart.html) for an overview of our CLI.