<div class="align-center">
<a href="https://oumi.ai/"><img src="https://oumi.ai/docs/en/latest/_static/logo/header_logo.png" height="200"></a>

[![Documentation](https://img.shields.io/badge/Documentation-latest-blue.svg)](https://oumi.ai/docs/en/latest/index.html)
[![Discord](https://img.shields.io/discord/1286348126797430814?label=Discord)](https://discord.gg/oumi)
[![GitHub Repo stars](https://img.shields.io/github/stars/oumi-ai/oumi)](https://github.com/oumi-ai/oumi)
</div>

👋 Welcome to Open Universal Machine Intelligence (Oumi)!

🚀 Oumi is a fully open-source platform that streamlines the entire lifecycle of foundation models - from [data preparation](https://oumi.ai/docs/en/latest/resources/datasets/datasets.html) and [training](hhttps://oumi.ai/docs/en/latest/user_guides/train/train.html) to [evaluation](https://oumi.ai/docs/en/latest/user_guides/evaluate/evaluate.html) and [deployment](https://oumi.ai/docs/en/latest/user_guides/launch/launch.html). Whether you're developing on a laptop, launching large scale experiments on a cluster, or deploying models in production, Oumi provides the tools and workflows you need.

🤝 Make sure to join our [Discord community](https://discord.gg/oumi) to get help, share your experiences, and contribute to the project! If you are interested in joining one of the community's open-science efforts, check out our [open collaboration](https://oumi.ai/community) page.

⭐ If you like Oumi and you would like to support it, please give it a star on [GitHub](https://github.com/oumi-ai/oumi).

# A Tour of Oumi

This tutorial will give you a brief overview of Oumi's core functionality. We'll cover:

1. Training a model
1. Performing model inference
1. Evaluating a model against common benchmarks
1. Launching jobs
1. Customizing datasets and clouds

# 📋 Prerequisites
## Oumi Installation

First, let's install Oumi. You can find more detailed instructions [here](https://oumi.ai/docs/en/latest/get_started/installation.html).

If you have a GPU, you can run the following commands to install Oumi:


In [1]:
%pip install uv -q
!uv pip install oumi --no-progress --system

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.2/16.2 MB[0m [31m46.8 MB/s[0m eta [36m0:00:00[0m
[?25h[2mUsing Python 3.11.11 environment at: /usr[0m
[2mResolved [1m144 packages[0m [2min 3.31s[0m[0m
[2mPrepared [1m58 packages[0m [2min 44.08s[0m[0m
[2mUninstalled [1m21 packages[0m [2min 1.32s[0m[0m
[2mInstalled [1m58 packages[0m [2min 714ms[0m[0m
 [32m+[39m [1maiofiles[0m[2m==24.1.0[0m
 [32m+[39m [1maioresponses[0m[2m==0.7.8[0m
 [32m+[39m [1mantlr4-python3-runtime[0m[2m==4.9.3[0m
 [32m+[39m [1mcolorama[0m[2m==0.4.6[0m
 [32m+[39m [1mdataproperty[0m[2m==1.1.0[0m
 [32m+[39m [1mdatasets[0m[2m==3.2.0[0m
 [32m+[39m [1mdill[0m[2m==0.3.8[0m
 [32m+[39m [1mevaluate[0m[2m==0.4.3[0m
 [31m-[39m [1mfsspec[0m[2m==2024.10.0[0m
 [32m+[39m [1mfsspec[0m[2m==2024.9.0[0m
 [32m+[39m [1mjsonlines[0m[2m==4.0.0[0m
 [32m+[39m [1mlm-eval[0m[2m==0.4.7[0m
 [32m+[39m [1mmbstrdecoder[0m[2m==1.

❗**WARNING:** After the first `pip install`, you may have to restart the notebook for the package updates to take effect (Colab Menu: `Runtime` -> `Restart Session`).

In [2]:
import os
from pathlib import Path

tutorial_dir = "tour_tutorial"

Path(tutorial_dir).mkdir(parents=True, exist_ok=True)
os.environ["TOKENIZERS_PARALLELISM"] = "false"  # Disable warnings from HF.

# ⚒️ Training a Model

Oumi supports training both custom and out-of-the-box models. Want to try out a model on HuggingFace? You can do that. Want to train your own custom Pytorch model? No problem.

## A Quick Demo

Let's try training a pre-existing model on HuggingFace. We'll use SmolLM2 135M as it's small and trains quickly.

Oumi uses [training configuration files](https://oumi.ai/docs/en/latest/api/oumi.core.configs.html#oumi.core.configs.TrainingConfig) to specify training parameters. We've already created a training config for SmolLM2 — let's give it a try!

In [3]:
yaml_content = f"""
model:
  model_name: "HuggingFaceTB/SmolLM2-135M-Instruct"
  torch_dtype_str: "bfloat16"
  trust_remote_code: True

data:
  train:
    datasets:
      - dataset_name: "yahma/alpaca-cleaned"
    target_col: "prompt"

training:
  trainer_type: "TRL_SFT"
  per_device_train_batch_size: 2
  max_steps: 10 # Quick "mini" training, for demo purposes only.
  run_name: "smollm2_135m_sft"
  output_dir: "{tutorial_dir}/output"
"""

with open(f"{tutorial_dir}/train.yaml", "w") as f:
    f.write(yaml_content)

In [4]:
from oumi.core.configs import TrainingConfig
from oumi.train import train

config = TrainingConfig.from_yaml(str(Path(tutorial_dir) / "train.yaml"))

train(config)

[2025-02-04 06:20:12,239][oumi][rank0][pid:466][MainThread][INFO]][train.py:96] Creating training.output_dir: tour_tutorial/output...
[2025-02-04 06:20:12,242][oumi][rank0][pid:466][MainThread][INFO]][train.py:98] Created training.output_dir absolute path: /content/tour_tutorial/output
[2025-02-04 06:20:12,244][oumi][rank0][pid:466][MainThread][INFO]][train.py:96] Creating training.telemetry_dir: tour_tutorial/output/telemetry...
[2025-02-04 06:20:12,246][oumi][rank0][pid:466][MainThread][INFO]][train.py:98] Created training.telemetry_dir absolute path: /content/tour_tutorial/output/telemetry
[2025-02-04 06:20:12,248][oumi][rank0][pid:466][MainThread][INFO]][torch_utils.py:66] Torch version: 2.4.1+cu121. NumPy version: 1.26.4
[2025-02-04 06:20:12,303][oumi][rank0][pid:466][MainThread][INFO]][torch_utils.py:72] CUDA version: 12.1 CuDNN version: 90.1.0
[2025-02-04 06:20:12,618][oumi][rank0][pid:466][MainThread][INFO]][torch_utils.py:106] CPU cores: 2 CUDA devices: 1
device(0)='Tesla T4' 

config.json:   0%|          | 0.00/861 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.76k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/801k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.10M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/655 [00:00<?, ?B/s]

[2025-02-04 06:20:20,706][oumi][rank0][pid:466][MainThread][INFO]][models.py:185] Building model using device_map: auto (DeviceRankInfo(world_size=1, rank=0, local_world_size=1, local_rank=0))...
[2025-02-04 06:20:20,708][oumi][rank0][pid:466][MainThread][INFO]][models.py:255] Using model class: <class 'transformers.models.auto.modeling_auto.AutoModelForCausalLM'> to instantiate model.


model.safetensors:   0%|          | 0.00/269M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

[2025-02-04 06:20:29,415][oumi][rank0][pid:466][MainThread][INFO]][base_map_dataset.py:68] Creating map dataset (type: AlpacaDataset) dataset_name: 'yahma/alpaca-cleaned', dataset_path: 'None'...


README.md:   0%|          | 0.00/11.6k [00:00<?, ?B/s]

alpaca_data_cleaned.json:   0%|          | 0.00/44.3M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/51760 [00:00<?, ? examples/s]

[2025-02-04 06:20:34,522][oumi][rank0][pid:466][MainThread][INFO]][base_map_dataset.py:472] Dataset Info:
	Split: train
	Version: 0.0.0
	Dataset size: 40283906
	Download size: 44307561
	Size: 84591467 bytes
	Rows: 51760
	Columns: ['output', 'input', 'instruction']
[2025-02-04 06:20:34,967][oumi][rank0][pid:466][MainThread][INFO]][base_map_dataset.py:411] Loaded DataFrame with shape: (51760, 3). Columns:
output         object
input          object
instruction    object
dtype: object


Generating train split: 0 examples [00:00, ? examples/s]

[2025-02-04 06:20:35,035][oumi][rank0][pid:466][MainThread][INFO]][base_map_dataset.py:297] AlpacaDataset: features=dict_keys(['input_ids', 'attention_mask'])


Generating train split: 0 examples [00:00, ? examples/s]

[2025-02-04 06:21:29,520][oumi][rank0][pid:466][MainThread][INFO]][base_map_dataset.py:361] Finished transforming dataset (AlpacaDataset)! Speed: 950.06 examples/sec. Examples: 51760. Duration: 54.5 sec. Transform workers: 1.
[2025-02-04 06:21:29,898][oumi][rank0][pid:466][MainThread][INFO]][torch_profiler_utils.py:150] PROF: Torch Profiler disabled!
[2025-02-04 06:21:29,923][oumi][rank0][pid:466][MainThread][INFO]][training.py:49] SFTConfig(output_dir='tour_tutorial/output',
          overwrite_output_dir=False,
          do_train=False,
          do_eval=False,
          do_predict=False,
          eval_strategy=<IntervalStrategy.NO: 'no'>,
          prediction_loss_only=False,
          per_device_train_batch_size=2,
          per_device_eval_batch_size=8,
          per_gpu_train_batch_size=None,
          per_gpu_eval_batch_size=None,
          gradient_accumulation_steps=1,
          eval_accumulation_steps=None,
          eval_delay=0,
          torch_empty_cache_steps=None,
    

max_steps is given, it will override any value given in num_train_epochs


[2025-02-04 06:21:29,940][oumi][rank0][pid:466][MainThread][INFO]][device_utils.py:283] GPU Metrics Before Training: GPU runtime info: None.
[2025-02-04 06:21:29,943][oumi][rank0][pid:466][MainThread][INFO]][train.py:312] Training init time: 77.703s
[2025-02-04 06:21:29,944][oumi][rank0][pid:466][MainThread][INFO]][train.py:313] Starting training... (TrainerType.TRL_SFT, transformers: 4.45.2)


Step,Training Loss


[2025-02-04 06:21:45,150][oumi][rank0][pid:466][MainThread][INFO]][train.py:320] Training is Complete.
[2025-02-04 06:21:45,153][oumi][rank0][pid:466][MainThread][INFO]][device_utils.py:283] GPU Metrics After Training: GPU runtime info: None.
[2025-02-04 06:21:45,156][oumi][rank0][pid:466][MainThread][INFO]][torch_utils.py:117] Peak GPU memory usage: 1.86 GB
[2025-02-04 06:21:45,158][oumi][rank0][pid:466][MainThread][INFO]][train.py:327] Saving final state...
[2025-02-04 06:21:45,160][oumi][rank0][pid:466][MainThread][INFO]][train.py:332] Saving final model...
[2025-02-04 06:21:46,596][oumi][rank0][pid:466][MainThread][INFO]][hf_trainer.py:102] Model has been saved at tour_tutorial/output
[2025-02-04 06:21:46,598][oumi][rank0][pid:466][MainThread][INFO]][train.py:339] 

» We're always looking for feedback. What's one thing we can improve? https://oumi.ai/feedback


Congratulations, you've trained your first model using Oumi!

You can also train your own custom Pytorch model. We cover that in depth in our [Finetuning Tutorial](https://github.com/oumi-ai/oumi/blob/main/notebooks/Oumi%20-%20Finetuning%20Tutorial.ipynb).

# 🧠 Model Inference

Now that you've trained a model, let's run inference.

In [5]:
yaml_content = f"""
model:
  model_name: "{tutorial_dir}/output"
  torch_dtype_str: "bfloat16"

generation:
  max_new_tokens: 128
  batch_size: 1
"""

with open(f"{tutorial_dir}/infer.yaml", "w") as f:
    f.write(yaml_content)

In [6]:
from oumi.core.configs import InferenceConfig
from oumi.infer import infer

config = InferenceConfig.from_yaml(str(Path(tutorial_dir) / "infer.yaml"))

input_text = (
    "Remember that we didn't train for long, so the results might not be great."
)

results = infer(config=config, inputs=[input_text])

print(results[0])

[2025-02-04 06:49:22,456][oumi][rank0][pid:466][MainThread][INFO]][models.py:185] Building model using device_map: auto (DeviceRankInfo(world_size=1, rank=0, local_world_size=1, local_rank=0))...
[2025-02-04 06:49:22,459][oumi][rank0][pid:466][MainThread][INFO]][models.py:255] Using model class: <class 'transformers.models.auto.modeling_auto.AutoModelForCausalLM'> to instantiate model.
[2025-02-04 06:49:23,340][oumi][rank0][pid:466][MainThread][INFO]][native_text_inference_engine.py:111] Setting EOS token id to `2`


Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)


conversation_id=None messages=[USER: Remember that we didn't train for long, so the results might not be great., ASSISTANT: I'm sorry for the inconvenience, but as a chatbot, I don't have the ability to access or process data from external sources. I'm designed to provide information and guidance based on your input, not to collect or store any data. I recommend checking the official website or platform for the most accurate and up-to-date information.] metadata={}


We can also run inference using the pretrained model by slightly tweaking our config:

In [7]:
base_model_config = InferenceConfig.from_yaml(str(Path(tutorial_dir) / "infer.yaml"))
base_model_config.model.model_name = "HuggingFaceTB/SmolLM2-135M-Instruct"

input_text = "Input for the pretrained model: What is your name? "

results = infer(config=base_model_config, inputs=[input_text])

print(results[0])

[2025-02-04 06:49:35,529][oumi][rank0][pid:466][MainThread][INFO]][models.py:185] Building model using device_map: auto (DeviceRankInfo(world_size=1, rank=0, local_world_size=1, local_rank=0))...
[2025-02-04 06:49:35,792][oumi][rank0][pid:466][MainThread][INFO]][models.py:255] Using model class: <class 'transformers.models.auto.modeling_auto.AutoModelForCausalLM'> to instantiate model.
[2025-02-04 06:49:37,054][oumi][rank0][pid:466][MainThread][INFO]][native_text_inference_engine.py:111] Setting EOS token id to `2`
conversation_id=None messages=[USER: Input for the pretrained model: What is your name? , ASSISTANT: My name is Alex Chen. I'm a data scientist and AI assistant, trained on a vast dataset of text data from various sources. I'm here to help you with your data analysis and interpretation tasks.] metadata={}


# 📊 Evaluating a Model against Common Benchmarks

You can use Oumi to evaluate pretrained and tuned models against standard benchmarks. For example, let's evaluate our tuned model against `Hellaswag`:

In [8]:
yaml_content = f"""
model:
  model_name: "{tutorial_dir}/output"
  torch_dtype_str: "bfloat16"

tasks:
  - evaluation_platform: lm_harness
    task_name: mmlu_college_computer_science

generation:
  batch_size: null # This will let LM HARNESS find the maximum possible batch size.
output_dir: "{tutorial_dir}/output/evaluation"
"""

with open(f"{tutorial_dir}/eval.yaml", "w") as f:
    f.write(yaml_content)

In [9]:
from oumi.core.configs import EvaluationConfig
from oumi.evaluate import evaluate

eval_config = EvaluationConfig.from_yaml(str(Path(tutorial_dir) / "eval.yaml"))

# Uncomment the following line to run evals against the V1 HuggingFace Leaderboard.
# This may take a while.
# eval_config.data.datasets[0].dataset_name = "huggingface_leaderboard_v1"

evaluate(eval_config)

RuntimeError: Failed to import transformers.pipelines because of the following error (look up to see its traceback):
cannot import name 'is_directory' from 'PIL._util' (/usr/local/lib/python3.11/dist-packages/PIL/_util.py)

# ☁️ Launching Jobs

Oftentimes you'll need to run various tasks (training, evaluation, etc.) on remote hardware that's better suited for the task. Oumi can handle this for you by launching jobs on various compute clusters. For more information about running jobs, see our [Running Jobs Remotely tutorial](https://github.com/oumi-ai/oumi/blob/main/notebooks/Oumi%20-%20Running%20Jobs%20Remotely.ipynb). For running jobs on custom clusters, see our [Launching Jobs on Custom Clusters tutorial](https://github.com/oumi-ai/oumi/blob/main/notebooks/Oumi%20-%20Launching%20Jobs%20on%20Custom%20Clusters.ipynb).


Today, Oumi supports running jobs on several cloud provider platforms.

For the latest list, we can run the `which_clouds` method:

In [10]:
import oumi.launcher as launcher

print("Supported Clouds in Oumi:")
for cloud in launcher.which_clouds():
    print(cloud)

Supported Clouds in Oumi:
local
polaris
runpod
gcp
lambda
aws
azure


Let's run a simple "Hello World" job locally to demonstrate how to use the Oumi job launcher. This job will echo `Hello World`, then run the same training job executed above. Running this job on a cloud provider like GCP simply involves changing the `cloud` field.

In [11]:
yaml_content = f"""
name: hello-world
resources:
  cloud: local

working_dir: .

envs:
  TEST_ENV_VARIABLE: '"Hello, World!"'
  OUMI_LOGGING_DIR: "{tutorial_dir}/logs"

run: |
  echo "$TEST_ENV_VARIABLE"
  oumi train -c {tutorial_dir}/train.yaml
"""

with open(f"{tutorial_dir}/job.yaml", "w") as f:
    f.write(yaml_content)

In [12]:
import time

job_config = launcher.JobConfig.from_yaml(str(Path(tutorial_dir) / "job.yaml"))
cluster, job_status = launcher.up(job_config, cluster_name=None)

while job_status and not job_status.done:
    print("Job is running...")
    time.sleep(15)
    job_status = cluster.get_job(job_status.id)
print("Job is done!")

Job is running...
Job is running...
Job is running...
Job is running...
Job is done!


The job created logs under our tutorial directory. Let's take a look at the directory:

In [13]:
logs_dir = f"{tutorial_dir}/logs"
os.listdir(logs_dir)

['2025_02_04_06_50_21_924_0.stdout', '2025_02_04_06_50_21_924_0.stderr']

Now let's parse the logfiles.

In [14]:
for log_file in Path(logs_dir).iterdir():
    print(f"Log file: {log_file}")
    with open(log_file) as f:
        print(f.read())

Log file: tour_tutorial/logs/2025_02_04_06_50_21_924_0.stdout
"Hello, World!"

@@@@@@@@@@@@@@@@@@@
@                 @
@   @@@@@  @  @   @
@   @   @  @  @   @
@   @@@@@  @@@@   @
@                 @
@   @@@@@@@   @   @
@   @  @  @   @   @
@   @  @  @   @   @
@                 @
@@@@@@@@@@@@@@@@@@@

[2025-02-04 06:50:26,293][oumi][rank0][pid:8794][MainThread][INFO]][distributed.py:546] Setting random seed to 42 on rank 0.
[2025-02-04 06:50:32,010][oumi][rank0][pid:8794][MainThread][INFO]][torch_utils.py:66] Torch version: 2.4.1+cu121. NumPy version: 1.26.4
[2025-02-04 06:50:32,011][oumi][rank0][pid:8794][MainThread][INFO]][torch_utils.py:72] CUDA version: 12.1 CuDNN version: 90.1.0
[2025-02-04 06:50:32,020][oumi][rank0][pid:8794][MainThread][INFO]][torch_utils.py:106] CPU cores: 2 CUDA devices: 1
device(0)='Tesla T4' Capability: (7, 5) Memory: [Total: 14.74GiB Free: 14.22GiB Allocated: 0.0GiB Cached: 0.0GiB]
[2025-02-04 06:50:32,023][oumi][rank0][pid:8794][MainThread][INFO]][train.py:13

# ⚙️ Customizing Datasets and Clusters

Oumi offers rich customization that allows users to build custom solutions on top of our existing building blocks. Several of Oumi's primary resources (Datasets, Clouds, etc.) leverage the Oumi Registry when invoked.

This registry allows users to build custom classes that function as drop-in replacements for core functionality.

For more details on registering custom datasets, see the [tutorial here](https://github.com/oumi-ai/oumi/blob/main/notebooks/Oumi%20-%20Datasets%20Tutorial.ipynb).

For a tutorial on writing a custom cloud/cluster for running jobs, see the [tutorial here](https://github.com/oumi-ai/oumi/blob/main/notebooks/Oumi%20-%20Launching%20Jobs%20on%20Custom%20Clusters.ipynb).

You can find further information about the required registry decorators [here](https://oumi.ai/docs/en/latest/api/oumi.core.registry.html#oumi.core.registry.register_cloud_builder).

# 🧭 What's Next?

Now that you've completed the basic tour, you're ready to tackle the other [notebook guides & tutorials](https://oumi.ai/docs/en/latest/get_started/tutorials.html).

If you have not already, make sure to take a look at the [Quickstart](https://oumi.ai/docs/en/latest/get_started/quickstart.html) for an overview of our CLI.