# Oumi Colab

Recommended to use the Python 3 + T4 GPU runtime for faster training.

### Installation
The first step is to install the `oumi` module and its dependencies.


Once we are out of stealth, and the package is published on PyPi, we can simply do: `pip install oumi[all]`


However for now, since the repo is sill private we need to use a workaround:
- **Manual upload**: The simplest option is to manually upload the zipped repo, either to Google Drive or the colab filesystem directly.
    - If you choose this option you can skip to step 3.
- **Git pull with read token**: A more convenient alternative is to generate a read-only github token for the repo.
    - The setup only needs to be done once, and after that you can quickly pull new code changes.

#### 1. Setting up read-only github token
Since the Github repository is private, we need to generate a `Read-only` user token scoped for the `oumi` repo.
1. In Github.com, go to `Settings -> Developer settings -> Personal access tokens -> Fine-grained tokens -> Generate new token`.
1. See example [here](https://drive.google.com/file/d/1zxd8r7qkPfl34mfGK83m_13oLGFGghW1/view?usp=share_link) on how to fill the form. The only permission that should be granted is `Repository permissions -> Contents -> Read-only`.
1. Click `Generate token`, copy the token, and save it somewhere safe (as you can't access it again).
1. Message Oussama or Nikolai on Slack to get the token approved.
1. Create a colab environment secret (Key icon in the left menu) with `repo-token` as the name and your token as the value.

This only needs to be done once!

#### 2. Cloning Oumi repository

In [None]:
from google.colab import userdata

github_repo_token = userdata.get("repo-token")  # Setup token in your notebook secrets
github_username = "<GITHUB_USERNAME>"  # Change to your github username

!git clone https://$github_username:$github_repo_token@github.com/oumi-ai/oumi.git

#### 3. Installing Oumi module & dependencies

In [None]:
%pip install -e 'oumi[all]'

#### 4. Importing Oumi

In [2]:
import oumi
from oumi.core.configs import (
    DataParams,
    DatasetParams,
    DatasetSplitParams,
    EvaluationConfig,
    ModelParams,
    TrainerType,
    TrainingConfig,
    TrainingParams,
)

## Training

#### Using `oumi` module

In [None]:
config = TrainingConfig(
    data=DataParams(
        train=DatasetSplitParams(
            datasets=[
                DatasetParams(
                    dataset_name="yahma/alpaca-cleaned",
                    preprocessing_function_name="alpaca",
                )
            ],
            target_col="prompt",
        )
    ),
    model=ModelParams(
        model_name="microsoft/Phi-3-mini-4k-instruct",
        trust_remote_code=True,
    ),
    training=TrainingParams(
        trainer_type=TrainerType.TRL_SFT,
        output_dir="train/",
    ),
)
oumi.train(config)

#### Using `oumi` CLI

In [1]:
!oumi-train \
    "data.train.dataset.0.dataset_name=yahma/alpaca-cleaned" \
    "data.train.dataset.0.preprocessing_function_name=alpaca" \
    "data.train.dataset.target_col=prompt" \
    "model.model_name=microsoft/Phi-3-mini-4k-instruct" \
    "model.trust_remote_code=true" \
    "training.trainer_type=TRL_SFT/" \
    "training.output_dir=train/"

## Evaluation

#### Using `oumi` module

In [None]:
config = EvaluationConfig(
    data=DatasetSplitParams(
        datasets=[
            DatasetParams(
                dataset_name="yahma/alpaca-cleaned",
                preprocessing_function_name="alpaca",
            )
        ],
    ),
    model=ModelParams(
        model_name="train/best.pt",
        trust_remote_code=True,
    ),
)

oumi.evaluate_oumi(config)

#### Using `oumi` CLI

In [None]:
!oumi-evaluate \
    "data.train.datasets.0.dataset_name=yahma/alpaca-cleaned" \
    "data.train.datasets.0.preprocessing_function_name=alpaca" \
    "data.train.datasets.0.target_col=prompt" \
    "model.model_name=microsoft/Phi-3-mini-4k-instruct" \
    "model.trust_remote_code=true"