<a href="https://colab.research.google.com/github/mshumer/gpt-oracle-trainer/blob/main/gpt_oracle_trainer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## gpt-oracle-trainer
Documentation -> fine-tuned LLaMA 2 Q/A system
By Matt Shumer (https://twitter.com/mattshumer_)

The goal of this notebook is to experiment with a new way to make it very easy to build a task-specific model for answering questions about products and services.

To get started:
- First, use the best GPU available (go to Runtime -> change runtime type). Add your OpenAI key.

- To create your model, just paste in each document in your documentation into the docs list (each as its own string).

- Adjust the `service_name_and_description` to match your product/service.

- Select a temperature (high=creative, low=precise), and the number of training examples to generate per doc (the more, the better -- ideally do 50+ for this value) to train the model. From there, just run all the cells.

You can change the model you want to fine-tune by changing `model_name` in the `Define Hyperparameters` cell.

In [None]:
!pip install openai

In [None]:
import openai
openai.api_key = "OPENAI API KEY HERE"

# Paste your documentation here. Each file should be a separate string within the `docs` list.

In [None]:
docs = [
    """Getting Started
Let’s get set up with MCLI, the MosaicML command line interface (CLI) and SDK. Install MCLI via pip into your python3 environment:

pip install --upgrade mosaicml-cli
There are two options listed below for configuring an API key. To manage your account or create new keys, visit the MosaicML console.

CLI access
This command will walk you through getting started and configuring mcli:

mcli init
You can also reset your API key through the command line at anytime by running:

mcli set api-key <new-value>
Access through environment variables
The MOSAICML_API_KEY environment variable can also be used to configure access to the MosaicML platform:

export MOSAICML_API_KEY=<value>
Note that the environment variable takes precedent over the api key set through the CLI

Advanced configuration
CLI Autocomplete
We support autocomplete tab completion in bash and zsh shells through argcomplete. To register tab completion:

eval "$(register-python-argcomplete mcli)"
Depending on your shell configuration, you may see an error, zsh: command not found: compdef. In that case, you need to run two commands before register-python-argcomplete:

autoload -Uz compinit
compinit
eval "$(register-python-argcomplete mcli)"
Other environment variables
Below are all environment variables that can be used to configure MCLI. Most defaults can be left as is:

Variable

Default

Description

MCLI_CONFIG_DIR

~/.mosaic

Folder used to store MCLI configuration file

MCLI_CONFIG_PATH

~/.mosaic/mcli_config

File used to store MCLI configuration file

MCLI_TIMEOUT

10

Timeout (seconds) for queries against the MosaicML API

MOSAICML_API_KEY

Config set API key override""",
    """Set up your environment
Setting up the environment for your code to run is easily configurable in the MosaicML platform.

Secrets
Secrets are credentials or other sensitive information used to configure access to a variety of services. Secrets can enable you to:

Access a private docker image

Access a private github repo

Configure API keys, for example, and API key from Weights and Biases for experiment tracking or from the MosaicML platform to launch runs within runs

Access storage: AWS S3, GCP, OCI, Coreweave, Cloudflare

All secrets are stored securely in a vault, maintained across your clusters, and added to every run and deployment. Your secrets are never shared with other users.

For more information, see the Secrets Page

Docker
Build a docker image with all the required system packages for your code. Especially for large dependencies, including them in your docker will speed up the run start time. For more information, see the Docker documentation.

We maintain a set of public docker images for PyTorch, PyTorch Vision, and Composer on DockerHub.

To run with an existing docker image, use the image field:


YAML
image: mosaicml/composer:latest

PYTHON
Docker Tags

We strongly recommend using a fixed tag instead of latest for docker images to ensure reproducibility. Create and use versioned tag names (e.g. v1.7.0) for your docker images.

Private images require setting up Docker Secrets with:

mcli create secrets docker
Environment Variables
Create your own
To add non-sensitive environment variables, use the env_variables field in your YAML:

name: using-env-variables
image: bash
env_variables:
  - key: FOO
    value: 'Hello World!'
command: |
  echo "$FOO"
MosaicML Platform Environment Variables
We automatically set the following environment variables in your run container.

Variable

Description

MASTER_ADDR

The network address of the node with rank 0 in the training job

MASTER_PORT

The network port of the node with rank 0 in the training job

NODE_RANK

The rank of the node the container is running on, indexed at zero

RUN_NAME

The name of your run as seen in the output of mcli get runs

COMPOSER_RUN_NAME

Identical to RUN_NAME, used by composer

WORLD_SIZE

The total number of GPUs being used for the training run

MOSAICML_PLATFORM

true if you are using the MosaicML Platform, used by composer

PARAMETERS

The path that your run parameters are stored in

RESUMPTION_INDEX

The index of the number of times your run has resumed, starting at zero

NUM_NODES

The total number of nodes the run is scheduled on

LOCAL_WORLD_SIZE

The number of GPUs available to the run on each node

Many integrations and secrets will also set environment variables automatically, for instance aws s3 secrets will set AWS_CONFIG_FILE and AWS_SHARED_CREDENTIALS_FILE. Refer to the secret documentation to learn more""",
    """Training Quickstart
You can easily train your model with the MosaicML platform with just a few simple steps. Before starting, make sure you’ve configured MosaicML access

Run “Hello World”
To submit your first run, copy the below yaml into a file called ‘hello_world.yaml’.:

name: hello-world
compute:
  gpus: 0
image: bash
command: |
  sleep 2
  echo Hello World!
Then, run:

mcli run -f hello_world.yaml --follow
If you see “Hello World!”, congratulations on setting up MCLI!

Specifying a cluster

If you have access to more than one cluster, the --cluster keyword is a required argument to launch runs. You can find all the clusters you have access to with the following command:

mcli get clusters
And then specify the cluster in the yaml or through the command line:

mcli run -f hello_world.yaml --follow --cluster [your_cluster_name]""",
    """Inference Quickstart
You can easily deploy your model with the MosaicML platform with just a few simple steps. Before starting, make sure you’ve configured MosaicML access

Creating Your First Deployment
For this tutorial, we’re going to deploy the MPT-7B Instruct model. To submit your first deployment, copy the below yaml into a file called mpt_instruct_deploy.yaml:

name: mpt-7b-instruct
compute:
  gpus: 1
  gpu_type: a100_40gb
replicas: 1
image: mosaicml/inference:0.0.96
command: |
  export PYTHONPATH=$PYTHONPATH:/code/examples
integrations:
  - integration_type: git_repo
    git_repo: mosaicml/examples
    ssh_clone: false
    git_commit: 0b348f765c8ba6b2896a6a0834446bfc4a333811
model:
  download_parameters:
    hf_path: mosaicml/mpt-7b-instruct
  model_handler: examples.inference-deployments.mpt.mpt_7b_handler.MPTModelHandler
  model_parameters:
    model_name: mosaicml/mpt-7b-instruct
This yaml tells the MosaicML platform that you are requesting a single a100_40gb gpu and would like to download the mpt-7b-instruct model from the HuggingFace hub. The deployment uses the model handler defined in the Mosaic examples repo here. The model_parameters configure the out-of-the-box model handler provided by the MosaicML platform.

Then, run:

mcli deploy -f mpt_instruct_deploy.yaml
Specifying a cluster

If you have access to more than one cluster, you’ll need to specify which cluster to deploy with using --cluster <name>. You can check which clusters you have access to using mcli get clusters

After you’ve run the deploy command, you’ll see the following output in your terminal (note that the hash after mpt-7b-instruct- is a unique identifier that we append to the deployment name you provided in your yaml):

✔  Deployment mpt-7b-instruct-0t30xo submitted.

To see the deployment's status, use:

mcli get deployments
If you run mcli get deployments, you’ll see the following output:

NAME                    USER             CLUSTER  GPU_TYPE   GPU_NUM  CREATED_TIME         STATUS
mpt-7b-instruct-0t30xo  user.email.com   r7z13    a100_40gb  1        2023-05-17 07:24 PM  PENDING
The mcli get deployments command shows you all the deployments in your organization, so you may see deployments that were not created by you.

You can also get more details about a specific deployment by running mcli describe deployment mpt-7b-instruct-0t30xo.

Interacting With Your Deployment
You’ve created your first deployment, congrats! From here, MCLI has a few convenience commands that make it easier for you to interact with your deployment.

First, you may want to check your deployment’s status to see if it’s ready to start serving traffic. You can do that by running the following command:

mcli ping mpt-7b-instruct-0t30xo
If your deployment is ready, you should see the output:

mpt-7b-instruct-0t30xo's status:
{'status': 200}
where the status is an HTTP status code. If your status code is 200, your deployment is ready to server traffic!

Let’s try sending a request to your deployment using the Python SDK:

from mcli import predict, get_inference_deployment

deployment = get_inference_deployment("mpt-7b-instruct-0t30xo")
predict(deployment, {"inputs": ["hello world!"]})
You can also make the same request via the command line:

mcli predict mpt-7b-instruct-0t30xo --input '{"inputs": ["hello world!"]}'
You can also do the same with a basic curl command:

curl https://mpt-7b-instruct-0t30xo.inf.hosted-on.mosaicml.hosting/predict_stream \
-H "Authorization: <your_api_key>" \
-d '{"inputs": ["hello world!"]}'
The address above is for the example, you can look up the address for your own deployment using mcli describe deployment <name>

Once you’re done with your deployment, you can delete it with the following command:

mcli delete deployments --name mpt-7b-instruct-0t30xo
Next Steps
There are many more ways you can customize your deployments. We support downloading checkpoint files from any remote storage such as s3 and you can customize your model’s forward logic by implementing a custom model handler. You can even write your own webserver and replace the mosaicml/inference image with your own. Take a look at the Inference Schema Page for more information.""",
    """Managing Compute
The MosaicML platform configures and manages clusters for you automatically.

To view clusters you have access to:

mcli get clusters
View current cluster utilization:

mcli util
Requesting compute resources
When submitting a run or deployment on a cluster, the MosaicML platform will try and infer which compute resources to use automatically. Which fields are required depend on which and what type of clusters are available to you or your organization. If those resources are not valid or if there are multiple options still available, an error will be raised on run submissions, and the run will not be created.

Field

Type

Details

gpus

int

Typically required, unless you specify nodes or a cpu-only run

cluster

str

Required if you have multiple clusters

gpu_type

str

Optional

instance

str

Optional. Only needed if the cluster has multiple GPU instances

nodes

int

Optional. Alternative to gpus - typically there are 8 GPUs per node

cpus

int

Optional

For example, you can launch a multi-node cluster my-cluster with 16 A100 GPUs:

compute:
  cluster: my-cluster
  gpus: 16
  gpu_type: a100_80gb
Most compute fields are also optional CLI arguments""",
    """Common Commands
mcli run -f <your_yaml>
Submits a run with the provided YAML configuration.

mcli run --clone <existing_run_name>
Submits a new run using the existing run’s configuration

mcli get runs
Lists all of your submitted runs (see mcli get runs --help to view the many filters available)

mcli describe run <run_name>
Get detailed information about a run, including the config that was used to launch it.

mcli logs <run_name>
Retrieves the console log of the latest resumption of the indicated run.

mcli logs <run_name> --resumption <N>
Retrieves the console log for a given resumption of the indicated run.

mcli stop run <run_name>
Stops the provided run. The run will be stopped but not deleted from the cluster.

mcli run -r <stopped_run>
Restarts a stopped run. See Composer’s Auto Resumption guide!

mcli delete run <run_name>
Deletes the run (and its associated logs) from the cluster.

mcli update run <run_name> --max-duration <hours>
Updates the max time (in hours) than a run can run for.

Full documentation for the mcli update run command
Run sharing
If run sharing is enabled, users within the same organization have read access to other users’ runs. Ask your administrator if you would like this feature enabled!

This enables easier collaboration, so a user can fetch other users’ runs with:

mcli get runs --user <another_users_email>
Users can also tail the logs and describe another user’s runs with:

mcli logs <run_name>
Watchdog
When training large-scale runs, there may be hardware failures (i.e. node failures). We’ve developed a system called watchdog that will automatically resume your run if our system detects any failures. This is not enabled by default because gracefully resuming models during training requires careful consideration. If you are using Composer or the LLM foundry, this can be easily enabled.

You can enable watchdog on an existing and active run.

To enable watchdog, use:

mcli watchdog <run_name>
To disable watchdog, use:

mcli watchdog --disable <run_name>
If watchdog is enabled for your run, you’ll see a 🐕 icon next to your run_name in the mcli get runs display.

By default, enabling watchdog will automatically retry your run 10 times.

You can configure this default in your yaml by overriding the max_retries scheduling parameter.""",
    """Configure a run
Run submissions to the MosaicML platform can be configured through a YAML file or using our Python API’s RunConfig class.

The fields are identical across both methods:

Field

Type

name

required

str

image

required

str

command

required

str

compute

required

ComputeConfig

scheduling

optional

SchedulingConfig

integrations

optional

List[Dict]

env_variables

optional

List[Dict]

parameters

optional

Dict[str, Any]

metadata

optional

Dict[str, Any]

Here’s an example run configuration:


YAML
name: hello-composer
image: mosaicml/pytorch:latest
command: 'echo $MESSAGE'
compute:
  cluster: <fill-in-with-cluster-name>
  gpus: 0
scheduling:
  priority: low
integrations:
  - integration_type: git_repo
    git_repo: mosaicml/benchmarks
    git_branch: main
env_variables:
  - key: MESSAGE
    value: hello composer!

PYTHON
Setting up YAML Schema in VSCode
Autocomplete suggestions and static checking for training YAML files can be supported using by using a JSON Schema in VSCode. Alt text To configure this in your local VSCode environment:

Download the YAML extension in VSCode.

Download the JSON Schema file training.json here.

Go to Settings from Code → Preferences → Settings.

Search for YAML Schema and go to Edit in settings.json Alt text

Under "yaml.schemas", add the code listed below. The key is a link to the JSON file specifying the YAML schema, and the value specifies what the kinds of YAML files that are targetted.

"yaml.schemas": {
        "https://raw.githubusercontent.com/mosaicml/examples/melissa/yaml_schema/training.json": "**/mcli/**/*.yaml"
    }
Restart VSCode.

Now, CTRL + Space should enable autocomplete for any file located in mcli/.

Field Types
Run Name
A run name is the primary identifier for working with runs. For each run, a unique identifier is automatically appended to the provided run name. After submitting a run, the finalized unique name is displayed in the terminal, and can also be viewed with mcli get runs or Run object.

Image
Runs are executed within Docker containers defined by a Docker image. Images on DockerHub can be configured as <organization>/<image name>. For private Dockerhub repositories, add a docker secret with:

mcli create secret docker
For more details, see the Docker Secret Page.

Using Alternative Docker Registries

While we default to DockerHub, custom registries are supported, see Docker’s documentation and Docker Secret Page for more details.

Command
The command is what’s executed when the run starts, typically to launch your training jobs and scripts. For example, the following command:

command: |
  echo Hello World!
will result in a run that prints “Hello World” to the console.

If you are training models with Composer, then the command field is where you will write your Composer launch command.

Compute Fields
The compute field specifies which compute resources to request for your run. The MosaicML platform will try and infer which compute resources to use automatically. Which fields are required depend on which and what type of clusters are available to your organization. If those resources are not valid or if there are multiple options still available, an error will be raised on run submissions, and the run will not be created.

Field

Type

Details

gpus

int

Typically required, unless you specify nodes or a cpu-only run

cluster

str

Required if you have multiple clusters

gpu_type

str

Optional

instance

str

Optional. Only needed if the cluster has multiple GPU instances

nodes

int

Optional. Alternative to gpus - typically there are 8 GPUs per node

cpus

int

Optional. Typically not used other than for debugging small deployments.

You can see clusters, instances, and compute resources available to you using:

mcli get clusters
For example, you can launch a multi-node cluster my-cluster with 16 A100 GPUs:

compute:
  cluster: my-cluster
  gpus: 16
  gpu_type: a100_80gb
Scheduling
The scheduling field governs how the MosaicML platform’s scheduler will manage your run. It is a simple dictionary, currently containing one key: priority.

Field

Type

priority

optional

str

preemptible

optional

bool

max_retries

optional

int

retry_on_system_failure

optional

bool

max_duration_seconds

optional

int

priority: Runs in the platform’s scheduling queue are first sorted by their priority, then by their creation time. The priority field can be one of 3 values: low, default and high. When omitted, the default value is used. Best practices usually dictate that large numbers of more experimental runs (think exploratory hyperparameter sweeps) should usually be run at low priority, whereas important “hero” runs should be run at high priority.

preemptible: If your run can be retried, you can set preemptible to True.

max_retries: This is the maximum number of times our system will attempt to retry your run.

retry_on_system_failure: If you want your run to be retried if it encounters a system failure, you can set retry_on_system_failure to True

max_duration_seconds: This is the time duration (in seconds) that your run can run for before it is stopped.

Integrations
We support many Integrations to customize aspects of both the run setup and environment.

Integrations are specified as a list in the YAML. Each item in the list must specify a valid integration_type along with the relevant fields for the requested integration.

Some examples of integrations include automatically cloning a Github repository, installing python packages, and setting up logging to a Weights and Biases project are shown below:

integrations:
  - integration_type: git_repo
    git_repo: org/my_repo
    git_branch: my-work-branch
  - integration_type: pip_packages
    packages:
      - numpy>=1.22.1
      - requests
  - integration_type: wandb
    project: my_weight_and_biases_project
    entity: mosaicml
You can read more about integrations on the Integrations Page.

Some integrations may require adding secrets. For example, pulling from a private github repository would require the git-ssh secret to be configured. See the Secrets Page.

Environment Variables
Environment variables can also be injected into each run at runtime through the env_variables field. Each environment variable in the list must have a key and value configured.

key: name used to access the value of the environment variable

value: value of the environment variable.

For example, the below YAML will print “Hello MOSAICML my name is MOSAICML_TWO!”:

name: hello-world
image: python
command: |
  sleep 2
  echo Hello $NAME my name is $SECOND_NAME!
env_variables:
  - key: NAME
    value: MOSAICML
  - key: SECOND_NAME
    value: MOSAICML_TWO
The command accesses the value of the environment variable by the key field (in this case $NAME and $SECOND_NAME)

Parameters
The provided parameters are mounted as a YAML file of your run at /mnt/config/parameters.yaml for your code to access. Parameters are a popular way to easily configure your training run.

Metadata
Metadata is meant to be a multi-purposed, unstructured place to put information about a run. It can be set at the beginning of the run, for example to add custom run-level tags or groupings:

name: hello-world
image: bash
command: echo 'hello world'
metadata:
  run_type: test
Metadata on your run is readable through the CLI or SDK:


BASH
> mcli describe run hello-world-VC5nFs
Run Details
Run Name      hello-world-VC5nFs
Image         bash
...
Run Metadata
KEY         VALUE
run_type    test

PYTHON
You can also update metadata when the run is running, which can be helpful for exporting metrics or information from the run:

from mcli import update_run_metadata

run = update_run_metadata("hello-world-VC5nFs", {"run_type": "test_but_updated"})
print("New metadata values:", run.metadata)
Metadata size constraints

Metadata is not intended for large amounts of data such as time series data. Each key is limited to 200 characters and value is limited to 0.1mb. Metadata cannot have more than 200 keys. A MAPIException will be raised on creation or updates if any of these limits are exceeded.""",
    """Run Lifecycle

BASH
mcli run -f example.yaml

PYTHON
What happens next? The MosaicML platform manages submitting the run and orchestrates all run requests automatically. The status of a run (RunStatus object) can be monitored using:


BASH
mcli get runs

PYTHON

CONSOLE
This status represents unique phases the run will enter during its lifecycle:



Between the “Starting” and “Terminating” phases, your run will be assigned and consuming node resources on the cluster. For this reason, GPU usage is computed as the difference of:

Start time: The time the run enters the STARTING status

End time: The time the run exits the TERMINATING status

Note that runs will never share GPUs, but could be assigned different GPUs on the same node if the cluster supports it.

Pending (PENDING)
The run has been submitted to the MosaicML platform, but hasn’t been sent to the compute plane or assigned a space in the queue

Queued (QUEUED)
The run has been placed in queue to be picked up by the specified cluster.

If there is space on the cluster, the run will likely appear to skip this phase entirely. In other cases, the run may remain in queued for a long time due to the size of the resource request and/or current cluster utilization. You can view active and queued runs on all clusters using:


BASH
mcli util

PYTHON
Starting (STARTING)
After a run has been scheduled, it goes from the pending to the starting status. In this phase, the scheduler has assigned the run to node(s) in the cluster and has started setting up everything needed to run the workload.

Starting a run includes setting up platform-specific containers and pulling the docker image you specified when you configured the run. This phase can be time consuming for large images that have not been used recently in the cluster (caching is done with the Always image pull policy in kubernetes).

Running (RUNNING)
After finishing the setup, the run goes into the Running phase, which is the core phase where the run is executed inside a containers on one or more nodes.

First, any integrations configured for the run are executed, such as cloning a Github repo or installing a pypi package. Integrations are executed in order to produce the required run environment.

Once integrations finish building, the command configured for the run is executed.

During this phase, you can view the stdout and stderr of any commands run using:


BASH
mcli logs <run>

PYTHON

CONSOLE
Terminating (TERMINATING)
After exiting the running or starting phase, runs will always enter a “terminating” status. This phase will typically last up to about 30-40 seconds as the platform kills any remaining processes and removes the run’s assignment to the node(s). Until this phase is complete, the run can still be consuming resources and other runs cannot be scheduled on these node(s).

From terminating, there are several terminal phases the run may enter into:

Completed: Everything ran smoothly and the run has successfully finished! 🙌

Stopped: The run was stopped and is no longer executing

Failed: Something went wrong while the run was starting or running

Below are details of each terminal phase

Completed (COMPLETED)
The run has executed the full command and finished without any errors. You now view the full run logs, examine final run metrics, or saved checkpoints and data.

If you no longer need this run, you can clean it up using:


BASH
mcli delete run <run>

PYTHON

CONSOLE
Stopped (STOPPED)
The run started running but did not complete entirely. This state can be entered by stopping the run using:


BASH
mcli stop run <run>

PYTHON

CONSOLE
A stopped run can then be be restarted using:


BASH
mcli run -r <run>

PYTHON

CONSOLE
When a run restarts, the platform does not automatically save the state of the previous run. Instead, the user code is left responsible for this. If you’re using Composer this is easy to enable through checkpointing and auto-resumption.

On restart, the run will begin the run lifecycle again and execute the series of commands from the very beginning. To see all attempts of a run, you can view the entire lifecycle using:


BASH
mcli describe run <run>

PYTHON

CONSOLE
Failed (FAILED)
Unfortunately, there are several potential reasons why a run may have failed. This section will go over in depth different failures you may encounter, and how to recover from them.

First, make sure you identify that the run has failed and potentially the reason using:


BASH
mcli get run "run"

PYTHON

CONSOLE
Below outlines debugging each reason. Take note of the exit code if provided as well

Reason: FailedImagePull
This means the run failed during the Starting phase when trying to pull the image you’ve specified in the run configuration.

There’s a few reasons this could happen:

The image is private and docker secrets are not configured or does not have access. To fix this, set up docker secrets and confirm you can pull the image with this combination of username and password

The image name is not valid. Double check the image name you entered in the run configuration by describing the run


BASH
mcli describe run <run>

PYTHON
Reason: Error
This is the catch-all run failure that means something failed when the run was being executed. You’ll want to look at the run logs to debug:


BASH
mcli logs <run>

PYTHON
The --failed flag will default to showing the logs of the first failed node rank. Note that since runs execute in a unique process for each nodes, the logs for each rank could be different (e.g. one node could have raised an exit code, which would have triggered all other nodes to fail). You can manually specify which node rank to view the logs of using the rank flag:


BASH
mcli logs <run> --rank 2""",
    """Manage a run with the SDK
Runs can be managed through the Python API. Below outlines how to work with runs, including creation, following, getting, stopping, and deleting runs. Before getting started, familiarize yourself with Run Lifecycle.

Creating a run
mcli.api.runs.create_run(run, *, timeout=10, future=False)[source]
Launch a run in the MosaicML platform

The provided run must contain enough information to fully detail the run

PARAMETERS
run – A fully-configured run to launch. The run will be queued and persisted in the run database.

timeout – Time, in seconds, in which the call should complete. If the run creation takes too long, a TimeoutError will be raised. If future is True, this value will be ignored.

future – Return the output as a :type concurrent.futures.Future:. If True, the call to create_run will return immediately and the request will be processed in the background. This takes precedence over the timeout argument. To get the :type Run: output, use return_value.result() with an optional timeout argument.

RETURNS
A Run that includes the launched run details and the run status

Runs can programmatically be created, giving you flexibility to define custom workflows or create similar runs in quick succession. create_run() will takes a RunConfig object, which is a fully-configured run ready to launch. The method will launch the run and then return a Run object, which includes the RunConfig data in Run.config but also data received at the time the run was launched.

The RunConfig object
The RunConfig object holds configuration data needed to launch a run. This is the underlying python data structure MCLI uses, so before beginning make sure to familiarize yourself with the Run schema.

class mcli.api.runs.RunConfig(run_name=None, name=None, gpu_type=None, gpu_num=None, cpus=None, platform=None, cluster=None, image=None, partitions=None, optimization_level=None, integrations=<factory>, env_variables=<factory>, scheduling=<factory>, compute=<factory>, metadata=<factory>, command='', parameters=<factory>, entrypoint='')[source]
A run configuration for the MosaicML platform

Values in here are not yet validated and some required values may be missing.

PARAMETERS
name (Optional[str]) – User-defined name of the run

gpu_type (Optional[str]) – GPU type (optional if only one gpu type for your cluster)

gpu_num (Optional[int]) – Number of GPUs

cpus (Optional[int]) – Number of CPUs

cluster (Optional[str]) – Cluster to use (optional if you only have one)

image (Optional[str]) – Docker image (e.g. mosaicml/composer)

integrations (List[Dict[str, Any]]) – List of integrations

env_variables (List[Dict[str, str]]) – List of environment variables

command (str) – Command to use when a run starts

parameters (Dict[str, Any]) – Parameters to mount into the environment

entrypoint (str) – Alternative to command

There are two ways to initialize a RunConfig object that can be used to config and create a run. The first is by referencing a YAML file, equivalent to the file argument MCLI:

from mcli.api.runs import RunConfig, create_run

run_config = RunConfig.from_file('hello_world.yaml')
created_run = create_run(run_config)
Alternatively, you can instantiate the RunConfig object directly in python:

from mcli.api.runs import RunConfig, create_run

cluster = "<your-cluster>"
run_config = RunConfig(
    name='hello-world',
    image='bash',
    command='echo "Hello World!" && sleep 60',
    gpu_type='none',
    cluster=cluster,
)
created_run = create_run(run_config)
These can also be used in combination, for example loading a base configuration file and modifying select fields:

from mcli.api.runs import RunConfig, create_run

special_config = RunConfig.from_file('base_config.yaml')
special_config.gpus = 8
created_run = create_run(special_config)
Changing parameters for parameter sweeps

If you are trying to kick off a bunch of runs with similar configurations and different training parameters, make sure you copy the parameters (and any other dict field) instead of modifying them directly

import copy

config = RunConfig.from_file('base_config.yaml')

params = { ... }
for lr in (0.1, 0.01, 0.001):
    new_params = copy.deepcopy(params)
    new_params['optimizers']['sgd']['lr'] = lr
    config.parameters = new_params
    created_run = create_run(config)
The Run object
Created runs will be returned as a Run object in create_run(). This object can be used as input to any subsequent run function, for example you can start a run and then immediately start following it:

created_run = create_run(config)
for line in follow_run_logs(created_run):
    print(line)
class mcli.api.runs.Run(run_uid, name, status, created_at, updated_at, created_by, priority, preemptible, retry_on_system_failure, cluster, gpus, gpu_type, cpus, node_count, latest_resumption, max_retries=None, reason=None, nodes=<factory>, submitted_config=None, metadata=None, last_resumption_id=None, resumptions=<factory>, lifecycle=<factory>, image=None, _required_properties=('id', 'name', 'status', 'createdAt', 'updatedAt', 'reason', 'createdByEmail', 'priority', 'preemptible', 'retryOnSystemFailure', 'resumptions'))[source]
A run that has been launched on the MosaicML platform

PARAMETERS
run_uid (str) – Unique identifier for the run

name (str) – User-defined name of the run

status (RunStatus) – Status of the run at a moment in time

created_at (datetime) – Date and time when the run was created

updated_at (datetime) – Date and time when the run was last updated

created_by (str) – Email of the user who created the run

priority (str) – Priority of the run

preemptible (bool) – Whether the run can be stopped and re-queued by higher priority jobs

retry_on_system_failure (bool) – Whether the run should be retried on system failure

cluster (str) – Cluster the run is running on

gpus (int) – Number of GPUs the run is using

gpu_type (str) – Type of GPU the run is using

cpus (int) – Number of CPUs the run is using

node_count (int) – Number of nodes the run is using

latest_resumption (Resumption) – Latest resumption of the run

max_retries (Optional[int]) – Maximum number of times the run can be retried

reason (Optional[str]) – Reason the run was stopped

nodes (List[:class:`~mcli.api.model.run.Node]`) – Nodes the run is using

submitted_config (Optional[:class:`~mcli.models.run_config.RunConfig]`) – Submitted run configuration

metadata (Optional[Dict[str, Any]]) – Metadata associated with the run

last_resumption_id (Optional[str]) – ID of the last resumption of the run

resumptions (List[:class:`~mcli.api.model.run.Resumption]`) – Resumptions of the run

lifecycle (List[:class:`~mcli.api.model.run.RunLifecycle]`) – Lifecycle of the run

image (Optional[str]) – Image the run is using

clone(name=None, image=None, cluster=None, instance=None, nodes=None, gpu_type=None, gpus=None, priority=None, preemptible=None, max_retries=None)[source]
Submits a new run with the same configuration as this run

PARAMETERS
name (str) – Override the name of the run

image (str) – Override the image of the run

cluster (str) – Override the cluster of the run

instance (str) – Override the instance of the run

nodes (int) – Override the number of nodes of the run

gpu_type (str) – Override the GPU type of the run

gpus (int) – Override the number of GPUs of the run

priority (str) – Override the priority of the run

preemptible (bool) – Override whether the run can be stopped and re-queued by higher priority jobs

max_retries (int) – Override the max number of times the run can be retried

RETURNS
New :class:`~mcli.api.model.run.Run` object

property completed_at
The time the run was completed

If there are multiple resumptions, this will be the last end time Completed At will be None if the last resumption has not been completed

RETURNS
The time the run was last completed

property cumulative_pending_time
Cumulative time spent in the PENDING state

RETURNS
The cumulative time (seconds)

property cumulative_running_time
Cumulative time spent in the RUNNING state

RETURNS
The cumulative time (seconds)

delete()[source]
Deletes the run

RETURNS
Deleted :class:`~mcli.api.model.run.Run` object

property display_name
The name of the run to display in the CLI

RETURNS
The name of the run

refresh()[source]
Refreshes the data on the run object

RETURNS
Refreshed :class:`~mcli.api.model.run.Run` object

property resumption_count
Number of times the run has been resumed

RETURNS
The number of times the run has been resumed

property started_at
The time the run was first started

If there are multiple resumptions, this will be the earliest start time Started At will be None if the first resumption has not been started

RETURNS
The time the run was first started

stop()[source]
Stops the run

RETURNS
Stopped :class:`~mcli.api.model.run.Run` object

update(preemptible=None, priority=None, max_retries=None, retry_on_system_failure=None)[source]
Updates the run’s data

PARAMETERS
preemptible (bool) – Update whether the run can be stopped and re-queued by higher priority jobs; default is False

priority (str) – Update the priority of the run to low, medium, or high; default is medium

max_retries (int) – Update the max number of times the run can be retried; default is 0

retry_on_system_failure (bool) – Update whether the run should be retried on system failure (i.e. a node failure); default is False

RETURNS
Updated :class:`~mcli.api.model.run.Run` object

update_metadata(metadata)[source]
Updates the run’s metadata

PARAMETERS
metadata (Dict[str, Any]) – The metadata to update the run with. This will be merged with the existing metadata. Keys not specified in this dictionary will not be modified.

RETURNS
Updated :class:`~mcli.api.model.run.Run` object

Observing a run
Getting a run’s logs
There are two functions for fetching run logs:

get_run_logs(): Gets currently available logs for any run. Ideal for completed runs or checking progress of an active run

follow_run_logs(): Follows logs line-by-line for any run. Ideal for monitoring active runs in real time or a condition is reached (see also wait_for_run_status())

mcli.api.runs.get_run_logs(run, rank=None, *, timeout=None, future=False, failed=False, resumption=None)[source]
Get the current logs for an active or completed run

Get the current logs for an active or completed run in the MosaicML platform. This returns the full logs as a str, as they exist at the time the request is made. If you want to follow the logs for an active run line-by-line, use follow_run_logs().

PARAMETERS
run (str | Run) – The run to get logs for. If a name is provided, the remaining required run details will be queried with get_runs().

rank (Optional[int]) – Node rank of a run to get logs for. Defaults to the lowest available rank. This will usually be rank 0 unless something has gone wrong.

timeout (Optional[float]) – Time, in seconds, in which the call should complete. If the the call takes too long, a TimeoutError will be raised. If future is True, this value will be ignored.

future (bool) – Return the output as a Future . If True, the call to get_run_logs() will return immediately and the request will be processed in the background. This takes precedence over the timeout argument. To get the log text, use return_value.result() with an optional timeout argument.

failed (bool) – Return the logs of the first failed rank for the provided resumption if True. False by default.

resumption (Optional[int]) – Resumption (0-indexed) of a run to get logs for. Defaults to the last resumption

RETURNS
If future is False – The full log text for a run at the time of the request as a str

Otherwise – A Future for the log text

mcli.api.runs.follow_run_logs(run, rank=None, *, timeout=None, future=False, resumption=None)[source]
Follow the logs for an active or completed run in the MosaicML platform

This returns a generator of individual log lines, line-by-line, and will wait until new lines are produced if the run is still active.

PARAMETERS
run (str | Run) – The run to get logs for. If a name is provided, the remaining required run details will be queried with get_runs().

rank (Optional[int]) – Node rank of a run to get logs for. Defaults to the lowest available rank. This will usually be rank 0 unless something has gone wrong.

timeout (Optional[float]) – Time, in seconds, in which the call should complete. If the call takes too long, a TimeoutError will be raised. If future is True, this value will be ignored. A run may take some time to generate logs, so you likely do not want to set a timeout.

future (bool) – Return the output as a Future . If True, the call to follow_run_logs() will return immediately and the request will be processed in the background. The generator returned by the ~concurrent.futures.Future will yield a ~concurrent.futures.Future for each new log string returned from the cloud. This takes precedence over the timeout argument. To get the generator, use return_value.result() with an optional timeout argument and log_future.result() for each new log string.

RETURNS
If future is False – A line-by-line Generator of the logs for a run

Otherwise – A Future of a line-by-line generator of the logs for a run

Monitoring a run throughout its lifecycle
mcli.api.runs.wait_for_run_status(run, status, timeout=None, future=False)[source]
Wait for a launched run to reach a specific status

PARAMETERS
run (str | Run) – The run whose status should be watched. This can be provided using the run’s name or an existing Run object.

status (str | RunStatus) – Status to wait for. This can be any valid RunStatus value. If the status is short-lived, or the run terminates, it is possible the run will reach a LATER status than the one requested. If the run never reaches this state (e.g. it stops early or the wait times out), then an error will be raised. See exception details below.

timeout (Optional[float]) – Time, in seconds, in which the call should complete. If the call takes too long, a TimeoutError will be raised. If future is True, this value will be ignored.

future (bool) – Return the output as a Future. If True, the call to wait_for_run_status() will return immediately and the request will be processed in the background. This takes precedence over the timeout argument. To get the Run output, use return_value.result() with an optional timeout argument.

RAISES
MAPIException – Raised if the run does not exist or there is an issue connecting to the MAPI service.

RunStatusNotReached – Raised in the event that the watch closes before the run reaches the desired status. If this happens, the connection to MAPI may have dropped, so try again.

TimeoutError – Raised if the run did not reach the correct status in the specified time

RETURNS
If future is False – A Run object once it has reached the requested status

Otherwise:
A Future for the run. This will not resolve until the run reaches the requested status

The RunStatus object
The RunStatus object is attached to each Run object and reflects the most recent status the run has been observed with.

class mcli.api.runs.RunStatus(value)[source]
Possible statuses of a run

PENDING = 'PENDING'
The run has been submitted and is waiting to be scheduled

QUEUED = 'QUEUED'
The run is awaiting execution

STARTING = 'STARTING'
The run is starting up and preparing to run

RUNNING = 'RUNNING'
The run is actively running

TERMINATING = 'TERMINATING'
The run is in the process of being terminated

COMPLETED = 'COMPLETED'
The run has finished without any errors

STOPPED = 'STOPPED'
The run has stopped

FAILED = 'FAILED'
The run has failed due to an issue at runtime

UNKNOWN = 'UNKNOWN'
A valid run status cannot be found

before(other, inclusive=False)[source]
Returns True if this state usually comes “before” the other

PARAMETERS
other – Another RunStatus

inclusive – If True, equality evaluates to True. Default False.

RETURNS
If this state is “before” the other

EXAMPLE

RunStatus.RUNNING.before(RunStatus.COMPLETED)
True
RunStatus.PENDING.before(RunStatus.RUNNING)
True
after(other, inclusive=False)[source]
Returns True if this state usually comes “after” the other

PARAMETERS
other – Another RunStatus

inclusive – If True, equality evaluates to True. Default False.

RETURNS
If this state is “after” the other

EXAMPLE

RunStatus.COMPLETED.after(RunStatus.RUNNING)
True
RunStatus.RUNNING.after(RunStatus.PENDING)
True
is_terminal()[source]
Returns True if this state is terminal

RETURNS
If this state is terminal

EXAMPLE

RunStatus.RUNNING.is_terminal()
False
RunStatus.COMPLETED.is_terminal()
True
classmethod from_string(run_status)[source]
Convert a string to a valid RunStatus Enum

If the run status string is not recognized, will return RunStatus.UNKNOWN instead of raising a KeyError

Listing runs
All runs that you have launched in the MosaicML platform and have not deleted can be accessed using the get_runs() function. Optional filters allow you to specify a subset of runs to list by name, cluster, gpu type, gpu number, or status.

mcli.api.runs.get_runs(runs=None, *, cluster_names=None, before=None, after=None, gpu_types=None, gpu_nums=None, statuses=None, timeout=10, future=False, clusters=None, user_emails=None, include_details=False, limit=None, include_interactive=None)[source]
List runs that have been launched in the MosaicML platform

The returned list will contain all of the details stored about the requested runs.

PARAMETERS
runs – List of runs on which to get information

cluster_names – List of cluster names to filter runs. This can be a list of str or :type Cluster: objects. Only runs submitted to these clusters will be returned.

before – Only runs created strictly before this time will be returned. This can be a str in ISO 8601 format(e.g 2023-03-31T12:23:04.34+05:30) or a datetime object.

after – Only runs created at or after this time will be returned. This can be a str in ISO 8601 format(e.g 2023-03-31T12:23:04.34+05:30) or a datetime object.

gpu_types – List of gpu types to filter runs. This can be a list of str or :type GPUType: enums. Only runs scheduled on these GPUs will be returned.

gpu_nums – List of gpu counts to filter runs. Only runs scheduled on this number of GPUs will be returned.

statuses – List of run statuses to filter runs. This can be a list of str or :type RunStatus: enums. Only runs currently in these phases will be returned.

timeout – Time, in seconds, in which the call should complete. If the call takes too long, a TimeoutError will be raised. If future is True, this value will be ignored.

future – Return the output as a :type concurrent.futures.Future:. If True, the call to get_runs will return immediately and the request will be processed in the background. This takes precedence over the timeout argument. To get the list of runs, use return_value.result() with an optional timeout argument.

include_details – If true, will fetch detailed information like run input for each run.

limit – Maximum number of runs to return. If None, all runs will be returned.

include_interactive – Whether the run is interactive or not. If None, all runs will be returned.

RAISES
MAPIException – If connecting to MAPI, raised when a MAPI communication error occurs

Updating runs
mcli.api.runs.update_run(run, update_run_data=None, *, preemptible=None, priority=None, max_retries=None, retry_on_system_failure=None, timeout=10, future=False, max_duration=None)[source]
Update a run’s data in the MosaicML platform.

Any values that are not specified will not be modified.

PARAMETERS
run (Optional[str | ``:class:`~mcli.api.model.run.Run` ``]) – A run or run name to update. Using Run objects is most efficient. See the note below.

update_run_data (Dict[str, Any]) – DEPRECATED: Use the individual named-arguments instead. The data to update the run with. This can include preemptible, priority, maxRetries, and retryOnSystemFailure

preemptible (bool) – Update whether the run can be stopped and re-queued by higher priority jobs; default is False

priority (str) – Update the priority of the run to low, medium, or high; default is medium

max_retries (int) – Update the max number of times the run can be retried; default is 0

retry_on_system_failure (bool) – Update whether the run should be retried on system failure (i.e. a node failure); default is False

timeout (Optional[float]) – Time, in seconds, in which the call should complete. If the call takes too long, a TimeoutError will be raised. If future is True, this value will be ignored.

max_duration – Update the max time that a run can run for (in hours).

future (bool) – Return the output as a Future. If True, the call to update_run() will return immediately and the request will be processed in the background. This takes precedence over the timeout argument. To get the list of Run output, use return_value.result() with an optional timeout argument.

RAISES
MAPIException – Raised if updating the requested run failed

RETURNS
If future is False – Updated Run object

Otherwise – A Future for the list

Stopping runs
mcli.api.runs.stop_run(run, *, timeout=10, future=False)[source]
Stop a run

Stop a run currently running in the MosaicML platform.

PARAMETERS
run (Optional[str | ``:class:`~mcli.api.model.run.Run` ``]) – A run or run name to stop. Using Run objects is most efficient. See the note below.

timeout (Optional[float]) – Time, in seconds, in which the call should complete. If the call takes too long, a TimeoutError will be raised. If future is True, this value will be ignored.

future (bool) – Return the output as a Future. If True, the call to stop_run() will return immediately and the request will be processed in the background. This takes precedence over the timeout argument. To get the list of Run output, use return_value.result() with an optional timeout argument.

RAISES
MAPIException – Raised if stopping the requested runs failed A successfully stopped run will have the status `RunStatus.STOPPED`

RETURNS
If future is False – Stopped Run object

Otherwise – A Future for the object

mcli.api.runs.stop_runs(runs, *, timeout=10, future=False)[source]
Stop a list of runs

Stop a list of runs currently running in the MosaicML platform.

PARAMETERS
runs (Optional[List[str] | List[Run ]]) – A list of runs or run names to stop. Using Run objects is most efficient. See the note below.

timeout (Optional[float]) – Time, in seconds, in which the call should complete. If the call takes too long, a TimeoutError will be raised. If future is True, this value will be ignored.

future (bool) – Return the output as a Future. If True, the call to stop_runs() will return immediately and the request will be processed in the background. This takes precedence over the timeout argument. To get the list of Run output, use return_value.result() with an optional timeout argument.

RAISES
MAPIException – Raised if stopping any of the requested runs failed. All successfully stopped runs will have the status `RunStatus.STOPPED`. You can freely retry any stopped and unstopped runs if this error is raised due to a connection issue.

RETURNS
If future is False – A list of stopped Run objects

Otherwise – A Future for the list

Deleting runs
To delete runs, you must supply the run names or Run object. To delete a set of runs, you can use the output of get_runs() or even define your own filters directly:

# delete a run by name
delete_run('delete-this-run')

# delete failed runs on cluster xyz using 1 or 2 GPUs
failed_runs = get_runs(statuses=['FAILED'], cluster_names=['xyz'], gpu_nums=[1, 2])
delete_runs(failed_runs)

# delete completed runs older than a month with name pattern
completed = get_runs(statuses=['COMPLETED'])
ref_date = dt.datetime.now() - dt.timedelta(days=30)
old_runs = [r for r in completed if 'experiment1' in r.name and r.created_at < ref_date ]
delete_runs(old_runs)
mcli.api.runs.delete_run(run, *, timeout=10, future=False)[source]
Delete a run in the MosaicML platform

If a run is currently running, it will first be stopped.

PARAMETERS
run – A run to delete

timeout – Time, in seconds, in which the call should complete. If the call takes too long, a TimeoutError will be raised. If future is True, this value will be ignored.

future – Return the output as a :type concurrent.futures.Future:. If True, the call to delete_run will return immediately and the request will be processed in the background. This takes precedence over the timeout argument. To get the :type Run: output, use return_value.result() with an optional timeout argument.

RETURNS
A – type Run: for the run that was deleted

mcli.api.runs.delete_runs(runs, *, timeout=10, future=False)[source]
Delete a list of runs in the MosaicML platform

Any runs that are currently running will first be stopped.

PARAMETERS
runs – A list of runs or run names to delete

timeout – Time, in seconds, in which the call should complete. If the call takes too long, a TimeoutError will be raised. If future is True, this value will be ignored.

future – Return the output as a :type concurrent.futures.Future:. If True, the call to delete_run will return immediately and the request will be processed in the background. This takes precedence over the timeout argument. To get the :type Run: output, use return_value.result() with an optional timeout argument.

RETURNS
A list of – type Run: for the runs that were deleted""",
    """Interactive Runs
Interactive runs give the ability to debug and iterate quickly inside your cluster in a secure way. Interactivity works on top of the existing MosaicML runs, so before connecting a run workload needs to be submitted to the cluster. For security purposes storage is not persisted, so we recommend utilizing your own cloud storage and git repositories to stream and save data between runs.

Launch an interactive run
Launching new runs

All runs on reserved clusters can be connected to, regardless of how they were launched. This section goes over mcli interactive, which is a helpful alias for creating simple “sleeper” runs for interactive purposes. You can also create a custom run configuration for interactive purposes through the normal mcli run entrypoint

Launch an interactive run by running:

mcli interactive --max-duration 1 --gpus 1 --tmux --cluster <cluster-name>
This command creates a “sleeper” run that will last for 1 hour (--max-duration 1), request 1 GPU (--gpus 1) and connect to a tmux session (--tmux) within your run. The --max-duration or --hours argument is required to avoid any large, accidental charges from a forgotten run. The --tmux argument is strongly recommended to allow your session to persist through any temporary disconnects. mcli will automatically try to reconnect you to your run whenever you disconnect, so utilizing tmux dramatically improves this experience.

Note that interactive runs act like normal runs:


BASH
# see interactive runs on the cluster
mcli util <cluster-name>

# your interactive runs will show up when you call "get runs"
mcli get runs --cluster <cluster-name>

# get more info about your run
mcli describe run <interactive-run-name>

# stop your interactive run early
mcli stop run <interactive-run-name>

# delete it
mcli delete run <interactive-run-name>

PYTHON
Full documentation for the interactive command
Update a run’s max duration
After creating an interactive run, you can change its maximum duration.

mcli update run <interactive-run-name> --max-duration <hours>
Connect to a run in the terminal
Regardless of how you launched the run, you can connect to any running run using:

mcli connect <run-name> --tmux
By default, the session will connect inside a bash shell. We highly recommend using tmux as the entrypoint for your run so your session is robust to disconnects (such as a local internet outage). You can also configure a command other than bash or tmux to execute in the run:

mcli connect --command "top"
If you are running multi-node interactive runs, you can specify the zero-indexed node rank via:

mcli connect --rank 2
Connect to a run with VSCode
Disclaimer

Due to VSCode Server licensing, we cannot integrate directly with the native VS code remote development extensions. This guide outlines and documents how to get started with the VSCode server using tunneling

First time local setup: Install VSCode and the remote development extension pack. We recommend reviewing the system requirements and installation guide for the extension pack as some requirements are highly dependent on your operating system.

Step 1: Create an interactive run as documented above

Step 2: Connect to that run via mcli connect

Step 3: Run the following commands to download VS Code server and start it:

trap '/tmp/code tunnel unregister' EXIT
cd /tmp && curl -Lk 'https://code.visualstudio.com/sha/download?build=stable&os=cli-alpine-x64' --output vscode_cli.tar.gz
tar -xf vscode_cli.tar.gz
/tmp/code tunnel --accept-server-license-terms --no-sleep --name mml-dev-01
This will output something like:

*
* Visual Studio Code Server
*
* By using the software, you agree to
* the Visual Studio Code Server License Terms (https://aka.ms/vscode-server-license) and
* the Microsoft Privacy Statement (https://privacy.microsoft.com/en-US/privacystatement).
*
To grant access to the server, please log into https://github.com/login/device and use code ABCD-1234
Step 4: Authenticate using the code provided at https://github.com/login/device and authorize your github account

Authenticate using the code provided

Authorize your github account

Step 5: From an existing VSCode window, connect using remote tunnel by selecting the blue remote window button on the very left of bottom sidebar. Select “Connect to tunnel” from “Remote-Tunnels” and then select the tunnel name (default: “mml-dev-01”)



Alternatively, you can connect in the browser using: https://vscode.dev/tunnel/mml-dev-01/tmp""",
    """Common Commands
mcli deploy -f <your_yaml>
Submits an inference deployment with the provided YAML configuration.

mcli get deployments
Lists all of your inference deployments (see mcli get deployments --help to view the many filters available)

mcli describe deployment <deployment_name>
Shows detailed information about an inference deployment, including the config that was used to launch it.

mcli get deployment logs <deployment_name>
Retrieves the console logs of the inference deployment.

mcli delete deployment <deployment_name>
Deletes the inference deployment from the cluster.

mcli update deployment <deployment_name> --image <image>?
Updates the image of a deployment.""",
    """Configure a deployment
Deployment submissions to the MosaicML platform can be configured through a YAML file or using our Python API’s InferenceDeploymentConfig class.

The fields are identical across both methods:

Field

Type

name

required

str

compute

required

ComputeConfig

replicas

optional (default 1)

int

image

optional (default mosaicml/inference)

str

command

optional (default '')

str

model

optional (default None)

ModelConfig

batching

optional (default {max_batch_size: 1, max_timeout_ms: 1000})

BatchingConfig

integrations

optional (default [])

List[Dict]

env_variables

optional (default [])

List[Dict]

metadata

optional (default {})

Dict[str, Any]

Here’s an example deployment configuration:


YAML
name: deployment-name
compute:
  cluster: <my-cluster>
  gpu_type: <my-gpu_type>
  gpus: 1
integrations:
  - integration_type: git_repo
    git_repo: mosaicml/examples
    ssh_clone: false
model:
  download_parameters:
    s3_path: s3://my-checkpoint-path
  model_parameters:
    task: text-generation
    model_dtype: fp16
    autocast_dtype: bf16
    model_name_or_path: my/local/s3_path
metadata:
  model_version: 2

PYTHON
Field Types
Deployment Name
A deployment name is the primary identifier for working with deployments. For each deployment, a unique identifier is automatically appended to the provided deployment name. After submitting a deployment, the finalized unique deployment is displayed in the terminal, and can also be viewed with mcli get deployments or InferenceDeployment object.

Compute Fields
The compute field specifies which compute resources to request for a single replica of your inference deployment. See the replicas section for details on how replicas interfaces with compute.

In cases where you underspecify compute, the MosaicML platform will try and infer which compute resources to use automatically. Which fields are required depend on which and what type of clusters are available to your organization. If those resources are not valid or if there are multiple options still available, an error will be raised on run submissions, and the run will not be created.

Field

Type

Details

cluster

str

Required

gpus

int

Typically required, unless you specify instance or a cpu-only run

gpu_type

str

Optional. Not needed if you specify instance.

instance

str

Optional. Use if the cluster has multiple instances with the same GPU type (ex. 1-wide and 2-wide A10 instances)

cpus

int

Optional. Typically not used other than for debugging small deployments.

You can see clusters, instances, and compute resources available to you using:

mcli get clusters
For example, you can launch a multi-node cluster my-cluster with 16 A100 GPUs:

compute:
  cluster: my-cluster
  gpus: 16
  gpu_type: a100_80gb
You can also specify a cluster and instance name within that cluster as follows:

compute:
  cluster: my-cluster
  instance: oci.vm.gpu.a10.2
In the above case, the deployment will use all the GPUs on the instance by default. If you want to use fewer GPUs, you can also specify the gpus field using a value up to the total number of GPUs available on the instance.

Replicas
If the value of replicas is n > 1 in your deployment YAML, then the deployment will spawn n copies of whatever you request in the compute field.

For example, if your YAML looks like this:

compute:
  cluster: my-cluster
  gpus: 1
  gpu_type: a100_40gb
replicas: 2
then your deployment will spawn 2 replicas each using 1 GPU. Since you did not specify an instance in the compute field, each replica will run on any instance that has a matching GPU type and 1 free GPU.

As another example, if your deployment YAML looks like this:

compute:
  cluster: my-cluster
  instance: oci.vm.gpu.a10.2
replicas: 2
then your deployment will spawn 2 replicas with each one being on a oci.vm.gpu.a10.2 instance. Since that particular instance has 2 GPUs, your deployment will use 4 GPUs total (2 replicas X 2 GPUs per replica).

Model
The provided model parameters are mounted as a YAML file of your deployment at /mnt/model/model_config.yaml for your code to access. These parameters configure the out-of-the-box MosaicML inference server. If you choose not to provide a model config, we submit the deployment under the assumption that you’re specifying your own inference server code in the provided image under port 8080.

The model schema fields are as follows:

Field

Type

Details

downloader

str

The module path to the function that downloads any necessary model files (i.e. checkpoint files). If not provided, uses the default downloader described below.

download_parameters

Dict[str, Any]

Kwargs passed into the downloader function

model_handler

str

The module path to the model handler class. If not provided, defaults to the HuggingFace model handler that comes with the inference server.

model_parameters

Dict[str, Any]

Kwargs used to initialize your model handler

Default Download Parameters

If you’re using the default downloader that comes with the inference server, the parameters are as follows:

Field

Type

Details

hf_path

str

The name of the HuggingFace model repo

s3_path

str

The s3 path to a model checkpoint in the HuggingFace format

gcp_path

str

The gcp path to a model checkpoint in the HuggingFace format

You can only specify one of the above options.

Default HF Model Parameters

If you’re using the default HuggingFace model handler that comes with the inference server, the parameters are as follows:

Field

Type

Details

task

str

Required. Determines how the forward pass is computed. Currently only text-generation and feature-extraction are supported, with more to come!

model_dtype

str

The dtype that a Hugging Face model gets loaded as. Defaults to bf16

autocast_dtype

str

The dtype that the model gets autocasted to if provided. Defaults to None

model_name_or_path

str

The name of the HuggingFace repo of the model to load or the path of the locally downloaded HuggingFace model

Please note that the out-of-the-box MosaicML webserver does not support multi-gpu inference for the following Hugging Face model families: CodeGen, DeBERTa, FlauBERT, FSMT, GPT-2, LED, Longformer, XLM, XLNet.

Custom Model Handler Format

See the docs on custom model handlers for details how to implement your own model handler class.

Batching
The configuration for dynamic batching in the web server.

Field

Type

Details

max_batch_size

int

The maximum batch size to create before sending requests to the model.

max_timeout_ms

int

The maximum time to wait from the first request before sending requests to the model.

Setting max_batch_size to 1 is equivalent to turning dynamic batching off which is the default behavior if batching is not specified.

Image
Deployments are executed within Docker containers defined by a Docker image. Images on DockerHub can be configured as <organization>/<image name>. For private Dockerhub repositories, add a docker secret with:

mcli create secret docker
For more details, see the Docker Secret Page.

Using Alternative Docker Registries

While we default to DockerHub, custom registries are supported, see Docker’s documentation and Docker Secret Page for more details.

Command
The command is what’s executed when the deployments starts, typically to start the inference server. For example, the following command:

command: |
  echo Hello World!
will result in a deployment that prints “Hello World” to the console.

If you are using a support model format (Hugging Face, Custom Model) then the command field is optional and will be populated by default as the launch command for starting the MosaicML inference server.

Integrations
We support many Integrations to customize aspects of both the deployment setup and environment.

Integrations are specified as a list in the YAML. Each item in the list must specify a valid integration_type along with the relevant fields for the requested integration.

Some examples of integrations include automatically cloning a Github repository, installing python packages as shown below:

integrations:
  - integration_type: git_repo
    git_repo: org/my_repo
    git_branch: my-work-branch
You can read more about integrations on the Integrations Page.

Some integrations may require adding secrets. For example, pulling from a private github repository would require the git-ssh secret to be configured. See the Secrets Page.

Environment Variables
Environment variables can also be injected into each deployment at runtime through the env_variables field. Each environment variable in the list must have a key and value configured.

key: name used to access the value of the environment variable

value: value of the environment variable.

For example, the below YAML will print “Hello MOSAICML my name is MOSAICML_TWO!”:

name: hello-world
image: python
command: |
  sleep 2
  echo Hello $NAME my name is $SECOND_NAME!
env_variables:
  - key: NAME
    value: MOSAICML
  - key: SECOND_NAME
    value: MOSAICML_TWO
The command accesses the value of the environment variable by the key field (in this case $NAME and $SECOND_NAME)

Metadata
Metadata is meant to be a multi-purposed, unstructured place to put information about a deployment. It can be set at the beginning of the deployment, for example to add custom version tags:

name: hello-world
image: bash
command: echo 'hello world'
metadata:
  model_version: 2
Metadata on your deployment is readable through the CLI or SDK:


BASH

PYTHON
from mcli import get_deployment

deployment = get_deployment('hello-world-VC5nFs')
print(deployment.metadata)
# {"model_version": 2}
Metadata size constraints

Metadata is not intended for large amounts of data such as time series data. Each key is limited to 200 characters and value is limited to 0.1mb. Metadata cannot have more than 200 keys. A MAPIException will be raised on creation or updates if any of these limits are exceeded.""",
    """Deployments
Below outlines how to work with deployments, including creating, updating, getting, and deleting deployments as well as pinging the deployment, and sending requests to your deployment.

Creating a deployment
Deployments can programmatically be created, giving you flexibility to define custom workflows or create similar deployments in quick succession. create_inference_deployment() will takes a InferenceDeploymentConfig object, which is a fully-configured deployment ready to launch. The method will launch the inference deployment and then return a InferenceDeployment object, which includes the InferenceDeploymentConfig data in InferenceDeployment.config but also data received at the time the deployment was launched.

The InferenceDeploymentConfig object
The InferenceDeploymentConfig object holds configuration data needed to launch a deployment. This is the underlying python data structure MCLI uses, so before beginning make sure to familiarize yourself with the inference schema. Take a look at the API Reference for the full list of fields on the InferenceDeploymentConfig object.

There are two ways to initialize a InferenceDeploymentConfig object that can be used to configure and create a deployment. The first is by referencing a YAML file, equivalent to the file argument in MCLI:

from mcli import InferenceDeploymentConfig, create_inference_deployment

deployment_config = InferenceDeploymentConfig.from_file('hello_world.yaml')
created_deployment = create_inference_deployment(deployment_config)
Alternatively, you can instantiate the InferenceDeploymentConfig object directly in python:

from mcli import InferenceDeploymentConfig, create_inference_deployment

cluster = "<your-cluster>"
inference_deployment_config = InferenceDeploymentConfig(
    name='hello-world',
    image='bash',
    command='echo "Hello World!" && sleep 60',
    gpu_type='none',
    cluster=cluster,
)
create_deployment = create_inference_deployment(inference_deployment_config)
These can also be used in combination, for example loading a base configuration file and modifying select fields:

from mcli import InferenceDeploymentConfig, create_inference_deployment

special_config = InferenceDeploymentConfig.from_file('base_config.yaml')
special_config.metadata = {"version": 1}
created_deployment = create_inference_deployment(special_config)
The InferenceDeployment object
Created deployments will be returned as an InferenceDeployment object in create_inference_deployment(). This object can be used as input to any subsequent deployment function, for example you can start a deployment and then immediately ping the deployment to see if it’s ready.

from mcli import create_inference_deployment, ping_inference_deployment as ping

created_deployment = create_inference_deployment(config)
ping(created_deployment)
Querying a deployment
When querying your inference deployment, you must provide a JSON with a key called inputs in the request. This will typicaly be a list of inputs to the model. For example, in a text-to-text language model the inputs field will contain a list of strings to be tokenized and fed into the model.

Optionally, you can also provide a parameters field which contains hyperparameters used in the forward pass of your model. An example of where one might use the parameters field is to pass arguments to the generation pipeline in a text-to-text language model. See our docs on this for more details.

The reason parameters is separated out from inputs in the request is so that the webserver’s dynamic batching functionality can automatically group requests with the same sets of parameters together in the batches it creates. This is important because in some cases different sets of parameters cannot be grouped together when running inference. For example, consider grouping different max_output_sequence_length parameters together in a text-to-text language model. The result would be that the user’s model handler class would have to implement logic to handle this. Separating out parameters makes it possible for the user to write a handler class without having to consider these details.

An example request is shown below:

{
  "inputs": ["(required) <any JSON value>"],
  "parameters": "(optional) <any JSON value>"
}
Observing a deployment
Getting a deployment’s logs
get_inference_deployment_logs() gets currently available logs for any deployment.

from mcli import create_inference_deployment, get_inference_deployment_logs

created_deployment = create_inference_deployment(config)
logs = get_inference_deployment_logs(created_deployment)
Listing deployments
All deployments from your organization that have been launched through the MosaicML platform and have not deleted can be accessed using the get_inference_deployments() function. Optional filters allow you to specify a subset of deployments to list by name, cluster, gpu type, gpu number, or status.

from mcli import get_inference_deployments

listed_deployments = get_inference_deployments(gpu_nums=1)
Updating a deployment
To update a deployment, you must supply the deployment names or InferenceDeployment object and the fields that need to be updated.

To update a set of deployments, you can use the output of get_inference_deployments() or even define your own filters directly:

Currently, we support the following fields:

image : Takes a string value.

replicas : Takes an int value

metadata: Takes a dict value of metadata keys (strings) and values (any).

from mcli import update_inference_deployment

update_inference_deployment('deployment-name', {"metadata":'{"name":"my_first_model"}', "replicas":2, "image":"my_new_image"})

from mcli import update_inference_deployments, get_inference_deployments

to_update = get_inference_deployments(cluster="name")
update_inference_deployments(to_update, {"replicas": 3})
Deleting deployments
To delete deployments, you must supply the deployment names or InferenceDeployment object. To delete a set of deployments, you can use the output of get_inference_deployments() or even define your own filters directly:

from mcli import delete_inference_deployment

delete_inference_deployment('delete-this-deployment')

from mcli import delete_inference_deployments, get_inference_deployments


to_delete = get_inference_deployments(cluster="name")
delete_inference_deployments(to_delete)
Pinging a deployment
You can ping a deployment to determine the server status. We return a status code 200 when the server is live, which indicates the model has finished loading and is ready to accept requests. You can either pass in a name or a InferenceDeployment object.

from mcli import ping

ping('deployment-name')
Sending predictions to a deployment
You can send predictions to your deployment programmatically. There are 3 ways you can specify the deployment you’d like to send your request to:

You can pass in the deployment object returned from create_inference_deployment or get_inference_deployment.

from mcli import predict

deployment = get_inference_deployments(name='your-deployment-name')
predict(deployment, {'inputs': ['some input']})
You can pass in the url to the deployment.

from mcli import predict

predict('https://your-deployment.inf.hosted-on.mosaicml.hosting', {'inputs': ['some input']})
You can pass in the name of the deployment.

from mcli import predict

predict('your-deployment-name', {'inputs': ['some input']})
Getting metrics for a deployment
You can retrieve latency, throughput, error rate and cpu utilization metrics from the /metrics endpoint on the deployment. These metrics are compute over the past 1 hour at 1 minute interval.

curl https://{deployment-name}.inf.hosted-on.mosaicml.hosting/metrics -H "Authorization: {api-key}"
Sample response here:

{
  "status": 200,
  "metrics": {
    "error_rate": [
      ["2023-05-01 16:24:57", "10"],
      ["2023-05-01 16:23:57", "10"],
      ...
    ],
    "cpu_seconds": [
      ["2023-05-01 16:24:57", "0.006"],
      ["2023-05-01 16:23:57", "0.001"],
      ...
    ],
    "avg_latency": [
      ["2023-05-01 16:24:57", "1.2"],
      ["2023-05-01 16:23:57", "1.5"],
      ...
    ],
    "requests_per_second": [
      ["2023-05-01 16:24:57", "0.5"],
      ["2023-05-01 16:23:57", "1.2"],
      ...
    ]
  }
}""",
    """Custom Deployments
There are two ways we allow you to customize your inference deployment. You can provide your own downloader function or model handler implementation. In this section, we’ll cover the interface for each of these and show you how to use them.

Model Handlers
Model handlers allow you to define how your model should be loaded and what should happen in a forward pass. They allow for the MosaicML platform to support a wide variety of models and use cases. This is configured by the model_handler field in your deployment input yaml, which expects a python path to your model handler class, and the model_parameters field which expects a key-value mapping of parameters that gets passed as kwargs to initialize your model handler class.

Default
We provide a model handler that is built into the webserver by default. This model handler is used when no model handler is specified in the deployment input yaml. It loads a model from a checkpoint file (expected to be in the HuggingFace checkpoint format) and runs a forward pass on the model. It is a good starting point for most text generation or text embedding use cases.

The parameters for the default model handler are as follows:

Field

Type

Details

task

str

Required. Determines how the forward pass is computed. Supported values are text-generation and feature-extraction

model_dtype

str

The dtype that a Hugging Face model gets loaded as. Defaults to fp16. Note that bf16 is not supported by DeepSpeed, which our default model handler uses.

autocast_dtype

str

The dtype that the model gets autocasted to if provided. Defaults to None

model_name_or_path

str

The name of the HuggingFace repo of the model to load or the path of the locally downloaded HuggingFace checkpoint.

Custom Model Handlers
You may have a use case that is not covered by our default model handler (e.g. you want to deploy a vision model).

If you’d like to define your own model handler, you can implement a class that exposes the below interface. Note that the format of the requests that are passed into the handler follow the exact same format as the requests you send to the webserver. See Querying a Deployment for details on the input request format of the webserver.

class ModelHandlerInterface:

    def __init__(self, **kwargs):
        '''
        The init function you define can have keyword arguments equal
        to the values passed in the `model_parameters` section of the deployment YAML.
        '''

    def predict(self, model_requests: List[Dict[str, Any]]):
        '''
        Specify the logic of your model's forward pass.
        For example for Hugging Face models for text generation, this would be a call to generate().

        The `model_requests` is a list of dictionaries where each dictionary
        is an individual request and the list represents a batch of requests.

        Note that each dictionary in the list is guaranteed to have
        two keys: `input` and `parameters`. These are almost the same `inputs` and `parameters`
        that you pass to the webserver when making a request, however here the `input` key
        represents a singular input.
        '''

    def predict_stream(self, model_request: Dict[str, Any]):
        '''
        Optional. If your model supports streaming, implement your model's
        behavior for streaming outputs in this method.

        `model_request` is a dictionary which has two keys: `input` and `parameters`.
        These are almost the same `inputs` and `parameters` that you pass to the webserver
        when making a request, however here the `input` key represents a singular input.
        '''
There are some examples you can follow in the examples repo here.

Let’s walk through a concrete example. Here’s a very simple model handler implementation that just returns the input string as the output of the forward pass:

# Saved as hello_world_handler.py
class HelloWorldModelHandler(ModelHandlerInterface):

    def __init__(self, **kwargs):
        self.print_string = kwargs.get("print_string", "hello world!")

    def predict(self, model_requests: List[Dict[str, Any]]):
        return [self.print_string]
Suppose my model handler is saved in a git repo with this structure:

```
hello_world/
├── hello_world_handler.py
└── __init__.py
```
And here is a sample yaml for how you can configure your deployment to use your custom model handler:

name: hello-world-model
compute:
  gpus: 1
  gpu_type: a100_40gb
replicas: 1
image: mosaicml/inference
integrations:
  - integration_type: git_repo
    git_repo: hello_world
model:
  model_handler: hello_world.hello_world_handler.HelloWorldModelHandler
  model_parameters:
    print_string: "hello world!"
Downloader Function
Default
The downloader function allows you to customize how your model checkpoint is downloaded. This is configured by the downloader field in your deployment input yaml, which expects a python path to your downloader module, and the download_parameters field which expects a key-value mapping of parameters that gets passed as kwargs to your downloader function.

If you don’t provide a custom downloader, you can use the downloader that is built into the webserver, which can download checkpoint files in the HuggingFace format from either the HuggingFace hub or s3. You must provide at most one of the parameters in the following table to download_parameters.

Parameter

Description

Default

Example

Output Path

hf_path

The name/path of the model on HuggingFace hub.

None

mosaicml/mpt-7b

Huggingface cache directory

s3_path

The path to the model on s3.

None

s3://my-bucket/checkpoint

/mosaicml/local_model

gcp_path

The path to the model on GCP.

None

gs://my-bucket/checkpoint

/mosaicml/local_model

Custom Downloader
If you’d like to download your checkpoint from a custom location, you can implement a function with the following interface where my_custom_location is passed in under the download_parameters field in your deployment YAML:

def download_model(my_custom_location: str) -> None:
    print("My custom location:", my_custom_location)
You can also take a look at this diffusion example for reference here.

Again, let’s walk through a concrete example and add to the custom repo in the earlier model handler example by saving the download function to custom_downloader.py:

```
hello_world/
├── hello_world_handler.py
├── custom_downloader.py
└── __init__.py
```
Let’s hook up the downloader to the input yaml:

name: hello-world-model
compute:
  gpus: 1
  gpu_type: a100_40gb
replicas: 1
image: mosaicml/inference
integrations:
  - integration_type: git_repo
    git_repo: hello_world
model:
  downloader: hello_world.custom_downloader.download_model
  download_parameters:
    my_custom_location: my_custom_location
  model_handler: hello_world.hello_world_handler.HelloWorldModelHandler
  model_parameters:
    print_string: hello world!"""
]

In [None]:
service_name_and_description = "MosaicML, a platform that makes it easier to train and fine-tune large AI models"
temperature = .7
number_of_examples_per_doc = 10

Run this to generate the dataset.

In [None]:
import os
import random
import time

def generate_example(service_name_and_description, doc, prev_examples, temperature=.5):
    messages=[
        {
            "role": "system",
            "content": f"""You are generating data which will be used to train a machine learning model.

Specifically, you will be creating Q/A data to train a model to answer questions about a given service.

You will be given a high-level description of a service, as well as a page of documentation from that service, and from that, you will generate data samples, each with a prompt/response pair. The prompt will be a question, the response will be the answer to that question.

You will do so in this format:
```
prompt
-----------
$prompt_goes_here
-----------

response
-----------
$response_goes_here
-----------
```

Only one prompt/response pair should be generated per turn.

For each turn, make the example slightly more complex than the last, while ensuring diversity.

Make sure your samples are unique and diverse, yet high-quality and complex enough to train a well-performing model. Make sure each sample covers a different aspect of the doc you are looking at. This is essential.

Here is the service you will be generating data about: `{service_name_and_description}`

Here is the document you will generate Q/A samples from. Make sure the samples are completely based on this doc, with no outside information:
```
{doc}
```

Okay, now get started generating samples. Remember to keep quality and sample diversity in mind, and make each one unique."""
        }
    ]

    if len(prev_examples) > 0:
        if len(prev_examples) > 10:
            prev_examples = random.sample(prev_examples, 10)
        for example in prev_examples:
            messages.append({
                "role": "assistant",
                "content": example
            })

    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo-16k",
        messages=messages,
        temperature=temperature,
        max_tokens=1354,
    )

    return [choice.message['content'] for choice in response.choices]

def generate_examples_for_doc(service_name_and_description, doc, number_of_examples_per_doc=50, temperature=.5):
    examples = []
    while len(examples) < number_of_examples_per_doc:
        new_examples = generate_example(service_name_and_description, doc, examples, temperature)
        for new_example in new_examples:
            if new_example not in examples:
                examples.append(new_example)

        # Add a counter for overall examples
        if len(examples) % 8 == 0:
            print(f"Generated {len(examples)} examples so far. Pausing for 20 seconds.")
            time.sleep(61)
    return examples

i = 0
examples = []
for doc in docs:
  time.sleep(61)
  i = i + 1
  print(f"Generating doc {i}/{len(docs)}'s examples")
  new_examples = generate_examples_for_doc(service_name_and_description, doc, number_of_examples_per_doc=number_of_examples_per_doc, temperature=.5)
  for example in new_examples:
    examples.append(example)

  # Add a counter for overall examples
  if len(examples) % 20 == 0:
    print(f"Generated {len(examples)} examples so far. Pausing for 20 seconds.")
    time.sleep(61)

We also need to generate a system message.

In [None]:
def generate_system_message(service_name_and_description):

    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[
          {
            "role": "system",
            "content": "You will be given a high-level description of the model we are training, and from that, you will generate a simple system prompt for that model to use. Remember, you are not generating the system message for data generation -- you are generating the system message to use for inference. A good format to follow is `Given $INPUT_DATA, you will $WHAT_THE_MODEL_SHOULD_DO.`.\n\nMake it as concise as possible. Include nothing but the system prompt in your response.\n\nFor example, never write: `\"$SYSTEM_PROMPT_HERE\"`.\n\nIt should be like: `$SYSTEM_PROMPT_HERE`."
          },
          {
              "role": "user",
              "content": f'Answer questions about this service: `{service_name_and_description.strip()}`',
          }
        ],
        temperature=temperature,
        max_tokens=500,
    )

    return response.choices[0].message['content']

system_message = generate_system_message(service_name_and_description)

print(f'The system message is: `{system_message}`. Feel free to re-run this cell if you want a better result.')

Now let's put our examples into a dataframe and turn them into a final pair of datasets.

In [None]:
import pandas as pd

# Initialize lists to store prompts and responses
prompts = []
responses = []

# Parse out prompts and responses from examples
for example in examples:
  try:
    split_example = example.split('-----------')
    prompts.append(split_example[1].strip())
    responses.append(split_example[3].strip())
  except:
    pass

# Create a DataFrame
df = pd.DataFrame({
    'prompt': prompts,
    'response': responses
})

# Remove duplicates
df = df.drop_duplicates()

print('There are ' + str(len(df)) + ' successfully-generated examples. Here are the first few:')

df.head()

Split into train and test sets.

In [None]:
# Split the data into train and test sets, with 90% in the train set
train_df = df.sample(frac=0.9, random_state=42)
test_df = df.drop(train_df.index)

# Save the dataframes to .jsonl files
train_df.to_json('train.jsonl', orient='records', lines=True)
test_df.to_json('test.jsonl', orient='records', lines=True)

# Install necessary libraries

In [None]:
!pip install -q accelerate==0.21.0 peft==0.4.0 bitsandbytes==0.40.2 transformers==4.31.0 trl==0.4.7

import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer

# Define Hyperparameters

In [None]:
model_name = "NousResearch/llama-2-7b-hf" # use this if you have access to the official LLaMA 2 model "meta-llama/Llama-2-7b-chat-hf", though keep in mind you'll need to pass a Hugging Face key argument
dataset_name = "/content/train.jsonl"
new_model = "llama-2-7b-custom"
lora_r = 64
lora_alpha = 16
lora_dropout = 0.1
use_4bit = True
bnb_4bit_compute_dtype = "float16"
bnb_4bit_quant_type = "nf4"
use_nested_quant = False
output_dir = "./results"
num_train_epochs = 1
fp16 = False
bf16 = False
per_device_train_batch_size = 4
per_device_eval_batch_size = 4
gradient_accumulation_steps = 1
gradient_checkpointing = True
max_grad_norm = 0.3
learning_rate = 2e-4
weight_decay = 0.001
optim = "paged_adamw_32bit"
lr_scheduler_type = "constant"
max_steps = -1
warmup_ratio = 0.03
group_by_length = True
save_steps = 25
logging_steps = 5
max_seq_length = None
packing = False
device_map = {"": 0}

#Load Datasets and Train

In [None]:
# Load datasets
train_dataset = load_dataset('json', data_files='/content/train.jsonl', split="train")
valid_dataset = load_dataset('json', data_files='/content/test.jsonl', split="train")

# Preprocess datasets
train_dataset_mapped = train_dataset.map(lambda examples: {'text': [f'[INST] <<SYS>>\n{system_message.strip()}\n<</SYS>>\n\n' + prompt + ' [/INST] ' + response for prompt, response in zip(examples['prompt'], examples['response'])]}, batched=True)
valid_dataset_mapped = valid_dataset.map(lambda examples: {'text': [f'[INST] <<SYS>>\n{system_message.strip()}\n<</SYS>>\n\n' + prompt + ' [/INST] ' + response for prompt, response in zip(examples['prompt'], examples['response'])]}, batched=True)

compute_dtype = getattr(torch, bnb_4bit_compute_dtype)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map=device_map
)
model.config.use_cache = False
model.config.pretraining_tp = 1
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
)
# Set training parameters
training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    fp16=fp16,
    bf16=bf16,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=group_by_length,
    lr_scheduler_type=lr_scheduler_type,
    report_to="all",
    evaluation_strategy="steps",
    eval_steps=5  # Evaluate every 20 steps
)
# Set supervised fine-tuning parameters
trainer = SFTTrainer(
    model=model,
    train_dataset=train_dataset_mapped,
    eval_dataset=valid_dataset_mapped,  # Pass validation dataset here
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=packing,
)
trainer.train()
trainer.model.save_pretrained(new_model)

# Cell 4: Test the model
logging.set_verbosity(logging.CRITICAL)
prompt = f"[INST] <<SYS>>\n{system_message}\n<</SYS>>\n\nHow do I create a new run? [/INST]" # replace the command here with something relevant to your task
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(prompt)
print(result[0]['generated_text'].replace(prompt, '').split('<</response>>')[0])

#Run Inference

In [None]:
from transformers import pipeline

prompt = f"[INST] <<SYS>>\n{system_message}\n<</SYS>>\n\nWhat might a typical training YAML look like?/INST]" # replace the command here with something relevant to your task
num_new_tokens = 250  # change to the number of new tokens you want to generate

# Count the number of tokens in the prompt
num_prompt_tokens = len(tokenizer(prompt)['input_ids'])

# Calculate the maximum length for the generation
max_length = num_prompt_tokens + num_new_tokens

gen = pipeline('text-generation', model=model, tokenizer=tokenizer, max_length=max_length)
result = gen(prompt)
print(result[0]['generated_text'].replace(prompt, '').split('<</response>>')[0])

#Merge the model and store in Google Drive

In [None]:
# Merge and save the fine-tuned model
from google.colab import drive
drive.mount('/content/drive')

model_path = "/content/drive/MyDrive/llama-2-7b-custom"  # change to your preferred path

# Reload model in FP16 and merge it with LoRA weights
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map=device_map,
)
model = PeftModel.from_pretrained(base_model, new_model)
model = model.merge_and_unload()

# Reload tokenizer to save it
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# Save the merged model
model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)

# Load a fine-tuned model from Drive and run inference

In [None]:
from google.colab import drive
from transformers import AutoModelForCausalLM, AutoTokenizer

drive.mount('/content/drive')

model_path = "/content/drive/MyDrive/llama-2-7b-custom"  # change to the path where your model is saved

model = AutoModelForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

In [None]:
from transformers import pipeline

prompt = "What is 2 + 2?"  # change to your desired prompt
gen = pipeline('text-generation', model=model, tokenizer=tokenizer)
result = gen(prompt)
print(result[0]['generated_text'])