# Transfer learning with Huggingface using CodeFlare

In this notebook you will learn how to leverage the **[huggingface](https://huggingface.co/)** support in ray ecosystem to carry out a text classification task using transfer learning. We will be referencing the examples **[here](https://huggingface.co/docs/transformers/tasks/sequence_classification)** and **[here](https://docs.ray.io/en/latest/train/getting-started-transformers.html)**.

The example carries out a text classification task on **[imdb dataset](https://huggingface.co/datasets/imdb)** and tries to classify the movie reviews as positive or negative. Huggingface library provides an easy way to build a model and the dataset to carry out this classification task. In this case we will be using **distilbert-base-uncased** model which is a **BERT** based model.

### Getting all the requirements in place

In [12]:
# Import pieces from codeflare-sdk
from codeflare_sdk import Cluster, ClusterConfiguration, TokenAuthentication

In [None]:
# Create authentication object for user permissions
# IF unused, SDK will automatically check for default kubeconfig, then in-cluster config
# KubeConfigFileAuthentication can also be used to specify kubeconfig path manually
auth = TokenAuthentication(
    token = "XXXX",
    server = "XXXX",
    skip_tls = False
)
auth.login()

Here, we want to define our cluster by specifying the resources we require for our batch workload. Below, we define our cluster object (which generates a corresponding Ray Cluster).

NOTE: We must specify the `image` which will be used in our RayCluster, we recommend you bring your own image which suits your purposes. 
The example here is a community image.

In [13]:
# Create our cluster and submit
# The SDK will try to find the name of your default local queue based on the annotation "kueue.x-k8s.io/default-queue": "true" unless you specify the local queue manually below
cluster = Cluster(ClusterConfiguration(name='hfgputest', 
                                       namespace="default", # Update to your namespace
                                       head_gpus=1, # For GPU enabled workloads set the head_gpus and num_gpus
                                       num_gpus=1,
                                       num_workers=1,
                                       min_cpus=8, 
                                       max_cpus=8, 
                                       min_memory=16, 
                                       max_memory=16, 
                                       image="quay.io/rhoai/ray:2.23.0-py39-cu121",
                                       write_to_file=False, # When enabled Ray Cluster yaml files are written to /HOME/.codeflare/resources 
                                       # local_queue="local-queue-name" # Specify the local queue manually
                                       ))

Written to: hfgputest.yaml


Next, we want to bring our cluster up, so we call the `up()` function below to submit our Ray Cluster onto the queue, and begin the process of obtaining our resource cluster.

In [None]:
cluster.up()

Now, we want to check on the initial status of our resource cluster, then wait until it is finally ready for use.

In [17]:
cluster.status()

(False, <CodeFlareClusterStatus.QUEUED: 2>)

In [None]:
cluster.wait_ready()

In [None]:
cluster.status()

Let's quickly verify that the specs of the cluster are as expected.

In [18]:
cluster.details()

<RayClusterStatus.READY: 'ready'>

In [19]:
ray_cluster_uri = cluster.cluster_uri()

**NOTE**: Now we have our resource cluster with the desired GPUs, so we can interact with it to train the HuggingFace model.

In [20]:
#before proceeding make sure the cluster exists and the uri is not empty
assert ray_cluster_uri, "Ray cluster needs to be started and set before proceeding"

import ray

# reset the ray context in case there's already one. 
ray.shutdown()
# establish connection to ray cluster

#install additional libraries that will be required for this training
runtime_env = {"pip": ["transformers==4.41.2", "datasets==2.17.0", "accelerate==0.31.0", "scikit-learn==1.5.0"]}

# NOTE: This will work for in-cluster notebook servers (RHODS/ODH), but not for local machines
# To see how to connect from your laptop, go to demo-notebooks/additional-demos/local_interactive.ipynb
ray.init(address=ray_cluster_uri, runtime_env=runtime_env)

print("Ray cluster is up and running: ", ray.is_initialized())

Ray cluster is up and running:  True


**NOTE** : in this case since we are running a task for which we need additional pip packages. we can install those by passing them in the `runtime_env` variable

### Transfer learning code from huggingface

We are using the code based on the examples **[here](https://huggingface.co/docs/transformers/tasks/sequence_classification)** and **[here](https://docs.ray.io/en/latest/train/getting-started-transformers.html)**. 

In [21]:
@ray.remote
def train_fn():
    import os
    import numpy as np
    from datasets import load_dataset, load_metric
    import transformers
    from transformers import (
        Trainer,
        TrainingArguments,
        AutoTokenizer,
        AutoModelForSequenceClassification,
    )
    import ray.train.huggingface.transformers
    from ray.train import ScalingConfig
    from ray.train.torch import TorchTrainer

    # When running in a multi-node cluster you will need persistent storage that is accessible across all worker nodes. 
    # See www.github.com/project-codeflare/codeflare-sdk/tree/main/docs/s3-compatible-storage.md for more information.
    
    def train_func():
        # Datasets
        dataset = load_dataset("imdb")
        tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

        def tokenize_function(examples):
            return tokenizer(examples["text"], padding="max_length", truncation=True)

        small_train_dataset = (
            dataset["train"].select(range(100)).map(tokenize_function, batched=True)
        )
        small_eval_dataset = (
            dataset["test"].select(range(100)).map(tokenize_function, batched=True)
        )

        # Model
        model = AutoModelForSequenceClassification.from_pretrained(
            "distilbert-base-uncased", num_labels=2
        )

        def compute_metrics(eval_pred):
            metric = load_metric("accuracy")
            logits, labels = eval_pred
            predictions = np.argmax(logits, axis=-1)
            return metric.compute(predictions=predictions, references=labels)

        # Hugging Face Trainer
        training_args = TrainingArguments(
            output_dir="test_trainer",
            evaluation_strategy="epoch",
            save_strategy="epoch",
            report_to="none",
        )

        trainer = Trainer(
            model=model,
            args=training_args,
            train_dataset=small_train_dataset,
            eval_dataset=small_eval_dataset,
            compute_metrics=compute_metrics,
        )


        callback = ray.train.huggingface.transformers.RayTrainReportCallback()
        trainer.add_callback(callback)

        trainer = ray.train.huggingface.transformers.prepare_trainer(trainer)

        trainer.train()


    ray_trainer = TorchTrainer(
        train_func,
        scaling_config=ScalingConfig(num_workers=3, use_gpu=True),
        # Configure persistent storage that is accessible across 
        # all worker nodes.
        # Uncomment and update the RunConfig below to include your storage details.
        # run_config=ray.train.RunConfig(storage_path="storage path"),
    )
    result: ray.train.Result = ray_trainer.fit()

**NOTE:** This code will produce a lot of output and will run for **approximately 2 minutes.** As a part of execution it will download the `imdb` dataset, `distilbert-base-uncased` model and then will start transfer learning task for training the model with this dataset. 

In [22]:
#call the above cell as a remote ray function
ray.get(train_fn.remote())

Downloading builder script: 100%|██████████| 4.31k/4.31k [00:00<00:00, 5.60MB/s]
Downloading metadata: 100%|██████████| 2.17k/2.17k [00:00<00:00, 3.13MB/s]
Downloading readme: 100%|██████████| 7.59k/7.59k [00:00<00:00, 9.75MB/s]


[2m[36m(train_fn pid=250)[0m Downloading and preparing dataset imdb/plain_text to /home/ray/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1...


Downloading data:   0%|          | 0.00/84.1M [00:00<?, ?B/s]
Downloading data:   0%|          | 30.7k/84.1M [00:00<05:22, 261kB/s]
Downloading data:   0%|          | 89.1k/84.1M [00:00<03:31, 397kB/s]
Downloading data:   0%|          | 184k/84.1M [00:00<02:24, 582kB/s] 
Downloading data:   0%|          | 373k/84.1M [00:00<01:25, 981kB/s]
Downloading data:   1%|          | 778k/84.1M [00:00<00:44, 1.86MB/s]
Downloading data:   2%|▏         | 1.34M/84.1M [00:00<00:29, 2.83MB/s]
Downloading data:   2%|▏         | 2.02M/84.1M [00:00<00:21, 3.79MB/s]
Downloading data:   3%|▎         | 2.86M/84.1M [00:00<00:16, 4.85MB/s]
Downloading data:   5%|▍         | 3.98M/84.1M [00:01<00:12, 6.27MB/s]
Downloading data:   6%|▋         | 5.39M/84.1M [00:01<00:09, 8.02MB/s]
Downloading data:   9%|▉         | 7.69M/84.1M [00:01<00:06, 11.8MB/s]
Downloading data:  13%|█▎        | 11.2M/84.1M [00:01<00:04, 17.4MB/s]
Downloading data:  18%|█▊        | 15.3M/84.1M [00:01<00:03, 22.5MB/s]
Downloading data:  23

[2m[36m(train_fn pid=250)[0m Dataset imdb downloaded and prepared to /home/ray/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1. Subsequent calls will reuse this data.


100%|██████████| 3/3 [00:00<00:00, 696.30it/s]                                               
Downloading: 100%|██████████| 28.0/28.0 [00:00<00:00, 32.1kB/s]
Downloading: 100%|██████████| 483/483 [00:00<00:00, 600kB/s]
Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]
Downloading: 100%|██████████| 232k/232k [00:00<00:00, 4.80MB/s]
Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]
Downloading: 100%|██████████| 466k/466k [00:00<00:00, 7.88MB/s]
  0%|          | 0/25 [00:00<?, ?ba/s]
  4%|▍         | 1/25 [00:00<00:15,  1.52ba/s]
  8%|▊         | 2/25 [00:01<00:14,  1.57ba/s]
 12%|█▏        | 3/25 [00:01<00:13,  1.59ba/s]
 16%|█▌        | 4/25 [00:02<00:13,  1.59ba/s]
 20%|██        | 5/25 [00:03<00:13,  1.52ba/s]
 24%|██▍       | 6/25 [00:03<00:12,  1.54ba/s]
 28%|██▊       | 7/25 [00:04<00:11,  1.55ba/s]
 32%|███▏      | 8/25 [00:05<00:11,  1.53ba/s]
 36%|███▌      | 9/25 [00:05<00:10,  1.54ba/s]
 40%|████      | 10/25 [00:06<00:09,  1.54ba/s]
 44%|████▍     | 11/25 [00:07<0

[2m[36m(train_fn pid=250)[0m len of train Dataset({
[2m[36m(train_fn pid=250)[0m     features: ['text', 'label', 'input_ids', 'attention_mask'],
[2m[36m(train_fn pid=250)[0m     num_rows: 100
[2m[36m(train_fn pid=250)[0m }) and test Dataset({
[2m[36m(train_fn pid=250)[0m     features: ['text', 'label', 'input_ids', 'attention_mask'],
[2m[36m(train_fn pid=250)[0m     num_rows: 100
[2m[36m(train_fn pid=250)[0m })


 98%|█████████▊| 49/50 [00:32<00:00,  1.53ba/s]


[2m[36m(train_fn pid=250)[0m huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
[2m[36m(train_fn pid=250)[0m 	- Avoid using `tokenizers` before the fork if possible
[2m[36m(train_fn pid=250)[0m 	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[2m[36m(train_fn pid=250)[0m == Status ==
[2m[36m(train_fn pid=250)[0m Current time: 2022-11-04 07:55:58 (running for 00:00:05.07)
[2m[36m(train_fn pid=250)[0m Memory usage on this node: 6.4/240.1 GiB
[2m[36m(train_fn pid=250)[0m Using FIFO scheduling algorithm.
[2m[36m(train_fn pid=250)[0m Resources requested: 5.0/10 CPUs, 4.0/4 GPUs, 0.0/22.35 GiB heap, 0.0/6.59 GiB objects
[2m[36m(train_fn pid=250)[0m Result logdir: /home/ray/ray_results/HuggingFaceTrainer_2022-11-04_07-55-53
[2m[36m(train_fn pid=250)[0m Number of trials: 1/1 (1 RUNNING)
[2m[36m(train_fn pid=250)[0m +--------------------

[2m[36m(BaseWorkerMixin pid=182, ip=10.129.66.16)[0m 2022-11-04 07:56:02,047	INFO torch.py:346 -- Setting up process group for: env:// [rank=0, world_size=4]
[2m[36m(BaseWorkerMixin pid=184, ip=10.129.66.16)[0m 2022-11-04 07:56:02,045	INFO torch.py:346 -- Setting up process group for: env:// [rank=2, world_size=4]
[2m[36m(BaseWorkerMixin pid=183, ip=10.129.66.16)[0m 2022-11-04 07:56:02,047	INFO torch.py:346 -- Setting up process group for: env:// [rank=1, world_size=4]
[2m[36m(BaseWorkerMixin pid=185, ip=10.129.66.16)[0m 2022-11-04 07:56:02,048	INFO torch.py:346 -- Setting up process group for: env:// [rank=3, world_size=4]


[2m[36m(train_fn pid=250)[0m == Status ==
[2m[36m(train_fn pid=250)[0m Current time: 2022-11-04 07:56:03 (running for 00:00:10.07)
[2m[36m(train_fn pid=250)[0m Memory usage on this node: 7.2/240.1 GiB
[2m[36m(train_fn pid=250)[0m Using FIFO scheduling algorithm.
[2m[36m(train_fn pid=250)[0m Resources requested: 5.0/10 CPUs, 4.0/4 GPUs, 0.0/22.35 GiB heap, 0.0/6.59 GiB objects
[2m[36m(train_fn pid=250)[0m Result logdir: /home/ray/ray_results/HuggingFaceTrainer_2022-11-04_07-55-53
[2m[36m(train_fn pid=250)[0m Number of trials: 1/1 (1 RUNNING)
[2m[36m(train_fn pid=250)[0m +--------------------------------+----------+------------------+
[2m[36m(train_fn pid=250)[0m | Trial name                     | status   | loc              |
[2m[36m(train_fn pid=250)[0m |--------------------------------+----------+------------------|
[2m[36m(train_fn pid=250)[0m | HuggingFaceTrainer_c7d60_00000 | RUNNING  | 10.129.66.16:146 |
[2m[36m(train_fn pid=250)[0m +-----------

Downloading: 100%|██████████| 483/483 [00:00<00:00, 588kB/s]
Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s] 
Downloading:   0%|          | 893k/268M [00:00<00:29, 8.93MB/s]
Downloading:   3%|▎         | 6.70M/268M [00:00<00:06, 37.8MB/s]
Downloading:   5%|▍         | 12.9M/268M [00:00<00:05, 48.7MB/s]
Downloading:   7%|▋         | 19.2M/268M [00:00<00:04, 54.4MB/s]
Downloading:  10%|▉         | 25.7M/268M [00:00<00:04, 58.3MB/s]
Downloading:  12%|█▏        | 32.3M/268M [00:00<00:03, 60.8MB/s]
Downloading:  14%|█▍        | 38.8M/268M [00:00<00:03, 62.1MB/s]
Downloading:  17%|█▋        | 45.3M/268M [00:00<00:03, 63.3MB/s]
Downloading:  19%|█▉        | 51.8M/268M [00:00<00:03, 63.7MB/s]
Downloading:  22%|██▏       | 58.4M/268M [00:01<00:03, 64.3MB/s]
Downloading:  24%|██▍       | 64.9M/268M [00:01<00:03, 64.7MB/s]
Downloading:  27%|██▋       | 71.5M/268M [00:01<00:03, 65.2MB/s]
Downloading:  29%|██▉       | 78.1M/268M [00:01<00:02, 65.1MB/s]
Downloading:  32%|███▏      | 84.6M/26

[2m[36m(train_fn pid=250)[0m == Status ==
[2m[36m(train_fn pid=250)[0m Current time: 2022-11-04 07:56:08 (running for 00:00:15.07)
[2m[36m(train_fn pid=250)[0m Memory usage on this node: 7.5/240.1 GiB
[2m[36m(train_fn pid=250)[0m Using FIFO scheduling algorithm.
[2m[36m(train_fn pid=250)[0m Resources requested: 5.0/10 CPUs, 4.0/4 GPUs, 0.0/22.35 GiB heap, 0.0/6.59 GiB objects
[2m[36m(train_fn pid=250)[0m Result logdir: /home/ray/ray_results/HuggingFaceTrainer_2022-11-04_07-55-53
[2m[36m(train_fn pid=250)[0m Number of trials: 1/1 (1 RUNNING)
[2m[36m(train_fn pid=250)[0m +--------------------------------+----------+------------------+
[2m[36m(train_fn pid=250)[0m | Trial name                     | status   | loc              |
[2m[36m(train_fn pid=250)[0m |--------------------------------+----------+------------------|
[2m[36m(train_fn pid=250)[0m | HuggingFaceTrainer_c7d60_00000 | RUNNING  | 10.129.66.16:146 |
[2m[36m(train_fn pid=250)[0m +-----------

Downloading:  78%|███████▊  | 209M/268M [00:03<00:00, 65.7MB/s]
Downloading:  81%|████████  | 216M/268M [00:03<00:00, 65.7MB/s]
Downloading:  83%|████████▎ | 223M/268M [00:03<00:00, 66.0MB/s]
Downloading:  86%|████████▌ | 229M/268M [00:03<00:00, 66.0MB/s]
Downloading:  88%|████████▊ | 236M/268M [00:03<00:00, 65.8MB/s]
Downloading:  90%|█████████ | 242M/268M [00:03<00:00, 65.8MB/s]
Downloading:  93%|█████████▎| 249M/268M [00:03<00:00, 65.7MB/s]
Downloading:  95%|█████████▌| 255M/268M [00:04<00:00, 65.7MB/s]
Downloading:  98%|█████████▊| 262M/268M [00:04<00:00, 65.8MB/s]
Downloading: 100%|██████████| 268M/268M [00:04<00:00, 63.9MB/s]
[2m[36m(BaseWorkerMixin pid=182, ip=10.129.66.16)[0m Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_projector.weight']
[2m[36m(BaseWork

[2m[36m(train_fn pid=250)[0m == Status ==
[2m[36m(train_fn pid=250)[0m Current time: 2022-11-04 07:56:13 (running for 00:00:20.08)
[2m[36m(train_fn pid=250)[0m Memory usage on this node: 12.3/240.1 GiB
[2m[36m(train_fn pid=250)[0m Using FIFO scheduling algorithm.
[2m[36m(train_fn pid=250)[0m Resources requested: 5.0/10 CPUs, 4.0/4 GPUs, 0.0/22.35 GiB heap, 0.0/6.59 GiB objects
[2m[36m(train_fn pid=250)[0m Result logdir: /home/ray/ray_results/HuggingFaceTrainer_2022-11-04_07-55-53
[2m[36m(train_fn pid=250)[0m Number of trials: 1/1 (1 RUNNING)
[2m[36m(train_fn pid=250)[0m +--------------------------------+----------+------------------+
[2m[36m(train_fn pid=250)[0m | Trial name                     | status   | loc              |
[2m[36m(train_fn pid=250)[0m |--------------------------------+----------+------------------|
[2m[36m(train_fn pid=250)[0m | HuggingFaceTrainer_c7d60_00000 | RUNNING  | 10.129.66.16:146 |
[2m[36m(train_fn pid=250)[0m +----------



[2m[36m(train_fn pid=250)[0m == Status ==
[2m[36m(train_fn pid=250)[0m Current time: 2022-11-04 07:56:18 (running for 00:00:25.08)
[2m[36m(train_fn pid=250)[0m Memory usage on this node: 13.7/240.1 GiB
[2m[36m(train_fn pid=250)[0m Using FIFO scheduling algorithm.
[2m[36m(train_fn pid=250)[0m Resources requested: 5.0/10 CPUs, 4.0/4 GPUs, 0.0/22.35 GiB heap, 0.0/6.59 GiB objects
[2m[36m(train_fn pid=250)[0m Result logdir: /home/ray/ray_results/HuggingFaceTrainer_2022-11-04_07-55-53
[2m[36m(train_fn pid=250)[0m Number of trials: 1/1 (1 RUNNING)
[2m[36m(train_fn pid=250)[0m +--------------------------------+----------+------------------+
[2m[36m(train_fn pid=250)[0m | Trial name                     | status   | loc              |
[2m[36m(train_fn pid=250)[0m |--------------------------------+----------+------------------|
[2m[36m(train_fn pid=250)[0m | HuggingFaceTrainer_c7d60_00000 | RUNNING  | 10.129.66.16:146 |
[2m[36m(train_fn pid=250)[0m +----------

[2m[36m(BaseWorkerMixin pid=182, ip=10.129.66.16)[0m Saving model checkpoint to /tmp/hf_imdb/test/checkpoint-391
[2m[36m(BaseWorkerMixin pid=182, ip=10.129.66.16)[0m Configuration saved in /tmp/hf_imdb/test/checkpoint-391/config.json
[2m[36m(BaseWorkerMixin pid=182, ip=10.129.66.16)[0m Model weights saved in /tmp/hf_imdb/test/checkpoint-391/pytorch_model.bin


[2m[36m(train_fn pid=250)[0m Result for HuggingFaceTrainer_c7d60_00000:
[2m[36m(train_fn pid=250)[0m   _time_this_iter_s: 118.07144260406494
[2m[36m(train_fn pid=250)[0m   _timestamp: 1667573883
[2m[36m(train_fn pid=250)[0m   _training_iteration: 1
[2m[36m(train_fn pid=250)[0m   date: 2022-11-04_07-58-03
[2m[36m(train_fn pid=250)[0m   done: false
[2m[36m(train_fn pid=250)[0m   epoch: 1.0
[2m[36m(train_fn pid=250)[0m   experiment_id: 7bc6ab25d0414fcbb589bcb5d0f29b99
[2m[36m(train_fn pid=250)[0m   hostname: hfgputest-worker-small-group-hfgputest-q4758
[2m[36m(train_fn pid=250)[0m   iterations_since_restore: 1
[2m[36m(train_fn pid=250)[0m   node_ip: 10.129.66.16
[2m[36m(train_fn pid=250)[0m   pid: 146
[2m[36m(train_fn pid=250)[0m   should_checkpoint: true
[2m[36m(train_fn pid=250)[0m   step: 391
[2m[36m(train_fn pid=250)[0m   time_since_restore: 124.55581378936768
[2m[36m(train_fn pid=250)[0m   time_this_iter_s: 124.55581378936768
[2m[36m(

[2m[36m(BaseWorkerMixin pid=182, ip=10.129.66.16)[0m 
[2m[36m(BaseWorkerMixin pid=182, ip=10.129.66.16)[0m 
[2m[36m(BaseWorkerMixin pid=182, ip=10.129.66.16)[0m Training completed. Do not forget to share your model on huggingface.co/models =)
[2m[36m(BaseWorkerMixin pid=182, ip=10.129.66.16)[0m 
[2m[36m(BaseWorkerMixin pid=182, ip=10.129.66.16)[0m 


[2m[36m(train_fn pid=250)[0m == Status ==
[2m[36m(train_fn pid=250)[0m Current time: 2022-11-04 07:58:13 (running for 00:02:19.36)
[2m[36m(train_fn pid=250)[0m Memory usage on this node: 16.0/240.1 GiB
[2m[36m(train_fn pid=250)[0m Using FIFO scheduling algorithm.
[2m[36m(train_fn pid=250)[0m Resources requested: 5.0/10 CPUs, 4.0/4 GPUs, 0.0/22.35 GiB heap, 0.0/6.59 GiB objects
[2m[36m(train_fn pid=250)[0m Result logdir: /home/ray/ray_results/HuggingFaceTrainer_2022-11-04_07-55-53
[2m[36m(train_fn pid=250)[0m Number of trials: 1/1 (1 RUNNING)
[2m[36m(train_fn pid=250)[0m +--------------------------------+----------+------------------+--------+------------------+-----------------+----------------------------+--------------------------+
[2m[36m(train_fn pid=250)[0m | Trial name                     | status   | loc              |   iter |   total time (s) |   train_runtime |   train_samples_per_second |   train_steps_per_second |
[2m[36m(train_fn pid=250)[0m |



[2m[36m(train_fn pid=250)[0m Result for HuggingFaceTrainer_c7d60_00000:
[2m[36m(train_fn pid=250)[0m   _time_this_iter_s: 118.07144260406494
[2m[36m(train_fn pid=250)[0m   _timestamp: 1667573883
[2m[36m(train_fn pid=250)[0m   _training_iteration: 1
[2m[36m(train_fn pid=250)[0m   date: 2022-11-04_07-58-03
[2m[36m(train_fn pid=250)[0m   done: true
[2m[36m(train_fn pid=250)[0m   epoch: 1.0
[2m[36m(train_fn pid=250)[0m   experiment_id: 7bc6ab25d0414fcbb589bcb5d0f29b99
[2m[36m(train_fn pid=250)[0m   experiment_tag: '0'
[2m[36m(train_fn pid=250)[0m   hostname: hfgputest-worker-small-group-hfgputest-q4758
[2m[36m(train_fn pid=250)[0m   iterations_since_restore: 1
[2m[36m(train_fn pid=250)[0m   node_ip: 10.129.66.16
[2m[36m(train_fn pid=250)[0m   pid: 146
[2m[36m(train_fn pid=250)[0m   should_checkpoint: true
[2m[36m(train_fn pid=250)[0m   step: 391
[2m[36m(train_fn pid=250)[0m   time_since_restore: 124.55581378936768
[2m[36m(train_fn pid=250)

[2m[36m(train_fn pid=250)[0m 2022-11-04 07:58:16,398	INFO tune.py:747 -- Total run time: 142.70 seconds (142.40 seconds for the tuning loop).


Finally, we bring our resource cluster down and release/terminate the associated resources, bringing everything back to the way it was before our cluster was brought up.

In [None]:
cluster.down()

In [None]:
auth.logout()

## Conclusion
As shown in the above example, you can run your Huggingface transfer learning tasks easily and natively on CodeFlare. You can scale them from 1 to n GPUs without requiring you to make any significant code changes and leveraging the native Huggingface trainer. 