Scaling transfer learning tasks using CodeFlare on OpenShift Container Platform (OCP)

Foundation models (e.g., BERT, GPT-3, RoBERTa) are trained on a large corpus of data and enable a wide variety of downstream tasks such as sentiment analysis, Q&A, and classification. This repository demonstrates how an enterprise can take a foundation model and run downstream tasks in a parallel manner on a Hybrid Cloud platform.

We use RoBERTa as our base model and run the GLUE benchmark that consists of 10 downstream tasks, each with 10 seeds. Each of these tasks is transformed to a ray task by using the @ray.remote annotation with a single GPU allocated for each task.

Setting up an OpenShift cluster

We assume that the user of this repoistory has an OpenShift cluster setup with the GPU operator. We also assume that the end user has OpenShift CLI installed and have their data in an S3 compatible object storage. Python scripts for downloading all GLUE data are avaible here.

Creating the S3 objects for roberta-base and glue_data

Both objects should be placed into the same S3 bucket.

Create the RoBERTa base model S3 object with key="roberta-base" and contents=roberta-base.tgz

- git clone https://huggingface.co/roberta-base
- tar -czf roberta-base.tgz roberta-base

Create the S3 object for the GLUE datasets with key=glue_data and contents=glue_data.tgz

- python download_glue_data.py --data_dir glue_data --tasks all
- tar -czf glue_data.tgz glue_data

Running glue_benchmark

Log into OCP using the oc login command (On IBM Cloud, one can go to the menu under IAM#<your username/email>, then "Copy Login Command").
Use oc project to confirm your namespace is as desired. You can switch to your desired namespace by:

$ oc project {your-namespace}

Use provided template-s3-creds.yaml and create a personal yaml secrets file with your namespace and S3 credentials. Note that to use AWS S3 storage, the value for ENDPOINT_URL should be empty. The program simple_check_s3.py can be used to validate S3 access from the head node.
Then register the secrets:

$ oc create -f {your-handle}-s3-creds.yaml

[Required only once] Check if Ray CRD is installed.

$ oc get crd | grep ray

You can install the Ray CRD using:

$ oc apply -f cluster_crd.yaml

Create a ray operator in your namespace:

$ oc apply -f glue-operator.yaml

Create a ray cluster in your namespace. Change the min and max number of workers as needed (around line 100)

 $ oc apply -f glue-cluster.yaml

If the container images are not cached on OCP nodes they will be pulled; this can take 5-10 minutes or more. When the ray cluster head and worker pods are in ready state, copy the application driver to the head node:

$ oc get po --watch
$ oc cp glue_benchmark.py glue-cluster-head-XXXXX:/home/ray/glue/

Exec into the head node and run the application. For example:

$ oc exec -it glue-cluster-head-cjgzk -- /bin/bash
(base) 1000650000@glue-cluster-head-cjgzk:~/glue$ nohup ./glue_benchmark -b {bucket-name} -m roberta-base -t WNLI -M &

This will run the GLUE benchmark, a set of downstream tasks on RoBERTa base model against the WNLI task with 10 different seeds, and save the model from the seed with best score. Before the compution starts, GLUE datasets and base model must be loaded into each worker node. Data loading is a two step process: first the S3 objects are pulled and cached locally in plasma, and then each worker pulls the data from plasma and unpacks it in its local filesystem. Additional processing with the same cluster will reuse the local data.

Monitor progress using nohup.out. The evaluation results, along with the remote consoles in log.log files, will be in /tmp/summary.
When finished, clean up the active resources in your project:

$ oc delete -f glue-cluster.yaml
$ oc delete -f glue-operator.yaml

Conclusion

This demonstrates how we can run downstream fine tuning tasks in parallel on a GPU enabled OpenShift cluster. Users can take arbitrary fine tuning tasks written by data scientists and following the pattern in this repository scale it out on their Hybrid Cloud environment. The data will never leave the user's environment and all the GPUs can be leveraged during the transfer learning process. In our experiments, we observed that all the 8 GPUs on four nodes were leveraged for training the various downstream tasks.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
cluster_crd.yaml		cluster_crd.yaml
glue-cluster.yaml		glue-cluster.yaml
glue-operator.yaml		glue-operator.yaml
glue_benchmark.py		glue_benchmark.py
kill_actor.py		kill_actor.py
requirements.txt		requirements.txt
simple_check_s3.py		simple_check_s3.py
template-s3-creds.yaml		template-s3-creds.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dockerfile

Dockerfile

LICENSE

LICENSE

README.md

README.md

cluster_crd.yaml

cluster_crd.yaml

glue-cluster.yaml

glue-cluster.yaml

glue-operator.yaml

glue-operator.yaml

glue_benchmark.py

glue_benchmark.py

kill_actor.py

kill_actor.py

requirements.txt

requirements.txt

simple_check_s3.py

simple_check_s3.py

template-s3-creds.yaml

template-s3-creds.yaml

Repository files navigation

Scaling transfer learning tasks using CodeFlare on OpenShift Container Platform (OCP)

Setting up an OpenShift cluster

Creating the S3 objects for roberta-base and glue_data

Running glue_benchmark

Conclusion

About

Releases

Packages

Contributors 3

Languages

License

project-codeflare/codeflare-transfer-learning

Folders and files

Latest commit

History

Repository files navigation

Scaling transfer learning tasks using CodeFlare on OpenShift Container Platform (OCP)

Setting up an OpenShift cluster

Creating the S3 objects for roberta-base and glue_data

Running glue_benchmark

Conclusion

About

Resources

License

Stars

Watchers

Forks

Languages