oracle-samples · qiuosier · Oct 18, 2023 · Oct 13, 2023 · Oct 17, 2023 · Oct 17, 2023
diff --git a/distributed_training/llama2/README.md b/distributed_training/llama2/README.md
diff --git a/distributed_training/llama2/demo.md b/distributed_training/llama2/demo.md
diff --git a/distributed_training/llama2/images/ads-opct-watch.png b/distributed_training/llama2/images/ads-opct-watch.png
diff --git a/distributed_training/llama2/images/ads-opctl-watch-jobrun-llama2-ft.png b/distributed_training/llama2/images/ads-opctl-watch-jobrun-llama2-ft.png
diff --git a/distributed_training/llama2/images/jobs-distributed-training.002.png b/distributed_training/llama2/images/jobs-distributed-training.002.png
diff --git a/distributed_training/llama2/images/llama2-ft-jobrun-gpu-memory.png b/distributed_training/llama2/images/llama2-ft-jobrun-gpu-memory.png
diff --git a/distributed_training/llama2/images/llama2-ft-jobrun-gpu-powerdraw.png b/distributed_training/llama2/images/llama2-ft-jobrun-gpu-powerdraw.png
diff --git a/distributed_training/llama2/images/llama2-ft-jobrun-metrics.png b/distributed_training/llama2/images/llama2-ft-jobrun-metrics.png
diff --git a/distributed_training/llama2/images/oci-console-job-llama2-ft.png b/distributed_training/llama2/images/oci-console-job-llama2-ft.png
diff --git a/distributed_training/llama2/load-back-FSDP-checkpoints.ipynb b/distributed_training/llama2/load-back-FSDP-checkpoints.ipynb
@@ -0,0 +1,113 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "58901f38",
+   "metadata": {},
+   "source": [
+    "# Loading back FSDP checkpoints\n",
+    "\n",
+    "For more information: https://github.com/facebookresearch/llama-recipes/blob/main/docs/inference.md#loading-back-fsdp-checkpoints"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "98b57d30",
+   "metadata": {},
+   "source": [
+    "## All of the code in this notebook should be run in the OCI Data Science Notebook Terminal!"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "05a59132",
+   "metadata": {},
+   "source": [
+    "Before you start make sure that you've installed the `pytorch20_p39_gpu_v2` Conda and activate it in the `Terminal`\n",
+    "\n",
+    "```bash\n",
+    "odsc conda install -s pytorch20_p39_gpu_v2\n",
+    "```\n",
+    "\n",
+    "... then activate it\n",
+    "\n",
+    "```bash\n",
+    "conda activate /home/datascience/conda/pytorch20_p39_gpu_v2\n",
+    "```\n",
+    "\n",
+    "Then install all of the required dependancies\n",
+    "\n",
+    "```bash\n",
+    "!pip install tokenizers==0.13.3 -U && pip install transformers -U && pip install llama-recipes==0.0.1\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6e2aecea",
+   "metadata": {},
+   "source": [
+    "Following commands work best when you execute them in the `terminal` too!\n",
+    "\n",
+    "First you have to login to access the Llama2 model\n",
+    "```bash\n",
+    "!huggingface-cli login\n",
+    "```\n",
+    "\n",
+    "Then run the checkpoint conververter, it looks like following\n",
+    "\n",
+    "```bash\n",
+    "python -m llama_recipes.inference.checkpoint_converter_fsdp_hf --fsdp_checkpoint_path  /mnt/llama2/outputs/lvp-7b/ocid1.datasciencejob.oc1.eu-frankfurt-1.amaaaaaan/fine-tuned-meta-llama/Llama-2-7b-hf --consolidated_model_path /mnt/llama2/fsdp_consolidated_checkpoints --HF_model_path_or_name \"meta-llama/Llama-2-13b-hf\"\n",
+    "```\n",
+    "\n",
+    "Replace the `--fsdp_checkpoint_path` with the folder you specified by the `--dist_checkpoint_root_folder` which will be the location at your object storage bucket, as per the example above. Notice that we ran this in OCI Data Science Notebooks and mounted the object storage bucket used to store the FSDP checkpoints under `/mnt/llama2`. The `--consolidated_model_path` is the path where the consolidated weights will be stored back. The `--HF_model_path_or_name` is the name of the model used for the fine-tuning, or if you downloaded the model locally, the location of the downloaded model.\n",
+    "\n",
+    "If the merging process was successful, you should see in your `--consolidated_model_path` folder something like this:\n",
+    "\n",
+    "```bash\n",
+    "   0 drwxr-xr-x. 1 datascience users    0 Oct 18 15:48 .\n",
+    "   0 drwxr-xr-x. 1 datascience users    0 Oct 18 14:38 ..\n",
+    " 512 -rw-r--r--. 1 datascience users   42 Oct 18 16:35 added_tokens.json\n",
+    "1.0K -rw-r--r--. 1 datascience users  656 Oct 18 16:35 config.json\n",
+    " 512 -rw-r--r--. 1 datascience users  111 Oct 18 16:35 generation_config.json\n",
+    "9.2G -rw-r--r--. 1 datascience users 9.2G Oct 18 16:35 pytorch_model-00001-of-00003.bin\n",
+    "9.3G -rw-r--r--. 1 datascience users 9.3G Oct 18 16:36 pytorch_model-00002-of-00003.bin\n",
+    "6.7G -rw-r--r--. 1 datascience users 6.7G Oct 18 16:36 pytorch_model-00003-of-00003.bin\n",
+    " 24K -rw-r--r--. 1 datascience users  24K Oct 18 16:36 pytorch_model.bin.index.json\n",
+    " 512 -rw-r--r--. 1 datascience users   72 Oct 18 16:35 special_tokens_map.json\n",
+    "1.5K -rw-r--r--. 1 datascience users 1.2K Oct 18 16:35 tokenizer_config.json\n",
+    "489K -rw-r--r--. 1 datascience users 489K Oct 18 16:35 tokenizer.model\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2407ae40",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python [conda env:pytorch20_p39_gpu_v2]",
+   "language": "python",
+   "name": "conda-env-pytorch20_p39_gpu_v2-py"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.16"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}