# Running stanford_alpaca on Amazon SageMaker

This is a sample code to run stanford_alpaca on Amazon SageMaker, for demo or research use only!

In [3]:
## Update sagemaker python sdk version
!pip install -U sagemaker

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Collecting sagemaker
  Downloading sagemaker-2.146.0.tar.gz (718 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m718.5/718.5 kB[0m [31m57.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: sagemaker
  Building wheel for sagemaker (setup.py) ... [?25ldone
[?25h  Created wheel for sagemaker: filename=sagemaker-2.146.0-py2.py3-none-any.whl size=964936 sha256=c0f4b2d2051f88eadc74daf58f1c91368e4c86cf6993ef37c91f5aba83f35759
  Stored in directory: /home/ec2-user/.cache/pip/wheels/08/f6/9a/3abd169a1b427683e78872b737fbab7831c8310fbec4c0acef
Successfully built sagemaker
Installing collected packages: sagemaker
  Attempting uninstall: sagemaker
    Found existing installation: sagemaker 2.132.0
    Uninstalling sagemaker-2.132.0:
      Successfully uninstalled sagemaker-2.132.0
Successfully installed sagemaker-2.146.0


In [1]:
import sagemaker
import boto3
from sagemaker import get_execution_role

sess = sagemaker.Session()
role = get_execution_role()
sagemaker_default_bucket = sess.default_bucket()

account = sess.boto_session.client("sts").get_caller_identity()["Account"]
region = sess.boto_session.region_name

In [3]:
## download training script from github
!git clone https://github.com/tatsu-lab/stanford_alpaca.git

Cloning into 'stanford_alpaca'...
remote: Enumerating objects: 111, done.[K
remote: Counting objects: 100% (63/63), done.[K
remote: Compressing objects: 100% (18/18), done.[K
remote: Total 111 (delta 49), reused 47 (delta 45), pack-reused 48[K
Receiving objects: 100% (111/111), 9.14 MiB | 4.85 MiB/s, done.
Resolving deltas: 100% (53/53), done.


In [4]:
## check data
!head -n 44 stanford_alpaca/alpaca_data.json

[
    {
        "instruction": "Give three tips for staying healthy.",
        "input": "",
        "output": "1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule."
    },
    {
        "instruction": "What are the three primary colors?",
        "input": "",
        "output": "The three primary colors are red, blue, and yellow."
    },
    {
        "instruction": "Describe the structure of an atom.",
        "input": "",
        "output": "An atom is made up of a nucleus, which contains protons and neutrons, surrounded by electrons that travel in orbits around the nucleus. The protons and neutrons have a positive charge, while the electrons have a negative charge, resulting in an overall neutral atom. The number of each particle determines the atomic number and the type of atom."
    },
    {
        "instruction": "How can we reduce 

**Generate sample dataset for debug use.**

In [5]:
%%writefile sample_dataset.json
[
    {
        "instruction": "Give three tips for staying healthy.",
        "input": "",
        "output": "1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule."
    },
    {
        "instruction": "What are the three primary colors?",
        "input": "",
        "output": "The three primary colors are red, blue, and yellow."
    },
    {
        "instruction": "Describe the structure of an atom.",
        "input": "",
        "output": "An atom is made up of a nucleus, which contains protons and neutrons, surrounded by electrons that travel in orbits around the nucleus. The protons and neutrons have a positive charge, while the electrons have a negative charge, resulting in an overall neutral atom. The number of each particle determines the atomic number and the type of atom."
    },
    {
        "instruction": "How can we reduce air pollution?",
        "input": "",
        "output": "There are a number of ways to reduce air pollution, such as shifting to renewable energy sources, encouraging the use of public transportation, prohibiting the burning of fossil fuels, implementing policies to reduce emissions from industrial sources, and implementing vehicle emissions standards. Additionally, individuals can do their part to reduce air pollution by reducing car use, avoiding burning materials such as wood, and changing to energy efficient appliances."
    },
    {
        "instruction": "Describe a time when you had to make a difficult decision.",
        "input": "",
        "output": "I had to make a difficult decision when I was working as a project manager at a construction company. I was in charge of a project that needed to be completed by a certain date in order to meet the client\u2019s expectations. However, due to unexpected delays, we were not able to meet the deadline and so I had to make a difficult decision. I decided to extend the deadline, but I had to stretch the team\u2019s resources even further and increase the budget. Although it was a risky decision, I ultimately decided to go ahead with it to ensure that the project was completed on time and that the client\u2019s expectations were met. The project was eventually successfully completed and this was seen as a testament to my leadership and decision-making abilities."
    },
    {
        "instruction": "Identify the odd one out.",
        "input": "Twitter, Instagram, Telegram",
        "output": "Telegram"
    },
    {
        "instruction": "Explain why the following fraction is equivalent to 1/4",
        "input": "4/16",
        "output": "The fraction 4/16 is equivalent to 1/4 because both numerators and denominators are divisible by 4. Dividing both the top and bottom numbers by 4 yields the fraction 1/4."
    },
    {
        "instruction": "Write a short story in third person narration about a protagonist who has to make an important career decision.",
        "input": "",
        "output": "John was at a crossroads in his life. He had just graduated college and was now facing the big decision of what career to pursue. After much deliberation, he decided that he wanted to be an accountant and help the financially disadvantaged. He had always been good with numbers and enjoyed seeing the tangible results of his work. \n\nJohn enrolled in accounting courses and initially found it quite challenging. He had to learn multiple systems and regulations quickly, but he worked hard and eventually excelled in his studies. After a few years, John started working at an accounting firm in his city. He was eager to put his knowledge of taxes and accounting to use in a real-world setting.\n\nJohn loved his job, as it let him express his creativity in finding strategies to save his clients money. After a few years at the firm, he became a senior accountant and was asked to manage bigger and more challenging cases. He was now a respected figure in the financial industry, but he still remembers when he was just a recent college graduate, unsure of the direction in which his life would take him."
    }
]

Writing sample_dataset.json


## Download pretrained model from HuggingFace Hub

To avoid download model from Huggingface hub failure, we download first and push those model files to S3 bucket first.

In [4]:
!pip install huggingface_hub

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Collecting huggingface_hub
  Downloading huggingface_hub-0.13.4-py3-none-any.whl (200 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m200.1/200.1 kB[0m [31m25.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: huggingface_hub
Successfully installed huggingface_hub-0.13.4


In [5]:
from huggingface_hub import snapshot_download
from pathlib import Path
path_str = r"../13bmodel"
local_cache_path = Path(path_str)
local_cache_path.mkdir(exist_ok=True)

model_name = "decapoda-research/llama-13b-hf"
model_name_s3 = "ds-llama-13b"

# Only download pytorch checkpoint files
allow_patterns = ["*.json", "*.pt", "*.bin", "*.model"]

model_download_path = snapshot_download(
    repo_id=model_name,
    cache_dir=local_cache_path,
    allow_patterns=allow_patterns,
)

Fetching 48 files:   0%|          | 0/48 [00:00<?, ?it/s]

Downloading (…)5d4de5ce/config.json:   0%|          | 0.00/427 [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Downloading (…)l-00002-of-00041.bin:   0%|          | 0.00/952M [00:00<?, ?B/s]

Downloading (…)l-00004-of-00041.bin:   0%|          | 0.00/952M [00:00<?, ?B/s]

Downloading (…)l-00001-of-00041.bin:   0%|          | 0.00/952M [00:00<?, ?B/s]

Downloading (…)l-00000-of-00041.bin:   0%|          | 0.00/952M [00:00<?, ?B/s]

Downloading (…)l-00005-of-00041.bin:   0%|          | 0.00/952M [00:00<?, ?B/s]

Downloading (…)l-00003-of-00041.bin:   0%|          | 0.00/952M [00:00<?, ?B/s]

Downloading (…)l-00007-of-00041.bin:   0%|          | 0.00/952M [00:00<?, ?B/s]

Downloading (…)l-00006-of-00041.bin:   0%|          | 0.00/952M [00:00<?, ?B/s]

Downloading (…)l-00008-of-00041.bin:   0%|          | 0.00/952M [00:00<?, ?B/s]

Downloading (…)l-00009-of-00041.bin:   0%|          | 0.00/952M [00:00<?, ?B/s]

Downloading (…)l-00010-of-00041.bin:   0%|          | 0.00/952M [00:00<?, ?B/s]

Downloading (…)l-00011-of-00041.bin:   0%|          | 0.00/952M [00:00<?, ?B/s]

Downloading (…)l-00012-of-00041.bin:   0%|          | 0.00/952M [00:00<?, ?B/s]

Downloading (…)l-00013-of-00041.bin:   0%|          | 0.00/952M [00:00<?, ?B/s]

Downloading (…)l-00014-of-00041.bin:   0%|          | 0.00/952M [00:00<?, ?B/s]

Downloading (…)l-00015-of-00041.bin:   0%|          | 0.00/952M [00:00<?, ?B/s]

Downloading (…)l-00016-of-00041.bin:   0%|          | 0.00/952M [00:00<?, ?B/s]

Downloading (…)l-00017-of-00041.bin:   0%|          | 0.00/952M [00:00<?, ?B/s]

Downloading (…)l-00018-of-00041.bin:   0%|          | 0.00/952M [00:00<?, ?B/s]

Downloading (…)l-00019-of-00041.bin:   0%|          | 0.00/952M [00:00<?, ?B/s]

Downloading (…)l-00020-of-00041.bin:   0%|          | 0.00/952M [00:00<?, ?B/s]

Downloading (…)l-00021-of-00041.bin:   0%|          | 0.00/952M [00:00<?, ?B/s]

Downloading (…)l-00022-of-00041.bin:   0%|          | 0.00/952M [00:00<?, ?B/s]

Downloading (…)l-00023-of-00041.bin:   0%|          | 0.00/952M [00:00<?, ?B/s]

Downloading (…)l-00024-of-00041.bin:   0%|          | 0.00/952M [00:00<?, ?B/s]

Downloading (…)l-00025-of-00041.bin:   0%|          | 0.00/952M [00:00<?, ?B/s]

Downloading (…)l-00026-of-00041.bin:   0%|          | 0.00/952M [00:00<?, ?B/s]

Downloading (…)l-00027-of-00041.bin:   0%|          | 0.00/952M [00:00<?, ?B/s]

Downloading (…)l-00028-of-00041.bin:   0%|          | 0.00/952M [00:00<?, ?B/s]

Downloading (…)l-00029-of-00041.bin:   0%|          | 0.00/952M [00:00<?, ?B/s]

Downloading (…)l-00031-of-00041.bin:   0%|          | 0.00/952M [00:00<?, ?B/s]

Downloading (…)l-00030-of-00041.bin:   0%|          | 0.00/952M [00:00<?, ?B/s]

Downloading (…)l-00033-of-00041.bin:   0%|          | 0.00/952M [00:00<?, ?B/s]

Downloading (…)l-00032-of-00041.bin:   0%|          | 0.00/952M [00:00<?, ?B/s]

Downloading (…)l-00034-of-00041.bin:   0%|          | 0.00/952M [00:00<?, ?B/s]

Downloading (…)l-00035-of-00041.bin:   0%|          | 0.00/952M [00:00<?, ?B/s]

Downloading (…)l-00036-of-00041.bin:   0%|          | 0.00/952M [00:00<?, ?B/s]

Downloading (…)l-00037-of-00041.bin:   0%|          | 0.00/952M [00:00<?, ?B/s]

Downloading (…)l-00038-of-00041.bin:   0%|          | 0.00/952M [00:00<?, ?B/s]

Downloading (…)l-00039-of-00041.bin:   0%|          | 0.00/952M [00:00<?, ?B/s]

Downloading (…)l-00040-of-00041.bin:   0%|          | 0.00/952M [00:00<?, ?B/s]

Downloading (…)l-00041-of-00041.bin:   0%|          | 0.00/983M [00:00<?, ?B/s]

Downloading (…)model.bin.index.json:   0%|          | 0.00/31.8k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/141 [00:00<?, ?B/s]

**Upload model files to S3**

In [6]:
# Get the model files path
import os
from glob import glob

paths = os.walk(path_str)#glob(r'./model/*')
for root, dirs, files in paths:
    for file in files:
        if file == 'config.json':
            print(os.path.join(root,file))
            local_model_path = str(os.path.join(root,file))[0:-11]
            print(local_model_path)

./13bmodel/models--decapoda-research--llama-13b-hf/snapshots/438770a656712a5072229b62256521845d4de5ce/config.json
./13bmodel/models--decapoda-research--llama-13b-hf/snapshots/438770a656712a5072229b62256521845d4de5ce/


In [14]:
%set_env region=$region
%set_env sagemaker_default_bucket=$sagemaker_default_bucket 
%set_env local_model_path=$local_model_path 
%set_env model_name_s3 = $model_name_s3

env: region=us-east-1
env: sagemaker_default_bucket=sagemaker-us-east-1-348052051973
env: local_model_path=./13bmodel/models--decapoda-research--llama-13b-hf/snapshots/438770a656712a5072229b62256521845d4de5ce/
env: model_name_s3=$model_name_s3


In [None]:
%%script bash

chmod +x ./s5cmd
./s5cmd sync ${local_model_path} s3://${sagemaker_default_bucket}/${model_name_s3}/pretrain/ 

rm -rf $local_model_path

## Prepare a docker image

In [10]:
%%writefile requirements.txt
numpy
rouge_score
fire
openai
transformers>=4.26.1
torch
sentencepiece
tokenizers==0.12.1
wandb

Writing requirements.txt


In [16]:
%%writefile Dockerfile2
## You should change below region code to the region you used, here sample is use us-west-2
From 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-training:1.13.1-transformers4.26.0-gpu-py39-cu117-ubuntu20.04 

ENV LANG=C.UTF-8
ENV PYTHONUNBUFFERED=TRUE
ENV PYTHONDONTWRITEBYTECODE=TRUE

COPY requirements.txt ./
RUN python3 -m pip install -r requirements.txt 
RUN python3 -m pip install git+https://github.com/huggingface/transformers.git@68d640f7c368bcaaaecfc678f11908ebbd3d6176
RUN pip3 uninstall -y deepspeed && pip3 install deepspeed

# Make all local GPUs visible
ENV NVIDIA_VISIBLE_DEVICES="all"

Writing Dockerfile2


In [12]:
## You should change below region code to the region you used, here sample is use us-west-2
!aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 763104351884.dkr.ecr.${region}.amazonaws.com

https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded


**Build image and push to ECR.**

In [13]:
%%sh

#!/usr/bin/env bash

# This script shows how to build the Docker image and push it to ECR to be ready for use
# by SageMaker.

# The argument to this script is the image name. This will be used as the image on the local
# machine and combined with the account and region to form the repository name for ECR.
# The name of our algorithm
algorithm_name=sagemaker-alpaca-demo

account=$(aws sts get-caller-identity --query Account --output text)

# Get the region defined in the current configuration (default to us-west-2 if none defined)
region=$(aws configure get region)
region=${region:-us-east-1}

fullname="${account}.dkr.ecr.${region}.amazonaws.com/${algorithm_name}:latest"

# If the repository doesn't exist in ECR, create it.
aws ecr describe-repositories --repository-names "${algorithm_name}" > /dev/null 2>&1

if [ $? -ne 0 ]
then
    aws ecr create-repository --repository-name "${algorithm_name}" > /dev/null
fi

# Get the login command from ECR and execute it directly
aws ecr get-login-password --region ${region}|docker login --username AWS --password-stdin ${fullname}

# Build the docker image locally with the image name and then push it to ECR
# with the full name.

docker build -t ${algorithm_name} .
docker tag ${algorithm_name} ${fullname}

docker push ${fullname}

Login Succeeded
Sending build context to Docker daemon  52.06MB
Step 1/9 : From 763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-training:1.13.1-transformers4.26.0-gpu-py39-cu117-ubuntu20.04
1.13.1-transformers4.26.0-gpu-py39-cu117-ubuntu20.04: Pulling from huggingface-pytorch-training
Digest: sha256:6465c5dd6672419b1a60cb47dab82a0f4f1cca22abe3ba7ed9af0c313836df26
Status: Downloaded newer image for 763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-training:1.13.1-transformers4.26.0-gpu-py39-cu117-ubuntu20.04
 ---> c5a6ef695006
Step 2/9 : ENV LANG=C.UTF-8
 ---> Running in bff6b98f3cb0
Removing intermediate container bff6b98f3cb0
 ---> f86efda73432
Step 3/9 : ENV PYTHONUNBUFFERED=TRUE
 ---> Running in aa64d3d4994d
Removing intermediate container aa64d3d4994d
 ---> 65932c4792b1
Step 4/9 : ENV PYTHONDONTWRITEBYTECODE=TRUE
 ---> Running in 00a3eb98575f
Removing intermediate container 00a3eb98575f
 ---> f2bb3986f066
Step 5/9 : COPY requirements.txt ./
 ---> d49

https://docs.docker.com/engine/reference/commandline/login/#credentials-store



**Generate deepspeed config file.**

In [14]:
%%writefile ds.json
{
  "fp16": {
    "enabled": true,
    "auto_cast": false,
    "loss_scale": 0,
    "initial_scale_power": 16,
    "loss_scale_window": 1000,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": "auto",
      "betas": "auto",
      "eps": "auto",
      "weight_decay": "auto"
    }
  },
  "scheduler": {
    "type": "WarmupLR",
    "params": {
      "warmup_min_lr": "auto",
      "warmup_max_lr": "auto",
      "warmup_num_steps": "auto"
    }
  },
  "zero_optimization": {
    "stage": 3,
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true
  },
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": "auto",
  "steps_per_print": 2000,
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "wall_clock_breakdown": false
}

Writing ds.json


**Generate training entrypoint script.**

**Note: DO NOT CHANGE BELOW VAlUE OF "output_dir" and "cache_dir", keep it "/tmp/llama_out" and "/tmp".**

In [10]:
%%writefile train-13b.sh
#!/bin/bash

chmod +x ./s5cmd
./s5cmd sync s3://$MODEL_S3_BUCKET/$MODEL_NAME_S3/pretrain/* /tmp/llama_pretrain/

deepspeed --num_gpus=8 stanford_alpaca/train.py \
    --deepspeed ds.json \
    --model_name_or_path "/tmp/llama_pretrain/" \
    --data_path stanford_alpaca/alpaca_data.json \
    --output_dir "/tmp/llama_out" \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size  1 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "no" \
    --save_steps 2000 \
    --save_total_limit 3 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --cache_dir '/tmp' \
    --fp16_full_eval \
    --fp16 \
    --report_to "none"

./s5cmd sync /tmp/llama_out s3://$MODEL_S3_BUCKET//$MODEL_NAME_S3/output/$(date +%Y-%m-%d-%H-%M-%S)/

Overwriting train-13b.sh


In [11]:
## The image uri which is build and pushed above
image_uri = "{}.dkr.ecr.{}.amazonaws.com/sagemaker-alpaca-demo:latest".format(account, region)
image_uri

'348052051973.dkr.ecr.us-east-1.amazonaws.com/sagemaker-alpaca-demo:latest'

In [17]:
## set train_data_path to your training dataset path in s3
# train_data_path = f's3://{sagemaker_default_bucket}/ds-llama/train_data/'

# inputs = {'train': train_data_path}

### Modify train.py a little about how to save model

Modify the model save methods in training script, change from 

```
trainer.save_state()
safe_save_model_for_hf_trainer(trainer=trainer, output_dir=training_args.output_dir)
```

to

```
tokenizer.save_pretrained(training_args.output_dir)
trainer.save_model(training_args.output_dir)
```

In [18]:
## rename orignal train.py, in case to use further
!mv stanford_alpaca/train.py stanford_alpaca/train_bak.py

**The modified training script**

In [19]:
%%writefile stanford_alpaca/train.py

#    Copyright 2023 Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li
#
#    Licensed under the Apache License, Version 2.0 (the "License");
#    you may not use this file except in compliance with the License.
#    You may obtain a copy of the License at
#
#        http://www.apache.org/licenses/LICENSE-2.0
#
#    Unless required by applicable law or agreed to in writing, software
#    distributed under the License is distributed on an "AS IS" BASIS,
#    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#    See the License for the specific language governing permissions and
#    limitations under the License.

import copy
import logging
from dataclasses import dataclass, field
from typing import Optional, Dict, Sequence

import torch
import transformers
from torch.utils.data import Dataset
from transformers import Trainer

import utils

IGNORE_INDEX = -100
DEFAULT_PAD_TOKEN = "[PAD]"
DEFAULT_EOS_TOKEN = "</s>"
DEFAULT_BOS_TOKEN = "</s>"
DEFAULT_UNK_TOKEN = "</s>"
PROMPT_DICT = {
    "prompt_input": (
        "Below is an instruction that describes a task, paired with an input that provides further context. "
        "Write a response that appropriately completes the request.\n\n"
        "### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:"
    ),
    "prompt_no_input": (
        "Below is an instruction that describes a task. "
        "Write a response that appropriately completes the request.\n\n"
        "### Instruction:\n{instruction}\n\n### Response:"
    ),
}


@dataclass
class ModelArguments:
    model_name_or_path: Optional[str] = field(default="facebook/opt-125m")


@dataclass
class DataArguments:
    data_path: str = field(default=None, metadata={"help": "Path to the training data."})


@dataclass
class TrainingArguments(transformers.TrainingArguments):
    cache_dir: Optional[str] = field(default=None)
    optim: str = field(default="adamw_torch")
    model_max_length: int = field(
        default=512,
        metadata={"help": "Maximum sequence length. Sequences will be right padded (and possibly truncated)."},
    )


def safe_save_model_for_hf_trainer(trainer: transformers.Trainer, output_dir: str):
    """Collects the state dict and dump to disk."""
    state_dict = trainer.model.state_dict()
    if trainer.args.should_save:
        cpu_state_dict = {key: value.cpu() for key, value in state_dict.items()}
        del state_dict
        trainer._save(output_dir, state_dict=cpu_state_dict)  # noqa


def smart_tokenizer_and_embedding_resize(
    special_tokens_dict: Dict,
    tokenizer: transformers.PreTrainedTokenizer,
    model: transformers.PreTrainedModel,
):
    """Resize tokenizer and embedding.

    Note: This is the unoptimized version that may make your embedding size not be divisible by 64.
    """
    num_new_tokens = tokenizer.add_special_tokens(special_tokens_dict)
    model.resize_token_embeddings(len(tokenizer))

    if num_new_tokens > 0:
        input_embeddings = model.get_input_embeddings().weight.data
        output_embeddings = model.get_output_embeddings().weight.data

        input_embeddings_avg = input_embeddings[:-num_new_tokens].mean(dim=0, keepdim=True)
        output_embeddings_avg = output_embeddings[:-num_new_tokens].mean(dim=0, keepdim=True)

        input_embeddings[-num_new_tokens:] = input_embeddings_avg
        output_embeddings[-num_new_tokens:] = output_embeddings_avg


def _tokenize_fn(strings: Sequence[str], tokenizer: transformers.PreTrainedTokenizer) -> Dict:
    """Tokenize a list of strings."""
    tokenized_list = [
        tokenizer(
            text,
            return_tensors="pt",
            padding="longest",
            max_length=tokenizer.model_max_length,
            truncation=True,
        )
        for text in strings
    ]
    input_ids = labels = [tokenized.input_ids[0] for tokenized in tokenized_list]
    input_ids_lens = labels_lens = [
        tokenized.input_ids.ne(tokenizer.pad_token_id).sum().item() for tokenized in tokenized_list
    ]
    return dict(
        input_ids=input_ids,
        labels=labels,
        input_ids_lens=input_ids_lens,
        labels_lens=labels_lens,
    )


def preprocess(
    sources: Sequence[str],
    targets: Sequence[str],
    tokenizer: transformers.PreTrainedTokenizer,
) -> Dict:
    """Preprocess the data by tokenizing."""
    examples = [s + t for s, t in zip(sources, targets)]
    examples_tokenized, sources_tokenized = [_tokenize_fn(strings, tokenizer) for strings in (examples, sources)]
    input_ids = examples_tokenized["input_ids"]
    labels = copy.deepcopy(input_ids)
    for label, source_len in zip(labels, sources_tokenized["input_ids_lens"]):
        label[:source_len] = IGNORE_INDEX
    return dict(input_ids=input_ids, labels=labels)


class SupervisedDataset(Dataset):
    """Dataset for supervised fine-tuning."""

    def __init__(self, data_path: str, tokenizer: transformers.PreTrainedTokenizer):
        super(SupervisedDataset, self).__init__()
        logging.warning("Loading data...")
        list_data_dict = utils.jload(data_path)

        logging.warning("Formatting inputs...")
        prompt_input, prompt_no_input = PROMPT_DICT["prompt_input"], PROMPT_DICT["prompt_no_input"]
        sources = [
            prompt_input.format_map(example) if example.get("input", "") != "" else prompt_no_input.format_map(example)
            for example in list_data_dict
        ]
        targets = [f"{example['output']}{tokenizer.eos_token}" for example in list_data_dict]

        logging.warning("Tokenizing inputs... This may take some time...")
        data_dict = preprocess(sources, targets, tokenizer)

        self.input_ids = data_dict["input_ids"]
        self.labels = data_dict["labels"]

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, i) -> Dict[str, torch.Tensor]:
        return dict(input_ids=self.input_ids[i], labels=self.labels[i])


@dataclass
class DataCollatorForSupervisedDataset(object):
    """Collate examples for supervised fine-tuning."""

    tokenizer: transformers.PreTrainedTokenizer

    def __call__(self, instances: Sequence[Dict]) -> Dict[str, torch.Tensor]:
        input_ids, labels = tuple([instance[key] for instance in instances] for key in ("input_ids", "labels"))
        input_ids = torch.nn.utils.rnn.pad_sequence(
            input_ids, batch_first=True, padding_value=self.tokenizer.pad_token_id
        )
        labels = torch.nn.utils.rnn.pad_sequence(labels, batch_first=True, padding_value=IGNORE_INDEX)
        return dict(
            input_ids=input_ids,
            labels=labels,
            attention_mask=input_ids.ne(self.tokenizer.pad_token_id),
        )


def make_supervised_data_module(tokenizer: transformers.PreTrainedTokenizer, data_args) -> Dict:
    """Make dataset and collator for supervised fine-tuning."""
    train_dataset = SupervisedDataset(tokenizer=tokenizer, data_path=data_args.data_path)
    data_collator = DataCollatorForSupervisedDataset(tokenizer=tokenizer)
    return dict(train_dataset=train_dataset, eval_dataset=None, data_collator=data_collator)


def train():
    parser = transformers.HfArgumentParser((ModelArguments, DataArguments, TrainingArguments))
    model_args, data_args, training_args = parser.parse_args_into_dataclasses()

    model = transformers.AutoModelForCausalLM.from_pretrained(
        model_args.model_name_or_path,
        cache_dir=training_args.cache_dir,
    )

    tokenizer = transformers.AutoTokenizer.from_pretrained(
        model_args.model_name_or_path,
        cache_dir=training_args.cache_dir,
        model_max_length=training_args.model_max_length,
        padding_side="right",
        use_fast=False,
    )
    if tokenizer.pad_token is None:
        smart_tokenizer_and_embedding_resize(
            special_tokens_dict=dict(pad_token=DEFAULT_PAD_TOKEN),
            tokenizer=tokenizer,
            model=model,
        )
    if "llama" in model_args.model_name_or_path:
        tokenizer.add_special_tokens(
            {
                "eos_token": DEFAULT_EOS_TOKEN,
                "bos_token": DEFAULT_BOS_TOKEN,
                "unk_token": DEFAULT_UNK_TOKEN,
            }
        )

    data_module = make_supervised_data_module(tokenizer=tokenizer, data_args=data_args)
    trainer = Trainer(model=model, tokenizer=tokenizer, args=training_args, **data_module)
    trainer.train()
#     trainer.save_state()
#     safe_save_model_for_hf_trainer(trainer=trainer, output_dir=training_args.output_dir)
    tokenizer.save_pretrained(training_args.output_dir)
    trainer.save_model(training_args.output_dir)


if __name__ == "__main__":
    train()


Writing stanford_alpaca/train.py


Everything is ready, let's launch the training job.

## Create SageMaker Training Job

In [None]:
import time
from sagemaker.estimator import Estimator

environment = {
              'MODEL_S3_BUCKET': sagemaker_default_bucket, # The bucket to store pretrained model and fine-tune model
              'MODEL_NAME_S3': model_name_s3
}

base_job_name = f'stanford_alpaca-{time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())}',          

instance_type = 'ml.p4d.24xlarge'

estimator = Estimator(role=role,
                      entry_point='train-13b.sh',
                      source_dir='./',
                      instance_count=1,
                      instance_type=instance_type,
                      image_uri=image_uri,
                      environment=environment,
                      disable_profiler=True,
                      debugger_hook_config=False)

estimator.fit()
# estimator.fit(inputs)

INFO:sagemaker:Creating training-job with name: sagemaker-alpaca-demo-2023-04-14-08-09-04-934


2023-04-14 08:09:10 Starting - Starting the training job......
2023-04-14 08:09:55 Starting - Preparing the instances for training.........
2023-04-14 08:11:36 Downloading - Downloading input data...
2023-04-14 08:11:51 Training - Downloading the training image.....................
2023-04-14 08:15:33 Training - Training image download completed. Training in progress.......[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2023-04-14 08:16:27,684 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2023-04-14 08:16:27,778 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2023-04-14 08:16:27,787 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2023-04-14 08:16:27,789 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2023-04-14 

## Reference

[SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.EstimatorBase)

[DeepSpeed Configuration JSON](https://www.deepspeed.ai/docs/config-json/)

[SageMaker Examples](https://github.com/aws/amazon-sagemaker-examples)