## Dependency prep.

Install a pip dependency for later downloading of model artifacts

In [None]:
!pip install huggingface_hub

Prepare train data, here we use an aligned corpus in stanford_alpaca repo

In [None]:
!wget https://raw.githubusercontent.com/tatsu-lab/stanford_alpaca/main/alpaca_data.json

Download s5cmd for faster S3 transfer than 'aws s3 cp'

In [None]:
!curl -L https://github.com/peak/s5cmd/releases/download/v2.0.0/s5cmd_2.0.0_Linux-64bit.tar.gz | tar -xz s5cmd

Use SageMaker default bucket, or ANY S3 bucket

In [None]:
import sagemaker
sess = sagemaker.Session()
sagemaker_default_bucket = sess.default_bucket()
print(sagemaker_default_bucket)

## Dowloading Model from HuggingFace

In [None]:
from huggingface_hub import snapshot_download
from pathlib import Path

local_cache_path = Path("./llama2-model")
local_cache_path.mkdir(exist_ok=True)

model_name = "TheBloke/Llama-2-7B-fp16" # choose a 3rd party hf model

# Only download pytorch checkpoint files
allow_patterns = ["*.json", "*.pt", "*.bin", "*.model"]

model_download_path = snapshot_download(
    repo_id=model_name,
    cache_dir=local_cache_path,
    allow_patterns=allow_patterns,
)

Find where the model artifacts (e.g. config.json, *.bin) sits and copy the path to following variables

In [None]:
snapshot_model_path = 'llama2-model/models--TheBloke--Llama-2-7B-fp16/snapshots/ec92360670debf11267ccece99b8e65a8c723802' # change to the correct path model exists
s3_destination_path = f's3://{sagemaker_default_bucket}/bloke-llama2-7b-fp16/' # change to your own s3 path


Copy the model files from notebook instance to S3, as training instances will access model artifacts from S3 (NOT from this notebook)

In [None]:
!aws s3 cp {snapshot_model_path} {s3_destination_path} --recursive

Have to release the Notebook Instance Storage

In [None]:
!rm -rf llama2-model

## Dowloading Model from S3 presign link

HuggingFace might have some throttling mechanism, if download requests happen in a short time. To avoid this, we use a S3 presign link, which will be distributed during the Workshop.

In [None]:
!wget -O llama2-model.zip "PASTE-THE-S3-PRESIGN-LINK-HERE"

In [None]:
!unzip llama2-model.zip

In [None]:
zip_model_path = 'llama2-model/' # change to the correct path model exists
s3_destination_path = f's3://{sagemaker_default_bucket}/bloke-llama2-7b-fp16/' # change to your own s3 path

We use s5cmd instead of 'aws s3 cp' for faster transfering

In [None]:
!chmod +x ./s5cmd
!./s5cmd sync {zip_model_path} {s3_destination_path}

In [None]:
!rm -rf llama2-model.zip
!rm -rf llama2-model