## Merge base and LoRA adapters For LLama2 13B

Make sure you pick an instance type with enough memory. Llama2 13B needs about 26GB of memory to process the merge

In [1]:
!pip install -Uq peft==0.4.0
!pip install -Uq bitsandbytes==0.40.2
!pip install -Uq sentencepiece

### > Setup

In [2]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel

### > Download a lora adapter as an example

In [3]:
from huggingface_hub import snapshot_download
lora_adapter_id = "Mikael110/llama-2-13b-guanaco-qlora"
revision = "main"
lora_local_dir = "lora-adapter"

snapshot_download(repo_id=lora_adapter_id, 
                  revision=revision, 
                  local_dir=lora_local_dir,
                 local_dir_use_symlinks=False)

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

adapter_config.json:   0%|          | 0.00/521 [00:00<?, ?B/s]

.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/964 [00:00<?, ?B/s]

adapter_model.bin:   0%|          | 0.00/501M [00:00<?, ?B/s]

'/home/ec2-user/SageMaker/sagemaker-lora-finetuning/llama-2-13b-qlora-hosting-sagemaker-DLC/lora-adapter'

### Merge the model with Lora weights

Save the combined model and tokenizer

In [4]:
model_name = "NousResearch/Llama-2-13b-hf"
# Reload model in FP16 and merge it with LoRA weights
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    low_cpu_mem_usage=True,
    torch_dtype=torch.float16,
    device_map="auto",
)

model = PeftModel.from_pretrained(base_model, lora_local_dir)
model = model.merge_and_unload()
save_dir = "merged-4bit"
model.save_pretrained(save_dir, safe_serialization=True, max_shard_size="2GB")

# Reload tokenizer to save it
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

tokenizer.save_pretrained(save_dir)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]


Thrown during validation:
`do_sample` is set to `False`. However, `temperature` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.


('merged-4bit/tokenizer_config.json',
 'merged-4bit/special_tokens_map.json',
 'merged-4bit/tokenizer.model',
 'merged-4bit/added_tokens.json',
 'merged-4bit/tokenizer.json')

### > upload the combined model to S3

In [7]:
import sagemaker
bucket = sagemaker.Session().default_bucket()
prefix  = f"{model_name}-qlora/models"
model_data_s3_location = f"s3://{bucket}/{prefix}"
!cd {save_dir} && aws s3 cp --recursive . {model_data_s3_location}

upload: ./config.json to s3://sagemaker-us-west-2-376678947624/NousResearch/Llama-2-13b-hf-qlora/models/config.json
upload: ./model-00001-of-00014.safetensors to s3://sagemaker-us-west-2-376678947624/NousResearch/Llama-2-13b-hf-qlora/models/model-00001-of-00014.safetensors
upload: ./model-00003-of-00014.safetensors to s3://sagemaker-us-west-2-376678947624/NousResearch/Llama-2-13b-hf-qlora/models/model-00003-of-00014.safetensors
upload: ./model-00004-of-00014.safetensors to s3://sagemaker-us-west-2-376678947624/NousResearch/Llama-2-13b-hf-qlora/models/model-00004-of-00014.safetensors
upload: ./model-00006-of-00014.safetensors to s3://sagemaker-us-west-2-376678947624/NousResearch/Llama-2-13b-hf-qlora/models/model-00006-of-00014.safetensors
upload: ./model-00002-of-00014.safetensors to s3://sagemaker-us-west-2-376678947624/NousResearch/Llama-2-13b-hf-qlora/models/model-00002-of-00014.safetensors
upload: ./model-00005-of-00014.safetensors to s3://sagemaker-us-west-2-376678947624/NousResear

### > store the parameters into the enviornment for downstream process

In [8]:
%store model_data_s3_location
%store model_name

Stored 'model_data_s3_location' (str)
Stored 'model_name' (str)
