## Finetune Llama-2-7b using QLora on a Google colab

Welcome to this Google Colab notebook that shows how to fine-tune the recent Llama-2-7b model on a single Google colab and turn it into a chatbot

We will leverage PEFT library from Hugging Face ecosystem, as well as QLoRA for more memory efficient finetuning

In [None]:
# login with hugging face token
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
!nvidia-smi

Sun May 26 23:06:18 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   63C    P8              10W /  70W |      3MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [None]:
PEFT_MODEL = "jasonweber99/Llama3-8b-qlora-test"

# loading trained model from hugging face
config = PeftConfig.from_pretrained(PEFT_MODEL)
model = AutoModelForCausalLM.from_pretrained(
    config.base_model_name_or_path,
    return_dict=True,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
tokenizer.pad_token = tokenizer.eos_token

model = PeftModel.from_pretrained(model, PEFT_MODEL)

adapter_config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


adapter_model.safetensors:   0%|          | 0.00/109M [00:00<?, ?B/s]

Do experiments with below parameters

In [None]:
# model configuration, you can try changing these parameters
generation_config = model.generation_config
generation_config.max_new_tokens = 50

# try using temperature also by uncommenting it
# generation_config.temperature = 0.3
generation_config.top_p = 0.7
generation_config.num_return_sequences = 1
generation_config.pad_token_id = tokenizer.eos_token_id
generation_config.eos_token_id = tokenizer.eos_token_id

In [None]:
# device configuration
DEVICE = "cuda:0"

In [None]:

%%time
prompt = f"""
: How can I create an account?
:
""".strip()

encoding = tokenizer(prompt, return_tensors="pt").to(DEVICE)
with torch.inference_mode():
    outputs = model.generate(
        input_ids=encoding.input_ids,
        attention_mask=encoding.attention_mask,
        generation_config=generation_config,

    )
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

: How can I create an account?
: You can create an account on our website by clicking on the 'Sign In' or 'Create Account' button and following the prompts. During the registration process, you will be asked to provide personal information such as your name, email address, and password
CPU times: user 4.8 s, sys: 277 ms, total: 5.07 s
Wall time: 5.17 s


In [None]:
# helper function to generate responses
def generate_response(question: str) -> str:
    prompt = f"""
: {question}
:
""".strip()
    encoding = tokenizer(prompt, return_tensors="pt").to(DEVICE)
    with torch.inference_mode():
        outputs = model.generate(
            input_ids=encoding.input_ids,
            attention_mask=encoding.attention_mask,
            generation_config=generation_config,
        )
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    assistant_start = ":"
    response_start = response.find(assistant_start)
    return response[response_start + len(assistant_start) :].strip()

In [None]:
# prompt
prompt = "Question: Can I return a product if it was a clearance or final sale item?"
print(generate_response(prompt))

Question: Can I return a product if it was a clearance or final sale item?
: Clearance or final sale items are typically non-returnable and non-refundable. Please review the product description or contact our customer support team for specific return policies.


In [None]:
# prompt
prompt = "Question: What happens when I return a clearance item?"
print(generate_response(prompt))

Question: What happens when I return a clearance item?
: If you return a clearance item, you will receive a refund for the amount paid, minus any applicable discounts. The final refund amount will depend on the specific terms and conditions of the clearance sale. Please refer to the product description or contact our customer support


In [None]:
# prompt
prompt = "Question: How do I know when I'll receive my order?"
print(generate_response(prompt))

Question: How do I know when I'll receive my order?
: You will receive an email with tracking information once your order has shipped. The email will include the estimated delivery date.


In [None]:
################ falcon with lama2
# https://github.com/curiousily/Get-Things-Done-with-Prompt-Engineering-and-LangChain/blob/master/07.falcon-qlora-fine-tuning.ipynb