# Alpaca (LLama) Lora-4bit quantized weights
This is an inference only notebook for trying out my Alpaca trained lora adapter (4bit)

The adapter took a little over 5 days to train on a single Titan RTX (24G) GPU.

Notes:
* if you have more than one GPU; might need to constrain to one with CUDA_VISIBLE_DEVICES=&lt;gpu_to_use&gt;
* I used python==3.8; I'm a little hazy but I think I ran into some snag trying 3.9 (my usual)
* I think in the training there was not proper eos token handling.  So the model doesn't like to stop so runs on a bit.  I was already half way thru training when I figured this out.... 

I also included in this repo, my training script (beware faulty eos handling) and a pieced-together generation script.  This notebook is my attempt at cleaning up something for others to consume.



In [None]:
#GPU memory usage for inference on a 24G Titan RTX
#21763MiB / 24576MiB

In [1]:
!pip install -Uqq torch
!pip install -Uqq accelerate
!pip install -Uqq bitsandbytes
!pip install -Uqq git+https://github.com/huggingface/transformers.git
#!pip install -Uqq git+https://github.com/sterlind/GPTQ-for-LLaMa.git@lora_4bit
#!pip install -Uqq git+https://github.com/sterlind/peft.git
!pip install -Uqq sentencepiece

In [None]:
# the model file format is changing.  The checkpoints I have are
# pre"v2" and the code is not backwards compatible
# so we need these checkpoints
!pip install -Uqq git+https://github.com/sterlind/GPTQ-for-LLaMa.git@d9e903072b507e3d01ced58ccc221641abe14c93
!pip install -Uqq git+https://github.com/sterlind/peft.git@ee2ddee858dc1983d5590d939505e60896aa6789

In [None]:
# despite repo name contains 7b,13b,30b,65b 4bit quanitzed llama model weights
# This takes a while 29G of files total
!git lfs clone https://huggingface.co/maderix/llama-65b-4bit llama_4bit_quantized

In [1]:
import time
import torch
from peft import PeftModel
import autograd_4bit
from autograd_4bit import load_llama_model_4bit_low_ram, Autograd4bitQuantLinear
from peft.tuners.lora import Linear4bitLt


Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
CUDA SETUP: CUDA runtime path found: /home/jr/anaconda3/envs/alpaca_lora_30b_4bit_2/lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 113
CUDA SETUP: Loading binary /home/jr/anaconda3/envs/alpaca_lora_30b_4bit_2/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda113.so...


In [2]:
def generate_prompt(instruction, input=None):
    p1 = "Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n"\
        "### Instruction:\n"\
        f"{instruction}\n\n"
    p2 = "### Input:\n"\
        f"{input}\n\n"
    p3 = "### Response:\n"

    # join the parts
    return p1 + (p2 if input else "") + p3

In [3]:
def load_model_llama(*args, **kwargs):
    
    # quantized (int4) llama base model
    model_path = './llama_4bit_quantized/llama30b-4bit.pt'
    
    # llama base model configuration
    config_path = 'decapoda-research/llama-30b-hf'
    
    # trained lora adapter
    lora_path = 'johnrobinsn/alpaca-llama-30b-4bit'    

    print("Loading {} ...".format(model_path))
    t0 = time.time()
    
    model, tokenizer = load_llama_model_4bit_low_ram(config_path, model_path)
    
    model = PeftModel.from_pretrained(model, lora_path, device_map={'': 0}, torch_dtype=torch.float32)
    print('{} Lora Applied.'.format(lora_path))
    
    print('Apply auto switch and half')
    for n, m in model.named_modules():
        if isinstance(m, Autograd4bitQuantLinear) or isinstance(m, Linear4bitLt):
            m.zeros = m.zeros.half()
            m.scales = m.scales.half()
            m.bias = m.bias.half()
    autograd_4bit.use_new = True
    autograd_4bit.auto_switch = True
    
    return model, tokenizer

In [4]:
model,tokenizer = load_model_llama()

print('Fitting 4bit scales and zeros to half')
for n, m in model.named_modules():
    if '4bit' in str(type(m)):
        m.zeros = m.zeros.half()
        m.scales = m.scales.half()

_ = model.eval()

Loading ./llama_4bit_quantized/llama30b-4bit.pt ...
Loading Model ...


The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'. 
The class this function is called from is 'LlamaTokenizer'.


Loaded the model in 11.95 seconds.
johnrobinsn/alpaca-llama-30b-4bit Lora Applied.
Apply auto switch and half
Fitting 4bit scales and zeros to half


In [5]:
def evaluate(
        model, 
        tokenizer,
        instruction,
        input=None,
        temperature=0.1,
        top_p=0.75,
        top_k=40,
        num_beams=4,
        max_new_tokens=150,
        **kwargs,
):
    prompt = generate_prompt(
        instruction, 
        input if input != "" else None
    )
    inputs = tokenizer(prompt, return_tensors="pt")
    input_ids = inputs["input_ids"].to("cuda")

    with torch.no_grad():
        generation_output = model.generate(
            input_ids=input_ids,
            repetition_penalty=1.1,
            temperature=temperature,
            top_p=top_p,
            top_k=top_k,
            num_beams=num_beams,
            return_dict_in_generate=True,
            output_scores=True,
            max_new_tokens=max_new_tokens,
            early_stopping=True,         
        )

    s = generation_output.sequences[0]
    output = tokenizer.decode(s)
    return output.split("### Response:")[1].strip()

In [6]:
def generate(instruction,input=None):  
    output = evaluate(model,tokenizer,
        instruction,input)

    print(f"Instruction: {instruction}\n")
    if (input): print(f"Input: {input}\n")
    print(f"Output: {output}\n")

In [7]:
generate("Write a poem about a cat.")

Instruction: Write a poem about a cat.

Output: A furry feline with a tail so long,
Who purrs and purrs and purrs all day long.
With whiskers so long and eyes so bright,
This kitty is a delight.

The paws so soft and the fur so fine,
This cat is a joy to behold.
With a meow so sweet and a purr so loud,
This kitty is a real crowd-pleaser.

So if you're looking for a cuddly friend,
This cat is the one you should send.
With a purr so loud and a meow so sweet,
This kitty is the perfect pet.



In [15]:
generate('Who was George Washington? and did he cut down a cherry tree?')

Instruction: Who was George Washington? and did he cut down a cherry tree?

Output: George Washington was the first President of the United States. He was born on February 22, 1732 in Westmoreland County, Virginia. He served as the Commander-in-Chief of the Continental Army during the American Revolutionary War and was unanimously elected as the first President of the United States in 1789. 

The story of George Washington chopping down a cherry tree is a legend. It is said that when he was a young boy, he chopped down a cherry tree with his father's hatchet. When his father asked him who did it, he replied, "I cannot tell a lie, I did it with my hatchet



In [16]:
# example with an instruction + input
generate('Identify the odd one out', 'Twitter, Instagram, Telegram')

Instruction: Identify the odd one out

Input: Twitter, Instagram, Telegram

Output: Telegram is the odd one out because it is not a social media platform. Twitter and Instagram are both social media platforms. Telegram is a messaging app. Therefore, Telegram is the odd one out. Twitter and Instagram are both social media platforms. Telegram is a messaging app. Therefore, Telegram is the odd one out. Twitter and Instagram are both social media platforms. Telegram is a messaging app. Therefore, Telegram is the odd one out. Twitter and Instagram are both social media platforms. Telegram is a messaging app. Therefore, Telegram is the odd one out. Twitter and Instagram are both social media platforms. Telegram is a messaging app. Therefore, Tele

