<a href="https://colab.research.google.com/github/kaiu85/llm-workshop/blob/main/Transformers/05_Open_Source_Conversation_Model_Playground.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Large Language Models at your fingertips

This notebook demonstrates how to download the trained weights of a (quite) large language model to this local Google Colab instance, and how encapsulate the model interface into a small function, __alpaca_talk__. This function just takes the trained model, an appropriate tokenizer, and a text string as a prompt as inputs, and outputs a string, which includes the input and the models predicted output. There are no specific tasks here, but feel free to use this function together with code from other notebooks or your private projects. It literally places the power of a large-language model, running on local hardware, at your fingertips.

# Model background

This notebook will use a small version (7 billion trainable parameters) of the LLaMA family of models, which was trained by the FAIR team of META on publicly available datasets (in contrast to other large language models, where the proprietary training data still is a matter of concern and public debate). Please take some time to have a proper look at the corresponding Huggingface [model card](https://huggingface.co/decapoda-research/llama-7b-hf), which also provides some basic information on potential biases, risks and harms.

For further information on this model family, feel free to also have a look at the corresponding [blog post](https://research.facebook.com/publications/llama-open-and-efficient-foundation-language-models/).

# Reminder: Using a large-language model as a coding resource

Alternatively, you can go with the flow and try to ask one of the many available large language models to help you. E.g., by copying some code into the model's prompt and asking it to find errors and/or improve your code. Here you could also experiment with different ways of **prompting**, i.e., asking or instructing your model. Usually, by asking the model to first think through a problem sequentially before providing the final answer, you can dramatically improve the performance in more complex reasoning tasks (similar to asking a human to first think through a problem carefully, before trying to provide a definite answer). One very impressive model in this regard is the one by [Perplexity AI](https://www.perplexity.ai/).

In [None]:
# Install latest bitsandbytes & transformers, accelerate from source
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
# Other requirements for the demo
!pip install gradio
!pip install sentencepiece

In [None]:
# Load the model.
# Note: It can take a while to download LLaMA and add the adapter modules.
# You can also use the 13B model by loading in 4bits.

import torch
from peft import PeftModel    
from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaTokenizer, StoppingCriteria, StoppingCriteriaList, TextIteratorStreamer

model_name = "decapoda-research/llama-7b-hf"
adapters_name = 'timdettmers/guanaco-7b'

print(f"Starting to load the model {model_name} into memory")

m = AutoModelForCausalLM.from_pretrained(
    model_name,
    #load_in_4bit=True,
    torch_dtype=torch.bfloat16,
    device_map={"": 0}
)
m = PeftModel.from_pretrained(m, adapters_name)
m = m.merge_and_unload()
tok = LlamaTokenizer.from_pretrained(model_name)
tok.bos_token_id = 1

stop_token_ids = [0]

print(f"Successfully loaded the model {model_name} into memory")

In [None]:
from transformers import GenerationConfig

def alpaca_talk(text, model, tokenizer):
    inputs = tokenizer(
        text,
        return_tensors="pt",
    )
    input_ids = inputs["input_ids"].cuda()

    generation_config = GenerationConfig(
        temperature=0.6,
        top_p=0.95,
        repetition_penalty=1.2,
    )
    print("Generating...")
    generation_output = model.generate(
        input_ids=input_ids,
        generation_config=generation_config,
        return_dict_in_generate=True,
        output_scores=True,
        max_new_tokens=1256,
    )
    output_string = ''
    for s in generation_output.sequences:
        output_string += tokenizer.decode(s)

    return output_string

In [None]:
input_text = """A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
### Human: List mountainbike brands, which are headquartered in California, USA.
### Assistant: """

output_text = alpaca_talk(input_text, m, tok)

print(output_text)

In [None]:
new_input_text = output_text + '''
### Human: Please check your previous answer carefully, fix any errors, and remove all entries from the list, which are not headquartered in CA!
### Assistant: '''

new_output_text = alpaca_talk(new_input_text, m, tok)

print(new_output_text)