In [1]:
# Setup: Required Libraries

# Run the following code in Google Colab to install the necessary libraries:

!pip install bitsandbytes datasets loralib sentencepiece transformers --quiet
!pip install git+https://github.com/huggingface/peft.git --quiet

# These libraries are essential for the project and provide functionalities for data processing, language modeling, and the Pretraining and Simulated Fine-Tuning (PEFT) technique.

# After executing the above code, you can import and use the installed libraries in your project.


  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [2]:
# Standard library imports
import textwrap

# Third-party imports
from peft import PeftModel
import torch
from transformers import LlamaTokenizer , LlamaForCausalLM, GenerationConfig


Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so...


  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
  warn(msg)


## Loading the LLaMA Tokenizer and Fine-Tuned Alpaca-LoRA Model

There are various options available for selecting the base and fine-tuned models. Here are some examples (not an exhaustive list) that you can choose from:

- For `base_model`, you can consider the following:
    - [decapoda-research/llama-7b-hf](https://huggingface.co/decapoda-research/llama-7b-hf)
    - [decapoda-research/llama-13b-hf](https://huggingface.co/decapoda-research/llama-13b-hf)
    - [decapoda-research/llama-30b-hf](https://huggingface.co/decapoda-research/llama-30b-hf)

- For `finetuned_model`, you can choose from:
    - [tloen/alpaca-lora-7b](https://huggingface.co/tloen/alpaca-lora-7b)
    - [chansung/gpt4-alpaca-lora-7b](https://huggingface.co/chansung/gpt4-alpaca-lora-7b)
    - [chansung/alpaca-lora-13b](https://huggingface.co/chansung/alpaca-lora-13b)

*Note: The model's runtime size is dependent on the available RAM capacity.*

In this section, we will use [tloen/alpaca-lora-7b](https://huggingface.co/tloen/alpaca-lora-7b) by Eric J. Wang. Once the desired model is selected, we proceed with setting up the tokenizer and model objects as follows:

- The tokenizer is created using `LlamaTokenizer` from the latest `transformers` library and loaded with the LLaMA tokenizer checkpoint from the selected `base_model`.
- The `model` is created using `LlamaForCausalLM` from the latest `transformers` library and loaded with the `base_model` checkpoint. The `load_in_8bit` parameter is set to True, which loads the model in 8-bit mode to reduce memory usage by half without any significant loss in quality. This is particularly useful when the GPU memory is limited. The `device_map` parameter is set to "auto" to automatically select the device (CPU or GPU) for running the model.


In [3]:
# Choose which model to run
base_model = "decapoda-research/llama-7b-hf"  # @param ["decapoda-research/llama-7b-hf", "decapoda-research/llama-13b-hf", "decapoda-research/llama-30b-hf"]
finetuned_model = "tloen/alpaca-lora-7b"  # @param ["tloen/alpaca-lora-7b", "chansung/alpaca-lora-13b", "chansung/gpt4-alpaca-lora-7b"]

# Load tokenizer, base model, and fine-tuned model
tokenizer = LlamaTokenizer.from_pretrained(base_model)
base_model = LlamaForCausalLM.from_pretrained(base_model, load_in_8bit=True, device_map="auto")
model = PeftModel.from_pretrained(base_model, finetuned_model)


The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'. 
The class this function is called from is 'LlamaTokenizer'.


Loading checkpoint shards:   0%|          | 0/33 [00:00<?, ?it/s]

In [4]:
def generate_prompt(instruction, input=None):
    """
    Generate a prompt for a given instruction and optional input.

    Args:
        instruction (str): The main instruction for the prompt.
        input (str, optional): Additional input that provides context for the task. Defaults to None.

    Returns:
        str: A prompt that includes the instruction, input (if provided), and a space for the response.
    """
    if input:
        return f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

        ### Instruction:
        {instruction}

        ### Input:
        {input}

        ### Response:"""
    else:
        return f"""Below is an instruction that describes a task. Write a response that appropriately completes the request.

        ### Instruction:
        {instruction}

        ### Response:"""


def alpaca_chat(context=None, temperature=0.7, top_p=0.95, repetition_penalty=1.2, max_new_tokens=512, width=100):
    """
    This function prompts the user to enter a prompt and generates responses using the fine-tuned Alpaca-LoRA model.

    Args:
        context (str): Optional. A string that provides additional context to the prompt. Default is None.
        temperature (float): Optional. A value that controls the "creativity" of the generated sequences. Represents the degree of randomness in the generated text. Default is 0.7.
        top_p (float): Optional. A value that controls the "safety" of the generated sequences. Represents the maximum cumulative probability allowed for the generated tokens. Default is 0.95.
        repetition_penalty (float): Optional. A value that controls the "repetition" of the generated sequences, penalizing the model for repeating the same tokens in a sequence. Default is 1.2.
        max_new_tokens (int): The maximum number of new tokens that can be generated by the model in each response. Defaults to 512.
        width (int): Optional. The maximum number of characters allowed in a single line of the generated text. Default is 100.

    Example usage:
    # Generate 5 responses using a prompt and additional context
    alpaca_chat(context="I love to play video games", n=5)
    """
    input_prompt = input("Prompt: ")
    print("-" * 100)
    prompt = generate_prompt(input_prompt, context)
    inputs = tokenizer(prompt, return_tensors="pt")
    input_ids = inputs["input_ids"].cuda()
    generation_config = GenerationConfig(
        temperature=temperature,
        top_p=top_p,
        repetition_penalty=repetition_penalty
    )
    print("Response:\n")
    generation_output = model.generate(
        input_ids=input_ids,
        generation_config=generation_config,
        return_dict_in_generate=True,
        output_scores=True,
        max_new_tokens=max_new_tokens
    )
    for s in generation_output.sequences:
        output = tokenizer.decode(s)
        print(
            textwrap.fill(
                output.split("### Response:")[1].strip(),
                width=width
            )
        )
    print("-" * 100)


In [5]:
alpaca_chat()

Prompt: how can i use chatgpt
----------------------------------------------------------------------------------------------------
Response:

You can use ChatGPT to generate natural language dialogue between two or more participants in a
conversation.
----------------------------------------------------------------------------------------------------


In [6]:
alpaca_chat()

Prompt: What is the relativityTheory?
----------------------------------------------------------------------------------------------------
Response:

The Relativity Theory is a branch of physics which studies how space and time are affected by
gravity, mass and energy. It attempts to explain why objects move in curved paths around each other
instead of moving straight lines through one another as Newton's Laws would predict.
----------------------------------------------------------------------------------------------------
