In [12]:
!pip install transformers
!pip install accelerate
!pip install -U bitsandbytes
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch



# Load model and tokenizer

### Using LLaMA for This Notebook

1. **Apply for model access**: Visit [LLaMA-2-7B-Chat on Hugging Face](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) to request access to the model. Please note that it may take a few days for your application to be approved. Once approved, you will see the following message on the website:

    > **Gated model** You have been granted access to this model
   
2. **Create your Hugging Face Access Key**: Go to your [Hugging Face settings](https://huggingface.co/settings/tokens) to create an access token. When creating the token, ensure you check the box:

    - `Read access to contents of all public gated repos you can access` under **Permissions**.

3. **Provide your Hugging Face Access Key**: Once you have your access token, paste it below to authenticate the notebook with Hugging Face.

In [13]:
hf_access_key = "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"

### Quantization

Using free Colab, you have access to **16 GB of GPU memory**, which is insufficient to load the entire LLaMA model at once. To complete inference, Hugging Face will dynamically move parts of the model onto the GPU during runtime, which will cause the inference to become **extremely slow**.

To address this limitation, we **quantize** the model using Hugging Face's `bitsandbytes` library. This approach significantly reduces GPU memory consumption, enabling faster inference without needing to load the entire model into GPU memory at once.


In [14]:
# Create a BitsAndBytesConfig for 4-bit quantization
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,       # Optional for performance
    bnb_4bit_quant_type='nf4',            # Normal floating-point 4-bit quantization
    bnb_4bit_compute_dtype=torch.float16  # Set compute dtype to float16 for faster inference
)

Now, load the model and tokenizer.

In [15]:
model_name = "meta-llama/Llama-2-7b-chat-hf"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    token=hf_access_key,
    quantization_config=quantization_config)
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True, token=hf_access_key)

ImportError: Using `bitsandbytes` 8-bit quantization requires Accelerate: `pip install accelerate` and the latest version of bitsandbytes: `pip install -i https://pypi.org/simple/ bitsandbytes`

Let’s check the GPU information and verify that all parts of the model are loaded onto the GPU.

In [5]:
# Check GPU info
if torch.cuda.is_available():
    # Get the current GPU memory usage in MB
    total_memory = torch.cuda.get_device_properties(0).total_memory / 1024 ** 2
    reserved_memory = torch.cuda.memory_reserved(0) / 1024 ** 2
    allocated_memory = torch.cuda.memory_allocated(0) / 1024 ** 2
    free_memory = reserved_memory - allocated_memory
    print(f"Total GPU memory: {total_memory:.2f} MB")
    print(f"Reserved GPU memory: {reserved_memory:.2f} MB")
    print(f"Allocated GPU memory: {allocated_memory:.2f} MB")
    print(f"Free GPU memory: {free_memory:.2f} MB")
else:
    print("No GPU available.")

# Check the device of each module of the model
for name, param in model.named_parameters():
    print(f"{name} is on device: {param.device}")

Total GPU memory: 15102.06 MB
Reserved GPU memory: 3996.00 MB
Allocated GPU memory: 3734.77 MB
Free GPU memory: 261.23 MB
model.embed_tokens.weight is on device: cuda:0
model.layers.0.self_attn.q_proj.weight is on device: cuda:0
model.layers.0.self_attn.k_proj.weight is on device: cuda:0
model.layers.0.self_attn.v_proj.weight is on device: cuda:0
model.layers.0.self_attn.o_proj.weight is on device: cuda:0
model.layers.0.mlp.gate_proj.weight is on device: cuda:0
model.layers.0.mlp.up_proj.weight is on device: cuda:0
model.layers.0.mlp.down_proj.weight is on device: cuda:0
model.layers.0.input_layernorm.weight is on device: cuda:0
model.layers.0.post_attention_layernorm.weight is on device: cuda:0
model.layers.1.self_attn.q_proj.weight is on device: cuda:0
model.layers.1.self_attn.k_proj.weight is on device: cuda:0
model.layers.1.self_attn.v_proj.weight is on device: cuda:0
model.layers.1.self_attn.o_proj.weight is on device: cuda:0
model.layers.1.mlp.gate_proj.weight is on device: cuda:

Let's try the model.

In [10]:
# Get the token ID for "\n"
newline_token_id = tokenizer("\n", return_tensors="pt").input_ids[0][0]

# Generate chat response
def generate_response(prompt, max_length=30, temperature=0.2, num_beams=1):
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    outputs = model.generate(inputs.input_ids,
                             eos_token_id=newline_token_id,  # stop generation if "\n" occurs
                             max_length=max_length,
                             temperature=temperature,
                             num_beams=num_beams)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Try it out
generate_response("Hi mate!")

"Hi mate! I'm just an AI, I don't have personal experiences, but I can help you with any questions or tasks"

### Key parameters in `model.generate()`



#### **Length Control**
- **`max_length`**: Defines the maximum number of tokens to generate, including both the input and generated tokens. The model will stop once this limit is reached.
- **`min_length`**: Sets the minimum number of tokens that must be generated before stopping. This ensures that the output doesn't stop too early.
- **`eos_token_id`**: The ID of the end-of-sequence (EOS) token. The generation will stop once the model generates this token, marking the end of the sequence.

#### **Diversity and Quality**
- **`temperature`**: Controls the randomness of predictions. Lower values (e.g., 0.7) make the model more deterministic, while higher values (e.g., 1.0 or above) increase randomness, making the outputs more diverse.
- **`top_k`**: Limits the next token selection to the top `k` most likely tokens. A higher value allows for more variety in the generated text, while a lower value makes it more deterministic.
- **`top_p` (nucleus sampling)**: Limits token selection to tokens with a cumulative probability of `p`. This ensures that only the top `p` percent of the probability mass is considered, promoting diverse but controlled generation.
- **`do_sample`**: Enables random sampling of tokens instead of greedy decoding (which selects the highest-probability token). This is essential for generating diverse outputs.
- **`num_beams`**: The number of beams for beam search. Higher values explore more possibilities during generation, leading to better outputs but at the cost of increased computation.

![beam](https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/beam_search.png)

#### **Repetition and Token Constraints**
- **`repetition_penalty`**: Penalizes repeated tokens, discouraging the model from generating repetitive sequences. A value greater than 1.0 reduces the likelihood of repeating the same token.
- **`no_repeat_ngram_size`**: Prevents repetition of n-grams of a specified size. For example, `no_repeat_ngram_size=3` ensures that trigrams (3-grams) do not repeat in the generated output.

#### **Output Control**
- **`num_return_sequences`**: The number of different sequences to generate. For example, `num_return_sequences=3` generates three separate outputs from the same prompt.


In [7]:
# # Chat loop with history
# print("Start chatting! Type 'exit' to end the conversation.\n")

# # Initialize conversation history
# conversation_history = ""

# while True:
#     # Get input from the user
#     user_input = input("You: ")

#     # Exit the chat loop
#     if user_input.lower() == "exit":
#         print("Ending chat. Goodbye!")
#         break

#     # Append the user input to the conversation history
#     conversation_history += f"You: {user_input}\n"

#     # Generate response using the updated conversation history
#     response = generate_response(conversation_history)

#     # Append the model's response to the conversation history
#     conversation_history += f"Bot: {response}\n"

#     # Print the model's response
#     print(f"Bot: {response}\n")