# Unlocking Llama 3: Your Ultimate Guide to Mastering Llama 3!

[Link](https://medium.com/pythoneers/unlocking-llama-3-your-ultimate-guide-to-mastering-the-llama-3-77712d1a0915)

## What is Llama 3

Decoder-only transformer architecture Large Language Model

## Experimenting with Llama 3: Google Colab and HuggingFace

### Step 1: Enable Llama3 access on HuggingFace

Go to this [link](https://huggingface.co/meta-llama/Meta-Llama-3-8B/) and request access

### Step 2: Get the HuggingFace access token to access the model [here](https://huggingface.co/settings/tokens)

### Step 3: Change runtime to T4 GPU

A runtime is a Google-provisioned virtual machine (VM) that can run the code in your notebook (IPYNB file).

### Step 4: Install dependencies

You need to install dependencies whenever you restart the runtime

In [1]:
!pip install -U "transformers==4.40.0" --upgrade
!pip install accelerate bitsandbytes

Collecting transformers==4.40.0
  Downloading transformers-4.40.0-py3-none-any.whl.metadata (137 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/137.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━[0m [32m133.1/137.6 kB[0m [31m7.8 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m137.6/137.6 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
Downloading transformers-4.40.0-py3-none-any.whl (9.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.0/9.0 MB[0m [31m47.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.44.2
    Uninstalling transformers-4.44.2:
      Successfully uninstalled transformers-4.44.2
Successfully installed transformers-4.40.0
Collecting bitsandbytes
  Downloading bitsandbytes-0.43.3-py3-none-manylinu

### Step 5: Download and Install the Model

Install the Llama 3 model and set up the text generation pipeline.

In [2]:
import transformers
import torch

"""
pipeline: https://huggingface.co/docs/transformers/en/main_classes/pipelines

Easy way to use HuggingFace models for inference

Objects that abstract complex code from libraries, offering simple API dedicated to several tasks

The general pipeline abstraction is a wrapper around all the other available pipelines
- Wrappers are a design pattern: Wraps around another piece of code to add behaviour on top the wrapped code

args below
- task (str): The task defining which pipeline will be returned
- model (str): Model that will be used by the pipeline to make predictions. This can be a model identifier
- model_kwargs (Dict[str, Any]): Additional dictionary of keyword arguments passed along to the model’s from_pretrained(..., **model_kwargs) function
https://huggingface.co/docs/transformers/en/main_classes/model

Returns the text generation pipeline

More on pipelines:

https://huggingface.co/docs/transformers/v4.44.2/en/pipeline_tutorial

Text-Generation Pipeline: https://huggingface.co/docs/transformers/v4.44.2/en/main_classes/pipelines#transformers.TextGenerationPipeline
has special method __call__ that receives text_inputs etc.
returns the generated text etc.

Using llama instruct:
- https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct
- https://github.com/meta-llama/llama-recipes?tab=readme-ov-file
- https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/
- https://medium.com/@renjuhere/llama-3-first-look-c19d99b4933b

Quantization reduces the hardware requirements by loading the model weights with lower precision.
Instead of loading them in 16 bits (float16), they are loaded in 4 bits,
significantly reducing memory usage from ~20GB to ~8GB.
https://medium.com/@manuelescobar-dev/implementing-and-running-llama-3-with-hugging-faces-transformers-library-40e9754d8c80
"""

# Load the model
model_id = "unsloth/llama-3-8b-Instruct-bnb-4bit"
pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={
        "torch_dtype": torch.float16,
        "quantization_config": {"load_in_4bit": True},
        "low_cpu_mem_usage": True,
    },
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.26k [00:00<?, ?B/s]

Unused kwargs: ['_load_in_4bit', '_load_in_8bit', 'quant_method']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/220 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/51.1k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/345 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


### Step 6: Send Queries to the Model for Inference

**HuggingFace Chat Templates**

https://huggingface.co/docs/transformers/main/en/chat_templating

https://huggingface.co/docs/transformers/en/chat_templating

Different models expect different input formats for chat

Chat templates are part of the tokenizer. They specify how to convert conversations, represented as lists of messages, into a single tokenizable string in the format that the model expects.

Chat templates handle the details of formatting for you, allowing you to write universal code that works for any model.

Simply build a list of messages, with role and content keys, and then pass it to the [apply_chat_template()](https://huggingface.co/docs/transformers/main/en/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template) method. Once you do that, you’ll get output that’s ready to go. When using chat templates as input for model generation, it’s also a good idea to use ```add_generation_prompt=True``` to add a [generation prompt](https://huggingface.co/docs/transformers/main/en/chat_templating#what-are-generation-prompts), i.e. tells the template to add tokens that indicate the start of a bot response.

In [3]:
messages = [
    {"role": "system", "content": "You are a helpful assistant!"},
    {"role": "user", "content": """Hey how are you doing today?"""},
]

prompt = pipeline.tokenizer.apply_chat_template(
        messages,
        tokenize=False, # output will be a string
        add_generation_prompt=True
)

terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

"""
Under the __call__

text_inputs: prompt. When chats are passed, the model’s chat template will be used to format them before passing them to the model.

https://huggingface.co/docs/transformers/v4.44.2/en/main_classes/pipelines#transformers.TextGenerationPipeline

The Hugging Face pipeline function is flexible and
accepts many additional arguments based on the underlying model and task.

Even if these arguments are not in the main documentation for the text-generation pipeline,
the model itself or specific fine-tuning tasks (like LLaMA's conversational models)
can allow or require these parameters for specialized behavior

return_text defaults to True

"""
outputs = pipeline(
    prompt,
    max_new_tokens=256,
    eos_token_id=terminators, # Help to specify when the text generation should stop
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)

"""

A list is returned

One of the dictionaries is returned

generated_text (str, present when return_text=True) — The generated text.
"""
print(outputs[0]["generated_text"][len(prompt):])

I'm doing great, thanks for asking! I'm a helpful assistant, so I'm always ready to assist you with any questions or tasks you may have. How about you? How's your day going so far?


In [5]:
print("View the Output")
print(outputs)

View the Output
[{'generated_text': "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant!<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHey how are you doing today?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nI'm doing great, thanks for asking! I'm a helpful assistant, so I'm always ready to assist you with any questions or tasks you may have. How about you? How's your day going so far?"}]
