<a href="https://colab.research.google.com/github/jman4162/LLM-Tutorials/blob/main/Getting_Started_with_Llama_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Getting Started with Llama 3

Name: John Hodge

Date: 04/30/24

# Introduction

Welcome to the "Getting Started with Llama 3" Jupyter notebook! This notebook is designed as a practical guide to help you understand and utilize the Llama 3 model, a powerful transformer-based model available on the Hugging Face platform. Whether you are a beginner or have some experience with machine learning models, this notebook will provide valuable insights into setting up and running text generation tasks using Llama 3.

This notebook is structured to walk you through several key steps:

Initializing the Environment: Setup your coding environment by importing necessary libraries to ensure smooth execution of model tasks.
Model Definition: Learn how to define and configure the Llama 3 model for text generation, including setting up the model to run on GPU for faster computation.
Creating and Utilizing a Pipeline: This section dives deep into creating a text generation pipeline, explaining the parameters and configurations that control the model's behavior and output quality.
Practical Examples: By the end of this tutorial, you will see practical examples of text generation, allowing you to understand the quality and variety of output that Llama 3 can generate.
This notebook aims to empower you with the knowledge and tools to start experimenting with Llama 3 for your projects or research. Let's dive in and explore the capabilities of this advanced AI model!

Source: https://huggingface.co/meta-llama

## Initialize environment and import necessary libraries

In [1]:
!pip install -U "transformers==4.40.0" torch

Collecting transformers==4.40.0
  Using cached transformers-4.40.0-py3-none-any.whl (9.0 MB)
Collecting torch
  Downloading torch-2.3.0-cp310-cp310-manylinux1_x86_64.whl (779.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m779.1/779.1 MB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch)
  Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.

In [2]:
import transformers
import torch

In [3]:
# Check if GPU is available
if torch.cuda.is_available():
  # Get the device name
  device_name = torch.cuda.get_device_name(0)
  # Print the GPU type
  print(f"GPU type: {device_name}")
else:
  # Print a message if GPU is not available
  print("GPU is not available.")

GPU type: NVIDIA A100-SXM4-40GB


## Define model

In [4]:
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
# model_id = "meta-llama/Meta-Llama-3-70B-Instruct"

## Create pipeline

The provided code snippet initializes a text generation pipeline using the `transformers` library, which is widely used for working with pre-trained transformer models such as BERT, GPT, and others. Here’s a breakdown of what each part of the code does:

1. **`transformers.pipeline()`**: This function creates a pipeline for a specific task, which in this case is `"text-generation"`. Pipelines simplify the process of using pre-trained models for specific tasks. By specifying `"text-generation"`, it sets up the necessary components for generating text using a specified model.

2. **`model=model_id`**: This argument specifies the model to be used for the pipeline. The `model_id` should be a string that identifies a pre-trained model available in the Hugging Face model hub or a path to a model locally stored. This model will be used to generate text.

3. **`model_kwargs={"torch_dtype": torch.bfloat16}`**: This is a dictionary of keyword arguments that are passed to the model during its instantiation. In this case, it sets the `torch_dtype` to `torch.bfloat16`. The dtype `torch.bfloat16` refers to a 16-bit floating point representation that uses less memory than standard 32-bit floating point (float32), which can lead to faster computation and lower memory usage, particularly on compatible GPUs.

4. **`device="cuda"`**: This argument specifies that the model should run on a CUDA-compatible GPU (if available). Using `"cuda"` leverages GPU acceleration, greatly improving performance for compute-intensive tasks like text generation. If a CUDA-compatible GPU is not available, you could change this to `"cpu"` to run the pipeline on the central processing unit.

Overall, this pipeline is configured for generating text using a specific transformer model on a GPU, with reduced precision to optimize for performance and resource use.

In [5]:
pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device="cuda",
)

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/51.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


## Write input text for prompt

In [6]:
messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

## Create prompt

The line of code you've provided is involved in preparing a `prompt` for a text generation model, typically used in conversational AI setups like chatbots or dialogue systems. Let's break down the function and its parameters to understand what it does:

1. **pipeline.tokenizer.apply_chat_template**:
   - `pipeline` likely represents an instance of a model pipeline configured for generating text or processing language. This object contains a `tokenizer`, which is a component used to convert text to a format that the model can understand (typically converting strings to numerical tokens).
   - The method `apply_chat_template` seems to be a custom function (not standard in popular libraries like `transformers` by Hugging Face, so it might be part of a specific implementation or framework). This function is used to format and prepare input text data (`messages`) into a structure that is suitable for generating conversational responses.

2. **messages**:
   - This is the input to the function and likely contains the conversation history or the current user message that needs to be processed. It can be a single message or a list of messages depending on the implementation.

3. **tokenize=False**:
   - This parameter controls whether the text (`messages`) should be tokenized into model-understandable tokens (numerical IDs). Setting `tokenize=False` indicates that the function should not tokenize the text within this method, which suggests that tokenization might occur later in the pipeline or is being handled differently.

4. **add_generation_prompt=True**:
   - This parameter indicates that the function should add a generation prompt to the `messages`. A generation prompt is typically used to cue the model on how to respond or continue the conversation. For instance, it might add something like "Reply:" at the end of the conversation history to indicate that the model should generate a reply. This is useful for maintaining context and guiding the model in generating appropriate and coherent responses.

Overall, this line of code is preparing the `prompt` by structuring the input `messages` according to a specific template required by the model, without tokenizing the messages yet but adding a cue for generation. This prepared prompt will then be used to generate a response in a conversational model setup.

In [7]:
prompt = pipeline.tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
)

## Create terminators

The code snippet defines a list named `terminators` that includes special token IDs related to the tokenizer used in the `pipeline`. Here's a detailed explanation of each line:

1. **`pipeline.tokenizer.eos_token_id`**:
   - This line accesses the tokenizer associated with the previously defined `pipeline`.
   - `eos_token_id` stands for "end-of-sequence token ID." It is a specific integer that the tokenizer uses to signify the end of a text sequence. This is common in text generation tasks to indicate when the model should consider the output to be complete.
   - Including this ID in the `terminators` list means that when this token is generated, it can be used to determine that the model has finished generating text, serving as a stopping condition.

2. **`pipeline.tokenizer.convert_tokens_to_ids("")`**:
   - This line calls the `convert_tokens_to_ids` method of the tokenizer, which converts a list of token strings to their corresponding numeric IDs.
   - In this instance, it's converting an empty string (`""`). Depending on the tokenizer, this might be treated as an invalid token, often returning a special token ID like that for "unknown" (often denoted as `unk_token_id`) or simply zero, depending on how the tokenizer is implemented.
   - The purpose here might be to include another specific token as a terminator, potentially handling cases where an empty or undefined token is generated, which also needs to be considered as a stopping point in text generation.

By including these token IDs in the `terminators` list, the intention is to use them to identify points at which text generation should be stopped. These could be checked against the output tokens generated by the model, and when any of these IDs appear, the generation process can be halted. This is particularly useful in controlling the generation process to ensure it concludes appropriately based on specific criteria defined by these tokens.

In [8]:
terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

## Generate outputs

This Python code snippet appears to be calling a function named `pipeline` with several parameters to generate text outputs, likely from a language model or some similar generative AI system. Let's break down each parameter to understand what they're doing:

1. **prompt**:
   - This is the input text or query that you provide to the model. The model uses this prompt as a starting point to generate output.

2. **max_new_tokens=256**:
   - This parameter specifies the maximum number of new tokens (words or pieces of words) that the model can generate in the output. Setting it to 256 limits the length of the generated text to approximately 256 tokens.

3. **eos_token_id=terminators**:
   - `eos_token_id` stands for "end of sequence token ID." This parameter specifies token IDs that signal the model to stop generating further tokens. The variable `terminators` likely contains one or more token IDs that, when generated, indicate the completion of the text output. This is useful for controlling different stopping conditions for text generation.

4. **do_sample=True**:
   - This parameter, when set to `True`, enables probabilistic sampling of the output tokens based on the model's predictions. This means the model will randomly pick the next token based on the probability distribution provided by the model, which makes the output more varied and less deterministic.

5. **temperature=0.6**:
   - The `temperature` parameter controls the randomness of the output generation. A lower temperature (closer to 0) makes the model more confident and repetitive, while a higher temperature (closer to 1) makes it more diverse and less predictable. Setting it to 0.6 strikes a balance, offering a mix of coherence and creativity in the output.

6. **top_p=0.9**:
   - `top_p`, also known as nucleus sampling, is another parameter to control the randomness of text generation. It specifies that the model should limit its choices to the smallest set of tokens whose cumulative probability exceeds the threshold of 0.9. This effectively filters out less likely tokens and focuses on the top 90% probable tokens, reducing the risk of generating irrelevant or nonsensical content.

Together, these parameters configure how the `pipeline` function generates text, giving control over aspects like length, randomness, and stopping conditions.

In [9]:
outputs = pipeline(
    prompt,
    max_new_tokens=256,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


## Print results

In [10]:
print(outputs[0]["generated_text"][len(prompt):])

Arrrr, me hearty! Me name be Captain Chat, the scurviest chatbot to ever sail the Seven Seas o' conversation! Me and me trusty crew o' code be here to swab the decks o' yer questions and serve ye a bounty o' answers, savvy? So hoist the sails and set course fer a swashbucklin' good time, me hearty!
