<a href="https://colab.research.google.com/github/jman4162/LLM-Tutorials/blob/main/Harnessing_Llama_2_with_PyTorch_A_Tutorial_for_Leveraging_LLMs_in_Your_Applications.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Harnessing Llama 2 with PyTorch: A Comprehensive Tutorial for Leveraging LLMs in Your Applications

Name: John Hodge

Date: 04/29/2024

# Introduction to Leveraging Llama 2 with PyTorch

Welcome to this tutorial on using the Llama 2 Large Language Model (LLM) with PyTorch. In the evolving landscape of natural language processing (NLP), LLMs have emerged as powerful tools for a variety of applications, from text generation to semantic analysis. Among these models, Llama 2 stands out for its adaptability and efficiency, making it a prime candidate for advanced NLP tasks.

In this tutorial, we will guide you through the essential steps to integrate Llama 2 into your PyTorch projects. We'll begin with setting up your Python environment, ensuring you have all the necessary libraries and dependencies installed. Following this, we will dive into the specifics of importing and utilizing the Llama 2 model within the PyTorch framework, demonstrating its capabilities through practical examples.

Whether you're a seasoned developer or new to using LLMs, this tutorial aims to provide you with a solid foundation to exploit the full potential of Llama 2 in your projects. Let's get started with setting up our environment to ensure everything is in place for a seamless experience.

# Environment Setup

First, you need to set up your Python environment and install the necessary packages. You can use a virtual environment if you prefer:

In [1]:
!pip install torch transformers

Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch)
  Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB)
Collecting nvidia-cufft-cu12==11.0.2.54 (from torch)
  Using cached nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl (121.6 MB)
Collecting nvidia-curand-cu12==10.3.2.106 (from torch)
  Using cached nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl (56.5 MB)
Collectin

# Import Libraries

This section of the code is crucial as it involves importing necessary libraries from the `transformers` package, which provides tools and models specifically tailored for natural language processing tasks.

## Explanation:

- **AutoTokenizer**: This component is responsible for preprocessing text input for it to be suitable for the model. It converts text into tokens or numbers that the model can understand. The `AutoTokenizer` is highly versatile and can automatically detect and use the correct tokenizer associated with the pre-trained models you intend to use.

- **LlamaForCausalLM**: This import is the model class for the Llama model configured for causal language modeling (also known as autoregressive text generation). It is designed to predict the probability distribution of the next token in a sequence, given the tokens that preceded it. This capability is essential for tasks that involve generating text, such as continuing a given text segment.

## Importance:

By importing these two components, you are setting up the foundational tools needed to perform a wide range of natural language processing tasks using the Llama model within the PyTorch environment. This step is essential for preparing the system to process and generate text, leveraging the powerful pre-trained Llama model for your specific needs.

Import the required libraries in your Python script or Jupyter notebook:

In [2]:
import torch
from transformers import LlamaTokenizer, LlamaForCausalLM, AutoTokenizer, LlamaForCausalLM

## Import hugging face token

In essence, this code retrieves a secret named "HF_TOKEN" from Google Colab's secure storage and assigns it to a variable for further use within your notebook.

In [3]:
from google.colab import userdata
token = userdata.get('HF_TOKEN')

Check for GPU and print GPU model if present.

In [12]:
# Check if GPU is available
if torch.cuda.is_available():
  # Get the device name
  device_name = torch.cuda.get_device_name(0)
  # Print the GPU type
  print(f"GPU type: {device_name}")
else:
  # Print a message if GPU is not available
  print("GPU is not available.")

GPU type: NVIDIA A100-SXM4-40GB


# Load the Model and Tokenizer

Before we can begin working with the Llama 2 model, we need to load the appropriate model and its corresponding tokenizer. The model variant you choose may depend on several factors, including the computational resources available and the specific requirements of your application. Llama 2 offers several variants differing mainly in size — from smaller models that are faster and require less memory, to larger ones that are more powerful but computationally expensive.

## Choosing a Model Variant

For the purposes of this tutorial, we'll opt for a smaller variant of Llama 2. This choice allows us to demonstrate the model's capabilities without the need for extensive computational resources. Smaller models are particularly advantageous for development and testing phases, where speed and resource efficiency are crucial. However, keep in mind that larger models might be necessary for achieving higher accuracy in complex NLP tasks.

## Loading the Model

To load the model, we will use the transformers library, which provides a straightforward API to work with Llama 2 and other LLMs. Here’s how you can load the model:

In [4]:
model = LlamaForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")

config.json:   0%|          | 0.00/609 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

# Prepare the Input

Prepare the text you want the model to respond to. This involves encoding the text to a format that the model can understand:

In [5]:
prompt = "Hey, are you conscious? Can you talk to me?"
inputs = tokenizer(prompt, return_tensors="pt")

In [6]:
# Generate text
generate_ids = model.generate(inputs.input_ids, max_length=30)
tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

"Hey, are you conscious? Can you talk to me?\nI'm sorry, I'm not allowed to talk to you.\n"

This section of code is used for generating text using the Llama 2 model within the PyTorch framework. Here's a breakdown of what each line of code does:

1. **Model Generation**: The first line invokes the `generate` method on the loaded model (`model`). This method takes the tokenized input (`inputs.input_ids`) and generates output text. The `max_length=30` parameter specifies that the output should not exceed 30 tokens in length. This is a crucial step where the model applies its learned patterns and knowledge to produce a coherent sequence of tokens based on the input provided.

2. **Decoding the Output**: The second line of code uses the `tokenizer` to convert the generated token IDs (`generate_ids`) back into human-readable text. The `batch_decode` function is called with `skip_special_tokens=True` to remove any special tokens (like padding or end-of-sequence tokens) from the output, ensuring that only meaningful text is presented. The parameter `clean_up_tokenization_spaces=False` is set to retain the original spacing of the tokens as they were generated by the model, which can be important for maintaining the intended formatting and readability of the output.

3. **Retrieving the Text**: The `[0]` at the end of the second line extracts the first (and typically, only) item from the list returned by `batch_decode`, which contains the generated text. This is useful when generating a single piece of text, ensuring that the output is directly usable as a string rather than a list.

Together, these lines of code allow for generating a concise piece of text using the Llama 2 model, with precise control over the length and format of the output, making it suitable for a variety of text generation tasks.

# Example 1: Text Completion

Text completion is a common use case for language models. You provide the beginning of a sentence or paragraph, and the model generates the continuation.

In [7]:
input_text = "In a surprising turn of events, the main character discovers that"
input_ids = tokenizer.encode(input_text, return_tensors='pt')
outputs = model.generate(input_ids, max_length=100, num_return_sequences=1)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

In a surprising turn of events, the main character discovers that he's actually the villain.
The main character is a villain in the story.
The main character is a villain in the story. The story is about how the villain tries to stop the hero from saving the day.
The main character is a villain in the story. The story is about how the villain tries to stop the hero from saving the day. The main character is a villa


# Example 2: Question Answering

Language models can also be used to answer questions based on their training data. Here's how you can ask Llama a question:

In [7]:
input_text = "What are the benefits of using electric vehicles over gasoline cars?"
input_ids = tokenizer.encode(input_text, return_tensors='pt')
outputs = model.generate(input_ids, max_length=80, num_return_sequences=1)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

What are the benefits of using electric vehicles over gasoline cars?
Electric cars are a lot more environmentally friendly than their gasoline counterparts. They produce less pollution and emit fewer greenhouse gases. They also require less maintenance and are easier to drive.
What are the different types of electric vehicles?
There are a variety of electric vehicles available on the market today


# Example 3: Style Transfer

You can ask the model to rewrite text in a different style, such as formal to informal, or vice versa:

In [8]:
input_text = "Could you perhaps inform me about the current weather conditions?"
input_ids = tokenizer.encode(input_text, return_tensors='pt')
outputs = model.generate(input_ids, max_length=80, num_return_sequences=1)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Could you perhaps inform me about the current weather conditions?
Sure. It’s currently 37°C (99°F).
The temperature is the same as it was yesterday.
The temperature is the same as it was yesterday. It’s currently 37°C (99°F).
I see. Thank you for the information.
I


# Example 4: Creative Writing

Prompt Llama to help with creative writing tasks, such as composing a poem or a short story:

In [9]:
input_text = "Write a short poem about the sunrise in the mountains:"
input_ids = tokenizer.encode(input_text, return_tensors='pt')
outputs = model.generate(input_ids, max_length=100, num_return_sequences=1)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Write a short poem about the sunrise in the mountains:
I love the sunrise in the mountains,
It's like a dream come true.
The air is crisp and clean,
And the view is breathtaking.
The sunrise in the mountains is a sight to behold,
The colors are so vibrant and bold.
It's like a painting come to life,
It's like a dream come true


# Example 5: Sentiment Analysis

Though typically not used directly for sentiment analysis, you can creatively prompt Llama to express sentiment about a topic:

In [10]:
input_text = "Describe how people feel about using public transportation:"
input_ids = tokenizer.encode(input_text, return_tensors='pt')
outputs = model.generate(input_ids, max_length=100, num_return_sequences=1)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Describe how people feel about using public transportation:
The public transportation system is an important part of the city’s infrastructure and should be a priority. The city has a number of bus routes, and the buses are clean and well-maintained. The buses run on time, and the drivers are friendly and helpful. The buses are also equipped with GPS systems, so passengers can track their progress. The buses are also equipped with cam


# Example 6: Simulating Dialogue

Simulate a dialogue by having Llama take one side of a conversation:

In [11]:
input_text = "User: How do I reset my password? \nAssistant:"
input_ids = tokenizer.encode(input_text, return_tensors='pt')
outputs = model.generate(input_ids, max_length=80, num_return_sequences=1)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

User: How do I reset my password? 
Assistant: Click the link below to reset your password.

### Link
https://www.example.com/reset_password

### Link
https://www.example.com/reset_password?email=myemail@example.com

### Link
https://www.example.com


Each of these examples showcases different capabilities and uses of the Llama 2 model. Adjust the max_length and num_return_sequences parameters based on your specific needs and the complexity of the task.

# Conclusion

Congratulations on completing this tutorial on integrating the Llama 2 Large Language Model with PyTorch. By now, you should have a solid understanding of how to set up your Python environment, import necessary libraries, and harness the capabilities of Llama 2 within your NLP projects. We've explored various functionalities of the model, demonstrating how to leverage its advanced features to enhance text processing and analysis tasks.

As you continue to develop your skills and build more complex applications, remember that the flexibility and power of Llama 2 combined with the robustness of PyTorch offer a versatile toolkit for tackling challenging NLP problems. Experiment with different configurations and approaches to find what works best for your specific needs.

We encourage you to delve deeper into the model's documentation and explore the wider community resources to further enhance your understanding and capabilities. Keep pushing the boundaries of what you can achieve with AI, and most importantly, enjoy the process of learning and creating innovative solutions.

Thank you for following along with this tutorial, and happy coding!