#📚📌 Introduction: Chatting with LLaMA 2 using Hugging Face 🤗🦙
In this notebook, you'll learn how to build a chat interface with the LLaMA 2 (7B) model, specifically the NousResearch/Llama-2-7b-chat-hf variant and the original meta-llama/Llama-2-7b-chat both hosted on Hugging Face. This model is a fine-tuned version of Meta's LLaMA 2 architecture, optimized for engaging, instruction-following conversations — similar to ChatGPT.

🚀 Objectives:

- Comparing Meta's LLaMA 2 and NousResearch's Fine-Tuned Variant

- Load and use a conversational LLM from Hugging Face

- Structure chat history for multi-turn dialogue

- Generate coherent and context-aware responses using transformers and pipeline

- Explore the power of open-source LLMs without running them locally

👨‍💻 Whether you're a developer, data scientist, or AI enthusiast, this notebook helps you quickly start chatting with one of the most capable open-access LLMs available. 🧠💬

📹 This video tutorial on my YouTube channel walks you through each step of the process — with explanations, code execution, and real-time results. 🔍💡 Make sure to follow along, try it out, and see how easy it is to chat with LLaMA 2 using Hugging Face! 🤖✨






## 🔍 Comparing Meta's LLaMA 2 and NousResearch's Fine-Tuned Variant 🧠💬

When working with open LLMs on Hugging Face, it's important to understand the difference between the **base models** and their **fine-tuned chat-ready variants**. In this notebook, we use the `NousResearch/Llama-2-7b-chat-hf` model — a fine-tuned version of Meta's foundational LLaMA 2.

Here's a breakdown of the differences between the two:

| Feature | 🧪 [Meta's LLaMA 2](https://huggingface.co/meta-llama/Llama-2-7b-chat) (`meta-llama/Llama-2-7b-chat-hf`) | 🛠️ [NousResearch](https://huggingface.co/NousResearch/Llama-2-7b-chat-hf) (`NousResearch/Llama-2-7b-chat-hf`) |
|--------|------------------------------------------------------|-----------------------------------------------------|
| **Source** | Meta AI | Fine-tuned by NousResearch |
| **Type** | Pretrained, base LLM | Fine-tuned for chat/instruction |
| **Use Case** | General-purpose language tasks | Conversational AI, instruction following |
| **Access** | Requires access token + approval 🔐 | Publicly available on Hugging Face 🆓 |
| **Optimized for Chat?** | ❌ Not directly | ✅ Yes |
| **Best for** | Researchers and developers seeking raw LLaMA 2 | Anyone building chatbots or dialogue systems |

### ✅ Summary:
The `NousResearch` version is essentially **Meta’s LLaMA 2 (7B)** — but **fine-tuned**, **more accessible**, and **optimized** for **real-world chat experiences**. If you're building chat applications or testing dialogue systems, it's the faster and friendlier starting point! 🚀

🧰 Install Required Libraries


In [1]:
#%pip install -U torch==2.0.1 \
#  transformers==4.33.0 \
#  sentencepiece==0.1.99 \
#  accelerate==0.22.0 # needed for low_cpu_mem_usage parameter

🔧 Load the Chat Model and Tokenizer for `NousResearch/Llama-2-7b-chat-hf` model and tokenizer from Hugging Face

In [2]:
import torch
from transformers import LlamaTokenizer,LlamaForCausalLM

model_checkpoint = "NousResearch/Llama-2-7b-chat-hf"

tokenizer = LlamaTokenizer.from_pretrained(model_checkpoint)

model = LlamaForCausalLM.from_pretrained(
    model_checkpoint,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
)



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/746 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/435 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/583 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

###🧩 Preparing Messages for LLaMA Chat Format

This utility function transforms a list of message histories into properly formatted input prompts for LLaMA-style chat models, following the instruction formatting used in many fine-tuned Hugging Face models.

### 🔍 What the Code Does:
- Defines a `Message` structure with roles (`system`, `user`, `assistant`) and content.
- Prepares messages with system instructions using special tokens like `<<SYS>>` and `[INST]...[/INST]`.
- Verifies correct message ordering:
  - A `system` message (optional, must be first)
  - Followed by alternating `user` and `assistant` messages
  - Ending with a `user` message
- Builds input strings by interleaving user and assistant turns, wrapped in `[INST]` tags, and adds `bos_token` and `eos_token` as required by the tokenizer.
- Ensures the format is compatible with models expecting instruction-style inputs (like LLaMA-2 chat variants).

🛠️ This function is adapted from [llama-cpp-chat-completion-wrapper](https://github.com/viniciusarruda/llama-cpp-chat-completion-wrapper/blob/1c9e29b70b1aaa7133d3c7d7b59a92d840e92e6d/llama_cpp_chat_completion_wrapper.py)



In [3]:
# based on https://github.com/viniciusarruda/llama-cpp-chat-completion-wrapper/blob/1c9e29b70b1aaa7133d3c7d7b59a92d840e92e6d/llama_cpp_chat_completion_wrapper.py

from typing import List
from typing import Literal
from typing import TypedDict

from transformers import PreTrainedTokenizer

Role = Literal["system", "user", "assistant"]

class Message(TypedDict):
    role: Role
    content: str

MessageList = List[Message]

BEGIN_INST, END_INST = "[INST] ", " [/INST] "
BEGIN_SYS, END_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"

def convert_list_of_message_lists_to_input_prompt(list_of_message_lists: List[MessageList], tokenizer: PreTrainedTokenizer) -> List[str]:
    input_prompts: List[str] = []
    print(type(list_of_message_lists))
    print(type(list_of_message_lists[0]))
    for message_list in list_of_message_lists:
        if message_list[0]["role"] == "system":
            content = "".join([BEGIN_SYS, message_list[0]["content"], END_SYS, message_list[1]["content"]])
            message_list = [{"role": message_list[1]["role"], "content": content}] + message_list[2:]

        if not (
            all([msg["role"] == "user" for msg in message_list[::2]])
            and all([msg["role"] == "assistant" for msg in message_list[1::2]])
        ):
            raise ValueError(
                "Format must be in this order: 'system', 'user', 'assistant' roles.\nAfter that, you can alternate between user and assistant multiple times"
            )

        eos = tokenizer.eos_token
        bos = tokenizer.bos_token
        input_prompt = "".join(
            [
                "".join([bos, BEGIN_INST, (prompt["content"]).strip(), END_INST, (answer["content"]).strip(), eos])
                for prompt, answer in zip(message_list[::2], message_list[1::2])
            ]
        )

        if message_list[-1]["role"] != "user":
            raise ValueError(f"Last message must be from user role. Instead, you sent from {message_list[-1]['role']} role")

        input_prompt += "".join([bos, BEGIN_INST, (message_list[-1]["content"]).strip(), END_INST])

        input_prompts.append(input_prompt)

    return input_prompts

 🧪 Creating and Formatting a Simple Chat Prompt

Here we construct a basic chat scenario using the `Message` format defined earlier. Note that:

- 🛠 A **system message** instructing the assistant to respond only with emojis
- 👤 A **user message** asking a questionon behalf of user
- 🧱 These messages are added to a list and passed into our `convert_list_of_message_lists_to_input_prompt()` function to generate a LLaMA-compatible chat prompt

This shows how to structure a minimal, valid input for models expecting `[INST]`-formatted chat inputs.

🧵 The resulting `prompt` can then be passed into the model for response generation.

In [4]:
system_message = Message()
system_message["role"] = "system"
system_message["content"] = "Answer only with emojis"
print(system_message)

user_message = Message()
user_message["role"] = "user"
user_message["content"] = "Who won the 2019 Stanley Hockey Cup?"
print(user_message)

# assistant_message = Message()
# assistant_message.role = "assistant"
# assistant_message.content = ""

list_of_messages = list()
list_of_messages.append(system_message)
list_of_messages.append(user_message)

list_of_message_lists = list()
list_of_message_lists.append(list_of_messages)

prompt = convert_list_of_message_lists_to_input_prompt(list_of_message_lists, tokenizer)
print(prompt)

{'role': 'system', 'content': 'Answer only with emojis'}
{'role': 'user', 'content': 'Who won the 2019 Stanley Hockey Cup?'}
<class 'list'>
<class 'list'>
['<s>[INST] <<SYS>>\nAnswer only with emojis\n<</SYS>>\n\nWho won the 2019 Stanley Hockey Cup? [/INST] ']


🧠 Generate Text from a prompt using pipeline API

- Tokenize the input prompt.

- Configure the length of the prompt in tokens is printed for reference.

- User `GenerationConfig` object is used to control generation parameters, such as the max_new_tokens.

- Finally, use `pipeline` API is called to generate text based on the model and tokenizer provided.

In [5]:
from transformers import pipeline

tokenized_prompt = tokenizer(prompt)

print(f'prompt is {len(tokenized_prompt["input_ids"][0])} tokens')

prompt is 41 tokens


In [6]:
from transformers import GenerationConfig

generation_config = GenerationConfig(max_new_tokens=2000)

pipeline = pipeline("text-generation",
                    model=model,
                    tokenizer=tokenizer,
                    generation_config=generation_config, return_full_text=False)

Device set to use cuda:0


In [7]:
response = pipeline(prompt)

In [8]:
response[0]

[{'generated_text': ' Here is my answer:\n\n🏒🏈🏆'}]

 🧪 Let's try a simpler prompt way to simlplify what happens inside `pipeline` API


In [9]:
system_prompt = "<<SYS>>\nYou are a helpful assistant that provides concise and informative answers.\n<<SYS>>"

user_prompt = "What is the capital of Canada?"

# Prompt format
prompt = system_prompt + "\n" + user_prompt

print(prompt)

# 1. tokenize the prompt into tokens
input_ids = tokenizer.encode(prompt, return_tensors="pt").to(model.device)

# 2. Pass the tokens to the model to generate a response in tokens
output = model.generate(input_ids, max_length=50)

# 3. Decode the response back
response = tokenizer.decode(output[0], skip_special_tokens=True)

# Print the response
response

<<SYS>>
You are a helpful assistant that provides concise and informative answers.
<<SYS>>
What is the capital of Canada?


'<<SYS>>\nYou are a helpful assistant that provides concise and informative answers.\n<<SYS>>\nWhat is the capital of Canada?\n<</PARAGRAPH>>\nThe capital of Canada is Ottawa'

In [10]:
response

'<<SYS>>\nYou are a helpful assistant that provides concise and informative answers.\n<<SYS>>\nWhat is the capital of Canada?\n<</PARAGRAPH>>\nThe capital of Canada is Ottawa'

 🧪 Calling `pipeline` while passing the prompt to generate a response in a single line...


In [11]:
response = pipeline(prompt)

In [12]:
response[0]

{'generated_text': '\n<</PAGE>>\nThe capital of Canada is Ottawa.'}

### 🔁 Repeating the Code with `meta-llama/Llama-2-7b-chat-hf`

In this section, we're reusing the earlier code, but this time loading the **LLaMA 2 model** from **Meta AI** instead of a community model like the one from `NousResearch`.

⚠️ **Note**: Access to Meta's LLaMA models requires logging into the Hugging Face Hub and requesting access to the model page.

Make sure you have:
- 🔐 Logged into Hugging Face using `notebook_login()`
- ✅ Been granted access to `meta-llama/Llama-2-7b-chat-hf`

This version loads the model and tokenizer from Meta, formats a prompt using a system message and user question, tokenizes it, and generates a concise response using the LLaMA 2 model.

📌 The setup uses `torch.float16` for efficient inference and `device_map="auto"` to automatically assign model parts to available devices.

#### 1.🐱‍💻 Hugging Face Hub Login

The notebook_login() function will prompt for your credentials 🔑, giving you access to the Hub's resources.
#### How to Generate Tokens from Your Hugging Face Account

1. 🖥️ **Go to Hugging Face Website**
   - Visit [Hugging Face](https://huggingface.co/).

2. 🔑 **Log In to Your Account**

3. 👤 **Navigate to Your Settings by Clicking on your profile icon and select Settings**

4. 🔐 **Generate a New Token** (Access Token)
   - In the **Access Tokens** section on the left side of the settings page, click on **New Token**.
   - Give your token a name (e.g., "Jupyter Notebook") and select the scope (permissions) for the token (e.g., **read**, **write**, or **admin**).
   - Click **Generate**.

5. 📄 **Copy Your Token**

6. 🔄 **Use the Token in Your Code**
   - You can now use this token in your code, like in `notebook_login()` or when interacting with the Hugging Face Hub via the `transformers` library.

In [13]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

#### 2. 🔁 Using `meta-llama/Llama-2-7b-chat-hf` Instead of NousResearch

We're repeating the same code, but this time using **Meta's LLaMA 2** model from Hugging Face: `meta-llama/Llama-2-7b-chat-hf`.

- Loads the Meta model
- Formats a prompt
- Tokenizes it
- Generates a concise response.

**Note** : We use `torch.float16` for efficiency and `device_map="auto"` for device placement.









In [15]:
import torch
from transformers import LlamaForCausalLM, LlamaTokenizer

# 1. Load the model and tokenizer
model_dir = "meta-llama/Llama-2-7b-chat-hf"
model = LlamaForCausalLM.from_pretrained(model_dir, device_map="auto", torch_dtype=torch.float16)
tokenizer = LlamaTokenizer.from_pretrained(model_dir)

# 2. Create the system prompt
system_prompt = "<<SYS>>\nYou are a helpful assistant that provides concise and informative answers.\n<<SYS>>"

# 3. Define the user prompt
user_prompt = "What is the capital of Canada?"

# 4. Format the whole prompt
prompt = system_prompt + "\n" + user_prompt

# 5. Tokenize the prompt
input_ids = tokenizer.encode(prompt, return_tensors="pt").to(model.device)

# 6. Generate a response (Tokens)
output = model.generate(input_ids, max_length=50)

# 7. Decode the response (from generated tokens)
response = tokenizer.decode(output[0], skip_special_tokens=True)

# 8. Print the response
print(response)

config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

<<SYS>>
You are a helpful assistant that provides concise and informative answers.
<<SYS>>
What is the capital of Canada?
<</PAGE>>
The capital of Canada is Ottawa.
