# Comparison between Llama-2 and Llama-2-Chat

Notebook is an experiment on differences between base Llama-2 and fine-tuned Llama-2-Chat mode. Both the models are loaded and prompted with the exact same prompts. The difference in the response is then observed. Llama-2-Chat is capable of generating EOS tags at right time so that it can stop generating at a logical point where base Llama-2 model starts generating some garbage even after answering the question.


In [1]:
import torch
import torch.nn as nn
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [2]:
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    # cache_dir="/data/yash/base_models",
    device_map='auto'
)

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf", 
                                          # cache_dir="/data/yash/base_models"
                                         )

chat_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-chat-hf",
    # cache_dir="/data/yash/base_models",
    device_map='auto'
)

chat_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf", 
                                               # cache_dir="/data/yash/base_models"
                                              )

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [3]:
def get_llama2_reponse(prompt, max_new_tokens=50):
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    outputs = model.generate(**inputs, max_new_tokens=max_new_tokens, temperature= 0.00001)
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response

def get_llama2_chat_reponse(prompt, max_new_tokens=50):
    inputs = chat_tokenizer(prompt, return_tensors="pt").to(device)
    outputs = chat_model.generate(**inputs, max_new_tokens=max_new_tokens, temperature= 0.00001)
    response = chat_tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response

In [4]:
prompt = "She is"
get_llama2_reponse(prompt, max_new_tokens=50)

'She is a 2016 graduate of the University of North Carolina at Chapel Hill, where she majored in English and minored in Creative Writing. She is currently a graduate student at the University of North Carolina at Chapel Hill,'

In [5]:
prompt = "Q:what is the capital of India? A:"
get_llama2_reponse(prompt, max_new_tokens=50)

'Q:what is the capital of India? A:New Delhi. Q:what is the capital of India? A:New Delhi. Q:what is the capital of India? A:New Delhi. Q:what is the capital of India? A:New Delhi. Q'

In [6]:
prompt = "Q:what is the capital of India? A:"
get_llama2_chat_reponse(prompt, max_new_tokens=50)

'Q:what is the capital of India? A:The capital of India is New Delhi. Q:What is the currency of India? A:The currency of India is Indian rupee (INR). Q:What is the population of India? A:The population of India is approximately'

In [7]:
prompt = "Keep answer short. Q: what is the capital of India?"
get_llama2_chat_reponse(prompt, max_new_tokens=200)

'Keep answer short. Q: what is the capital of India? A: New Delhi.'

In [8]:
# Trying out same above prompt with base model to check the response

prompt = "Keep answer short. Q: what is the capital of India?"
get_llama2_reponse(prompt, max_new_tokens=200)

'Keep answer short. Q: what is the capital of India? A: New Delhi. Q: what is the capital of India? A: New Delhi. Q: what is the capital of India? A: New Delhi. Q: what is the capital of India? A: New Delhi. Q: what is the capital of India? A: New Delhi. Q: what is the capital of India? A: New Delhi. Q: what is the capital of India? A: New Delhi. Q: what is the capital of India? A: New Delhi. Q: what is the capital of India? A: New Delhi. Q: what is the capital of India? A: New Delhi. Q: what is the capital of India? A: New Delhi. Q: what is the capital of India? A: New Delhi. Q: what is the capital of India? A: New Delhi. Q: what is the capital of India? A: New Delhi'

In [9]:
prompt = "[INST] Keep answer short. Q: what is the capital of India? [/INST]"
get_llama2_chat_reponse(prompt, max_new_tokens=200)

'[INST] Keep answer short. Q: what is the capital of India? [/INST]  The capital of India is New Delhi.'

In [10]:
prompt = "[INST] write a python code to print numbers from 1 to 10 without any explanation? [/INST]"
print(get_llama2_chat_reponse(prompt, max_new_tokens=200))

[INST] write a python code to print numbers from 1 to 10 without any explanation? [/INST]  Sure! Here is a simple Python code to print numbers from 1 to 10:
```
for i in range(1, 11):
    print(i)
```
This will output:

1
2
3
4
5
6
7
8
9
10

I hope this helps! Let me know if you have any questions.
