# Running Llama-2-7b model locally

Notebook demonstrates how one can run the model to get sentence completions and some sample prompts to try and get solutions to problems right out of the box.

Note that it might not work at all if system does not have GPU (might work if there is enough CPU and RAM though).


In [1]:
import torch
import torch.nn as nn
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [4]:
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    cache_dir="/data/yash/base_models",
    device_map='auto'
)

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf", cache_dir="/data/yash/base_models")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [5]:
inputs = tokenizer("She is", return_tensors="pt").to(device)

In [6]:
inputs

{'input_ids': tensor([[   1, 2296,  338]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1]], device='cuda:0')}

In [7]:
outputs = model.generate(**inputs, max_new_tokens=10)

In [8]:
outputs

tensor([[    1,  2296,   338,   263,  7826,   411,   263, 12561, 29889,  2296,
         10753,   304,   367]], device='cuda:0')

In [11]:
response = tokenizer.decode(outputs[0], skip_special_tokens=True)

In [12]:
response

'She is a girl with a dream. She wants to be'

In [36]:
def get_llama2_reponse(prompt, max_new_tokens=50):
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    outputs = model.generate(**inputs, max_new_tokens=max_new_tokens, temperature= 0.00001)
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response

In [37]:
prompt = "She is"
get_llama2_reponse(prompt, max_new_tokens=50)

'She is a 2016 graduate of the University of North Carolina at Chapel Hill, where she majored in English and minored in Creative Writing. She is currently a graduate student at the University of North Carolina at Chapel Hill,'

In [38]:
prompt = "Q:what is the capital of India? A:"
get_llama2_reponse(prompt, max_new_tokens=50)

'Q:what is the capital of India? A:New Delhi. Q:what is the capital of India? A:New Delhi. Q:what is the capital of India? A:New Delhi. Q:what is the capital of India? A:New Delhi. Q'

In [39]:
prompt = "translation of sentence 'i want to eat' in hindi is"
get_llama2_reponse(prompt, max_new_tokens=50)

"translation of sentence 'i want to eat' in hindi is 'मैं खाना चाहता हूं'\nI want to eat.\nमैं खाना चाहता हूं\nI want to eat"

In [40]:
prompt = "translation of sentence 'i want to eat' in french is"
get_llama2_reponse(prompt, max_new_tokens=50)

"translation of sentence 'i want to eat' in french is 'je veux manger'\nI want to eat.\nI want to eat. I want to eat.\nI want to eat. I want to eat. I want to eat.\nI want to eat. I want to eat"

In [41]:
prompt='''python code to loop from 1 to 10 and print the numbers is:'''
print(get_llama2_reponse(prompt, max_new_tokens=50))

python code to loop from 1 to 10 and print the numbers is:

\begin{code}
for i in range(1, 11):
    print(i)
\end{code}

I want to write a code that will loop from 1 to 100
