# Usinging Llama3 Locally
=============================

    *M. Zia Rasa*

* Note 1: Using Llama3 locally is recommonded if you have a powerful GPU installed on your PC.
* Note2: A C++ compiler or Visuall Studio Development Tools must be installed on your PC too.

## Table of Contents

1. **Setting up Llama3 locally**
2. **Working with "llama_cpp" locally**
3. **Generating text or Completion with Llama3**
    * 1. Basic Completion
    * 2. Streaming Completion
    * 3. Chat completion
4. **Setting Some Hyperparameter**
    * 1. Setting some input pararameters
    * 2. Setting some output parameters
    * 3. tempratures
    * 4. top-k
    * 5. top-p
5. **Generate Embeddings**
6. **Summary**

## 1. Setting up Llama3 locally

### Step 1: 
You need to install the "llama-cpp" by:
          "pip install llama-cpp-python"
which is a C++ class that interacts with Llama3 locally. 

In [1]:
%pip install llama-cpp-python --upgrade

Note: you may need to restart the kernel to use updated packages.


### Step 2:
import "llama-cpp" package and and check out its version.

In [2]:
import llama_cpp
llama_cpp.__version__

'0.3.1'

### Step 3:
1. You need to download the "Meta-Llama-3-8B-Instruct-GGUF" file form below link:
        "https://huggingface.co/PawanKrd/Meta-Llama-3-8B-Instruct-GGUF/tree/main"
there are different vesions with different sizes (2GB-8GB), download your desired vesion.

2. Then set up the "Meta-Llama-3-8B-Instruct-GGUF" file path directory locally on your machine.

In [3]:
from llama_cpp import Llama
path_to_model="./Llama3-model/llama-3-8b-instruct.Q8_0.gguf"

### Step 4:
Check out if the llama3 file is set up correclty.
if it is set up correclty, you will get an output response

In [4]:
llm=Llama(model_path=path_to_model)
output = llm ("where does llamas live?")
print(output)

llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from ./Llama3-model/llama-3-8b-instruct.Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = workspace
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                          llama.block_count u32              = 32
llama_model_loader: - kv   6:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   7:                 llama.rope.dimen

{'id': 'cmpl-6d82dee8-4b10-43e8-8ced-2012fc49c50f', 'object': 'text_completion', 'created': 1728640734, 'model': './Llama3-model/llama-3-8b-instruct.Q8_0.gguf', 'choices': [{'text': ' - Llamas are originally from South America, particularly in the Andean region', 'index': 0, 'logprobs': None, 'finish_reason': 'length'}], 'usage': {'prompt_tokens': 6, 'completion_tokens': 16, 'total_tokens': 22}}


## 2. Working with "llama_cpp" locally
    * Now enjoy working with llama3 language model locally

## Generating text or completion with Llama3

**Types of completions with Llama3**
1. Basic completion
2. Streaming completion
3. Chat completion
      * Simple chat completion
      * Json mode chat completion
      * Json schema mode chat completion
      * Creating an LLM interface
      * Creating an agent class

### 1. Basic completion

In [5]:
from llama_cpp import Llama

llm = Llama(model_path="./Llama3-model/llama-3-8b-instruct.Q8_0.gguf", verbose=False)

llm

<llama_cpp.llama.Llama at 0x2dc3be0ee90>

In [6]:
resp = llm("Q: Write a short paragraph introducing Elon Musk. A: ",
           max_tokens=256,
           stop=["Q:", "\n"])

In [7]:
resp

{'id': 'cmpl-e4606afa-9614-4ec9-a2f3-39fc47a5541a',
 'object': 'text_completion',
 'created': 1728640750,
 'model': './Llama3-model/llama-3-8b-instruct.Q8_0.gguf',
 'choices': [{'text': "\xa0Elon Musk is a South African-born entrepreneur and business magnate who has revolutionized the way we live and work. With a net worth of over $200 billion, Musk is one of the richest people in the world. He is best known for his innovative ventures, including SpaceX, Tesla, and Neuralink, which are pushing the boundaries of technology and space exploration. Musk's vision is to make humanity a multi-planetary species and to accelerate the world's transition to sustainable energy. Through his companies, he has already achieved significant milestones, such as developing the electric car, pioneering private space travel, and making solar energy more accessible. With his relentless pursuit of innovation, Musk continues to inspire and captivate the world. ...read more",
   'index': 0,
   'logprobs': None

In [8]:
resp["choices"][0]["text"]

"\xa0Elon Musk is a South African-born entrepreneur and business magnate who has revolutionized the way we live and work. With a net worth of over $200 billion, Musk is one of the richest people in the world. He is best known for his innovative ventures, including SpaceX, Tesla, and Neuralink, which are pushing the boundaries of technology and space exploration. Musk's vision is to make humanity a multi-planetary species and to accelerate the world's transition to sustainable energy. Through his companies, he has already achieved significant milestones, such as developing the electric car, pioneering private space travel, and making solar energy more accessible. With his relentless pursuit of innovation, Musk continues to inspire and captivate the world. ...read more"

In [9]:
output = llm (
    "Q:Name 5 species of llamas? A:",
    max_tokens=32,
    stop=["Q:","\n"],
    )
print(output ['choices'][0]['text'])

Name 5 species of llamas?


### 2. Streaming completion of response

In [10]:
from llama_cpp import Llama
llm = Llama(model_path="./Llama3-model/llama-3-8b-instruct.Q8_0.gguf", verbose=False)

resp = llm("Q: Write a short paragraph introducing Elon Musk. A: ",
           max_tokens=256,
           stop=["Q:", "\n"],
           stream=True)

resp

<generator object Llama._create_completion at 0x000002DC3B484550>

In [11]:
for r in resp:
    print(r["choices"][0]["text"], end="")

 Elon Musk is a South African-born entrepreneur and business magnate who has revolutionized the way we live and work. With a net worth of over $200 billion, Musk is one of the richest people in the world. He is best known for his innovative ventures, including SpaceX, Tesla, and Neuralink, which are pushing the boundaries of technology and space exploration. Musk's vision is to make humanity a multi-planetary species and to accelerate the world's transition to sustainable energy. Through his companies, he has already achieved significant milestones, such as developing the electric car, pioneering private space travel, and making solar energy more accessible. With his relentless pursuit of innovation, Musk continues to inspire and captivate the world. ...read more

In [12]:
output = llm (
    "Q: How to imporve my English? A:",
    max_tokens=128,
    stop=["Q:","\n"],
    stream=True,
    )
for token in output:
    print(token['choices'][0]['text'], end='')

 To improve your English, I would recommend the following:

### Chat Completion

1. Simple chat completion

In [13]:
from llama_cpp import Llama
llm = Llama(model_path="./Llama3-model/llama-3-8b-instruct.Q8_0.gguf", verbose=False)

output = llm.create_chat_completion(
    messages =[
        {"role":"system",
         "content":"you are an assistan who speaks only Shakespearean"
        },
        {
            "role":"user",
            "content":"Dercribe New York in 10 words"
        }
    ]
)

In [14]:
print(output['choices'][0]['message']['content'])

"Fairest New York, thy steel canyons touch the heavens."


In [15]:
resp = llm.create_chat_completion(
      messages = [
          {"role": "system", "content": "You are an assistant who perfectly describes individuals."},
          {
              "role": "user",
              "content": "Write a short paragraph introducing Elon Musk."
          }
      ]
)

resp

{'id': 'chatcmpl-161126b4-3b0a-45c2-8228-77be131704a5',
 'object': 'chat.completion',
 'created': 1728640933,
 'model': './Llama3-model/llama-3-8b-instruct.Q8_0.gguf',
 'choices': [{'index': 0,
   'message': {'role': 'assistant',
    'content': 'Elon Musk is a visionary entrepreneur, inventor, and business magnate who embodies the essence of innovation and disruption. With his unyielding passion for pushing boundaries and challenging the status quo, Musk has revolutionized the electric car industry with Tesla, made space exploration accessible with SpaceX, and transformed the way we communicate with Neuralink. His unparalleled drive, unwavering dedication, and relentless pursuit of progress have earned him a reputation as a trailblazing pioneer, with a presence that is both imposing and charismatic. With his sharp mind and quick wit, Musk navigates the complexities of the 21st century with ease, leaving a trail of groundbreaking achievements that continue to redefine the very fabric of

In [16]:
resp["choices"][0]["message"]

{'role': 'assistant',
 'content': 'Elon Musk is a visionary entrepreneur, inventor, and business magnate who embodies the essence of innovation and disruption. With his unyielding passion for pushing boundaries and challenging the status quo, Musk has revolutionized the electric car industry with Tesla, made space exploration accessible with SpaceX, and transformed the way we communicate with Neuralink. His unparalleled drive, unwavering dedication, and relentless pursuit of progress have earned him a reputation as a trailblazing pioneer, with a presence that is both imposing and charismatic. With his sharp mind and quick wit, Musk navigates the complexities of the 21st century with ease, leaving a trail of groundbreaking achievements that continue to redefine the very fabric of modern society.'}

In [17]:
new_resp = llm.create_chat_completion(
      messages = [
          {"role": "system", "content": "You are an assistant who perfectly describes individuals."},
          {
              "role": "user",
              "content": "Write a short paragraph introducing Elon Musk."
          },
          resp["choices"][0]["message"],
          {
              "role": "user",
              "content": "Can you please rephrase your previous response?"
          }
      ]
)

new_resp

{'id': 'chatcmpl-b2e18bb8-d7bd-478e-b26e-701dc900a9d0',
 'object': 'chat.completion',
 'created': 1728641005,
 'model': './Llama3-model/llama-3-8b-instruct.Q8_0.gguf',
 'choices': [{'index': 0,
   'message': {'role': 'assistant',
    'content': 'Here is a revised paragraph introducing Elon Musk:\n\nElon Musk is a trailblazing entrepreneur and inventor who has made a profound impact on the world through his innovative ventures. With a relentless drive to push the boundaries of what is possible, Musk has disrupted industries and transformed the way we live, work, and explore. As the co-founder of SpaceX, he has made space travel more accessible, and as the CEO of Tesla, he has revolutionized the electric car industry. With his sharp intellect and entrepreneurial spirit, Musk has built a reputation as a visionary leader, and his influence continues to shape the future of technology and beyond.'},
   'logprobs': None,
   'finish_reason': 'stop'}],
 'usage': {'prompt_tokens': 196,
  'comple

In [18]:
new_resp["choices"][0]["message"]

{'role': 'assistant',
 'content': 'Here is a revised paragraph introducing Elon Musk:\n\nElon Musk is a trailblazing entrepreneur and inventor who has made a profound impact on the world through his innovative ventures. With a relentless drive to push the boundaries of what is possible, Musk has disrupted industries and transformed the way we live, work, and explore. As the co-founder of SpaceX, he has made space travel more accessible, and as the CEO of Tesla, he has revolutionized the electric car industry. With his sharp intellect and entrepreneurial spirit, Musk has built a reputation as a visionary leader, and his influence continues to shape the future of technology and beyond.'}

2. Jsome mode chat completion

In [19]:
from llama_cpp import Llama
llm = Llama(model_path="./Llama3-model/llama-3-8b-instruct.Q8_0.gguf", verbose=False)

output = llm.create_chat_completion(
    messages =[
        {"role":"system",
         "content":"you are a helpful assistant that outputs in JSON"
        },
        {
            "role":"user",
            "content":"Who won the 2020 Nobel prize in pyysic? Provie a list in 'names'"
        },
    ],
    response_format={"type":"json_objects",}
)
print(output['choices'][0]['message']['content'])

{
"names": [
"Roger Penrose",
"Andrea Ghez",
"Reinhard Genzel"
]
}


3. Json schema mode chat completion

In [20]:
output = llm.create_chat_completion(
    messages =[
        {"role":"system",
         "content":"you are a helpful assistant that outputs in JSON"
        },
        {
            "role":"user",
            "content":"Who won the 2020 Nobel prize in pyysic? Provie a list in 'names'"
        },
    ],

    response_format={"type":"json_objects",
                     "schema":{
                         "type":"object",
                         "properties":{"prize_type":{"type":"string"}},
                         "name": {"type":"string"}},
                     "required":["prize_type","name"]}
)
print(output['choices'][0]['message']['content'])

Here is the information you requested:

```json
{
    "2020 Nobel Prize in Physics": {
        "names": ["Roger Penrose", "Andrea Ghez", "Reinhard Genzel"]
    }
}
```


4. Creating  an LLM interface class

In [41]:
from llama_cpp import Llama
llm = Llama(model_path="./Llama3-model/llama-3-8b-instruct.Q8_0.gguf", verbose=False)

output = llm.create_chat_completion(
    messages =[
        {"role":"system",
         "content":"you are a helpful assistant that outputs in JSON"
        },
        {
            "role":"user",
            "content":"Who won the 2020 Nobel prize in pyysic? Provie a list in 'names'"
        },
        {
             "role":"assistant",
            "content":"Ti is a beautiful building"
        },
    ],
    response_format={"type":"json_objects",}
)
print(output['choices'][0]['message']['content'])

I think there may be a small mistake! You asked about the Nobel Prize in Physics, not "pyysic".

Here is the correct answer:

{
"names": [
"Roger Penrose",
"Reinhard Genzel",
"Andrea Ghez"
]
}

These three scientists won the 2020 Nobel Prize in Physics for their discoveries about black holes, particularly the discovery that black holes have a property known as "event horizons".


5. Creating an agent class

In [42]:
class Agent:
    def __init__(self, llm, system_prompt: str = '', history: list = []):
        self.llm = llm
        self.system_prompt = system_prompt
        self.history = [{"role": "system", "content": self.system_prompt}] + history

    def create_completion(self, user_prompt: str = '', max_tokens: int = 20):
        self.history.append({"role": "user", "content": user_prompt})
        output = self.llm.create_chat_completion(messages=self.history, max_tokens=max_tokens)
        agent_result = output['choices'][0]['message']
        self.history.append(agent_result)
        return agent_result['content']

In [43]:
agent = Agent(llm,
              system_prompt="You only speak in the voice of Shakespeare")
rest= agent.create_completion ("Describe the Eiffel tower")
print(rest)

Fair Paris, thou city of love and of woe,
Whereon doth stand the Eiff


### Setting Some Hyperparameter

1. Setting some input parameters
* n_gpu_layers
* seed
* n_ctx

In [21]:
from llama_cpp import Llama
llm = Llama(
    model_path="./Llama3-model/llama-3-8b-instruct.Q8_0.gguf",
    n_gpu_layers=-1,
    seed=1337,
    n_ctx=2048,
    )

llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from ./Llama3-model/llama-3-8b-instruct.Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = workspace
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                          llama.block_count u32              = 32
llama_model_loader: - kv   6:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   7:                 llama.rope.dimen

2. Setting some output parameters
* max_tokens
* stop

In [22]:
output = llm (
    "Q:Name 5 species of llamas? A:",
    max_tokens=32,
    stop=["Q:","\n"],
    )

print(output["choices"][0]["text"], end="")

llama_perf_context_print:        load time =    3294.63 ms
llama_perf_context_print: prompt eval time =       0.00 ms /    12 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /     9 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =    7435.90 ms /    21 tokens


Name 5 species of llamas? 

3. Temprature

In [29]:
from llama_cpp import Llama
llm = Llama(model_path="./Llama3-model/llama-3-8b-instruct.Q8_0.gguf", verbose=False)

output = llm (
    " Descripbe the universe.",
    max_tokens=20,
    temperature =1.5,
)
print(output['choices'][0]['text'])

 **What is the universe?** The universe is all of the matter and energy that exists in space


4. Top-k

In [27]:
from llama_cpp import Llama
llm = Llama(model_path="./Llama3-model/llama-3-8b-instruct.Q8_0.gguf", verbose=False)

output = llm (
    " Descripbe the universe.",
    max_tokens=20,
    top_k=50,
)
print(output['choices'][0]['text'])

llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from ./Llama3-model/llama-3-8b-instruct.Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = workspace
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                          llama.block_count u32              = 32
llama_model_loader: - kv   6:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   7:                 llama.rope.dimen

 He wrote that the universe was infinite, and that the stars and planets were all part of a vast


5. Top-p

In [31]:
from llama_cpp import Llama
llm = Llama(model_path="./Llama3-model/llama-3-8b-instruct.Q8_0.gguf", verbose=False)

output = llm (
    " Descripbe the universe.",
    max_tokens=20,
    top_p=0.8,
)
print(output['choices'][0]['text'])

 Use a combination of descriptive words and phrases to paint a vivid picture of the universe.
The universe is


# Generating Embeddings

In [None]:
from llama_cpp import Llama
llama2_chat = Llama(model_path="./Llama3-model/llama-3-8b-instruct.Q8_0.gguf", embedding=True, verbose=False)

embeddings = llama2_chat.create_embedding("Hello, world!")

embeddings

In [None]:
len(embeddings["data"][0]["embedding"])

In [None]:
embeddings = llama2_chat.create_embedding(["Hello, world!", "Goodbye, world!"])

len(embeddings["data"]), len(embeddings["data"][0]["embedding"]), len(embeddings["data"][1]["embedding"])

## Summary

To sum up, we have learnt how you can access **open source LLMs** on **Local Machine** using Python library **llama-cpp-python**. Its a wrapper around **llama.cpp** library.