# LLAMA.CPP
* project: https://github.com/ggerganov/llama.cpp
* python bindings: https://llama-cpp-python.readthedocs.io/en/latest/#development
* 7 lines article: https://blog.lastmileai.dev/run-llama-2-locally-in-7-lines-apple-silicon-mac-c3f46143f327

In [1]:
LLAMA2_PATH = "/Users/juanluis/Documents/llama.cpp/models/7B/llama-2-7b-chat.Q5_K_M.gguf"

In [2]:
from llama_cpp import Llama
LLM = Llama(model_path=LLAMA2_PATH)

llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from /Users/juanluis/Documents/llama.cpp/models/7B/llama-2-7b-chat.Q5_K_M.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q5_K     [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    1:           blk.0.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    2:            blk.0.ffn_down.weight q6_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor    3:            blk.0.ffn_gate.weight q5_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.ffn_up.weight q5_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.attn_k.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    7:         blk.0.attn_output.

In [3]:
output = LLM("Q: Name the planets in the solar system? A: ", max_tokens=32, stop=["Q:", "\n"], echo=True)
print(output)


{'id': 'cmpl-04d53c22-9616-444c-bfa7-042aa5e4eb46', 'object': 'text_completion', 'created': 1694614971, 'model': '/Users/juanluis/Documents/llama.cpp/models/7B/llama-2-7b-chat.Q5_K_M.gguf', 'choices': [{'text': 'Q: Name the planets in the solar system? A: 1. Moon 2. Mercury 3. Venus 4. Earth 5. Mars 6. Jupiter 7. Saturn ', 'index': 0, 'logprobs': None, 'finish_reason': 'length'}], 'usage': {'prompt_tokens': 15, 'completion_tokens': 32, 'total_tokens': 47}}



llama_print_timings:        load time =  8462.79 ms
llama_print_timings:      sample time =    23.12 ms /    32 runs   (    0.72 ms per token,  1383.84 tokens per second)
llama_print_timings: prompt eval time =  8462.70 ms /    15 tokens (  564.18 ms per token,     1.77 tokens per second)
llama_print_timings:        eval time = 36407.51 ms /    31 runs   ( 1174.44 ms per token,     0.85 tokens per second)
llama_print_timings:       total time = 44950.85 ms


In [4]:
LLM("Q: What is the most heavy element in the periodic table? A: ", max_tokens=32, stop=["Q:", "\n"], echo=True)

Llama.generate: prefix-match hit

llama_print_timings:        load time =  8462.79 ms
llama_print_timings:      sample time =    22.71 ms /    32 runs   (    0.71 ms per token,  1409.32 tokens per second)
llama_print_timings: prompt eval time =  5626.77 ms /    14 tokens (  401.91 ms per token,     2.49 tokens per second)
llama_print_timings:        eval time =  3410.53 ms /    31 runs   (  110.02 ms per token,     9.09 tokens per second)
llama_print_timings:       total time =  9110.21 ms


{'id': 'cmpl-395f089b-3c1e-4220-a130-248937ff99bc',
 'object': 'text_completion',
 'created': 1694615423,
 'model': '/Users/juanluis/Documents/llama.cpp/models/7B/llama-2-7b-chat.Q5_K_M.gguf',
 'choices': [{'text': 'Q: What is the most heavy element in the periodic table? A:  Element #112, Oganesson (Og). It has an atomic mass of 437 u (unified atomic mass units),',
   'index': 0,
   'logprobs': None,
   'finish_reason': 'length'}],
 'usage': {'prompt_tokens': 17, 'completion_tokens': 32, 'total_tokens': 49}}

In [6]:
output.keys())

dict_keys(['id', 'object', 'created', 'model', 'choices', 'usage'])

In [6]:
output = LLM("Q: Please what are some famous boxers A: ", max_tokens=4096, stop=["Q:", "\n"], stream = True)
print(output)

<generator object Llama._create_completion at 0x1111ae8f0>


In [7]:
for o in output:
    print(o["choices"][0]["text"], end = "")

Llama.generate: prefix-match hit


1. ultimately he had to face off against his archrival Muhammad Ali in the Rumble in the Jungle fight, and although Foreman lost that particular bout, he remained a dominant force in the sport until his retirement in 1977.


llama_print_timings:        load time =  6846.39 ms
llama_print_timings:      sample time =    37.77 ms /    53 runs   (    0.71 ms per token,  1403.34 tokens per second)
llama_print_timings: prompt eval time =   691.80 ms /    10 tokens (   69.18 ms per token,    14.46 tokens per second)
llama_print_timings:        eval time =  3057.81 ms /    52 runs   (   58.80 ms per token,    17.01 tokens per second)
llama_print_timings:       total time =  3857.58 ms


## Using it with Langchain
https://python.langchain.com/docs/integrations/llms/llamacpp

In [5]:
from langchain.llms import LlamaCpp
from langchain import PromptTemplate, LLMChain
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

In [6]:
template = """Question: {question}

Answer: Let's work this out in a step by step way to be sure we have the right answer."""

prompt = PromptTemplate(template=template, input_variables=["question"])

In [7]:
# Callbacks support token-wise streaming
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])
# Verbose is required to pass to the callback manager

In [None]:
# Make sure the model path is correct for your system!
llm = LlamaCpp(
    model_path=LLAMA2_PATH,
    temperature=0.75,
    max_tokens=2000,
    top_p=1,
    callback_manager=callback_manager,
    verbose=True,
)

In [None]:
llm("hello there", stop = 30)

In [None]:
n_gpu_layers = 40  # Change this value based on your model and your GPU VRAM pool.
n_batch = 512  # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.

# Make sure the model path is correct for your system!
llm = LlamaCpp(
    model_path=LLAMA2_PATH,
    n_gpu_layers=n_gpu_layers,
    n_batch=n_batch,
    callback_manager=callback_manager,
    verbose=True,
)

In [10]:
n_gpu_layers = 1  # Metal set to 1 is enough.
n_batch = 512  # Should be between 1 and n_ctx, consider the amount of RAM of your Apple Silicon Chip.

# Make sure the model path is correct for your system!
llm = LlamaCpp(
    model_path=LLAMA2_PATH,
    n_gpu_layers=n_gpu_layers,
    n_batch=n_batch,
    f16_kv=True,  # MUST set to True, otherwise you will run into problem after a couple of calls
    callback_manager=callback_manager,
    verbose=True,
)

llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from /Users/juanluis/Documents/llama.cpp/models/7B/llama-2-7b-chat.Q5_K_M.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q5_K     [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    1:           blk.0.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    2:            blk.0.ffn_down.weight q6_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor    3:            blk.0.ffn_gate.weight q5_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.ffn_up.weight q5_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.attn_k.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    7:         blk.0.attn_output.

In [11]:
llm_chain = LLMChain(prompt=prompt, llm=llm)



In [12]:
question = "What NFL team won the Super Bowl in the year Justin Bieber was born?"

llm_chain.run(question)


Step 1: Check when Justin Bieber was born.
Justin Bieber was born on March 1, 1994.
Step 2: Find the NFL team that won the Super Bowl in 1994.
The team that won Super Bowl XXVIII (the Super Bowl in 1994) was the Dallas Cowboys.
Therefore, the answer to the question is: The Dallas Cowboys.


llama_print_timings:        load time =  3534.83 ms
llama_print_timings:      sample time =    76.60 ms /   100 runs   (    0.77 ms per token,  1305.55 tokens per second)
llama_print_timings: prompt eval time =  3534.69 ms /    45 tokens (   78.55 ms per token,    12.73 tokens per second)
llama_print_timings:        eval time =  4613.90 ms /    99 runs   (   46.61 ms per token,    21.46 tokens per second)
llama_print_timings:       total time =  8413.10 ms


'\nStep 1: Check when Justin Bieber was born.\nJustin Bieber was born on March 1, 1994.\nStep 2: Find the NFL team that won the Super Bowl in 1994.\nThe team that won Super Bowl XXVIII (the Super Bowl in 1994) was the Dallas Cowboys.\nTherefore, the answer to the question is: The Dallas Cowboys.'

In [16]:
llm_chain.run("How can I build a computer?")

Llama.generate: prefix-match hit


 

Here are the steps that would help you in building a new computer:

1. Identify your budget and needs: Deciding how much you want to spend on the computer beforehand will help narrow down the choices and avoid overspending. Consider what tasks you will be using the computer for, such as gaming, graphics design, video editing, or just general use.
2. Choose a CPU (Central Processing Unit): The CPU is the brain of your computer, handling all calculations and operations. Select a processor from reputable brands like Intel or AMD based on your budget and intended usage. Certain CPUs may also require specific motherboards, so be sure to check compatibility before making a final decision.
3. Pick out a Motherboard: The motherboard is the main circuit board of your computer, connecting all the components together. Choose one that matches your CPU and has enough slots for RAM, USB ports, and other features you need. Make sure it also supports the latest technologies like USB 3.0 or SATA III


llama_print_timings:        load time =  4803.62 ms
llama_print_timings:      sample time =   181.42 ms /   256 runs   (    0.71 ms per token,  1411.08 tokens per second)
llama_print_timings: prompt eval time =  8919.69 ms /    32 tokens (  278.74 ms per token,     3.59 tokens per second)
llama_print_timings:        eval time = 15422.28 ms /   255 runs   (   60.48 ms per token,    16.53 tokens per second)
llama_print_timings:       total time = 24981.72 ms


' \n\nHere are the steps that would help you in building a new computer:\n\n1. Identify your budget and needs: Deciding how much you want to spend on the computer beforehand will help narrow down the choices and avoid overspending. Consider what tasks you will be using the computer for, such as gaming, graphics design, video editing, or just general use.\n2. Choose a CPU (Central Processing Unit): The CPU is the brain of your computer, handling all calculations and operations. Select a processor from reputable brands like Intel or AMD based on your budget and intended usage. Certain CPUs may also require specific motherboards, so be sure to check compatibility before making a final decision.\n3. Pick out a Motherboard: The motherboard is the main circuit board of your computer, connecting all the components together. Choose one that matches your CPU and has enough slots for RAM, USB ports, and other features you need. Make sure it also supports the latest technologies like USB 3.0 or S

In [21]:
import requests

API_URL = "https://api-inference.huggingface.co/models/meta-llama/Llama-2-7b-chat-hf"
headers = {"Authorization": OAUTH}

def query(payload):
	response = requests.post(API_URL, headers=headers, json=payload)
	return response.json()
	
output = query({
	"inputs": "Can you please let us know more details about your ",
})

In [22]:
output

{'error': 'Model requires a Pro subscription'}

In [23]:
llm("create a python function which returns random combinations of letters and numbers")

Llama.generate: prefix-match hit




```
def combine_letters_and_numbers(n):
    # Initialize an empty list to store the combinations
    combinations = []
    
    # Create all possible combinations of letters and numbers up to n
    for i in range(n + 1):
        for j in range(i + 1, n + 1):
            # Combine the letters and numbers into a single string
            combination = f"{random.choice('ABC')}{random.randint(1, n-1)}{random.choice('0123456789')}"
            combinations.append(combination)
    
    return combinations
```

Explanation:

The function first initializes an empty list `combinations` to store the random combinations of letters and numbers. Then, it uses a for loop to create all possible combinations of letters and numbers up to the input value `n`. For each combination, it creates a string by concatenating a randomly chosen letter from `'ABC'`, a randomly chosen number between 1 and `n - 1`, and a randomly chosen digit from `'0123456789'


llama_print_timings:        load time =  4803.62 ms
llama_print_timings:      sample time =   183.35 ms /   256 runs   (    0.72 ms per token,  1396.21 tokens per second)
llama_print_timings: prompt eval time =  6963.58 ms /    12 tokens (  580.30 ms per token,     1.72 tokens per second)
llama_print_timings:        eval time = 15842.40 ms /   255 runs   (   62.13 ms per token,    16.10 tokens per second)
llama_print_timings:       total time = 23473.30 ms


'\n\n```\ndef combine_letters_and_numbers(n):\n    # Initialize an empty list to store the combinations\n    combinations = []\n    \n    # Create all possible combinations of letters and numbers up to n\n    for i in range(n + 1):\n        for j in range(i + 1, n + 1):\n            # Combine the letters and numbers into a single string\n            combination = f"{random.choice(\'ABC\')}{random.randint(1, n-1)}{random.choice(\'0123456789\')}"\n            combinations.append(combination)\n    \n    return combinations\n```\n\nExplanation:\n\nThe function first initializes an empty list `combinations` to store the random combinations of letters and numbers. Then, it uses a for loop to create all possible combinations of letters and numbers up to the input value `n`. For each combination, it creates a string by concatenating a randomly chosen letter from `\'ABC\'`, a randomly chosen number between 1 and `n - 1`, and a randomly chosen digit from `\'0123456789\''

In [31]:
from urllib.parse import urlparse, parse_qs

def extract_parameters_from_url(url):
    # Parse the URL
    parsed_url = urlparse(url)

    # Get the query parameters as a dictionary
    query_params = parse_qs(parsed_url.query)

    # Extract the "style" and "environment" parameters if present
    style = query_params.get('style', [None])[0]
    environment = query_params.get('environment', [None])[0]

    return style, environment

# Example URL
url = "https://example.com/endpoint"

style, environment = extract_parameters_from_url(url)

In [32]:
environment