# First sample about LLama-Cpp-Python (Local model deployment)

## Installation
1. Fork https://github.com/abetlen/llama-cpp-python/forkand clone       
```
git clone https://github.com/<your-git-id>/llama-cpp-python llama
```
2. Update the source tree (download llama-cpp)
```
cd llama
git pull origin
git submodule init
git submodule update
```
3. Install the llama-cpp-python
```
python -m pip install --upgrade --force-reinstall --no-cache-dir .
```

## Installation and model loading

In [1]:
!pip3 uninstall wasabi -y
!pip3 install wasabi==0.9.1 

Found existing installation: wasabi 0.9.1
Uninstalling wasabi-0.9.1:
  Successfully uninstalled wasabi-0.9.1
Collecting wasabi==0.9.1
  Obtaining dependency information for wasabi==0.9.1 from https://files.pythonhosted.org/packages/f6/77/736fa303d2efb5b640aad8abad323c23c83c184ce95c4df25e8a8e435d2e/wasabi-0.9.1-py3-none-any.whl.metadata
  Using cached wasabi-0.9.1-py3-none-any.whl.metadata (28 kB)
Using cached wasabi-0.9.1-py3-none-any.whl (26 kB)
Installing collected packages: wasabi
Successfully installed wasabi-0.9.1



[notice] A new release of pip is available: 23.2 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
import llama_cpp
import llama_cpp.llama_tokenizer
import sys

llama = llama_cpp.Llama.from_pretrained(
    repo_id="TheBloke/Llama-2-7B-GGUF",
    filename="llama-2-7b.Q8_0.gguf",
    verbose=True
)



llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from C:\Users\hongweixli\.cache\huggingface\hub\models--TheBloke--Llama-2-7B-GGUF\snapshots\b4e04e128f421c93a5f1e34ac4d7ca9b0af47b80\.\llama-2-7b.Q8_0.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimensio

In [3]:
llama.verbose = False

## Simple response

In [4]:
def completion(prompt, temperature = 0.2, top_p = 0.95):
    response = llama.create_completion(temperature = temperature, 
                                       top_p = top_p,
                                       prompt = prompt,
                                       max_tokens=None, # Generate up to 32 tokens, set to None to generate up to the end of the context window
                                       stop=["Q:", "\n"], # Stop generating just before the model would generate a new question
                                       echo=True # Echo the prompt back in the output
    ) 

    res = ""
    for chunk in response["choices"]:
        if "text" not in chunk:
            continue
        res += chunk["text"]

    return res

In [5]:
prompt = "Q: Name the planets in the solar system? A:"

response = completion(prompt)

print(response)

Q: Name the planets in the solar system? A: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune.


Compare to the remote deployment of Llama2. The response is almost the same.

In [12]:
import os

# Get a free API key from https://replicate.com/account/api-tokens
os.environ["REPLICATE_API_TOKEN"] = "r8_P7oN8E12pf3aBfDi7JTsiIclQb2gN4p1DTZ5I"

from langchain.llms import Replicate

def remote_completion(
    prompt: str,
    model: str = "meta/llama-2-7b-chat",
    temperature: float = 0.2,
    top_p: float = 0.95,
) -> str:
    print(prompt)
    llm = Replicate(
        model=model,
        model_kwargs={"temperature": temperature,"top_p": top_p, "max_new_tokens": 1000}
    )
    return llm(prompt)


In [13]:
response = remote_completion(prompt = "The typical color of a llama is:", model = "meta/llama-2-7b-chat")

The typical color of a llama is:


ValueError: not enough values to unpack (expected 2, got 1)

## Chat response

In [None]:
def chat_completion(chat, temperature = 0.6, top_p = 0.9):
    response = llama.create_chat_completion(
        messages = chat,
        temperature= temperature,
        top_p = top_p   
    )

    return response


In [None]:
chat =[
        {
            "role": "user",
            "content": "What is the capital of France?"
        }
    ]
chat_completion(chat)

print("============ Response: ===========")
for chunk in response:
    delta = chunk["choices"][0]["delta"]
    if "content" not in delta:
        continue
    print(delta["content"], end="", flush=True)

print()

In [None]:
    response_format={
        "type": "json_object",
        "schema": {
            "type": "object",
            "properties": {
                "country": {"type": "string"},
                "capital": {"type": "string"}
            },
            "required": ["country", "capital"],
        }
    },
    stream=True
)

print("============ Response: ===========")
for chunk in response:
    delta = chunk["choices"][0]["delta"]
    if "content" not in delta:
        continue
    print(delta["content"], end="", flush=True)

print()