# ChatLlamaCpp

This notebook provides a quick overview for getting started with chat model intergrated with [llama cpp python](https://github.com/abetlen/llama-cpp-python)

An example below demonstrating how to implement with the open-source Llama3 Instruct 8B

## Instantiation

Now we can instantiate our model object and generate chat completions:

In [1]:
import sys

sys.path.append("/home/tni5hc/Documents/langchain_llamacpp")

In [2]:
import multiprocessing

from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

from libs.community.langchain_community.chat_models.llamacpp import ChatLlamaCpp

llm = ChatLlamaCpp(
    temperature=0.3,
    model_path="/home/tni5hc/Documents/langchain_llamacpp/SanctumAI-meta-llama-3-8b-instruct.Q8_0.gguf",
    n_ctx=10000,
    n_gpu_layers=4,
    n_batch=200,  # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.
    max_tokens=512,
    n_threads=multiprocessing.cpu_count() - 1,
    callback_manager=CallbackManager(
        [StreamingStdOutCallbackHandler()]
    ),  # Callbacks support token-wise streaming
    streaming=True,
    repeat_penalty=1.5,
    top_p=0.5,
    stop=["<|end_of_text|>", "<|eot_id|>"],
    verbose=True,
)

llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /home/tni5hc/Documents/langchain_llamacpp/SanctumAI-meta-llama-3-8b-instruct.Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B-Instruct
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_mod

## Invocation

In [3]:
messages = [
    (
        "system",
        "You are a helpful assistant that translates English to French. Translate the user sentence.",
    ),
    ("human", "I love programming."),
]

ai_msg = llm.invoke(messages)
ai_msg

Je adore le programmation.

(Note: "programmation" is used instead of just saying you like programing, as it's more formal and accurate in this context)


llama_print_timings:        load time =    1119.07 ms
llama_print_timings:      sample time =      19.71 ms /    36 runs   (    0.55 ms per token,  1826.02 tokens per second)
llama_print_timings: prompt eval time =    1119.00 ms /    37 tokens (   30.24 ms per token,    33.07 tokens per second)
llama_print_timings:        eval time =    5555.43 ms /    35 runs   (  158.73 ms per token,     6.30 tokens per second)
llama_print_timings:       total time =    6883.54 ms /    72 tokens


AIMessage(content='Je adore le programmation.\n\n(Note: "programmation" is used instead of just saying you like programing, as it\'s more formal and accurate in this context)', response_metadata={'finish_reason': 'stop'}, id='run-2fbd654b-7424-4f79-b43b-711da3fc8905-0')

In [4]:
print(ai_msg.content)

Je adore le programmation.

(Note: "programmation" is used instead of just saying you like programing, as it's more formal and accurate in this context)


## Chaining

We can [chain](/docs/how_to/sequence/) our model with a prompt template like so:

In [5]:
from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are a helpful assistant that translates {input_language} to {output_language}.",
        ),
        ("human", "{input}"),
    ]
)

chain = prompt | llm
chain.invoke(
    {
        "input_language": "English",
        "output_language": "German",
        "input": "I love programming.",
    }
)

Llama.generate: prefix-match hit


Ich liebe auch Programmieren! (Translation: I also like coding!) What kind of programs do you enjoy writing?


llama_print_timings:        load time =    1119.07 ms
llama_print_timings:      sample time =      13.02 ms /    24 runs   (    0.54 ms per token,  1843.18 tokens per second)
llama_print_timings: prompt eval time =     964.79 ms /    17 tokens (   56.75 ms per token,    17.62 tokens per second)
llama_print_timings:        eval time =    4455.69 ms /    23 runs   (  193.73 ms per token,     5.16 tokens per second)
llama_print_timings:       total time =    5554.79 ms /    40 tokens


AIMessage(content='Ich liebe auch Programmieren! (Translation: I also like coding!) What kind of programs do you enjoy writing?', response_metadata={'finish_reason': 'stop'}, id='run-4602aed1-6d2b-4120-8c80-7f4507c3a176-0')

## Tool calling

Firstly, it works mostly the same as OpenAI Function Calling

OpenAI has a [tool calling](https://platform.openai.com/docs/guides/function-calling) (we use "tool calling" and "function calling" interchangeably here) API that lets you describe tools and their arguments, and have the model return a JSON object with a tool to invoke and the inputs to that tool. tool-calling is extremely useful for building tool-using chains and agents, and for getting structured outputs from models more generally.

With `ChatLlamaCpp.bind_tools`, we can easily pass in Pydantic classes, dict schemas, LangChain tools, or even functions as tools to the model. Under the hood these are converted to an OpenAI tool schemas, which looks like:
```
{
    "name": "...",
    "description": "...",
    "parameters": {...}  # JSONSchema
}
```
and passed in every model invocation.


However, it cannot automatically trigger a function/tool, we need to force it by specifying the 'tool choice' parameter. This parameter is typically formatted as described below.

```{"type": "function", "function": {"name": <<tool_name>>}}.```

In [6]:
from langchain.tools import tool
from langchain_core.pydantic_v1 import BaseModel, Field


class WeatherInput(BaseModel):
    location: str = Field(description="The city and state, e.g. San Francisco, CA")
    unit: str = Field(enum=["celsius", "fahrenheit"])


@tool("get_current_weather", args_schema=WeatherInput)
def get_weather(location: str, unit: str):
    """Get the current weather in a given location"""
    return f"Now the weather in {location} is 22 {unit}"


llm_with_tools = llm.bind_tools(
    tools=[get_weather],
    tool_choice={"type": "function", "function": {"name": "get_current_weather"}},
)

In [7]:
ai_msg = llm_with_tools.invoke(
    "what is the weather like in HCMC in celsius",
)
ai_msg

Llama.generate: prefix-match hit

llama_print_timings:        load time =    1119.07 ms
llama_print_timings:      sample time =     873.61 ms /    20 runs   (   43.68 ms per token,    22.89 tokens per second)
llama_print_timings: prompt eval time =    1173.58 ms /    21 tokens (   55.88 ms per token,    17.89 tokens per second)
llama_print_timings:        eval time =    3283.58 ms /    19 runs   (  172.82 ms per token,     5.79 tokens per second)
llama_print_timings:       total time =    5477.53 ms /    40 tokens


AIMessage(content='', additional_kwargs={'function_call': {'name': 'get_current_weather', 'arguments': '{ "location": "Ho Chi Minh City", "unit" : "celsius"}'}, 'tool_calls': [{'index': 0, 'id': 'call__0_get_current_weather_cmpl-eeb56e5e-dd5a-4fd2-8a18-03f799cab23a', 'type': 'function', 'function': {'name': 'get_current_weather', 'arguments': '{ "location": "Ho Chi Minh City", "unit" : "celsius"}'}}]}, response_metadata={'finish_reason': 'tool_calls'}, id='run-f7f88b08-b0e0-4e90-ba63-07923ecb8d27-0', tool_calls=[{'name': 'get_current_weather', 'args': {'location': 'Ho Chi Minh City', 'unit': 'celsius'}, 'id': 'call__0_get_current_weather_cmpl-eeb56e5e-dd5a-4fd2-8a18-03f799cab23a'}])

In [8]:
ai_msg.tool_calls

[{'name': 'get_current_weather',
  'args': {'location': 'Ho Chi Minh City', 'unit': 'celsius'},
  'id': 'call__0_get_current_weather_cmpl-eeb56e5e-dd5a-4fd2-8a18-03f799cab23a'}]

# Structure output

In [27]:
from langchain_core.pydantic_v1 import BaseModel
from langchain_core.utils.function_calling import convert_to_openai_tool


class AnswerWithJustification(BaseModel):
    """An answer to the user question along with justification for the answer."""

    answer: str
    justification: str


dict_schema = convert_to_openai_tool(AnswerWithJustification)

structured_llm = llm.with_structured_output(dict_schema)

structured_llm.invoke("What weighs more a pound of bricks or a pound of feathers")

Llama.generate: prefix-match hit

llama_print_timings:        load time =    1119.07 ms
llama_print_timings:      sample time =    2009.09 ms /    44 runs   (   45.66 ms per token,    21.90 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time =    6640.81 ms /    44 runs   (  150.93 ms per token,     6.63 tokens per second)
llama_print_timings:       total time =    8967.47 ms /    45 tokens


{'answer': "a pound is always the same weight, regardless of what it's made up of. So in this case both options weigh exactly one pound.",
 'justification': ''}

# Streaming

You will see a duplicate token beacause of on verbose mode of llama_cpp, they will print a token as well

In [42]:
for chunk in llm.stream("what is 25x5"):
    print(chunk.content, end="\n", flush=True)


TheThe


Llama.generate: prefix-match hit


 answer answer
 to to
 the the
 multiplication multiplication
 problem problem
 " "
WhatWhat
's's
  
2525
 x x
  
55
?"?"
 would would
 be be
:

:


125125



llama_print_timings:        load time =    1119.07 ms
llama_print_timings:      sample time =      10.67 ms /    20 runs   (    0.53 ms per token,  1874.41 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time =    2980.27 ms /    20 runs   (  149.01 ms per token,     6.71 tokens per second)
llama_print_timings:       total time =    3093.39 ms /    21 tokens



