In [21]:
! pip install -U --quiet llama-cpp-python

### Repo

Clone and build Llama.cpp, following instructions here:

https://github.com/ggerganov/llama.cpp

### Model

Download a local LLM, ideally one that is capable of tool calling to use all features discussed below:
 
* Weights: `meta-llama-3-8b-instruct.Q8_0.gguf`
* Link: https://huggingface.co/SanctumAI/Meta-Llama-3-8B-Instruct-GGUF

### Function calling

First, we can see that llama-cpp-python does support function calling.

https://llama-cpp-python.readthedocs.io/en/latest/

In [1]:
local_model = "/Users/rlm/Desktop/Code/llama.cpp/models/meta-llama-3-8b-instruct.Q8_0.gguf"

In [2]:
from langchain_community.chat_models import ChatLlamaCpp

In [3]:
import multiprocessing

from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

llm = ChatLlamaCpp(
    temperature=0.3,
    model_path=local_model,
    n_ctx=10000,
    n_gpu_layers=4,
    n_batch=200,  # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.
    max_tokens=512,
    n_threads=multiprocessing.cpu_count() - 1,
    callback_manager=CallbackManager(
        [StreamingStdOutCallbackHandler()]
    ),  # Callbacks support token-wise streaming
    streaming=True,
    repeat_penalty=1.5,
    top_p=0.5,
    stop=["<|end_of_text|>", "<|eot_id|>"],
    verbose=True,
)

llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /Users/rlm/Desktop/Code/llama.cpp/models/meta-llama-3-8b-instruct.Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B-Instruct
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: 

### Invoke

In [4]:
messages = [
    (
        "system",
        "You are a helpful assistant that translates English to French. Translate the user sentence.",
    ),
    ("human", "I love programming."),
]

ai_msg = llm.invoke(messages)
ai_msg

llama_tokenize_internal: Added a BOS token to the prompt as specified by the model but the prompt also starts with a BOS token. So now the final prompt starts with 2 BOS tokens. Are you sure this is what you want?


Je adore le programmation.

(Note: "programming" is translated as both an activity ("le programme") and also referring specifically to computer code, which in this case I chose not translate.)


llama_print_timings:        load time =    2102.08 ms
llama_print_timings:      sample time =      10.86 ms /    39 runs   (    0.28 ms per token,  3591.16 tokens per second)
llama_print_timings: prompt eval time =    2101.88 ms /    36 tokens (   58.39 ms per token,    17.13 tokens per second)
llama_print_timings:        eval time =    4366.03 ms /    38 runs   (  114.90 ms per token,     8.70 tokens per second)
llama_print_timings:       total time =    6516.86 ms /    74 tokens


AIMessage(content='Je adore le programmation.\n\n(Note: "programming" is translated as both an activity ("le programme") and also referring specifically to computer code, which in this case I chose not translate.)', response_metadata={'finish_reason': 'stop'}, id='run-c6d117d2-0478-4f45-9a8b-f84749225ecb-0')

### Chain

In [5]:
from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are a helpful assistant that translates {input_language} to {output_language}.",
        ),
        ("human", "{input}"),
    ]
)

chain = prompt | llm
chain.invoke(
    {
        "input_language": "English",
        "output_language": "German",
        "input": "I love programming.",
    }
)

llama_tokenize_internal: Added a BOS token to the prompt as specified by the model but the prompt also starts with a BOS token. So now the final prompt starts with 2 BOS tokens. Are you sure this is what you want?
Llama.generate: prefix-match hit


Ich liebe auch Programmieren! (Translation: I also like coding!) What kind of programs do you enjoy working on?


llama_print_timings:        load time =    2102.08 ms
llama_print_timings:      sample time =       6.91 ms /    25 runs   (    0.28 ms per token,  3616.37 tokens per second)
llama_print_timings: prompt eval time =     605.25 ms /    16 tokens (   37.83 ms per token,    26.44 tokens per second)
llama_print_timings:        eval time =    2692.54 ms /    24 runs   (  112.19 ms per token,     8.91 tokens per second)
llama_print_timings:       total time =    3327.85 ms /    40 tokens


AIMessage(content='Ich liebe auch Programmieren! (Translation: I also like coding!) What kind of programs do you enjoy working on?', response_metadata={'finish_reason': 'stop'}, id='run-895aeed1-2460-4f46-9f99-6ce724bf5a8d-0')

### Structured output

In [6]:
from langchain_core.pydantic_v1 import BaseModel
from langchain_core.utils.function_calling import convert_to_openai_tool

class AnswerWithJustification(BaseModel):
    '''An answer to the user question along with justification for the answer.'''
    answer: str
    justification: str

dict_schema = convert_to_openai_tool(AnswerWithJustification)
structured_llm = llm.with_structured_output(dict_schema)
result = structured_llm.invoke("What weighs more a pound of bricks or a pound of feathers")
result

from_string grammar:
answer-kv ::= ["] [a] [n] [s] [w] [e] [r] ["] space [:] space string 
space ::= space_7 
string ::= ["] string_8 ["] space 
char ::= [^"\] | [\] char_4 
char_4 ::= ["\/bfnrt] | [u] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] 
justification-kv ::= ["] [j] [u] [s] [t] [i] [f] [i] [c] [a] [t] [i] [o] [n] ["] space [:] space string 
root ::= [{] space answer-kv [,] space justification-kv [}] space 
space_7 ::= [ ] | 
string_8 ::= char string_8 | 

llama_tokenize_internal: Added a BOS token to the prompt as specified by the model but the prompt also starts with a BOS token. So now the final prompt starts with 2 BOS tokens. Are you sure this is what you want?
Llama.generate: prefix-match hit

llama_print_timings:        load time =    2102.08 ms
llama_print_timings:      sample time =     274.45 ms /    31 runs   (    8.85 ms per token,   112.95 tokens per second)
llama_print_timings: prompt eval time =     673.28 ms /    20 tokens (   33.66 ms per token,    29.71 to

### Tool calling

* However, it cannot automatically trigger a function/tool. 
* We need to force it by specifying the 'tool choice' parameter. This parameter is typically formatted as described below.

In [7]:
from langchain.tools import tool
from langchain_core.pydantic_v1 import BaseModel, Field

class WeatherInput(BaseModel):
    location: str = Field(description="The city and state, e.g. San Francisco, CA")
    unit: str = Field(enum=["celsius", "fahrenheit"])

@tool("get_current_weather", args_schema=WeatherInput)
def get_weather(location: str, unit: str):
    """Get the current weather in a given location"""
    return f"Now the weather in {location} is 22 {unit}"

llm_with_tools = llm.bind_tools(
    tools=[get_weather],
    tool_choice={"type": "function", "function": {"name": "get_current_weather"}},
)

ai_msg = llm_with_tools.invoke(
    "what is the weather in San Francisco, CA",
)

ai_msg

from_string grammar:
char ::= [^"\] | [\] char_1 
char_1 ::= ["\/bfnrt] | [u] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] 
location-kv ::= ["] [l] [o] [c] [a] [t] [i] [o] [n] ["] space [:] space string 
space ::= space_7 
string ::= ["] string_8 ["] space 
root ::= [{] space location-kv [,] space unit-kv [}] space 
unit-kv ::= ["] [u] [n] [i] [t] ["] space [:] space unit 
space_7 ::= [ ] | 
string_8 ::= char string_8 | 
unit ::= ["] [c] [e] [l] [s] [i] [u] [s] ["] | ["] [f] [a] [h] [r] [e] [n] [h] [e] [i] [t] ["] 

llama_tokenize_internal: Added a BOS token to the prompt as specified by the model but the prompt also starts with a BOS token. So now the final prompt starts with 2 BOS tokens. Are you sure this is what you want?
Llama.generate: prefix-match hit

llama_print_timings:        load time =    2102.08 ms
llama_print_timings:      sample time =     142.78 ms /    18 runs   (    7.93 ms per token,   126.06 tokens per second)
llama_print_timings: prompt eval time =     501.87 m

AIMessage(content='', additional_kwargs={'function_call': {'name': 'get_current_weatherget_current_weatherget_current_weatherget_current_weatherget_current_weatherget_current_weatherget_current_weatherget_current_weatherget_current_weatherget_current_weatherget_current_weatherget_current_weatherget_current_weatherget_current_weatherget_current_weatherget_current_weatherget_current_weatherget_current_weather', 'arguments': '{ "location": "Napa Valley", "unit" : "fahrenheit"}'}, 'tool_calls': [{'index': 0, 'id': 'call__0_get_current_weather_cmpl-3e181b53-e014-4d5e-b9b0-c5313eebae2ccall__0_get_current_weather_cmpl-3e181b53-e014-4d5e-b9b0-c5313eebae2ccall__0_get_current_weather_cmpl-3e181b53-e014-4d5e-b9b0-c5313eebae2ccall__0_get_current_weather_cmpl-3e181b53-e014-4d5e-b9b0-c5313eebae2ccall__0_get_current_weather_cmpl-3e181b53-e014-4d5e-b9b0-c5313eebae2ccall__0_get_current_weather_cmpl-3e181b53-e014-4d5e-b9b0-c5313eebae2ccall__0_get_current_weather_cmpl-3e181b53-e014-4d5e-b9b0-c5313eebae2c