## Basic Langchain Llama.cpp Usage

https://python.langchain.com/docs/integrations/llms/llamacpp

In [6]:
%pip install langchain langchain-core

Note: you may need to restart the kernel to use updated packages.


In [3]:
%pip install langchain-community

Collecting langchain-community
  Downloading langchain_community-0.0.20-py3-none-any.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting langsmith<0.1,>=0.0.83
  Downloading langsmith-0.0.92-py3-none-any.whl (56 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.5/56.5 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
Collecting langchain-core<0.2,>=0.1.21
  Downloading langchain_core-0.1.23-py3-none-any.whl (241 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m241.2/241.2 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting packaging<24.0,>=23.2
  Using cached packaging-23.2-py3-none-any.whl (53 kB)
Collecting langsmith<0.1,>=0.0.83
  Downloading langsmith-0.0.87-py3-none-any.whl (55 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.4/55.4 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
Installing collect

In [6]:
from langchain.callbacks.manager import CallbackManager
from langchain_core.callbacks.base import BaseCallbackManager, BaseCallbackHandler
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
from langchain_community.llms import LlamaCpp

In [8]:
template = """Question: {question}

Answer: Let's work this out in a step by step way to be sure we have the right answer."""

prompt = PromptTemplate.from_template(template)

Note: The example in the Langchain documentation didn't work as it was using `langchain.callbacks.manager.CallbackManager` instead of `langchain_core.callbacks.base.BaseCallbackManager`. I found the working example [here](https://medium.com/@jayanthd04/leveraging-llama-to-talk-to-your-codebase-1fc83ed4728c)

In [9]:
# Callbacks support token-wise streaming
callback_manager = BaseCallbackManager([StreamingStdOutCallbackHandler()])

In [10]:
# Make sure the model path is correct for your system. The path must be an absolute path.
llm = LlamaCpp(
    model_path="/Users/mitjamartini/Developer/models/mistral-7b-instruct-v0.1.Q6_K.gguf",
    temperature=0.75,
    max_tokens=2000,
    top_p=1,
    callback_manager=callback_manager,
    verbose=True,  # Verbose is required to pass to the callback manager
)

llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /Users/mitjamartini/Developer/models/mistral-7b-instruct-v0.1.Q6_K.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.1
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: -

In [11]:
prompt = """
Question: A rap battle between Stephen Colbert and John Oliver
"""
llm.invoke(prompt)


[INTRO]

Stephen Colbert: Ladies and gentlemen, tonight we have a special treat for you. The one and only John Oliver is here to take on the king of late-night comedy, me! That's right, it's time for a rap battle between two of your favorite comedians. So let's get started without further ado.

John Oliver: (laughs) Alright Steve, I'm ready to go down swinging. But first, let me just say that I've always been impressed by your ability to turn complex issues into something so entertaining. It's like you're giving us all the news we need but making it fun at the same time. But enough about me, let's get into this rap battle.

Stephen Colbert: Yo, yo, yo, listen up, people. We've got two of the funniest guys in town goin' head to head in a rap battle royale. And if you don't know who these guys are, well then you must be living under a rock. But let's not waste any more time on pleasantries, let's get into it!

John Oliver: Alright Steve, here's my first line. I may be from England, but 


llama_print_timings:        load time =    7128.06 ms
llama_print_timings:      sample time =      40.53 ms /   496 runs   (    0.08 ms per token, 12238.15 tokens per second)
llama_print_timings: prompt eval time =    7555.79 ms /    16 tokens (  472.24 ms per token,     2.12 tokens per second)
llama_print_timings:        eval time =   31352.22 ms /   495 runs   (   63.34 ms per token,    15.79 tokens per second)
llama_print_timings:       total time =   39904.06 ms /   511 tokens


"\n[INTRO]\n\nStephen Colbert: Ladies and gentlemen, tonight we have a special treat for you. The one and only John Oliver is here to take on the king of late-night comedy, me! That's right, it's time for a rap battle between two of your favorite comedians. So let's get started without further ado.\n\nJohn Oliver: (laughs) Alright Steve, I'm ready to go down swinging. But first, let me just say that I've always been impressed by your ability to turn complex issues into something so entertaining. It's like you're giving us all the news we need but making it fun at the same time. But enough about me, let's get into this rap battle.\n\nStephen Colbert: Yo, yo, yo, listen up, people. We've got two of the funniest guys in town goin' head to head in a rap battle royale. And if you don't know who these guys are, well then you must be living under a rock. But let's not waste any more time on pleasantries, let's get into it!\n\nJohn Oliver: Alright Steve, here's my first line. I may be from Eng

## With an Output Parser

In [7]:
# adapted from https://python.langchain.com/docs/modules/model_io/output_parsers/types/json to work with llama.cpp and mistral-7b-instruct-v0.1.Q6_K.gguf

from typing import List

from langchain.prompts import PromptTemplate
from langchain.output_parsers import PydanticOutputParser
from langchain_core.pydantic_v1 import BaseModel, Field, validator
from langchain_community.llms import LlamaCpp

In [8]:
callback_manager = BaseCallbackManager([StreamingStdOutCallbackHandler()])
model = LlamaCpp(
    model_path="/Users/mitjamartini/Developer/models/mistral-7b-instruct-v0.1.Q6_K.gguf",
    temperature=0.75,
    max_tokens=2000,
    top_p=1,
    #callback_manager=callback_manager,
    verbose=True,  # Verbose is required to pass to the callback manager
)

llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /Users/mitjamartini/Developer/models/mistral-7b-instruct-v0.1.Q6_K.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.1
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: -

In [11]:
# Define your desired data structure.
class Joke(BaseModel):
    setup: str = Field(description="question to set up a joke")
    punchline: str = Field(description="answer to resolve the joke")

    # You can add custom validation logic easily with Pydantic.
    @validator("setup")
    def question_ends_with_question_mark(cls, field):
        if field[-1] != "?":
            raise ValueError("Badly formed question!")
        return field


# And a query intented to prompt a language model to populate the data structure.
joke_query = "Tell me a joke."

# Set up a parser + inject instructions into the prompt template.
parser = PydanticOutputParser(pydantic_object=Joke)

prompt = PromptTemplate(
    template="Answer the user query.\n{format_instructions}\n{query}\n",
    input_variables=["query"],
    partial_variables={"format_instructions": parser.get_format_instructions()},
)

chain = prompt #| model | parser

from pprint import pprint
print(chain.invoke({"query": joke_query}).text)

Answer the user query.
The output should be formatted as a JSON instance that conforms to the JSON schema below.

As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}
the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.

Here is the output schema:
```
{"properties": {"setup": {"title": "Setup", "description": "question to set up a joke", "type": "string"}, "punchline": {"title": "Punchline", "description": "answer to resolve the joke", "type": "string"}}, "required": ["setup", "punchline"]}
```
Tell me a joke.



## The better approach

In [1]:
%pip install py-llm-core --upgrade --quiet

Note: you may need to restart the kernel to use updated packages.


In [4]:
# For local inference with GGUF models, store your models in MODELS_CACHE_DIR
!#mkdir -p ~/.cache/py-llm-core/models
!cd ~/.cache/py-llm-core/models && wget -c https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q4_K_M.gguf


--2024-02-17 12:43:47--  https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q4_K_M.gguf
Auflösen des Hostnamens huggingface.co (huggingface.co)… 2600:9000:225f:5c00:17:b174:6d00:93a1, 2600:9000:225f:4800:17:b174:6d00:93a1, 2600:9000:225f:9a00:17:b174:6d00:93a1, ...
Verbindungsaufbau zu huggingface.co (huggingface.co)|2600:9000:225f:5c00:17:b174:6d00:93a1|:443 … verbunden.
HTTP-Anforderung gesendet, auf Antwort wird gewartet … 302 Found
Platz: https://cdn-lfs.huggingface.co/repos/46/12/46124cd8d4788fd8e0879883abfc473f247664b987955cc98a08658f7df6b826/14466f9d658bf4a79f96c3f3f22759707c291cac4e62fea625e80c7d32169991?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27mistral-7b-instruct-v0.1.Q4_K_M.gguf%3B+filename%3D%22mistral-7b-instruct-v0.1.Q4_K_M.gguf%22%3B&Expires=1708427738&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcwODQyNzczOH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5odWdnaW

The llama-cpp-python dependency may improperly detects the architecture and raise an error an incompatible architecture (have 'x86_64', need 'arm64')).

If that's the case, uncomment and run the following in your virtual env:

In [6]:
!CMAKE_ARGS="-DCMAKE_OSX_ARCHITECTURES=arm64" pip3 install --upgrade --verbose --force-reinstall --no-cache-dir llama-cpp-python

Using pip 22.3.1 from /opt/homebrew/anaconda3/lib/python3.10/site-packages/pip (python 3.10)
Collecting llama-cpp-python
  Downloading llama_cpp_python-0.2.44.tar.gz (36.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m36.6/36.6 MB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25h  Installing build dependencies ... [?25l  Running command pip subprocess to install build dependencies
  Collecting scikit-build-core[pyproject]>=0.5.1
    Using cached scikit_build_core-0.8.1-py3-none-any.whl (139 kB)
  Collecting exceptiongroup
    Using cached exceptiongroup-1.2.0-py3-none-any.whl (16 kB)
  Collecting packaging>=20.9
    Using cached packaging-23.2-py3-none-any.whl (53 kB)
  Collecting tomli>=1.1
    Using cached tomli-2.0.1-py3-none-any.whl (12 kB)
  Collecting pathspec>=0.10.1
    Using cached pathspec-0.12.1-py3-none-any.whl (31 kB)
  Collecting pyproject-metadata>=0.5
    Using cached pyproject_metadata-0.7.1-py3-none-any.whl (7.4 kB)
  Installin

This is an example directly from the [py-llm-core](py-llm-core) readme. 

In [32]:
from pprint import pprint
from dataclasses import dataclass
from llm_core.parsers import LLaMACPPParser

@dataclass
class Book:
    title: str
    summary: str
    author: str
    published_year: int

@dataclass
class Joke:
    setup: str
    punchline: str

text = """Foundation is a science fiction novel by American writer
Isaac Asimov. Foundation is a cycle of five
interrelated short stories, first published as a single book by Gnome Press
in 1951. Collectively they tell the early story of the Foundation,
an institute founded by psychohistorian Hari Seldon to preserve the best
of galactic civilization after the collapse of the Galactic Empire.
"""

model = "mistral-7b-instruct-v0.1.Q4_K_M.gguf"

with LLaMACPPParser(Book, model=model) as parser:
    book = parser.parse(text)
    pprint(book)

Book(title='Foundation',
     summary='Foundation is a science fiction novel by Isaac Asimov. It is a '
             'cycle of five interrelated short stories that tell the early '
             'story of the Foundation, an institute founded by psychohistorian '
             'Hari Seldon to preserve the best of galactic civilization after '
             'the collapse of the Galactic Empire.',
     author='Isaac Asimov',
     published_year=1951)


now adapt this to the joke example. For this we need to split the task into two steps:

1. tell me a joke
2. parse the output to JSON

In [37]:
from llama_cpp import Llama
llm = Llama(
      model_path="/Users/mitjamartini/.cache/py-llm-core/models/mistral-7b-instruct-v0.1.Q4_K_M.gguf"
)
output = llm(
      "Q: Tell me a joke. A: ", # Prompt
      max_tokens=100, # Generate up to 32 tokens, set to None to generate up to the end of the context window
      stop=["Q:", "\n"], # Stop generating just before the model would generate a new question
      echo=False # Echo the prompt back in the output
)
#pprint(output)
the_joke = output['choices'][0]['text']
pprint(the_joke)

llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /Users/mitjamartini/.cache/py-llm-core/models/mistral-7b-instruct-v0.1.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.1
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_mode

('10 years ago I said to my wife she was drawing her eyebrows too high she '
 'looked surprised.')


In [58]:
prompt = """
Q: Which part is the setup and which the punchline of the following joke? "10 years ago I said to my wife she was drawing her eyebrows too high. She looked surprised." The setup is 
"""

output = llm(
      prompt,
      max_tokens=300, # Generate up to 32 tokens, set to None to generate up to the end of the context window
      stop=["Q:", "\n"], # Stop generating just before the model would generate a new question
      echo=False # Echo the prompt back in the output
)
#pprint(output)
the_joke_separated = output['choices'][0]['text']
pprint(the_joke_separated)

Llama.generate: prefix-match hit


('"10 years ago I said to my wife she was drawing her eyebrows too high" and '
 'the punchline is "She looked surprised."')


In [59]:
@dataclass
class Joke:
    setup: str
    punchline: str

with LLaMACPPParser(Joke, model=model) as parser:
    joke_json = parser.parse(the_joke_separated)
    pprint(joke_json)

Joke(setup='10 years ago I said to my wife she was drawing her eyebrows too '
           'high',
     punchline='She looked surprised.')
