# Llama.cpp

[Llama.cpp](https://github.com/ggerganov/llama.cpp) allows you to run open-source large language models, such as LLaMA2 or Mistral locally.


## Setup

Follow [instructions](https://github.com/ggerganov/llama.cpp#prepare-data--run) to obtain local gguf model.
You can optionally quantize the model, cutting requirements on memory.

Alternativelly, you can download existing model in the gguf format.

## Usage

You can see a full list of supported parameters on the [API reference page](https://api.python.langchain.com/en/latest/llms/langchain.llms.llamacpp.LlamaCpp.html).


In [1]:
%pip install llama-cpp-python


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [1]:
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.chat_models.llamacpp import ChatLlamacpp


model = ChatLlamacpp(
    model_path="./models/ggml-model-Q4_K_M.gguf",
    callback_manager=CallbackManager([StreamingStdOutCallbackHandler()]),
    n_ctx=32000,
)

llama_model_loader: loaded meta data with 23 key-value pairs and 291 tensors from ./models/ggml-model-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q4_K     [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    1:              blk.0.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    2:              blk.0.attn_k.weight q4_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor    3:              blk.0.attn_v.weight q6_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor    4:         blk.0.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_gate.weight q4_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.ffn_up.weight q4_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor    7:            blk.0.ffn_down.weight q6_K     [ 14336,  4096,     1,    

With `StreamingStdOutCallbackHandler`, you will see tokens streamed.

In [2]:
from langchain.schema import HumanMessage

messages = [HumanMessage(content="Tell me about the history of AI")]
model(messages)

 Artificial Intelligence (AI) is a branch of computer science that aims to develop intelligent machines capable of performing tasks that normally require human intelligence. The field has a rich history, dating back to ancient civilizations.

One of the earliest examples of artificial intelligence can be found in ancient Greece, where engineers built machines that could mimic human speech and movement. These machines were designed by using simple automatons, which were mechanical devices programmed to perform specific tasks.

During the Middle Ages, scholars and monks developed machines that were capable of performing basic mathematical calculations, as well as reading and writing text. These machines used an abacus, which is a counting device that was widely used in mathematics and science.

In the 17th and 18th centuries, philosophers such as René Descartes and Leibniz proposed theories of artificial intelligence, which were based on the idea that machines could be designed to think 

AIMessage(content=' Artificial Intelligence (AI) is a branch of computer science that aims to develop intelligent machines capable of performing tasks that normally require human intelligence. The field has a rich history, dating back to ancient civilizations.\n\nOne of the earliest examples of artificial intelligence can be found in ancient Greece, where engineers built machines that could mimic human speech and movement. These machines were designed by using simple automatons, which were mechanical devices programmed to perform specific tasks.\n\nDuring the Middle Ages, scholars and monks developed machines that were capable of performing basic mathematical calculations, as well as reading and writing text. These machines used an abacus, which is a counting device that was widely used in mathematics and science.\n\nIn the 17th and 18th centuries, philosophers such as René Descartes and Leibniz proposed theories of artificial intelligence, which were based on the idea that machines co

## RAG

LLama.cpp can also use RAG. For example with the same 7B model, GPT4All embeddings and chroma.


In [3]:
! pip install chromadb gpt4all


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [4]:
from langchain.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent/")
data = loader.load()

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
all_splits = text_splitter.split_documents(data)

In [5]:
from langchain.embeddings import GPT4AllEmbeddings
from langchain.vectorstores import Chroma

vectorstore = Chroma.from_documents(documents=all_splits, embedding=GPT4AllEmbeddings())

bert_load_from_file: gguf version     = 2
bert_load_from_file: gguf alignment   = 32
bert_load_from_file: gguf data offset = 695552
bert_load_from_file: model name           = BERT
bert_load_from_file: model architecture   = bert
bert_load_from_file: model file type      = 1
bert_load_from_file: bert tokenizer vocab = 30522


In [6]:
question = "What are the approaches to Task Decomposition?"
docs = vectorstore.similarity_search(question)
len(docs)

4

In [7]:
from langchain.prompts import PromptTemplate

# Prompt
template = """[INST] <<SYS>> Use the following pieces of context to answer the question at the end. 
If you don't know the answer, just say that you don't know, don't try to make up an answer. 
Use three sentences maximum and keep the answer as concise as possible. <</SYS>>
{context}
Question: {question}
Helpful Answer:[/INST]"""
QA_CHAIN_PROMPT = PromptTemplate(
    input_variables=["context", "question"],
    template=template,
)

In [8]:
# QA chain
from langchain.chains import RetrievalQA

qa_chain = RetrievalQA.from_chain_type(
    model,
    retriever=vectorstore.as_retriever(),
    chain_type_kwargs={"prompt": QA_CHAIN_PROMPT},
)

In [9]:
question = "What are the various approaches to Task Decomposition for AI Agents?"
result = qa_chain({"query": question})

 The various approaches to task decomposition for AI agents are (1) by LLM with simple prompting, (2) using task-specific instructions, and (3) with human inputs.