## Llama.cpp

llama-cpp-python is a Python binding for llama.cpp.

It supports inference for many LLMs models, which can be accessed on Hugging Face.

You don’t need an API_TOKEN as you will run the LLM locally.

__[TheBloke’s](https://huggingface.co/TheBloke)__ Hugging Face models have a Provided files section that exposes the RAM required to run models of different quantisation sizes and methods (eg: __[Llama2-7B-Chat-GGUF](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF#provided-files)__).

### How to download a .gguf file

1. Install the huggingface-hub Python library: 

    ``` pip3 install huggingface-hub ```
    
    
2. Download individual model file to current directory

    ``` huggingface-cli download TheBloke/Llama-2-7B-Chat-GGUF llama-2-7b-chat.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False ```

### Load all pdf files from directory

In [None]:
from langchain_community.document_loaders import PyPDFDirectoryLoader

loader = PyPDFDirectoryLoader("data")
pages_from_pdf = loader.load()

### Define the embedding model

In [None]:
from langchain_community.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name = 'all-MiniLM-L6-v2')

### Set the Faiss document store

In [None]:
from langchain.vectorstores import FAISS
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 200,
    length_function = len,
)

docs = text_splitter.split_documents(pages_from_pdf)
db = FAISS.from_documents(docs, embedding=embeddings)

In [None]:
retriever = db.as_retriever()

### Test the retriever

In [None]:
query = ''
results = retriever.get_relevant_documents(query)
print(results)

### Load the local LLM

In [None]:
from langchain_community.llms import LlamaCpp
from langchain_core.callbacks import CallbackManager, StreamingStdOutCallbackHandler
from langchain_core.prompts import PromptTemplate

In [None]:
template = """Question: {question}

Answer: Let's work this out in a step by step way to be sure we have the right answer."""

prompt = PromptTemplate.from_template(template)

# Callbacks support token-wise streaming
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])

In [None]:
# Make sure the model path is correct for your system!
llm = LlamaCpp(
    model_path="models\mistral-7b-instruct-v0.1.Q4_K_M.gguf",
    temperature=0.75,
    max_tokens=2000,
    context_length=6000,
    max_new_tokens=4096,
    n_ctx=4096,
    top_p=1,
    callback_manager=callback_manager,
    verbose=True,  # Verbose is required to pass to the callback manager
)

In [None]:
from langchain.chain import RetrievalQA
QA_CHAIN_PROMPT = PrmptTemplate.from_template(template)
qa = RetrievalQA.from_chain_type(llm, retriever=retriever, return_source_documents=True, chain_type_kwargs={'prompt':QA_CHAIN_PROMPT})

In [None]:
query = 'my question'

In [None]:
result = qa({'query': query})