# Llama-cpp-python

Create a local Llama2 server, embed documents, and chat with them.

Install RAGStack and additional dependencies.

In [2]:
!pip3 install ragstack-ai



Install llama-cpp-python and llama server dependencies with METAL inference for Mac M1 enabled.

Use METAL if you are running on an M1/M2 MacBook.

Use CuBLAS if you have CUDA and an NVidia GPU.

Use CLBLAST if you are running on an AMD/Intel GPU.

In [3]:
!CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install llama-cpp-python 'llama-cpp-python[server]'



Download your LLM model, or get the model from HuggingFace.
For more on managing GGUF models, see [](link).

In [4]:
# !wget https://huggingface.co/TheBloke/CodeLlama-7B-GGUF/resolve/main/codellama-7b.Q4_0.gguf

Start a local Llama server.
It will be served at http://localhost:8000/.

n_gpu_layers = 0 to use just CPU.

In [5]:
!python3 -m llama_cpp.server --model models/7B/codellama-7b.Q4_0.gguf --n_gpu_layers 1

llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from models/7B/codellama-7b.Q4_0.gguf (version GGUF V2)
llama_model_loader: - tensor    0:                token_embd.weight q4_0     [  4096, 32016,     1,     1 ]
llama_model_loader: - tensor    1:           blk.0.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    2:            blk.0.ffn_down.weight q4_0     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor    3:            blk.0.ffn_gate.weight q4_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.ffn_up.weight q4_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.attn_k.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    7:         blk.0.attn_output.weight q4_0     [  4096,  4096,     1,     1 ]
lla

Visit http://localhost:8000/docs to see the Swagger UI for your server.

## Load local folder of PDFs
Load some local PDFs into the Document object format.

In [6]:
!pip3 install pypdf

from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("frasier-pdfs/11.pdf")
pages = loader.load_and_split()

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
all_splits = text_splitter.split_documents(pages)



In [7]:
from langchain.embeddings import LlamaCppEmbeddings

llama = LlamaCppEmbeddings(model_path="models/7B/codellama-7b.Q4_0.gguf")

text = "Test document"

query_result = llama.embed_query(text)

doc_result = llama.embed_documents([text])

llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from models/7B/codellama-7b.Q4_0.gguf (version GGUF V2)
llama_model_loader: - tensor    0:                token_embd.weight q4_0     [  4096, 32016,     1,     1 ]
llama_model_loader: - tensor    1:           blk.0.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    2:            blk.0.ffn_down.weight q4_0     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor    3:            blk.0.ffn_gate.weight q4_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.ffn_up.weight q4_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.attn_k.weight q4_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    7:         blk.0.attn_output.weight q4_0     [  4096,  4096,     1,     1 ]
lla

Post a question to the chat completions endpoint:

In [None]:
curl -X 'POST' \
  'http://localhost:8000/v1/chat/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "messages": [
    {
      "content": "You are a helpful assistant.",
      "role": "system"
    },
    {
      "content": "How do I compile Llama.cpp locally?",
      "role": "user"
    }
  ]
}'