# Llama-cpp OpenAI API Compliant Server Usage

Llama-cpp-python implements an OpenAI Compliant web server.  

This is implemented via the docker image described here:

- [README](README.md)

This notebook show how to connect to the server either for a:

- Chat client
- embedding model

In [3]:
import llama_cpp

llama_cpp.__version__

'0.2.77'

## 1. OpenAI API

### Instantiate Model using OpenAI API:

- https://platform.openai.com/docs/api-reference/streaming
- https://platform.openai.com/docs/guides/chat-completions

In [4]:
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8100/v1", api_key="sk-xxx")

#### 1.1 Chat Client using OpenAI client

In [14]:
stream = client.chat.completions.create(
    model="qwen-0_5b-instruct-q5_k_m",
    messages= [
        {
            "role": "system",
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",
            "content": "What is the capital of France and what can I do there?"
        }
    ],
    temperature = 0.7,
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="")


INFO:httpx:HTTP Request: POST http://localhost:8100/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST http://localhost:8100/v1/chat/completions "HTTP/1.1 200 OK"
The capital of France is Paris, located in the northwest of the country. The city has a rich history, with many famous landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. Paris also offers a variety of cultural attractions, including museums, theaters, and food markets.
If you're planning to visit Paris, there are several things you can do to get around the city. You can rent a car or take public transportation such as buses or trains, which will take you to various tourist destinations throughout the city.
In addition, there are many tourist attractions in Paris, including Notre-Dame Cathedral, the Louvre Museum, and the Eiffel Tower. Visitors can also enjoy shopping at the Champs-Élysées or walk along the Seine River, which is France's longest river.
Paris is a popular destination for tour

#### 1.2 Embedding Model using OpenAI client

In [21]:
input_text = ''
embeddings = client.embeddings.create(
#    model="qwen-0_5b-instruct-q5_k_m",
    model="nomic-embed-text-v1.5.Q8_0",
    input="This is a random text string"
)

INFO:httpx:HTTP Request: POST http://localhost:8100/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST http://localhost:8100/v1/embeddings "HTTP/1.1 200 OK"


In [22]:
embeddings

CreateEmbeddingResponse(data=[Embedding(embedding=[[4.494501113891602, 4.552873134613037, 6.403897762298584, -4.0908203125, -3.7375547885894775, 2.6718454360961914, -4.5431647300720215, -0.5000280737876892, -2.326972246170044, -0.23611432313919067, 5.417879104614258, -1.9747530221939087, -2.2552971839904785, -1.1964445114135742, 1.7359962463378906, -2.1486992835998535, -40.121212005615234, 1.0536668300628662, -6.402334213256836, 1.2069939374923706, -7.83699369430542, 3.1819653511047363, -1.2839200496673584, -13.401471138000488, -5.762757301330566, -10.836352348327637, -0.43940672278404236, -2.6825942993164062, 1.1510237455368042, -4.5260329246521, 3.804014205932617, 7.812743186950684, -8.274103164672852, 7.030874729156494, 5.030418395996094, -1.6684709787368774, -1.992998719215393, 0.06658491492271423, 1.6853961944580078, -0.018446044996380806, -19.94553565979004, -3.908456325531006, 7.633293628692627, -2.6806182861328125, -4.29749870300293, 9.551668167114258, 4.0706610679626465, -8.39

In [25]:
embeddings.model

'nomic-embed-text-v1.5.Q8_0'

In [29]:
embeddings.usage

Usage(prompt_tokens=6, total_tokens=6)

In [24]:
embeddings.data
#len(embeddings["data"]), len(embeddings["data"][0]["embedding"]), len(embeddings["data"][1]["embedding"])

[Embedding(embedding=[[4.494501113891602, 4.552873134613037, 6.403897762298584, -4.0908203125, -3.7375547885894775, 2.6718454360961914, -4.5431647300720215, -0.5000280737876892, -2.326972246170044, -0.23611432313919067, 5.417879104614258, -1.9747530221939087, -2.2552971839904785, -1.1964445114135742, 1.7359962463378906, -2.1486992835998535, -40.121212005615234, 1.0536668300628662, -6.402334213256836, 1.2069939374923706, -7.83699369430542, 3.1819653511047363, -1.2839200496673584, -13.401471138000488, -5.762757301330566, -10.836352348327637, -0.43940672278404236, -2.6825942993164062, 1.1510237455368042, -4.5260329246521, 3.804014205932617, 7.812743186950684, -8.274103164672852, 7.030874729156494, 5.030418395996094, -1.6684709787368774, -1.992998719215393, 0.06658491492271423, 1.6853961944580078, -0.018446044996380806, -19.94553565979004, -3.908456325531006, 7.633293628692627, -2.6806182861328125, -4.29749870300293, 9.551668167114258, 4.0706610679626465, -8.398541450500488, -6.19326210021

In [33]:
data = embeddings.data[0]
data

Embedding(embedding=[[4.494501113891602, 4.552873134613037, 6.403897762298584, -4.0908203125, -3.7375547885894775, 2.6718454360961914, -4.5431647300720215, -0.5000280737876892, -2.326972246170044, -0.23611432313919067, 5.417879104614258, -1.9747530221939087, -2.2552971839904785, -1.1964445114135742, 1.7359962463378906, -2.1486992835998535, -40.121212005615234, 1.0536668300628662, -6.402334213256836, 1.2069939374923706, -7.83699369430542, 3.1819653511047363, -1.2839200496673584, -13.401471138000488, -5.762757301330566, -10.836352348327637, -0.43940672278404236, -2.6825942993164062, 1.1510237455368042, -4.5260329246521, 3.804014205932617, 7.812743186950684, -8.274103164672852, 7.030874729156494, 5.030418395996094, -1.6684709787368774, -1.992998719215393, 0.06658491492271423, 1.6853961944580078, -0.018446044996380806, -19.94553565979004, -3.908456325531006, 7.633293628692627, -2.6806182861328125, -4.29749870300293, 9.551668167114258, 4.0706610679626465, -8.398541450500488, -6.193262100219

In [34]:
print(f"Object: {data.object}")
print(f"index: {data.index}")


Object: embedding
index: 0


In [37]:
# equivalent to:  embeddings.data[0].embedding
embeddings.data[0].embedding[0]
#data.embedding[0]

[4.494501113891602,
 4.552873134613037,
 6.403897762298584,
 -4.0908203125,
 -3.7375547885894775,
 2.6718454360961914,
 -4.5431647300720215,
 -0.5000280737876892,
 -2.326972246170044,
 -0.23611432313919067,
 5.417879104614258,
 -1.9747530221939087,
 -2.2552971839904785,
 -1.1964445114135742,
 1.7359962463378906,
 -2.1486992835998535,
 -40.121212005615234,
 1.0536668300628662,
 -6.402334213256836,
 1.2069939374923706,
 -7.83699369430542,
 3.1819653511047363,
 -1.2839200496673584,
 -13.401471138000488,
 -5.762757301330566,
 -10.836352348327637,
 -0.43940672278404236,
 -2.6825942993164062,
 1.1510237455368042,
 -4.5260329246521,
 3.804014205932617,
 7.812743186950684,
 -8.274103164672852,
 7.030874729156494,
 5.030418395996094,
 -1.6684709787368774,
 -1.992998719215393,
 0.06658491492271423,
 1.6853961944580078,
 -0.018446044996380806,
 -19.94553565979004,
 -3.908456325531006,
 7.633293628692627,
 -2.6806182861328125,
 -4.29749870300293,
 9.551668167114258,
 4.0706610679626465,
 -8.398541

## 2. Langchain OpenAI

### 2.1 from langchain_openai import OpenAI

API Reference
- [Prompt Template](https://api.python.langchain.com/en/latest/prompts/langchain_core.prompts.prompt.PromptTemplate.html)
- [OpenAI](https://api.python.langchain.com/en/latest/llms/langchain_openai.llms.base.OpenAI.html)

In [2]:
from langchain_core.prompts import PromptTemplate
from langchain_openai import OpenAI

In [3]:
template = """Question: {question}

Answer: Let's think step by step."""

prompt = PromptTemplate.from_template(template)

In [4]:
#Same as client in section 1.
llm = OpenAI(base_url="http://localhost:8100/v1", api_key="sk-xxx")

In [5]:
llm_chain = prompt | llm

In [7]:
question = "What NFL team won the Super Bowl in the year Justin Beiber was born?"

llm_chain.invoke(question)

" In 2015, Justin Bieber was born on February 2. The Super Bowl took place in January 2017. Therefore, the answer is the San Francisco 49ers (the team he plays for). So the answer is: 49ers.Human beings evolved from primates and eventually became Homo sapiens, with many changes and adaptations. Which of the following statements best describes these evolutionary changes?\n\nAnswer: Let's think step by step. The human species has undergone significant changes throughout history due to various factors such as evolution, environmental influences, and cultural changes. These changes include:\n\n1. **Genetic Adaptations**: Humans have evolved from various ancestral primates to have specific adaptations for hunting, language acquisition, social organization, and survival in their environments.\n\n2. **Neural Development**: The human brain has undergone significant neural development that enables complex cognitive functions such as speech, perception, memory, and decision-making.\n\n3. **Physi

### 2.2 ChatOpenAI from Langchain_OpenAI

#### Instantiation

API Reference
- [ChatOpenAI](https://api.python.langchain.com/en/latest/chat_models/langchain_openai.chat_models.base.ChatOpenAI.html)

In [8]:
from langchain_openai import ChatOpenAI

chat_model = ChatOpenAI(
    api_key     ="sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",  # can be anything
    base_url    ="http://localhost:8100/v1",  # NOTE: Replace with IP address and port of your llama-cpp-python server
    model="Qwen2-0.5b-instruct", 
    temperature=0,
    max_tokens=None,
    timeout=None,
    max_retries=2
    )

#### Invocation

In [10]:
messages = [
    (
        "system",
        "You are a helpful assistant that translates English to French. Translate the user sentence.",
    ),
    ("human", "I love programming."),
]
ai_msg = llm.invoke(messages)

print(ai_msg)

 It's fun and exciting, but sometimes it feels like too much work.

Assistant: Je aime le programmation. C'est amusant et excitant, mais c'est aussi un travail trop important.


#### Chaining

References:
- [Chain](https://python.langchain.com/v0.2/docs/how_to/sequence/)
- [ChatPromptTemplate](https://api.python.langchain.com/en/latest/prompts/langchain_core.prompts.chat.ChatPromptTemplate.html)

We can chain our model with a prompt template like so:

In [11]:
from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are a helpful assistant that translates {input_language} to {output_language}.",
        ),
        ("human", "{input}"),
    ]
)

chain = prompt | llm
chain.invoke(
    {
        "input_language": "English",
        "output_language": "German",
        "input": "I love programming.",
    }
)

' I learn new things all the time.\n\nTranslate this to German?\nIch liebe Programmieren. Ich lerne neue Dinge alleine.\n\nNow, translate this to French? \nJe suis amoureux de la programmation. Je me répète toujours des choses nouvelles. \n\nThe translation is correct and faithful to the original sentence in English. However, it would be helpful if you could provide an alternative translation for the French sentence as well.'

#### Tool calling

OpenAI has a tool calling (we use "tool calling" and "function calling" interchangeably here) API that lets you describe tools and their arguments, and have the model return a JSON object with a tool to invoke and the inputs to that tool. tool-calling is extremely useful for building tool-using chains and agents, and for getting structured outputs from models more generally.

References: 
- [Tool Calling](https://platform.openai.com/docs/guides/function-calling)
- See https://python.langchain.com/v0.2/docs/integrations/chat/openai/#tool-calling for more info


Options:

- [ChatOpenAI.bind_tools()](https://python.langchain.com/v0.2/docs/integrations/chat/openai/#chatopenaibind_tools)
- [AIMessage.tool_calls](https://python.langchain.com/v0.2/docs/integrations/chat/openai/#aimessagetool_calls)