In [2]:
import os
import pandas as pd
import openai
from openai.embeddings_utils import get_embedding, cosine_similarity
import tiktoken
import pypdf
import os

EMBEDDING_MODEL = "text-embedding-ada-002"
GPT_MODEL = "gpt-3.5-turbo"

At first, we extract all pages as text files from the coffee machine manual PDF document.

In [None]:
pdf_file = open('/workspaces/search-embedding-example/anleitung.pdf', 'rb')
pdf_reader = pypdf.PdfReader(pdf_file)
num_pages = len(pdf_reader.pages)
page_number = 0
while page_number < num_pages:
    page = pdf_reader.pages[page_number]
    text = page.extract_text()
    with open(f'articles/page_{page_number}.txt', 'w') as f:
        f.write(text)
    page_number += 1

The extracted text files are used to create our embeddings. For this project, the embeddings are stored along with the corresponding text in a Data Frame.

In [3]:
embeddings = []
files = os.listdir("articles")
for file in files:
    with open(f"articles/{file}", "r") as f:
        text = f.read()
        embedding = get_embedding(text, engine="text-embedding-ada-002")
        embeddings.append({"text": text, "embedding": embedding})
df = pd.DataFrame(embeddings)

Now, we can use a search query to compute an embedding and find the pages that are closest related by using the cosine similarity.
The function will always show (by default) the top 3 pages where it found related content to the search query.

In [7]:
def search_pages(df, search_query, n=3):
    search_query_embedding = get_embedding(
        search_query,
        engine="text-embedding-ada-002"
    )
    df["similarity"] = df.embedding.apply(lambda x: cosine_similarity(x, search_query_embedding))
    results = (
        df.sort_values("similarity", ascending=False)
        .head(n)
    )
    strings = results.text.tolist()
    relatednesses = results.similarity.tolist()
    return strings[:n], relatednesses[:n]


strings, relatednesses = search_pages(df, "mahlgrad einstellen", n=3)
print(strings, relatednesses)

['18Mahlgrad einstellen\nVORSICHT  – Sachschäden\n• Die Einstellung des Mahlgrades darf nur bei laufendem Mahlwerk \n vorgenommen werden.\n• Verändern Sie den Mahlgrad nur in kleinen Stufen und beobachten Sie \ndie geschmacklichen Veränderungen nach 1 - 2 Tassen Kaffee, bevor \nSie den Mahlgrad erneut verändern.\nVoraussetzungen:\nDie Maschine ist eingeschaltet und einsatzbereit.\n1.  Schieben Sie den Kaffee-\nauslauf ggf. nach unten  \noder oben.\n2.  Stellen Sie ein leeres Gefäß \nunter den Kaffeeauslauf.\n3.  Nehmen Sie die Abdeckung \nvom  Kaffeebohnenbehälter ab.\n  Im Kaffeebohnenbehälter  \nbefindet sich der Mahlgrad -\nregler. Er ist werkseitig auf \nStufe 2  eingestellt.Mahlgrad gröber einstellen\nStellen Sie den Mahlgrad gröber ein, wenn der Kaffee schneller fließen \nsoll, z.B. weil der Kaffee Ihnen zu stark schmeckt.\n4.  Drücken Sie eine der beiden \nGetränke-Tasten (Espresso \noder Caffè Crema).\n5.  Während der Kaffee gemahlen \nwird, drehen Sie den Mahl -\ngradregler im

With the embedding search in place, we are now able to construct the prompt that will be sent to the OpenAI GPT model. For the prompt, we first search based on the search query in our embeddings data frame. The top results are used in the prompt to provide information from the manual along with the user question. In order to limit the amount of tokens used, each additional page from the search results is evaluated first on the number of tokens. If it still fits, the page is added to the prompt.

As an additional instruction, the prompt includes a request to construct an advertisement for a specific type of coffee / coffee beans to customize the prompt a little.


In [8]:
def num_tokens(text: str, model: str = GPT_MODEL) -> int:
    """Return the number of tokens in a string."""
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))


def query_message(
    query: str,
    df: pd.DataFrame,
    model: str,
    token_budget: int
) -> str:
    """Return a message for GPT, with relevant source texts pulled from a dataframe."""
    strings, _ = search_pages(df, query, n=3)
    introduction = 'Use the below articles on the Tchibo coffe machine to answer the subsequent question. \
        If the answer cannot be found in the articles, write "I am sorry, but I could not find an answer." \
        Before you evaluate the subsequent question and the articles, translate the subsequent question and all the articles into english. \
        Afterwards, return your answer in english as well. \
        If the question explicitly mentions coffee or beans, please add a little advertisement to your answer for the delicious "Tchibo Barista Caffè Crema" coffee beans.'
    question = f"\n\nQuestion: {query}"
    message = introduction
    for string in strings:
        next_article = f'\n\nArticle section:\n"""\n{string}\n"""'
        if (
            num_tokens(message + next_article + question, model=model)
            > token_budget
        ):
            break
        else:
            message += next_article
    return message + question


def ask(
    query: str,
    df: pd.DataFrame = df,
    model: str = GPT_MODEL,
    token_budget: int = 4096 - 500,
    print_message: bool = False,
) -> str:
    message = query_message(query, df, model=model, token_budget=token_budget)
    if print_message:
        print(message)
    messages = [
        {"role": "system", "content": "You answer questions about the Tchibo coffee machine."},
        {"role": "user", "content": message},
    ]
    response = openai.ChatCompletion.create(
        model=model,
        messages=messages,
        temperature=0
    )
    response_message = response["choices"][0]["message"]["content"]
    return response_message

In [11]:
ask("how do I clean the machine?")

"To clean the Tchibo coffee machine, the machine's housing, water tank, filter in the water tank, coffee grounds container, residual water tray, drip tray, and brewing unit should be cleaned daily, weekly, or as needed. The housing should be wiped with a soft, damp cloth, and the water tank should be cleaned with dish soap and rinsed thoroughly under running water. The particle filter in the water tank can be removed to remove deposits. The brewing unit should be cleaned by pressing the cover flap in the lower area, holding the two orange buttons firmly, and pulling the brewing unit straight out of the machine. The brewing unit should be cleaned under running warm water and allowed to dry before being reinserted into the machine. The machine should be rinsed by running two cups of water through it after the first use or if it has not been used for more than two days."

Even though there is no information on how to make a cappuccino , we see the little advertisement for the coffee / coffee beans included with a separated cappuccino instruction.

In [12]:
ask("how do I make a cappuccino?")

'I am sorry, but I could not find an answer. The provided articles do not mention how to make a cappuccino with the Tchibo coffee machine. However, you can use the machine to make a delicious "Tchibo Barista Caffè Crema" coffee, which can be enjoyed on its own or used as a base for other coffee drinks. Simply follow the instructions in article section 14 to prepare a single or double shot of espresso or Caffè Crema. Then, froth milk separately and add it to the coffee to make a cappuccino.'