# Retrieval Augmented Generation from Scratch

![image.png](https://miro.medium.com/v2/resize:fit:1400/format:webp/0*WYv0_CaBmCTt7FXc)
Image source: https://gradientflow.com/techniques-challenges-and-future-of-augmented-language-models/

Retrieval Augmented Generation (RAG) refers to the set of techniques that aim to **contextualize** the prompts to an LLM using information from existing sources.

The basic steps to create a RAG pipeline are

1. Data ingestion: Consume data from a set of *textual* sources (webpages, pdfs, etc...)
2. Text chunking: Split the consumed text into chunks of manageable length
3. Vector database indexing
4. Similarity retrieval: Retrieve the most relevant texts based on the user's query
5. Prompt formation: Format the prompt to the LLM with the collected passages
6. LLM query: Query the LLM using the contextualized prompt

In [None]:
%%capture

!pip install openai pymupdf tqdm chromadb spacy sentence-transformers
!python -m spacy download  el_core_news_sm
!python -m spacy download en_core_web_sm

## 1. Data ingestion

Fetch one pdf of **your** choice. We will parse it manually using `pymupdf` to extract the text.

If you want to ingest html documents from urls you can follow this tutoria: https://docs.llamaindex.ai/en/stable/examples/data_connectors/WebPageDemo/

In [None]:
import re
import os
import urllib.request
import uuid
import shutil
import fitz
from tqdm.autonotebook import tqdm

  from tqdm.autonotebook import tqdm


In [None]:
# Basic functions to download and extract text from a pdf

def download_pdf(url, output_path):
    os.makedirs(output_path, exist_ok=True)
    local_pdf = f"{output_path}/{uuid.uuid4().hex}.pdf"
    if url == output_path:
        return
    try:
        urllib.request.urlretrieve(url, local_pdf)
    except ValueError:
        shutil.copy(url, local_pdf)
    return local_pdf


def preprocess(text):
    text = text.replace("-\n", "")  # no word breaks
    text = text.replace("\n", " ")
    text = re.sub(r"\s+", " ", text)
    text = re.sub(r"\.+", ".", text)
    return text


def pdf2text(path, start_page=0, end_page=-1):
    print("Parsing PDF")
    doc = fitz.open(path)
    total_pages = doc.page_count
    print(f"PDF contains {total_pages} pages")
    if end_page <= 0:
        end_page = total_pages

    text_list = []
    for i in tqdm(
        range(start_page, end_page),
        desc=f"Converting PDF to text. Pages: {start_page}-{end_page}",
    ):
        text = doc.load_page(i).get_text("text")
        text = preprocess(text)
        text_list.append(text)
    doc.close()
    return " ".join(text_list)

In [None]:
input_pdf = "https://aclanthology.org/N19-1423.pdf"
local_pdf = download_pdf(input_pdf, "data")

text = pdf2text(local_pdf, start_page=0, end_page=-1)

print(text)

# Uncomment for a greek example
input_pdf_greek = "http://ebooks.edu.gr/ebooks/d/8547/5306/22-0081-02_Istoria-tou-Neoterou-kai-Sygchronou-Kosmou_G-Lykeiou_Vivlio-Mathiti.pdf"
local_pdf_greek = download_pdf(input_pdf_greek, "data")


text_greek = pdf2text(local_pdf_greek, start_page=0, end_page=-1)
print(text_greek)


Parsing PDF
PDF contains 16 pages


Converting PDF to text. Pages: 0-16:   0%|          | 0/16 [00:00<?, ?it/s]

Proceedings of NAACL-HLT 2019, pages 4171–4186 Minneapolis, Minnesota, June 2 - June 7, 2019. c⃝2019 Association for Computational Linguistics 4171 BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova Google AI Language {jacobdevlin,mingweichang,kentonl,kristout}@google.com Abstract We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models (Peters et al., 2018a; Radford et al., 2018), BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be ﬁnetuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial taskspeciﬁc architectur

Converting PDF to text. Pages: 0-258:   0%|          | 0/258 [00:00<?, ?it/s]

ΙΝΣΤΙΤΟΥΤΟ ΤΕΧΝΟΛΟΓΙΑΣ ΥΠΟΛΟΓΙΣΤΩΝ ΚΑΙ ΕΚΔΟΣΕΩΝ «ΔΙΟΦΑΝΤΟΣ» Ιστορία του νεότερου και του σύγχρονου κόσμου (από το 1815 έως σήμερα) Γ΄ ΓΕΝΙΚΟΥ ΛΥΚΕΙΟΥ ΥΠΟΥΡΓΕΙΟ ΠΑΙΔΕΙΑΣ, ΘΡΗΣΚΕΥΜΑΤΩΝ ΚΑΙ ΑΘΛΗΤΙΣΜΟΥ ΙΝΣΤΙΤΟΥΤΟ ΕΚΠΑΙΔΕΥΤΙΚΗΣ ΠΟΛΙΤΙΚΗΣ Ομάδας Προσανατολισμού Θετικών Σπουδών, Σπουδών Υγείας και Σπουδών Οικονομίας & Πληροφορικής Ιστορία του νεότερου και του σύγχρονου κόσμου (από το 1815 έως σήμερα)  ΙΣΤOPIA TOY ΝΕΟΤΕΡΟΥ ΚΑΙ TOY ΣΥΓΧΡΟΝΟΥ ΚΟΣΜΟΥ  ΣΤΟΙΧΕΙΑ ΑΡΧΙΚΗΣ ΕΚ∆ΟΣΗΣ ΣΤΟΙΧΕΙΑ ΕΠΑΝΕΚ∆ΟΣΗΣ Η επανέκδοση του παρόντος βιβλίου πραγματοποιήθηκε από το Ινστιτούτο Τεχνολογίας Υπολογιστών & Εκδόσεων «Διόφαντος» μέσω ψηφιακής μακέτας, η οποία δημιουργήθηκε με χρηματοδότηση από το ΕΣΠΑ / ΕΠ «Εκπαίδευση & Διά Βίου Μάθηση» / Πράξη «ΣΤΗΡΙΖΩ». Οι διορθώσεις πραγματοποιήθηκαν κατόπιν έγκρισης του Δ.Σ. του Ινστιτούτου Εκπαιδευτικής Πολιτικής ΣΥΓΓΡΑΦΕΙΣ Ιωάννης Κολιόπουλος Καθηγητής του Πανεπιστημίου Θεσσαλονίκης Κωνσταντίνος Σβολόπουλος Ακαδημαϊκός Καθηγητής του Πανεπιστημίου Αθηνών Ευάνθης Χατζηβασιλείου

In [None]:
text

'Proceedings of NAACL-HLT 2019, pages 4171–4186 Minneapolis, Minnesota, June 2 - June 7, 2019. c⃝2019 Association for Computational Linguistics 4171 BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova Google AI Language {jacobdevlin,mingweichang,kentonl,kristout}@google.com Abstract We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models (Peters et al., 2018a; Radford et al., 2018), BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be ﬁnetuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial taskspeciﬁc architectu

In [None]:
text_greek

'ΙΝΣΤΙΤΟΥΤΟ ΤΕΧΝΟΛΟΓΙΑΣ ΥΠΟΛΟΓΙΣΤΩΝ ΚΑΙ ΕΚΔΟΣΕΩΝ «ΔΙΟΦΑΝΤΟΣ» Ιστορία του νεότερου και του σύγχρονου κόσμου (από το 1815 έως σήμερα) Γ΄ ΓΕΝΙΚΟΥ ΛΥΚΕΙΟΥ ΥΠΟΥΡΓΕΙΟ ΠΑΙΔΕΙΑΣ, ΘΡΗΣΚΕΥΜΑΤΩΝ ΚΑΙ ΑΘΛΗΤΙΣΜΟΥ ΙΝΣΤΙΤΟΥΤΟ ΕΚΠΑΙΔΕΥΤΙΚΗΣ ΠΟΛΙΤΙΚΗΣ Ομάδας Προσανατολισμού Θετικών Σπουδών, Σπουδών Υγείας και Σπουδών Οικονομίας & Πληροφορικής Ιστορία του νεότερου και του σύγχρονου κόσμου (από το 1815 έως σήμερα)  ΙΣΤOPIA TOY ΝΕΟΤΕΡΟΥ ΚΑΙ TOY ΣΥΓΧΡΟΝΟΥ ΚΟΣΜΟΥ  ΣΤΟΙΧΕΙΑ ΑΡΧΙΚΗΣ ΕΚ∆ΟΣΗΣ ΣΤΟΙΧΕΙΑ ΕΠΑΝΕΚ∆ΟΣΗΣ Η επανέκδοση του παρόντος βιβλίου πραγματοποιήθηκε από το Ινστιτούτο Τεχνολογίας Υπολογιστών & Εκδόσεων «Διόφαντος» μέσω ψηφιακής μακέτας, η οποία δημιουργήθηκε με χρηματοδότηση από το ΕΣΠΑ / ΕΠ «Εκπαίδευση & Διά Βίου Μάθηση» / Πράξη «ΣΤΗΡΙΖΩ». Οι διορθώσεις πραγματοποιήθηκαν κατόπιν έγκρισης του Δ.Σ. του Ινστιτούτου Εκπαιδευτικής Πολιτικής ΣΥΓΓΡΑΦΕΙΣ Ιωάννης Κολιόπουλος Καθηγητής του Πανεπιστημίου Θεσσαλονίκης Κωνσταντίνος Σβολόπουλος Ακαδημαϊκός Καθηγητής του Πανεπιστημίου Αθηνών Ευάνθης Χατζηβασιλείο

## 2. Chunking

The parsed document is a long text string. We want to chunk it into multiple shorter segments, in order to adhere to model context sizes and have a predictable token consumption.

The most popular chunking algorithms are:

1. Sentence-based chunking: Split the document into sentences. Each chunk corresponds to 1 sentence.
2. Fixed-size chunking: Split the document into chunks of (nearly) equal size (e.g., aim for balanced number of tokens)
3. Semantic chunking: First split into sentences, then merge chunks that are semantically coherent using embedding similarity
4. Document specific chunking: For some document types (e.g., Markdown, HTML, code) etc., make use of the known structure to split the document.
5. Agent chunking: Use an LLM to do the heavy lifting and decide how the document is better split. Not yet production ready.

In this tutorial we implement a combination of 1 and 2. First we split the document into sentences and then we merge adjacent sentences until we reach a maximum number of tokens per chunk.

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")
# Uncomment for Greek spacy model
nlp_greek = spacy.load("el_core_news_sm")



In [None]:
def chunk_text(text, spacy_nlp=nlp, chunk_size=128):
    print("Split document into sentences using spacy")
    sentences = [sent.text for sent in nlp(text).sents]
    print(f"Document contains #{len(sentences)} sentences")

    print(f"Merge consecutive sentences up to chunk size = {chunk_size}")
    chunks = [sentences[0]]
    for index in tqdm(range(1, len(sentences)), desc="Chunking text"):
        potential_chunk = "\n".join([chunks[-1], sentences[index]])
        num_tokens_in_chunk = len([token for token in spacy_nlp(potential_chunk)])
        if num_tokens_in_chunk < chunk_size:
            chunks[-1] = potential_chunk
        else:
            chunks.append(sentences[index])

    print(f"Created #{len(chunks)} chunks")
    return chunks

In [None]:
chunks = chunk_text(text, spacy_nlp=nlp, chunk_size=128)

Split document into sentences using spacy
Document contains #636 sentences
Merge consecutive sentences up to chunk size = 128


Chunking text:   0%|          | 0/635 [00:00<?, ?it/s]

Created #118 chunks


In [None]:
print(chunks[8])

Unlike Radford et al. (2018), which uses unidirectional language models for pre-training, BERT uses masked language models to enable pretrained deep bidirectional representations.
This is also in contrast to Peters et al. (2018a), which uses a shallow concatenation of independently trained left-to-right and right-to-left LMs.
•
We show that pre-trained representations reduce the need for many heavily-engineered taskspeciﬁc architectures.


In [None]:
chunks_greek = chunk_text(text_greek, spacy_nlp=nlp_greek, chunk_size=128)

Split document into sentences using spacy


KeyboardInterrupt: 

## 3. Vector DB indexing

Next we explore embeddings, vector databases, and the indexing process used to organize and manage embeddings efficiently.

### What are Embeddings?

Embeddings are numerical representations of text, converted into vectors in a high-dimensional space. These vectors encapsulate the semantic meaning of the text inputs, allowing for the comparison of different chunks of text based on their meanings rather than just their literal content. By capturing semantic information, embeddings enable advanced tasks such as similarity searches, clustering, and various types of machine learning applications.

There are multiple models to extract embeddings from text. Some APIs offer embedding models (e.g., OpenAI), or you can use a local model. For this tutorial we use the `paraphrase-multilingual-MiniLM-L12-v2` offered by `sentence-transformers`. This model offers a good balance between performance and computational efficiency, and is multilingual.

### What is a Vector Database?

A vector database is a specialized type of database designed to store, manage, and query vector representations of data. Unlike traditional databases that store structured data, vector databases are optimized for handling high-dimensional vectors, making them ideal for applications that require efficient similarity searches and other operations on embeddings. They provide the necessary infrastructure to perform fast and scalable vector operations, such as nearest neighbor searches, which are crucial for tasks like semantic search and recommendation systems.

#### Indexing Process

1. Generating Embeddings: Each text chunk is then transformed into a vector representation (embedding) using a pre-trained model. This model could be based on neural networks such as BERT, GPT, or similar architectures, which are designed to capture the semantic nuances of the text.
2. Storing Embeddings: The generated embeddings are stored in the vector database. Alongside the embeddings, metadata (such as the original text chunk) is often stored to facilitate easy retrieval and interpretation of the results.


In this tutorial we use ChromaDB as a vector database, mainly due to its simplicity and ability to run through a file (similar to sqlite) without the need of an external server.




In [None]:
import chromadb
from chromadb.api.types import Documents, EmbeddingFunction, Embeddings

In [None]:
class MultilingualSentenceTransformer(EmbeddingFunction):
    """ Create a ChromaDB embedding function based on the chosen model """
    def __init__(
        self,
        # model="sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2",  # 384d.
        model="BAAI/bge-m3",  # 1024d.
    ):
        self.model = model
        self.embedder = self.initialize_model(model)

    def initialize_model(self, model):
        from sentence_transformers import SentenceTransformer

        embedder = SentenceTransformer(model, device="cuda")
        return embedder

    def __call__(self, sentences: Documents) -> Embeddings:
        return self.embedder.encode(sentences, convert_to_numpy=True).tolist()  # type: ignore


def batchify(iterable, n=1):
    """ For efficient embedding extraction """
    l = len(iterable)
    for ndx in range(0, l, n):
        yield iterable[ndx : min(ndx + n, l)]


def create_collection(chunks, collection_name="rag_from_scratch", batch_size=12):
    print("Create and index vector store")

    client = chromadb.EphemeralClient()  # DB is created in memory. Use persistent client for a persistent DB
    # client = chromadb.PersistentClient("./chroma")
    embedding_function = MultilingualSentenceTransformer()
    collection = client.create_collection(
        collection_name,
        metadata={"hnsw:space": "cosine"},
        embedding_function=embedding_function,
    )

    id_num = 0
    for batch in tqdm(
        batchify(chunks, n=batch_size),
        desc="Indexing vector database",
        total=len(chunks) // batch_size,
    ):
        texts = batch
        collection.add(
            documents=texts,  # type: ignore
            ids=[f"id{idx}" for idx in range(id_num, id_num + len(texts))],
        )
        id_num += len(texts)

    return collection

In [None]:
COLLECTION_NAME = "rag_from_scratch_en3"
collection = create_collection(chunks, collection_name=COLLECTION_NAME)

Create and index vector store


Indexing vector database:   0%|          | 0/9 [00:00<?, ?it/s]

In [None]:
COLLECTION_NAME_GREEK = "rag_from_scratch_greek3"
collection_greek = create_collection(chunks_greek, collection_name=COLLECTION_NAME_GREEK)

Create and index vector store


Indexing vector database:   0%|          | 0/102 [00:00<?, ?it/s]

## Similarity-based retrieval

After indexing the database we can retrieve semantically relevant chunks for user prompts. We can configure the `n_results` parameter to determine how many relevant results we want to obtain. Balancing the chunk size and the number of retrieved results is a key aspect for a well performing RAG system.

Here we have configured our ChromaDB collection to retrieve relevant chunks based on  the cosine similarity between the query and chunk embeddings.

Let's see an example:

In [None]:
import pprint

In [None]:
def retrieve_relevant_passages(query, chroma_collection=collection, n_results=4):
    result = chroma_collection.query(
        query_texts="What is MLM?",
        n_results=4,
    )
    return result["documents"][0]

In [None]:
passages = retrieve_relevant_passages("What is MLM?", chroma_collection=collection)

for passage in passages:
    print(passage)
    print("--------------------------------------")


former is often referred to as a “Transformer encoder” while the left-context-only version is referred to as a “Transformer decoder” since it can be used for text generation.
In order to train a deep bidirectional representation, we simply mask some percentage of the input tokens at random, and then predict those masked tokens.
We refer to this procedure as a “masked LM” (MLM), although it is often referred to as a Cloze task in the literature (Taylor, 1953).
--------------------------------------
In the table, MASK means that we replace the target token with the [MASK] symbol for MLM; SAME means that we keep the target token as is; RND means that we replace the target token with another random token.
The numbers in the left part of the table represent the probabilities of the speciﬁc strategies used during MLM pre-training (BERT uses 80%, 10%, 10%).
The right part of the paper represents the Dev set results.
For the feature-based approach, we concatenate the last 4 layers of BERT as t

In [None]:
# For the Greek example
passages_greek = retrieve_relevant_passages("Ποιος ήταν ο Ιωσήφ Βησσαριόνοβιτς Τζουγκασβίλι;", chroma_collection=collection_greek)

for passage in passages_greek:
    print(passage)
    print("--------------------------------------")


Με στόχο την εκτόνωση της κατάστασης η κυβέρνηση προωθούσε ορισμένες μεταρρυθμίσεις, επέμεινε όμως στη συνέχιση του πολέμου και διακήρυξε ότι μία και αδιαίρετη ήταν η Ρωσία, αποκλείοντας οποιεσδήποτε παραχωρήσεις προς τις διάφορες εθνότητες.
Η πολιτική αυτή ευνοούσε τη θέση της πλειονότητας των σοσιαλιστών, των Μπολσεβίκων, οι οποίοι απαιτούσαν την άμεση κατάπαυση των εχθροπραξιών, την ελευθερία των εθνοτήτων, την εθνικοποίηση των γαιών, των μεγάλων επιχειρήσεων και των τραπεζών, καθώς και τον έλεγχο της βιομηχανικής παραγωγής από τους εργάτες.
Ο Βλαντιμίρ Ίλιτς Ουλιάνοφ (1870-1924), που έλαβε το ψευδώνυμο Λένιν, υπήρξε ο ηγέτης του κινήματος των Μπολσεβίκων και του νέου κομμουνιστικού καθεστώτος.
--------------------------------------
Το πρότυπο κρατικής οργάνωσης που επικράτησε στην Ευρώπη κατά την περίοδο του Μεσοπολέμου σε διάφορες παραλλαγές.
Κύρια χαρακτηριστικά του ήταν η αυταρχική άσκηση της εξουσίας, η μαζοποίη­ση του ατόμου και η προσωπολατρία του ηγέτη.
Ορθολογισμός Γνωστός 

## 5. Prompt formation

After we retrieve the relevant passages, we can use them to contextualize an input prompt for the LLM

In [None]:
PROMPT_TEMPLATE = """Passages:\n\n
{passages}\n\n
Query: {query}"""


def format_prompt(query, collection, n_neighbors=3):
    results = collection.query(query_texts=query, n_results=n_neighbors)

    if not results["documents"]:
        print("No relevant documents found. Returning original query")
        return query

    passages = "\n\n".join(results["documents"][0])

    prompt = PROMPT_TEMPLATE.format(passages=passages, query=query)

    return prompt


test_prompt = format_prompt("What is MLM;", collection)

print(test_prompt)

Passages:


In the table, MASK means that we replace the target token with the [MASK] symbol for MLM; SAME means that we keep the target token as is; RND means that we replace the target token with another random token.
The numbers in the left part of the table represent the probabilities of the speciﬁc strategies used during MLM pre-training (BERT uses 80%, 10%, 10%).
The right part of the paper represents the Dev set results.
For the feature-based approach, we concatenate the last 4 layers of BERT as the features, which was shown to be the best approach in Section 5.3.

former is often referred to as a “Transformer encoder” while the left-context-only version is referred to as a “Transformer decoder” since it can be used for text generation.
In order to train a deep bidirectional representation, we simply mask some percentage of the input tokens at random, and then predict those masked tokens.
We refer to this procedure as a “masked LM” (MLM), although it is often referred to as a Cl

In [None]:
test_prompt_greek = format_prompt("What is MLM;", collection_greek)

print(test_prompt_greek)

Passages:


Με στόχο την εκτόνωση της κατάστασης η κυβέρνηση προωθούσε ορισμένες μεταρρυθμίσεις, επέμεινε όμως στη συνέχιση του πολέμου και διακήρυξε ότι μία και αδιαίρετη ήταν η Ρωσία, αποκλείοντας οποιεσδήποτε παραχωρήσεις προς τις διάφορες εθνότητες.
Η πολιτική αυτή ευνοούσε τη θέση της πλειονότητας των σοσιαλιστών, των Μπολσεβίκων, οι οποίοι απαιτούσαν την άμεση κατάπαυση των εχθροπραξιών, την ελευθερία των εθνοτήτων, την εθνικοποίηση των γαιών, των μεγάλων επιχειρήσεων και των τραπεζών, καθώς και τον έλεγχο της βιομηχανικής παραγωγής από τους εργάτες.
Ο Βλαντιμίρ Ίλιτς Ουλιάνοφ (1870-1924), που έλαβε το ψευδώνυμο Λένιν, υπήρξε ο ηγέτης του κινήματος των Μπολσεβίκων και του νέου κομμουνιστικού καθεστώτος.

Το πρότυπο κρατικής οργάνωσης που επικράτησε στην Ευρώπη κατά την περίοδο του Μεσοπολέμου σε διάφορες παραλλαγές.
Κύρια χαρακτηριστικά του ήταν η αυταρχική άσκηση της εξουσίας, η μαζοποίη­ση του ατόμου και η προσωπολατρία του ηγέτη.
Ορθολογισμός Γνωστός και ως ρασιοναλισμός.
Επισ

In [None]:
import openai
from openai import OpenAI

# For GPT-3.5
OPENAI_API_KEY = "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"
GPT_CLIENT = OpenAI(api_key=OPENAI_API_KEY)

# For Meltemi
MELTEMI_API_KEY = "sk-RYF0g_hDDIa2TLiHFboZ1Q"
MELTEMI_BASE_URL = "http://ec2-3-19-37-251.us-east-2.compute.amazonaws.com:4000/"
MELTEMI_CLIENT = OpenAI(api_key=MELTEMI_API_KEY, base_url=MELTEMI_BASE_URL)

In [None]:
RAG_SYSTEM_PROMPT = """Instructions: Compose a comprehensive reply to the query, based on the provided passages.
If the search results mention multiple subjects with the same name, create separate answers for each.
In your answers, maintain the appropriate knowledge level for a person that is consuming this material.
You MAY include appropriate extra information for providing clear, simple and comprehensive clarifications,
if the user explicitly asks for it, else you only respond based on the provided passages.
Do not make the responses neither too advanced, nor too simplistic.
If the text does not relate at all to the query, propose more appropriate prompt queries based on the book contents.
Provide your answer in the same language as the user's query."""

PLAIN_SYSTEM_PROMPT = """Instructions: Answer the user's questions to the best of your knowledge."""


def query_llm(query, system_prompt, model="gpt"):
    if "gpt" in model:
        model_name = "gpt-3.5-turbo"
        client = GPT_CLIENT
    else:
        model_name = "meltemi"
        client = MELTEMI_CLIENT
    response = client.chat.completions.create(
        model=model_name,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": query},
        ],
    )
    return response.choices[0].message.content, response.usage



def plain_query(query, model="gpt"):
    print("---------------------------------------------------------------")
    print("Plain query without using  RAG")
    print("---------------------------------------------------------------")

    prompt = query
    response, usage = query_llm(prompt, PLAIN_SYSTEM_PROMPT, model=model)
    print()
    print("QUERY:")
    print(prompt)
    print("---------------------------------------------------------------")
    print()
    print("LLM RESPONSE:")
    print(response)
    print("---------------------------------------------------------------")
    print()
    print("TOKEN USAGE:")
    print(usage)
    print("---------------------------------------------------------------")


def rag_query(query, model="gpt", collection=collection):
    print("---------------------------------------------------------------")
    print("Query using  RAG")
    print("---------------------------------------------------------------")

    prompt = format_prompt(query, collection)
    response, usage = query_llm(prompt, RAG_SYSTEM_PROMPT, model=model)
    print()
    print("QUERY:")
    print(prompt)
    print("---------------------------------------------------------------")
    print()
    print("LLM RESPONSE:")
    print(response)
    print("---------------------------------------------------------------")
    print()
    print("TOKEN USAGE:")
    print(usage)
    print("---------------------------------------------------------------")

In [None]:
QUERY1 = "What is MLM?"
QUERY2 = "How does BERT perform on MNLI?"
QUERY3 = "How many goals has Lionel Messi scored over his career?"

# For the Greek example
QUERY1_GREEK = "Τι ξέρεις για το συνέδριο της Βιέννης;"
QUERY2_GREEK = "Πόσα γκολ έχει βάλει ο Λιονέλ Μέσι;"

In [None]:
plain_query(QUERY1, model="gpt")

---------------------------------------------------------------
Plain query without using  RAG
---------------------------------------------------------------

QUERY:
What is MLM?
---------------------------------------------------------------

LLM RESPONSE:
MLM, or Multi-Level Marketing, is a business strategy where a company recruits non-salaried salespeople, known as distributors or representatives, to market and sell products or services directly to consumers. Distributors are encouraged to also recruit new distributors, creating a multi-level structure where the distributor earns a commission not only on their own sales but also on the sales made by the distributors they have recruited. This creates a hierarchical network of distributors, hence the term "multi-level marketing." MLM companies often use word-of-mouth marketing and direct selling strategies to promote their products and compensate distributors based on their sales performance and the sales activity of their downline 

In [None]:
rag_query(QUERY1, model="gpt")

---------------------------------------------------------------
Query using  RAG
---------------------------------------------------------------

QUERY:
Passages:


former is often referred to as a “Transformer encoder” while the left-context-only version is referred to as a “Transformer decoder” since it can be used for text generation.
In order to train a deep bidirectional representation, we simply mask some percentage of the input tokens at random, and then predict those masked tokens.
We refer to this procedure as a “masked LM” (MLM), although it is often referred to as a Cloze task in the literature (Taylor, 1953).

In the table, MASK means that we replace the target token with the [MASK] symbol for MLM; SAME means that we keep the target token as is; RND means that we replace the target token with another random token.
The numbers in the left part of the table represent the probabilities of the speciﬁc strategies used during MLM pre-training (BERT uses 80%, 10%, 10%).
The right 

In [None]:
plain_query(QUERY2, model="gpt")

---------------------------------------------------------------
Plain query without using  RAG
---------------------------------------------------------------

QUERY:
How does BERT perform on MNLI?
---------------------------------------------------------------

LLM RESPONSE:
BERT, short for Bidirectional Encoder Representations from Transformers, has shown impressive performance on the MultiNLI (MNLI) dataset. MNLI is a large-scale natural language inference dataset where the model is tasked with determining the relationship between a premise and hypothesis as entailment, neutral, or contradiction.

BERT achieves state-of-the-art results on the MNLI dataset by using its bidirectional architecture to capture contextual information from both directions, which helps in understanding the relationship between the premise and the hypothesis. By pre-training on a large corpus of text data and fine-tuning on the MNLI dataset, BERT is able to leverage its deep representation learning capabilit

In [None]:
rag_query(QUERY2, model="gpt")

---------------------------------------------------------------
Query using  RAG
---------------------------------------------------------------

QUERY:
Passages:


BERT is the ﬁrst ﬁnetuning based representation model that achieves state-of-the-art performance on a large suite of sentence-level and token-level tasks, outperforming many task-speciﬁc architectures.
•
BERT advances the state of the art for eleven NLP tasks.
The code and pre-trained models are available at https://github.com/ google-research/bert.
2 Related Work There is a long history of pre-training general language representations, and we brieﬂy review the most widely-used approaches in this section.

We ﬁne-tune the model for 3 epochs with a learning rate of 2e-5 and a batch size of 16.
Results are presented in Table 4. BERTLARGE outperforms the authors’ baseline ESIM+ELMo system by +27.1% and OpenAI GPT by 8.3%.
5 Ablation Studies In this section, we perform ablation experiments over a number of facets of BERT in ord

In [None]:
plain_query(QUERY3, model="gpt")

---------------------------------------------------------------
Plain query without using  RAG
---------------------------------------------------------------

QUERY:
How many goals has Lionel Messi scored over his career?
---------------------------------------------------------------

LLM RESPONSE:
As of September 2021, Lionel Messi has scored over 700 career goals for both club and country. Messi has been playing professional football for over 17 years and is considered one of the greatest footballers of all time.
---------------------------------------------------------------

TOKEN USAGE:
CompletionUsage(completion_tokens=46, prompt_tokens=36, total_tokens=82, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=0))
---------------------------------------------------------------


In [None]:
rag_query(QUERY3, model="gpt")

---------------------------------------------------------------
Query using  RAG
---------------------------------------------------------------

QUERY:
Passages:


#L = the number of layers; #H = hidden size; #A = number of attention heads.
“LM (ppl)” is the masked LM perplexity of held-out training data.
System Dev F1 Test F1 ELMo (Peters et al., 2018a)
95.7 92.2
CVT (Clark et al., 2018) 92.6 CSE (Akbik et al., 2018)
93.1 Fine-tuning approach BERTLARGE 96.6 92.8 BERTBASE 96.4 92.4 Feature-based approach (BERTBASE)

In Section C.1 we demonstrate that MLM does converge marginally slower than a leftto-right model (which predicts every token), but the empirical improvements of the MLM model far outweigh the increased training cost.
Next Sentence Prediction
The next sentence prediction task can be illustrated in the following examples.
Input =
[CLS] the man went to [MASK] store
[SEP] he bought a gallon
[MASK] milk
[SEP] Label =
IsNext Input =
[CLS] the man [MASK] to the store
[SEP] pengui

In [None]:
plain_query(QUERY1_GREEK, model="meltemi")

---------------------------------------------------------------
Plain query without using  RAG
---------------------------------------------------------------

QUERY:
Τι ξέρεις για το συνέδριο της Βιέννης;
---------------------------------------------------------------

LLM RESPONSE:
Το Συνέδριο της Βιέννης ήταν ένα διεθνές διπλωματικό γεγονός που διεξήχθη από τις 14 Σεπτεμβρίου έως τις 19 Σεπτεμβρίου 1815 στην Βιέννη της Αυστρίας. Οι κυρίαρχες δυνάμεις του Συνασπισμού—το Βρετανικό Στέμμα του Γεωργίου Δ΄, η Ρωσική Αυτοκρατορία, η Αυστριακή Αυτοκρατορία, η Γαλλική Επανάσταση και οι Ηνωμένες Πολιτείες— συναντήθηκαν ξανά από την αρχή της ιστορίας για να συζητήσουν σημαντικά παγκόσμια ζητήματα. Τα κύρια θέματα του συνεδρίου ήταν το μέλλον της ηττημένης Ναπολεόντειας Αυτοκρατορίας,
---------------------------------------------------------------

TOKEN USAGE:
CompletionUsage(completion_tokens=100, prompt_tokens=61, total_tokens=161, completion_tokens_details=None, prompt_tokens_details=None)

In [None]:
rag_query(QUERY1_GREEK, model="meltemi", collection=collection_greek)

---------------------------------------------------------------
Query using  RAG
---------------------------------------------------------------

QUERY:
Passages:


Οι τέσσερις μεγάλες δυνάμεις της Ευρώπης που είχαν νικήσει τη Γαλλία του Ναπολέοντα, η Αυστρία, η Ρωσία, η Πρωσία και η Βρετανία, συγκάλεσαν την 1η Νοεμβρίου 1814 συνέδριο στη Βιέννη, για να τερματίσουν και επίσημα τον πόλεμο και για να λύσουν τα
προβλήματα που αυτός είχε προκαλέσει.
Ήταν το Συνέδριο Ειρήνης της Βιέννης (18141815), το οποίο αποτελεί
σταθμό στην ιστορία της Ευρώπης και του κόσμου.
Το συνέδριο κατέληξε στην ομώνυμη Συνθήκη Ειρήνης (1815).

Το Συνέδριο Ειρήνης της Βιέννης (1814-1815).9 2.
Τα εθνικά και φιλελεύθερα κινήματα στην Ευρώπη.13 3.
Η Ελληνική Επανάσταση του 1821 - Ένα μήνυμα ελευθερίας για την Ευρώπη.16 4.
Το ελληνικό κράτος και η εξέλιξή του (1830-1881).34 5.
Το Ανατολικό Ζήτημα και ο Κριμαϊκός Πόλεμος.38 6.
Η Βιομηχανική Επανάσταση.41 7.
Η κρίση της Αυτοκρατορίας των Αψβούργων - Η ιταλική και η γερμ

In [None]:
plain_query(QUERY2_GREEK, model="meltemi")

---------------------------------------------------------------
Plain query without using  RAG
---------------------------------------------------------------

QUERY:
Πόσα γκολ έχει βάλει ο Λιονέλ Μέσι;
---------------------------------------------------------------

LLM RESPONSE:
Ο Λιονέλ Μέσι είναι ένας διάσημος Αργεντινός ποδοσφαιριστής, γνωστός για την εκπληκτική του ικανότητα με την μπάλα. Έχει πετύχει σημαντικά γκολ στην πορεία του.

Από το 2023, ο Μέσι έχει βάλει 778 γκολ σε επίσημους αγώνες με την FC Barcelona και την Αργεντινή εθνική ομάδα. Αυτό περιλαμβάνει 73 γκολ στο Τσάμπιονς Λιγκ, 45 γκολ στο Κόπα Άμερικα, 39 γκολ στο ΟΥΕΦΑ Σούπερ Καπ, και πολλά άλλα σε διαφορετικά τουρνουά.

Ο Μέσι πέτυχε ένα εντυπωσιακό γκολ νούμερο
---------------------------------------------------------------

TOKEN USAGE:
CompletionUsage(completion_tokens=100, prompt_tokens=63, total_tokens=163, completion_tokens_details=None, prompt_tokens_details=None)
---------------------------------------------

In [None]:
rag_query(QUERY2_GREEK, model="meltemi", collection=collection_greek)

---------------------------------------------------------------
Query using  RAG
---------------------------------------------------------------

QUERY:
Passages:


Μποντλέρ (Charles Baudelaire, 1821-1867).
Μαλαρμέ (Stéphan Mallarmé, 1842-1898).
Βερλέν (Paul Verlaine, 1844-1896).
Ρεμπό (Arthur Rimbaud, 1854-1891).
Κλοντέλ (Paul Claudel, 1868-1955).
Μωρεάς (Jean Moreas - Ιωάννης Παπαδιαμαντόπουλος, 1856-1910).
Βαλερύ (Paul Valéry, 1871-1945) κ.ά. 19ος ΑΙΩΝΑΣ 3.
ΜΕΓΑΛΟΙ ΔΗΜΙΟΥΡΓΟΙ ΤΟΥ ΡΟΜΑΝΤΙΣΜΟΥ 19ος ΑΙΩΝΑΣ 4.
ΜΕΓΑΛΟΙ ΔΗΜΙΟΥΡΓΟΙ ΤΟΥ ΡΕΑΛΙΣΜΟΥ 19ος
ΑΙΩΝΑΣ 5.
ΜΕΓΑΛΟΙ ΠΟΙΗΤΕΣ ΤΟΥ ΣΥΜΒΟΛΙΣΜΟΥ  234 ΠΙΝΑΚΕΣ Πισαρό (Camille Pissaro, 1830-1903).
Μανέ (Édouard Manet, 1832-1883).
Ντεγκά (Edgar Degas, 1834-1917).

Σισλέ (Alfred Sisley, 1839-1899).
Μονέ (Claude Monet, 1840-1926).
Ρενουάρ (Pierre-Auguste Renoir, 1841-1919).
Κάσατ (Mary Cassatt, 1844-1926).
Σερά (Georges Seurat, 1859-1891).
Σεζάν (Paul Cézanne, 1839-1906).
Γκογκέν (Paul Gauguin, 1848-1903).
Βαν Γκογκ (Vincent Van Gogh, 1853-1890).
Μο

# Building a RAG pipeline with Llamaindex

We have shown the basic process of implementing a RAG pipeline using low-level components. This approach is great for instructive purposes / limited projects but for more versatility one can use existing OSS frameworks.

Some good candidate frameworks are:

* **Llamaindex**: Framework focused towards efficient indexing and retrieval, providing multiple components for all the RAG steps.
* **Langchain**: Generic framework for LLM applications. Due to this may suffer from the "jack of all trades, master of none" problem and the code tends to be a bit more complex, but is more versatile
* **Haystack**: High-level framework focused on question answering with RAG capabilities


In this tutorial we choose Llamaindex due to the balance of versatility, clean API and features.

In [None]:
%%capture

!pip install llama-index llama-index-readers-file llama-index-llms-openai llama-index-embeddings-huggingface llama-index-llms-litellm

In [None]:
!rm -rf example_data

In [None]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core.node_parser import SemanticSplitterNodeParser
from llama_index.llms.openai import OpenAI as LLamaIndexOpenAI
from llama_index.llms.litellm import LiteLLM

# 1. Data ingestion

SimpleDirectoryReader selects the best file-reader for a given file extension.

You can have any document type inside `DATA_DIR` (e.g., pdf, txt, html, ppt) and it will be parsed by the appropriate parser.



In [None]:
# Download pdf inside DATA_DIR

DATA_DIR = "example_data/"

input_pdf = "https://aclanthology.org/N19-1423.pdf"
local_pdf = download_pdf(input_pdf, DATA_DIR)

documents = SimpleDirectoryReader(DATA_DIR).load_data()

Creates one document per pdf page

In [None]:
print(len(documents))

16


# 2. Chunking


Here we use the more advanced Semantic Chunker provided by Llamaindex.

"Semantic chunking" is a new concept proposed Greg Kamradt in his video tutorial on 5 levels of embedding chunking: https://youtu.be/8OJC21T2SL4?t=1933.

Instead of chunking text with a fixed chunk size, the semantic splitter adaptively picks the breakpoint in-between sentences using embedding similarity. This ensures that a "chunk" contains sentences that are semantically related to each other.


PS: A similar chunker to what we implemented previously is provded by `llama_index.core.node_parser.SentenceSplitter`.


In [None]:
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-m3")

splitter = SemanticSplitterNodeParser(
    buffer_size=1, breakpoint_percentile_threshold=95, embed_model=embed_model
)

nodes = splitter.get_nodes_from_documents(documents)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/123 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/15.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/54.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/687 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.27G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/444 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/964 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/191 [00:00<?, ?B/s]

In [None]:
nodes[1]

TextNode(id_='e6c603f5-747b-4856-9e40-7a527ffe0e8e', embedding=None, metadata={'page_label': '1', 'file_name': '2843a57377ff4623ad2fb3002b754bfe.pdf', 'file_path': '/content/example_data/2843a57377ff4623ad2fb3002b754bfe.pdf', 'file_type': 'application/pdf', 'file_size': 786279, 'creation_date': '2024-11-30', 'last_modified_date': '2024-11-30'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='ad4c1eb2-39e5-49ff-b97d-b9e501a597ef', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'page_label': '1', 'file_name': '2843a57377ff4623ad2fb3002b754bfe.pdf', 'file_path': '/content/example_data/2843a57377ff4623ad2fb3002b754bfe.pdf', 'file_type': 'application/pdf', 'file_size': 786279, 'creation_date': '2024-11-30', 'la

# 3. Create vector store index

Llamaindex can connect with multiple vector stores (e.g., chroma, pinecone, weaviate, pgvector etc.)

Here we use the default in-memory vector store for simplicity.

In [None]:
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-m3")
index = VectorStoreIndex(nodes, embed_model=embed_model)

# 4. Get answers from LLM

Create a query engine from the vector store and ask questions about your data

In [None]:
# gpt35 = LLamaIndexOpenAI(api_key=OPENAI_API_KEY, model="gpt-3.5-turbo")
# meltemi = LLamaIndexOpenAI(api_key=MELTEMI_API_KEY, base_url=MELTEMI_BASE_URL, model="meltemi-instruct-7b")

gpt = LiteLLM(
    "openai/gpt-4o",
    api_base="http://ec2-3-19-37-251.us-east-2.compute.amazonaws.com:4000/",
    api_key="sk-RYF0g_hDDIa2TLiHFboZ1Q",
)

In [None]:
query_engine = index.as_query_engine(
    gpt,
    similarity_top_k=3
)

In [None]:
query_engine.query("What is MLM?").response

'MLM stands for "masked language model." It is a pre-training objective used in BERT where some of the tokens in the input are randomly masked, and the model\'s task is to predict the original vocabulary id of these masked tokens. This approach allows the model to learn bidirectional representations by considering both the left and right context of a word.'

In [None]:
query_engine.query("How does BERT perform on MNLI?").response

'BERT achieves high performance on the MNLI task, with BERT BASE showing almost a 1.0% additional accuracy when trained on 1 million steps compared to 500,000 steps. Additionally, BERT LARGE performs competitively with state-of-the-art methods, demonstrating its effectiveness for both fine-tuning and feature-based approaches.'

In [None]:
query_engine.query("How does BERT perform on Squad?").response

'BERT achieves a Test F1 score of 93.2 on SQuAD v1.1 and a Test F1 score of 83.1 on SQuAD v2.0.'

In [None]:
query_engine.query("How many goals did Lionel Messi score over his career?").response

"The provided context does not contain information about Lionel Messi's career goals."

**RAG hands on exercise**

In this exercise we will implement a RAG system and test different embedding models and LLMs. We will provide a access to the OpenAI API and Meltemi for this exercise. If you speak Greek we encourage you to implement your pipeline with Greek sources and give Meltemi a test.

***Step 1***: implement the data ingestion pipeline. This pipeline will be more elaborate than the one we implemented before, since it will require to crawl a real website. We recommend to use one of the following knowledge sources [1,2,3], but feel free to use any other data source you prefer:

[1] Greek government procedures (available in both Greek and English): https://mitos.gov.gr/index.php/Αρχική_σελίδα

[2] Greek school ebooks (available in Greek. Use more than one, include both teacher’s manual and student’s textbook):
http://ebooks.edu.gr/ebooks/

[3] Feynman’s lectures (available in English): https://www.feynmanlectures.caltech.edu

***Step 2***: Build an evaluation set of 10-20 examples using ragas. Make sure to check the generated examples. If you are building a Greek system you may need to localize some of the data creation prompts (https://github.com/explodinggradients/ragas/blob/main/src/ragas/testset/prompts.py).

***Step 3***: Try different chunking algorithms

***Step 4***: Create the vector db index. Try different embeddings.

***Step 5***: Create a query engine and a chat engine from the vector index. The chat engine is a stateful query engine where you can ask follow up questions and refine your queries. If you are building a Greek system you may need to localize the corresponding prompts in llama index (see example below)

***Step 6***: Evaluate the different configurations of the system you built on the test set and report your results.


# Example prompt localization for LlamaIndex

Also refer to [[1]](https://docs.llamaindex.ai/en/stable/examples/customization/prompts/chat_prompts/), [[2]](https://docs.llamaindex.ai/en/stable/module_guides/models/prompts/) for more detailed guides on prompting in LlamaIndex

In [None]:
from llama_index.core.prompts import (
    SelectorPromptTemplate,
    PromptType,
    PromptTemplate,
    MessageRole,
    ChatPromptTemplate,
    ChatMessage,
)

QA_PROMPT = """BEGININPUT
{context_str}

ENDINPUT
BEGININSTRUCTION
{query_str}
ENDINSTRUCTION"""

QA_SYSTEM_PROMPT = """Είσαι ένας ειλικρινής και αμερόληπτος βοηθός που πάντα απαντάει με ακρίβεια σε αυτά που του ζητούνται.
Σου δίνεται ένα έγγραφο το οποίο βρίσκεται μεταξύ του BEGININPUT και του ENDINPUT.
Επίσης, σου δίνονται μεταδεδομένα για το συγκεκριμένο έγγραφο μεταξύ του BEGINCONTEXT και του ENDCONTEXT.
Το βασικό κείμενο του εγγράφου βρίσκεται μεταξύ του ENDCONTEXT και του ENDINPUT.
Απάντα στις οδηγίες του χρήστη που βρίσκονται μεταξύ του BEGININSTRUCTION και του ENDINSTRUCTION χρησιμοποιώντας μόνο το βασικό κείμενο
και τα μεταδεδομένα του εγγράφου που σου δίνονται παρακάτω. Αν οι οδηγίες που σου ζητάει ο χρήστης δεν μπορούν να απαντηθούν με το βασικό
κείμενο ή τα μεταδεδομένα του εγγράφου, ενημέρωσε τον χρήστη ότι δεν ξέρεις την σωστή απάντηση."""

REFINE_PROMPT = """Η αρχική ερώτηση είναι η εξής: {query_str}
Έχουμε δώσει την παρακάτω απάντηση: {existing_answer}
Έχουμε την ευκαιρία να βελτιώσουμε την προηγούμενη απάντηση (μόνο αν χρειάζεται) με τις παρακάτω νέες πληροφορίες.
------------
{context_str}
------------
Με βάση τις νέες πληροφορίες, βελτίωσε την απάντησή σου για να απαντά καλύτερα την ερώτηση.
Αν οι πληροφορίες δεν είναι χρήσιμες, απλά επανέλαβε την προηγούμενη απάντηση.
Βελτιωμένη απάντηση: """

REFINE_CHAT_PROMPT = """Είσαι ένα έμπιστο σύστημα που απαντάει ερωτήσεις χρηστών. Λειτουργείς με τους εξής δύο τρόπους όταν βελτιώνεις υπάρχουσες απαντήσεις:
1. **Ξαναγράφεις** την απάντηση με βάση νέες πληροφορίες που σου παρέχονται
2. **Επαναλαμβάνεις** την προηγούμενη απάντηση αν δεν είναι χρήσιμες οι νέες πληροφορίες

Ποτέ δεν αναφέρεις την αρχική απάντηση ή το συγκείμενο απευθείας στην απάντησή σου.
Αν υπάρχει αμφιβολία για το τι πρέπει να απαντήσεις, απλά επανέλαβε την αρχική απάντηση.
Νέες πληροφορίες:
------------
{context_str}
------------
Ερώτηση: {query_str}
Αρχική απάντηση: {existing_answer}
Βελτιωμένη απάντηση: """


default_template = PromptTemplate(
    metadata={"prompt_type": PromptType.QUESTION_ANSWER},
    # template_vars=["context_str", "query_str"],
    template=QA_PROMPT,
)
chat_template = ChatPromptTemplate(
    metadata={"prompt_type": PromptType.CUSTOM},
    # template_vars=["context_str", "query_str"],
    message_templates=[
        ChatMessage(role=MessageRole.SYSTEM, content=QA_SYSTEM_PROMPT),
        ChatMessage(role=MessageRole.USER, content=QA_PROMPT),
    ],
)
text_qa_prompt = SelectorPromptTemplate(
    default_template=default_template, conditionals=[(lambda llm: True, chat_template)]
)

default_refine_template = PromptTemplate(
    metadata={"prompt_type": PromptType.REFINE},
    # template_vars=["query_str", "existing_answer", "context_str"],
    template=REFINE_PROMPT,
)
chat_refine_template = ChatPromptTemplate(
    metadata={"prompt_type": PromptType.CUSTOM},
    # template_vars=["context_str", "query_str", "existing_answer"],
    message_templates=[ChatMessage(role=MessageRole.USER, content=REFINE_CHAT_PROMPT)],
)
text_refine_prompt = SelectorPromptTemplate(
    default_template=default_refine_template,
    conditionals=[(lambda llm: True, chat_refine_template)],
)

### .... code ....

index = VectorStoreIndex(nodes, embed_model=embed_model)

meltemi = LiteLLM(
    "hosted_vllm/meltemi-vllm",
    api_base="http://ec2-3-19-37-251.us-east-2.compute.amazonaws.com:4000/",
    api_key="sk-RYF0g_hDDIa2TLiHFboZ1Q",
)
query_engine = index.as_query_engine(
    llm=meltemi, text_qa_prompt=text_qa_prompt, text_refine_prompt=text_refine_prompt
)

NameError: name 'davinci' is not defined