<!-- TABS -->
# Retrieval augmented generation

The first step in any SuperDuperDB application is to connect to your data-backend with SuperDuperDB:

<!-- TABS -->
## Connect to SuperDuperDB

In [None]:
# <tab: MongoDB>
from pinnacledb import pinnacle

db = pinnacle('mongodb://localhost:27017/documents')

In [None]:
# <tab: SQLite>
from pinnacledb import pinnacle

db = pinnacle('sqlite://my_db.db')

Once you have done that you are ready to define your datatype(s) which you would like to "search".

<!-- TABS -->
## Insert data

In order to create data, we need create a `Schema` for encoding our special `Datatype` column(s) in the databackend.

Here's some sample data to work with:

In [None]:
# <tab: Text>
!curl -O https://jupyter-sessions.s3.us-east-2.amazonaws.com/text.json

import json
with open('text.json') as f:
    data = json.load(f)

In [None]:
# <tab: Images>
!curl -O https://jupyter-sessions.s3.us-east-2.amazonaws.com/images.zip
!unzip images.zip

import os
data = [{'image': f'file://image/{file}'} for file in os.listdir('./images')]

In [None]:
# <tab: Audio>
!curl -O https://jupyter-sessions.s3.us-east-2.amazonaws.com/audio.zip
!unzip audio.zip

import os
data = [{'audio': f'file://audio/{file}'} for file in os.listdir('./audio')]

The next code-block is only necessary if you're working with a custom `DataType`:

In [None]:
from pinnacledb import Schema, Document

schema = Schema(
    'my_schema',
    fields={
        'my_key': dt
    }
)

data = [
    Document({'my_key': item}) for item in data
]

In [None]:
# <tab: MongoDB>
from pinnacledb.backends.mongodb import Collection

collection = Collection('documents')

db.execute(collection.insert_many(data))

In [None]:
# <tab: SQL>
from pinnacledb.backends.ibis import Table

table = Table(
    'my_table',
    schema=schema,
)

db.add(table)
db.execute(table.insert(data))

<!-- TABS -->
## Build text embedding model

In [None]:
# <tab: OpenAI>
...

In [None]:
# <tab: JinaAI>
...

In [None]:
# <tab: Sentence-Transformers>
...

In [None]:
# <tab: Transformers>
...

<!-- TABS -->
## Perform a vector search

- `item` is the item which is to be encoded
- `dt` is the `DataType` instance to apply

In [None]:
from pinnacledb import Document

item = Document({'my_key': dt(item)})

Once we have this search target, we can execute a search as follows:

In [None]:
# <tab: MongoDB>
from pinnacledb.backends.mongodb import Collection

collection = Collection('documents')

select = collection.find().like(item)

In [None]:
# <tab: SQL>

# Table was created earlier, before preparing vector-search
table = db.load('table', 'documents')

select = table.like(item)

In [None]:
results = db.execute(select)

<!-- TABS -->
## Build LLM

In [None]:
# <tab: OpenAI>

...

In [None]:
# <tab: Anthropic>

...

In [None]:
# <tab: vLLM>

...

In [None]:
# <tab: Transformers>

...

In [None]:
# <tab: Llama.cpp>

...

In [None]:
llm.predict_one(X='Tell me about SuperDuperDB')

In [None]:
# <tab: MongoDB>
from pinnacledb.components.model import QueryModel
from pinnacledb import Variable, Document

query_model = QueryModel(
    select=collection.find().like(Document({'my_key': Variable('item')}))
)

In [None]:
from pinnacledb.components.graph import Graph, Input
from pinnacledb import pinnacle


@pinnacle
class PromptBuilder:
    def __init__(self, initial_prompt, post_prompt, key):
        self.inital_prompt = initial_prompt
        self.post_prompt = post_prompt
        self.key = key

    def __call__(self, X, context):
        return (
            self.initial_prompt + '\n\n'
            + [r[self.key] for r in context]
            + self.post_prompt + '\n\n'
            + X
        )


prompt_builder = PromptBuilder(
    initial_prompt='Answer the following question based on the following facts:',
    post_prompt='Here\'s the question:',
    key='my_key',
)

with Graph() as G:
    input = Input('X')
    query_results = query_model(item=input)
    prompt = prompt_builder(X=input, context=query_results)
    output = llm(X=prompt)