# Basic RAG tutorial with templates

:::info
In this tutorial we show you how to do retrieval augmented generation (RAG) with `pinnacledb`.
Note that this is just an example of the flexibility and power which `pinnacledb` gives 
to developers. `pinnacledb` is about much more than RAG and LLMs. 
:::

As in the vector-search tutorial we'll use `pinnacledb` documentation for the tutorial.
We'll add this to a testing database by downloading the data snapshot:

In [3]:
!curl -O https://pinnacledb-public-demo.s3.amazonaws.com/text.json

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  720k  100  720k    0     0   656k      0  0:00:01  0:00:01 --:--:--  661k


In [4]:
import json

from pinnacledb import pinnacle, Document

db = pinnacle('mongomock://test')

with open('text.json') as f:
    data = json.load(f)

_ = db['docu'].insert_many([{'txt': r} for r in data]).execute()

2024-Jun-06 14:03:14.74| INFO     | Duncans-MBP.fritz.box| pinnacledb.base.build:69   | Data Client is ready. mongomock.MongoClient('localhost', 27017)
2024-Jun-06 14:03:14.74| INFO     | Duncans-MBP.fritz.box| pinnacledb.base.build:42   | Connecting to Metadata Client with engine:  mongomock.MongoClient('localhost', 27017)
2024-Jun-06 14:03:14.74| INFO     | Duncans-MBP.fritz.box| pinnacledb.base.build:155  | Connecting to compute client: None
2024-Jun-06 14:03:14.74| INFO     | Duncans-MBP.fritz.box| pinnacledb.base.datalayer:86   | Building Data Layer
2024-Jun-06 14:03:14.74| INFO     | Duncans-MBP.fritz.box| pinnacledb.base.build:220  | Configuration: 
 +---------------+------------------+
| Configuration |      Value       |
+---------------+------------------+
|  Data Backend | mongomock://test |
+---------------+------------------+
2024-Jun-06 14:03:14.75| INFO     | Duncans-MBP.fritz.box| pinnacledb.backends.mongodb.data_backend:191  | Table docu does not exist, auto creating..

Let's verify the data in the `db` by querying one datapoint:

In [5]:
db['docu'].find_one().execute()

Document({'txt': "---\nsidebar_position: 5\n---\n\n# Encoding data\n\nIn AI, typical types of data are:\n\n- **Numbers** (integers, floats, etc.)\n- **Text**\n- **Images**\n- **Audio**\n- **Videos**\n- **...bespoke in house data**\n\nMost databases don't support any data other than numbers and text.\nSuperDuperDB enables the use of these more interesting data-types using the `Document` wrapper.\n\n### `Document`\n\nThe `Document` wrapper, wraps dictionaries, and is the container which is used whenever \ndata is exchanged with your database. That means inputs, and queries, wrap dictionaries \nused with `Document` and also results are returned wrapped with `Document`.\n\nWhenever the `Document` contains data which is in need of specialized serialization,\nthen the `Document` instance contains calls to `DataType` instances.\n\n### `DataType`\n\nThe [`DataType` class](../apply_api/datatype), allows users to create and encoder custom datatypes, by providing \ntheir own encoder/decoder pairs

The first step in a RAG application is to create a `VectorIndex`. The results of searching 
with this index will be used as input to the LLM for answering questions.

Read about `VectorIndex` [here](../apply_api/vector_index.md) and follow along the tutorial on 
vector-search [here](./vector_search.md).

In [6]:
import requests 

from pinnacledb import Stack, Document, VectorIndex, Listener, vector
from pinnacledb.ext.sentence_transformers.model import SentenceTransformer
from pinnacledb.base.code import Code

def postprocess(x):
    return x.tolist()

datatype = vector(shape=384, identifier="my-vec")
    
model = SentenceTransformer(
    identifier="my-embedding",
    datatype=datatype,
    predict_kwargs={"show_progress_bar": True},
    signature="*args,**kwargs",
    model="all-MiniLM-L6-v2",      
    device="cpu",
    postprocess=Code.from_object(postprocess),
)

listener = Listener(
    identifier="my-listener",
    model=model,
    key='txt',
    select=db['docu'].find(),
    predict_kwargs={'max_chunk_size': 50},
)

vector_index = VectorIndex(
    identifier="my-index",
    indexing_listener=listener,
    measure="cosine"
)

db.apply(vector_index)

from pinnacledb import code

@code
def postprocess(x):
    return x.tolist()





2024-Jun-06 14:03:36.07| INFO     | Duncans-MBP.fritz.box| pinnacledb.backends.local.compute:37   | Submitting job. function:<function method_job at 0x1142f5620>


210it [00:00, 80277.42it/s]

2024-Jun-06 14:03:37.27| INFO     | Duncans-MBP.fritz.box| pinnacledb.components.model:730  | Computing chunk 0/4





Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2024-Jun-06 14:03:37.91| INFO     | Duncans-MBP.fritz.box| pinnacledb.components.model:754  | Adding 50 model outputs to `db`
2024-Jun-06 14:03:37.93| INFO     | Duncans-MBP.fritz.box| pinnacledb.components.model:730  | Computing chunk 1/4


Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2024-Jun-06 14:03:38.49| INFO     | Duncans-MBP.fritz.box| pinnacledb.components.model:754  | Adding 50 model outputs to `db`
2024-Jun-06 14:03:38.51| INFO     | Duncans-MBP.fritz.box| pinnacledb.components.model:730  | Computing chunk 2/4


Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2024-Jun-06 14:03:39.07| INFO     | Duncans-MBP.fritz.box| pinnacledb.components.model:754  | Adding 50 model outputs to `db`
2024-Jun-06 14:03:39.09| INFO     | Duncans-MBP.fritz.box| pinnacledb.components.model:730  | Computing chunk 3/4


Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2024-Jun-06 14:03:39.68| INFO     | Duncans-MBP.fritz.box| pinnacledb.components.model:754  | Adding 50 model outputs to `db`
2024-Jun-06 14:03:39.71| INFO     | Duncans-MBP.fritz.box| pinnacledb.components.model:730  | Computing chunk 4/4


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2024-Jun-06 14:03:39.84| INFO     | Duncans-MBP.fritz.box| pinnacledb.components.model:754  | Adding 10 model outputs to `db`
2024-Jun-06 14:03:39.84| SUCCESS  | Duncans-MBP.fritz.box| pinnacledb.backends.local.compute:43   | Job submitted on <pinnacledb.backends.local.compute.LocalComputeBackend object at 0x16fad6fd0>.  function:<function method_job at 0x1142f5620> future:ea790267-73ea-4b5c-a9e9-e3bef259fe6f
2024-Jun-06 14:03:39.84| INFO     | Duncans-MBP.fritz.box| pinnacledb.backends.local.compute:37   | Submitting job. function:<function callable_job at 0x1142f5580>
2024-Jun-06 14:03:41.00| INFO     | Duncans-MBP.fritz.box| pinnacledb.base.datalayer:170  | Loading vectors of vector-index: 'my-index'
2024-Jun-06 14:03:41.00| INFO     | Duncans-MBP.fritz.box| pinnacledb.base.datalayer:180  | docu.find(documents[0], documents[1])


Loading vectors into vector-table...: 210it [00:00, 5126.62it/s]

2024-Jun-06 14:03:41.04| SUCCESS  | Duncans-MBP.fritz.box| pinnacledb.backends.local.compute:43   | Job submitted on <pinnacledb.backends.local.compute.LocalComputeBackend object at 0x16fad6fd0>.  function:<function callable_job at 0x1142f5580> future:8dc44a22-f2b5-43bc-a410-0328b8bf6fc1





([<pinnacledb.jobs.job.ComponentJob at 0x2eb37a7d0>,
  <pinnacledb.jobs.job.FunctionJob at 0x2e8bc2990>],
 VectorIndex(identifier='my-index', uuid='a32aae9a-465c-4041-aa82-ecbebbb4e0fb', indexing_listener=Listener(identifier='my-listener', uuid='f2a5cc60-9308-4146-8370-4c0b787292e3', key='txt', model=SentenceTransformer(preferred_devices=('cuda', 'mps', 'cpu'), device='cpu', identifier='my-embedding', uuid='a34389e8-12bf-4bf3-bca7-bbe5c027d859', signature='*args,**kwargs', datatype=DataType(identifier='my-vec', uuid='61eb1d6c-aa94-4f12-8823-55732675b6ce', encoder=None, decoder=None, info=None, shape=(384,), directory=None, encodable='native', bytes_encoding=<BytesEncoding.BYTES: 'Bytes'>, intermediate_type='bytes', media_type=None), output_schema=None, flatten=False, model_update_kwargs={}, predict_kwargs={'show_progress_bar': True}, compute_kwargs={}, validation=None, metric_values={}, num_workers=0, object=SentenceTransformer(
   (0): Transformer({'max_seq_length': 256, 'do_lower_cas

Now that we've set up a `VectorIndex`, we can connect this index with an LLM in a number of ways.
A simple way to do that is with the `SequentialModel`. The first part of the `SequentialModel`
executes a query and provides the results to the LLM in the second part. 

The `RetrievalPrompt` component takes a query with a "free" `Variable` as input. 
This gives users great flexibility with regard to how they fetch the context
for their downstream models.

We're using OpenAI, but you can use any type of LLm with `pinnacledb`. We have several 
native integrations (see [here](../ai_integraitons/)) but you can also [bring your own model](../models/bring_your_own_model.md).

In [7]:
from pinnacledb.ext.llm.prompter import *
from pinnacledb.base.variables import Variable
from pinnacledb import Document
from pinnacledb.components.model import SequentialModel
from pinnacledb.ext.openai import OpenAIChatCompletion

q = db['docu'].like(Document({'txt': Variable('prompt')}), vector_index='my-index', n=5).find().limit(10)

def get_output(c):
    return [r['txt'] for r in c]

prompt_template = RetrievalPrompt('my-prompt', select=q, postprocess=Code.from_object(get_output))

llm = OpenAIChatCompletion('gpt-3.5-turbo')
seq = SequentialModel('rag', models=[prompt_template, llm])

db.apply(seq)

from pinnacledb import code

@code
def get_output(c):
    return [r['txt'] for r in c]



huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


([],
 SequentialModel(identifier='rag', uuid='c95f5400-c9d4-4cf7-99e3-ec86b685e7f1', signature='**kwargs', datatype=None, output_schema=None, flatten=False, model_update_kwargs={}, predict_kwargs={}, compute_kwargs={}, validation=None, metric_values={}, num_workers=0, models=[RetrievalPrompt(identifier='my-prompt', uuid='5e165724-d78f-47b9-b4e0-f782f21eecf7', signature='**kwargs', datatype=None, output_schema=None, flatten=False, model_update_kwargs={}, predict_kwargs={}, compute_kwargs={}, validation=None, metric_values={}, num_workers=0, preprocess=None, postprocess=Code(identifier='', uuid='9d1e182d-42bc-4ec2-857f-06fb4008dee2', code="from pinnacledb import code\n\n@code\ndef get_output(c):\n    return [r['txt'] for r in c]\n"), select=docu.like(documents[0], vector_index="my-index", n=5).find().limit(10), prompt_explanation="HERE ARE SOME FACTS SEPARATED BY '---' IN OUR DATA REPOSITORY WHICH WILL HELP YOU ANSWER THE QUESTION.", prompt_introduction='HERE IS THE QUESTION WHICH YOU SH

Now we can test the `SequentialModel` with a sample question:

In [11]:
seq.predict('Tell be about vector-indexes')

2024-Jun-06 14:04:54.06| INFO     | Duncans-MBP.fritz.box| pinnacledb.base.datalayer:1073 | {}


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

'Vector-indexes in SuperDuperDB allow users to make their data searchable based on vector embeddings generated by models. Users can wrap a `Listener` or a `Model` instance with a `VectorIndex` to enable vector search capabilities. These indexes can be applied to the data layer to make vector-search queries using the `.like` operator. Additionally, the `LanceVectorSearcher` class in the `pinnacledb.vector_search.lance` module provides an implementation of a vector index with specific parameters such as dimensions, seed vectors, index IDs, and similarity measures. By setting up and using vector-indexes, users can enhance the search functionality of their database by enabling searching based on vector similarities.'

:::tip
Did you know you can use any tools from the Python ecosystem with `pinnacledb`.
That includes `langchain` and `llamaindex` which can be very useful for RAG applications.
:::

In [12]:
stack = Stack('rag', components=[vector_index, seq])

In [13]:
stack.export('rag')

In [14]:
!cat rag/component.json | jq .

[1;39m{
  [0m[1;34m"_base"[0m[1;39m: [0m[0;32m"?:component:stack/rag/37e587f3-91e8-4a4f-b286-2d32398dd0b3"[0m[1;39m,
  [0m[1;34m"_leaves"[0m[1;39m: [0m[1;39m{
    [0m[1;34m"pinnacledb/components/vector_index/vector/232a2649119b9619411948dc32785b5addb1549a"[0m[1;39m: [0m[1;39m{
      [0m[1;34m"_path"[0m[1;39m: [0m[0;32m"pinnacledb/components/vector_index/vector"[0m[1;39m,
      [0m[1;34m"shape"[0m[1;39m: [0m[0;39m384[0m[1;39m,
      [0m[1;34m"identifier"[0m[1;39m: [0m[0;32m"my-vec"[0m[1;39m
    [1;39m}[0m[1;39m,
    [0m[1;34m":component:model/my-embedding/a34389e8-12bf-4bf3-bca7-bbe5c027d859"[0m[1;39m: [0m[1;39m{
      [0m[1;34m"_path"[0m[1;39m: [0m[0;32m"pinnacledb/ext/sentence_transformers/model/SentenceTransformer"[0m[1;39m,
      [0m[1;34m"preferred_devices"[0m[1;39m: [0m[1;39m[
        [0;32m"cuda"[0m[1;39m,
        [0;32m"mps"[0m[1;39m,
        [0;32m"cpu"[0m[1;39m
      [1;39m][0m[1;39m,
      [0m[1

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
