# Ask the docs anything about SuperDuperDB

In this notebook we show you how to implement the much-loved document Q&A task, using SuperDuperDB
together with MongoDB.

In [None]:
!pip install superduperdb

In [None]:
import os

if 'OPENAI_API_KEY' not in os.environ:
    raise Exception('Environment variable "OPENAI_API_KEY" not set')

In [1]:
import os
from superduperdb import superduper
from superduperdb.db.mongodb.query import Collection

# Uncomment one of the following lines to use a bespoke MongoDB deployment
# For testing the default connection is to mongomock

mongodb_uri = os.getenv("MONGODB_URI","mongomock://test")
# mongodb_uri = "mongodb://localhost:27017/documents"
# mongodb_uri = "mongodb://superduper:superduper@mongodb:27017/documents"
# mongodb_uri = "mongodb://<user>:<pass>@<mongo_cluster>/<database>"
# mongodb_uri = "mongodb+srv://<username>:<password>@<atlas_cluster>/<database>"

# Super-Duper your Database!
from superduperdb import superduper
db = superduper(mongodb_uri, artifact_store='filesystem://./models')

collection = Collection('questiondocs')



In this example we use the internal textual data from the `superduperdb` project's API documentation, with the "meta"-goal of 
creating a chat-bot to tell us about the project which we are using!

Uncomment the following cell if you have the superduperdb project locally, and would like to load the latest version of the API.
Otherwise you can load the data in the following cells.

In [2]:
import glob

ROOT = '../../superduperdb/docs/hr/content/docs'

STRIDE = 5       # stride in numbers of lines
WINDOW = 10       # length of window in numbers of lines

content = sum([open(file).readlines() 
               for file in glob.glob(f'{ROOT}/*/*.md') 
               + glob.glob(f'{ROOT}/*.md')], [])
chunks = ['\n'.join(content[i: i + WINDOW]) for i in range(0, len(content), STRIDE)]

In [None]:
!curl -O https://superduperdb-public.s3.eu-west-1.amazonaws.com/superduperdb_docs.json

In [None]:
import json

with open('superduperdb_docs.json') as f:
    chunks = json.load(f)

You can see that the chunks of text contain bits of code, and explanations, 
which can become useful in building a document Q&A chatbot.

In [3]:
from IPython.display import Markdown
Markdown(chunks[0])

# Mission



At SuperDuperDB, our goal is to massively facilitate and accelerate the developer journey between data and AI models. We aim to:



- Create an **easy-to-use**, **extensible** and **comprehensive** Python framework for integrating AI and

  ML directly to the datastore: to databases, object-storage, data-lakes and data-warehouses.

- Empower developers, data scientists and architects to leverage the vast AI

  **open-source ecosystem** in their datastore deployments.

- Enable ways-of-working with AI and data which **enable scalability** and industrial scale deployment,

  as well as providing easy-to-use tools for the **individual developer**.


As usual we insert the data:

In [4]:
from superduperdb.container.document import Document

db.execute(collection.insert_many([Document({'txt': chunk}) for chunk in chunks]))

INFO:root:Adding model text-embedding-ada-002 to db
INFO:root:Done.
INFO:root:Computing chunk 0/0
0it [00:00, ?it/s]


([ObjectId('653169786de1e5edf99d7824'),
  ObjectId('653169786de1e5edf99d7825'),
  ObjectId('653169786de1e5edf99d7826'),
  ObjectId('653169786de1e5edf99d7827'),
  ObjectId('653169786de1e5edf99d7828'),
  ObjectId('653169786de1e5edf99d7829'),
  ObjectId('653169786de1e5edf99d782a'),
  ObjectId('653169786de1e5edf99d782b'),
  ObjectId('653169786de1e5edf99d782c'),
  ObjectId('653169786de1e5edf99d782d'),
  ObjectId('653169786de1e5edf99d782e'),
  ObjectId('653169786de1e5edf99d782f'),
  ObjectId('653169786de1e5edf99d7830'),
  ObjectId('653169786de1e5edf99d7831'),
  ObjectId('653169786de1e5edf99d7832'),
  ObjectId('653169786de1e5edf99d7833'),
  ObjectId('653169786de1e5edf99d7834'),
  ObjectId('653169786de1e5edf99d7835'),
  ObjectId('653169786de1e5edf99d7836'),
  ObjectId('653169786de1e5edf99d7837'),
  ObjectId('653169786de1e5edf99d7838'),
  ObjectId('653169786de1e5edf99d7839'),
  ObjectId('653169786de1e5edf99d783a'),
  ObjectId('653169786de1e5edf99d783b'),
  ObjectId('653169786de1e5edf99d783c'),


We set up a standard `superduperdb` vector-search index using `openai` (although there are many options
here: `torch`, `sentence_transformers`, `transformers`, ...)

In [5]:
from superduperdb.container.vector_index import VectorIndex
from superduperdb.container.listener import Listener
from superduperdb.ext.openai.model import OpenAIEmbedding

db.add(
    VectorIndex(
        identifier='my-index',
        indexing_listener=Listener(
            model=OpenAIEmbedding(model='text-embedding-ada-002'),
            key='txt',
            select=collection.find(),
        ),
    )
)

{
  "createSearchIndexes": "questiondocs",
  "indexes": [
    {
      "name": "my-index",
      "definition": {
        "mappings": {
          "dynamic": true,
          "fields": {
            "_outputs": {
              "fields": {
                "txt": {
                  "fields": {
                    "text-embedding-ada-002": [
                      {
                        "dimensions": 1536,
                        "similarity": "cosine",
                        "type": "knnVector"
                      }
                    ]
                  },
                  "type": "document"
                }
              },
              "type": "document"
            }
          }
        }
      }
    }
  ]
}


INFO:root:Adding model text-embedding-ada-002 to db
INFO:root:Done.
433it [00:01, 366.37it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:06<00:00,  1.27s/it]


[]

Now we create a chat-completion component, and add this to the system:

In [6]:
from superduperdb.ext.openai.model import OpenAIChatCompletion

chat = OpenAIChatCompletion(
    model='gpt-3.5-turbo',
    prompt=(
        'Use the following description and code-snippets aboout SuperDuperDB to answer this question about SuperDuperDB\n'
        'Do not use any other information you might have learned about other python packages\n'
        'Only base your answer on the code-snippets retrieved\n'
        '{context}\n\n'
        'Here\'s the question:\n'
    ),
)

db.add(chat)

[]

We can view that this is now registed in the system:

In [7]:
print(db.show('model'))

['gpt-3.5-turbo', 'text-embedding-ada-002']


Finally, asking questions about the documents can be targeted with a particular query.
Using the power of MongoDB, this allows users to use vector-search in combination with
important filtering rules:

In [8]:
from superduperdb.container.document import Document
from IPython.display import display, Markdown

q = 'Can you show we a code-snippet demonstrating setting up a `VectorIndex` component?'

output, context = db.predict(
    model_name='gpt-3.5-turbo',
    input=q,
    context_select=(
        collection
            .like(Document({'txt': q}), vector_index='my-index', n=5)
            .find()
    ),
    context_key='txt',
)

Markdown(output.content)

Certainly! Here's a code snippet demonstrating how to set up a `VectorIndex` component using the SuperDuperDB package:

```python
from superduperdb.container.listener import Listener
from superduperdb.db import DB
from superduperdb.indexes import VectorIndex
from superduperdb.models import OpenAIEmbedding

# Create a listener to keep vectors up-to-date
indexing_listener = Listener(model=OpenAIEmbedding(), key='text', select=collection.find())

# Create a DB instance
db = DB()

# Add a VectorIndex to the DB using the indexing_listener
db.add(VectorIndex('my-index', indexing_listener=indexing_listener))
```

In the code above, `Listener` is used to create an indexing listener that ensures vectors stay up-to-date. It takes a model (in this case, `OpenAIEmbedding`) and a key (`text`) to select the data from the collection. Then, a `DB` instance is created, and a `VectorIndex` named `'my-index'` is added to the DB using the `indexing_listener`.

This code snippet demonstrates the setup process for a `VectorIndex` component in SuperDuperDB.