# Ask the docs anything about SuperDuperDB

In [1]:
import os
os.environ['OPENAI_API_KEY'] = '<YOUR-OPENAI-API-KEY>'

In [2]:
from pinnacledb import pinnacle
from pinnacledb.db.mongodb.query import Collection
import pymongo

db = pymongo.MongoClient().documents
db = pinnacle(db)

collection = Collection('questiondocs')

In [4]:
import glob

STRIDE = 5       # stride in numbers of lines
WINDOW = 10       # length of window in numbers of lines

content = sum([open(file).readlines() for file in glob.glob('../*/*.md') + glob.glob('../*.md')], [])
chunks = ['\n'.join(content[i: i + WINDOW]) for i in range(0, len(content), STRIDE)]

In [5]:
from IPython.display import Markdown
Markdown(chunks[2])

- We have data in production populated by users accessing a popular website, and which sends JSON records to MongoDB, with references to web URLs hosted on a separate image server.

- Each record contains some data left behind by users which may be useful for training a classification model.



Given this data, we would like to accomplish the following:



- We would like to use our data hosted in MongoDB to train a model to classify images

- We want to use the probabilistic estimates for the classifications in a production scenario



To do this, we need to be able to implement these high level steps:




In [6]:
from pinnacledb.container.document import Document

db.execute(collection.insert_many([Document({'txt': chunk}) for chunk in chunks]))

INFO:root:found 0 uris


(<pymongo.results.InsertManyResult at 0x110ecc490>,
 TaskWorkflow(database=<pinnacledb.db.base.db.DB object at 0x19b248dc0>, G=<networkx.classes.digraph.DiGraph object at 0x110edb460>))

In [7]:
db.execute(collection.find_one())

Document({'_id': ObjectId('64d750f0606e13d1ad232b38'), 'txt': "# Common issues in AI-data development\n\n\n\nTraditionally, AI development and databases have lived in separate silo-ed worlds, which \n\nonly interact as an afterthought at the point where a production system is required to \n\napply an AI model to a row or table in a database and store and serve the resulting predictions.\n\n\n\nLet's see how this can play out in practice.\n\n\n\nSuppose our situation is as follows:\n\n\n", '_fold': 'train'})

In [8]:
from pinnacledb.container.vector_index import VectorIndex
from pinnacledb.container.listener import Listener
from pinnacledb.ext.openai.model import OpenAIEmbedding

db.add(
    VectorIndex(
        identifier='my-index',
        indexing_listener=Listener(
            model=OpenAIEmbedding(model='text-embedding-ada-002'),
            key='txt',
            select=collection.find(),
        ),
    )
)

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:07<00:00,  1.81s/it]
INFO:root:loading hashes: 'my-index'
Loading vectors into vector-table...: 375it [00:00, 924.50it/s]


[]

In [36]:
from pinnacledb.ext.openai.model import OpenAIChatCompletion

chat = OpenAIChatCompletion(
    model='gpt-3.5-turbo',
    prompt=(
        'Use the following description and code-snippets aboout SuperDuperDB to answer this question about SuperDuperDB\n'
        'Do not use any other information you might have learned about other python packages\n'
        'Only base your answer on the code-snippets retrieved\n'
        '{context}\n\n'
        'Here\'s the question:\n'
    ),
)

db.add(chat)

print(db.show('model'))

['gpt-3.5-turbo', 'text-embedding-ada-002']


In [37]:
db.show('model', 'gpt-3.5-turbo')

[0, 1, 2]

In [50]:
from pinnacledb.container.document import Document
from IPython.display import display, Markdown


q = 'Can you give me a code-snippet to set up a `VectorIndex`?'

output, context = db.predict(
    model='gpt-3.5-turbo',
    input=q,
    context_select=(
        collection
            .like(Document({'txt': q}), vector_index='my-index', n=5)
            .find()
    ),
    context_key='txt',
)

Markdown(output.content)

Sure! Here's a code snippet to set up a `VectorIndex`:

```python
from pinnacledb.container.vector_index import VectorIndex
from pinnacledb.core.listener import listener

# First, define a listener to keep vectors up-to-date
indexing_listener = listener(model=OpenAIEmbedding(), key='text', select=collection.find())

# Then, create a VectorIndex and link it with the indexing listener
db.add(VectorIndex('my-index', indexing_listener=indexing_listener))
```

This code snippet sets up a `VectorIndex` named `'my-index'` and associates it with an indexing listener that uses a model called `OpenAIEmbedding`. The indexing listener will ensure that the vectors stay up-to-date.