# Vector-search with SuperDuperDB

In [None]:
!pip install pinnacledb
!pip install sentence_transformers

Set your `openai` key if it's not already in your `.env` variables by uncommenting this line, and adding your `OPEN_API_KEY` environment variable:

In [None]:
#import os
#os.environ['OPENAI_API_KEY'] = 'sk-...'

In [None]:
import os

if 'OPENAI_API_KEY' not in os.environ:
    raise Exception('You need to set an OpenAI key as environment variable: "export OPEN_API_KEY=sk-..."')

This line allows `pinnacledb` to connect to MongoDB. Under the hood, `pinnacledb` sets up configurations
for where to store:
- models
- outputs
- metadata
In addition `pinnacledb` configures how vector-search is to be performed.

In [None]:
import os

# Uncomment one of the following lines to use a bespoke MongoDB deployment
# For testing the default connection is to mongomock

mongodb_uri = os.getenv("MONGODB_URI", "mongomock://test")
# mongodb_uri = "mongodb://localhost:27017"
# mongodb_uri = "mongodb://pinnacle:pinnacle@mongodb:27017/documents"
# mongodb_uri = "mongodb://<user>:<pass>@<mongo_cluster>/<database>"
# mongodb_uri = "mongodb+srv://<username>:<password>@<atlas_cluster>/<database>"

# Super-Duper your Database!
from pinnacledb import pinnacle
db = pinnacle(mongodb_uri)

We've prepared some data - it's the inline documentation of the `pymongo` API!

In [None]:
!curl -O https://pinnacledb-public.s3.eu-west-1.amazonaws.com/pymongo.json

We can insert this data to MongoDB using the `pinnacledb` API, which supports `pymongo` commands.

In [None]:
import json
from pinnacledb.backends.mongodb.query import Collection
from pinnacledb import Document as D

with open('pymongo.json') as f:
    data = json.load(f)

In [None]:
data[0]

In [None]:
db.execute(
    Collection('documents').insert_many([D(r) for r in data])
)

In the remainder of the notebook you can choose between using `openai` or `sentence_transformers` to 
perform vector-search. After instantiating the model wrappers, the rest of the notebook is identical.

In [None]:
from pinnacledb.ext.openai.model import OpenAIEmbedding

model = OpenAIEmbedding(model='text-embedding-ada-002')

In [None]:
import sentence_transformers
from pinnacledb import Model, vector

model = Model(
    identifier='all-MiniLM-L6-v2',
    object=sentence_transformers.SentenceTransformer('all-MiniLM-L6-v2'),
    encoder=vector(shape=(384,)),
    predict_method='encode',
    postprocess=lambda x: x.tolist(),
    batch_predict=True,
)

In [None]:
model.predict('This is a test', one=True)

Now we can configure the Atlas vector-search index. 
This command saves and sets up a model to "listen" to a particular subfield (or whole document) for
new text, and convert this on the fly to vectors which are then indexed by Atlas vector-search.

In [None]:
from pinnacledb import Listener, VectorIndex

db.add(
    VectorIndex(
        identifier='pymongo-docs',
        indexing_listener=Listener(
            model=model,
            key='value',
            select=Collection('documents').find(),
            predict_kwargs={'max_chunk_size': 1000},
        ),
    )
)

In [None]:
db.show('vector_index')

Now the index is set up we can use it in a query. `pinnacledb` provides some syntactic sugar for 
the `aggregate` search pipelines, which can trip developers up. It also handles 
all conversion of inputs to vectors under the hood

In [None]:
from pinnacledb.backends.mongodb import Collection
from pinnacledb import Document as D
from IPython.display import *

query = 'Query the database'

result = db.execute(
    Collection('documents')
        .like(D({'value': query}), vector_index='pymongo-docs', n=5)
        .find()
)

display(Markdown('---'))

for r in result:
    display(Markdown(f'### `{r["parent"] + "." if r["parent"] else ""}{r["res"]}`'))
    display(Markdown(r['value']))
    display(Markdown('---'))