# Turn your classical-database into a vector-database with SuperDuperDB

In this notebook we show how you can use SuperDuperDB to turn your classical database into a vector-search database.

In this example, we'll be using the `sentence_transformers` with `pinnacledb` python package.
In addition, we'll be accessing the OpenAI API. In order to get these working you'll need to install the packages:

In [None]:
!pip install sentence-transformers
!pip install pinnacledb

And set the `OPEN_AI_KEY` as environment variable

In [1]:
import os
os.environ['OPENAI_API_KEY'] = '<YOUR-OPENAI-KEY>'

In order to access SuperDuperDB, we'll wrap our standard database connector with the `pinnacle` decorator.
This will transform the functionality of your database into a **super-duper** database:

In [2]:
import os

# Uncomment one of the following lines to use a bespoke MongoDB deployment
# For testing the default connection is to mongomock

mongodb_uri = os.getenv("MONGODB_URI","mongomock://test")
# mongodb_uri = "mongodb://localhost:27017"
# mongodb_uri = "mongodb://pinnacle:pinnacle@mongodb:27017/documents"
# mongodb_uri = "mongodb://<user>:<pass>@<mongo_cluster>/<database>"
# mongodb_uri = "mongodb+srv://<username>:<password>@<atlas_cluster>/<database>"

# Super-Duper your Database!
from pinnacledb import pinnacle
db = pinnacle(mongodb_uri)

INFO:numexpr.utils:NumExpr defaulting to 8 threads.


In this notebook we upload some wikipedia documents from a wikipedia dump. You can find this raw data here https://dumps.wikimedia.org/enwiki/.

We've preprocessed the data, extracting titles and abstracts from each document. We can use this as a test bed for search, by simulating a "typed query" using the title, and indexing the document based on the abstracts only.

In [4]:
!curl -O https://pinnacledb-public.s3.eu-west-1.amazonaws.com/wikipedia-sample.json

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 6546k  100 6546k    0     0  2039k      0  0:00:03  0:00:03 --:--:-- 2045k


In [6]:
import json
import random 

#with open(f'{os.environ["HOME"]}/data/wikipedia/abstracts.json') as f:
#    data = json.load(f)
with open(f'wikipedia-sample.json') as f:
    data = json.load(f)
data = random.sample(data, 1000)

Here's a snapshot of the data:

In [7]:
data[:2]

[{'title': 'Anjos',
  'abstract': 'Anjos is a former parish (freguesia) in the municipality of Lisbon, Portugal. At the administrative reorganization of Lisbon on 8 December 2012 it became part of the parish Arroios.'},
 {'title': 'Nathaniel Dexter',
  'abstract': 'Nathaniel Dexter, "town father" of Lancaster, Massachusetts, USA, donated what is known as Dexter Drumlin to The Trustees of Reservations.Town of Lancaster, Massachusetts: Conservation The Trustees of Reservations" Dexter Drumlin After his death he was hailed as "a beloved supporter of The Trustees and active'}]

We now insert the data into MongoDB using the SuperDuperDB client:

In [8]:
from pinnacledb.db.mongodb.query import Collection

collection = Collection(name='wikipedia')

In [9]:
from pinnacledb.container.document import Document

db.execute(collection.insert_many([Document(r) for r in data]))

INFO:root:found 0 uris


(<pymongo.results.InsertManyResult at 0x7fcfeaae8730>,
 TaskWorkflow(database=<pinnacledb.db.base.db.DB object at 0x7fcfea270670>, G=<networkx.classes.digraph.DiGraph object at 0x7fcff96a0610>))

We can verify that the documents are in the database:

In [10]:
r = db.execute(collection.find_one())
r.unpack()

{'title': 'Anjos',
 'abstract': 'Anjos is a former parish (freguesia) in the municipality of Lisbon, Portugal. At the administrative reorganization of Lisbon on 8 December 2012 it became part of the parish Arroios.',
 '_fold': 'train',
 '_id': ObjectId('651e84ff9d74c4da2361c9ad')}

Creating a vector-index in SuperDuperDB involves two things:

- Creating a model which is used to compute vectors (in this case `OpenAIEmbedding`)
- Daemonizing this model on a key (`Listener`), so that when new data are added, these are vectorized using the key

Sentence Transformers are supported by SuperDuperDB, with a wrapper that allows the chosen model to 
communicate directly with SuperDuperDB. The `encoder` argument specifies how the outputs of the models
are saved in the `Datalayer`.

In [11]:
import sentence_transformers
from pinnacledb.container.model import Model
from pinnacledb.ext.numpy.array import array

model = Model(
    identifier='all-MiniLM-L6-v2',
    object=sentence_transformers.SentenceTransformer('all-MiniLM-L6-v2'),
    encoder=array('float32', shape=(384,)),
    predict_method='encode',
    batch_predict=True,
)

INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: all-MiniLM-L6-v2
INFO:torch.distributed.nn.jit.instantiator:Created a temporary directory at /var/folders/jt/hrc4w0jj3fdcz0hfhg15fq0m0000gn/T/tmpvvl_jrt2
INFO:torch.distributed.nn.jit.instantiator:Writing /var/folders/jt/hrc4w0jj3fdcz0hfhg15fq0m0000gn/T/tmpvvl_jrt2/_remote_module_non_scriptable.py
INFO:sentence_transformers.SentenceTransformer:Use pytorch device: cpu


SuperDuperDB also has inbuilt support for OpenAI. You can also integrate APIs from clients, such as the CoherAI
client using the Model wrapper:

In [12]:
from pinnacledb.ext.openai.model import OpenAIEmbedding

model = OpenAIEmbedding(model='text-embedding-ada-002')

We can test our model (whichever we've chosen) like this

In [13]:
model.predict('This is a test', one=True)

[-0.008059182204306126,
 -0.003603511257097125,
 -0.000528058095369488,
 -0.005753727629780769,
 -0.024468205869197845,
 0.016131576150655746,
 -0.014929304830729961,
 -0.004634029697626829,
 -0.0009636337636038661,
 -0.03445630520582199,
 0.015920188277959824,
 0.01726778782904148,
 -0.008997217752039433,
 0.0022311382927000523,
 0.008713165298104286,
 1.3005340406380128e-05,
 0.02448141761124134,
 0.0005771893775090575,
 0.008336629718542099,
 -0.007444834802299738,
 0.005446553695946932,
 0.0075637404806911945,
 -0.011547090485692024,
 0.02483813464641571,
 -0.028352467343211174,
 -0.02319987490773201,
 0.0035044229589402676,
 -0.03522258996963501,
 0.019421307370066643,
 -0.009941860102117062,
 0.021878696978092194,
 -0.0173470601439476,
 0.001747257076203823,
 -0.0363323800265789,
 0.0007807332440279424,
 -0.012676697224378586,
 -0.010609054937958717,
 -0.01729421131312847,
 0.00801954697817564,
 -0.010886501520872116,
 0.009162365458905697,
 0.016686471179127693,
 0.0071475696749

We've verified our model gives us vectorial outputs, now let's add the search functionality using this model:

In [14]:
from pinnacledb.container.vector_index import VectorIndex
from pinnacledb.container.listener import Listener
from pinnacledb.ext.numpy.array import array

db.add(
    VectorIndex(
        identifier=f'wiki-index-{model.identifier}',
        indexing_listener=Listener(
            model=model,
            key='abstract',
            select=collection.find(),
            predict_kwargs={'max_chunk_size': 1000},
        ),
        compatible_listener=Listener(
            model=model,
            key='title',
            select=collection.find(),
            active=False,
        ),
    )
)

INFO:root:Adding model text-embedding-ada-002 to db
INFO:root:Done.
1000it [00:00, 26381.93it/s]


Computing chunk 0/1


100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:25<00:00,  2.51s/it]
INFO:root:loading hashes: 'wiki-index-text-embedding-ada-002'
Loading vectors into vector-table...: 1000it [00:02, 377.11it/s]


[]

We can inspect the functionality which was added like this. The above command creates several components in the single call:

- *model*
- *listener*
- *vector_index*

In [15]:
db.show('model')

['text-embedding-ada-002']

In [16]:
db.show('listener')

['text-embedding-ada-002/abstract', 'text-embedding-ada-002/title']

In [17]:
db.show('vector_index')

['wiki-index-text-embedding-ada-002']

We can now test a few vector searches. The way to do this in combination with your classical database
(in this case MongoDB) is to pre-pend the standard query, with a similarity comparison via `like`.

The item inside `like` is vectorized and compared with the stored vectors. In order for this to work, the keys in the 
first parameter to `like` must match those configured in the `Listener` instances inside the `VectorIndex`. The results are then filtered
using the classical query part:

In [18]:
cur = db.execute(
    collection
        .like({'title': 'articles about sport'}, n=10, vector_index=f'wiki-index-{model.identifier}')
        .find({}, {'title': 1})
)

for r in cur:
    print(r)

INFO:root:loading hashes: 'wiki-index-text-embedding-ada-002'
Loading vectors into vector-table...: 1000it [00:02, 401.94it/s]


Document({'title': 'Table tennis at the 2012 Summer Olympics', '_id': ObjectId('651e84ff9d74c4da2361ca10'), '_score': 0.8001081208590772})
Document({'title': 'Czech Republic at the 2018 Summer Youth Olympics', '_id': ObjectId('651e84ff9d74c4da2361cae4'), '_score': 0.7847875707473778})
Document({'title': 'Finnish pesäpallo match-fixing scandal', '_id': ObjectId('651e84ff9d74c4da2361cbd2'), '_score': 0.788250935132824})
Document({'title': "Athletics at the 2015 Parapan American Games – Men's 100 metres T35", '_id': ObjectId('651e84ff9d74c4da2361cc0c'), '_score': 0.7908315465743161})
Document({'title': "Swimming at the 2010 Summer Youth Olympics – Boys' 100 metre backstroke", '_id': ObjectId('651e84ff9d74c4da2361cc13'), '_score': 0.7861748711634465})
Document({'title': "Judo at the 2010 Asian Games – Women's 63 kg", '_id': ObjectId('651e84ff9d74c4da2361cc2e'), '_score': 0.7844640915718223})
Document({'title': 'Hungary national football team results (2010–2019)', '_id': ObjectId('651e84ff9

The benefit of having this combination is demonstrated in this query:

In [19]:
cur = db.execute(
    collection
        .like({'title': 'articles about sport'}, n=100, vector_index=f'wiki-index-{model.identifier}')
        .find({'title': {'$regex': '.*Australia'}})
)

for r in cur:
    print(r['title'])

2018 Australian Men's Curling Championship
Candidates of the 1990 Australian federal election
