<a href="https://colab.research.google.com/github/osaeed-ds/vector-hello/blob/main/Osaeed_Mongo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **MongoDB Atlas as a Vector Database**
This is a hello world exercise based on the Vector Search quickstart on the MongoDB website.

## **Prerequisites**

In [None]:
%pip install pymongo sentence-transformers datasets

## **Connect to Atlas**
Need to setup db in UI first, and import the sample dataset.  Then grab your MongoDB URI

In [8]:
from getpass import getpass
# Input your Mongo Client URI
MONGODB_URI = getpass('Your MongoDB URI: ')

In [10]:
import pymongo

client = pymongo.MongoClient(MONGODB_URI)
db = client.sample_mflix
collection = db.movies

## **Embedding Engine**
We will use sentence_transformers to generate our embeddings.  MongoDB's example uses Hugging Face, but I ran into some issues with the sample python code provided by Mongo.

In [22]:
from sentence_transformers import SentenceTransformer
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'
if device != 'cuda':
    print(f"You are using {device}. This is much slower than using "
          "a CUDA-enabled GPU. If on Colab you can change this by "
          "clicking Runtime > Change runtime type > GPU.")

model = SentenceTransformer('sangmini/msmarco-cotmae-MiniLM-L12_en-ko-ja', device=device)
model

You are using cpu. This is much slower than using a CUDA-enabled GPU. If on Colab you can change this by clicking Runtime > Change runtime type > GPU.


Downloading (…)590e6/.gitattributes: 100%|██████████| 1.58k/1.58k [00:00<00:00, 4.26MB/s]
Downloading (…)_Pooling/config.json: 100%|██████████| 190/190 [00:00<00:00, 485kB/s]
Downloading (…)/2_Dense/config.json: 100%|██████████| 115/115 [00:00<00:00, 307kB/s]
Downloading pytorch_model.bin: 100%|██████████| 2.37M/2.37M [00:00<00:00, 7.82MB/s]
Downloading (…)32435590e6/README.md: 100%|██████████| 2.41k/2.41k [00:00<00:00, 6.31MB/s]
Downloading (…)435590e6/config.json: 100%|██████████| 701/701 [00:00<00:00, 1.50MB/s]
Downloading (…)ce_transformers.json: 100%|██████████| 124/124 [00:00<00:00, 297kB/s]
Downloading (…)v.tsv.gz_results.csv: 100%|██████████| 513/513 [00:00<00:00, 1.24MB/s]
Downloading (…)v.tsv.gz_results.csv: 100%|██████████| 510/510 [00:00<00:00, 1.25MB/s]
Downloading (…)n-en.txt_results.csv: 100%|██████████| 3.31k/3.31k [00:00<00:00, 12.3MB/s]
Downloading (…)v.tsv.gz_results.csv: 100%|██████████| 385/385 [00:00<00:00, 1.12MB/s]
Downloading (…)v.tsv.gz_results.csv: 100%|█████

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Dense({'in_features': 384, 'out_features': 1536, 'bias': True, 'activation_function': 'torch.nn.modules.activation.Tanh'})
)

In [23]:
def generate_embedding(text):
  return model.encode(text).tolist()

In [24]:
generate_embedding ("AstraDB is Awesome")

[0.089430071413517,
 -0.06920432299375534,
 0.09861543029546738,
 0.11942709237337112,
 0.11436723172664642,
 -0.30147692561149597,
 -0.01487584039568901,
 0.3228060305118561,
 -0.08128578215837479,
 -0.12116913497447968,
 0.026900317519903183,
 0.07737396657466888,
 0.06480281054973602,
 0.11849971860647202,
 0.27250516414642334,
 0.08069834113121033,
 0.18955343961715698,
 -0.09829431772232056,
 0.1343601942062378,
 -0.2579577565193176,
 -0.09847153723239899,
 0.0248506348580122,
 0.06210676208138466,
 0.10260748118162155,
 -0.02691064216196537,
 -0.12934879958629608,
 -0.044433075934648514,
 -0.04372734948992729,
 -0.2497698813676834,
 0.15587151050567627,
 0.18668486177921295,
 0.14014063775539398,
 -0.04893195256590843,
 -0.20854319632053375,
 -0.14901547133922577,
 -0.05616605281829834,
 -0.2470139116048813,
 -0.07211043685674667,
 -0.17724701762199402,
 -0.1234440729022026,
 -0.20385140180587769,
 -0.0900653749704361,
 0.09927608072757721,
 0.10438819974660873,
 -0.2969210743904

## **Generate Embeddings for our Movie dataset**
Geneate the embeddings, create the vector index, then query

In [25]:
for doc in collection.find({'plot':{"$exists": True}}).limit(50):
	doc['plot_embedding_hf'] = generate_embedding(doc['plot'])
	collection.replace_one({'_id': doc['_id']}, doc)

At this stage you have to manually create the vector index in the Atlas UI.  Then you query.

In [30]:
query = "Movie about football, overcoming obsticles, and winning."

queryEmbedding = generate_embedding(query)

display(queryEmbedding)

[0.027141794562339783,
 -0.08355841785669327,
 -0.21839921176433563,
 -0.01808204874396324,
 0.18326720595359802,
 0.06405038386583328,
 -0.008068809285759926,
 0.2195902019739151,
 -0.17345398664474487,
 -0.12317837029695511,
 -0.19032196700572968,
 0.20500041544437408,
 0.12821975350379944,
 -0.033515505492687225,
 0.301954448223114,
 0.15881311893463135,
 0.1838749349117279,
 -0.04693184792995453,
 0.09843286871910095,
 -0.1907309591770172,
 -0.163761168718338,
 0.013262473978102207,
 0.15579754114151,
 0.08998873829841614,
 0.14344806969165802,
 -0.1465996652841568,
 -0.01141342893242836,
 -0.02900363877415657,
 -0.23538421094417572,
 0.0798146203160286,
 0.013410077430307865,
 -0.00012876372784376144,
 0.08953499048948288,
 -0.20670276880264282,
 -0.1513492316007614,
 -0.031548868864774704,
 -0.48853519558906555,
 -0.19288000464439392,
 -0.18662671744823456,
 -0.17667663097381592,
 -0.2602727711200714,
 -0.21210359036922455,
 -0.2372191995382309,
 -0.01388615183532238,
 -0.0361194

In [31]:


results = collection.aggregate([
    {
        '$search': {
            "index": "default",
            "knnBeta": {
                "vector":queryEmbedding,
                "k": 4,
                "path": "plot_embedding_hf"}
        }
    }
])

for document in results:
    print(f'Movie Name: {document["title"]},\nMovie Plot: {document["plot"]}\n')

Movie Name: Dames,
Movie Plot: Multi-millionaire Ezra Ounce wants to start a campaign against 'filthy' forms of entertainment, like Broadway-Shows. He comes to his relatives families and makes them members of his ...

Movie Name: Tabu: A Story of the South Seas,
Movie Plot: In his final film, F.W. Murnau presents the tale of two young lovers on the idyllic island of Bora Bora in the South Pacific. Their life is shattered when the old warrior declares the girl ...

Movie Name: Dante's Inferno,
Movie Plot: Jim Carter moves in on the McWade's carnival concession which shows scenes from Dante's "Inferno". He makes it a going concern, marrying Betty along the way. An inspector calls the ...

Movie Name: Hallelujah,
Movie Plot: In a juke joint, sharecropper Zeke falls for a beautiful dancer, Chick, but she's only setting him up for a rigged craps game. He loses $100, the money he got for the sale of his family's ...

