GitHub - mongodb-developer/movie-vector-embedding-lab

MongoDB Atlas Vector Search Workshop

Looking to power an artificial intelligence with long term memory that could take over the world? Or maybe something simpler like retrieval augmented generation (RAG), semantic search, recommendation engines, or dynamic personalization. It all starts with the ability to search across the vector embeddings of your data.

In this lesson, you will learn how to create vector embeddings to store inside MongoDB Atlas with machine learning models such as the ones provided by OpenAI and Hugging Face. Then you will see how easy it is to implement vector search across by using the $vectorSearch operator in the easy-to-learn, ever-extensible MongoDB aggregation framework you already know and love. 💚

With just a few code functions, a sample movie dataset, and a free forever MongoDB Atlas cluster, you will semantically search for movies based on their plot. Semantic search means we search across data using intent and contextual meaning for more relevant results. For instance, you can search for cop instead of police man and find both (and then some). Compare this to full-text search which only allows you to find words by spelling, not the general meaning. You know what I mean. 😉

This project is based on the fantastic YouTube tutorial by Jesse Hall. Click to open the link to the tutorial and code along with Jesse.

Atlas Vector Search works by sending your data of any type through an encoder to obtain vectors, simply an array of floats, as a numeric representation of that data in n-dimensional space. Each number in your array represents a property of that data object.

Similar data is mapped closer to each other. In the image below, you can see that sports movies and superhero movies map to proximal clusters once vectorized.

Leveraging the MongoDB document data model, this n-dimensional array is then stored and indexed alongside your other data inside a document.

{
    "_id": "573a1390f29313caabcd5293",
    "title": "Hoosiers",
    "plot": "A coach with a checkered past and a local drunk train a small town ...",
    "year": 1996,
    "plot_embedding": [0.0729123, -0.0268332, -0.0315214, ...]
}

On the read side, you encode your query using the same encoder, and submit that vectorized query via the $vectorSearch aggregation stage to your data. The nearest neighbors in vector space to your query are returned as your search results.

This workshop is broken down into 4 parts to teach you how to create and perform vector search on your MongoDB Atlas data.

Create vector embeddings from the plots of movie documents using the encoding model all-MiniLM-L6-v2 found on Hugging Face.
Store those vector embeddings alongside your other data fields in your sample movie document.
Index your movie documents using knn, the cosine similarity function, as well the dimensions of the all-MiniLM-L6-v2 model.
Query for your choice of movie using $vectorSearch aggregation operator!

Prerequisites

This application was created using:

Node.js
Hugging Face sentence-transformers/all-MiniLM-L6-v2 model
The Atlas sample dataset of sample_mflix.movies

As such, you will need the following:

A MongoDB Atlas account. Get one for free here. See how it is done: https://www.youtube.com/watch?v=jXgJyuBeb_o
A recent version of Node.js and npm.
Atlas sample dataset downloaded from Atlas UI
A HuggingFace access token.

Getting Set Up

Clone the repo.
Navigate inside directory cd movie-vector-embedding-lab
Run npm install .
Create a .env file in the root directory with the following environment variables:
MONGODB_CONNECTION_STRING= HF_ACCESS_TOKEN=
Click the Connect button in the Atlas UI to find your MongoDB Atlas connection string for the Node.js driver in the Atlas UI. Replace your username and password before pasting into you .env file. It should look like this: mongodb+srv://:@Cluster0.ecmzvfs.mongodb.net/?retryWrites=true&w=majority
Replace your HuggingFace access token, as well. You can obtain an access token from the HuggingFace website by following the steps in the gif below:

Let's have a look at the main.js file. This is where we will execute the all the functionality needed for Atlas Vector Search in this workshop. Notice access to the environment variables on line 5:
const uri = process.env.MONGODB_CONNECTION_STRING; and line 8 for your Hugging Face access token:
const hf_token = process.env.HF_ACCESS_TOKEN;

Testing Atlas Cluster Connection

Let's make sure you have your **.env** file set up with your Atlas cluster connection string by executing the main file in the terminal.
Typing node main will execute the **main** function on line 27.
main().catch(console.dir);
In the try statement, the application will ping the client. If successful, the following message will appear in the console:
Pinged deployment. You successfully connected to your MongoDB Atlas cluster.
The app will finally close the connection to the Atlas cluster when finished:
Closing connection.
If this is not working, make sure you have correctly whitelisted your IP address. If you are connecting successfully, we can start searching for vectors!

Let's Get Started!

If you got this far, then you have successfully connected to your Atlas cluster, and we can start creating vector embeddings. Go ahead and comment out line 27. main().catch(console.dir); in the main.js file since we no longer need to execute that functionality.

All of the code for the following steps can be found in the functionDefinitions.js file.

	Step 1: Create vector embeddings for movie plot. In the functionDefinitions.js file, the `generateEmbeddings` function is on lines 2 - 28. async function generateEmbeddings(text) { const data = { inputs: text }; try { const response = await axios({ url: embeddingUrl, method: "POST", headers: { Authorization: `Bearer ${hf_token}`, "Content-Type": "application/json", Accept: "application/json", }, data: data, }); if (response.status !== 200) { throw new Error( `Request failed with status code: ${response.status}: ${response.data}` ); } // LOG JUST TO TEST IF EMBEDDINGS ARE RETURNED console.log(response.data); // IF EMBEDDINGS WORK, UNCOMMENT THE FOLLOWING // return response.data; } catch (error) { console.error(error); } } Copy this function and paste it into the your main.js file. Notice that it makes a POST call to the HuggingFace hosted embedding url using your HuggingFace access token. If successful, the function will log the array of floats to the console. Test the generateEmbeddings functionality by executing `generateEmbeddings("MongoDB is AWESOME!!!");` Now re-run the application by typing `node main` in the console. Et voilà! Since we see that the embeddings are generated, we will need to return them from the function. Before moving on, let's COMMENT OUT `console.log(response.data);` and let's UNCOMMENT `return response.data;` inside the generateEmbeddings function. Also, let's DELETE `generateEmbeddings("MongoDB is AWESOME!!!")`.
	Step 2: Store newly acquired plot embeddings directly in your movie documents. In the functionDefinitions.js file, the `saveEmbeddings` function is on lines 31 - 52. async function saveEmbeddings() { try { await client.connect(); const db = client.db("sample_mflix"); const collection = db.collection("movies"); const docs = await collection .find({ plot: { $exists: true }, genres: "Horror" }) .limit(100) .toArray(); for (let doc of docs) { doc.plot_embedding_hf = await generateEmbeddings(doc.plot); await collection.replaceOne({ _id: doc._id }, doc); console.log(`Updated ${doc._id}`); } } finally { console.log("Closing connection."); await client.close(); } } Copy this function and paste it into the your `main.js` file. Notice this function will look in the sample_mflix.movies collection for 100 scary movies 🧟 with a plot field. `const docs = await collection.find({ plot: { $exists: true }, genres: "Horror" }).limit(100).toArray();` Feel free to change the filter to look for other movie types that suit you. Comedies movies can be fun, too. 🎭 🤣 For each of these 100 movies, this function will use the recently created `generateEmbeddings` function to obtain vectorized embeddings for the plot field and save them in a new `plot_embedding_hf` field before replacing the movie document. Execute this function by pasting the call: `saveEmbeddings();` in the main.js file. Now re-run the application by typing `node main` in the console. You should see the updated documents being logged in the console. Inside the Atlas UI, you can use the Data Explorer in the Collections tab to filter for movies with your new vectorized plot fields using the filter:`{plot_embedding_hf:{$exists:true}}` Before moving to the next step, COMMENT OUT the call to execute saveEmbeddings`saveEmbeddings();`
	Step 3: Create a vector index on the plot embedding field leveraging the Atlas UI. Now that we have the plots of 100 different movies vectorized and stored as an array of floats, we will need to index the new `plot_embedding_hf` fields before we can search through them. Still in our Atlas UI on the Collections tab: - Go to Search Indexes - Click Create Search Index - Under Atlas Vector Search, use the JSON editor - name the index `vector_index` - from the `indexDefinition.txt` file, copy the index definition: json { "fields":[ { "type": "vector", "path": "plot_embedding_hf", "numDimensions":384, "similarity":"cosine", } ] } Notice this knnVector type index will use the cosine similarity function, which is great for mapping text data, and the 384 dimensions, the length of the vector arrays provided by HuggingFace's `all-MiniLM-L6-v2` encoding model. With this definition, "plot_embedding_hf" is the only field indexed.
	Step 4: Search semantically with the `$vectorSearch` aggregation operator. We are finally ready to use `$vectorSearch` to search for that horror flick whose name is on the tip of our tongue... You know the one... 🤔 Find the `queryEmbeddings` function in the functionDefinitions.js and paste into the `main` file. async function queryEmbeddings(query) { try { await client.connect(); const db = client.db("sample_mflix"); const collection = db.collection("movies"); const vectorizedQuery = await generateEmbeddings(query); results = await collection.aggregate([ { $vectorSearch: { index: "vectorIndex", queryVector: vectorizedQuery, path: "plot_embedding_hf", numCandidates: 100, limit: 8, }, }, { $project: { _id: 0, title: 1, plot: 1, }, }, ]).toArray(); console.log(results); } finally { console.log("Closing connection."); await client.close(); } } Notice the function parameter called "query." This is the description of the movie we provide. In order to perform vector search, we need to vectorize that description query using the `generateEmbeddings` function and store those vectors in the constant `vectorizedQuery`. We have to vectorize the query using the same `all-MiniLM-L6-v2` embedding model we used to vectorize our movie plots in order to compare them. Now we can run an aggregation on the `sample_mflix.movies` collection. The 1st stage uses the `$vectorSearch` operator along with our `vectorIndex` to search for our query in the `plot_embedding_hf` path and returns the closest 4 matches. The 2nd stage uses `$project` to return to the client only the title and the plot fields. We then convert the matching movies results from a cursor to an array before printing them to the console. Without further adieu, let's search for a good horror flick by calling : `queryEmbeddings("enormous creatures attacking earth");` Now returning to the terminal, type `node main` one last time. Drumroll, please!

No additional servers or software needed. No need to keep data in sync. Everything is done in MongoDB Atlas.

If you have any questions or feedback about this repo, feel free to create an Issue or PR in this repo.

Also please join our online MongoDB Community to interact with our product and engineering teams along with thousands of other MongoDB and Realm users.

Have fun and happy coding!

Name		Name	Last commit message	Last commit date
Latest commit History 104 Commits
OpenAIEmbeddings		OpenAIEmbeddings
images		images
.env		.env
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
functionDefinitions.js		functionDefinitions.js
indexDefinition.txt		indexDefinition.txt
main.js		main.js
package-lock.json		package-lock.json
package.json		package.json
search_text.js		search_text.js

License

mongodb-developer/movie-vector-embedding-lab

Folders and files

Latest commit

History

Repository files navigation

MongoDB Atlas Vector Search Workshop

This workshop is broken down into 4 parts to teach you how to create and perform vector search on your MongoDB Atlas data.

Prerequisites

Getting Set Up

Testing Atlas Cluster Connection

Let's Get Started!

Step 1: Create vector embeddings for movie plot.

Test the generateEmbeddings functionality by executing generateEmbeddings("MongoDB is AWESOME!!!");

Now re-run the application by typing node main in the console.

Step 2: Store newly acquired plot embeddings directly in your movie documents.

Now re-run the application by typing node main in the console.

Step 3: Create a vector index on the plot embedding field leveraging the Atlas UI.

Step 4: Search semantically with the $vectorSearch aggregation operator.

About

Resources

License

Stars

Watchers

Forks

Languages

Test the generateEmbeddings functionality by executing `generateEmbeddings("MongoDB is AWESOME!!!");`

Now re-run the application by typing `node main` in the console.

Now re-run the application by typing `node main` in the console.

Step 4: Search semantically with the `$vectorSearch` aggregation operator.