Skip to content

mongodb-developer/movie-vector-embedding-lab

Repository files navigation

MongoDB Atlas Vector Search Workshop

Looking to power an artificial intelligence with long term memory that could take over the world? Or maybe something simpler like retrieval augmented generation (RAG), semantic search, recommendation engines, or dynamic personalization. It all starts with the ability to search across the vector embeddings of your data.

In this lesson, you will learn how to create vector embeddings to store inside MongoDB Atlas with machine learning models such as the ones provided by OpenAI and Hugging Face. Then you will see how easy it is to implement vector search across by using the $vectorSearch operator in the easy-to-learn, ever-extensible MongoDB aggregation framework you already know and love. 💚

With just a few code functions, a sample movie dataset, and a free forever MongoDB Atlas cluster, you will semantically search for movies based on their plot. Semantic search means we search across data using intent and contextual meaning for more relevant results. For instance, you can search for cop instead of police man and find both (and then some). Compare this to full-text search which only allows you to find words by spelling, not the general meaning. You know what I mean. 😉

This project is based on the fantastic YouTube tutorial by Jesse Hall. Click to open the link to the tutorial and code along with Jesse.


Atlas Vector Search works by sending your data of any type through an encoder to obtain vectors, simply an array of floats, as a numeric representation of that data in n-dimensional space. Each number in your array represents a property of that data object.

Similar data is mapped closer to each other. In the image below, you can see that sports movies and superhero movies map to proximal clusters once vectorized.

Leveraging the MongoDB document data model, this n-dimensional array is then stored and indexed alongside your other data inside a document.

{
    "_id": "573a1390f29313caabcd5293",
    "title": "Hoosiers",
    "plot": "A coach with a checkered past and a local drunk train a small town ...",
    "year": 1996,
    "plot_embedding": [0.0729123, -0.0268332, -0.0315214, ...]
}

On the read side, you encode your query using the same encoder, and submit that vectorized query via the $vectorSearch aggregation stage to your data. The nearest neighbors in vector space to your query are returned as your search results.

This workshop is broken down into 4 parts to teach you how to create and perform vector search on your MongoDB Atlas data.

steps
  • Create vector embeddings from the plots of movie documents using the encoding model all-MiniLM-L6-v2 found on Hugging Face.
  • Store those vector embeddings alongside your other data fields in your sample movie document.
  • Index your movie documents using knn, the cosine similarity function, as well the dimensions of the all-MiniLM-L6-v2 model.
  • Query for your choice of movie using $vectorSearch aggregation operator!

Prerequisites

This application was created using:
  • Node.js

  • Hugging Face sentence-transformers/all-MiniLM-L6-v2 model

  • The Atlas sample dataset of sample_mflix.movies

As such, you will need the following:

Getting Set Up

  1. Clone the repo.
  2. Navigate inside directory cd movie-vector-embedding-lab
  3. Run npm install .
  4. Create a .env file in the root directory with the following environment variables:
    MONGODB_CONNECTION_STRING= HF_ACCESS_TOKEN=
  5. Click the Connect button in the Atlas UI to find your MongoDB Atlas connection string for the Node.js driver in the Atlas UI. Replace your username and password before pasting into you .env file. It should look like this: mongodb+srv://:@Cluster0.ecmzvfs.mongodb.net/?retryWrites=true&w=majority
  6. Replace your HuggingFace access token, as well. You can obtain an access token from the HuggingFace website by following the steps in the gif below:

Let's have a look at the main.js file. This is where we will execute the all the functionality needed for Atlas Vector Search in this workshop. Notice access to the environment variables on line 5:
const uri = process.env.MONGODB_CONNECTION_STRING; and line 8 for your Hugging Face access token:
const hf_token = process.env.HF_ACCESS_TOKEN;

Testing Atlas Cluster Connection

Let's make sure you have your **.env** file set up with your Atlas cluster connection string by executing the main file in the terminal.
Typing node main will execute the **main** function on line 27.
main().catch(console.dir);
In the try statement, the application will ping the client. If successful, the following message will appear in the console:
Pinged deployment. You successfully connected to your MongoDB Atlas cluster.
The app will finally close the connection to the Atlas cluster when finished:
Closing connection.
If this is not working, make sure you have correctly whitelisted your IP address. If you are connecting successfully, we can start searching for vectors!

Let's Get Started!

If you got this far, then you have successfully connected to your Atlas cluster, and we can start creating vector embeddings. Go ahead and comment out line 27. main().catch(console.dir); in the main.js file since we no longer need to execute that functionality.

All of the code for the following steps can be found in the functionDefinitions.js file.

Step 1: Create vector embeddings for movie plot.

In the **functionDefinitions.js** file, the generateEmbeddings function is on lines 2 - 28.
async function generateEmbeddings(text) {
    const data = { inputs: text };
    try {
        const response = await axios({
        url: embeddingUrl,
        method: "POST",
        headers: {
            Authorization: `Bearer ${hf_token}`,
            "Content-Type": "application/json",
            Accept: "application/json",
        },
        data: data,
        });
        if (response.status !== 200) {
        throw new Error(
            `Request failed with status code: ${response.status}: ${response.data}`
        );
        }
        // LOG JUST TO TEST IF EMBEDDINGS ARE RETURNED
        console.log(response.data);

        // IF EMBEDDINGS WORK, UNCOMMENT THE FOLLOWING
        // return response.data;

    } catch (error) {
        console.error(error);
    }
}


Copy this function and paste it into the your main.js file.
Notice that it makes a POST call to the HuggingFace hosted embedding url using your HuggingFace access token. If successful, the function will log the array of floats to the console.

Test the generateEmbeddings functionality by executing generateEmbeddings("MongoDB is AWESOME!!!");

Now re-run the application by typing node main in the console.

Et voilà!

Since we see that the embeddings are generated, we will need to return them from the function. Before moving on, let's COMMENT OUT console.log(response.data); and let's UNCOMMENT return response.data; inside the generateEmbeddings function. Also, let's DELETE generateEmbeddings("MongoDB is AWESOME!!!").

Step 2: Store newly acquired plot embeddings directly in your movie documents.

In the functionDefinitions.js file, the saveEmbeddings function is on lines 31 - 52.
async function saveEmbeddings() {
  try {
    await client.connect();

    const db = client.db("sample_mflix");
    const collection = db.collection("movies");

    const docs = await collection
      .find({ plot: { $exists: true }, genres: "Horror" })
      .limit(100)
      .toArray();

    for (let doc of docs) {
      doc.plot_embedding_hf = await generateEmbeddings(doc.plot);
      await collection.replaceOne({ _id: doc._id }, doc);
      console.log(`Updated ${doc._id}`);
    }
  } finally {
    console.log("Closing connection.");
    await client.close();
  }
}

Copy this function and paste it into the your main.js file.
Notice this function will look in the sample_mflix.movies collection for 100 scary movies 🧟 with a plot field.
const docs = await collection.find({ plot: { $exists: true }, genres: "Horror" }).limit(100).toArray();

Feel free to change the filter to look for other movie types that suit you. Comedies movies can be fun, too. 🎭 🤣
For each of these 100 movies, this function will use the recently created generateEmbeddings function to obtain vectorized embeddings for the plot field and save them in a new plot_embedding_hf field before replacing the movie document.
Execute this function by pasting the call:
saveEmbeddings(); in the main.js file.

Now re-run the application by typing node main in the console.

You should see the updated documents being logged in the console.
Inside the Atlas UI, you can use the Data Explorer in the Collections tab to filter for movies with your new vectorized plot fields using the filter:{plot_embedding_hf:{$exists:true}}

Before moving to the next step, COMMENT OUT the call to execute saveEmbeddingssaveEmbeddings();

Step 3: Create a vector index on the plot embedding field leveraging the Atlas UI.
Now that we have the plots of 100 different movies vectorized and stored as an array of floats, we will need to index the new plot_embedding_hf fields before we can search through them.
Still in our Atlas UI on the Collections tab:
- Go to Search Indexes - Click Create Search Index - Under Atlas Vector Search, use the JSON editor - name the index vector_index - from the indexDefinition.txt file, copy the index definition:
json
{
  "fields":[
    {
      	"type": "vector",
      	"path": "plot_embedding_hf",
    	"numDimensions":384,
        "similarity":"cosine",
    }
  ]
}

Notice this knnVector type index will use the cosine similarity function, which is great for mapping text data, and the 384 dimensions, the length of the vector arrays provided by HuggingFace's all-MiniLM-L6-v2 encoding model.

With this definition, "plot_embedding_hf" is the only field indexed.

Step 4: Search semantically with the $vectorSearch aggregation operator.
We are *finally* ready to use $vectorSearch to search for that horror flick whose name is on the tip of our tongue... You know the one... 🤔
Find the queryEmbeddings function in the functionDefinitions.js and paste into the main file.
async function queryEmbeddings(query) {
    try {
        await client.connect();
        const db = client.db("sample_mflix");
        const collection = db.collection("movies");

        const vectorizedQuery = await generateEmbeddings(query);

        results = await collection.aggregate([
            {
                $vectorSearch: {
                    index: "vectorIndex",
                    queryVector: vectorizedQuery,
                    path: "plot_embedding_hf",
                    numCandidates: 100,
                    limit: 8,
                },
            },
            {
                $project: {
                    _id: 0,
                    title: 1,
                    plot: 1,
                },
            },
        ]).toArray();
        console.log(results);

    } finally {
        console.log("Closing connection.");
        await client.close();
    }
}
Notice the function parameter called "query." This is the description of the movie we provide. In order to perform vector search, we need to vectorize that description query using the generateEmbeddings function and store those vectors in the constant vectorizedQuery. We have to vectorize the query using the same all-MiniLM-L6-v2 embedding model we used to vectorize our movie plots in order to compare them.

Now we can run an aggregation on the sample_mflix.movies collection.
  • The 1st stage uses the $vectorSearch operator along with our vectorIndex to search for our query in the plot_embedding_hf path and returns the closest 4 matches.
  • The 2nd stage uses $project to return to the client only the title and the plot fields.
We then convert the matching movies results from a cursor to an array before printing them to the console.

Without further adieu, let's search for a good horror flick by calling :

queryEmbeddings("enormous creatures attacking earth");

Now returning to the terminal, type node main one last time.
Drumroll, please!

demo

No additional servers or software needed. No need to keep data in sync. Everything is done in MongoDB Atlas.

If you have any questions or feedback about this repo, feel free to create an Issue or PR in this repo.

Also please join our online MongoDB Community to interact with our product and engineering teams along with thousands of other MongoDB and Realm users.

Have fun and happy coding!

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published