# Vectorstores and embeddings

We'll continue walking through the steps of a flow to set up retrieval augmented generation. As a refresher, here are the steps we previously outlined:

1. Load documents from a source.
2. Split the docs into chunks small enough to fit into an LLM's context window and avoid distraction.
3. Embed the chunks and store them in a vectorstore to allow for later retrieval based on input queries.
4. Retrieval of relevant previously-split chunks.
5. Generating a final output with retrieved chunks as context.

![](./static/images/rag_diagram.png)

The previous lesson covered various ways of loading and splitting documents. Next, we'll dive into the next steps of our document prepration pipeline, storage in a vectorstore.

A vectorstore is a specialized type of database with natural language search capabilities. We'll show how to embed our previously split document chunks so that we can later take advantage of those capabilities.

## Vectorstore ingestion

Adding documents to a vectorstore is commonly called ingestion. It generally involves using another type of ML model called a text embeddings model to convert our document contents into a representation called a vector, which the vectorstore can then search over.

For this lesson, we will use OpenAI's hosted embeddings and an in-memory vectorstore. For production deployments, you'll likely want to use a cloud solution which you can access from web environments. However, you can use any combination of vectorstore and embeddings you prefer.

First, let's look at an embeddings model in isolation to get a sense of what it does. We'll embed a simple string:

In [5]:
import "dotenv/config";

[Module: null prototype] { default: {} }

In [6]:
import { OpenAIEmbeddings } from "langchain/embeddings/openai";

const embeddings = new OpenAIEmbeddings();

await embeddings.embedQuery("This is some sample text.");

[
     [33m-0.0042987[39m,  [33m0.0006434934[39m, [33m-0.0007414519[39m,  [33m-0.007843242[39m,  [33m-0.009226957[39m,
    [33m0.015607789[39m,  [33m-0.012984631[39m,  [33m-0.002354284[39m,  [33m-0.016866904[39m,   [33m-0.01878181[39m,
   [33m0.0010131947[39m,   [33m0.028146483[39m, [33m-0.0073186103[39m,  [33m0.0006717743[39m,   [33m0.004226563[39m,
     [33m0.00719401[39m,   [33m0.023660883[39m,  [33m0.0021608262[39m,   [33m0.010892662[39m,  [33m-0.010715599[39m,
  [33m-0.0034101051[39m,     [33m0.0062923[39m, [33m-0.0046331524[39m,   [33m0.016591473[39m,  [33m-0.010669694[39m,
   [33m-0.007633389[39m,  [33m0.0010550013[39m,  [33m-0.013535494[39m,   [33m0.009856516[39m, [33m-0.0039970367[39m,
    [33m0.012663295[39m,  [33m-0.017089874[39m, [33m-0.0022493578[39m,  [33m-0.016316041[39m,  [33m0.0035871682[39m,
    [33m0.008794136[39m, [33m-0.0019788446[39m,  [33m-0.011240231[39m,   [33m0.026244694[39m,  [33m-

The result is a vector in the form of an array of numbers.

You can think of these generated numbers as capturing various abstract features of the embedded text, and search as determining closely related vectors.

For a concrete example, let's use a JavaScript library to compare similarity between some different embeddings:

In [23]:
import { similarity } from "ml-distance";

const vector1 = await embeddings.embedQuery("What are vectors useful for in machine learning?");
const unrelatedVector = await embeddings.embedQuery("A group of parrots is called a pandemonium.");

similarity.cosine(vector1, unrelatedVector);

[33m0.6962144676957391[39m

Now, let's compare two more closely related texts and see what their similarity score is:

In [24]:
const similarVector = await embeddings.embedQuery("Vectors are representation of information.");

similarity.cosine(vector1, similarVector);

[33m0.8600749683959877[39m

The score is higher since both texts were related to something similar.

We prepare documents using the techniques covered in the previous lesson. Let's set the chunk size small for demo purposes:

In [3]:
// Peer dependency
import * as parse from "pdf-parse";
import { PDFLoader } from "langchain/document_loaders/fs/pdf";
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";

const loader = new PDFLoader("./static/docs/MachineLearning-Lecture01.pdf");

const rawCS229Docs = await loader.load();

const splitter = new RecursiveCharacterTextSplitter({
  chunkSize: 128,
  chunkOverlap: 0,
});

const splitDocs = await splitter.splitDocuments(rawCS229Docs);

Next, let's initialize our vectorstore.

Note that we pass in an embeddings model on initialization. The LangChain vectorstore implementation will use it to generate vector representations for each added document's content:


In [4]:
import { MemoryVectorStore } from "langchain/vectorstores/memory";

const vectorstore = new MemoryVectorStore(embeddings);

And finally add the documents to our vectorstore!

In [5]:
await vectorstore.addDocuments(splitDocs);

And we've now got a populated, searchable vectorstore!

Because LangChain vectorstores expose an interface for searching directly with a natural language query, we can immediately try it and see what results we get:

In [6]:
// Retrieve 4 documents
const retrievedDocs = await vectorstore.similaritySearch("What is deep learning?", 4);

const pageContents = retrievedDocs.map((doc) => doc.pageContent);

pageContents

[
  [32m"piece of research in machine learning, okay?"[39m,
  [32m"are using a learning algorithm, perhaps without even being aware of it."[39m,
  [32m"some of my own excitement about machine learning to you."[39m,
  [32m"of the class, and then we'll start to talk a bit about machine learning."[39m
]

And we can see that we get results with content related to deep learning.

## Retrievers

Vectorstore search is just one type of way to fetch data for an LLM. LangChain encapsulates this with a broader `Retriever` abstraction that returns documents related to a given natural language query. 

We can instantiate a retriever from our vectorstore with a simple function call:

In [7]:
const retriever = vectorstore.asRetriever();

One nice trait of retrievers is that unlike vectorstores, they implement `.invoke()` and are themselves Expression Language runnables, and can be chained with other modules:

In [8]:
await retriever.invoke("What is deep learning?")

[
  Document {
    pageContent: [32m"piece of research in machine learning, okay?"[39m,
    metadata: {
      source: [32m"./static/docs/MachineLearning-Lecture01.pdf"[39m,
      pdf: {
        version: [32m"1.10.100"[39m,
        info: {
          PDFFormatVersion: [32m"1.4"[39m,
          IsAcroFormPresent: [33mfalse[39m,
          IsXFAPresent: [33mfalse[39m,
          Title: [32m""[39m,
          Author: [32m""[39m,
          Creator: [32m"PScript5.dll Version 5.2.2"[39m,
          Producer: [32m"Acrobat Distiller 8.1.0 (Windows)"[39m,
          CreationDate: [32m"D:20080711112523-07'00'"[39m,
          ModDate: [32m"D:20080711112523-07'00'"[39m
        },
        metadata: Metadata { _metadata: [36m[Object: null prototype][39m },
        totalPages: [33m22[39m
      },
      loc: { pageNumber: [33m8[39m, lines: { from: [33m2[39m, to: [33m2[39m } }
    }
  },
  Document {
    pageContent: [32m"are using a learning algorithm, perhaps without even b

We'll take advantage of this in the next lesson in combination with what we learned in the first lessons to create a retrieval chain.