![Generating Embeddings](../../images/headings/02_retrieval_augmented_generation_01_01_vectorstores.png)

# Vectorstores: Langchain + PGVector

This notebook demonstrates how to use the `PGVector` vectorstore from the `langchain_postgres` package. `PGVector` is an implementation of LangChain's vectorstore abstraction using PostgreSQL as the backend and utilizing the `pgvector` extension.

## Setup
### Import required classes

In [1]:
from langchain.embeddings import HuggingFaceEmbeddings
from langchain_aws import BedrockEmbeddings
from langchain_core.documents import Document
from langchain_postgres import PGVector
from langchain_postgres.vectorstores import PGVector

### Identify current user

In [2]:
import os

user = os.getenv('LOGNAME')
print(f'Hello, {user}')

Hello, mbklein


### Connect and initialize the `PGVector` vectorstore

For this example, we're going to use an embedding model from [HuggingFace](https://huggingface.co)

In [3]:
connection = f'postgresql+psycopg://{user}:{user}@localhost:5432/{user}'
collection_name = "code4lib2024"
# embeddings = HuggingFaceEmbeddings(model_name='nomic-ai/nomic-embed-text-v1.5', model_kwargs={'trust_remote_code':True})
embeddings = BedrockEmbeddings(model_id='amazon.titan-embed-text-v2:0')

vectorstore = PGVector(
    embeddings=embeddings,
    collection_name=collection_name,
    connection=connection,
    use_jsonb=True,
)

## Populate vectorstore
### Initialize documents

Create a list of `Document` objects

In [4]:
docs = [
    Document(
        page_content="Interlibrary loan requests can be made online or at the service desk",
        metadata={"id": 1, "location": "library", "topic": "borrowing"},
    ),
    Document(
        page_content="Course reserves are available for checkout at the circulation desk",
        metadata={"id": 2, "location": "library", "topic": "borrowing"},
    ),
    Document(
        page_content="Study rooms can be reserved up to two weeks in advance",
        metadata={"id": 3, "location": "library", "topic": "reservations"},
    ),
    Document(
        page_content="Library workshops on database research are held monthly",
        metadata={"id": 4, "location": "library", "topic": "workshops"},
    ),
    Document(
        page_content="Access to digital archives is available through the library portal",
        metadata={"id": 5, "location": "library", "topic": "online resources"},
    ),
    Document(
        page_content="Renew your borrowed items online or at any library kiosk",
        metadata={"id": 6, "location": "library", "topic": "borrowing"},
    ),
    Document(
        page_content="Special collections can be accessed in the reading room",
        metadata={"id": 7, "location": "library", "topic": "borrowing"},
    ),
    Document(
        page_content="Library orientation tours are available for new users",
        metadata={"id": 8, "location": "library", "topic": "facilities"},
    ),
    Document(
        page_content="The library offers free Wi-Fi to all visitors",
        metadata={"id": 9, "location": "library", "topic": "facilities"},
    ),
    Document(
        page_content="Photocopying and printing services are available on the ground floor",
        metadata={"id": 10, "location": "library", "topic": "printing services"},
    ),
]

### Add documents to vectorstore

Add the documents to the vectorstore using `add_documents()`. The list of unique document IDs is specified by the `ids` parameter.

In [5]:
vectorstore.add_documents(docs, ids=[doc.metadata["id"] for doc in docs])

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

## Similarity Search

Perform a similarity search using the `similarity_search()` method. You can specify the number of results to return using the `k` parameter.

### Return results only

In [8]:
vectorstore.similarity_search_with_relevance_scores("guided viewing of library spaces", k=10)

[(Document(page_content='Library orientation tours are available for new users', metadata={'id': 8, 'topic': 'facilities', 'location': 'library'}),
  0.4981667102475944),
 (Document(page_content='Special collections can be accessed in the reading room', metadata={'id': 7, 'topic': 'borrowing', 'location': 'library'}),
  0.3830259442329407),
 (Document(page_content='Access to digital archives is available through the library portal', metadata={'id': 5, 'topic': 'online resources', 'location': 'library'}),
  0.31946558712320305),
 (Document(page_content='The library offers free Wi-Fi to all visitors', metadata={'id': 9, 'topic': 'facilities', 'location': 'library'}),
  0.2769853472709656),
 (Document(page_content='Library workshops on database research are held monthly', metadata={'id': 4, 'topic': 'workshops', 'location': 'library'}),
  0.25827664136886597),
 (Document(page_content='Interlibrary loan requests can be made online or at the service desk', metadata={'id': 1, 'topic': 'borro

Perform a similarity search using the `similarity_search_with_score()` method.
- This method allows you to return not only the documents, but also the distance score of the query to them.
- The returned distance score is `L2 distance` (or, the length between two points in Euclidean space).
- The calculated distance is normalized to a value between 0 and 1.
- A *LOWER* score is better (i.e., more similar).

### Return results with similarity scores

In [9]:
vectorstore.similarity_search_with_score("guided viewing of library spaces", k=5)

[(Document(page_content='Library orientation tours are available for new users', metadata={'id': 8, 'topic': 'facilities', 'location': 'library'}),
  0.5018332897524056),
 (Document(page_content='Special collections can be accessed in the reading room', metadata={'id': 7, 'topic': 'borrowing', 'location': 'library'}),
  0.6169740557670593),
 (Document(page_content='Access to digital archives is available through the library portal', metadata={'id': 5, 'topic': 'online resources', 'location': 'library'}),
  0.680534412876797),
 (Document(page_content='The library offers free Wi-Fi to all visitors', metadata={'id': 9, 'topic': 'facilities', 'location': 'library'}),
  0.7230146527290344),
 (Document(page_content='Library workshops on database research are held monthly', metadata={'id': 4, 'topic': 'workshops', 'location': 'library'}),
  0.741723358631134)]

In [10]:
vectorstore.similarity_search_with_score("tours", k=5)

[(Document(page_content='Library orientation tours are available for new users', metadata={'id': 8, 'topic': 'facilities', 'location': 'library'}),
  0.7052507629542173),
 (Document(page_content='Course reserves are available for checkout at the circulation desk', metadata={'id': 2, 'topic': 'borrowing', 'location': 'library'}),
  0.8708834061900601),
 (Document(page_content='Interlibrary loan requests can be made online or at the service desk', metadata={'id': 1, 'topic': 'borrowing', 'location': 'library'}),
  0.9248611957165729),
 (Document(page_content='Study rooms can be reserved up to two weeks in advance', metadata={'id': 3, 'topic': 'reservations', 'location': 'library'}),
  0.9336128612026137),
 (Document(page_content='Photocopying and printing services are available on the ground floor', metadata={'id': 10, 'topic': 'printing services', 'location': 'library'}),
  0.9338406850289477)]

## Filtering Support

`PGVector` supports filtering documents based on their metadata fields. You can use various operators to define the filters. If you provide a dictionary with multiple fields but no operators, the top level will be interpreted as a logical AND filter.

| Operator   | Meaning/Category               |
|------------|--------------------------------|
| `$eq`      | Equality (`==`)                |
| `$ne`      | Inequality (`!=`)              |
| `$lt`      | Less than (`<`)                |
| `$lte`     | Less than or equal (`<=`)      |
| `$gt`      | Greater than (`>`)             |
| `$gte`     | Greater than or equal (`>=`)   |
| `$in`      | Special Cased (`in`)           |
| `$nin`     | Special Cased (`not in`)       |
| `$between` | Special Cased (`between`)      |
| `$like`    | Text (`like`)                  |
| `$ilike`   | Text (case-insensitive `like`) |
| `$and`     | Logical (`and`)                |
| `$or`      | Logical (`or`)                 |

### Filter on a list of specific IDs

In [11]:
vectorstore.similarity_search_with_score("borrowing books for a course", k=10, filter={"id": {"$in": [1, 5, 2, 9]}})

[(Document(page_content='Course reserves are available for checkout at the circulation desk', metadata={'id': 2, 'topic': 'borrowing', 'location': 'library'}),
  0.6277142610890527),
 (Document(page_content='Interlibrary loan requests can be made online or at the service desk', metadata={'id': 1, 'topic': 'borrowing', 'location': 'library'}),
  0.6807321690640225),
 (Document(page_content='Access to digital archives is available through the library portal', metadata={'id': 5, 'topic': 'online resources', 'location': 'library'}),
  0.9020237595446038),
 (Document(page_content='The library offers free Wi-Fi to all visitors', metadata={'id': 9, 'topic': 'facilities', 'location': 'library'}),
  0.9494203217327595)]

In [12]:
vectorstore.similarity_search(
    "ILL requests",
    k=10,
    filter={"id": {"$in": [1, 5, 2, 9]}, "topic": {"$in": ["borrowing"]}},
)

[Document(page_content='Interlibrary loan requests can be made online or at the service desk', metadata={'id': 1, 'topic': 'borrowing', 'location': 'library'}),
 Document(page_content='Course reserves are available for checkout at the circulation desk', metadata={'id': 2, 'topic': 'borrowing', 'location': 'library'})]

### Combining filters using `$and` / `$or`

In [13]:
vectorstore.similarity_search(
    "books",
    k=10,
    filter={
        "$and": [
            {"id": {"$in": [1, 5, 2, 9]}},
            {"topic": {"$in": ["borrowing", "online resources"]}},
        ]
    },
)

[Document(page_content='Interlibrary loan requests can be made online or at the service desk', metadata={'id': 1, 'topic': 'borrowing', 'location': 'library'}),
 Document(page_content='Course reserves are available for checkout at the circulation desk', metadata={'id': 2, 'topic': 'borrowing', 'location': 'library'}),
 Document(page_content='Access to digital archives is available through the library portal', metadata={'id': 5, 'topic': 'online resources', 'location': 'library'})]

### Excluding documents using `$ne`

In [None]:
vectorstore.similarity_search("reserves", k=10, filter={"topic": {"$ne": "borrowing"}})

## Exercises

- Add documents to this collection (or a new one!) and do a bunch of similarity searches to get a "feel" for how they work.
- Perform searches with scores, add documents to your vectorstore and experiment with searches that provide scores.

## Discussion Questions

- Which of your collections could you use to begin experimenting with storing and retrieving embeddings?
- How would you keep documents in the vectorstore up-to-date?
- Are there other vectorstores you can find on the Internet besides Posgtgres + PGVector?