## Creating an index and populating it with documents using PostgreSQL+pgvector

Simple example on how to ingest PDF documents, then web pages content into a PostgreSQL+pgvector VectorStore.

Requirements:
- A PostgreSQL cluster with the pgvector extension installed (https://github.com/pgvector/pgvector)
- A Database created in the cluster with the extension enabled (in this example, the database is named `vectordb`. Run the following command in the database as a superuser:
`CREATE EXTENSION vector;`

Note: if your PostgreSQL is deployed on OpenShift, directly from inside the Pod (Terminal view on the Console, or using `oc rsh` to log into the Pod), you can run the command: `psql -d vectordb -c "CREATE EXTENSION vector;"`


### Needed packages

In [1]:
!pip install -q pgvector


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


### Base parameters, the PostgreSQL info

In [1]:
CONNECTION_STRING = "postgresql+psycopg://vectordb:vectordb@postgresql:5432/vectordb"

#### Imports

In [2]:
import os
import pathlib
from langchain.text_splitter import RecursiveCharacterTextSplitter, MarkdownHeaderTextSplitter
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain.vectorstores.pgvector import PGVector
from langchain.document_loaders import DirectoryLoader,TextLoader, GitLoader

## Initial index creation and document ingestion

#### Document loading from a folder containing PDFs

In [3]:
files_folder_path = './markdown/do188/'

loader = DirectoryLoader(files_folder_path, glob="**/*.adoc", loader_cls=TextLoader)
docs = loader.load()

#### Split documents into chunks with some overlap

In [8]:
#text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=128)
    

headers_to_split_on = [
    ("==", "section"),
    ("===", "subsection")
]

markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)


def clean_split(split):
    split.page_content = split.page_content.replace('\x00', "").replace(':gls_prefix', '')
    return split

def add_metadata(split, metadata):
    split.metadata.update(metadata)
    return split

def is_valid_spit(split):
    return (
        "~]$ *lab start" not in split.page_content and
        "~]$ *lab finish" not in split.page_content and
        len(split.page_content) > 10
    )

md_header_splits = []
for doc in docs:
    splits = markdown_splitter.split_text(doc.page_content)
    splits = [clean_split(s) for s in splits]
    splits = [add_metadata(s, {"sku": "DO188", "file": doc.metadata["source"]}) for s in splits]
    splits = [s for s in splits if is_valid_spit(s)]
    md_header_splits += splits



chunk_size = 1024
chunk_overlap = 128
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size, chunk_overlap=chunk_overlap
)

# Split
all_splits = text_splitter.split_documents(md_header_splits)
all_splits

for s in all_splits:
    if s.metadata.get("section") == "Containerfile Instructions":
        print(s.page_content)


Containerfiles use a small domain-specific language (DSL) consisting of basic instructions for crafting container images. The following are the most common instructions.  
`FROM`::
Sets the base image for the resulting container image.
Takes the name of the base image as an argument.  
`WORKDIR`::
Sets the current working directory within the container.
Instructions that follow the `WORKDIR` instruction run within this directory.  
`COPY` and `ADD`::
+
--
Copy files from the build host into the file system of the resulting container image. Relative paths use the host current working directory, known as the build context.
Both instructions use the working directory within the container as defined by the `WORKDIR` instruction.  
The `ADD` instruction adds the following functionality:  
[compact]
* Copying files from URLs.
* Unpacking `tar` archives in the destination image.
[compact]
* Copying files from URLs.
* Unpacking `tar` archives in the destination image.  
Because the `ADD` instr

#### Cleanup documents as PostgreSQL won't accept the NUL character, '\x00', in TEXT fields.

In [80]:
len(all_splits)

503

#### Create the index and ingest the documents

In [84]:
embeddings = HuggingFaceEmbeddings()

COLLECTION_NAME = "adoc_test"

db = PGVector.from_documents(
    documents=all_splits,
    embedding=embeddings,
    collection_name=COLLECTION_NAME,
    connection_string=CONNECTION_STRING,
    pre_delete_collection=True
)

KeyboardInterrupt: 

In [82]:
query = "How do you run a container with a specific user?"
docs_with_score = db.similarity_search_with_score(query)

In [83]:
for doc, score in docs_with_score:
    print("-" * 80)
    print("Score: ", score)
    print(doc.page_content)
    print("-" * 80)

--------------------------------------------------------------------------------
Score:  0.3407857018474999
See the references section for more information about rootless Podman setup.  
=== Changing the Container User
When you create a Containerfile, the user tends to be root. This is because you require elevated privileges for certain operations, such as installing packages or making configuration changes.  
Determine the current user by running the `id` command:  
[subs="+quotes,+macros"]
----
[user@host ~]$ *podman run registry.access.redhat.com/ubi9/ubi id*
uid=0(root) gid=0(root) groups=0(root)
----  
The following container image uses the root user to start an HTTP server:  
[subs="+quotes,+macros"]
----
FROM registry.access.redhat.com/ubi9/ubi  
CMD ["python3", "-m", "http.server"]
----
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Score:  0.35849994195997525
[sub

## Ingesting new documents

#### Example with Web pages

In [10]:
from langchain.document_loaders import WebBaseLoader

In [11]:
loader = WebBaseLoader(["https://ai-on-openshift.io/getting-started/openshift/",
                        "https://ai-on-openshift.io/getting-started/opendatahub/",
                        "https://ai-on-openshift.io/getting-started/openshift-data-science/",
                        "https://ai-on-openshift.io/odh-rhods/configuration/",
                        "https://ai-on-openshift.io/odh-rhods/custom-notebooks/",
                        "https://ai-on-openshift.io/odh-rhods/nvidia-gpus/",
                        "https://ai-on-openshift.io/odh-rhods/custom-runtime-triton/",
                        "https://ai-on-openshift.io/odh-rhods/openshift-group-management/",
                        "https://ai-on-openshift.io/tools-and-applications/minio/minio/"
                       ])

In [12]:
data = loader.load()

In [13]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024,
                                               chunk_overlap=40)
all_splits = text_splitter.split_documents(data)
for doc in all_splits:
    doc.page_content = doc.page_content.replace('\x00', '')

In [14]:
embeddings = HuggingFaceEmbeddings()
store = PGVector(
    connection_string=CONNECTION_STRING,
    collection_name=COLLECTION_NAME,
    embedding_function=embeddings)

In [16]:
store.add_documents(all_splits)