## Creating an index and populating it with documents using PostgreSQL+pgvector

Simple example on how to ingest PDF documents, then web pages content into a PostgreSQL+pgvector VectorStore.

Requirements:
- A PostgreSQL cluster with the pgvector extension installed (https://github.com/pgvector/pgvector)
- A Database created in the cluster with the extension enabled (in this example, the database is named `vectordb`. Run the following command in the database as a superuser:
`CREATE EXTENSION vector;`

Note: if your PostgreSQL is deployed on OpenShift, directly from inside the Pod (Terminal view on the Console, or using `oc rsh` to log into the Pod), you can run the command: `psql -d vectordb -c "CREATE EXTENSION vector;"`


### Needed packages

In [1]:
!pip install -q pgvector langchain pypdf sentence-transformers psycopg langchain-community


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


### Base parameters, the PostgreSQL info

In [2]:

CONNECTION_STRING = "postgresql+psycopg://vectordb:vectordb@postgresql.angent-workshop.svc.cluster.local:5432/vectordb"

#### Imports

In [3]:
from langchain.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain.vectorstores.pgvector import PGVector

## Initial index creation and document ingestion

#### Document loading from a folder containing PDFs

In [4]:
# pdf_folder_path = './rhods-doc'

# loader = PyPDFDirectoryLoader(pdf_folder_path)
# docs = loader.load()

In [5]:
import requests
import re

response = requests.get(
    "https://storage.googleapis.com/benchmarks-artifacts/travel-db/swiss_faq.md"
)
response.raise_for_status()
faq_text = response.text

docs = [{"page_content": txt} for txt in re.split(r"(?=\n##)", faq_text)]
# print(docs)

#### Split documents into chunks with some overlap

In [6]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024,
                                               chunk_overlap=40)
# all_splits = text_splitter.split_documents(docs)
all_splits = text_splitter.create_documents(docs)
all_splits[0]

# Change this:
# texts = text_splitter.split_documents(pages)
# To this:
# texts = text_splitter.create_documents(pages)

# input=[doc["page_content"] for doc in docs]

TypeError: expected string or bytes-like object, got 'dict'

#### Cleanup documents as PostgreSQL won't accept the NUL character, '\x00', in TEXT fields.

In [None]:
for doc in all_splits:
    doc.page_content = doc.page_content.replace('\x00', '')

#### Create the index and ingest the documents

In [None]:
embeddings = HuggingFaceEmbeddings()

COLLECTION_NAME = "documents_google"

db = PGVector.from_documents(
    documents=all_splits,
    embedding=embeddings,
    collection_name=COLLECTION_NAME,
    connection_string=CONNECTION_STRING,)

In [None]:
# query = "How do you install OpenShift Data Science?"
query="Consult the company policies to check whether certain options are permitted.Use this before making any flight changes performing other 'write' events."
docs_with_score = db.similarity_search_with_score(query)

In [None]:
for doc, score in docs_with_score:
    print("-" * 80)
    print("Score: ", score)
    print(doc.page_content)
    print("-" * 80)

## Ingesting new documents

#### Example with Web pages

In [None]:
from langchain.document_loaders import WebBaseLoader

In [None]:
# loader = WebBaseLoader(["https://ai-on-openshift.io/getting-started/openshift/",
#                         "https://ai-on-openshift.io/getting-started/opendatahub/",
#                         "https://ai-on-openshift.io/getting-started/openshift-data-science/",
#                         "https://ai-on-openshift.io/odh-rhods/configuration/",
#                         "https://ai-on-openshift.io/odh-rhods/custom-notebooks/",
#                         "https://ai-on-openshift.io/odh-rhods/nvidia-gpus/",
#                         "https://ai-on-openshift.io/odh-rhods/custom-runtime-triton/",
#                         "https://ai-on-openshift.io/odh-rhods/openshift-group-management/",
#                         "https://ai-on-openshift.io/tools-and-applications/minio/minio/"
#                        ])

In [None]:
# data = loader.load()

In [None]:
# text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024,
#                                                chunk_overlap=40)
# all_splits = text_splitter.split_documents(data)
# for doc in all_splits:
#     doc.page_content = doc.page_content.replace('\x00', '')

In [None]:
# embeddings = HuggingFaceEmbeddings()
# store = PGVector(
#     connection_string=CONNECTION_STRING,
#     collection_name=COLLECTION_NAME,
#     embedding_function=embeddings)

In [None]:
# store.add_documents(all_splits);