<a href="https://colab.research.google.com/github/osaeed-ds/vector-hello/blob/main/Osaeed_pgVector.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **pgVector as a Vector Database**
This is a hello world exercise based on the Vector Search quickstart on the LangChain website.
https://python.langchain.com/docs/integrations/vectorstores/pgvector

The dataset did not work in the example (did not specify where to get the file) so I substituted my own dataset.



## **Prerequisites**

In [1]:
!pip install pgVector openai tiktoken langchain psycopg2-binary

Collecting pgVector
  Downloading pgvector-0.2.2-py2.py3-none-any.whl (9.2 kB)
Collecting openai
  Downloading openai-0.28.0-py3-none-any.whl (76 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.5/76.5 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tiktoken
  Downloading tiktoken-0.5.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m32.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain
  Downloading langchain-0.0.300-py3-none-any.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m57.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting psycopg2-binary
  Downloading psycopg2_binary-2.9.7-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.0/3.0 MB[0m [31m63.9 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json<

In [2]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import PGVector
from langchain.document_loaders import TextLoader
from langchain.docstore.document import Document

## **Embedding Engine**
We will use Open AI

In [3]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

OpenAI API Key:··········


## **Dataset**
We will use the US Constitution as our dataset

In [4]:
!curl https://www.govinfo.gov/content/pkg/CDOC-110hdoc50/html/CDOC-110hdoc50.htm > constitution.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  291k    0  291k    0     0   616k      0 --:--:-- --:--:-- --:--:--  617k


## **Generate Embeddings**
Use LangChain to chunk the dataset and use OpenAI for embeddings.

In [5]:
loader = TextLoader("constitution.txt")

documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings()



## **Connect to pgVector and load the embeddings**

In [7]:
# PGVector needs the connection string to the database.
#CONNECTION_STRING = "postgresql+psycopg2://harrisonchase@localhost:5432/test3"
#CONNECTION_STRING = PG_VECTOR_URI


# # Alternatively, you can create it from enviornment variables.
import os

CONNECTION_STRING = PGVector.connection_string_from_db_params(
     driver=os.environ.get("PGVECTOR_DRIVER", "psycopg2"),
     host=os.environ.get("PGVECTOR_HOST", "osaeed-vector-test-do-user-14702791-0.b.db.ondigitalocean.com"),
     port=int(os.environ.get("PGVECTOR_PORT", "25060")),
     database=os.environ.get("PGVECTOR_DATABASE", "defaultdb"),
     user=os.environ.get("PGVECTOR_USER", "doadmin"),
     password=os.environ.get("PGVECTOR_PASSWORD", "MYPASSWORD"),
)

In [8]:
# The PGVector Module will try to create a table with the name of the collection.
# So, make sure that the collection name is unique and the user has the permission to create a table.

COLLECTION_NAME = "constitution_test"

db = PGVector.from_documents(
    embedding=embeddings,
    documents=docs,
    collection_name=COLLECTION_NAME,
    connection_string=CONNECTION_STRING,
)

## **Query the DB**

In [9]:
query = "What is the role of the Vice President?"
docs_with_score = db.similarity_search_with_score(query)

In [10]:
for doc, score in docs_with_score:
    print("-" * 80)
    print("Score: ", score)
    print(doc.page_content)
    print("-" * 80)

--------------------------------------------------------------------------------
Score:  0.15975599528813245
Section 1. In case of the removal of the President from 
office or of his death or resignation, the Vice President shall 
become President.
    Section 2. Whenever there is a vacancy in the office of the 
Vice President, the President shall nominate a Vice President 
who shall take office upon confirmation by a majority vote of 
both Houses of Congress.
    Section 3. Whenever the President transmits to the 
President pro tempore of the Senate and the Speaker of the 
House of Representatives his written declaration that he is 
unable to discharge the powers and duties of his office, and 
until he transmits to them a written declaration to the 
contrary, such powers and duties shall be discharged by the 
Vice President as Acting President.
    Section 4. Whenever the Vice President and a majority of 
either the principal officers of the executive departments or 
of such other bod