# PostgreSQL as Vector Database: Getting Started With pgvector

PostgreSQL is an open source relational database known for its extensibility. Pgvector is one of the extensions that provides PostgreSQL with all the essential capabilities needed for a vector database. With pgvector, you can efficiently store vectors/embeddings in PostgreSQL, perform similarity searches across vectorized data, optimize data access with IVFFlat and HNSW indexes, and much more.

## Prerequisites

TBD

* Docker

## Install Required Modules

The notebook uses the following libraries:
- `openai` - provides access to the OpenAI Embedding API.
- `psycopg2` - PostgreSQL database driver for Python.
- `wget` - allows downloading files and datasets.

Install the libraries with pip:

In [None]:
! pip install openai psycopg2 wget

## Start PostgreSQL With pgvector

The fastest way to start with PostgreSQL as a vector database is by creating a database container with pre-installed pgvector extension. Run the command below to start PostgreSQL in Docker:

In [None]:
! docker compose up -d

Enable the pgvector extension by connecting to the database instance from within the container with the psql tool and running the `CREATE EXTENSION vector` command:

In [7]:
! docker exec -it postgres-pgvector psql -U postgres -c 'CREATE EXTENSION vector'

CREATE EXTENSION


## Provide OpenAI API Key

Provide your OpenAI API key as the `OPENAI_API_KEY` environment variable or run the code snippet below. If the variable is not set, then you'll be prompted for your key and it will be used during this session:

In [28]:
import os
import openai
from getpass import getpass

openai_key = os.getenv('OPENAI_API_KEY')

if (openai_key == None):
    openai_key = getpass('Provide your OpenAI API key: ')

if (not openai_key):
    raise Exception('No OpenAI API key provided. Please set the OPENAI_API_KEY environment variable or provide it when prompted.')

openai.api_key = openai_key

print('OpenAI API key set.')

OpenAI API key set.


## Load Sample Dataset

The notebook uses a [movies dataset](https://huggingface.co/datasets/denismagda/movies/blob/main/README.md) from the Hugging Face with over 45,000 movies and 26 million ratings from over 270,000 users. The dataset comes with pre-generated embeddings for the movies' overviews. The embeddings were generated with the OpenAI `text-embedding-ada-002` model.

Download the dataset:

In [15]:
import wget

schema_file = "https://huggingface.co/datasets/denismagda/movies/raw/main/movie_schema.sql"
data_file = "https://huggingface.co/datasets/denismagda/movies/resolve/main/movie_data_with_openai_embeddings.sql"

print('Downloading the schema file...')
wget.download(schema_file)

# This file is 900MB, so it might take a minute to download it
print('Downloading the data file...')
wget.download(data_file)

print('Finished downloading the files.')

Downloading the schema file...
Downloading the data file...
Finished downloading the files.


Load the schema and data into the PostgreSQL instance using the psycopg2 driver:

In [26]:
import psycopg2

print('Connecting to PostgreSQL...')
conn = psycopg2.connect("host=localhost dbname=postgres user=postgres password=password")
    
cursor = conn.cursor()

print('Creating the schema...')
schema_file = open('movie_schema.sql', 'r')
cursor.execute(schema_file.read())
conn.commit()

print('Loading the data. It might take a minute...')
data_file = open('movie_data_with_openai_embeddings.sql', 'r')
cursor.execute(data_file.read())
conn.commit()

cursor.execute('SELECT COUNT(*) FROM movie')
result = cursor.fetchone()

print(f'{result[0]} movies loaded.')

Connecting to PostgreSQL...
Creating the schema...
Loading the data. It might take a minute...
45426 movies loaded.


## Perform Vector Similarity Search

The movie dataset stores a vectorized representation of a movie overview in the `overview_vector` column. Each vector is a 1536-dimensional embedding generated with the OpenAI `text-embedding-ada-002` model.

The Postgres pgvector extension supports L2 and cosine distance calculation between the embeddings. 

Run the snippet below to find the most relevant movies for the user's prompt using the cosine distance operator (`<=>`):


In [39]:
prompt = 'A family comedy for the Christmas holidays.'

response = openai.embeddings.create(
        input=prompt,
        model='text-embedding-ada-002')

# Converting the embedding to the pgvector format by adding brackets
prompt_vector = '[' + ','.join(map(str, response.data[0].embedding)) + ']'

match_threshold = 0.7
match_cnt = 3

cursor.execute(
    'SELECT title, overview '
    'FROM movie WHERE 1 - (overview_vector <=> %(prompt_vector)s) >= %(match_threshold)s '
    'ORDER BY overview_vector <=> %(prompt_vector)s LIMIT %(match_cnt)s',
    {'prompt_vector': prompt_vector, 'match_threshold': match_threshold, 'match_cnt': match_cnt}
    )

result = cursor.fetchall()

for row in result:
    print(row)

('Better Living', 'A comedy about families, the elements that bind them together, and about hope in the face of hardship.')
('Shared Rooms', 'A new romantic comedy feature film that brings together three interrelated tales of gay men seeking family, love and sex during the holiday season.')
('The Ref', 'A cat burglar is forced to take a bickering, dysfunctional family hostage on Christmas Eve.')
