# Exercise 8.: Vector Databases

In this exercise you'll build (the back-end of) a simple image search engine, where you can search with images or with text, within a database, powered by CLIP.

**Before you begin, make sure that you have your credentials in a .json within the same folder as this Notebook is, and that those credentials are correct.**

In [None]:
# Install dependencies. Ignore the model libraries if you don't want to melt down k8plex-edu...
!pip install -U pgvector
# !pip install -U transformers

In [None]:
import json
import os
import io

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from PIL import Image

In [None]:
from sqlalchemy import create_engine

# Functions to connect to the server using SQLAlchemy
def get_engine(user, passwd, host, port, db, schema):
    """
    Get SQLalchemy engine using credentials.
    Input:
        db: database name
        user: Username
        host: Hostname of the database server
        port: Port number
        passwd: Password for the database
        schema: Database schema
    Returns:
        Database engine
    """

    url = 'postgresql://{user}:{passwd}@{host}:{port}/{db}'.format(
        user=user, passwd=passwd, host=host, port=port, db=db)
    engine = create_engine(url,connect_args={'options' : f'--search_path={schema}'}, pool_size=50, echo=False)
    return engine


def get_engine_from_settings(settings):
    """
    Sets up database connection from local settings.
    Input:
        settings: Dictionary containing pghost, pguser, pgpassword, pgdatabase, pgport and schema.
    Returns:
        Call to get_database returning engine
    """
    keys = ['pguser','pgpasswd','pghost','pgport','pgdb','schema']
    if not all(key in keys for key in settings.keys()):
        raise Exception('Bad config file')

    return get_engine(settings['pguser'],
                      settings['pgpasswd'],
                      settings['pghost'],
                      settings['pgport'],
                      settings['pgdb'],
                      settings['schema'])

In [None]:
imgs_folder = './imgs'
queries_folder = './queries'

# Image Embeddings

A short example code on how to use CLIP to embed the images. You do not need to run this, it's too heavy for k8plex-edu...

In [None]:
'''import torch
from transformers import CLIPProcessor, CLIPModel
# Load the model and processor
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch16")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch16")

# Function to create image embeddings
def create_image_embeddings(folder):
    embeddings_dict = {}
    for filename in os.listdir(folder):
        if filename.endswith(".jpg") or filename.endswith(".png"):
            image_path = os.path.join(folder, filename)
            image = Image.open(image_path)
            inputs = processor(images=image, return_tensors="pt", padding=True)
            image_embeddings = model.get_image_features(**inputs)
            embeddings_dict[filename] = image_embeddings[0].detach().tolist()
    return embeddings_dict'''

In [None]:
'''# Create embeddings for images in './imgs'
imgs_embeddings = create_image_embeddings('./imgs')
with open('imgs_embeddings.json', 'w') as f:
    json.dump(imgs_embeddings, f)

# and './queries'
queries_embeddings = create_image_embeddings('./queries')
with open('queries_embeddings.json', 'w') as f:
    json.dump(queries_embeddings, f)

# Create embedding for the text
text = "An astronaut holding a gun."
inputs = processor(text=text, return_tensors="pt", padding=True)
text_embeddings = model.get_text_features(**inputs)
text_embeddings_dict = {text: text_embeddings[0].detach().tolist()}
with open('text_embeddings.json', 'w') as f:
    json.dump(text_embeddings_dict, f)'''

## 1-a.) Load the image embeddings from ./imgs/imgs_embeddings.json and upload them along with the file name and the (binary) image data into a new table called 'memes_your_username'. (3 points)

* The embedding dimension is 512.
* Make sure to use a unique table name to avoid clashes with the other students. You're all using the public schema, as that's where the pgvector extension is enabled.

In [None]:
from sqlalchemy import Table, Column, Text, LargeBinary
from sqlalchemy.orm import sessionmaker
from pgvector.sqlalchemy import Vector
from sqlalchemy.ext.declarative import declarative_base

In [None]:
# YOUR database settings
pgsql_settings = {
    'pguser' : 'database_username',
    'pgpasswd' : 'your neptune code in UPPERCASE',
    'pghost' : 'postgres-datasci.db-test',
    'pgport' : 5432,
    'pgdb' : 'database_username_homework',
    'schema' : 'public'
}

In [None]:
engine = get_engine_from_settings(pgsql_settings)
Session = sessionmaker(bind=engine)
session = Session()

# Set the search_path to 'intro_to_sql' and 'public'
session.execute('SET search_path TO public')

Base = declarative_base(bind=engine)

## 1-b.) Create a HNSW index on the 'memes' table with cosine similarity! (1 point)

## 2.) Using the embeddings for the query images in './queries/queries_embeddings.json' find the most similar image to each in the database and plot them side-by-side. (3 points)

**Fetch the (binary) image data from the 'memes' table! Do NOT use the local version!**

**Make sure that you write PARAMETRIZED queries!**

## 3.) Using the embedding correponding to the text prompt 'An astronaut holding a gun', stored in './queries/text_embeddings.json', find and plot the image in the 'memes' table that is the most similar to it! (3 points)

**Fetch the (binary) image data from the 'memes' table! Do NOT use the local version!**

**Make sure that you write PARAMETRIZED queries!**