# Step 1: Install Required Libraries
Make sure the necessary libraries are installed.

In [7]:
!pip install requests
!pip install python-dotenv
!pip install pgvector psycopg
!pip install openai
!pip install tiktoken

Collecting tiktoken
  Downloading tiktoken-0.8.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.metadata (6.6 kB)
Collecting regex>=2022.1.18 (from tiktoken)
  Downloading regex-2024.9.11-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.metadata (40 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.5/40.5 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
Downloading tiktoken-0.8.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m0m
[?25hDownloading regex-2024.9.11-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (791 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m791.8/791.8 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: regex, tiktoken
Successfully installed regex-2024.9.11 tiktoken-0.8.0


# Step 2: Set up REST API interaction
In this step, we will define a function that interacts with the Flask API to process documents.

In [9]:
import requests
import os

def process_document_via_api(file_path):
    """
    This function sends a request to the document processor API to process the document.
    It sends the file path as a payload to the API.
    """
    # Define the API URL (adjust if running on a different host)
    api_url = "http://doc_processor:5000/process_document"

    # Check if the file exists
    # if not os.path.exists(file_path):
    #     raise ValueError(f"The file at {file_path} does not exist.")

    # Create the payload with the file path
    payload = {"file_path": file_path}

    # Send the POST request to the API
    response = requests.post(api_url, json=payload)

    # Check if the request was successful
    if response.status_code == 200:
        print("Document processed successfully.")
        return response.json()
    else:
        print(f"Error: {response.status_code}")
        print(response)
        response.raise_for_status()

# Step 3: Process a Document via the REST API
Provide the file path to the document you want to process.

In [10]:
# Example file path (adjust this to point to your document)
file_path = "/app/documents/carbon-free-energy.pdf"

# Call the REST API to process the document
response = process_document_via_api(file_path)

# Output the API response
response

Document processed successfully.


{'message': 'Document processed successfully. Number of chunks: 47'}

# Step 4: Perform Semantic Search Query (Optional)
Once the document is processed, you can modify this step to interact with the database for querying embeddings.
Here, you can build additional functionality to run queries directly on the database using `psycopg2`.

In [15]:
from dotenv import load_dotenv
from pgvector.psycopg import register_vector
import psycopg
import os
import numpy as np
from langchain.embeddings.openai import OpenAIEmbeddings

# Load environment variables from .env file
load_dotenv()

# Initialize the embedding model (assuming you are using OpenAI embeddings)
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
embedding_model = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

# Connect to PostgreSQL database
conn = psycopg.connect(
    dbname=os.getenv("POSTGRES_DB"),
    host=os.getenv("POSTGRES_HOST"),
    user=os.getenv("POSTGRES_USER"),
    password=os.getenv("POSTGRES_PASSWORD")
)

# Register the vector extension in PostgreSQL
register_vector(conn)

# Function to query the database for similar documents
def get_top_similar_documents(user_input: str):
    # Embed the user input using embed_query for single queries
    query_vector = embedding_model.embed_query(user_input)

    # Convert the query_vector to a format that PostgreSQL expects
    query_vector_str = '[' + ','.join(map(str, query_vector)) + ']'

    # Query the vector database for the top 2 similar documents
    response = conn.execute(
        '''
        SELECT lc.name AS collection_name, le.document, le.cmetadata 
        FROM langchain_pg_embedding le
        JOIN langchain_pg_collection lc ON le.collection_id = lc.uuid
        ORDER BY le.embedding <-> %s::vector LIMIT 2
        ''',
        (query_vector_str,)
    ).fetchall()

    # Display the results
    for hit in response:
        # Accessing the values by their index in the tuple
        print(f"Collection: {hit[0]}")  # collection_name
        print(f"Document: {hit[1]}")    # document
        print(f"Metadata: {hit[2]}")    # cmetadata
        print("---------")



# Example usage
user_input = input("Enter a description to search: ")
get_top_similar_documents(user_input)


Enter a description to search:  Hi, what are Google plans for the future?


Collection: documents
Document: SEPTEMBER 2020 24/7 BY 2030: REALIZING A CARBON-FREE FUTUREdaunting challenge, it’s also an epic opportunity—a once-in-history 
chance to fundamentally reshape the world’s energy systems for  
the better.
Google will continue to lead the way toward a clean energy future in 
our own operations, but to create broader change we need your 
help. Let’s work together and make a carbon-free economy a reality, 
this decade. The planet can’t wait any longer. 
Notes
1. To ensure that Google is the driver for bringing new clean energy onto the grid, we 
insist that all projects we buy electricity from be “ additional .” This means that we 
seek to purchase energy from not yet constructed generation facilities that will be 
built above and beyond what’s required by existing energy regulations.
2. Google remains unwavering in our commitment to the United Nations Framework 
Convention on Climate Change’s 2015 Paris Agreement , which targets aggressive
Metadata: {'sour