## **Nugen Intelligence**
<img src="https://nugen.in/logo.png" alt="Nugen Logo" width="200"/>

Domain-aligned foundational models at industry leading speeds and zero-data retention! To learn more, visit [Nugen](https://docs.nugen.in/introduction)

### **Embedding Government Documents for Enhanced Query Resolution**
**Introduction**

Welcome to the Nugen API Guide! This notebook will help you use Nugen’s embedding and completion APIs to extract information from PDF documents and answer questions based on the content. 

By the end of this guide, you'll be able to:

* Extract text from a PDF file.
* Generate embeddings for chunks of text.
* Find relevant information from a document using embeddings.
* Generate answers based on the relevant information.

**Step 1: Installing Required Libraries**

Before starting, ensure you have the necessary libraries installed. You can run the following commands to install them:

In [None]:
!pip install --quiet PyPDF2==3.0.1 requests numpy==1.26.0

These libraries will help us:

* **PyPDF2**: For extracting text from PDF documents.
* **requests**: For making API calls to Nugen.
* **numpy**: For handling embeddings and similarity calculations.

**Step 2: Importing Libraries and Helper Functions**

Let's begin by importing the libraries and defining helper functions for interacting with the Nugen API.

In [None]:
import os
import PyPDF2
import requests
import json
import numpy as np
from datetime import datetime
import time
import os
import re

**Step 3: Using Nugen APIs for Embeddings**

We’ll define a function that sends text data to Nugen’s embedding model and retrieves embeddings.

To read more about Nugen API and access free API keys, you can visit [Nugen Dashboard](https://nugen-platform-frontend.azurewebsites.net/dashboard)

In [None]:
# Replace with your API key securely loaded from environment variable
api_key = os.getenv("NUGEN_API_KEY")  # Fetch API Key securely from environment variables


In [None]:
def sanitize_text(text):
    """Sanitize text by removing any non-printable/control characters."""
    # Filter out characters that are not printable to ensure the text is clean and safe for further processing
    sanitized = ''.join([char for char in text if char.isprintable()])
    return sanitized

In [None]:
def get_nugen_embeddings(texts, model="nugen-flash-embed", dimensions=768):
    """Fetch embeddings for a list of texts from Nugen API."""
    embedding_url = "https://api.nugen.cloud/inference/embeddings"
    headers = {
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json"
    }
    data = {
        "input": texts,
        "model": model,
        "dimensions": dimensions
    }
    response = requests.post(embedding_url, headers=headers, data=json.dumps(data))
    if response.status_code == 200:
        response_json = response.json()
        return [entry["embedding"] for entry in response_json["data"]]
    else:
        print(f"Error: {response.status_code}, {response.text}")
        return None

**Step 4: Download PDF**

In [None]:
!wget -O legal_service_authorities_act_1987.pdf https://www.indiacode.nic.in/bitstream/123456789/19023/1/legal_service_authorities_act,_1987.pdf

or

In [None]:
# File Security Check : Download PDF from a trusted source
pdf_url = "https://www.indiacode.nic.in/bitstream/123456789/13236/1/the_registration_act,_1908.pdf"
response = requests.get(pdf_url)
pdf_file = "registration_act_1908.pdf"
if not pdf_file.lower().endswith(".pdf"):
    raise ValueError(" Only PDF files are allowed.")
with open(pdf_file, "wb") as f:
    f.write(response.content)
print(" PDF downloaded and verified as safe.")

PDF downloaded successfully.


**Step 5: Extracting Text from a PDF Document**

The next step is to extract text from the PDF file. We’ll loop through all the pages of the PDF document and extract the text.

In [None]:
def extract_text_from_pdf(file_path):
    """Extract text from the entire PDF document with safety check."""
    pdf_text = ""
    with open(file_path, "rb") as pdf_file:
        reader = PyPDF2.PdfReader(pdf_file)
        for page in reader.pages:
            text = page.extract_text()
            if text:  # SECURITY: Only add if not None
                pdf_text += text + "\n"
    return pdf_text

**Step 6: Chunking the Text**

To handle large documents, it’s helpful to split the text into smaller chunks. This function breaks the extracted text into chunks of a specified size.

In [None]:
# IMPROVED: Sentence-based chunking for better coherence
def chunk_text(text, chunk_size=50):
    """Split text into semi-coherent sentence-based chunks."""
    sentences = re.split(r'(?<=[.?!])\s+', text)
    chunks, chunk = [], []
    for sentence in sentences:
        chunk.append(sentence)
        if len(chunk) >= chunk_size:
            chunks.append(" ".join(chunk))
            chunk = []
    if chunk:
        chunks.append(" ".join(chunk))
    return chunks

**Step 7: Processing the Document and Generating Embeddings**

In this step, we’ll combine everything to process the document, chunk the text, and generate embeddings for each chunk.

In [None]:
def process_document(file_path, chunk_size=50):
    """Process the document, generate embeddings, and return chunks with embeddings."""
    start = time.time()
    pdf_text = extract_text_from_pdf(file_path)
    pdf_text = sanitize_text(pdf_text)  # Sanitize input text to ensure it's clean
    chunks = chunk_text(pdf_text, chunk_size)
    doc_embeddings = []
    for i, chunk in enumerate(chunks):
        print(f"Processing chunk {i + 1} of {len(chunks)}")
        doc_embds = get_nugen_embeddings([chunk], model="nugen-flash-embed", dimensions=768)
        if doc_embds:
            doc_embeddings.extend(doc_embds)
        else:
            print(f"Failed to retrieve embeddings for chunk {i + 1}")
    return chunks, doc_embeddings

**Step 8: Finding Relevant Chunks**

Now, we need to find the most relevant chunk based on a query. We compare the query’s embedding with document embeddings to find the closest match.

In [None]:
def find_relevant_chunk(query, chunks, doc_embeddings):
    """Find the most relevant chunk for the query."""
    query_embd = get_nugen_embeddings([query], model="nugen-flash-embed", dimensions=768)
    if query_embd:
        query_embd = np.array(query_embd[0]).reshape(1, -1)
        similarities = np.dot(np.array(doc_embeddings), query_embd.T).flatten()
        retrieved_id = np.argmax(similarities)
        if retrieved_id < len(chunks):
            print(f"Most relevant chunk (similarity: {similarities[retrieved_id]:.4f}):\n")
            return chunks[retrieved_id]
        else:
            print("Error: Retrieved ID out of range.")
            return None
    else:
        print("Failed to retrieve query embedding.")
        return None

**Step 9: Generating a Completion Based on the Relevant Chunk**

After finding the relevant text chunk, we can generate an answer to the query using Nugen’s completion API.

In [None]:

def get_nugen_completion(prompt, model="nugen-flash-instruct", max_tokens=400, temperature=1.0):
    """Fetch a completion using Nugen API."""
    completion_url = "https://api.nugen.in/inference/completions"
    headers = {
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json"
    }
    data = {
        "prompt": prompt,
        "model": model,
        "max_tokens": max_tokens,
        "temperature": temperature
    }
    response = requests.post(completion_url, headers=headers, data=json.dumps(data))
    if response.status_code == 200:
        return response.json()["choices"][0]["text"].strip()
    else:
        print(f"Error: {response.status_code}, {response.text}")
        return None


In [None]:
def log_qa(question, answer):
    """
    Log Q&A pairs to a file with timestamps for auditing.
    This function appends each question and its corresponding answer 
    along with a timestamp to a log file for record-keeping purposes.
    """

    # Get the current date and time in the format YYYY-MM-DD HH:MM:SS
    timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")

    # Open the log file in append mode ("a").
    # If the file does not exist, it will be created automatically.
    with open("qna_log.txt", "a") as log_file:
        # Write the timestamp, question, and answer to the log file in a structured format
        log_file.write(f"{timestamp} - Question: {question}\nAnswer: {answer}\n\n")

**Step 10: Putting It All Together**

We can now combine all the steps into a function that extracts text, finds relevant chunks, and generates answers for a given query.

In [None]:
def answer_query_from_pdf(pdf_file, query):
    """Answer a query based on the content of the PDF file."""
    chunks, doc_embeddings = process_document(pdf_file, chunk_size=50)
    relevant_text = find_relevant_chunk(query, chunks, doc_embeddings)
    if relevant_text:
        print("Relevant text found:")
        print(relevant_text)
        answer = get_nugen_completion(prompt=relevant_text, model="nugen-flash-instruct")
        if answer:
            print("Generated answer:", answer)
            log_qa(query, answer)  # Log Q&A pair with timestamp
        else:
            print("Failed to generate an answer.")
    else:
        print("No relevant text found.")

**Step 11: Example Usage**

You can now use the following example to test the entire process:

In [None]:
pdf_file = "registration_act_1908.pdf"
# Updated code with interactive loop for continuous queries
while True:
    # Prompt the user to enter a question or type 'exit' to quit the loop
    query = input("\nEnter your question (or type 'exit' to quit): ")
    
    # Check if the user wants to exit the loop
    if query.lower() == "exit":
        break  # Exit the loop
    
    # Process the user's query and fetch the answer from the PDF
    answer_query_from_pdf(pdf_file, query)

Processing chunk 1 of 23
Processing chunk 2 of 23
Processing chunk 3 of 23
Processing chunk 4 of 23
Processing chunk 5 of 23
Processing chunk 6 of 23
Processing chunk 7 of 23
Processing chunk 8 of 23
Processing chunk 9 of 23
Processing chunk 10 of 23
Processing chunk 11 of 23
Processing chunk 12 of 23
Processing chunk 13 of 23
Processing chunk 14 of 23
Processing chunk 15 of 23
Processing chunk 16 of 23
Processing chunk 17 of 23
Processing chunk 18 of 23
Processing chunk 19 of 23
Processing chunk 20 of 23
Processing chunk 21 of 23
Processing chunk 22 of 23
Processing chunk 23 of 23
Relevant text found:
 (2) The Registrar shall also forward a copy of such document, together with a copy of the map or plan  (if any) mentioned in section 21, to every other Registrar in whose district any part of such property is  situate.   (3) Such Registrar on receiving any such copy s hall file it in his Book No. 1, and shall also send a  memorandum of the copy to each of the Sub -Registrars subordinate

In [None]:
# Cleanup 
os.remove(pdf_file)
print(f" Cleaned up temporary file: {pdf_file}")

**Conclusion**

This guide walks through how to use Nugen APIs to extract information from documents, generate embeddings, and answer queries. You can use this template to work with other PDF documents and queries.