## **Nugen Intelligence**
<img src="https://nugen.in/logo.png" alt="Nugen Logo" width="200"/>

Domain-aligned foundational models at industry leading speeds and zero-data retention!

### **Embedding Government Documents for Enhanced Query Resolution**
**Introduction**

Welcome to the Nugen API Guide! This notebook will help you use Nugen’s embedding and completion APIs to extract information from PDF documents and answer questions based on the content. 

By the end of this guide, you'll be able to:

* Extract text from a PDF file.
* Generate embeddings for chunks of text.
* Find relevant information from a document using embeddings.
* Generate answers based on the relevant information.

**Step 1: Installing Required Libraries**

Before starting, ensure you have the necessary libraries installed. You can run the following commands to install them:

In [1]:
!pip install --quiet PyPDF2 requests numpy

These libraries will help us:

* PyPDF2: For extracting text from PDF documents.
* requests: For making API calls to Nugen.
* numpy: For handling embeddings and similarity calculations.

**Step 2: Importing Libraries and Helper Functions**

Let's begin by importing the libraries and defining helper functions for interacting with the Nugen API.

In [2]:
import PyPDF2
import requests
import json
import numpy as np

**Step 3: Using Nugen APIs for Embeddings**

We’ll define a function that sends text data to Nugen’s embedding model and retrieves embeddings.

In [3]:
def get_nugen_embeddings(texts, model="nugen-flash-embed", dimensions=768):
    """Fetch embeddings for a list of texts from Nugen API."""
    api_key = "<--API KEY-->"  # Replace with your API key
    embedding_url = "https://api.nugen.in/inference/embeddings"
    
    headers = {
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json"
    }
    
    data = {
        "input": texts,
        "model": model,
        "dimensions": dimensions
    }
    
    response = requests.post(embedding_url, headers=headers, data=json.dumps(data))
    
    if response.status_code == 200:
        response_json = response.json()
        return [entry["embedding"] for entry in response_json["data"]]
    else:
        print("Error:", response.status_code, response.text)
        return None

**Step 4: Download PDF**

In [None]:
!wget -O legal_service_authorities_act_1987.pdf https://www.indiacode.nic.in/bitstream/123456789/19023/1/legal_service_authorities_act,_1987.pdf

or

In [4]:
import requests

# URL of the PDF document
pdf_url = "https://www.indiacode.nic.in/bitstream/123456789/19023/1/legal_service_authorities_act,_1987.pdf"

# Send a GET request to fetch the PDF
response = requests.get(pdf_url)

# Save the PDF to a file
with open("legal_service_authorities_act_1987.pdf", "wb") as pdf_file:
    pdf_file.write(response.content)

print("PDF downloaded successfully.")

PDF downloaded successfully.


**Step 5: Extracting Text from a PDF Document**

The next step is to extract text from the PDF file. We’ll loop through all the pages of the PDF document and extract the text.

In [5]:
def extract_text_from_pdf(file_path):
    """Extract text from the entire PDF document."""
    pdf_text = ""
    with open(file_path, "rb") as pdf_file:
        reader = PyPDF2.PdfReader(pdf_file)
        for page in reader.pages:
            pdf_text += page.extract_text() + "\n"
    return pdf_text

**Step 6: Chunking the Text**

To handle large documents, it’s helpful to split the text into smaller chunks. This function breaks the extracted text into chunks of a specified size.

In [6]:
def chunk_text(text, chunk_size):
    """Split text into manageable chunks."""
    all_lines = [line for line in text.split('\n') if line.strip()]
    chunks = []
    
    for i in range(0, len(all_lines), chunk_size):
        chunk = all_lines[i:i + chunk_size]
        chunks.append(' '.join(chunk))
    
    return chunks

**Step 7: Processing the Document and Generating Embeddings**

In this step, we’ll combine everything to process the document, chunk the text, and generate embeddings for each chunk.

In [7]:
def process_document(file_path, chunk_size=50):
    """Process the document, generate embeddings, and return chunks with embeddings."""
    pdf_text = extract_text_from_pdf(file_path)
    chunks = chunk_text(pdf_text, chunk_size)
    
    doc_embeddings = []
    for i, chunk in enumerate(chunks):
        print(f"Processing chunk {i + 1} of {len(chunks)}")
        doc_embds = get_nugen_embeddings([chunk], model="nugen-flash-embed", dimensions=768)
        if doc_embds:
            doc_embeddings.extend(doc_embds)
        else:
            print(f"Failed to retrieve embeddings for chunk {i + 1}")
    
    return chunks, doc_embeddings

**Step 8: Finding Relevant Chunks**

Now, we need to find the most relevant chunk based on a query. We compare the query’s embedding with document embeddings to find the closest match.

In [8]:
def find_relevant_chunk(query, chunks, doc_embeddings):
    """Find the most relevant chunk for the query."""
    query_embd = get_nugen_embeddings([query], model="nugen-flash-embed", dimensions=768)
    if query_embd:
        query_embd = np.array(query_embd[0]).reshape(1, -1)
        similarities = np.dot(np.array(doc_embeddings), query_embd.T).flatten()
        retrieved_id = np.argmax(similarities)
        if retrieved_id < len(chunks):
            return chunks[retrieved_id]
        else:
            print("Error: Retrieved ID out of range.")
            return None
    else:
        print("Failed to retrieve query embedding.")
        return None

**Step 9: Generating a Completion Based on the Relevant Chunk**

After finding the relevant text chunk, we can generate an answer to the query using Nugen’s completion API.

In [9]:
def get_nugen_completion(prompt, model="nugen-flash-instruct", max_tokens=400, temperature=1.0):
    """Fetch a completion using Nugen API."""
    api_key = "<--API KEY-->"  # Replace with your API key
    completion_url = "https://api.nugen.in/inference/completions"
    
    headers = {
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json"
    }
    
    data = {
        "prompt": prompt,
        "model": model,
        "max_tokens": max_tokens,
        "temperature": temperature
    }
    
    response = requests.post(completion_url, headers=headers, data=json.dumps(data))
    if response.status_code == 200:
        return response.json()["choices"][0]["text"].strip()
    else:
        print("Error:", response.status_code, response.text)
        return None

**Step 10: Putting It All Together**

We can now combine all the steps into a function that extracts text, finds relevant chunks, and generates answers for a given query.

In [10]:
def answer_query_from_pdf(pdf_file, query):
    """Answer a query based on the content of the PDF file."""
    chunks, doc_embeddings = process_document(pdf_file, chunk_size=50)
    relevant_text = find_relevant_chunk(query, chunks, doc_embeddings)
    
    if relevant_text:
        print("Relevant text found:")
        print(relevant_text)
        answer = get_nugen_completion(prompt=relevant_text, model="nugen-flash-instruct")
        if answer:
            print("Generated answer:", answer)
        else:
            print("Failed to generate an answer.")
    else:
        print("No relevant text found.")

**Step 11: Example Usage**

You can now use the following example to test the entire process:

In [11]:
pdf_file = "legal_service_authorities_act_1987.pdf"
query = "What should the Central Authority consist of?"
answer_query_from_pdf(pdf_file, query)

Processing chunk 1 of 16
Processing chunk 2 of 16
Processing chunk 3 of 16
Processing chunk 4 of 16
Processing chunk 5 of 16
Processing chunk 6 of 16
Processing chunk 7 of 16
Processing chunk 8 of 16
Processing chunk 9 of 16
Processing chunk 10 of 16
Processing chunk 11 of 16
Processing chunk 12 of 16
Processing chunk 13 of 16
Processing chunk 14 of 16
Processing chunk 15 of 16
Processing chunk 16 of 16
Relevant text found:
29. Power of Central Authority to make regulations. —(1) The Central Authority may, by  notification, make regulations not inconsistent wit h the provisions of this Act and the rules made  thereunder, to provide for all matters for which pr ovisions is necessary or expedient for the purposes  of  giving effect to the provisions of this Act.  (2) In particular, and without prejudice to the gener ality of the foregoing power, such regulations may  provide for all or any of the following matters, na mely: —  (a) the powers and functions of the Supreme Court Leg al Serv

**Conclusion**

This guide walks through how to use Nugen APIs to extract information from documents, generate embeddings, and answer queries. You can use this template to work with other PDF documents and queries.