## **Nugen Intelligence**
<img src="https://nugen.in/logo.png" alt="Nugen Logo" width="200"/>

Domain-aligned foundational models at industry leading speeds and zero-data retention!

### **Using Nugen's Embedding Model with LlamaIndex for PDF Content Retrieval**

### **Introduction**
In this cookbook, you will learn how to use Nugen’s powerful embedding models to convert PDF content into embeddings and how to use LlamaIndex to index and retrieve that data efficiently. This guide provides step-by-step instructions, from extracting text from PDFs to performing semantic searches using the generated embeddings.

Nugen offers state-of-the-art embedding models for natural language understanding that can transform unstructured text into meaningful vectors. LlamaIndex is an efficient tool for indexing and querying text data based on semantic similarity, making it an excellent choice for creating search engines or knowledge retrieval systems.


## Key Terms:

* Embedding: A numerical representation of text, allowing machines to understand and process language in a meaningful way.
* Nugen API: An API that provides embedding and completion models for text processing.
* LlamaIndex: A framework for building retrieval-augmented generation (RAG) systems that index and retrieve information based on embeddings.
* Vector Store: A data structure used to store embeddings for fast, similarity-based retrieval.
* Semantic Search: A search method that uses the meaning of the query rather than just keyword matching.

### Step 1: Set Up the Environment

**Install Required Libraries**

Before you begin, ensure you have the necessary Python libraries installed. These libraries include requests (for making HTTP requests to Nugen’s API), llama_index (for indexing and querying), and PyMuPDF (for extracting text from PDF files).

In [3]:
pip install --quiet requests llama_index PyMuPDF 

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.2 -> 24.3.1
[notice] To update, run: C:\Users\parimal\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


### Step 2: Get Your Nugen API Key

To use Nugen's embedding models, you will need to obtain an API key. 
You can access **Nugen API** key from **[here](https://docs.nugen.in/)** for **FREE**! 

Once you have the API key, store it securely, as it will be used to authenticate requests to Nugen’s API.

### Step 3: Extract Text from PDF

To extract the text from a PDF, we will use the PyMuPDF library (also known as fitz). It allows us to load a PDF file and extract text from each page.


In [None]:
import fitz  # PyMuPDF

def extract_text_from_pdf(pdf_path):
    """
    Extracts text from a PDF file.

    Args:
        pdf_path (str): Path to the PDF file.

    Returns:
        str: The extracted text from the PDF.
    """
    document = fitz.open(pdf_path)  # Open the PDF file
    text = ""
    
    # Iterate over all the pages in the PDF and extract text
    for page_num in range(len(document)):
        page = document.load_page(page_num)
        text += page.get_text("text")
    
    return text

# Example usage with double backslashes
pdf_path = "legal_service_authorities_act_1987.pdf"
extracted_text = extract_text_from_pdf(pdf_path)
print(extracted_text[:500])  # Print the first 500 characters of the extracted text


1 
 
THE LEGAL SERVICES AUTHORITIES ACT, 1987 
________ 
ARRANGEMENT OF SECTIONS 
________ 
CHAPTER I 
PRELIMINARY 
SECTIONS 
1. Short title, extent and commencement. 
2. Definitions. 
 
CHAPTER II 
THE NATIONAL LEGAL SERVICES AUTHORITY 
3. Constitution of the National Legal Services Authority. 
3A. Supreme Court Legal Services Committee. 
4. Functions of the Central Authority. 
5. Central Authority to work in coordination with other agencies. 
 
CHAPTER III 
STATE LEGAL SERVICES AUTHORITY 
6. C


### Step 4: Split Text into Chunks

To efficiently generate embeddings and avoid overwhelming the model with too much text at once, we will split the extracted text into smaller chunks (e.g., paragraphs or sections).

**Split Text into Chunks Function:**

In [10]:
def split_text_into_chunks(text, chunk_size=50):
    """
    Splits the extracted text into smaller chunks for embedding.
    
    Args:
        text (str): The full text to split.
        chunk_size (int): The maximum size of each chunk in characters.
        
    Returns:
        list: A list of text chunks.
    """
    return [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]


### Step 5: Generate Embeddings from PDF Content

We’ll use Nugen’s embedding model to transform the text chunks into embeddings. This will convert the textual content into numerical vectors, making it possible to perform semantic searches.

**Generate Embeddings Function:**

In [21]:
import requests

def get_nugen_embeddings(text, model="nugen-flash-embed", dimensions=123):
    """
    Calls Nugen's embedding API to fetch the embeddings for the given text.
    
    Args:
        text (str): The text to embed.
        model (str): The embedding model name (default is "nugen-flash-embed").
        dimensions (int): The number of dimensions for the embedding (default is 123).
        
    Returns:
        list: Embedding vectors for the input text.
    """
    url = "https://api.nugen.in/inference/embeddings"
    
    api_key = "nugen-l9oSjm6J9rUghiHinH8d8Q"
    payload = {
        "input": text,
        "model": model,
        "dimensions": dimensions
    }
    
    headers = {
        "Authorization": f"Bearer {api_key}",  # Replace with your Nugen API token
        "Content-Type": "application/json"
    }

    response = requests.post(url, json=payload, headers=headers)

    if response.status_code == 200:
        embeddings = response.json().get('embeddings')
        return embeddings
    else:
        raise Exception(f"Error fetching embeddings: {response.text}")
    

def generate_embeddings_from_pdf(pdf_path, model="nugen-flash-embed", dimensions=123):
    """
    Extracts text from a PDF, splits it into chunks, and generates embeddings for each chunk.
    
    Args:
        pdf_path (str): Path to the PDF file.
        model (str): The embedding model name (default is "nugen-flash-embed").
        dimensions (int): The number of dimensions for the embedding (default is 123).
        
    Returns:
        list: A list of embedding vectors for the PDF content.
    """
    # Step 1: Extract text from PDF
    text = extract_text_from_pdf(pdf_path)
    
    # Step 2: Split text into chunks
    chunks = split_text_into_chunks(text)
    
    # Step 3: Generate embeddings for each chunk
    all_embeddings = []
    for chunk in chunks:
        embeddings = get_nugen_embeddings(chunk, model=model, dimensions=dimensions)
        all_embeddings.append(embeddings)
    
    return all_embeddings


### Step 6: Create a Vector Store Using LlamaIndex

Next, we will index the embeddings using LlamaIndex, which allows us to efficiently store and query the embeddings based on their similarity.

**Create Vector Store Function:**

In [22]:
from llama_index.core import  VectorStoreIndex

def create_vector_store(embeddings):
    """
    Creates a vector store in LlamaIndex using the provided embeddings.
    
    Args:
        embeddings (list): The embedding vectors to store.
    
    Returns:
        SimpleVectorStore: A vector store containing the embeddings.
    """
    vector_store =  VectorStoreIndex.from_embeddings(embeddings)
    return vector_store


### Step 7: Perform Semantic Search on Indexed PDF Content

Once the embeddings are indexed, we can query the vector store using a text input (query). The vector store will find the most relevant chunks based on semantic similarity.

**Query Function:**

In [23]:
def query_vector_store(query, vector_store, model="nugen-flash-embed", dimensions=123):
    """
    Queries the vector store to find the most relevant document to the input query.
    
    Args:
        query (str): The query to search for in the vector store.
        vector_store (SimpleVectorStore): The vector store containing embeddings.
    
    Returns:
        list: The most relevant documents based on the query.
    """
    query_embedding = get_nugen_embeddings(query, model=model, dimensions=dimensions)
    results = vector_store.query(query_embedding)
    return results

### Step 8: Putting It All Together

Now we’ll combine all the functions into one cohesive process, from extracting text from the PDF to querying the indexed content.

**Complete Workflow:**

In [24]:
# Full process: from PDF extraction to query
def process_pdf_for_query(pdf_path, query):
    """
    Processes a PDF by extracting text, generating embeddings, and querying the vector store.
    
    Args:
        pdf_path (str): Path to the PDF file.
        query (str): The query to search for in the PDF content.
        
    Returns:
        list: The most relevant text chunks from the PDF based on the query.
    """
    # Generate embeddings from PDF and create vector store
    embeddings = generate_embeddings_from_pdf(pdf_path)
    vector_store = create_vector_store(embeddings)
    
    # Query the vector store with the input query
    results = query_vector_store(query, vector_store)
    
    return results

# Sample PDF and query
pdf_path = "legal_service_authorities_act_1987.pdf"  # Replace with the actual PDF path
query = "What should the Central Authority consist of?"

# Run the process
results = process_pdf_for_query(pdf_path, query)

print(f"Query Results: {results}")


Exception: Error fetching embeddings: {"detail":"Could not validate credentials. Reason: Quota Limit Reached"}