**Mr.HelpMate AI**

**Embedding Layer**

Key Steps:

**Document Processing:**

Extract text from the PDF.
Clean the text (remove unnecessary symbols, normalize spaces, etc.).
**Chunking Strategies:**

Implement multiple chunking strategies: fixed-size, sentence-based, and semantic chunking.
Experiment with different chunk sizes for fixed-size and thresholds for semantic coherence.
Embedding Choices:

Try OpenAI Embedding Model (requires API key) and SentenceTransformers models (like all-MiniLM-L6-v2).

In [2]:
pip install PyPDF2


Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


In [4]:
pip install sentence-transformers




In [5]:
pip install chromadb

Collecting chromadb
  Downloading chromadb-0.5.21-py3-none-any.whl.metadata (6.8 kB)
Collecting build>=1.0.3 (from chromadb)
  Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
Collecting chroma-hnswlib==0.7.6 (from chromadb)
  Downloading chroma_hnswlib-0.7.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (252 bytes)
Collecting fastapi>=0.95.2 (from chromadb)
  Downloading fastapi-0.115.6-py3-none-any.whl.metadata (27 kB)
Collecting uvicorn>=0.18.3 (from uvicorn[standard]>=0.18.3->chromadb)
  Downloading uvicorn-0.32.1-py3-none-any.whl.metadata (6.6 kB)
Collecting posthog>=2.4.0 (from chromadb)
  Downloading posthog-3.7.4-py2.py3-none-any.whl.metadata (2.0 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.20.1-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.5 kB)
Collecting opentelemetry-exporter-otlp-proto-grpc>=1.2.0 (from chromadb)
  Downloading opentelemetry_exporter_otlp_proto_grpc-1.28.2-py3

In [6]:
pip install faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.9.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.4 kB)
Downloading faiss_cpu-1.9.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.5/27.5 MB[0m [31m48.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.9.0.post1


In [7]:
pip install transformers



In [36]:
pip install --upgrade openai


Collecting openai
  Downloading openai-1.56.2-py3-none-any.whl.metadata (24 kB)
Downloading openai-1.56.2-py3-none-any.whl (389 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m389.9/389.9 kB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: openai
  Attempting uninstall: openai
    Found existing installation: openai 1.54.4
    Uninstalling openai-1.54.4:
      Successfully uninstalled openai-1.54.4
Successfully installed openai-1.56.2


In [9]:
pip install diskcache

Collecting diskcache
  Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)
Downloading diskcache-5.6.3-py3-none-any.whl (45 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/45.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: diskcache
Successfully installed diskcache-5.6.3


In [10]:
pip install rich



In [50]:
from PyPDF2 import PdfReader
from sentence_transformers import SentenceTransformer

class EmbeddingLayer:
    def __init__(self, model_name="all-MiniLM-L6-v2"):
        self.model = SentenceTransformer(model_name)

    def process_document(self, pdf_path):
        reader = PdfReader(pdf_path)
        text = " ".join([page.extract_text() for page in reader.pages])
        text = " ".join(text.split())  # Normalize whitespace
        return text

    def chunk_text(self, text, strategy="fixed", chunk_size=100):
        if strategy == "fixed":
            words = text.split()
            return [" ".join(words[i:i + chunk_size]) for i in range(0, len(words), chunk_size)]
        elif strategy == "sentence":
            return text.split(". ")
        elif strategy == "semantic":
            # Placeholder: Replace with advanced semantic chunking logic if needed
            return text.split(". ")
        else:
            raise ValueError("Invalid strategy!")

    def embed_chunks(self, chunks):
        return self.model.encode(chunks, convert_to_tensor=True)


**Search Layer**
Key Steps:

Vector Database:

Use ChromaDB for indexing and similarity searches.
Query Embedding:

Experiment with different query embedding models.
Re-ranking:

Use cross-encoders like sentence-transformers/ms-marco-MiniLM-L-6-v2.
Cache Mechanism:

Store previously processed embeddings and search results.

In [51]:
import chromadb
from chromadb.utils import embedding_functions

class SearchLayer:
    def __init__(self, model, db_path="./vector_db"):
        self.model = model
        self.client = chromadb.PersistentClient(path=db_path)
        self.collection = self.client.get_or_create_collection(name="policy_docs")

    def index_chunks(self, chunks, embeddings):
        for idx, (chunk, embedding) in enumerate(zip(chunks, embeddings)):
            self.collection.add(
                documents=[chunk],
                metadatas={"id": idx},
                embeddings=[embedding.tolist()],
                ids=[str(idx)]
            )

    def search(self, query, top_k=3):
        query_embedding = self.model.encode(query)
        results = self.collection.query(
            query_embeddings=[query_embedding],
            n_results=top_k
        )
        return results



3. Generation Layer
Key Steps:

Prompt Engineering:

Include retrieved chunks, the user query, and detailed instructions.
Experiment with few-shot examples for better responses.
Language Model:

Use OpenAI's GPT-3.5 API or HuggingFace's LLMs (like bigscience/T0pp).

In [58]:
import openai
from itertools import chain

class GenerationLayer:
    def __init__(self, api_key):
        # Set your OpenAI API key
        openai.api_key = api_key

    def generate_answer(self, query, retrieved_chunks):
        # Flatten the retrieved chunks if they are nested lists
        flattened_chunks = list(chain.from_iterable(retrieved_chunks))

        # Create the system and user messages for the API call
        system_message = "You are a helpful assistant providing concise answers based on the given policy details."
        user_message = f"The policy document contains the following relevant details:\n\n" + " ".join(flattened_chunks) + "\n\nQuestion: {query}\n\nProvide a concise and clear answer."

        # Call the OpenAI API using the correct 'messages' format
        response = openai.chat.completions.create(  # Corrected API call method
            model="gpt-3.5-turbo",  # Specify the model
            messages=[
                {"role": "system", "content": system_message},
                {"role": "user", "content": user_message}
            ],
            max_tokens=100,  # Adjust based on desired response length
            temperature=0.7,  # Adjust creativity
        )

        # Extract the answer from the response correctly
        return response.choices[0].message.content.strip()  # Corrected response handling





4. Experimentation Plan
Design three test queries based on the document content, such as:

"What are the coverage benefits of this policy?"
"Are there any exclusions in the policy?"
"What is the claim process for this policy?"
Evaluate each layer:

Compare the top 3 retrieved results for each query using different chunking and embedding strategies.
Analyze re-ranking improvements using cross-encoders.
Measure LLM output quality using variations in the prompt.

5. Screenshot Requirements
Top 3 Results from Search Layer:
Capture outputs showcasing the retrieved chunks for each query.
Final Generated Answer:
Capture answers generated by the LLM for each query.

In [59]:
def main(pdf_path, queries):
    # Initialize layers
    embedding_layer = EmbeddingLayer()
    search_layer = SearchLayer(embedding_layer.model)

    # Replace YOUR_API_KEY with your actual OpenAI API key
    generation_layer = GenerationLayer(api_key="")

    # Process document
    text = embedding_layer.process_document(pdf_path)
    chunks = embedding_layer.chunk_text(text, strategy="fixed", chunk_size=100)
    embeddings = embedding_layer.embed_chunks(chunks).cpu().numpy()

    # Index chunks
    search_layer.index_chunks(chunks, embeddings)

    # Test queries
    for query in queries:
      print(f"\nQuery: {query}")
      results = search_layer.search(query)

      # Adjusted retrieval of top chunks
      retrieved_chunks = results["documents"]  # Fixed here
      print("Top Retrieved Chunks:", retrieved_chunks)

      # Generate answer
      answer = generation_layer.generate_answer(query, retrieved_chunks)
      print("Generated Answer:", answer)


Usage
Save your document as policy_document.pdf.
Provide at least three queries in the queries list.

In [62]:
if __name__ == "__main__":
    pdf_path = "/content/Principal-Sample-Life-Insurance-Policy.pdf"
    queries = [
        "What are the benefits included under the Group Policy for Life Insurance?",
        "What is the coverage for Dependent Life Insurance?",
        "What is considered a 'Qualifying Event' for Accelerated Benefits under this policy?"
    ]
    main(pdf_path, queries)





Query: What are the benefits included under the Group Policy for Life Insurance?
Top Retrieved Chunks: [["qualifies and makes timely application, he or she may convert the group coverage by purchasing an individual policy of life insurance under these terms: (1) The Member will not be required to submit Proof of Good Health. (2) The policy will be for life insurance only. No disabilit y or other benefits will be included. (3) The policy will be on one of the forms, other than term insurance, then issued by The Principal to persons in the risk class to which the Member belongs on the individual policy's effective date. (4) Premium will be based on the Member's age", 'Policyholder The entity to whom this Group Policy is issued (see Title Page). Prior Policy The Group Term Life coverage of either: a. the Policyholder; or b. a business entity which has been obtained by the Policyholder through a merger or acquisition; for which this Group Policy is a replacement. Proof of Good Health Writ