**Problem Statement **

There are large documents that speak about insurance policies. Its difficult for a person to go through it and find answers to his questions.

So lets build a robust generative search system capable of effectively and accurately answering questions from a policy document.

Lets use a single long life insurance policy document for this project stored at - /content/drive/MyDrive/HelpMateAssignment/Principal-Sample-Life-Insurance-Policy.pdf


Approach -
1.	The embedding layer - Here we read and effectively process the PDF documents. Divide the documents into chunks. We have used 2 ways of chunking - one is fixed size chunks and other is paragraph chunks. After chunking lets generate embeddings for those chunks using a pre-trained SentenceTransformer Model - "all-MiniLM-L6-v2"
2.	The Searching layer - Here we build a semantic search. Take user query, embed it and then find the cosine similarity with existing chunks and list down top 3 chunks with maximum cosine similarity. Those are the chunk which matches the most with the user query. Also, we store these embedding and chunks to ChormaBD for fast retrieval of query results. ChromaDB provides faster and easier ways of searching.
3.	The generative layer - He we create proper prompts that help in generating faster and accurate query results.


## 1. <font color='red'> The Embedding Layer </font>



In [1]:
!pip install -U -q pdfplumber tiktoken openai chromaDB sentence-transformers

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.0/42.0 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.5/48.5 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.2/59.2 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.6/5.6 MB[0m [31m78.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m48.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m389.5/389.5 kB[0m [31m24.8 MB/s[0m eta [36m0:

In [2]:
# Import required libraries

import numpy as np
import pandas as pd
import pdfplumber

### 1.1 <font color = 'green'> Document Chunking </font>

We will generate embeddings for texts related to various Wikipedia articles. But since you are using large blocks of text, before generating the embeddings, you need to generate the chunks. Let's start with a basic chunking technique, and chunk the text by paragraph.

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
input_path = '/content/drive/MyDrive/HelpMateAssignment/Principal-Sample-Life-Insurance-Policy.pdf'

#### 1.1.1 <font color = 'orange'> Fixed-Size Chunking </font>

In fixed-size chunking, the document is split into fixed-size windows with each window representing a separate document chunk.

In [5]:
# Function to extract and store Wikipedia page information

def process_page(spliton):

   # Open the PDF file
    with pdfplumber.open(input_path) as pdf:
      # Create a DataFrame to store the chunks and page title
      data = {'Title': [], 'Chunk Text': []}
      for pdf_page in pdf.pages:
        single_page_text = pdf_page.extract_text()
        chunk_size = 100  # Set your desired chunk size (in characters)
        text_chunks = split_text_into_chunks(single_page_text, chunk_size,spliton)
        for idx, chunk in enumerate(text_chunks):
            data['Title'].append(pdf_page)
            data['Chunk Text'].append(chunk)


    return pd.DataFrame(data)

In [6]:
# Function to split text into fixed-size chunks

def split_text_into_chunks(text, chunk_size, spliton):
    chunks = []
    if(spliton==''):
      words = text.split()  # Split the text into words
    else:
      words = text.split(spliton)  # Split the text into words

    current_chunk = []  # Store words for the current chunk
    current_chunk_word_count = 0  # Count of words in the current chunk

    for word in words:
        if current_chunk_word_count + len(word) + 1 <= chunk_size:
            current_chunk.append(word)
            current_chunk_word_count += len(word) + 1
        else:
            chunks.append(' '.join(current_chunk))
            current_chunk = [word]
            current_chunk_word_count = len(word)

    if current_chunk:
        chunks.append(' '.join(current_chunk))

    return chunks

In [7]:
fixed_chunk_df = process_page('')

In [8]:
fixed_chunk_df

Unnamed: 0,Title,Chunk Text
0,<Page:1>,DOROTHEA GLAUSE S655 RHODE ISLAND JOHN DOE 01/...
1,<Page:1>,FOR: RHODE ISLAND JOHN DOE ALL MEMBERS Group M...
2,<Page:2>,This page left blank intentionally
3,<Page:3>,POLICY RIDER GROUP INSURANCE POLICY NO: S655 C...
4,<Page:3>,Effective on the later of the Date of Issue of...
...,...,...
1049,<Page:62>,required to be filed. Article 8 - Time Limits ...
1050,<Page:62>,required by law. This policy has been updated ...
1051,<Page:62>,"Section D - Claim Procedures, Page 2"
1052,<Page:63>,This page left blank intentionally


####1.1.2 <font color = 'orange'>   Chunking by Paragraph </font>

Here, we will try to chunk and extract individual paragraphs using the newline character.

In [9]:
# Initialize lists to store data
para_chunk_df = process_page('\n')

In [10]:
para_chunk_df

Unnamed: 0,Title,Chunk Text
0,<Page:1>,DOROTHEA GLAUSE S655 RHODE ISLAND JOHN DOE 01/...
1,<Page:1>,GROUP POLICY FOR: RHODE ISLAND JOHN DOE ALL ME...
2,<Page:1>,Print Date: 07/16/2014
3,<Page:2>,This page left blank intentionally
4,<Page:3>,POLICY RIDER GROUP INSURANCE POLICY NO: S655 C...
...,...,...
1255,<Page:62>,filed. Article 8 - Time Limits Any time limits...
1256,<Page:62>,This policy has been updated effective January...
1257,<Page:62>,"GC 6018 Section D - Claim Procedures, Page 2"
1258,<Page:63>,This page left blank intentionally


### 1.2 <font color = 'green'> Generating Embeddings </font>


In [11]:
# Install the sentence transformers library

!pip install -q -u sentence-transformers


Usage:   
  pip3 install [options] <requirement specifier> [package-index-options] ...
  pip3 install [options] -r <requirements file> [package-index-options] ...
  pip3 install [options] [-e] <vcs project url> ...
  pip3 install [options] [-e] <local project path> ...
  pip3 install [options] <archive url/path> ...

no such option: -u


In [12]:
from sentence_transformers import SentenceTransformer, util

In [13]:
# Load pre-trained Sentence Transformer model

model_name = "all-MiniLM-L6-v2"
embedder = SentenceTransformer(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [14]:
# Function to generate embeddings for text
def generate_embeddings(texts):
    embeddings = embedder.encode(texts, convert_to_tensor=True)
    return embeddings

In [15]:
def generate_embeddings_on_df(df):
  df['Embeddings'] = df['Chunk Text'].apply(lambda x: generate_embeddings([x])[0])

In [16]:
# Create embeddings for 'Chunk Text' column on all three dataframes

generate_embeddings_on_df(fixed_chunk_df)

In [17]:
generate_embeddings_on_df(para_chunk_df)

In [18]:
fixed_chunk_df

Unnamed: 0,Title,Chunk Text,Embeddings
0,<Page:1>,DOROTHEA GLAUSE S655 RHODE ISLAND JOHN DOE 01/...,"[tensor(-0.0528), tensor(0.0316), tensor(0.065..."
1,<Page:1>,FOR: RHODE ISLAND JOHN DOE ALL MEMBERS Group M...,"[tensor(-0.0326), tensor(0.0292), tensor(0.018..."
2,<Page:2>,This page left blank intentionally,"[tensor(0.0291), tensor(0.0606), tensor(0.0464..."
3,<Page:3>,POLICY RIDER GROUP INSURANCE POLICY NO: S655 C...,"[tensor(-0.0216), tensor(0.0828), tensor(0.006..."
4,<Page:3>,Effective on the later of the Date of Issue of...,"[tensor(0.0179), tensor(-0.0087), tensor(0.077..."
...,...,...,...
1049,<Page:62>,required to be filed. Article 8 - Time Limits ...,"[tensor(-0.0459), tensor(0.0872), tensor(0.062..."
1050,<Page:62>,required by law. This policy has been updated ...,"[tensor(-0.0976), tensor(0.0637), tensor(0.059..."
1051,<Page:62>,"Section D - Claim Procedures, Page 2","[tensor(-0.1458), tensor(0.0946), tensor(0.090..."
1052,<Page:63>,This page left blank intentionally,"[tensor(0.0291), tensor(0.0606), tensor(0.0464..."


In [19]:
para_chunk_df

Unnamed: 0,Title,Chunk Text,Embeddings
0,<Page:1>,DOROTHEA GLAUSE S655 RHODE ISLAND JOHN DOE 01/...,"[tensor(-0.0614), tensor(0.0300), tensor(0.064..."
1,<Page:1>,GROUP POLICY FOR: RHODE ISLAND JOHN DOE ALL ME...,"[tensor(-0.0047), tensor(0.0447), tensor(0.040..."
2,<Page:1>,Print Date: 07/16/2014,"[tensor(-0.0460), tensor(0.1036), tensor(-0.01..."
3,<Page:2>,This page left blank intentionally,"[tensor(0.0291), tensor(0.0606), tensor(0.0464..."
4,<Page:3>,POLICY RIDER GROUP INSURANCE POLICY NO: S655 C...,"[tensor(-0.0216), tensor(0.0828), tensor(0.006..."
...,...,...,...
1255,<Page:62>,filed. Article 8 - Time Limits Any time limits...,"[tensor(-0.0404), tensor(0.0736), tensor(0.062..."
1256,<Page:62>,This policy has been updated effective January...,"[tensor(-0.0315), tensor(0.0708), tensor(0.094..."
1257,<Page:62>,"GC 6018 Section D - Claim Procedures, Page 2","[tensor(-0.1968), tensor(0.0632), tensor(0.066..."
1258,<Page:63>,This page left blank intentionally,"[tensor(0.0291), tensor(0.0606), tensor(0.0464..."


In [20]:
fixed_chunk_df['Embeddings'][0]

tensor([-5.2836e-02,  3.1606e-02,  6.5942e-02, -5.3049e-05,  1.2810e-02,
         9.5933e-02,  2.3919e-02, -3.7878e-02, -1.0634e-01,  4.6007e-02,
         3.7477e-02,  5.5533e-02, -1.2827e-02, -6.5894e-02, -6.8076e-02,
        -2.1931e-02, -2.0584e-02,  1.6472e-02, -4.4307e-02,  8.9845e-03,
         3.1904e-02,  3.0087e-02, -7.9046e-02, -2.9789e-02,  3.1667e-02,
         1.7966e-02,  4.1620e-02, -1.5310e-02,  1.5556e-02, -2.1093e-03,
         2.0278e-02, -5.7765e-02, -5.1948e-03,  3.4381e-02,  6.5676e-02,
         1.9977e-02, -1.5457e-02,  5.9743e-03, -5.3741e-03,  5.4316e-02,
        -5.1816e-02, -2.1929e-02,  6.5593e-02,  1.6982e-02, -5.4109e-02,
         5.0850e-02,  1.1968e-02, -3.2410e-02, -1.7589e-02, -1.4468e-02,
         3.2112e-02, -1.3447e-02,  3.7639e-02,  2.7480e-02,  3.3600e-02,
         2.9604e-02,  1.3419e-02,  2.4344e-02, -8.2679e-02, -9.0444e-02,
        -1.0205e-02,  1.1198e-02, -5.3772e-02, -7.0996e-02, -2.4880e-02,
         7.7738e-03, -5.3612e-03, -4.5682e-02,  7.1

In [21]:
# Save the embeddings in a CSV

output_path = '/content/drive/My Drive/HelpMateAssignment/'


# Save the dataframe with embeddings
fixed_chunk_df.to_csv(output_path+"fixed_chunk_embeddings.csv", index=False)
para_chunk_df.to_csv(output_path+"para_chunk_embeddings.csv", index=False)

# 2. <font color = 'red'> The Search Layer </font>

### 2.1 <font color = 'green'> Defining semantic Search </font>


In [23]:
# Read user input query
user_query = input() #Date of Issue

Date of Issue


In [25]:
# Define the function for calculating cosine similarity

def calculate_similarity(embedding1, embedding2):
    cosine_score = util.pytorch_cos_sim(embedding1, embedding2)
    # Convert the result to a Python float
    similarity = cosine_score.item()

    return similarity

In [26]:
# Function to perform semantic search and return ranked chunks
def semantic_search(user_query, df, embedder):

    # Calculate the query embedding
    query_embedding = embedder.encode(user_query, convert_to_tensor=True)

    # Calculate similarity scores between the query embedding and all chunk embeddings
    df['Similarity'] = df['Embeddings'].apply(lambda x: calculate_similarity(query_embedding, x))

    # Sort the DataFrame by similarity scores in descending order
    df = df.sort_values(by='Similarity', ascending=False).reset_index(drop=True)

    # Return only the top 3 values from the dataframe, and drop the embeddings column for a cleaner view of the final results
    df = df.head(3)

    return df

In [27]:
# Perform semantic search on each DataFrame
fixed_chunk_results = semantic_search(user_query, fixed_chunk_df, embedder)
para_chunk_results = semantic_search(user_query, para_chunk_df, embedder)

In [28]:
fixed_chunk_results

Unnamed: 0,Title,Chunk Text,Embeddings,Similarity
0,<Page:5>,in this Group Policy) The Date of Issue is Nov...,"[tensor(-0.0226), tensor(-0.0187), tensor(0.03...",0.627044
1,<Page:33>,REQUIREMENTS AND RIGHTS GC 6007 Section B - Ef...,"[tensor(-0.0961), tensor(0.0163), tensor(0.062...",0.598996
2,<Page:53>,Subject to the Effective Date provisions of PA...,"[tensor(-0.0307), tensor(0.0370), tensor(0.106...",0.589228


In [29]:
para_chunk_results

Unnamed: 0,Title,Chunk Text,Embeddings,Similarity
0,<Page:21>,date of change.,"[tensor(-0.0332), tensor(0.0838), tensor(0.053...",0.62725
1,<Page:5>,(called the Policyholder in this Group Policy)...,"[tensor(-0.0199), tensor(-0.0290), tensor(0.01...",0.622679
2,<Page:42>,c. Application/Effective Date,"[tensor(-0.0038), tensor(0.0362), tensor(-0.00...",0.604534


In [30]:
pip install plotly umap-learn

Collecting umap-learn
  Downloading umap_learn-0.5.7-py3-none-any.whl.metadata (21 kB)
Collecting pynndescent>=0.5 (from umap-learn)
  Downloading pynndescent-0.5.13-py3-none-any.whl.metadata (6.8 kB)
Downloading umap_learn-0.5.7-py3-none-any.whl (88 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m88.8/88.8 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pynndescent-0.5.13-py3-none-any.whl (56 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.9/56.9 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pynndescent, umap-learn
Successfully installed pynndescent-0.5.13 umap-learn-0.5.7


In [34]:
import plotly.express as px
import umap

In [36]:
# Combine all results into a single DataFrame for visualization
all_results = pd.concat([fixed_chunk_results, para_chunk_results], ignore_index=True)

# Assuming 'Embeddings' is the column containing embeddings as PyTorch tensors
embeddings = all_results['Embeddings'].tolist()

# Convert PyTorch tensors to NumPy arrays
embeddings = [embedding.cpu().numpy() if hasattr(embedding, 'cpu') else embedding.numpy() for embedding in embeddings]

# Convert the list of embeddings into a 2D NumPy array
X = np.array(embeddings)

# Reduce dimensionality using UMAP
umap_model = umap.UMAP(n_components=2)
umap_embeddings = umap_model.fit_transform(X)

# Add UMAP embeddings as new columns in your dataframe
all_results['umap_x'] = umap_embeddings[:, 0]
all_results['umap_y'] = umap_embeddings[:, 1]

  warn(


In [31]:
# Combine all results into a single DataFrame for visualization
all_results = pd.concat([fixed_chunk_results, para_chunk_results], ignore_index=True)
all_results

Unnamed: 0,Title,Chunk Text,Embeddings,Similarity
0,<Page:5>,in this Group Policy) The Date of Issue is Nov...,"[tensor(-0.0226), tensor(-0.0187), tensor(0.03...",0.627044
1,<Page:33>,REQUIREMENTS AND RIGHTS GC 6007 Section B - Ef...,"[tensor(-0.0961), tensor(0.0163), tensor(0.062...",0.598996
2,<Page:53>,Subject to the Effective Date provisions of PA...,"[tensor(-0.0307), tensor(0.0370), tensor(0.106...",0.589228
3,<Page:21>,date of change.,"[tensor(-0.0332), tensor(0.0838), tensor(0.053...",0.62725
4,<Page:5>,(called the Policyholder in this Group Policy)...,"[tensor(-0.0199), tensor(-0.0290), tensor(0.01...",0.622679
5,<Page:42>,c. Application/Effective Date,"[tensor(-0.0038), tensor(0.0362), tensor(-0.00...",0.604534


In [32]:
print(all_results['Embeddings'].apply(lambda x: len(x)).unique())

[384]


In [37]:
# Compute the user query embedding (replace this with your actual query)
user_query_embedding = embedder.encode(user_query, convert_to_tensor=True)

# Assuming user_query_embedding is a NumPy array of shape (1, 384)
# Reduce dimensionality of the user query embedding using the same UMAP model
user_query_umap = umap_model.transform(user_query_embedding.reshape(1, -1))

# Add the user query embedding to the DataFrame
user_query_df = pd.DataFrame({
    'umap_x': user_query_umap[:, 0],
    'umap_y': user_query_umap[:, 1],
    'page title': 'User Query',
    'chunk text': 'User Query'
})

# Concatenate the user query DataFrame with the original results
all_results = pd.concat([all_results, user_query_df], ignore_index=True)

# Specify colors based on a condition
all_results['color'] = np.where(all_results['page title'] == 'User Query', 'User Query Color', 'Other Data Color')

In [38]:
# Visualize the UMAP embeddings with the user query using Plotly
fig = px.scatter(all_results, x='umap_x', y='umap_y', hover_data=['page title', 'chunk text'],
                 color='color', color_discrete_map={'User Query Color': 'red', 'Other Data Color': 'blue'})
fig.update_traces(marker=dict(size=3))  # Adjust marker size as needed
fig.show()

In [39]:
para_chunk_embeddings = pd.read_csv(output_path  + 'para_chunk_embeddings.csv')

In [40]:
para_chunk_embeddings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1260 entries, 0 to 1259
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Title       1260 non-null   object
 1   Chunk Text  1260 non-null   object
 2   Embeddings  1260 non-null   object
dtypes: object(3)
memory usage: 29.7+ KB


In [41]:
para_chunk_embeddings['Embeddings'] = para_chunk_embeddings['Embeddings'].apply(lambda x: x[7:][:-1])

In [42]:
para_chunk_embeddings['Embeddings'][0]

'[-6.1396e-02,  2.9963e-02,  6.4252e-02,  6.8732e-03, -1.0564e-02,\n         6.7935e-02, -7.4636e-03, -2.4438e-03, -7.4920e-02,  6.7305e-03,\n        -2.5005e-03,  3.3360e-02, -6.8428e-03, -7.6574e-02, -9.0893e-02,\n        -7.7191e-03, -3.5130e-03,  3.5431e-02,  7.2207e-03,  6.9286e-03,\n         1.0508e-02,  5.7313e-02, -7.0764e-02, -5.3176e-02,  3.4646e-02,\n        -2.0038e-02,  4.1927e-02, -8.2218e-03,  1.1843e-02, -2.1650e-02,\n         2.4854e-03, -5.9339e-02, -7.2907e-03,  6.1877e-02,  8.8751e-02,\n         1.8346e-02,  1.6169e-03,  7.1061e-02,  2.0243e-02,  4.2763e-02,\n        -5.1957e-02, -1.3018e-02,  6.6724e-02,  8.0048e-03, -7.0898e-02,\n         2.0814e-02, -1.1877e-02, -3.8930e-02,  1.7741e-02, -3.4378e-02,\n        -1.9286e-02, -2.2275e-02, -3.0604e-03,  2.1922e-02,  3.1945e-02,\n         5.5488e-02,  3.4999e-02,  2.2936e-02, -8.6991e-02, -9.6582e-02,\n        -3.1036e-02,  3.9063e-02, -3.6698e-02, -8.5156e-02, -1.4042e-02,\n        -6.6821e-03,  4.3142e-03, -6.8921e-0

In [43]:
import ast
para_chunk_embeddings['Embeddings'] = para_chunk_embeddings['Embeddings'].apply(lambda x: ast.literal_eval(x))

In [44]:
para_chunk_embeddings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1260 entries, 0 to 1259
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Title       1260 non-null   object
 1   Chunk Text  1260 non-null   object
 2   Embeddings  1260 non-null   object
dtypes: object(3)
memory usage: 29.7+ KB


In [45]:
para_chunk_embeddings

Unnamed: 0,Title,Chunk Text,Embeddings
0,<Page:1>,DOROTHEA GLAUSE S655 RHODE ISLAND JOHN DOE 01/...,"[-0.061396, 0.029963, 0.064252, 0.0068732, -0...."
1,<Page:1>,GROUP POLICY FOR: RHODE ISLAND JOHN DOE ALL ME...,"[-0.004747, 0.044749, 0.040475, 0.01712, 0.068..."
2,<Page:1>,Print Date: 07/16/2014,"[-0.046045, 0.10365, -0.018114, 0.093488, -0.0..."
3,<Page:2>,This page left blank intentionally,"[0.029119, 0.060574, 0.046415, 0.037793, 0.046..."
4,<Page:3>,POLICY RIDER GROUP INSURANCE POLICY NO: S655 C...,"[-0.021588, 0.082771, 0.0064722, 0.011061, 0.0..."
...,...,...,...
1255,<Page:62>,filed. Article 8 - Time Limits Any time limits...,"[-0.040359, 0.073557, 0.062815, 0.0031853, 0.0..."
1256,<Page:62>,This policy has been updated effective January...,"[-0.031512, 0.07082, 0.09493, -0.039622, 0.044..."
1257,<Page:62>,"GC 6018 Section D - Claim Procedures, Page 2","[-0.19685, 0.063235, 0.066546, -0.03799, 0.009..."
1258,<Page:63>,This page left blank intentionally,"[0.029119, 0.060574, 0.046415, 0.037793, 0.046..."


In [46]:
para_chunk_embeddings['Embeddings'][0]

[-0.061396,
 0.029963,
 0.064252,
 0.0068732,
 -0.010564,
 0.067935,
 -0.0074636,
 -0.0024438,
 -0.07492,
 0.0067305,
 -0.0025005,
 0.03336,
 -0.0068428,
 -0.076574,
 -0.090893,
 -0.0077191,
 -0.003513,
 0.035431,
 0.0072207,
 0.0069286,
 0.010508,
 0.057313,
 -0.070764,
 -0.053176,
 0.034646,
 -0.020038,
 0.041927,
 -0.0082218,
 0.011843,
 -0.02165,
 0.0024854,
 -0.059339,
 -0.0072907,
 0.061877,
 0.088751,
 0.018346,
 0.0016169,
 0.071061,
 0.020243,
 0.042763,
 -0.051957,
 -0.013018,
 0.066724,
 0.0080048,
 -0.070898,
 0.020814,
 -0.011877,
 -0.03893,
 0.017741,
 -0.034378,
 -0.019286,
 -0.022275,
 -0.0030604,
 0.021922,
 0.031945,
 0.055488,
 0.034999,
 0.022936,
 -0.086991,
 -0.096582,
 -0.031036,
 0.039063,
 -0.036698,
 -0.085156,
 -0.014042,
 -0.0066821,
 0.0043142,
 -0.068921,
 0.075058,
 -0.091671,
 0.059208,
 0.047607,
 -0.063194,
 0.013008,
 -0.076709,
 0.024845,
 -0.0083532,
 0.10589,
 0.043595,
 0.05525,
 -0.0039114,
 -0.064159,
 0.0053383,
 0.080849,
 -0.042372,
 -0.01150

In [47]:
type(para_chunk_embeddings['Embeddings'][0])

list

### 2.2 <font color = "green"> ChromaDB </font>


In [48]:
# Pip install chromaDB

!pip install chromadb



In [49]:
# Import ChromaDB and get the Chroma client

import chromadb
chroma_client = chromadb.PersistentClient()

In [50]:
# Create a collection to store the embeddings. Collections in Chroma are where you can store your embeddings, documents, and any additional metadata.

collection = chroma_client.get_or_create_collection(name="Semantic_Search_with_Chromadb")

In [51]:
# Add the documents, embeddings, and ids into the collection

collection.add(
    embeddings = para_chunk_embeddings['Embeddings'].to_list(),
    documents = para_chunk_embeddings['Chunk Text'].to_list(),
    ids = [str(i) for i in range(0, len(para_chunk_embeddings['Embeddings']))],
)

In [52]:
# Peek at the initial elements of the collection

collection.peek()

{'ids': ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'],
 'embeddings': array([[-0.061396 ,  0.029963 ,  0.064252 , ..., -0.0014633, -0.06248  ,
         -0.011137 ],
        [-0.004747 ,  0.044749 ,  0.040475 , ..., -0.073135 , -0.01493  ,
         -0.013911 ],
        [-0.046045 ,  0.10365  , -0.018114 , ...,  0.019926 ,  0.038    ,
         -0.015041 ],
        ...,
        [-0.055894 , -0.026484 ,  0.022527 , ..., -0.0028857,  0.02477  ,
          0.0090112],
        [-0.089855 ,  0.016125 , -0.038506 , ..., -0.049839 , -0.0018254,
          0.019995 ],
        [-0.11554  ,  0.04064  , -0.07478  , ..., -0.050221 ,  0.077053 ,
          0.023387 ]]),
 'documents': ['DOROTHEA GLAUSE S655 RHODE ISLAND JOHN DOE 01/01/2014 711 HIGH STREET GEORGE RI 02903',
  'GROUP POLICY FOR: RHODE ISLAND JOHN DOE ALL MEMBERS Group Member Life Insurance',
  'Print Date: 07/16/2014',
  'This page left blank intentionally',
  'POLICY RIDER GROUP INSURANCE POLICY NO: S655 COVERAGE: Life EMPLOYER: RHODE

In [53]:
# Retrieve items from the collection

collection.get(
    ids = ['0','1']
)

{'ids': ['0', '1'],
 'embeddings': None,
 'documents': ['DOROTHEA GLAUSE S655 RHODE ISLAND JOHN DOE 01/01/2014 711 HIGH STREET GEORGE RI 02903',
  'GROUP POLICY FOR: RHODE ISLAND JOHN DOE ALL MEMBERS Group Member Life Insurance'],
 'uris': None,
 'data': None,
 'metadatas': [None, None],
 'included': [<IncludeEnum.documents: 'documents'>,
  <IncludeEnum.metadatas: 'metadatas'>]}

### <font color = "green"> 2.3 Querying on ChromaDB </font>


In [54]:
# Read the query from the user
query = input()

# Query the ChromaDB collection using `collection.query()
results = collection.query(
    query_texts=query,
    n_results=3,
    include = ['documents', 'distances']
)

#What is the Group Policy?

What is the Group Policy?


/root/.cache/chroma/onnx_models/all-MiniLM-L6-v2/onnx.tar.gz: 100%|██████████| 79.3M/79.3M [00:03<00:00, 21.4MiB/s]


In [55]:
results

{'ids': [['357', '636', '138']],
 'embeddings': None,
 'documents': [['this Group Policy.',
   'Group Policy; or',
   'The policy of group insurance issued to the Policyholder by The Principal, which describes']],
 'uris': None,
 'data': None,
 'metadatas': None,
 'distances': [[0.27872034907341003, 0.3511675000190735, 0.5212141871452332]],
 'included': [<IncludeEnum.distances: 'distances'>,
  <IncludeEnum.documents: 'documents'>]}

In [56]:
# Format the results to return in a more structured and readable form

data = {'Document': results['documents'][0], 'Distance': results['distances'][0]}
results_df = pd.DataFrame.from_dict(data)
results_df

Unnamed: 0,Document,Distance
0,this Group Policy.,0.27872
1,Group Policy; or,0.351168
2,The policy of group insurance issued to the Po...,0.521214


### 2.4 <font color = "green"> Updating Collections </font>


In [57]:
# Converting all text chunks to lowercase

para_chunk_embeddings['Chunk Text'] = para_chunk_embeddings['Chunk Text'].apply(lambda x: x.lower())

In [58]:
# Adding a metadata column containing the title of the page and the para name

para_chunk_embeddings['Metadata'] = para_chunk_embeddings.apply(lambda x: {'Title': x['Title']}, axis=1)

In [59]:
para_chunk_embeddings

Unnamed: 0,Title,Chunk Text,Embeddings,Metadata
0,<Page:1>,dorothea glause s655 rhode island john doe 01/...,"[-0.061396, 0.029963, 0.064252, 0.0068732, -0....",{'Title': '<Page:1>'}
1,<Page:1>,group policy for: rhode island john doe all me...,"[-0.004747, 0.044749, 0.040475, 0.01712, 0.068...",{'Title': '<Page:1>'}
2,<Page:1>,print date: 07/16/2014,"[-0.046045, 0.10365, -0.018114, 0.093488, -0.0...",{'Title': '<Page:1>'}
3,<Page:2>,this page left blank intentionally,"[0.029119, 0.060574, 0.046415, 0.037793, 0.046...",{'Title': '<Page:2>'}
4,<Page:3>,policy rider group insurance policy no: s655 c...,"[-0.021588, 0.082771, 0.0064722, 0.011061, 0.0...",{'Title': '<Page:3>'}
...,...,...,...,...
1255,<Page:62>,filed. article 8 - time limits any time limits...,"[-0.040359, 0.073557, 0.062815, 0.0031853, 0.0...",{'Title': '<Page:62>'}
1256,<Page:62>,this policy has been updated effective january...,"[-0.031512, 0.07082, 0.09493, -0.039622, 0.044...",{'Title': '<Page:62>'}
1257,<Page:62>,"gc 6018 section d - claim procedures, page 2","[-0.19685, 0.063235, 0.066546, -0.03799, 0.009...",{'Title': '<Page:62>'}
1258,<Page:63>,this page left blank intentionally,"[0.029119, 0.060574, 0.046415, 0.037793, 0.046...",{'Title': '<Page:63>'}


In [60]:
# Convert the metadata column to a list to feed it to ChromaDB

metadata_list = para_chunk_embeddings['Metadata'].tolist()

In [61]:
# Update the collection using `collection.upsert()`
collection.upsert(
    embeddings = para_chunk_embeddings['Embeddings'].to_list(),
    documents = para_chunk_embeddings['Chunk Text'].to_list(),
    metadatas = metadata_list,
    ids = [str(i) for i in range(0, len(para_chunk_embeddings['Embeddings']))],
)

In [62]:
collection.peek()

{'ids': ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'],
 'embeddings': array([[-0.061396 ,  0.029963 ,  0.064252 , ..., -0.0014633, -0.06248  ,
         -0.011137 ],
        [-0.004747 ,  0.044749 ,  0.040475 , ..., -0.073135 , -0.01493  ,
         -0.013911 ],
        [-0.046045 ,  0.10365  , -0.018114 , ...,  0.019926 ,  0.038    ,
         -0.015041 ],
        ...,
        [-0.055894 , -0.026484 ,  0.022527 , ..., -0.0028857,  0.02477  ,
          0.0090112],
        [-0.089855 ,  0.016125 , -0.038506 , ..., -0.049839 , -0.0018254,
          0.019995 ],
        [-0.11554  ,  0.04064  , -0.07478  , ..., -0.050221 ,  0.077053 ,
          0.023387 ]]),
 'documents': ['dorothea glause s655 rhode island john doe 01/01/2014 711 high street george ri 02903',
  'group policy for: rhode island john doe all members group member life insurance',
  'print date: 07/16/2014',
  'this page left blank intentionally',
  'policy rider group insurance policy no: s655 coverage: life employer: rhode

In [63]:
# Convert the metadata column to a list to feed it to ChromaDB

metadata_list = para_chunk_embeddings['Metadata'].tolist()

In [64]:
collection.get(
    ids = ['0','1']
)

{'ids': ['0', '1'],
 'embeddings': None,
 'documents': ['dorothea glause s655 rhode island john doe 01/01/2014 711 high street george ri 02903',
  'group policy for: rhode island john doe all members group member life insurance'],
 'uris': None,
 'data': None,
 'metadatas': [{'Title': '<Page:1>'}, {'Title': '<Page:1>'}],
 'included': [<IncludeEnum.documents: 'documents'>,
  <IncludeEnum.metadatas: 'metadatas'>]}

# 3. <font color = "red"> The Generation Layer  </font>

In [65]:
# Query on the collection again, this time with filters on the chunk texts and the metadata

query = input()

results = collection.query(
    query_texts=query,
    n_results=3,
    where = {'Title': '<Page:30>'},
    where_document = {"$contains": "effective date"},
)

#Effective Date for Benefit Changes Due to Change in Insurance Class

Effective Date for Benefit Changes Due to Change in Insurance Class


In [66]:
results

{'ids': [['497', '513', '520']],
 'embeddings': None,
 'documents': [['f. effective date for benefit changes due to change in insurance class',
   'g. effective date for benefit changes due to change by policy amendment',
   'gc 6007 section b - effective dates, page 3']],
 'uris': None,
 'data': None,
 'metadatas': [[{'Title': '<Page:30>'},
   {'Title': '<Page:30>'},
   {'Title': '<Page:30>'}]],
 'distances': [[0.08876600861549377, 0.448520302772522, 0.9981523752212524]],
 'included': [<IncludeEnum.distances: 'distances'>,
  <IncludeEnum.documents: 'documents'>,
  <IncludeEnum.metadatas: 'metadatas'>]}

In [67]:
# Query on the collection again, this time with filters on the chunk texts and the metadata

query = input()

results = collection.query(
    query_texts=query,
    n_results=3,
    where = {'Title': '<Page:62>'},
    where_document = {"$contains": "autopsy"},
)
#When is an autopsy required

When is an autopsy required


In [68]:
results

{'ids': [['1251', '1249', '1250']],
 'embeddings': None,
 'documents': [['pay for any such autopsy. article 7 - legal action',
   'examinations and will choose the physician to perform them. article 6 - autopsy',
   'if payment for loss of life is claimed, the principal may require an autopsy. the principal will']],
 'uris': None,
 'data': None,
 'metadatas': [[{'Title': '<Page:62>'},
   {'Title': '<Page:62>'},
   {'Title': '<Page:62>'}]],
 'distances': [[0.6738941376748242, 0.7397393386576389, 0.8016852329311567]],
 'included': [<IncludeEnum.distances: 'distances'>,
  <IncludeEnum.documents: 'documents'>,
  <IncludeEnum.metadatas: 'metadatas'>]}

In [76]:
# Query on the collection again, this time with filters on the chunk texts and the metadata

query = input()

results = collection.query(
    query_texts=query,
    n_results=3,
    where = {'Title': '<Page:14>'},
    where_document = {"$contains": "symbol"},
)
#What is the meaning of Signed or Signature

What is the meaning of Signed or Signature?


In [77]:
results

{'ids': [['186']],
 'embeddings': None,
 'documents': [['any symbol or method executed or adopted by a person with the present intention to authenticate']],
 'uris': None,
 'data': None,
 'metadatas': [[{'Title': '<Page:14>'}]],
 'distances': [[1.126270055770874]],
 'included': [<IncludeEnum.distances: 'distances'>,
  <IncludeEnum.documents: 'documents'>,
  <IncludeEnum.metadatas: 'metadatas'>]}