# Exploratory Data Analysis - Insurance Policy Generation Chatbot

Exploratory Data Analysis (EDA) for textual data from PDFs is a multi-step process that involves:


## Setup & Initialization:

Let's Import dependencies and initialize some variables.


In [155]:
%pip install -r requirements.txt > /dev/null

Note: you may need to restart the kernel to use updated packages.


In [156]:
%load_ext autoreload
%autoreload 2

# import openai api and set api key
import openai

openai.api_key = config.OPENAI_API_KEY

# import src modules
from src import config
from src import ETL

# import langchain related modules
from langchain.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import (
    CharacterTextSplitter,
    RecursiveCharacterTextSplitter,
)
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Data Extraction:

In this section, we will download the data from an s3 bucket and save it locally.


In [157]:
ETL.extract_from_s3(
    config.S3_BUCKET_NAME,
    config.S3_BUCKET_PREFIX,
    config.AWS_ACCESS_KEY_ID,
    config.AWS_SECRET_ACCESS_KEY,
    config.DATASET_ROOT_PATH,
)

Files from anyoneai-datasets/queplan_insurance/ already downloaded to /home/gio/ANYONEAI/InsurancePolicyChatbot/dataset.


## Document Loading:

Now let's load the PDFs containing Insurance Policies using PyPDFDirectoryLoader from LangChain.


In [158]:
loader = PyPDFDirectoryLoader(config.DATASET_ROOT_PATH)

documents = loader.load()

Each Page is a `Document`.

A `Document` contains text (`page_content`) and `metadata`


In [159]:
print(f"Number of pages: {len(documents)}")

Number of pages: 267


Let's take a look at it's metadata. We can see that it contains the following fields:

- `source` : The source of the document
- `page` : The page number of the document

this information can be used to trace back the document to the source for debugging purposes.


In [160]:
print("The Page's File name is: ", documents[0].metadata["source"])

print("The Page number is: ", documents[0].metadata["page"])

The Page's File name is:  /home/gio/ANYONEAI/InsurancePolicyChatbot/dataset/POL320130223.pdf
The Page number is:  0


Now, let's look at a glimpse of the first page of the first document.


In [161]:
print(documents[0].page_content[0:500], "...")

SEGURO COLECTIVO COMPLEMENTARIO DE SALUD 
Incorporada al Depósito de Pólizas bajo el código POL320130223
ARTICULO 1°: REGLAS APLICABLES AL CONTRATO
 
 
 
 
 
 
 
Se aplicarán al presente contrato de seguro las disposiciones contenidas en los artículos siguientes y las
normas legales de carácter imperativo establecidas en el Título VIII, del Libro II, del Código de Comercio. Sin
embargo, se entenderán válidas las estipulaciones contractuales que sean más beneficiosas para el
asegurado o beneficia ...


The reason why we explore the first page of the first document is to get a sense of the structure of the document. This analisis will help us in the next step which is the Document Splitting phase, where we will split the document into chunks to feed our vector store. In order to ensure that each chunk has coherent, self-contained information, we need to define de appropriate chunk size.

Given the content of our dataset, the logical division would be to break down by articles and major subheadings since they seem to encapsulate a singular topic or concept.


## Preprocessing:

However, we see the presence of a lot of whitespace and empty lines across the dataset. This might interfere with the chunking process and produce poor results. Therefore, we will preprocess the data to remove the empty lines and whitespace.


In [None]:
# This function will clean the pages from extra whitespaces and newlines
ETL.preprocess(documents)

In [168]:
print(documents[3].page_content)

 
E) BENEFICIO DE SALUD MENTAL
F) BENEFICIOS ESPECIALES
Para los efectos de tener derecho a los beneficios que emanan de estas coberturas, éstas deberán estar
expresamente señaladas en las Condiciones Particulares de la póliza, en las cuales se establecerán además
los porcentajes y límites de reembolso o pago correspondientes a cada una de ellas.
A) BENEFICIO DE HOSPITALIZACIÓN:
Bajo este beneficio se cubren los gastos médicos incurridos en complemento de lo que cubra el sistema de
salud previsional o de bienestar u otro seguro o convenio, de acuerdo a los porcentajes y límites de
reembolso o pago definidos para este beneficio en el Cuadro de Beneficios de las Condiciones Particulares
de la póliza.



## Document Splitting

Now, let's split the document into chunks of text for further processing.


### Chunking Considerations

Several variables play a role in determining the best chunking strategy, and these variables vary depending on the use case. Here are some key aspects to keep in mind:

1. **What is the nature of the content being indexed?**

   Based on the content of the dataset, we're working with insurance policies. The logical division would be to break down by articles and major subheadings since they seem to encapsulate a singular topic or concept.

2. **Which embedding model will be used, and what chunk sizes does it perform optimally on?**

   We will be using OpenAI GPT-3.5 Turbo, which performs optimally on chunks of 512 tokens.

3. **What are your expectations for the length and complexity of user queries?**

   Since this solution will act as a chatbot, we can expect the queries to be mostly short and simple. However, we should also consider the possibility of more complex queries, such as "Cual es la diferencia entre el seguro de vida y el seguro de salud?" (What is the difference between life insurance and health insurance?)

4. **How will the retrieved results be utilized within your specific application?**

   The retrieved results will be used to answer user queries. The user will be able to ask questions about the content of the documents, and the chatbot will respond with the most relevant information.


### Chunking methods

There are different methods for chunking, and each of them might be appropriate for different situations. By examining the strengths and weaknesses of each method, our goal is to identify the right scenario to apply them to.


#### Fixed-size chunking

This is the most common and straightforward approach to chunking: we simply decide the number of tokens in our chunk and, optionally, whether there should be any overlap between them.


In [163]:
c_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

c_splitter_result = c_splitter.split_documents(documents)

print("Chunk Size count: ", len(c_splitter_result))

print(c_splitter_result[0].page_content, "...")

Chunk Size count:  267
SEGURO COLECTIVO COMPLEMENTARIO DE SALUD 
Incorporada al Depósito de Pólizas bajo el código POL320130223
ARTICULO 1°: REGLAS APLICABLES AL CONTRATO
Se aplicarán al presente contrato de seguro las disposiciones contenidas en los artículos siguientes y las
normas legales de carácter imperativo establecidas en el Título VIII, del Libro II, del Código de Comercio. Sin
embargo, se entenderán válidas las estipulaciones contractuales que sean más beneficiosas para el
asegurado o beneficiario.
ARTÍCULO Nº 2: COBERTURA
La compañía de seguros bajo las condiciones y términos que más adelante se establecen, conviene en
reembolsar o pagar al beneficiario, los gastos médicos razonables y acostumbrados en que haya incurrido
efectivamente un asegurado, en complemento de lo que cubra el sistema de salud previsional o de bienestar
u otro seguro o convenio, a consecuencia de una incapacidad cubierta. ...


#### Recursive Chunking

Recursive chunking divides the input text into smaller chunks in a hierarchical and iterative manner using a set of separators. If the initial attempt at splitting the text doesn’t produce chunks of the desired size or structure, the method recursively calls itself on the resulting chunks with a different separator or criterion until the desired chunk size or structure is achieved. This means that while the chunks aren’t going to be exactly the same size, they’ll still “aspire” to be of a similar size.


In [164]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
    is_separator_regex=False,
    strip_whitespace=True,
)

r_splitter_result = r_splitter.split_documents(documents)

print("Chunk Size count: ", len(r_splitter_result))

print(r_splitter_result[0].page_content, "...")

Chunk Size count:  734
SEGURO COLECTIVO COMPLEMENTARIO DE SALUD 
Incorporada al Depósito de Pólizas bajo el código POL320130223
ARTICULO 1°: REGLAS APLICABLES AL CONTRATO
Se aplicarán al presente contrato de seguro las disposiciones contenidas en los artículos siguientes y las
normas legales de carácter imperativo establecidas en el Título VIII, del Libro II, del Código de Comercio. Sin
embargo, se entenderán válidas las estipulaciones contractuales que sean más beneficiosas para el
asegurado o beneficiario.
ARTÍCULO Nº 2: COBERTURA
La compañía de seguros bajo las condiciones y términos que más adelante se establecen, conviene en
reembolsar o pagar al beneficiario, los gastos médicos razonables y acostumbrados en que haya incurrido
efectivamente un asegurado, en complemento de lo que cubra el sistema de salud previsional o de bienestar
u otro seguro o convenio, a consecuencia de una incapacidad cubierta. ...


## Storage

Now that we have our chunks, we can feed them to our vector store. But first, we need to convert them into embeddings.

### Embedding

We will use the OpenAI GPT-3.5 Turbo model to generate embeddings for our chunks.


In [165]:
embedding = OpenAIEmbeddings()

### Vector Store

We will use Chroma to store our embeddings. Chroma is a vector store that allows us to store and query embeddings.


In [166]:
vector_store = Chroma.from_documents(
    documents, embedding, persist_directory=config.CHROMA_PATH
)

Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-text-embedding-ada-002 in organization org-NZFzrEuZsJ8hLl82wd85Ec4b on tokens per min. Limit: 150000 / min. Current: 0 / min. Contact us through our help center at help.openai.com if you continue to have issues. Please add a payment method to your account to increase your rate limit. Visit https://platform.openai.com/account/billing to add a payment method..
Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-text-embedding-ada-002 in organization org-NZFzrEuZsJ8hLl82wd85Ec4b on tokens per min. Limit: 150000 / min. Current: 1 / min. Contact us through our help center at help.openai.com if you continue to have issues. Please add a payment method to your account to increase your rate limit. Visit https://platform.openai.com/account/

KeyboardInterrupt: 

Let's see the count


print(vector_store.count())
