# Exploratory Data Analysis

Exploratory Data Analysis (EDA) for textual data from PDFs is a multi-step process that involves:

## Setup & Initialization:

Let's Import dependencies and initialize some variables.

In [56]:
%pip install -r requirements.txt > /dev/null

Note: you may need to restart the kernel to use updated packages.


In [57]:
%load_ext autoreload
%autoreload 2

# import src modules
from src import config
from src import ETL

# import langchain related modules
from langchain.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import (
    CharacterTextSplitter,
    RecursiveCharacterTextSplitter,
    MarkdownTextSplitter
)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Data Extraction:

In this section, we will download the data from an s3 bucket and save it locally.

In [58]:
ETL.extract_from_s3(
    config.S3_BUCKET_NAME,
    config.S3_BUCKET_PREFIX,
    config.AWS_ACCESS_KEY_ID,
    config.AWS_SECRET_ACCESS_KEY,
    config.DATASET_ROOT_PATH,
)

Files from anyoneai-datasets/queplan_insurance/ already downloaded to /home/gio/ANYONEAI/InsurancePolicyChatbot/dataset.


## Document Loading:
Now let's load the PDFs containing Insurance Policies using PyPDFDirectoryLoader from LangChain.

In [59]:
loader = PyPDFDirectoryLoader(config.DATASET_ROOT_PATH)
documents = loader.load()

Each Page is a ```Document```.

A ```Document``` contains text (```page_content```) and ```metadata```

In [60]:
print(f"Number of pages: {len(documents)}")

Number of pages: 267


Let's take a look at it's metadata. We can see that it contains the following fields:

- ```source``` : The source of the document
- ```page``` : The page number of the document

this information can be used to trace back the document to the source for debugging purposes.

In [61]:
print("The Page's File name is: ", documents[0].metadata["source"])
print("The Page number is: ", documents[0].metadata["page"])

The Page's File name is:  /home/gio/ANYONEAI/InsurancePolicyChatbot/dataset/POL320130223.pdf
The Page number is:  0


Now, let's look at a glimpse of the first page of the first document.

In [62]:
print(documents[0].page_content[0:500], "...")

SEGURO COLECTIVO COMPLEMENTARIO DE SALUD 
Incorporada al Depósito de Pólizas bajo el código POL320130223
ARTICULO 1°: REGLAS APLICABLES AL CONTRATO
 
 
 
 
 
 
 
Se aplicarán al presente contrato de seguro las disposiciones contenidas en los artículos siguientes y las
normas legales de carácter imperativo establecidas en el Título VIII, del Libro II, del Código de Comercio. Sin
embargo, se entenderán válidas las estipulaciones contractuales que sean más beneficiosas para el
asegurado o beneficia ...


The reason why we explore the first page of the first document is to get a sense of the structure of the document. This analisis will help us in the next step which is the Document Splitting phase, where we will split the document into chunks to feed our vector store. In order to ensure that each chunk has coherent, self-contained information, we need to define de appropriate chunk size. 

Given the content of our dataset, the logical division would be to break down by articles and major subheadings since they seem to encapsulate a singular topic or concept.

# Document Splitting

Now, let's split the document into chunks of text for further processing.

### Chunking Considerations

Several variables play a role in determining the best chunking strategy, and these variables vary depending on the use case. Here are some key aspects to keep in mind:

1. **What is the nature of the content being indexed?** 

    Based on the content of the dataset, we're working with insurance policies. The logical division would be to break down by articles and major subheadings since they seem to encapsulate a singular topic or concept.

2. **Which embedding model will be used, and what chunk sizes does it perform optimally on?**

    We will be using OpenAI GPT-3.5 Turbo, which performs optimally on chunks of 512 tokens.

3. **What are your expectations for the length and complexity of user queries?** 

    Since this solution will act as a chatbot, we can expect the queries to be mostly short and simple. However, we should also consider the possibility of more complex queries, such as "Cual es la diferencia entre el seguro de vida y el seguro de salud?" (What is the difference between life insurance and health insurance?)

4. **How will the retrieved results be utilized within your specific application?**

    The retrieved results will be used to answer user queries. The user will be able to ask questions about the content of the documents, and the chatbot will respond with the most relevant information.

### Chunking methods

There are different methods for chunking, and each of them might be appropriate for different situations. By examining the strengths and weaknesses of each method, our goal is to identify the right scenario to apply them to.

#### Fixed-size chunking

This is the most common and straightforward approach to chunking: we simply decide the number of tokens in our chunk and, optionally, whether there should be any overlap between them. 

In [91]:
c_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

c_splitter_result = c_splitter.split_documents(documents)

print("Chunk Size count: ", len(c_splitter_result))
print(c_splitter_result[10].page_content)


Chunk Size count:  267
Bajo este beneficio se podrán contratar las siguientes categorías de medicamentos ambulatorios, las cuales
deberán estar expresamente indicadas en las Condiciones Particulares de la póliza. Las categorías de
medicamentos podrán ser las siguientes:
 
 
 
 
i. Medicamentos Ambulatorios Genéricos: Se entienden incluidos en esta categoría los medicamentos que
se comercializan bajo la denominación del principio activo que incorpora, siendo igual en composición y
forma farmacéutica a la marca original, pero sin marca comercial, figurando en su lugar el nombre de su
principio activo;
 
 
 
 
ii. Medicamentos Ambulatorios No Genéricos: Se entienden incluidos en esta categoría los medicamentos no
comprendidos en la categoría anterior, que se comercializan bajo un nombre comercial específico sujeto a la
protección comercial que otorgan las agencias internacionales de patentes y que han sido registrados por un
laboratorio farmacéutico, los que pueden corresponder a la fórmu

As you can see, the chunk count is the same as the page count, which means that each page is a chunk. This is not ideal because it doesn't take into account the structure of the document.

## Context aware splitting

Chunking aims to keep text with common context together.

#### Recursive Chunking

Recursive chunking divides the input text into smaller chunks in a hierarchical and iterative manner using a set of separators. If the initial attempt at splitting the text doesn’t produce chunks of the desired size or structure, the method recursively calls itself on the resulting chunks with a different separator or criterion until the desired chunk size or structure is achieved. This means that while the chunks aren’t going to be exactly the same size, they’ll still “aspire” to be of a similar size.

In [98]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200, length_function=len
)

r_splitter_result = r_splitter.split_documents(documents)

print("Chunk Size count: ", len(r_splitter_result))

print(r_splitter_result[11].page_content)
print ("-------------------")
print(r_splitter_result[12].page_content)


Chunk Size count:  735
complemento de lo que cubra el sistema de salud previsional o de bienestar u otro seguro o convenio, de
acuerdo a los porcentajes y límites de reembolso o pago definidos para este beneficio en el Cuadro de
Beneficios de las Condiciones Particulares de la póliza.
 
 
 
 
Bajo este beneficio se podrán contratar las siguientes prestaciones, las cuales deberán estar expresamente
indicadas en las Condiciones Particulares de la póliza. Las prestaciones podrán ser las siguientes:
 
 
 
 
i) Consultas médicas generales;
 
 
ii) Consultas médicas especialistas;
 
 
iii) Consultas médicas domiciliarias;
 
 
iv) Exámenes de laboratorio;
 
 
v) Exámenes de imagenología, ultrasonografía y medicina nuclear;
-------------------
vi) Procedimientos de diagnóstico no quirúrgicos;
 
 
vii) Procedimientos terapéuticos no quirúrgicos;
 
 
viii) Cirugía ambulatoria: Bajo esta prestación se re-em-bolsarán los gastos médicos en que se incurra sólo
por concepto de las prestaciones descri