# Exploratory Data Analysis

Exploratory Data Analysis (EDA) for textual data from PDFs is a multi-step process that involves:

## Setup & Initialization:

Let's Import dependencies and initialize some variables.

In [1]:
import os
import sys
from io import BytesIO

import boto3
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
from nltk.corpus import stopwords
from wordcloud import WordCloud
from dotenv import load_dotenv

# import src modules
from src import config
from src import extract 

# import langchain related modules
from langchain.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter

## Data Loading:

In this section, we will download the data from an s3 bucket and save it locally.

In [2]:
extract.extract_from_s3(config.S3_BUCKET_NAME, 
                    config.S3_BUCKET_PREFIX, 
                    config.AWS_ACCESS_KEY_ID, 
                    config.AWS_SECRET_ACCESS_KEY, 
                    config.DATASET_ROOT_PATH)

Downloading files from s3://anyoneai-datasets/queplan_insurance/ to /home/gio/ANYONEAI/InsurancePolicyChatbot/dataset ...
Downloaded queplan_insurance/POL120190177.pdf to /home/gio/ANYONEAI/InsurancePolicyChatbot/dataset/POL120190177.pdf
Downloaded queplan_insurance/POL320130223.pdf to /home/gio/ANYONEAI/InsurancePolicyChatbot/dataset/POL320130223.pdf
Downloaded queplan_insurance/POL320150503.pdf to /home/gio/ANYONEAI/InsurancePolicyChatbot/dataset/POL320150503.pdf
Downloaded queplan_insurance/POL320180100.pdf to /home/gio/ANYONEAI/InsurancePolicyChatbot/dataset/POL320180100.pdf
Downloaded queplan_insurance/POL320190074.pdf to /home/gio/ANYONEAI/InsurancePolicyChatbot/dataset/POL320190074.pdf
Downloaded queplan_insurance/POL320200071.pdf to /home/gio/ANYONEAI/InsurancePolicyChatbot/dataset/POL320200071.pdf
Downloaded queplan_insurance/POL320200214.pdf to /home/gio/ANYONEAI/InsurancePolicyChatbot/dataset/POL320200214.pdf
Downloaded queplan_insurance/POL320210063.pdf to /home/gio/ANYONEA

Now let's load the PDFs containing Insurance Policies using PyPDFDirectoryLoader from LangChain.

In [3]:
loader = PyPDFDirectoryLoader(config.DATASET_ROOT_PATH)
documents = loader.load()

Each Page is a ```Document```.

A ```Document``` contains text (```page_content```) and ```metadata```

In [6]:
print(f"Number of pages: {len(documents)}")

Number of pages: 267


Now, let's look at the first page of the first document.

In [9]:
page = documents[0]
print (page.page_content[0:500])

SEGURO COLECTIVO COMPLEMENTARIO DE SALUD 
Incorporada al Depósito de Pólizas bajo el código POL320130223
ARTICULO 1°: REGLAS APLICABLES AL CONTRATO
 
 
 
 
 
 
 
Se aplicarán al presente contrato de seguro las disposiciones contenidas en los artículos siguientes y las
normas legales de carácter imperativo establecidas en el Título VIII, del Libro II, del Código de Comercio. Sin
embargo, se entenderán válidas las estipulaciones contractuales que sean más beneficiosas para el
asegurado o beneficia


Let's take a look at it's metadata. We can see that it contains the following fields:

- ```source``` : The source of the document
- ```page``` : The page number of the document

this information can be used to trace back the document to the source for debugging purposes.

In [17]:
print("The Page's File name is: ", page.metadata["source"])
print("The Page number is: ", page.metadata["page"])

The Page's File name is:  /home/gio/ANYONEAI/InsurancePolicyChatbot/dataset/POL320130223.pdf
The Page number is:  0


## Data Cleaning & Preprocessing:

In this section, we will clean the data and preprocess it for further analysis.