# Load Document Data from PDF into Hive Catalog in watsonx.data

## Overview
This Jupyter Notebook provides a step-by-step guide on how to prepare PDF document data for RAG using Milvus as a vector database (in watsonx.data).

In this notebook, as the first step, we will prepare document data from the PDF files and populate it into the Hive Catalog (in watsonx.data). Here are the steps:1. Install and import libraries.
2. Connect to watsonx.data
3. Create Schema and Table in Hive Catalog.
4. Chunk the PDF documents and load into Hive Table.
5. Check the loaded documents data in Hive Table.

- Author: ahmad.muzaffar@ibm.com (APAC Ecosystem Technical Enablement).
- This material has been adopted from material originally produced by Katherine Ciaravalli, Ken Bailey and George Baklarz.

## 1. Install and import libraries

In [None]:
# Install libraries
!pip install ipython-sql==0.4.1
!pip install sqlalchemy==1.4.46
!pip install pyhive[presto]
!pip install python-dotenv
!pip install grpcio==1.60.0
!pip install langchain
!pip install pandas==2.1.4
!pip install langchain_community
!pip install PyPDF2
!pip install pymilvus
!pip install sentence_transformers

In [None]:
# Import libraries
import PyPDF2
import pandas as pd
import warnings
import ssl
import urllib3
import os

from langchain_community.document_loaders.pdf import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

warnings.filterwarnings('ignore')
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning) # disable https warning

## 2. Connect to watsonx.data
The following code will use the Presto Magic commmands to load data in watsonx.data.

In [None]:
%run presto.ipynb

The connection details should not change unless you are attempting to run this script from a Jupyter environment that is outside of the developer system.

In [None]:
%%sql
   connect
   userid=ibmlhadmin
   password=password
   hostname=watsonxdata
   port=8443
   catalog=tpch
   schema=tiny
   certfile=/certs/lh-ssl-ts.crt

## 3. Create Schema and Table in Hive Catalog

In [None]:
%%sql
DROP TABLE IF EXISTS hive_data.rag_pdf.pdf_watsonx;
DROP SCHEMA IF EXISTS hive_data.rag_pdf;

In [None]:
# The next step will delete any existing data in the rag_web bucket. 
# A DROP table command does not remove the files in the bucket. 
# You may see error messages displayed if no data or bucket exists.

minio_host    = "watsonxdata"
minio_port    = "9000"
hive_host     = "watsonxdata"
hive_port     = "9083"

hive_id           = None
hive_password     = None
minio_access_key  = None
minio_secret_key  = None
keystore_password = None

try:
    with open('/certs/passwords') as fd:
        certs = fd.readlines()
    for line in certs:
        args = line.split()
        if (len(args) >= 3):
            system   = args[0].strip()
            user     = args[1].strip()
            password = args[2].strip()
            if (system == "Minio"):
                minio_access_key = user
                minio_secret_key = password
            elif (system == "Thrift"):
                hive_id = user
                hive_password = password
            elif (system == "Keystore"):
                keystore_password = password
            else:
                pass
except Error as e:
    print("Certificate file with passwords could not be found")

%system mc alias set watsonxdata http://{minio_host}:{minio_port} {minio_access_key} {minio_secret_key}

%system mc rm --recursive --force watsonxdata/hive-bucket/rag_pdf

### Create Schema (rag_pdf)

In [None]:
%%sql
CREATE SCHEMA IF NOT EXISTS 
  hive_data.rag_pdf 
WITH (location = 's3a://hive-bucket/rag_pdf')

### Create Table (pdf_watsonx)

In [None]:
%%sql
CREATE TABLE hive_data.rag_pdf.pdf_watsonx
  (
    "id" varchar,
    "text" varchar,
    "title" varchar  
  )
WITH 
  (
  format = 'PARQUET',
  external_location = 's3a://hive-bucket/rag_pdf' 
  )

## 4. Chunk the PDF documents and load into Hive Table

In [None]:
# Insert the document filename into a list
pdf_files=['DB2 and DB2 Warehouse on Cloud [TechTalks].PDF',
          'IBM Watsonx.Data [TechTalks].PDF',
          'IBM watsonx.data Milvus Vector Database Competitive - Mar 2024 Final.pdf'
          ]

In [None]:
# Set parameter for chunking
CHUNK_SIZE = 500
CHUNK_OVERLAP = 50   

# Use text splitter function by LangChain
text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=CHUNK_SIZE,
            chunk_overlap=CHUNK_OVERLAP
        )

In [None]:
# Create an empty string for text
all_data = ''

# Initialize the index counter
index_counter = 1  

# For each pdf file, we conduct text extraction, processing, chunking and insertion into Hive table
if pdf_files:
    for pdf in pdf_files:
        pdf_reader = PyPDF2.PdfReader(pdf)
        
        # Extract text from each page
        for page in pdf_reader.pages:
            text = page.extract_text()
            all_data += text
            
        # Escapes special characters and replaces newlines with spaces
        all_data = all_data.replace("'", "''").replace("%", "%%").replace("\n", " ")

        # Chunking data
        texts = text_splitter.split_text(all_data)
       
        # Insert data
        for chunk in texts:
            insert_stmt = f"insert into hive_data.rag_pdf.pdf_watsonx values ('{index_counter}', '{chunk}', '{pdf}')"
            %sql --quiet {insert_stmt}
            print(f"{pdf} chunk {index_counter} Inserted.")  
            index_counter += 1  # Increment the index counter after each insertion

print('Data insertion completed.')

## 5. Check the loaded documents data in Hive Table

In [None]:
%%sql
   SELECT * FROM hive_data.rag_pdf.pdf_watsonx