# Load Document Data from Web into Hive Catalog in watsonx.data (Web)

## Overview
This Jupyter Notebook provides a step-by-step guide on how to prepare Web document data for RAG using Milvus as a vector database (in watsonx.data).

In this notebook, we will prepare document data from the Web (Wikipedia) using a URL and populate it into the Hive Catalog (in watsonx.data). Here are the steps:1. Install and import libraries.
2. Fetch web articles (Wikipedia articles) and populate into dataframe.
3. Connect to watsonx.data.
4. Create Schema and Table in Hive Catalog.
5. Chunk the web documents and load into Hive Table.
6. Check the loaded documents data in Hive Table.

- Author: ahmad.muzaffar@ibm.com (APAC Ecosystem Technical Enablement).
- This material has been adopted from material originally produced by Katherine Ciaravalli, Ken Bailey and George Baklarz.

## 1. Install and import libraries

In [None]:
# Install libraries
!pip install python-dotenv
!pip install wikipedia
!pip install pymilvus
!pip install sentence_transformers
!pip install grpcio==1.60.0 

In [None]:
# Import libraries
import wikipedia
import pandas as pd
import warnings

warnings.filterwarnings('ignore')

## 1. Fetch web articles (Wikipedia articles) and populate into dataframe

This notebook walks through the process of loading a wikipedia article into a watsonx.data relational database table. We use the [Wikipedia python library](https://pypi.org/project/wikipedia/) to retrieve wikipedia articles. We then create a table in the database to store the articles. Finally, we load the articles into the database. 

For details on the copyright issues when extracting data, please refer to the [Wikipedia Copyrights](https://en.wikipedia.org/wiki/Wikipedia:Copyrights) page.

The following code will search Wikipedia articles and display a list of the articles by title. The initial search will return a list of up to 10 titles, while the subsequent call will retrieve the summary of the article. The two results are combined into one dataframe for easy scrolling.

Update the next field to include what you are searching for.

In [None]:
topic = "climate"

### Retrieve 10 Articles
The next call will retrieve a maximum of 10 titles and display the list.

In [None]:
search_results = wikipedia.search(topic)
print("Article Title")
print("-------------------------------------------------")
for result in search_results: print(result)

### Retrieve Article Summary
Now that we have a list of articles, we can request a summary of each article and display them. Note that if an article is ambiguous, the program will not attempt to retrieve the article. An ambiguous article is an article which could refer to multiple topics. The summary output from an ambiguous article will display possible searches that you may want to try. Since we are only interested in direct articles, the ambiguous titles will be ignored.

In [None]:
# search
search_results = wikipedia.search("Climate")

display_articles = []
for i in range (0,len(search_results)):
    try:
        summary = wikipedia.summary(search_results[i])
    except Exception as err:
        print(f"Skipped article '{search_results[i]}' skipped because of ambiguity.")
        continue
        
    display_articles.append({
        "title"   : search_results[i],
        "summary" : summary
    })

#print(display_articles)

df = pd.DataFrame.from_dict(display_articles)
df.style.set_properties(**{'text-align': 'left'})

This step will load selected articles into watsonx.data. Since we are only interested in climate change, we will select the first two articles in the list. You can change the documents loaded by changing the document indexes in the variable found in the next cell.

In [None]:
documents = [0,1]

In [None]:
# fetch wikipedia articles
articles = {}
for document in documents:
    articles.update({display_articles[document]["title"] : None})

for k,v in articles.items():
    article = wikipedia.page(k)
    articles[k] = article.content
    print(f"Successfully fetched article {k}")

print(f"Successfully fetched {len(articles)} articles ")

## 2. Connect to watsonx.data
The following code will use the Presto Magic commmands to load data in watsonx.data.

In [None]:
%run presto.ipynb

The connection details should not change unless you are attempting to run this script from a Jupyter environment that is outside of the developer system.

In [None]:
%%sql
   connect
   userid=ibmlhadmin
   password=password
   hostname=watsonxdata
   port=8443
   catalog=tpch
   schema=tiny
   certfile=/certs/lh-ssl-ts.crt

## 3. Create Schema and Table in Hive Catalog

In [None]:
%%sql
DROP TABLE IF EXISTS hive_data.rag_web.web_wikipedia;
DROP SCHEMA IF EXISTS hive_data.rag_web;

In [None]:
# The next step will delete any existing data in the rag_web bucket. 
# A DROP table command does not remove the files in the bucket. 
# You may see error messages displayed if no data or bucket exists.

minio_host    = "watsonxdata"
minio_port    = "9000"
hive_host     = "watsonxdata"
hive_port     = "9083"

hive_id           = None
hive_password     = None
minio_access_key  = None
minio_secret_key  = None
keystore_password = None

try:
    with open('/certs/passwords') as fd:
        certs = fd.readlines()
    for line in certs:
        args = line.split()
        if (len(args) >= 3):
            system   = args[0].strip()
            user     = args[1].strip()
            password = args[2].strip()
            if (system == "Minio"):
                minio_access_key = user
                minio_secret_key = password
            elif (system == "Thrift"):
                hive_id = user
                hive_password = password
            elif (system == "Keystore"):
                keystore_password = password
            else:
                pass
except Error as e:
    print("Certificate file with passwords could not be found")

%system mc alias set watsonxdata http://{minio_host}:{minio_port} {minio_access_key} {minio_secret_key}

%system mc rm --recursive --force watsonxdata/hive-bucket/rag_web

#### Create Schema (rag_web)

In [None]:
%%sql
CREATE SCHEMA IF NOT EXISTS 
  hive_data.rag_web
WITH (location = 's3a://hive-bucket/rag_docs')

#### Create Table (web_wikipedia)

In [None]:
%%sql
CREATE TABLE hive_data.rag_web.web_wikipedia
  (
    "id" varchar,
    "text" varchar,
    "title" varchar  
  )
WITH 
  (
  format = 'PARQUET',
  external_location = 's3a://hive-bucket/rag_web' 
  )

## 4. Chunk the web documents and load into the Table
The Wikipedia article is written into the watsonx.data database in chucks of approximately 225 words in size. The reason for chunking the data is to make it more efficient when populating the Milvus system from watsonx.data.

In [None]:
# Chunk data
def split_into_chunks(text, chunk_size):
    words = text.split()
    return [' '.join(words[i:i + chunk_size]) for i in range(0, len(words), chunk_size)]

split_articles = {}
for k,v in articles.items():
    split_articles[k] = split_into_chunks(v, 225)

# Insert data
for article_title, article_chunks in split_articles.items():
    for i, chunk in enumerate(article_chunks):
        escaped_chunk = chunk.replace("'", "''").replace("%", "%%")
        insert_stmt = f"insert into hive_data.rag_web.web_wikipedia values ('{i+1}', '{escaped_chunk}', '{article_title}')"
        %sql --quiet {insert_stmt}
        print(f"{article_title} {i+1}/{len(article_chunks)} inserted",end="\r")
            
    print(f"\n{article_title} Insertion complete")

## 5. Check the loaded documents data in the Table

In [None]:
%%sql
   SELECT * FROM hive_data.rag_web.web_wikipedia