# Load Document Data from Web into Hive Catalog in watsonx.data

## Overview
This Jupyter Notebook provides a step-by-step guide on how to prepare Web document data for RAG using Milvus as a vector database (in watsonx.data).

In this notebook, we will prepare document data from the Web (Wikipedia) using a URL and populate it into the Hive Catalog (in watsonx.data). Here are the steps:1. Install and import libraries.
2. Fetch web articles (Wikipedia articles) and populate into dataframe.
3. Connect to watsonx.data.
4. Create Schema and Table in Hive Catalog.
5. Chunk the web documents and load into Hive Table.
6. Check the loaded documents data in Hive Table.

- Author: ahmad.muzaffar@ibm.com (APAC Ecosystem Technical Enablement).
- This material has been adopted from material originally produced by Katherine Ciaravalli, Ken Bailey and George Baklarz.

## 1. Install and import libraries

In [1]:
# Install libraries
!pip install python-dotenv
!pip install wikipedia
!pip install pymilvus
!pip install sentence_transformers
!pip install grpcio==1.60.0 



In [2]:
# Import libraries
import wikipedia
import pandas as pd
import warnings

warnings.filterwarnings('ignore')

## 1. Fetch web articles (Wikipedia articles) and populate into dataframe

This notebook walks through the process of loading a wikipedia article into a watsonx.data relational database table. We use the [Wikipedia python library](https://pypi.org/project/wikipedia/) to retrieve wikipedia articles. We then create a table in the database to store the articles. Finally, we load the articles into the database. 

For details on the copyright issues when extracting data, please refer to the [Wikipedia Copyrights](https://en.wikipedia.org/wiki/Wikipedia:Copyrights) page.

The following code will search Wikipedia articles and display a list of the articles by title. The initial search will return a list of up to 10 titles, while the subsequent call will retrieve the summary of the article. The two results are combined into one dataframe for easy scrolling.

Update the next field to include what you are searching for.

In [3]:
topic = "climate"

### Retrieve 10 Articles
The next call will retrieve a maximum of 10 titles and display the list.

In [4]:
search_results = wikipedia.search(topic)
print("Article Title")
print("-------------------------------------------------")
for result in search_results: print(result)

Article Title
-------------------------------------------------
Climate
Climate change
Köppen climate classification
Mediterranean climate
Oceanic climate
Climate action
Subarctic climate
Climatize
Temperate climate
Climate classification


### Retrieve Article Summary
Now that we have a list of articles, we can request a summary of each article and display them. Note that if an article is ambiguous, the program will not attempt to retrieve the article. An ambiguous article is an article which could refer to multiple topics. The summary output from an ambiguous article will display possible searches that you may want to try. Since we are only interested in direct articles, the ambiguous titles will be ignored.

In [5]:
# search
search_results = wikipedia.search("Climate")

display_articles = []
for i in range (0,len(search_results)):
    try:
        summary = wikipedia.summary(search_results[i])
    except Exception as err:
        print(f"Skipped article '{search_results[i]}' skipped because of ambiguity.")
        continue
        
    display_articles.append({
        "title"   : search_results[i],
        "summary" : summary
    })

#print(display_articles)

df = pd.DataFrame.from_dict(display_articles)
df.style.set_properties(**{'text-align': 'left'})

Skipped article 'Oceanic climate' skipped because of ambiguity.
Skipped article 'Climatize' skipped because of ambiguity.
Skipped article 'Temperate climate' skipped because of ambiguity.


Unnamed: 0,title,summary
0,Climate,"Climate is the long-term weather pattern in a region, typically averaged over 30 years. More rigorously, it is the mean and variability of meteorological variables over a time spanning from months to millions of years. Some of the meteorological variables that are commonly measured are temperature, humidity, atmospheric pressure, wind, and precipitation. In a broader sense, climate is the state of the components of the climate system, including the atmosphere, hydrosphere, cryosphere, lithosphere and biosphere and the interactions between them. The climate of a location is affected by its latitude, longitude, terrain, altitude, land use and nearby water bodies and their currents. Climates can be classified according to the average and typical variables, most commonly temperature and precipitation. The most widely used classification scheme is the Köppen climate classification. The Thornthwaite system, in use since 1948, incorporates evapotranspiration along with temperature and precipitation information and is used in studying biological diversity and how climate change affects it. The major classifications in Thornthwaite's climate classification are microthermal, mesothermal, and megathermal. Finally, the Bergeron and Spatial Synoptic Classification systems focus on the origin of air masses that define the climate of a region. Paleoclimatology is the study of ancient climates. Paleoclimatologists seek to explain climate variations for all parts of the Earth during any given geologic period, beginning with the time of the Earth's formation. Since very few direct observations of climate were available before the 19th century, paleoclimates are inferred from proxy variables. They include non-biotic evidence—such as sediments found in lake beds and ice cores—and biotic evidence—such as tree rings and coral. Climate models are mathematical models of past, present, and future climates. Climate change may occur over long and short timescales due to various factors. Recent warming is discussed in terms of global warming, which results in redistributions of biota. For example, as climate scientist Lesley Ann Hughes has written: ""a 3 °C [5 °F] change in mean annual temperature corresponds to a shift in isotherms of approximately 300–400 km [190–250 mi] in latitude (in the temperate zone) or 500 m [1,600 ft] in elevation. Therefore, species are expected to move upwards in elevation or towards the poles in latitude in response to shifting climate zones."""
1,Climate change,"In common usage, climate change describes global warming—the ongoing increase in global average temperature—and its effects on Earth's climate system. Climate change in a broader sense also includes previous long-term changes to Earth's climate. The current rise in global average temperature is primarily caused by humans burning fossil fuels since the Industrial Revolution. Fossil fuel use, deforestation, and some agricultural and industrial practices add to greenhouse gases. These gases absorb some of the heat that the Earth radiates after it warms from sunlight, warming the lower atmosphere. Carbon dioxide, the primary greenhouse gas driving global warming, has grown by about 50% and is at levels unseen for millions of years. Climate change has an increasingly large impact on the environment. Deserts are expanding, while heat waves and wildfires are becoming more common. Amplified warming in the Arctic has contributed to thawing permafrost, retreat of glaciers and sea ice decline. Higher temperatures are also causing more intense storms, droughts, and other weather extremes. Rapid environmental change in mountains, coral reefs, and the Arctic is forcing many species to relocate or become extinct. Even if efforts to minimize future warming are successful, some effects will continue for centuries. These include ocean heating, ocean acidification and sea level rise. Climate change threatens people with increased flooding, extreme heat, increased food and water scarcity, more disease, and economic loss. Human migration and conflict can also be a result. The World Health Organization calls climate change one of the biggest threats to global health in the 21st century. Societies and ecosystems will experience more severe risks without action to limit warming. Adapting to climate change through efforts like flood control measures or drought-resistant crops partially reduces climate change risks, although some limits to adaptation have already been reached. Poorer communities are responsible for a small share of global emissions, yet have the least ability to adapt and are most vulnerable to climate change. Many climate change impacts have been felt in recent years, with 2023 the warmest on record at +1.48 °C (2.66 °F) since regular tracking began in 1850. Additional warming will increase these impacts and can trigger tipping points, such as melting all of the Greenland ice sheet. Under the 2015 Paris Agreement, nations collectively agreed to keep warming ""well under 2 °C"". However, with pledges made under the Agreement, global warming would still reach about 2.7 °C (4.9 °F) by the end of the century. Limiting warming to 1.5 °C would require halving emissions by 2030 and achieving net-zero emissions by 2050. Fossil fuel use can be phased out by conserving energy and switching to energy sources that do not produce significant carbon pollution. These energy sources include wind, solar, hydro, and nuclear power. Cleanly generated electricity can replace fossil fuels for powering transportation, heating buildings, and running industrial processes. Carbon can also be removed from the atmosphere, for instance by increasing forest cover and farming with methods that capture carbon in soil."
2,Köppen climate classification,"The Köppen climate classification is one of the most widely used climate classification systems. It was first published by German-Russian climatologist Wladimir Köppen (1846–1940) in 1884, with several later modifications by Köppen, notably in 1918 and 1936. Later, German climatologist Rudolf Geiger (1894–1981) introduced some changes to the classification system in 1954 and 1961, which is thus sometimes called the Köppen–Geiger climate classification. The Köppen climate classification divides climates into five main climate groups, with each group being divided based on patterns of seasonal precipitation and temperature. The five main groups are A (tropical), B (arid), C (temperate), D (continental), and E (polar). Each group and subgroup is represented by a letter. All climates are assigned a main group (the first letter). All climates except for those in the E group are assigned a seasonal precipitation subgroup (the second letter). For example, Af indicates a tropical rainforest climate. The system assigns a temperature subgroup for all groups other than those in the A group, indicated by the third letter for climates in B, C, D, and the second letter for climates in E. For example, Cfb indicates an oceanic climate with warm summers as indicated by the ending b. Climates are classified based on specific criteria unique to each climate type. As Köppen designed the system based on his experience as a botanist, his main climate groups are based on the types of vegetation occurring in a given climate classification region. In addition to identifying climates, the system can be used to analyze ecosystem conditions and identify the main types of vegetation within climates. Due to its association with the plant life of a given region, the system is useful in predicting future changes of plant life within that region. The Köppen climate classification system was modified further within the Trewartha climate classification system in 1966 (revised in 1980). The Trewartha system sought to create a more refined middle latitude climate zone, which was one of the criticisms of the Köppen system (the climate group C was too general).: 200–1"
3,Mediterranean climate,"A Mediterranean climate ( MED-ih-tə-RAY-nee-ən), also called a dry summer climate, described by Köppen as Cs, is a temperate climate type that occurs in the lower mid-latitudes (normally 30 to 44 north and south latitude). Such climates typically have dry summers and wet winters, with summer conditions being hot and winter conditions typically being mild. These weather conditions are typically experienced in the majority of Mediterranean-climate regions and countries, but remain highly dependent on proximity to the ocean, altitude and geographical location. The dry summer climate is found throughout the warmer middle latitudes, affecting almost exclusively the western portions of continents in relative proximity to the coast. The climate type's name is in reference to the coastal regions of the Mediterranean Sea, which mostly share this type of climate, but it can also be found in the Atlantic portions of Iberia and Northwest Africa, the Pacific portions of the United States and Chile, extreme west areas of Argentina, around Cape Town in South Africa, parts of Southwest and South Australia, and parts of Central Asia. They tend to be found in proximity (both poleward and near the coast) of desert and semi-arid climates, and equatorward of oceanic climates. Mediterranean climate zones are typically located along the western coasts of landmasses, between roughly 30 and 45 degrees north or south of the equator. The main cause of Mediterranean, or dry summer, climate is the subtropical ridge, which extends towards the pole of the hemisphere in question during the summer and migrates towards the equator during the winter. This is due to the seasonal poleward-equatorward variations of temperatures. The resulting vegetation of Mediterranean climates are the garrigue or maquis in the European Mediterranean Basin, the chaparral in California, the fynbos in South Africa, the mallee in Australia, and the matorral in Chile. Areas with this climate are also where the so-called ""Mediterranean trinity"" of major agricultural crops have traditionally been successfully grown (wheat, grapes and olives). As a result, these regions are notable for their high-quality wines, grapeseed/olive oils, and bread products."
4,Climate action,"Climate action (or climate change action) refers to a range of activities, mechanisms, policy instruments, and so forth that aim at reducing the severity of human-induced climate change and its impacts. ""More climate action"" is a central demand of the climate movement. Climate inaction is the absence of climate action."
5,Climate classification,"The Köppen climate classification is one of the most widely used climate classification systems. It was first published by German-Russian climatologist Wladimir Köppen (1846–1940) in 1884, with several later modifications by Köppen, notably in 1918 and 1936. Later, German climatologist Rudolf Geiger (1894–1981) introduced some changes to the classification system in 1954 and 1961, which is thus sometimes called the Köppen–Geiger climate classification. The Köppen climate classification divides climates into five main climate groups, with each group being divided based on patterns of seasonal precipitation and temperature. The five main groups are A (tropical), B (arid), C (temperate), D (continental), and E (polar). Each group and subgroup is represented by a letter. All climates are assigned a main group (the first letter). All climates except for those in the E group are assigned a seasonal precipitation subgroup (the second letter). For example, Af indicates a tropical rainforest climate. The system assigns a temperature subgroup for all groups other than those in the A group, indicated by the third letter for climates in B, C, D, and the second letter for climates in E. For example, Cfb indicates an oceanic climate with warm summers as indicated by the ending b. Climates are classified based on specific criteria unique to each climate type. As Köppen designed the system based on his experience as a botanist, his main climate groups are based on the types of vegetation occurring in a given climate classification region. In addition to identifying climates, the system can be used to analyze ecosystem conditions and identify the main types of vegetation within climates. Due to its association with the plant life of a given region, the system is useful in predicting future changes of plant life within that region. The Köppen climate classification system was modified further within the Trewartha climate classification system in 1966 (revised in 1980). The Trewartha system sought to create a more refined middle latitude climate zone, which was one of the criticisms of the Köppen system (the climate group C was too general).: 200–1"
6,Desert climate,"The desert climate or arid climate (in the Köppen climate classification BWh and BWk) is a dry climate sub-type in which there is a severe excess of evaporation over precipitation. The typically bald, rocky, or sandy surfaces in desert climates are dry and hold little moisture, quickly evaporating the already little rainfall they receive. Covering 14.2% of Earth's land area, hot deserts are the second most common type of climate on Earth after the polar climate. There are two variations of a desert climate according to the Köppen climate classification: a hot desert climate (BWh), and a cold desert climate (BWk). To delineate ""hot desert climates"" from ""cold desert climates"", a mean annual temperature of 18 °C (64.4 °F) is used as an isotherm so that a location with a BW type climate with the appropriate temperature above this isotherm is classified as ""hot arid subtype"" (BWh), and a location with the appropriate temperature below the isotherm is classified as ""cold arid subtype"" (BWk). Most desert/arid climates receive between 25 and 200 mm (1 and 8 in) of rainfall annually, although some of the most consistently hot areas of Central Australia, the Sahel and Guajira Peninsula can be, due to extreme potential evapotranspiration, classed as arid with the annual rainfall as high as 430 millimetres or 17 inches."


This step will load selected articles into watsonx.data. Since we are only interested in climate change, we will select the first two articles in the list. You can change the documents loaded by changing the document indexes in the variable found in the next cell.

In [6]:
documents = [0,1]

In [7]:
# fetch wikipedia articles
articles = {}
for document in documents:
    articles.update({display_articles[document]["title"] : None})

for k,v in articles.items():
    article = wikipedia.page(k)
    articles[k] = article.content
    print(f"Successfully fetched article {k}")

print(f"Successfully fetched {len(articles)} articles ")

Successfully fetched article Climate
Successfully fetched article Climate change
Successfully fetched 2 articles 


## 2. Connect to watsonx.data
The following code will use the Presto Magic commmands to load data in watsonx.data.

In [8]:
%run presto.ipynb

Presto Extensions Loaded.


The connection details should not change unless you are attempting to run this script from a Jupyter environment that is outside of the developer system.

In [9]:
%%sql
   connect
   userid=ibmlhadmin
   password=password
   hostname=watsonxdata
   port=8443
   catalog=tpch
   schema=tiny
   certfile=/certs/lh-ssl-ts.crt

Connection successful.


## 3. Create Schema and Table in Hive Catalog

In [10]:
%%sql
DROP TABLE IF EXISTS hive_data.rag_web.web_wikipedia;
DROP SCHEMA IF EXISTS hive_data.rag_web;

Command completed.
Command completed.


In [11]:
# The next step will delete any existing data in the rag_web bucket. 
# A DROP table command does not remove the files in the bucket. 
# You may see error messages displayed if no data or bucket exists.

minio_host    = "watsonxdata"
minio_port    = "9000"
hive_host     = "watsonxdata"
hive_port     = "9083"

hive_id           = None
hive_password     = None
minio_access_key  = None
minio_secret_key  = None
keystore_password = None

try:
    with open('/certs/passwords') as fd:
        certs = fd.readlines()
    for line in certs:
        args = line.split()
        if (len(args) >= 3):
            system   = args[0].strip()
            user     = args[1].strip()
            password = args[2].strip()
            if (system == "Minio"):
                minio_access_key = user
                minio_secret_key = password
            elif (system == "Thrift"):
                hive_id = user
                hive_password = password
            elif (system == "Keystore"):
                keystore_password = password
            else:
                pass
except Error as e:
    print("Certificate file with passwords could not be found")

%system mc alias set watsonxdata http://{minio_host}:{minio_port} {minio_access_key} {minio_secret_key}

%system mc rm --recursive --force watsonxdata/hive-bucket/rag_web

[]

#### Create Schema (rag_web)

In [12]:
%%sql
CREATE SCHEMA IF NOT EXISTS 
  hive_data.rag_web
WITH (location = 's3a://hive-bucket/rag_docs')

Command completed.


#### Create Table (web_wikipedia)

In [16]:
%%sql
CREATE TABLE hive_data.rag_web.web_wikipedia
  (
    "id" varchar,
    "text" varchar,
    "title" varchar  
  )
WITH 
  (
  format = 'PARQUET',
  external_location = 's3a://hive-bucket/rag_web' 
  )

SQL Error: line 1:1: Table 'hive_data.rag_web.web_wikipedia' already exists


## 4. Chunk the web documents and load into the Table
The Wikipedia article is written into the watsonx.data database in chucks of approximately 225 words in size. The reason for chunking the data is to make it more efficient when populating the Milvus system from watsonx.data.

In [17]:
# Chunk data
def split_into_chunks(text, chunk_size):
    words = text.split()
    return [' '.join(words[i:i + chunk_size]) for i in range(0, len(words), chunk_size)]

split_articles = {}
for k,v in articles.items():
    split_articles[k] = split_into_chunks(v, 225)

# Insert data
for article_title, article_chunks in split_articles.items():
    for i, chunk in enumerate(article_chunks):
        escaped_chunk = chunk.replace("'", "''").replace("%", "%%")
        insert_stmt = f"insert into hive_data.rag_web.web_wikipedia values ('{i+1}', '{escaped_chunk}', '{article_title}')"
        %sql --quiet {insert_stmt}
        print(f"{article_title} {i+1}/{len(article_chunks)} inserted",end="\r")
            
    print(f"\n{article_title} Insertion complete")

Climate 11/11 inserted
Climate Insertion complete
Climate change 42/42 inserted
Climate change Insertion complete


## 5. Check the loaded documents data in the Table

In [18]:
%%sql
   SELECT * FROM hive_data.rag_web.web_wikipedia

DataGrid(auto_fit_columns=True, auto_fit_params={'area': 'all', 'padding': 30, 'numCols': None}, corner_render…