### Advanced RAG with Llamindex
- https://www.youtube.com/watch?v=oDzWsynpOyI&t=615s

In [1]:
import os,sys,getpass
sys.path.insert(0,'../../libs')
import pandas as pd 
from pprint import pprint
from utils import load_json, extract_json_string,get_all_files
from oai_ast_utils import OpenAIAssistant_Base
import tqdm,json,time

import logging
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

from llama_index import (
    VectorStoreIndex,
    SimpleDirectoryReader,
    load_index_from_storage,
    StorageContext,
    ServiceContext,
    Document
)

In [2]:
from llama_index.llms import OpenAI, Anthropic
from llama_index.node_parser import SentenceWindowNodeParser, HierarchicalNodeParser, get_leaf_nodes
from llama_index.text_splitter import SentenceSplitter
from llama_index.embeddings import OpenAIEmbedding, HuggingFaceEmbedding
from llama_index.schema import MetadataMode
from llama_index.postprocessor import MetadataReplacementPostProcessor

In [3]:
#from openai import OpenAI
## load API Key
#key = load_json('/root/workspace/key/openai_key.json') 
os.environ["OPENAI_API_KEY"] = getpass.getpass(prompt='OpenAI API Token:')

In [4]:
data_folder = '/root/workspace/data/Adv_RAG_temp_data'
## if data is not downloaded, download them first 
# !wget 'https://www.gutenberg.org/cache/epub/72306/pg72306.txt' -O '/root/workspace/data/Adv_RAG_temp_data/teahistory.txt'
# !wget 'https://www.gutenberg.org/cache/epub/11367/pg11367.txt' -O '/root/workspace/data/Adv_RAG_temp_data/chinahistory.txt'


- load and explore input data 

In [5]:
documents = SimpleDirectoryReader(data_folder).load_data()

In [6]:
print("length of doc: "+ str(len(documents)))
print("----")
print(documents[0].metadata)
print(documents[1].metadata)

length of doc: 2
----
{'file_path': '/root/workspace/data/Adv_RAG_temp_data/chinahistory.txt', 'file_name': 'chinahistory.txt', 'file_type': 'text/plain', 'file_size': 977274, 'creation_date': '2024-01-07', 'last_modified_date': '2024-01-05', 'last_accessed_date': '2024-01-07'}
{'file_path': '/root/workspace/data/Adv_RAG_temp_data/teahistory.txt', 'file_name': 'teahistory.txt', 'file_type': 'text/plain', 'file_size': 493827, 'creation_date': '2024-01-07', 'last_modified_date': '2023-12-31', 'last_accessed_date': '2024-01-07'}


- ### Node Parsing & Indexing (Base & Sentence Window Method)

In [8]:
# create the sentence window node parser w/ default settings
sentence_node_parser = SentenceWindowNodeParser.from_defaults(
    window_size=3,
    window_metadata_key="window",
    original_text_metadata_key="original_text"
)

base_node_parser = SentenceSplitter()

llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)

In [9]:
## parse by windows size 
nodes = sentence_node_parser.get_nodes_from_documents(documents)

## parse by chunking
base_nodes = base_node_parser.get_nodes_from_documents(documents)

In [10]:
print("---------")
print("SENTENCE NODES")
print("---------")
print(nodes[100])
print("---------")
print("BASE NODES")
print("---------")
print(base_nodes[100])

---------
SENTENCE NODES
---------
Node ID: 48f05624-32f2-4c63-9c4e-d7fa7ceb4740
Text: We have no desire to show that China's history is the most
glorious or her civilization the oldest in the world.
---------
BASE NODES
---------
Node ID: 88d58e32-f8a3-4dd1-88af-33d12cef8002
Text: The government now proceeded to convert also its own Toba tribes
into military formations. The tribal men of noble rank were brought to
the court as military officers, and so were separated from the common
tribesmen and the slaves who had to remain with the herds. This
change, which robbed the tribes of all means of independent action,
was not c...


- Use OpenAi Embedding Model to embed nodes 

In [11]:
ctx_sentence = ServiceContext.from_defaults(llm=llm, embed_model=OpenAIEmbedding(embed_batch_size=50), node_parser=sentence_node_parser)
ctx_base = ServiceContext.from_defaults(llm=llm, embed_model=OpenAIEmbedding(embed_batch_size=50), node_parser=base_node_parser)

sentence_index = VectorStoreIndex(nodes, service_context=ctx_sentence)
base_index = VectorStoreIndex(base_nodes, service_context=ctx_base)

HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.

- Save to Persistent Storage

In [12]:
### save index to local 
sentence_index.storage_context.persist(persist_dir=os.path.join(data_folder,"sentence_index"))
base_index.storage_context.persist(persist_dir=os.path.join(data_folder,"base_index"))

- retrieve teh data base 

In [13]:
# rebuild storage context
SC_retrieved_sentence = StorageContext.from_defaults(persist_dir=os.path.join(data_folder,"sentence_index"))
SC_retrieved_base = StorageContext.from_defaults(persist_dir=os.path.join(data_folder,"base_index"))
                                                 
# load index
retrieved_sentence_index = load_index_from_storage(SC_retrieved_sentence)
retrieved_base_index = load_index_from_storage(SC_retrieved_base)

Loading all indices.
Loading all indices.


In [14]:
from llama_index.postprocessor import MetadataReplacementPostProcessor

In [15]:
sentence_query_engine = retrieved_sentence_index.as_query_engine(
    similarity_top_k=5,
    verbose=True,
    # the target key defaults to `window` to match the node_parser's default
    node_postprocessors=[
        MetadataReplacementPostProcessor(target_metadata_key="window")
    ],
)

In [16]:
base_query_engine = retrieved_base_index.as_query_engine(
    similarity_top_k=5,
    verbose=True
)

- ### Now Inference

In [17]:
question = """Something happened in the United States 10 years after the first American ships sailed for China which 
could have made it more expensive to purchase tea. what happened that year? Try to break down your answer into steps."""

In [18]:
base_response = base_query_engine.query(
    question
)
print(base_response)

HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
1. The first American ships sailed for China in 1784.
2. Two more vessels were dispatched the following year, bringing back 880,000 pounds of Tea.
3. During 1786-87, five other ships brought to the United States over 1,000,000 pounds of Tea.
4. In 1790, the earliest official record of the importation of Tea into the United States was made.
5. The order of increase for its importation, value, and consumption in the country by decades since 1790 is provided.
6. Based on the provided table, the importation of Tea into the United States increased over the years.
7. In 1794, the rates of duty on tea were increased by 75% on direct importations and 100% on all teas shipped from Europe.
8. The rates were reduced in 1796, but doubled during the War of 1812.
9. In 

In [19]:
sentence_response = sentence_query_engine.query(
    question
)
print(sentence_response)

HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
1. The first American ships sailed for China in 1784.
2. Two more vessels were dispatched the following year, bringing back 880,000 pounds of Tea.
3. During 1786-87, five other ships brought to the United States over 1,000,000 pounds of Tea.
4. Based on this information, it can be inferred that the United States started importing a significant amount of tea from China.
5. The increased importation of tea from China could have led to higher demand and potentially higher prices for tea in the United States.
