# Rag Indexing and Querying Together Example

This is a working notebook to write and test the code that is used in our Google Cloud function.

## Indexing Stage
In the initial indexing stage, text data must be first collected as documents and metadata. In this implementation, this is performed by the scraping of website. This data must be then split into "nodes", which is a represents a "chunk" or part of the data containing a certain portion of information. Nodes must are then indexed via an embedding model, where we plan on using OpenAI's Ada v2 embedding model. The embeddings and metadata together create a rich representation to aid in retrieval.

In [1]:
# Suppress Pydantic warnings since it's based in llamaindex
import warnings
warnings.filterwarnings('ignore', category=DeprecationWarning)

## Hard-coded stuff in this cell that will be replaced in the cloud function
* OPEN AI Key will be an environment variable
* Weaviate IP address that we will work on finding programmatically

In [2]:
!pip install weaviate-client
!pip install openai
!pip install llama-index
!pip install python-dotenv

import weaviate
import pandas as pd
import os

from dotenv import load_dotenv
from datetime import datetime, timezone
from llama_index import Document
# Suppress Pydantic warnings since it's based in llamaindex
import warnings
warnings.simplefilter(action='ignore', category=Warning)


from llama_index.node_parser import SimpleNodeParser
from llama_index.vector_stores import WeaviateVectorStore
from llama_index import VectorStoreIndex, StorageContext
from llama_index.storage.storage_context import StorageContext
from llama_index.vector_stores.types import ExactMatchFilter, MetadataFilters

# Load the .env file
load_dotenv()

# Retrieve the OpenAI API key from the environment variables
OPENAI_KEY = os.getenv("OPENAI_KEY")

# Set the OpenAI key as an Environment Variable (for when it's run on GCS)
os.environ["OPENAI_API_KEY"] = OPENAI_KEY

# Current Weaviate IP
WEAVIATE_IP_ADDRESS = "34.42.138.162"





/Users/iankelk/anaconda3/lib/python3.11/site-packages/pydantic/_internal/_config.py:267: PydanticDeprecatedSince20: Support for class-based `config` is deprecated, use ConfigDict instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.4/migration/


In [3]:
schema = {
    "classes": [
        {
            "class": "Document",
            "description": "A full document of text from a scraped webpage with full details.",
            "invertedIndexConfig": {
                "indexTimestamps": True
            },
            "vectorizer": "text2vec-openai",
            "moduleConfig": {
                "generative-openai": {
                    "model": "gpt-3.5-turbo"
                }
            },
            "properties": [
                {
                    "name": "text",
                    "dataType": ["string"],
                    "description": "The content of the document.",
                    "indexInverted": True
                },
                {
                    "name": "websiteAddress",
                    "dataType": ["string"],
                    "description": "The address of the website this document comes from.",
                    "indexInverted": True
                },
                {
                    "name": "timestamp",
                    "dataType": ["date"],
                    "description": "The date and time when the document was scraped.",
                    "indexInverted": True
                }
            ]
        }
    ]
}

In [4]:
def create_date(date_string):
    """
    Convert a date string to RFC 3339 formatted string with timezone.

    Parameters:
    - date_string (str): Input date string in the format "%Y-%m-%dT%H-%M-%S".

    Returns:
    - str: RFC 3339 formatted date-time string.
    """
    dt_object = datetime.strptime(date_string, "%Y-%m-%dT%H-%M-%S")
    # convert datetime object to RFC 3339 string (with timezone)
    rfc3339_string = dt_object.replace(tzinfo=timezone.utc).isoformat()
    return rfc3339_string

In [5]:
client = weaviate.Client(url="http://" + WEAVIATE_IP_ADDRESS + ":8080")

# Delete existing schema (caution: this deletes the current structure)
client.schema.delete_all()

# Here we use the schema created in the previous cell.
client.schema.create(schema)
print("Schema was created.")

Schema was created.


## Hard-coded stuff in this cell that will be replaced in the cloud function
* data_directory will be the bucket
* csv_file will be the new file added to the bucket

In [6]:
data_directory = "./sample_data"
csv_file = 'ai21.com_2023-10-06T18-11-24.csv'
# Get the website address and timestamp from the filename
websiteAddress, timestamp = csv_file.rsplit('.', 1)[0].split('_')

# Read in the CSV
df = pd.read_csv(data_directory + "/" + csv_file)

# Manually assemble the documents
documents = []
for _, row in df.iterrows():
    document = Document(
        text=row['text'],
        metadata={
            'websiteAddress': websiteAddress,
            'timestamp': timestamp
        }
    )
    document.doc_id = row['key']
    documents.append(document)

In [7]:
# Create the parser and nodes
parser = SimpleNodeParser.from_defaults(chunk_size=1024, chunk_overlap=20)
nodes = parser.get_nodes_from_documents(documents)

# construct vector store
vector_store = WeaviateVectorStore(weaviate_client = client, index_name="Pages", text_key="text")
# setting up the storage for the embeddings
storage_context = StorageContext.from_defaults(vector_store = vector_store)
# set up the index
index = VectorStoreIndex(nodes, storage_context=storage_context)

# Querying Stage

In this stage, the RAG pipeline extracts the most pertinent context based on a user’s query and forwards it, along with the query, to the LLM to generate a response. This procedure equips the LLM with current knowledge that wasn’t included in its original training data. This also reduces the likelihood of hallucinations, a problem for LLMs when they invent answers for data they were insufficiently trained with. The pivotal challenges in this phase revolve around the retrieval, coordination, and analysis across one or several knowledge bases.

## Hard-coded stuff in this cell that will be replaced in the cloud function
* The websiteAddress will be from the query string of the https request
* The timestamp will be from the query string of the https request
* The query will be from the query string of the https request

In [8]:
# Custom prompt to exclude out of context answers
from llama_index.prompts import PromptTemplate

template = ("We have provided context information below. If the answer to a query is not contained in this context, "
            "please only reply that it is not in the context."
            "\n---------------------\n"
            "{context_str}"
            "\n---------------------\n"
            "Given this information, please answer the question: {query_str}\n"
)
qa_template = PromptTemplate(template)

In [9]:
# Create exact match filters for websiteAddress and timestamp
website_address_filter = ExactMatchFilter(key="websiteAddress", value="ai21.com")
timestamp_filter = ExactMatchFilter(key="timestamp", value="2023-10-06T18-11-24")

# Create a metadata filters instance with the above filters
metadata_filters = MetadataFilters(filters=[website_address_filter, timestamp_filter])

# Create a query engine with the custom prompt and filters
query_engine = index.as_query_engine(text_qa_template=qa_template,
                                     streaming=True,
                                     filters=metadata_filters)

# Execute the query
query_str = "How was AI21 Studio a game changer?"
streaming_response = query_engine.query(query_str)

# Print the response as it arrives
streaming_response.print_response_stream()

AI21 Studio was a game changer by helping Verb.ai create a revolutionary writing tool for authors. It improved brainstorming and expression, making the process of completing long-form narratives faster, easier, and more fun. AI21 Studio assisted with all key stages of creation, including brainstorming, writing, and editing. It provided a feature for planning the novel scene by scene and chapter by chapter, generating plot points and ideas to spark the writer's imagination. This tool proved to be incredibly useful for writers and kept them coming back for more.

In [10]:
def extract_document_urls(streaming_response):
    urls = []
    for node_with_score in streaming_response.source_nodes:
        relationships = node_with_score.node.relationships
        for related_node_info in relationships.values():
            if related_node_info.node_type == "4":  # Corresponds to ObjectType.DOCUMENT
                urls.append(related_node_info.node_id)
    return urls

extracted_urls = extract_document_urls(streaming_response)
print("The following websites were used as references:\n")
print(extracted_urls)

The following websites were used as references:

['https://www.ai21.com/blog/verb-ai-case-study', 'https://www.ai21.com/blog']


In [11]:
query_str = "Who is Kim Kardashian?"
streaming_response = query_engine.query(query_str)

# Print the response as it arrives
streaming_response.print_response_stream()

The information provided does not contain any information about Kim Kardashian.

In [12]:
extracted_urls = extract_document_urls(streaming_response)
print("The following websites were used as references:\n")
print(extracted_urls)

The following websites were used as references:

['https://www.ai21.com/blog/ubisoft-case-study', 'https://www.ai21.com/blog/retail-personalization-using-ai-will-transform-the-industry']
