This notebook guides you through the steps to integrate the vector database (Oracle Database 23ai in our case) and retrieve a list of text chunks that are close to the "question" in vector space. Then, we will use the most relevant text chunks to create an LLM prompt and ask the Oracle Generative AI Service to create a nicely worded response for us.

In [1]:
table_name = 'faqs'
topK = 3

sql = f"""select payload, vector_distance(vector, :vector, COSINE) as score
from {table_name}
order by score
fetch approx first {topK} rows only"""

In the above SQL query, topK represents the number of top results to retrieve. The query selects the payload column along with the cosine distance between the vector column in the specified table {table_name} and a provided vector parameter :vector, aliasing the distance calculation as score.
By ordering the results by the calculated score and using fetch approx first {topK} rows only, the query efficiently retrieves only the top topK results based on their cosine similarity to the provided vector.

In [2]:
# Define the query
question = 'What is Always Free?'

We will then define the user query.

In [3]:
# Connect to the Oracle Database 23ai
un = "vector"
pw = "vector"
cs = "localhost/FREEPDB1"

import oracledb

connection = oracledb.connect(user=un, password=pw, dsn=cs)

oracledb.connect()function establishes a connection to an Oracle database using the provided credentials and connection details. The function takes username, password and dsn. The DSN (data source name) specifies the host, port and database service name to connect to.

In [4]:
from sentence_transformers import SentenceTransformer
encoder = SentenceTransformer('all-MiniLM-L12-v2')

  from tqdm.autonotebook import tqdm, trange


We need an encoder to handle the vectorization for us. all-MiniLM-L12-v2 is a specific pre-trained model that is designed to be an encoder. It is based on the MiniLM (Mini Language Model) architecture, which is a lightweight version of transformer models like BERT.
Note:
•	Ignore the warning saying IProgress not found., among others.

In [5]:
# Retrieval Code
import array
import json

with connection.cursor() as cursor:
  embedding = list(encoder.encode(question))
  vector = array.array("f", embedding)

  results  = []

  for (info, score, ) in cursor.execute(sql, vector=vector):
      text_content = info.read()
      results.append((score, json.loads(text_content)))

Next, we write the retrieval code. We employ the same encoder as in previous text chunks, generating a vector representation of the question.
The SQL query is executed with the provided vector parameter, fetching relevant information from the database. For each result, the code retrieves the text content, stored in JSON format, and appends it to a list along with the calculated similarity score. This process iterates through all fetched results, accumulating them in the results list. 


In [6]:
# Check results
import pprint
pprint.pp(results)

[(0.342059164223519,
  {'text': 'faq | What are Always Free services?\n'
           '\n'
           'Always Free services are part of Oracle Cloud Free Tier. Always '
           'Free services are available for an unlimited time. Some '
           'limitations apply. As new Always Free services become available, '
           'you will automatically be able to use those as well.\n'
           '\n'
           'The following services are available as Always Free:\n'
           '\n'
           'AMD-based Compute\n'
           'Arm-based Ampere A1 Compute\n'
           'Block Volume\n'
           'Object Storage\n'
           'Archive Storage\n'
           'Flexible Load Balancer\n'
           'Flexible Network Load Balancer\n'
           'VPN Connect\n'
           'Autonomous Data Warehouse\n'
           'Autonomous Transaction Processing\n'
           'Autonomous JSON Database\n'
           'NoSQL Database (Phoenix Region only)\n'
           'APEX Application Development\n'
           'Re

Next we print the results. We should have the "score" of each hit, which is essentially the distance in vector space between the question and the text chunk, as well as the metadata JSON embedded in each chunk.

In [7]:
from transformers import LlamaTokenizerFast
import sys

tokenizer = LlamaTokenizerFast.from_pretrained("hf-internal-testing/llama-tokenizer")


tokenizer.model_max_length = sys.maxsize

def truncate_string(string, max_tokens):
    # Tokenize the text and count the tokens
    tokens = tokenizer.encode(string, add_special_tokens=True) 
    # Truncate the tokens to a maximum length
    truncated_tokens = tokens[:max_tokens]
    # transform the tokens back to text
    truncated_text = tokenizer.decode(truncated_tokens)
    return truncated_text

You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.


Before sending anything to the LLM, we must ensure that our prompt does not exceed the maximum context length of the model. We are planning to use LLaMA 2, so the context is limited to 4,096 tokens. Note that the context is used for both the input tokens (the prompt) and the response. 
Above code leverages the Hugging Face Transformers library to tokenize text using the LlamaTokenizerFast model. The tokenizer is initialized from the pre-trained hf-internal-testing/llama-tokenizer model, and its model_max_length attribute is set to sys.maxsize to handle extremely large inputs without length constraints.
The truncate_string function takes a string and a maximum token count as inputs. It tokenizes the input string, truncates the tokenized sequence to the specified maximum length, and then decodes the truncated tokens back into a string. This function effectively shortens the text to a specified token limit while preserving its readable format, useful for tasks requiring length constraints on input text.
Note: Ignore the legacy warning.


In [8]:
import os

def loadFAQs(directory_path):
   faqs = {}

   for filename in os.listdir(directory_path):
      if filename.endswith(".txt"):  # assuming FAQs are in .txt files
         file_path = os.path.join(directory_path, filename)

         with open(file_path) as f:
            raw_faq = f.read()

         filename_without_ext = os.path.splitext(filename)[0]  # remove .txt extension
         faqs[filename_without_ext] = [text.strip() for text in raw_faq.split('=====')]

   return faqs

faqs = loadFAQs('./txt-docs')

docs = [{'text': filename + ' | ' + section, 'path': filename} for filename, sections in faqs.items() for section in sections]

We will now read, split and store the data.

In [9]:
# Transform docs into a string array using the "paylod" key
docs_as_one_string = "\n=========\n".join([doc["text"] for doc in docs])
docs_truncated = truncate_string(docs_as_one_string, 1000)

We will truncate our chunks to 1000 tokens, to leave plenty of space for the rest of the prompt and the answer.

In [10]:
# Create the LLM Prompt
prompt = f"""\
    <s>[INST] <<SYS>>
    You are a helpful assistant named Oracle chatbot. 
    USE ONLY the sources below and ABSOLUTELY IGNORE any previous knowledge.
    Use Markdown if appropriate.
    Assume the customer is highly technical.
    <</SYS>> [/INST]

    [INST]
    Respond to PRECISELY to this question: "{question}.",  USING ONLY the following information and IGNORING ANY PREVIOUS KNOWLEDGE.
    Include code snippets and commands where necessary.
    NEVER mention the sources, always respond as if you have that knowledge yourself. Do NOT provide warnings or disclaimers.
    =====
    Sources: {docs_truncated}
    =====
    Answer (Three paragraphs, maximum 50 words each, 90% spartan):
    [/INST]
    """

The prompt will include the retrieved top chunks, the question posed by the user, and the custom instructions.

In [11]:
import oci
from LoadProperties import LoadProperties

# Setup basic variables
properties = LoadProperties()

# Use Instance Principals for Authentication
signer = oci.auth.signers.InstancePrincipalsSecurityTokenSigner()

generative_ai_inference_client = oci.generative_ai_inference.GenerativeAiInferenceClient(config={}, signer=signer, service_endpoint=properties.getEndpoint(), retry_strategy=oci.retry.NoneRetryStrategy(), timeout=(10,240))
chat_detail = oci.generative_ai_inference.models.ChatDetails()
chat_request = oci.generative_ai_inference.models.CohereChatRequest()
chat_request.message = prompt
chat_request.max_tokens = 1000
chat_request.temperature = 0.0
chat_request.frequency_penalty = 0
chat_request.top_p = 0.75
chat_request.top_k = 0

chat_detail.serving_mode = oci.generative_ai_inference.models.OnDemandServingMode(model_id=properties.getModelName())
chat_detail.chat_request = chat_request
chat_detail.compartment_id = properties.getCompartment()
chat_response = generative_ai_inference_client.chat(chat_detail)

We will now call the OCI Generative AI Chat model and store the chat model’s response in chat_response.

In [12]:
pprint.pp(
    chat_response.data.chat_response.chat_history[1].message
)

('Always Free is a program offered by Oracle Cloud. It provides users with '
 'access to a range of services that are completely free to use indefinitely. '
 'These include compute, storage, and networking resources. \n'
 '\n'
 'Always Free is ideal for developers who want to experiment, build, and test '
 'applications in the cloud without incurring any costs. The program is '
 'available to anyone, and you can sign up for it on the Oracle Cloud Free '
 'Tier webpage.')


The response is extracted and cleaned of any leading or trailing whitespace before being printed in a readable format.