# Chatbot v1.1

- Embedding model: OpenAI 
- LLM model: OpenAI

## requirements

In [2]:
# !pip install langchain
# !pip install chromadb
# !pip install pypdf
# !pip isntall pytest

In [19]:
import dotenv
import os

dotenv.load_dotenv()
openai_api_key=os.getenv("openai_api_key")

## Embedding Function

In [1]:
chromaPath= 'chroma/open'

In [20]:
from langchain_community.embeddings.bedrock import BedrockEmbeddings
from langchain_community.embeddings.ollama import OllamaEmbeddings
from langchain_community.embeddings.openai import OpenAIEmbeddings
from langchain_community.embeddings.huggingface import HuggingFaceInferenceAPIEmbeddings

def get_embedding_function():
    embeddings=OpenAIEmbeddings(
        # model="text-embedding-ada",
        api_key=openai_api_key
    )
    return embeddings


In [21]:
embeddings=get_embedding_function()
text = "This is a test document."
query_result = embeddings.embed_query(text)
query_result[:3]

[-0.00318459689536844, 0.0110777294721545, -0.0041049622618212454]

## Chroma-DB

In [4]:
import argparse
import os
import shutil
from langchain.document_loaders.pdf import PyPDFDirectoryLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.schema.document import Document
# from get_embedding_function import get_embedding_function
from langchain.vectorstores.chroma import Chroma

In [5]:
from langchain.document_loaders.pdf import PyPDFDirectoryLoader

def load_documents():
    DATA_PATH = 'Data'
    document_loader = PyPDFDirectoryLoader(DATA_PATH)
    return document_loader.load()


In [6]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.schema.document import Document

def split_document(documents: list[Document]):
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=800,
        chunk_overlap=80,
        length_function=len,
        is_separator_regex=False
    )
    return text_splitter.split_documents(documents)

In [7]:
documents=load_documents()
chunks=split_document(documents)
print(chunks[0])

page_content='How update a certificate into WAF ?\nCurrently below endpoints are protected by WAF.\nProduction:\nhttps://aero-suite-prod-airarabia.accelaero.com/\nhttps://aero-pay-prod-airarabia.accelaero.com\nhttps://aero-pay-callbackapi-prod-airarabia.accelaero.com\nStagingX\nhttps://aero-suite-stage1-airarabia.isaaviation.net/ \nWhat is the procedure to upload certificate into WAF ?\n \n1. Log into Oracle Console and navigate to Edge Policy Resources\n   Oracle Console => Security and Identity => Web Application Firewall => OCI Edge Policy Resources\n    Example: Oracle \n2. Create a Certificate by providing SSL Certificate and Private Key.\nWe need to do below additional step in our use cases as Oracle is unable to identify the encrypted version.' metadata={'source': 'Data/AVN-How update a certificate into WAF _-020524-111955.pdf', 'page': 0}


In [8]:
def calculate_chunk_ids(chunks):

    # This will create IDs like "data/monopoly.pdf:6:2"
    # Page Source : Page Number : Chunk Index

    last_page_id = None
    current_chunk_index = 0

    for chunk in chunks:
        source = chunk.metadata.get("source")
        page = chunk.metadata.get("page")
        current_page_id = f"{source}:{page}"

        # If the page ID is the same as the last one, increment the index.
        if current_page_id == last_page_id:
            current_chunk_index += 1
        else:
            current_chunk_index = 0

        # Calculate the chunk ID.
        chunk_id = f"{current_page_id}:{current_chunk_index}"
        last_page_id = current_page_id

        # Add it to the page meta-data.
        chunk.metadata["id"] = chunk_id

    return chunks

In [9]:
from langchain.vectorstores.chroma import Chroma

def add_to_chroma(chunks: list[Document]):
    # Load the existing database.
    db = Chroma(
        persist_directory=chromaPath, embedding_function=get_embedding_function()
    )

    # Calculate Page IDs.
    chunks_with_ids = calculate_chunk_ids(chunks)

    # Add or Update the documents.
    existing_items = db.get(include=[])  # IDs are always included by default
    existing_ids = set(existing_items["ids"])
    print(f"Number of existing documents in DB: {len(existing_ids)}")

    # Only add documents that don't exist in the DB.
    new_chunks = []
    for chunk in chunks_with_ids:
        if chunk.metadata["id"] not in existing_ids:
            new_chunks.append(chunk)

    if len(new_chunks):
        print(f"👉 Adding new documents: {len(new_chunks)}")
        new_chunk_ids = [chunk.metadata["id"] for chunk in new_chunks]
        db.add_documents(new_chunks, ids=new_chunk_ids)
        db.persist()
    else:
        print("✅ No new documents to add")

In [10]:
def clear_database():
    if os.path.exists(chromaPath):
        shutil.rmtree(chromaPath)

In [11]:
def main(f):
    if f=='reset':
        print("✨ Clearing Database")
        clear_database()
    # Create (or update) the data store.
    documents = load_documents()
    chunks = split_document(documents)
    add_to_chroma(chunks)

In [12]:
main(1)

Number of existing documents in DB: 65
✅ No new documents to add


## Query Data

In [13]:
import argparse
from langchain.vectorstores.chroma import Chroma
from langchain.prompts import ChatPromptTemplate
from langchain_community.llms.ollama import Ollama


In [14]:
PROMPT_TEMPLATE = """
Answer the question based only on the following context:

{context}

---

Answer the question based on the above context: {question}
"""

In [22]:
from langchain_openai import OpenAI

def query_rag(query_text: str):
    # Prepare the DB.
    embedding_function = get_embedding_function()
    db = Chroma(persist_directory=chromaPath, embedding_function=embedding_function)

    # Search the DB.
    results = db.similarity_search_with_score(query_text, k=5)

    context_text = "\n\n---\n\n".join([doc.page_content for doc, _score in results])
    prompt_template = ChatPromptTemplate.from_template(PROMPT_TEMPLATE)
    prompt = prompt_template.format(context=context_text, question=query_text)
    print(prompt)

    model = OpenAI(openai_api_key=openai_api_key,temperature=0.5)
    response_text = model.invoke(prompt)

    sources = [doc.metadata.get("id", None) for doc, _score in results]
    formatted_response = f"Response: {response_text}\nSources: {sources}"
    print(formatted_response)
    return response_text

In [23]:
# gpt4
query_rag("What is Grafana")

Human: 
Answer the question based only on the following context:

Grafana
 
Grafana is an analytics and interactive data visualization tool that is multi-platform and open-source. Mainly used when large amounts of 
data are needed to be monitored. The data is shown as time series analytics meaning shown using timestamps and these data can be 
taken from many data sources such as Prometheus, Graphite, ElasticSearch, MySQL etc. Grafana provides the capability of creating 
dashboards which can be customized by the user to include graphs, charts, stats and many more.
 
 
Step 1:
 
Install Grafana in a Docker container.
An example of this is shown below.
Have port 3000 mapped for Grafana since it is the default port number.
 
Step 2:
 
Open your browser and go to localhost:3000

---

Step 2:
 
Open your browser and go to localhost:3000
You will see the above page and type ‘admin’ for both username and password.What is Grafana?
Getting Started
1docker run -it -p 3000:3000 --net=host grafana/

'\nGrafana is an analytics and interactive data visualization tool that is multi-platform and open-source. It is used for monitoring large amounts of data from various sources, such as Prometheus, Graphite, ElasticSearch, and MySQL, and displaying it as time series analytics. Grafana also allows users to create customized dashboards with graphs, charts, and statistics.'

In [17]:
query_rag("What is Prometheus")

Human: 
Answer the question based only on the following context:

Prometheus
 
            Prometheus, originally founded in 2012, is an open source system monitoring and alerting tool that has a active and large 
developer/user community. Prometheus is capable of collecting and storing metrics (numeric measurements) generated by the systems 
as time series data, meaning data are stored with the timestamp as well as alongside key-value pair optional label.  
 
 
 
Prometheus Server will store the metrics that are scraped from Prometheus Targets and a Pushgateway will be needed as an 
intermediary for Short-lived Jobs.
Alertmanager can be used for sending alerts to users.
These metrics can be finally shown in the Prometheus web UI itself or third party tools such as Grafana.

---

Multi-dimensional data model with time series data identified by metric name and key/value pairs.
PromQL, a flexible query language to leverage this dimensionality.What is Prometheus?
Architecture
Features

--

'\nPrometheus is an open source system monitoring and alerting tool that is capable of collecting and storing metrics generated by systems as time series data. It has a multi-dimensional data model with a flexible query language and can be used for monitoring and alerting in single server nodes. It supports both pull and push models for collecting time series data and can be integrated with third party tools such as Grafana for visualization.'