<a href="https://colab.research.google.com/github/popolome/SmartDoc-A-Private-AI-Assistant-for-Your-Company-Files/blob/main/Enterprise_Ready_RAG_Assistant_with_Intent_Guardrails.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction
* I used **Hybrid-Cloud Architecture** here by leveraging **Colab's L4 GPU** for vectorization, then use **Groq's API** for ultra-low latency inference.
* Minimal GPU VRAM and RAM required, less than 8GB VRAM and 16GB RAM is enough, API handles the heavy-lifting.
* I fine-tune the Guardrails prompt to handle various short-comings of the chatbot.
* I also changed k=3 to k=10 in the end to make the bot 'smarter'.
* Added memory and streaming to make the bot output more flowy and actually remembers past 3 conversations.
* Tested conversing with the chatbot, and deployed to Streamlit.

# Install dependencies

In [None]:
%%writefile requirements.txt
langchain
langchain-classic
langchain-groq
langchain-community
langchain_huggingface
langchain_text_splitters
langchain_chroma
langchain-core
chromadb
pypdf
sentence-transformers
streamlit
python-dotenv

In [None]:
!pip install -r requirements.txt

# Set up Groq API Key

In [None]:
import os
from getpass import getpass

# Used getpass to hide your input when you type you Groq Key
os.environ["GROQ_API_KEY"] = getpass("Enter your Groq API Key here: ")

# Initialize the Embedding Model (Hugging Face)

In [None]:
from langchain_huggingface import HuggingFaceEmbeddings

# This embeds the text into numerical embeddings that the Language Model can
# understand
embeddings = HuggingFaceEmbeddings(model="all-MiniLM-L6-v2")

print("Embedding model loaded successfully!")

# Initialize the Language Model (Processor)

In [None]:
# We use Groq LPU here for fast inference
from langchain_groq import ChatGroq

# Setting Low Temperature(Less Random), High Temperature(More Random)
llm = ChatGroq(
    model_name="llama-3.1-8b-instant",
    temperature=0,
    max_tokens=None,
    timeout=30,
    max_retries=2
)

print("LLM initialized with ChatGroq!")

# Import or Drag the Apple_10K_Report.pdf or any PDFs to Colab folder
* The folder is on the left.

# Load and chunk the document

In [None]:
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Use PyPDFLoader to load our pdf report
loader = PyPDFLoader("Apple_10K_Report.pdf")
docs = loader.load()

# Use RecursiveCharacterTextSplitter to split each text into 1000 characters
# and if more than 1000 characters in chunk 1, put them in chunk 2
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
chunks = text_splitter.split_documents(docs)

print(f"Document loaded. Total chunks created: {len(chunks)}")

# Create Vector store

In [None]:
from langchain_chroma import Chroma

# Convert our chunks from earlier into vectors
# Save the vector database as well
vector_db = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./apple_chroma_db"
)

print("Vector database has been created and saved!")

# Create the Guardrail Chatbot

In [None]:
from langchain_classic.memory import ConversationBufferWindowMemory
from langchain_classic.chains import ConversationalRetrievalChain
from langchain_core.prompts import PromptTemplate

# This is the guardrail template to fallback if someone asks it a unanswerable
# question, aka "hallucination"
template = """
### ROLE
You are Popo, a Senior Financial Analyst specializing in Apple Inc. Your tone is professional, objective, and precise.

### INSTRUCTIONS
1. **Scope Control**: Use ONLY the provided context and chat history. If the information isn't there, say: "I'm sorry, I only have the ability to answer questions about the provided Apple 10-K report."
2. **Precision**: Always specify the exact fiscal year (e.g., 'In fiscal year 2025...').
3. **Social Guardrail**: If the user greets you or says 'thank you', respond warmly as Popo and offer to assist with further analysis of the 10-K.
4. **Context Awareness**: Use the history to handle follow-up questions accurately.
5. **Ethical Boundary**: Strictly refuse to give personal investment advice. If asked, politely explain that your expertise is limited to analyzing the facts within the Apple 10-K report and suggest the user consult a certified financial advisor.
6. **Formatting**: Use bullet points for lists of risks or financial metrics to improve readability.
7. Identity: Do not assume the user's name or identity. Address the user respectfully as "User" or simply dive into the analysis unless they explicitly introduce themselves.

### CONTEXT
{context}

### CHAT HISTORY
{chat_history}

### USER QUERY
{question}

### POPO's ANALYSIS:
"""

qa_prompt = PromptTemplate(
      template=template,
      input_variables=['context', 'chat_history' 'question']
    )

# This is the memory for the chatbot, it remembers the past 3 conversations
memory = ConversationBufferWindowMemory(
    memory_key='chat_history',
    k=3,
    return_messages=True,
    output_keys='answer'
)

# The ConversationalRetrieval chain, the Apple bot
# It assigns the Groq Llama3 as the llm, searches vector db for top 10 results
# And to follow my prompting rules instead of the default
apple_bot = ConversationalRetrievalChain.from_llm(
    llm=llm,
    chain_type='stuff',
    retriever=vector_db.as_retriever(search_kwargs={'k': 10}),
    memory=memory,
    combine_docs_chain_kwargs={'prompt': qa_prompt}
)

print("Popo is online and ready!")

# Test the Guardrails

In [None]:
import sys
import time

print(("--- Apple 10-K Financial Analyst (v1.0) ---"))
print("Hello! My name is Popo, how may I assist you today? Type 'exit' to quit." )

while True:
  user_query = input("\nYou: ")

  if user_query.lower() in ['exit', 'quit', 'q']:
    print("Goodbye! Popo is signing off.")
    break

  # Creates a temporary thinking process till it replies
  print("Popo is thinking...", end="\r")

  try:
    print(f"Popo: ", end="")

    # This is to loop thru each chunk in the stream and look for answer
    for chunk in apple_bot.stream({"question": user_query}):
      if 'answer' in chunk:
        answer_text = chunk['answer']

        # This creates a typewriter style chatbot lower time.sleep = faster
        for char in answer_text:
          sys.stdout.write(char)
          sys.stdout.flush()
          time.sleep(0.01)

    # This is just an empty line for each answer
    print()

  except Exception as e:
    if "429" in str(e):
        print("\n[Rate Limit] Popo needs a 60-minute breather, please wait...")
        time.sleep(5)
    else:
      print(f"\nAn error has occurred: {e}")

  print("-" * 50)

# This is for securing my vector database for deployment

In [None]:
import shutil

# This is for zipping my vector database for download
shutil.make_archive('apple_chroma_db_export', 'zip', 'apple_chroma_db')
print("Database zipped! Look for 'apple_chroma_db_export.zip in your file menu.'")