# RAG Chatbot Demo

This notebook demonstrates how to use the RAG (Retrieval-Augmented Generation) chatbot for document processing and question answering. The chatbot can process Word documents, Excel files, and PDFs, and answer questions based on their content using OpenAI's language models.

## Setup

First, let's set up our environment and import the necessary modules.

In [1]:
import os
import sys
import warnings

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')

# Add the parent directory to the path to import our modules
sys.path.append('..')

# Import our RAG chatbot modules
from utils.document_processor import DocumentProcessor
from utils.vector_store import VectorStore
from rag_chatbot import RAGChatbot

print("Modules imported successfully!")

ModuleNotFoundError: No module named 'langchain_text_splitters'

## Setting up the OpenAI API Key

To use the RAG chatbot, you need an OpenAI API key. You can set it as an environment variable or provide it directly to the chatbot.

In [None]:
# Set your OpenAI API key here
# You can replace this with your actual API key or set it as an environment variable
OPENAI_API_KEY = "your-openai-api-key"

# Set the API key as an environment variable
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

print("API key set!")

## Creating Sample Documents

For demonstration purposes, let's create some sample text files that we can use to test the chatbot. In a real-world scenario, you would use your own PDF, Word, or Excel files.

In [None]:
import os

# Create a sample directory
sample_dir = "./sample_docs"
os.makedirs(sample_dir, exist_ok=True)

# Create a sample text file about artificial intelligence
ai_content = """
# Artificial Intelligence Overview

Artificial Intelligence (AI) is the simulation of human intelligence processes by machines, especially computer systems. These processes include learning (the acquisition of information and rules for using the information), reasoning (using the rules to reach approximate or definite conclusions), and self-correction.

## Machine Learning

Machine Learning is a subset of AI that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. Machine learning focuses on the development of computer programs that can access data and use it to learn for themselves.

The process of learning begins with observations or data, such as examples, direct experience, or instruction, in order to look for patterns in data and make better decisions in the future based on the examples that we provide. The primary aim is to allow the computers to learn automatically without human intervention or assistance and adjust actions accordingly.

## Deep Learning

Deep Learning is a subset of machine learning that uses neural networks with many layers (hence "deep") to analyze various factors of data. Deep learning is a key technology behind driverless cars, enabling them to recognize a stop sign, or to distinguish a pedestrian from a lamppost.

## Natural Language Processing

Natural Language Processing (NLP) is a field of AI that gives computers the ability to understand text and spoken words in much the same way human beings can. NLP combines computational linguistics—rule-based modeling of human language—with statistical, machine learning, and deep learning models.
"""

with open(os.path.join(sample_dir, "ai_overview.txt"), "w") as f:
    f.write(ai_content)

# Create a sample text file about data science
ds_content = """
# Data Science Fundamentals

Data Science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data. It employs techniques and theories drawn from many fields within the context of mathematics, statistics, computer science, and information science.

## Data Analysis

Data Analysis is the process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making. Data analysis has multiple facets and approaches, encompassing diverse techniques under a variety of names, and is used in different business, science, and social science domains.

## Big Data

Big Data refers to data sets that are too large or complex to be dealt with by traditional data-processing application software. Big data challenges include capturing data, data storage, data analysis, search, sharing, transfer, visualization, querying, updating, information privacy, and data source.

## Data Visualization

Data Visualization is the graphic representation of data. It involves producing images that communicate relationships among the represented data to viewers. Visualizing data is an important step in data analysis and is critical for understanding patterns, trends, and outliers in data.
"""

with open(os.path.join(sample_dir, "data_science.txt"), "w") as f:
    f.write(ds_content)

print(f"Created sample documents in {sample_dir}")

## Document Processing

Now, let's explore how the document processor works. The `DocumentProcessor` class is responsible for extracting text from different document types and splitting it into chunks.

In [None]:
# Initialize the document processor
processor = DocumentProcessor(chunk_size=500, chunk_overlap=100)
print(f"Document processor initialized with chunk_size=500, chunk_overlap=100")

# Process a sample document
ai_file_path = os.path.join(sample_dir, "ai_overview.txt")
chunks = processor.process_file(ai_file_path)

print(f"Processed {ai_file_path}")
print(f"Extracted {len(chunks)} chunks")

# Display the first chunk
print("\nFirst chunk:")
print(chunks[0])

## Vector Database

Next, let's see how the vector database works. The `VectorStore` class uses Chroma to store and retrieve document embeddings.

In [None]:
# Initialize the vector store
vector_store = VectorStore(persist_directory="./demo_chroma_db")
print("Vector store initialized")

# Add the chunks to the vector store
metadatas = [{"source": "ai_overview.txt"} for _ in chunks]
ids = vector_store.add_texts(chunks, metadatas)
print(f"Added {len(ids)} chunks to vector store")

# Process and add another document
ds_file_path = os.path.join(sample_dir, "data_science.txt")
ds_chunks = processor.process_file(ds_file_path)
ds_metadatas = [{"source": "data_science.txt"} for _ in ds_chunks]
ds_ids = vector_store.add_texts(ds_chunks, ds_metadatas)
print(f"Added {len(ds_ids)} chunks from data science document")

# Perform a similarity search
query = "What is machine learning?"
results = vector_store.similarity_search(query, k=2)

print(f"\nSimilarity search results for query: '{query}'")
for i, doc in enumerate(results):
    print(f"Result {i+1}:")
    print(f"Source: {doc.metadata['source']}")
    print(f"Content: {doc.page_content[:150]}...")
    print()

## RAG Chatbot

Now, let's use the RAG chatbot to answer questions based on our documents.

In [None]:
# Initialize the RAG chatbot
chatbot = RAGChatbot(persist_directory="./demo_rag_db")
print("RAG chatbot initialized")

# Load the sample documents
file_paths = [
    os.path.join(sample_dir, "ai_overview.txt"),
    os.path.join(sample_dir, "data_science.txt")
]
num_chunks = chatbot.load_documents(file_paths=file_paths)
print(f"Loaded {num_chunks} chunks from {len(file_paths)} documents")

## Asking Questions

Let's ask some questions to the chatbot and see how it responds based on the documents we've loaded.

In [None]:
# Define some questions to ask
questions = [
    "What is artificial intelligence?",
    "Explain machine learning and how it relates to AI.",
    "What is data science and how does it differ from AI?",
    "What are the main components of data analysis?",
    "How are deep learning and natural language processing related?"
]

# Ask each question and display the answer
for i, question in enumerate(questions):
    print(f"Question {i+1}: {question}")
    answer = chatbot.ask(question)
    print(f"Answer: {answer}\n")

## Using the Chatbot with Your Own Documents

To use the chatbot with your own documents, you would follow these steps:

1. Initialize the chatbot with your OpenAI API key
2. Load your documents (PDF, Word, Excel)
3. Ask questions about the content of your documents

Here's an example of how you would do this:

In [None]:
# Example code for using your own documents
'''
# Initialize the chatbot
my_chatbot = RAGChatbot(openai_api_key="your-api-key")

# Load your documents
my_documents = [
    "path/to/your/document.pdf",
    "path/to/your/document.docx",
    "path/to/your/spreadsheet.xlsx"
]
my_chatbot.load_documents(file_paths=my_documents)

# Or load all documents in a directory
# my_chatbot.load_documents(directory_path="path/to/your/documents")

# Ask questions
question = "What information is in these documents?"
answer = my_chatbot.ask(question)
print(f"Question: {question}")
print(f"Answer: {answer}")
'''

print("Example code for using your own documents is shown above.")

## Cleaning Up

Finally, let's clean up by clearing the documents from the vector store.

In [None]:
# Clear documents from the chatbot
chatbot.clear_documents()
print("Cleared documents from the chatbot")

# Clear the vector store
vector_store.clear()
print("Cleared the vector store")

print("\nDemo completed successfully!")

## Conclusion

In this notebook, we've demonstrated how to use the RAG chatbot to process documents and answer questions based on their content. The chatbot uses:

- Document processing to extract text from different file types
- Vector database (Chroma) to store and retrieve document embeddings
- OpenAI API to generate answers based on retrieved context

You can use this chatbot with your own documents by following the steps outlined above. For a more user-friendly interface, you can also use the Streamlit app provided in the repository.