# RAG Chatbot Demo

This notebook demonstrates how to use the RAG (Retrieval-Augmented Generation) chatbot for document processing and question answering. The chatbot can process Word documents, Excel files, and PDFs, and answer questions based on their content using OpenAI's language models.

## Setup

First, let's set up our environment and import the necessary modules.

In [None]:
import os
import sys
import warnings

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')

# Make sure we're in the root directory of the repository
# This helps ensure imports work correctly
if os.path.basename(os.getcwd()) == 'test':
    os.chdir('..')
elif os.path.basename(os.getcwd()) == 'notebooks':
    os.chdir('..')

# Import our RAG chatbot modules
try:
    from utils.document_processor import DocumentProcessor
    from utils.vector_store import VectorStore
    from rag_chatbot import RAGChatbot
    print("Modules imported successfully!")
except ImportError as e:
    print(f"Error importing modules: {e}")
    print("\nMake sure you have installed all required dependencies.")
    print("You can install them using: conda env create -f environment.yml")

## Setting up the OpenAI API Key

To use the RAG chatbot, you need an OpenAI API key. You can set it as an environment variable or provide it directly to the chatbot.

In [None]:
from IPython.display import display
from ipywidgets import widgets

# Create a simple password input widget
api_key_input = widgets.Password(
    description='OpenAI API Key:',
    placeholder='Enter your key',
    layout=widgets.Layout(width='400px')
)

# Function to set the API key when the user presses Enter
def on_key_entered(sender):
    key = sender.value.strip()
    if key:
        os.environ["OPENAI_API_KEY"] = key
        print("✅ API key set")
    
# Register the callback
api_key_input.observe(lambda change: on_key_entered(api_key_input) 
                      if change.name == 'value' and change.new.endswith('\n') 
                      else None, names='value')

# Display status
if os.environ.get("OPENAI_API_KEY"):
    print("API key already set in environment")
else:
    print("Enter your OpenAI API key and press Enter")

# Display the widget
display(api_key_input)

Password(description='OpenAI API Key:', layout=Layout(width='500px'), placeholder='Enter your OpenAI API key')

Button(button_style='primary', description='Apply API Key', layout=Layout(width='150px'), style=ButtonStyle())

Output()

## Checking Test Data

The repository includes test data in the `test/test_data` directory. Let's check if these files are available for us to use in our demo.

In [None]:
# Path to test data directory
test_data_dir = "test/test_data"

# Check if the directory exists
if os.path.exists(test_data_dir):
    print(f"Test data directory found at {test_data_dir}")
    
    # List the available files
    files = []
    for root, _, filenames in os.walk(test_data_dir):
        for filename in filenames:
            if filename.endswith(('.pdf', '.docx', '.xlsx', '.xls')):
                files.append(os.path.join(root, filename))
    
    if files:
        print("\nAvailable test files:")
        for file in files:
            print(f"- {file}")
    else:
        print("\nNo suitable test files found. We'll create sample files for the demo.")
else:
    print(f"Test data directory not found at {test_data_dir}")
    print("We'll create sample files for the demo.")

## Creating Sample Documents (if needed)

If we don't have test files available, let's create some sample text files that we can use to test the chatbot.

In [None]:
# Create sample files only if we don't have test files already
if not os.path.exists(test_data_dir) or not files:
    # Create a sample directory
    sample_dir = "./sample_docs"
    os.makedirs(sample_dir, exist_ok=True)
    
    # Create a sample text file about artificial intelligence
    ai_content = """
    # Artificial Intelligence Overview
    
    Artificial Intelligence (AI) is the simulation of human intelligence processes by machines, especially computer systems. These processes include learning (the acquisition of information and rules for using the information), reasoning (using the rules to reach approximate or definite conclusions), and self-correction.
    
    ## Machine Learning
    
    Machine Learning is a subset of AI that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. Machine learning focuses on the development of computer programs that can access data and use it to learn for themselves.
    
    The process of learning begins with observations or data, such as examples, direct experience, or instruction, in order to look for patterns in data and make better decisions in the future based on the examples that we provide. The primary aim is to allow the computers to learn automatically without human intervention or assistance and adjust actions accordingly.
    
    ## Deep Learning
    
    Deep Learning is a subset of machine learning that uses neural networks with many layers (hence "deep") to analyze various factors of data. Deep learning is a key technology behind driverless cars, enabling them to recognize a stop sign, or to distinguish a pedestrian from a lamppost.
    
    ## Natural Language Processing
    
    Natural Language Processing (NLP) is a field of AI that gives computers the ability to understand text and spoken words in much the same way human beings can. NLP combines computational linguistics—rule-based modeling of human language—with statistical, machine learning, and deep learning models.
    """
    
    with open(os.path.join(sample_dir, "ai_overview.txt"), "w") as f:
        f.write(ai_content)
    
    # Create a sample text file about data science
    ds_content = """
    # Data Science Fundamentals
    
    Data Science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data. It employs techniques and theories drawn from many fields within the context of mathematics, statistics, computer science, and information science.
    
    ## Data Analysis
    
    Data Analysis is the process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making. Data analysis has multiple facets and approaches, encompassing diverse techniques under a variety of names, and is used in different business, science, and social science domains.
    
    ## Big Data
    
    Big Data refers to data sets that are too large or complex to be dealt with by traditional data-processing application software. Big data challenges include capturing data, data storage, data analysis, search, sharing, transfer, visualization, querying, updating, information privacy, and data source.
    
    ## Data Visualization
    
    Data Visualization is the graphic representation of data. It involves producing images that communicate relationships among the represented data to viewers. Visualizing data is an important step in data analysis and is critical for understanding patterns, trends, and outliers in data.
    """
    
    with open(os.path.join(sample_dir, "data_science.txt"), "w") as f:
        f.write(ds_content)
    
    print(f"Created sample documents in {sample_dir}")
    
    # Update our files list to use these samples
    files = [
        os.path.join(sample_dir, "ai_overview.txt"),
        os.path.join(sample_dir, "data_science.txt")
    ]
else:
    print("Using existing test files")

## Document Processing

Now, let's explore how the document processor works. The `DocumentProcessor` class is responsible for extracting text from different document types and splitting it into chunks.

In [None]:
# Initialize the document processor
processor = DocumentProcessor(chunk_size=500, chunk_overlap=100)
print(f"Document processor initialized with chunk_size=500, chunk_overlap=100")

# Select a document to process
if files:
    test_file = files[0]  # Choose the first available file
    print(f"Processing {test_file}...")
    
    try:
        chunks = processor.process_file(test_file)
        print(f"Successfully processed {test_file}")
        print(f"Extracted {len(chunks)} chunks")
        
        # Display the first chunk
        if chunks:
            print("\nFirst chunk:")
            preview_length = min(len(chunks[0]), 500)
            print(chunks[0][:preview_length] + ("..." if preview_length < len(chunks[0]) else ""))
    except Exception as e:
        print(f"Error processing file: {str(e)}")
else:
    print("No files available to process")

## Vector Database

Next, let's see how the vector database works. The `VectorStore` class uses Chroma to store and retrieve document embeddings.

In [None]:
# Check if OpenAI API key is set
api_key = os.environ.get("OPENAI_API_KEY")
if not api_key and OPENAI_API_KEY != "your-openai-api-key":
    os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
    api_key = OPENAI_API_KEY

if not api_key or api_key == "your-openai-api-key":
    print("WARNING: OpenAI API key is not set. Vector store operations will be skipped.")
    print("Please set your API key in the cell above and run it again.")
else:
    try:
        # Initialize the vector store
        vector_store = VectorStore(persist_directory="./demo_chroma_db")
        print("Vector store initialized successfully")
        
        # Add the chunks to the vector store if we have any
        if 'chunks' in locals() and chunks:
            metadatas = [{"source": os.path.basename(test_file)} for _ in chunks]
            ids = vector_store.add_texts(chunks, metadatas)
            print(f"Added {len(ids)} chunks to vector store")
            
            # Try processing a second file if available
            if len(files) > 1:
                second_file = files[1]
                print(f"\nProcessing second file: {second_file}")
                try:
                    second_chunks = processor.process_file(second_file)
                    second_metadatas = [{"source": os.path.basename(second_file)} for _ in second_chunks]
                    second_ids = vector_store.add_texts(second_chunks, second_metadatas)
                    print(f"Added {len(second_ids)} chunks from second file")
                except Exception as e:
                    print(f"Error processing second file: {str(e)}")
            
            # Perform a similarity search
            print("\nPerforming similarity search...")
            query = "What is artificial intelligence?"
            results = vector_store.similarity_search(query, k=2)
            
            print(f"Similarity search results for query: '{query}'")
            for i, doc in enumerate(results):
                print(f"Result {i+1}:")
                print(f"Source: {doc.metadata.get('source', 'Unknown')}")
                preview_length = min(len(doc.page_content), 150)
                print(f"Content: {doc.page_content[:preview_length]}...")
                print()
        else:
            print("No chunks available to add to vector store")
    except Exception as e:
        print(f"Error with vector store operations: {str(e)}")

## RAG Chatbot

Now, let's use the RAG chatbot to answer questions based on our documents.

In [None]:
# Check if OpenAI API key is set
api_key = os.environ.get("OPENAI_API_KEY")
if not api_key and OPENAI_API_KEY != "your-openai-api-key":
    os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
    api_key = OPENAI_API_KEY

if not api_key or api_key == "your-openai-api-key":
    print("WARNING: OpenAI API key is not set. RAG chatbot operations will be skipped.")
    print("Please set your API key in the cell above and run it again.")
else:
    try:
        # Initialize the RAG chatbot
        chatbot = RAGChatbot(persist_directory="./demo_rag_db")
        print("RAG chatbot initialized successfully")
        
        # Load the documents
        if files:
            print(f"Loading {len(files)} documents...")
            num_chunks = chatbot.load_documents(file_paths=files)
            print(f"Loaded {num_chunks} chunks from {len(files)} documents")
        else:
            print("No files available to load")
    except Exception as e:
        print(f"Error with RAG chatbot operations: {str(e)}")

## Asking Questions

Let's ask some questions to the chatbot and see how it responds based on the documents we've loaded.

In [None]:
# Check if the chatbot is initialized
if 'chatbot' in locals() and api_key and api_key != "your-openai-api-key":
    # Define some questions to ask
    questions = [
        "What is artificial intelligence?",
        "Explain machine learning and how it relates to AI.",
        "What is data science and how does it differ from AI?",
        "What are the main components of data analysis?",
        "How are deep learning and natural language processing related?"
    ]
    
    # Ask each question and display the answer
    for i, question in enumerate(questions):
        print(f"Question {i+1}: {question}")
        answer = chatbot.ask(question)
        print(f"Answer: {answer}\n")
else:
    print("Chatbot is not available. Please make sure you have set your OpenAI API key and initialized the chatbot.")

## Using the Chatbot with Your Own Documents

To use the chatbot with your own documents, you would follow these steps:

1. Initialize the chatbot with your OpenAI API key
2. Load your documents (PDF, Word, Excel)
3. Ask questions about the content of your documents

Here's an example of how you would do this:

In [None]:
# Example code for using your own documents
'''
# Initialize the chatbot
my_chatbot = RAGChatbot(openai_api_key="your-api-key")

# Load your documents
my_documents = [
    "path/to/your/document.pdf",
    "path/to/your/document.docx",
    "path/to/your/spreadsheet.xlsx"
]
my_chatbot.load_documents(file_paths=my_documents)

# Or load all documents in a directory
# my_chatbot.load_documents(directory_path="path/to/your/documents")

# Ask questions
question = "What information is in these documents?"
answer = my_chatbot.ask(question)
print(f"Question: {question}")
print(f"Answer: {answer}")
'''

print("Example code for using your own documents is shown above.")

## Cleaning Up

Finally, let's clean up by clearing the documents from the vector store.

In [None]:
# Clean up resources if they were created
if 'chatbot' in locals() and api_key and api_key != "your-openai-api-key":
    # Clear documents from the chatbot
    chatbot.clear_documents()
    print("Cleared documents from the chatbot")

if 'vector_store' in locals() and api_key and api_key != "your-openai-api-key":
    # Clear the vector store
    vector_store.clear()
    print("Cleared the vector store")

print("\nDemo completed.")

## Conclusion

In this notebook, we've demonstrated how to use the RAG chatbot to process documents and answer questions based on their content. The chatbot uses:

- Document processing to extract text from different file types
- Vector database (Chroma) to store and retrieve document embeddings
- OpenAI API to generate answers based on retrieved context

You can use this chatbot with your own documents by following the steps outlined above. For a more user-friendly interface, you can also use the Streamlit app provided in the repository by running:

```bash
python run_app.py
```

or directly with Streamlit:

```bash
streamlit run app.py
```