## AI Document Retrieval System using Pinecone and LangChain

This code demonstrates how to:
1. Download various document types (txt, md, pdf, docx, xlsx, html)
2. Process and chunk documents
3. Create embeddings using OpenAI
4. Store embeddings in Pinecone vector database
5. Retrieve relevant documents based on queries

### Installation

In [1]:
# !pip install langchain openai pinecone-client python-dotenv requests pandas beautifulsoup4 unstructured pypdf docx2txt openpyxl

### Environment setup

In [2]:
import os
import requests
import tempfile
import pandas as pd
from bs4 import BeautifulSoup
from langchain.document_loaders import (
    TextLoader,
    UnstructuredMarkdownLoader,
    PyPDFLoader,
    Docx2txtLoader,
    UnstructuredHTMLLoader,
)
from langchain.document_loaders.excel import UnstructuredExcelLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
# from langchain.embeddings.openai import OpenAIEmbeddings
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
# from langchain.vectorstores import Pinecone
from langchain_pinecone import PineconeVectorStore
from langchain.chains import RetrievalQA
# from langchain.llms import OpenAI
import pinecone
from pinecone import Pinecone, ServerlessSpec
from dotenv import load_dotenv
import time

In [3]:
# Load environment variables
load_dotenv()

# Get API keys from environment variables
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
PINECONE_API_KEY = os.getenv("PINECONE_API_KEY")
PINECONE_ENVIRONMENT = os.getenv("PINECONE_ENVIRONMENT", "gcp-starter")

# Set up directories
# BASE_DIR = os.path.dirname(os.path.abspath(__file__))  # Python script
BASE_DIR = os.getcwd()  # Jupyter Notebook
DATA_DIR = os.path.join(BASE_DIR, "data-002")
os.makedirs(DATA_DIR, exist_ok=True)

MODEL_GPT = 'gpt-4o-mini' # 'gpt-3.5-turbo'
PINECONE_INDEX_NAME = "document-retrieval"

print("BASE_DIR:", BASE_DIR)
print("DATA_DIR:", DATA_DIR)
print("MODEL_GPT:", MODEL_GPT)
print("PINECONE_ENVIRONMENT:", PINECONE_ENVIRONMENT)
print("PINECONE_INDEX_NAME:", PINECONE_INDEX_NAME)

BASE_DIR: C:\Users\pavel\projects\ai-llm-agents\rag
DATA_DIR: C:\Users\pavel\projects\ai-llm-agents\rag\data-002
MODEL_GPT: gpt-4o-mini
PINECONE_ENVIRONMENT: us-east-1
PINECONE_INDEX_NAME: document-retrieval


### Download documents
(MD, IPYNB, PDF, DOCX, XLSX, HTML)

In [4]:
# Sample documents to download (URL, filename, type)
DOCUMENTS = [
    ("https://raw.githubusercontent.com/langchain-ai/langchain/master/README.md", "langchain_readme.md", "markdown"),
    # ("https://raw.githubusercontent.com/pinecone-io/examples/master/learn/generation/langchain/handbook/05-langchain-retrieval-augmentation.ipynb", "pinecone_example.ipynb", "text"),
    ("https://raw.githubusercontent.com/pinecone-io/examples/master/learn/generation/langchain/handbook/05-langchain-retrieval-augmentation.ipynb", "pinecone_example.ipynb", "notebook"),
    ("https://arxiv.org/pdf/2005.11401.pdf", "neural_networks.pdf", "pdf"),
    ("https://calibre-ebook.com/downloads/demos/demo.docx", "calibre_demo.docx", "docx"),
    ("https://filesamples.com/samples/document/xlsx/sample1.xlsx", "sample_data.xlsx", "excel"),
    ("https://www.w3.org/WAI/tutorials/page-structure/", "web_accessibility.html", "html"),
]

In [5]:
def download_file(url, file_path):
    """Download a file from a URL and save it locally."""
    try:
        response = requests.get(url, timeout=30)
        response.raise_for_status()
        
        with open(file_path, 'wb') as f:
            f.write(response.content)
        
        print(f"Successfully downloaded {url} to {file_path}")
        return True
    except Exception as e:
        print(f"Error downloading {url}: {e}")
        return False

In [6]:
# def load_document(file_path, doc_type):
#     """Load a document based on its type."""
#     try:
#         if doc_type == "text":
#             return TextLoader(file_path).load()
#         if doc_type == "notebook":
#             return TextLoader(file_path).load()
#         elif doc_type == "markdown":
#             return UnstructuredMarkdownLoader(file_path).load()
#         elif doc_type == "pdf":
#             return PyPDFLoader(file_path).load()
#         elif doc_type == "docx":
#             return Docx2txtLoader(file_path).load()
#         elif doc_type == "excel":
#             return UnstructuredExcelLoader(file_path).load()
#         elif doc_type == "html":
#             return UnstructuredHTMLLoader(file_path).load()
#         else:
#             print(f"Unsupported document type: {doc_type}")
#             return []
#     except Exception as e:
#         print(f"Error loading {file_path}: {e}")
#         return []

In [7]:
# import pandas as pd
# from langchain_core.documents import Document
# from langchain.document_loaders import (
#     TextLoader,
#     UnstructuredMarkdownLoader,
#     PyPDFLoader,
#     Docx2txtLoader,
#     UnstructuredHTMLLoader
# )

# def load_document(file_path, doc_type):
#     """Load a document based on its type."""
#     try:
#         if doc_type == "text":
#             return TextLoader(file_path).load()
        
#         elif doc_type == "notebook":
#             # Currently loading as plain text; can be replaced with smarter logic later
#             return TextLoader(file_path).load()
        
#         elif doc_type == "markdown":
#             return UnstructuredMarkdownLoader(file_path).load()
        
#         elif doc_type == "pdf":
#             return PyPDFLoader(file_path).load()
        
#         elif doc_type == "docx":
#             return Docx2txtLoader(file_path).load()
        
#         elif doc_type == "excel":
#             df = pd.read_excel(file_path)
#             text_content = f"Excel file with {len(df)} rows and {len(df.columns)} columns.\n\n"
#             text_content += f"Column names: {', '.join(df.columns.astype(str))}\n\n"
#             text_content += "Sample data (first 5 rows):\n"
#             text_content += df.head().to_string()
            
#             return [Document(
#                 page_content=text_content,
#                 metadata={"source": file_path, "file_type": "excel", "rows": len(df), "columns": len(df.columns)}
#             )]
        
#         elif doc_type == "html":
#             return UnstructuredHTMLLoader(file_path).load()
        
#         else:
#             print(f"Unsupported document type: {doc_type}")
#             return []

#     except Exception as e:
#         print(f"Error loading {file_path}: {e}")
#         return []

In [8]:
# import pandas as pd
# import json
# from langchain_core.documents import Document
# from langchain.document_loaders import (
#     TextLoader,
#     UnstructuredMarkdownLoader,
#     PyPDFLoader,
#     Docx2txtLoader,
#     UnstructuredHTMLLoader
# )

# def load_document(file_path, doc_type):
#     """Load a document based on its type."""
#     try:
#         if doc_type == "text":
#             return TextLoader(file_path).load()
        
#         elif doc_type == "markdown":
#             return UnstructuredMarkdownLoader(file_path).load()
        
#         elif doc_type == "pdf":
#             return PyPDFLoader(file_path).load()
        
#         elif doc_type == "docx":
#             return Docx2txtLoader(file_path).load()
        
#         elif doc_type == "excel":
#             df = pd.read_excel(file_path)
            
#             content = f"Excel file with {len(df)} rows and {len(df.columns)} columns.\n\n"
#             content += f"Columns: {', '.join(df.columns.astype(str))}\n\n"
#             content += f"Sample (first 5 rows):\n{df.head().to_string()}\n\n"
            
#             try:
#                 content += f"Summary stats:\n{df.describe().to_string()}\n\n"
#             except:
#                 pass
            
#             return [Document(
#                 page_content=content,
#                 metadata={"source": file_path, "file_type": "excel", "rows": len(df), "columns": len(df.columns)}
#             )]
        
#         elif doc_type == "html":
#             return UnstructuredHTMLLoader(file_path).load()
        
#         elif doc_type == "notebook" or file_path.endswith(".ipynb"):
#             with open(file_path, 'r', encoding='utf-8') as f:
#                 notebook = json.load(f)
            
#             content = ""
#             for cell in notebook.get('cells', []):
#                 if cell.get('cell_type') == 'markdown':
#                     content += "".join(cell.get('source', [])) + "\n\n"
#                 elif cell.get('cell_type') == 'code':
#                     content += "```python\n" + "".join(cell.get('source', [])) + "\n```\n\n"
            
#             return [Document(
#                 page_content=content,
#                 metadata={"source": file_path, "file_type": "jupyter_notebook"}
#             )]
        
#         else:
#             print(f"Unsupported document type: {doc_type}")
#             return []

#     except Exception as e:
#         print(f"Error loading {file_path}: {e}")
#         return []

In [9]:
# import pandas as pd
# import json
# from langchain_core.documents import Document
# from langchain.document_loaders import (
#     TextLoader,
#     UnstructuredMarkdownLoader,
#     PyPDFLoader,
#     Docx2txtLoader,
#     UnstructuredHTMLLoader
# )

# def load_document(file_path, doc_type):
#     """Load a document based on its type."""
#     try:
#         if doc_type == "text":
#             return TextLoader(file_path).load()

#         elif doc_type == "markdown":
#             return UnstructuredMarkdownLoader(file_path).load()

#         elif doc_type == "pdf":
#             return PyPDFLoader(file_path).load()

#         elif doc_type == "docx":
#             return Docx2txtLoader(file_path).load()

#         elif doc_type == "excel":
#             df = pd.read_excel(file_path)

#             content = f"Excel file containing {len(df)} rows and {len(df.columns)} columns.\n\n"
#             content += f"Column names: {', '.join(df.columns.astype(str))}\n\n"
#             content += f"Sample data (first 5 rows):\n{df.head().to_string()}\n\n"

#             try:
#                 content += f"Numeric column statistics:\n{df.describe().to_string()}\n\n"
#             except Exception:
#                 pass

#             return [Document(
#                 page_content=content,
#                 metadata={"source": file_path, "file_type": "excel", "rows": len(df), "columns": len(df.columns)}
#             )]

#         elif doc_type == "html":
#             return UnstructuredHTMLLoader(file_path).load()

#         elif doc_type == "notebook" or (isinstance(file_path, str) and file_path.endswith(".ipynb")):
#             try:
#                 with open(file_path, 'r', encoding='utf-8') as f:
#                     notebook = json.load(f)

#                 content = f"Jupyter Notebook with {len(notebook.get('cells', []))} cells\n\n"

#                 for i, cell in enumerate(notebook.get('cells', [])):
#                     cell_type = cell.get('cell_type')
#                     cell_source = "".join(cell.get('source', []))

#                     if cell_type == 'markdown':
#                         content += f"[MARKDOWN CELL {i+1}]\n{cell_source}\n\n"
#                     elif cell_type == 'code':
#                         content += f"[CODE CELL {i+1}]\n```python\n{cell_source}\n```\n\n"

#                         outputs = cell.get('outputs', [])
#                         output_text = ""
#                         for output in outputs:
#                             if 'text' in output:
#                                 output_text += "".join(output['text'])
#                             elif 'data' in output and 'text/plain' in output['data']:
#                                 data_output = output['data']['text/plain']
#                                 if isinstance(data_output, list):
#                                     output_text += "".join(data_output)
#                                 else:
#                                     output_text += str(data_output)
#                         if output_text:
#                             content += f"[OUTPUT]\n{output_text}\n\n"

#                 return [Document(
#                     page_content=content,
#                     metadata={"source": file_path, "file_type": "jupyter_notebook"}
#                 )]

#             except json.JSONDecodeError:
#                 print(f"Warning: Could not parse {file_path} as JSON, falling back to text loader")
#                 return TextLoader(file_path).load()

#         else:
#             print(f"Unsupported document type: {doc_type}")
#             return []

#     except Exception as e:
#         print(f"Error loading {file_path}: {e}")
#         return []

In [10]:
import pandas as pd
import json
from langchain_core.documents import Document
from langchain.document_loaders import (
    TextLoader,
    UnstructuredMarkdownLoader,
    PyPDFLoader,
    Docx2txtLoader,
    UnstructuredHTMLLoader
)

def load_document(file_path, doc_type):
    """Load a document based on its type."""
    try:
        if doc_type == "text":
            return TextLoader(file_path).load()
        
        elif doc_type == "markdown":
            return UnstructuredMarkdownLoader(file_path).load()
        
        elif doc_type == "pdf":
            return PyPDFLoader(file_path).load()
        
        elif doc_type == "docx":
            return Docx2txtLoader(file_path).load()
        
        elif doc_type == "excel":
            df = pd.read_excel(file_path)
            
            content = f"Excel file containing {len(df)} rows and {len(df.columns)} columns.\n\n"
            content += f"Column names: {', '.join(df.columns.astype(str))}\n\n"
            content += f"Sample data (first 5 rows):\n{df.head().to_string()}\n\n"
            
            try:
                content += f"Numeric column statistics:\n{df.describe().to_string()}\n\n"
            except Exception:
                pass
            
            return [Document(
                page_content=content,
                metadata={"source": file_path, "file_type": "excel", "rows": len(df), "columns": len(df.columns)}
            )]
        
        elif doc_type == "notebook" or file_path.endswith(".ipynb"):
            try:
                with open(file_path, 'r', encoding='utf-8') as f:
                    notebook = json.load(f)
                
                content = f"Jupyter Notebook with {len(notebook.get('cells', []))} cells\n\n"
                
                for i, cell in enumerate(notebook.get('cells', [])):
                    cell_type = cell.get('cell_type')
                    cell_source = "".join(cell.get('source', []))
                    
                    if cell_type == 'markdown':
                        content += f"[MARKDOWN CELL {i+1}]\n{cell_source}\n\n"
                    elif cell_type == 'code':
                        content += f"[CODE CELL {i+1}]\n```python\n{cell_source}\n```\n\n"
                        
                        outputs = cell.get('outputs', [])
                        if outputs:
                            output_text = ""
                            for output in outputs:
                                if 'text' in output:
                                    output_text += "".join(output['text'])
                                elif 'data' in output and 'text/plain' in output['data']:
                                    data_output = output['data']['text/plain']
                                    if isinstance(data_output, list):
                                        output_text += "".join(data_output)
                                    else:
                                        output_text += str(data_output)
                            if output_text:
                                content += f"[OUTPUT]\n{output_text}\n\n"
                
                return [Document(
                    page_content=content,
                    metadata={"source": file_path, "file_type": "jupyter_notebook"}
                )]
            
            except json.JSONDecodeError:
                print(f"Warning: Could not parse {file_path} as JSON, falling back to text loader")
                return TextLoader(file_path).load()
        
        elif doc_type == "html":
            return UnstructuredHTMLLoader(file_path).load()
        
        else:
            print(f"Unsupported document type: {doc_type}")
            return []

    except Exception as e:
        print(f"Error loading {file_path}: {e}")
        return []

In [11]:
print("Downloading documents...")
downloaded_files = []

for url, filename, doc_type in DOCUMENTS:
    file_path = os.path.join(DATA_DIR, filename)
    if download_file(url, file_path):
        downloaded_files.append((file_path, doc_type))

Downloading documents...
Successfully downloaded https://raw.githubusercontent.com/langchain-ai/langchain/master/README.md to C:\Users\pavel\projects\ai-llm-agents\rag\data-002\langchain_readme.md
Successfully downloaded https://raw.githubusercontent.com/pinecone-io/examples/master/learn/generation/langchain/handbook/05-langchain-retrieval-augmentation.ipynb to C:\Users\pavel\projects\ai-llm-agents\rag\data-002\pinecone_example.ipynb
Successfully downloaded https://arxiv.org/pdf/2005.11401.pdf to C:\Users\pavel\projects\ai-llm-agents\rag\data-002\neural_networks.pdf
Successfully downloaded https://calibre-ebook.com/downloads/demos/demo.docx to C:\Users\pavel\projects\ai-llm-agents\rag\data-002\calibre_demo.docx
Successfully downloaded https://filesamples.com/samples/document/xlsx/sample1.xlsx to C:\Users\pavel\projects\ai-llm-agents\rag\data-002\sample_data.xlsx
Successfully downloaded https://www.w3.org/WAI/tutorials/page-structure/ to C:\Users\pavel\projects\ai-llm-agents\rag\data-00

In [12]:
len(downloaded_files)

6

### Processing documents and creating chunks

In [13]:
documents = []

for file_path, doc_type in downloaded_files:
    docs = load_document(file_path, doc_type)
    documents.extend(docs)
    print(f"Loaded {len(docs)} document(s) from {file_path}")

Loaded 1 document(s) from C:\Users\pavel\projects\ai-llm-agents\rag\data-002\langchain_readme.md
Loaded 1 document(s) from C:\Users\pavel\projects\ai-llm-agents\rag\data-002\pinecone_example.ipynb
Loaded 19 document(s) from C:\Users\pavel\projects\ai-llm-agents\rag\data-002\neural_networks.pdf
Loaded 1 document(s) from C:\Users\pavel\projects\ai-llm-agents\rag\data-002\calibre_demo.docx
Loaded 1 document(s) from C:\Users\pavel\projects\ai-llm-agents\rag\data-002\sample_data.xlsx
Loaded 1 document(s) from C:\Users\pavel\projects\ai-llm-agents\rag\data-002\web_accessibility.html


In [14]:
len(documents)

24

### Debug document loading [FIXING]

In [15]:
# def get_content_preview(doc):
#     return doc.page_content[:1500] + "..." if len(doc.page_content) > 1500 else doc.page_content

# def debug_document_loading():
#     """Debug function to check if Excel and Jupyter Notebook files are properly loaded."""
#     print("\n🔍 DEBUGGING DOCUMENT LOADING...")
    
#     # 1. Check Excel files
#     excel_files = [doc for doc in documents if "sample_data.xlsx" in doc.metadata.get("source", "")]
#     print(f"\n📊 Excel Files: {len(excel_files)} documents found")
    
#     if excel_files:
#         for idx, doc in enumerate(excel_files):
#             print(f"  Excel Doc #{idx+1}:")
#             print(f"  - Source: {doc.metadata.get('source')}")
#             print(f"  - Content length: {len(doc.page_content)} characters")
#             print(f"  - Metadata: {doc.metadata}")
#             # print(f"  - Content preview: {doc.page_content[:200]}...\n")
#             print(f"  - Content preview: {get_content_preview(doc)}...\n")
#     else:
#         print("  ❌ No Excel documents were loaded!")
        
#     # 2. Check Jupyter Notebook files
#     notebook_files = [doc for doc in documents if "pinecone_example.ipynb" in doc.metadata.get("source", "")]
#     print(f"\n📓 Jupyter Notebooks: {len(notebook_files)} documents found")
    
#     if notebook_files:
#         for idx, doc in enumerate(notebook_files):
#             print(f"  Notebook Doc #{idx+1}:")
#             print(f"  - Source: {doc.metadata.get('source')}")
#             print(f"  - Content length: {len(doc.page_content)} characters")
#             print(f"  - Metadata: {doc.metadata}")
#             # print(f"  - Content preview: {doc.page_content[:200]}...\n")
#             print(f"  - Content preview: {get_content_preview(doc)}...\n")
#     else:
#         print("  ❌ No Jupyter Notebook documents were loaded!")
        
#     # 3. Test direct loading of files
#     print("\n🧪 TESTING DIRECT FILE LOADING...")
    
#     excel_path = os.path.join(DATA_DIR, "sample_data.xlsx")
#     if os.path.exists(excel_path):
#         print(f"\nTesting Excel loader on: {excel_path}")
#         try:
#             import pandas as pd
#             df = pd.read_excel(excel_path)
#             print(f"✅ Successfully loaded Excel with pandas: {len(df)} rows, {len(df.columns)} columns")
#             print(f"Column names: {list(df.columns)}")
#         except Exception as e:
#             print(f"❌ Error loading Excel with pandas: {e}")
    
#     notebook_path = os.path.join(DATA_DIR, "pinecone_example.ipynb")
#     if os.path.exists(notebook_path):
#         print(f"\nTesting Jupyter Notebook loader on: {notebook_path}")
#         try:
#             with open(notebook_path, 'r', encoding='utf-8') as f:
#                 import json
#                 notebook = json.load(f)
#                 cell_count = len(notebook.get('cells', []))
#                 print(f"✅ Successfully loaded Notebook: {cell_count} cells found")
                
#                 # Count cell types
#                 md_cells = sum(1 for cell in notebook.get('cells', []) if cell.get('cell_type') == 'markdown')
#                 code_cells = sum(1 for cell in notebook.get('cells', []) if cell.get('cell_type') == 'code')
#                 print(f"  - Markdown cells: {md_cells}")
#                 print(f"  - Code cells: {code_cells}")
#         except Exception as e:
#             print(f"❌ Error loading Jupyter Notebook: {e}")

In [16]:
def get_content_preview(doc):
    return doc.page_content[:1500] + "..." if len(doc.page_content) > 1500 else doc.page_content

def debug_document_loading():
    """Debug function to check if Excel and Jupyter Notebook files are properly loaded."""
    print("\n🔍 DEBUGGING DOCUMENT LOADING...")
    
    # 1. Check Excel files
    excel_files = [doc for doc in documents if "sample_data.xlsx" in doc.metadata.get("source", "")]
    print(f"\n📊 Excel Files: {len(excel_files)} documents found")
    
    if excel_files:
        for idx, doc in enumerate(excel_files):
            print(f"  Excel Doc #{idx+1}:")
            print(f"  - Source: {doc.metadata.get('source')}")
            print(f"  - Content length: {len(doc.page_content)} characters")
            # print(f"  - Content preview: {doc.page_content[:200]}...\n")
            print(f"  - Content preview: {get_content_preview(doc)}...\n")
    else:
        print("  ❌ No Excel documents were loaded!")
        
    # 2. Check Jupyter Notebook files
    notebook_files = [doc for doc in documents if "ipynb" in doc.metadata.get("source", "")]
    print(f"\n📓 Jupyter Notebooks: {len(notebook_files)} documents found")
    
    if notebook_files:
        for idx, doc in enumerate(notebook_files):
            print(f"  Notebook Doc #{idx+1}:")
            print(f"  - Source: {doc.metadata.get('source')}")
            print(f"  - Content length: {len(doc.page_content)} characters")
            # print(f"  - Content preview: {doc.page_content[:200]}...\n")
            print(f"  - Content preview: {get_content_preview(doc)}...\n")
    else:
        print("  ❌ No Jupyter Notebook documents were loaded!")
        
    # 3. Test direct loading of files
    print("\n🧪 TESTING DIRECT FILE LOADING...")
    
    excel_path = os.path.join(DATA_DIR, "sample_data.xlsx")
    if os.path.exists(excel_path):
        print(f"\nTesting Excel loader on: {excel_path}")
        try:
            import pandas as pd
            df = pd.read_excel(excel_path)
            print(f"✅ Successfully loaded Excel with pandas: {len(df)} rows, {len(df.columns)} columns")
            print(f"Column names: {list(df.columns)}")
        except Exception as e:
            print(f"❌ Error loading Excel with pandas: {e}")
    
    notebook_path = os.path.join(DATA_DIR, "pinecone_example.ipynb")
    if os.path.exists(notebook_path):
        print(f"\nTesting Jupyter Notebook loader on: {notebook_path}")
        try:
            with open(notebook_path, 'r', encoding='utf-8') as f:
                import json
                notebook = json.load(f)
                cell_count = len(notebook.get('cells', []))
                print(f"✅ Successfully loaded Notebook: {cell_count} cells found")
                
                # Count cell types
                md_cells = sum(1 for cell in notebook.get('cells', []) if cell.get('cell_type') == 'markdown')
                code_cells = sum(1 for cell in notebook.get('cells', []) if cell.get('cell_type') == 'code')
                print(f"  - Markdown cells: {md_cells}")
                print(f"  - Code cells: {code_cells}")
        except Exception as e:
            print(f"❌ Error loading Jupyter Notebook: {e}")

In [17]:
# Call the debug function
debug_document_loading()


🔍 DEBUGGING DOCUMENT LOADING...

📊 Excel Files: 1 documents found
  Excel Doc #1:
  - Source: C:\Users\pavel\projects\ai-llm-agents\rag\data-002\sample_data.xlsx
  - Content length: 1077 characters
  - Content preview: Excel file containing 390 rows and 5 columns.

Column names: Postcode, Sales_Rep_ID, Sales_Rep_Name, Year, Value

Sample data (first 5 rows):
   Postcode  Sales_Rep_ID Sales_Rep_Name  Year         Value
0      2121           456           Jane  2011  84219.497311
1      2092           789         Ashish  2012  28322.192268
2      2128           456           Jane  2013  81878.997241
3      2073           123           John  2011  44491.142121
4      2134           789         Ashish  2012  71837.720959

Numeric column statistics:
          Postcode  Sales_Rep_ID         Year         Value
count   390.000000    390.000000   390.000000    390.000000
mean   2098.430769    456.000000  2012.000000  49229.388305
std      58.652206    272.242614     0.817545  28251.271309
min 

In [18]:
# documents[0].page_content
documents[0].page_content[:300]

'Release Notes\n\nCI\n\nPyPI - License\n\nPyPI - Downloads\n\nGitHub star chart\n\nOpen Issues\n\nOpen in Dev Containers\n\n\n\nTwitter\n\nCodSpeed Badge\n\n[!NOTE] Looking for the JS/TS library? Check out LangChain.js.\n\nLangChain is a framework for building LLM-powered applications. It helps you chain together interope'

In [19]:
source = documents[0].metadata.get("source", "No source metadata found")
print(source)

C:\Users\pavel\projects\ai-llm-agents\rag\data-002\langchain_readme.md


In [20]:
for i, doc in enumerate(documents):
    source = doc.metadata.get("source", "No source metadata found")
    print(f"📄 Document {i + 1} source: {source}")

📄 Document 1 source: C:\Users\pavel\projects\ai-llm-agents\rag\data-002\langchain_readme.md
📄 Document 2 source: C:\Users\pavel\projects\ai-llm-agents\rag\data-002\pinecone_example.ipynb
📄 Document 3 source: C:\Users\pavel\projects\ai-llm-agents\rag\data-002\neural_networks.pdf
📄 Document 4 source: C:\Users\pavel\projects\ai-llm-agents\rag\data-002\neural_networks.pdf
📄 Document 5 source: C:\Users\pavel\projects\ai-llm-agents\rag\data-002\neural_networks.pdf
📄 Document 6 source: C:\Users\pavel\projects\ai-llm-agents\rag\data-002\neural_networks.pdf
📄 Document 7 source: C:\Users\pavel\projects\ai-llm-agents\rag\data-002\neural_networks.pdf
📄 Document 8 source: C:\Users\pavel\projects\ai-llm-agents\rag\data-002\neural_networks.pdf
📄 Document 9 source: C:\Users\pavel\projects\ai-llm-agents\rag\data-002\neural_networks.pdf
📄 Document 10 source: C:\Users\pavel\projects\ai-llm-agents\rag\data-002\neural_networks.pdf
📄 Document 11 source: C:\Users\pavel\projects\ai-llm-agents\rag\data-002\neu

In [21]:
# for i, doc in enumerate(documents):
#     source = doc.metadata.get("source", "No source metadata found")
#     preview = doc.page_content[:200].replace('\n', ' ')
#     print(f"📄 Document {i + 1} source: {source}")
#     print(f"   🔍 Preview: {preview}\n")

for i, doc in enumerate(documents):
    source = doc.metadata.get("source", "No source metadata found")
    content = doc.page_content
    preview = content[:200].replace('\n', ' ')
    total_chars = len(content)
    
    print(f"📄 Document {i + 1}")
    print(f"   ✏️ Total characters: {total_chars}")
    print(f"   🔗 Source: {source}")
    print(f"   🔍 Preview: {preview}\n")

📄 Document 1
   ✏️ Total characters: 2782
   🔗 Source: C:\Users\pavel\projects\ai-llm-agents\rag\data-002\langchain_readme.md
   🔍 Preview: Release Notes  CI  PyPI - License  PyPI - Downloads  GitHub star chart  Open Issues  Open in Dev Containers    Twitter  CodSpeed Badge  [!NOTE] Looking for the JS/TS library? Check out LangChain.js.  

📄 Document 2
   ✏️ Total characters: 27923
   🔗 Source: C:\Users\pavel\projects\ai-llm-agents\rag\data-002\pinecone_example.ipynb
   🔍 Preview: Jupyter Notebook with 44 cells  [MARKDOWN CELL 1] [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master

📄 Document 3
   ✏️ Total characters: 2900
   🔗 Source: C:\Users\pavel\projects\ai-llm-agents\rag\data-002\neural_networks.pdf
   🔍 Preview: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks Patrick Lewis†‡, Ethan Perez⋆, Aleksandra Piktus†, Fabio Petroni†, Vladimir Karpukhin†, Naman Goyal†, Heinr

In [22]:
import tiktoken

def estimate_embedding_cost(texts, model="text-embedding-3-small"):
    """
    Estimate total number of tokens and embedding cost for OpenAI embedding models.
    """
    if isinstance(texts, str):
        texts = [texts]

    try:
        encoding = tiktoken.encoding_for_model(model)
    except KeyError:
        encoding = tiktoken.get_encoding("cl100k_base")

    total_tokens = sum(len(encoding.encode(text)) for text in texts)

    prices_per_1k = {
        "text-embedding-3-small": 0.00002,
        "text-embedding-3-large": 0.00013,
    }

    price_per_1k = prices_per_1k.get(model, 0.00002)
    estimated_cost = (total_tokens / 1000) * price_per_1k

    return {
        "total_tokens": total_tokens,
        "estimated_cost_usd": round(estimated_cost, 6)
    }

# 🧾 Extract text from your 24 documents and estimate
document_texts = [doc.page_content for doc in documents]

result = estimate_embedding_cost(document_texts, model="text-embedding-3-small")

print(f"📚 Total documents: {len(documents)}")
print(f"🔢 Total tokens: {result['total_tokens']}")
print(f"💰 Estimated embedding cost: ${result['estimated_cost_usd']}")

📚 Total documents: 24
🔢 Total tokens: 28862
💰 Estimated embedding cost: $0.000577


In [23]:
# print("\n📄 Per-document token breakdown:")
# for i, doc in enumerate(documents):
#     text = doc.page_content
#     tokens = len(tiktoken.encoding_for_model("text-embedding-3-small").encode(text))
#     cost = round((tokens / 1000) * 0.00002, 6)
#     # print(f"   Doc {i+1:>2}: {tokens} tokens | ${cost} | Source: {doc.metadata.get('source', 'N/A')}")
#     print(f"   Doc {i+1:>2}: {tokens:>5} tokens | Cost: ${cost:.6f} | Source: {doc.metadata.get('source', 'N/A')}")

encoding = tiktoken.encoding_for_model("text-embedding-3-small")
price_per_1k = 0.00002

print("\n📄 Per-document token & cost breakdown:")
for i, doc in enumerate(documents):
    text = doc.page_content
    tokens = len(encoding.encode(text))
    cost = (tokens / 1000) * price_per_1k
    source = doc.metadata.get("source", "N/A")
    print(f"   Doc {i+1:>2}: {tokens:>5} tokens | Cost: ${cost:.6f} | Source: {source}")


📄 Per-document token & cost breakdown:
   Doc  1:   538 tokens | Cost: $0.000011 | Source: C:\Users\pavel\projects\ai-llm-agents\rag\data-002\langchain_readme.md
   Doc  2:  6863 tokens | Cost: $0.000137 | Source: C:\Users\pavel\projects\ai-llm-agents\rag\data-002\pinecone_example.ipynb
   Doc  3:   688 tokens | Cost: $0.000014 | Source: C:\Users\pavel\projects\ai-llm-agents\rag\data-002\neural_networks.pdf
   Doc  4:  1148 tokens | Cost: $0.000023 | Source: C:\Users\pavel\projects\ai-llm-agents\rag\data-002\neural_networks.pdf
   Doc  5:   938 tokens | Cost: $0.000019 | Source: C:\Users\pavel\projects\ai-llm-agents\rag\data-002\neural_networks.pdf
   Doc  6:  1058 tokens | Cost: $0.000021 | Source: C:\Users\pavel\projects\ai-llm-agents\rag\data-002\neural_networks.pdf
   Doc  7:  1034 tokens | Cost: $0.000021 | Source: C:\Users\pavel\projects\ai-llm-agents\rag\data-002\neural_networks.pdf
   Doc  8:  1203 tokens | Cost: $0.000024 | Source: C:\Users\pavel\projects\ai-llm-agents\rag\da

### Create chunks

In [24]:
# text_splitter = RecursiveCharacterTextSplitter(
#     chunk_size=1000,
#     chunk_overlap=100,
#     length_function=len,
# )

# chunks = text_splitter.split_documents(documents)
# print(f"Created {len(chunks)} chunks from {len(documents)} documents")

### Create chunks [FIXING]

In [25]:
# # Modify Chunking for Excel Data
# # - For Excel data, which is more structured than regular text, consider using a different chunking strategy:

# # Before creating chunks
# regular_docs = [doc for doc in documents if "sample_data.xlsx" not in doc.metadata.get("source", "")]
# excel_docs = [doc for doc in documents if "sample_data.xlsx" in doc.metadata.get("source", "")]

# # Process regular documents with RecursiveCharacterTextSplitter
# text_splitter = RecursiveCharacterTextSplitter(
#     chunk_size=1000,
#     chunk_overlap=100,
#     length_function=len,
# )
# regular_chunks = text_splitter.split_documents(regular_docs)

# # Process Excel documents with less aggressive chunking to preserve table context
# excel_splitter = RecursiveCharacterTextSplitter(
#     chunk_size=3000,  # Larger chunk size for tabular data
#     chunk_overlap=200,
#     length_function=len,
# )
# excel_chunks = excel_splitter.split_documents(excel_docs)

# # Combine chunks
# chunks = regular_chunks + excel_chunks
# print(f"Created {len(chunks)} chunks from {len(documents)} documents")

In [26]:
# Modify Document Processing for Better Retrieval
# - When creating chunks, add special handling for Excel data:

# Before chunking, separate Excel documents for special handling
regular_docs = [doc for doc in documents if "sample_data.xlsx" not in doc.metadata.get("source", "")]
excel_docs = [doc for doc in documents if "sample_data.xlsx" in doc.metadata.get("source", "")]

# Process regular documents with RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=100,
    length_function=len,
)
regular_chunks = text_splitter.split_documents(regular_docs)

# For Excel documents, use larger chunks with less overlap
excel_splitter = RecursiveCharacterTextSplitter(
    chunk_size=2000,  # Larger to keep more Excel context together
    chunk_overlap=50,
    length_function=len,
)
excel_chunks = excel_splitter.split_documents(excel_docs)

# Combine all chunks
chunks = regular_chunks + excel_chunks
print(f"Created {len(chunks)} chunks from {len(documents)} documents")
print(f"  - Regular documents: {len(regular_chunks)} chunks")
print(f"  - Excel documents: {len(excel_chunks)} chunks")

Created 149 chunks from 24 documents
  - Regular documents: 148 chunks
  - Excel documents: 1 chunks


### Add Metadata [FIXING]

In [27]:
# Add Metadata
# - Adding specific metadata tags can help improve retrieval

for chunk in chunks:
    source = chunk.metadata.get("source", "")
    if "sample_data.xlsx" in source:
        chunk.metadata["content_type"] = "excel_data"
        chunk.metadata["document_type"] = "spreadsheet"
        print(chunk.metadata)
    elif source.endswith(".ipynb"):
        chunk.metadata["content_type"] = "notebook_content"
        chunk.metadata["document_type"] = "jupyter_notebook"
        print(chunk.metadata)

{'source': 'C:\\Users\\pavel\\projects\\ai-llm-agents\\rag\\data-002\\pinecone_example.ipynb', 'file_type': 'jupyter_notebook', 'content_type': 'notebook_content', 'document_type': 'jupyter_notebook'}
{'source': 'C:\\Users\\pavel\\projects\\ai-llm-agents\\rag\\data-002\\pinecone_example.ipynb', 'file_type': 'jupyter_notebook', 'content_type': 'notebook_content', 'document_type': 'jupyter_notebook'}
{'source': 'C:\\Users\\pavel\\projects\\ai-llm-agents\\rag\\data-002\\pinecone_example.ipynb', 'file_type': 'jupyter_notebook', 'content_type': 'notebook_content', 'document_type': 'jupyter_notebook'}
{'source': 'C:\\Users\\pavel\\projects\\ai-llm-agents\\rag\\data-002\\pinecone_example.ipynb', 'file_type': 'jupyter_notebook', 'content_type': 'notebook_content', 'document_type': 'jupyter_notebook'}
{'source': 'C:\\Users\\pavel\\projects\\ai-llm-agents\\rag\\data-002\\pinecone_example.ipynb', 'file_type': 'jupyter_notebook', 'content_type': 'notebook_content', 'document_type': 'jupyter_notebo

### Create embeddings

In [28]:
print("\nCreating embeddings using OpenAI...")

# Initialize OpenAI embeddings
embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)


Creating embeddings using OpenAI...


### Initialize pinecone

In [29]:
print("\nInitializing Pinecone...")

# Create Pinecone instance
pc = Pinecone(api_key=PINECONE_API_KEY)

# Create or use existing index
index_name = PINECONE_INDEX_NAME
dimension = 1536  # OpenAI embedding dimension

# Check if index exists
if index_name not in pc.list_indexes().names():
    print(f"Creating new Pinecone index: {index_name}")
    pc.create_index(
        name=index_name,
        dimension=dimension,
        metric="cosine",
        spec=ServerlessSpec(
            cloud="aws",
            region=PINECONE_ENVIRONMENT
        )
    )
    # Wait for index to be ready
    time.sleep(20)


Initializing Pinecone...
Creating new Pinecone index: document-retrieval


### Create vector store

In [30]:
print("\nStoring embeddings in Pinecone...")

# Connect to the index with Pinecone API
index = pc.Index(index_name)

# Create Pinecone vector store with LangChain
vectorstore = PineconeVectorStore.from_documents(
    documents=chunks,
    embedding=embeddings,
    index_name=index_name
)

print("Documents loaded into Pinecone successfully!")


Storing embeddings in Pinecone...
Documents loaded into Pinecone successfully!


### Create retrieval chain

In [31]:
print("\nSetting up retrieval system...")

# Create a retriever from the vector store
# retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 4}
)

# Create a RetrievalQA chain
# qa_chain = RetrievalQA.from_chain_type(
#     llm=OpenAI(temperature=0, openai_api_key=OPENAI_API_KEY),
#     chain_type="stuff",
#     retriever=retriever
# )
llm = ChatOpenAI(temperature=0, model_name=MODEL_GPT)
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True
)


Setting up retrieval system...


### Test retrieval with sample questions

In [33]:
# Function to query and display results
def ask_question(question):
    print(f"\n\n❓ Question: {question}")
    print("-" * 80)
    
    result = qa_chain.invoke({"query": question})
    
    print("📘 Answer:")
    print(result["result"])

    sources = [doc.metadata.get("source", "N/A") for doc in result['source_documents']]
    print("\n📁 Sources used in the answer:")
    for i, src in enumerate(sources, 1):
        print(f"{i:>2}. {src}")
    
    # print("\nSource Documents:")
    # for i, doc in enumerate(result["source_documents"]):
    #     print(f"\nDocument {i+1}:")
    #     source = doc.metadata.get('source', 'Unknown')
    #     if source.endswith('.txt'):
    #         doc_type = "Alice in Wonderland"
    #     elif source.endswith('.md'):
    #         doc_type = "Flask Documentation"
    #     elif source.endswith('.pdf'):
    #         doc_type = "Attention Research Paper"
    #     else:
    #         doc_type = "Unknown"
            
    #     print(f"Source: {doc_type} ({source})")
    #     if 'page' in doc.metadata:
    #         print(f"Page: {doc.metadata['page']}")
    #     print(f"Content: {doc.page_content[:200]}...")

In [34]:
questions = [
    # Markdown questions (LangChain README)
    "What is LangChain and what problems does it solve?",
    "What are the main components or modules of the LangChain framework?",
    
    # Text questions (Pinecone RAG Notebook)
    "How does Retrieval Augmented Generation (RAG) work with LangChain and Pinecone?",
    "What are the key steps to implement a RAG system using Pinecone?",
    
    # PDF questions (Neural Networks Paper)
    "What are the key findings or contributions in the neural networks paper?",
    "What methodologies or algorithms are discussed for training neural networks?",
    
    # DOCX questions (Calibre Demo)
    "What features of document formatting are showcased in the Calibre demo document?",
    "What is the main content or purpose of the Calibre demo document?",
    
    # Excel questions (Sample Spreadsheet)
    "What kind of data is in the Excel file?",
    "How many rows and columns are in the sample Excel data?",
    # "What types of data are contained in the sample Excel spreadsheet?",
    # "How is the data structured in the Excel file in terms of organization?",
    
    # HTML questions (W3C Accessibility)
    "What are the key principles for creating accessible web page structures?",
    "What HTML elements or techniques does W3C recommend for web accessibility?"
]

# questions = [
#     # Markdown questions
#     "What is LangChain and what are its main components?",
#     "How do LangChain agents work?",
    
#     # Text questions
#     "What is RAG and how does it work with Pinecone?",
#     "What are the key features of Pinecone for vector search?",
    
#     # PDF questions
#     "What are the main types of neural networks discussed in the PDF?",
#     "What applications of neural networks are mentioned in the paper?",
    
#     # DOCX questions
#     "What accessibility issues are mentioned in the report?",
#     "What recommendations are made for improving accessibility?",
    
#     # Excel questions
#     "What kind of data is in the Excel file?",
#     "How many rows and columns are in the sample Excel data?",
    
#     # HTML questions
#     "What are the main principles of web accessibility?",
#     "How should page structure be organized for accessibility?"
# ]

In [35]:
# for i, question in enumerate(questions):
#     print(f"\nQuestion {i+1}: {question}")
#     # result = qa_chain.run(question)
#     result = qa_chain.invoke({"query": question})
#     print(f"Answer: {result}")

# Ask each question
for question in questions:
    ask_question(question)



❓ Question: What is LangChain and what problems does it solve?
--------------------------------------------------------------------------------
📘 Answer:
LangChain is a framework for building applications powered by Large Language Models (LLMs). It helps developers chain together interoperable components and third-party integrations to simplify AI application development. 

The problems LangChain solves include:

1. **Real-time Data Augmentation**: It allows easy connection of LLMs to diverse data sources and systems, enabling the use of up-to-date information in applications.

2. **Model Interoperability**: Developers can swap models in and out as needed, allowing for experimentation and adaptation to find the best model for specific application needs.

3. **Simplified Development**: By providing a standard interface for models, embeddings, vector stores, and more, LangChain streamlines the development process for LLM applications.

4. **Future-proofing**: As the underlying technolo

### Clean up

In [36]:
print("\nDone!")

# Uncomment below if you want to delete the index after testing
# print("Cleaning up...")
# pinecone.delete_index(index_name)


Done!


## DEBUG: FIXING

In [37]:
# print(excel_docs)

### DEBUG [1]: Fixing Excel File Retrieval in Your RAG System
The issue is that your system isn't properly loading or processing the Excel file data. The UnstructuredExcelLoader might not be extracting the content effectively for vector storage and retrieval.

In [38]:
# # 1. Replace the Excel Loader

# from langchain_community.document_loaders import PandasLoader
# import pandas as pd

# def load_document(file_path, doc_type):
#     """Load a document based on its type."""
#     try:
#         if doc_type == "text":
#             return TextLoader(file_path).load()
#         elif doc_type == "markdown":
#             return UnstructuredMarkdownLoader(file_path).load()
#         elif doc_type == "pdf":
#             return PyPDFLoader(file_path).load()
#         elif doc_type == "docx":
#             return Docx2txtLoader(file_path).load()
#         elif doc_type == "excel":
#             # Replace the UnstructuredExcelLoader with a better approach
#             df = pd.read_excel(file_path)
#             # Convert DataFrame to a more descriptive text format
#             text_content = f"Excel file with {len(df)} rows and {len(df.columns)} columns.\n\n"
#             text_content += f"Column names: {', '.join(df.columns.astype(str))}\n\n"
#             text_content += "Sample data (first 5 rows):\n"
#             text_content += df.head().to_string()
            
#             # Create a document with this content
#             from langchain_core.documents import Document
#             return [Document(page_content=text_content, metadata={"source": file_path})]
#         elif doc_type == "html":
#             return UnstructuredHTMLLoader(file_path).load()
#         else:
#             print(f"Unsupported document type: {doc_type}")
#             return []
#     except Exception as e:
#         print(f"Error loading {file_path}: {e}")
#         return []

In [39]:
# # 2. Debug the Excel Loading
# # - After modifying the loader, add this code to verify the Excel content was properly extracted:

# # After loading documents, check specifically for Excel content:
# excel_docs = [doc for doc in documents if "sample_data.xlsx" in doc.metadata.get("source", "")]
# if excel_docs:
#     print("\n📊 Excel Document Content:")
#     for doc in excel_docs:
#         print(f"Source: {doc.metadata.get('source')}")
#         print(f"Content length: {len(doc.page_content)} characters")
#         print(f"Content preview:\n{doc.page_content[:500]}...\n")
# else:
#     print("\n❌ No Excel documents were successfully loaded!")

In [40]:
# # 3. Modify Chunking for Excel Data
# # - For Excel data, which is more structured than regular text, consider using a different chunking strategy:

# # Before creating chunks
# regular_docs = [doc for doc in documents if "sample_data.xlsx" not in doc.metadata.get("source", "")]
# excel_docs = [doc for doc in documents if "sample_data.xlsx" in doc.metadata.get("source", "")]

# # Process regular documents with RecursiveCharacterTextSplitter
# text_splitter = RecursiveCharacterTextSplitter(
#     chunk_size=1000,
#     chunk_overlap=100,
#     length_function=len,
# )
# regular_chunks = text_splitter.split_documents(regular_docs)

# # Process Excel documents with less aggressive chunking to preserve table context
# excel_splitter = RecursiveCharacterTextSplitter(
#     chunk_size=3000,  # Larger chunk size for tabular data
#     chunk_overlap=200,
#     length_function=len,
# )
# excel_chunks = excel_splitter.split_documents(excel_docs)

# # Combine chunks
# chunks = regular_chunks + excel_chunks
# print(f"Created {len(chunks)} chunks from {len(documents)} documents")

In [41]:
# # 4. Add Metadata Tags
# # - Adding specific metadata tags can help improve retrieval:

# # After creating chunks, but before vector store creation
# for chunk in chunks:
#     if "sample_data.xlsx" in chunk.metadata.get("source", ""):
#         chunk.metadata["content_type"] = "excel_data"
#         chunk.metadata["document_type"] = "spreadsheet"

```
These changes should help your system better process, store, and retrieve Excel data, allowing it to answer questions about the Excel file content.
```

### DEBUG [2]: Fixing Excel and Jupyter Notebook File Processing in RAG System
The issue is that both Excel files and Jupyter notebooks aren't being properly processed by the UnstructuredExcelLoader. Let's implement a more robust solution for both file types:

In [42]:
# # Fix 1: Replace the Excel Loader with a Better Implementation

# def load_document(file_path, doc_type):
#     """Load a document based on its type."""
#     try:
#         if doc_type == "text":
#             return TextLoader(file_path).load()
#         elif doc_type == "markdown":
#             return UnstructuredMarkdownLoader(file_path).load()
#         elif doc_type == "pdf":
#             return PyPDFLoader(file_path).load()
#         elif doc_type == "docx":
#             return Docx2txtLoader(file_path).load()
#         elif doc_type == "excel":
#             # Better Excel processing using pandas
#             import pandas as pd
#             from langchain_core.documents import Document
            
#             df = pd.read_excel(file_path)
            
#             # Create a detailed text description of the Excel content
#             content = f"Excel file containing {len(df)} rows and {len(df.columns)} columns.\n\n"
#             content += f"Column names: {', '.join(df.columns.astype(str))}\n\n"
            
#             # Add sample data information
#             content += f"Sample data (first 5 rows):\n{df.head().to_string()}\n\n"
            
#             # Add summary statistics if applicable
#             try:
#                 content += f"Numeric column statistics:\n{df.describe().to_string()}\n\n"
#             except:
#                 pass
                
#             # Create a document with metadata flagging it as Excel
#             return [Document(
#                 page_content=content,
#                 metadata={"source": file_path, "file_type": "excel", "rows": len(df), "columns": len(df.columns)}
#             )]
#         elif doc_type == "html":
#             return UnstructuredHTMLLoader(file_path).load()
#         elif doc_type == "notebook" or doc_type.endswith(".ipynb"):
#             # Process Jupyter notebooks as text files
#             with open(file_path, 'r', encoding='utf-8') as f:
#                 import json
#                 notebook = json.load(f)
                
#                 # Extract text from markdown and code cells
#                 content = ""
#                 for cell in notebook.get('cells', []):
#                     if cell.get('cell_type') == 'markdown':
#                         content += "".join(cell.get('source', [])) + "\n\n"
#                     elif cell.get('cell_type') == 'code':
#                         content += "```python\n" + "".join(cell.get('source', [])) + "\n```\n\n"
                
#                 return [Document(
#                     page_content=content,
#                     metadata={"source": file_path, "file_type": "jupyter_notebook"}
#                 )]
#         else:
#             print(f"Unsupported document type: {doc_type}")
#             return []
#     except Exception as e:
#         print(f"Error loading {file_path}: {e}")
#         return []

In [43]:
# # Fix 2: Add Debug Code to Check Excel Content
# # - Add this code after loading the documents:

# # Debug: Check Excel content
# excel_docs = [doc for doc in documents if "sample_data.xlsx" in doc.metadata.get("source", "")]
# if excel_docs:
#     print("\n📊 Excel file loaded successfully:")
#     for doc in excel_docs:
#         content_preview = doc.page_content[:500] + "..." if len(doc.page_content) > 500 else doc.page_content
#         print(f"Content length: {len(doc.page_content)} characters")
#         print(f"Content preview:\n{content_preview}\n")
#         print(f"Metadata: {doc.metadata}")
# else:
#     print("\n❌ No Excel documents were successfully loaded!")

In [44]:
# # Fix 3: Modify Document Processing for Better Retrieval
# # When creating chunks, add special handling for Excel data:

# # Before chunking, separate Excel documents for special handling
# regular_docs = [doc for doc in documents if "sample_data.xlsx" not in doc.metadata.get("source", "")]
# excel_docs = [doc for doc in documents if "sample_data.xlsx" in doc.metadata.get("source", "")]

# # Process regular documents with RecursiveCharacterTextSplitter
# text_splitter = RecursiveCharacterTextSplitter(
#     chunk_size=1000,
#     chunk_overlap=100,
#     length_function=len,
# )
# regular_chunks = text_splitter.split_documents(regular_docs)

# # For Excel documents, use larger chunks with less overlap
# excel_splitter = RecursiveCharacterTextSplitter(
#     chunk_size=2000,  # Larger to keep more Excel context together
#     chunk_overlap=50,
#     length_function=len,
# )
# excel_chunks = excel_splitter.split_documents(excel_docs)

# # Combine all chunks
# chunks = regular_chunks + excel_chunks
# print(f"Created {len(chunks)} chunks from {len(documents)} documents")
# print(f"  - Regular documents: {len(regular_chunks)} chunks")
# print(f"  - Excel documents: {len(excel_chunks)} chunks")

In [45]:
# # Fix 4: Enhance Metadata for Better Retrieval
# # After creating chunks but before vector store creation:

# # Add more specific metadata for better retrieval
# for chunk in chunks:
#     source = chunk.metadata.get("source", "")
#     if "sample_data.xlsx" in source:
#         chunk.metadata["content_type"] = "excel_data"
#         chunk.metadata["document_type"] = "spreadsheet"
#     elif source.endswith(".ipynb"):
#         chunk.metadata["content_type"] = "notebook_content"
#         chunk.metadata["document_type"] = "jupyter_notebook"

```
The primary issue is that UnstructuredExcelLoader doesn't convert Excel files into a format that's good for semantic search. The new implementation:

1. Uses pandas to properly extract table data
2. Creates a text representation that describes the Excel structure
3. Preserves important metadata about the file
4. Uses special chunking for tabular data

This should significantly improve your ability to retrieve information from Excel files and Jupyter notebooks in your RAG system.
```

### DEBUG [3]: Debugging Jupyter Notebook and Excel Loading in RAG System
Here's a dedicated debug function to check both your Excel and Jupyter Notebook processing:

In [46]:
# # Add this after your document loading code

# def debug_document_loading():
#     """Debug function to check if Excel and Jupyter Notebook files are properly loaded."""
#     print("\n🔍 DEBUGGING DOCUMENT LOADING...")
    
#     # 1. Check Excel files
#     excel_files = [doc for doc in documents if "sample_data.xlsx" in doc.metadata.get("source", "")]
#     print(f"\n📊 Excel Files: {len(excel_files)} documents found")
    
#     if excel_files:
#         for idx, doc in enumerate(excel_files):
#             print(f"  Excel Doc #{idx+1}:")
#             print(f"  - Source: {doc.metadata.get('source')}")
#             print(f"  - Content length: {len(doc.page_content)} characters")
#             print(f"  - Content preview: {doc.page_content[:200]}...\n")
#     else:
#         print("  ❌ No Excel documents were loaded!")
        
#     # 2. Check Jupyter Notebook files
#     notebook_files = [doc for doc in documents if "ipynb" in doc.metadata.get("source", "")]
#     print(f"\n📓 Jupyter Notebooks: {len(notebook_files)} documents found")
    
#     if notebook_files:
#         for idx, doc in enumerate(notebook_files):
#             print(f"  Notebook Doc #{idx+1}:")
#             print(f"  - Source: {doc.metadata.get('source')}")
#             print(f"  - Content length: {len(doc.page_content)} characters")
#             print(f"  - Content preview: {doc.page_content[:200]}...\n")
#     else:
#         print("  ❌ No Jupyter Notebook documents were loaded!")
        
#     # 3. Test direct loading of files
#     print("\n🧪 TESTING DIRECT FILE LOADING...")
    
#     excel_path = os.path.join(DATA_DIR, "sample_data.xlsx")
#     if os.path.exists(excel_path):
#         print(f"\nTesting Excel loader on: {excel_path}")
#         try:
#             import pandas as pd
#             df = pd.read_excel(excel_path)
#             print(f"✅ Successfully loaded Excel with pandas: {len(df)} rows, {len(df.columns)} columns")
#             print(f"Column names: {list(df.columns)}")
#         except Exception as e:
#             print(f"❌ Error loading Excel with pandas: {e}")
    
#     notebook_path = os.path.join(DATA_DIR, "pinecone_example.ipynb")
#     if os.path.exists(notebook_path):
#         print(f"\nTesting Jupyter Notebook loader on: {notebook_path}")
#         try:
#             with open(notebook_path, 'r', encoding='utf-8') as f:
#                 import json
#                 notebook = json.load(f)
#                 cell_count = len(notebook.get('cells', []))
#                 print(f"✅ Successfully loaded Notebook: {cell_count} cells found")
                
#                 # Count cell types
#                 md_cells = sum(1 for cell in notebook.get('cells', []) if cell.get('cell_type') == 'markdown')
#                 code_cells = sum(1 for cell in notebook.get('cells', []) if cell.get('cell_type') == 'code')
#                 print(f"  - Markdown cells: {md_cells}")
#                 print(f"  - Code cells: {code_cells}")
#         except Exception as e:
#             print(f"❌ Error loading Jupyter Notebook: {e}")

# # Call the debug function
# debug_document_loading()

In [47]:
# # Improved Jupyter Notebook Loading Function
# # Replace your existing notebook loading code with this improved version:

# def load_document(file_path, doc_type):
#     """Load a document based on its type."""
#     try:
#         if doc_type == "text":
#             return TextLoader(file_path).load()
#         elif doc_type == "markdown":
#             return UnstructuredMarkdownLoader(file_path).load()
#         elif doc_type == "pdf":
#             return PyPDFLoader(file_path).load()
#         elif doc_type == "docx":
#             return Docx2txtLoader(file_path).load()
#         elif doc_type == "excel":
#             # Better Excel processing using pandas
#             import pandas as pd
#             from langchain_core.documents import Document
            
#             df = pd.read_excel(file_path)
            
#             # Create a detailed text description of the Excel content
#             content = f"Excel file containing {len(df)} rows and {len(df.columns)} columns.\n\n"
#             content += f"Column names: {', '.join(df.columns.astype(str))}\n\n"
            
#             # Add sample data information
#             content += f"Sample data (first 5 rows):\n{df.head().to_string()}\n\n"
            
#             # Add summary statistics where possible
#             try:
#                 content += f"Numeric column statistics:\n{df.describe().to_string()}\n\n"
#             except:
#                 pass
                
#             # Create a document with metadata flagging it as Excel
#             return [Document(
#                 page_content=content,
#                 metadata={"source": file_path, "file_type": "excel", "rows": len(df), "columns": len(df.columns)}
#             )]
#         elif doc_type == "notebook" or "ipynb" in file_path:
#             # Improved Jupyter Notebook processing
#             try:
#                 with open(file_path, 'r', encoding='utf-8') as f:
#                     import json
#                     notebook = json.load(f)
                    
#                     # Extract text from markdown and code cells
#                     content = f"Jupyter Notebook with {len(notebook.get('cells', []))} cells\n\n"
                    
#                     # Process each cell
#                     for i, cell in enumerate(notebook.get('cells', [])):
#                         cell_type = cell.get('cell_type')
#                         cell_source = "".join(cell.get('source', []))
                        
#                         if cell_type == 'markdown':
#                             content += f"[MARKDOWN CELL {i+1}]\n{cell_source}\n\n"
#                         elif cell_type == 'code':
#                             content += f"[CODE CELL {i+1}]\n```python\n{cell_source}\n```\n\n"
                            
#                             # If there's output, include it
#                             outputs = cell.get('outputs', [])
#                             if outputs:
#                                 output_text = ""
#                                 for output in outputs:
#                                     if 'text' in output:
#                                         output_text += "".join(output['text'])
#                                     elif 'data' in output and 'text/plain' in output['data']:
#                                         output_text += output['data']['text/plain']
                                
#                                 if output_text:
#                                     content += f"[OUTPUT]\n{output_text}\n\n"
                    
#                     return [Document(
#                         page_content=content,
#                         metadata={"source": file_path, "file_type": "jupyter_notebook"}
#                     )]
#             except json.JSONDecodeError:
#                 # If JSON parsing fails, fall back to text loader
#                 print(f"Warning: Could not parse {file_path} as JSON, falling back to text loader")
#                 return TextLoader(file_path).load()
#         elif doc_type == "html":
#             return UnstructuredHTMLLoader(file_path).load()
#         else:
#             print(f"Unsupported document type: {doc_type}")
#             return []
#     except Exception as e:
#         print(f"Error loading {file_path}: {e}")
#         return []

In [48]:
# # Fix Your DOCUMENTS Configuration
# # Make sure the file type for the notebook is correctly set:

# DOCUMENTS = [
#     ("https://raw.githubusercontent.com/langchain-ai/langchain/master/README.md", "langchain_readme.md", "markdown"),
#     # Change the file type from "text" to "notebook"
#     ("https://raw.githubusercontent.com/pinecone-io/examples/master/learn/generation/langchain/handbook/05-langchain-retrieval-augmentation.ipynb", "pinecone_example.ipynb", "notebook"),
#     ("https://arxiv.org/pdf/2005.11401.pdf", "neural_networks.pdf", "pdf"),
#     ("https://calibre-ebook.com/downloads/demos/demo.docx", "calibre_demo.docx", "docx"),
#     ("https://filesamples.com/samples/document/xlsx/sample1.xlsx", "sample_data.xlsx", "excel"),
#     ("https://www.w3.org/WAI/tutorials/page-structure/", "web_accessibility.html", "html"),
# ]

```
The debug function will help you identify exactly what's happening with both file types, and the improved loading functions should properly handle your Excel and Jupyter Notebook files for better retrieval results.
```

```
The improved document loaders will help your RAG system properly process both Excel files and Jupyter notebooks by:

1. Using pandas to extract structured data from Excel spreadsheets
2. Converting tabular data into descriptive text that's better for semantic search
3. Adding rich metadata to help with retrieval accuracy
4. Using special chunking strategies for different document types

When you implement these changes, your system should be able to properly answer questions about all document types, including the Excel spreadsheet data.
```