# üìö Documentation Processing Pipeline

This notebook contains all the steps to process documentation: crawl ‚Üí chunk ‚Üí embed upload

## Overview
- **Step 1**: Crawl documentation from URLs
- **Step 2**: Chunk documents into smaller pieces
- **Step 3**: Create embeddings and upload to Pinecone

## Prerequisites
Make sure you have:
- `.env` file with your API keys
- Appwrite database and storage set up
- Pinecone index created


In [9]:
!pip install -r requirements.txt

Collecting langchain-pinecone (from -r requirements.txt (line 2))
  Downloading langchain_pinecone-0.2.8-py3-none-any.whl.metadata (5.3 kB)
Collecting langchain-huggingface (from -r requirements.txt (line 3))
  Downloading langchain_huggingface-0.3.0-py3-none-any.whl.metadata (996 bytes)
Collecting langchain-openai (from -r requirements.txt (line 5))
  Downloading langchain_openai-0.3.27-py3-none-any.whl.metadata (2.3 kB)
Collecting rank_bm25 (from -r requirements.txt (line 11))
  Downloading rank_bm25-0.2.2-py3-none-any.whl.metadata (3.2 kB)
Collecting pinecone-client (from -r requirements.txt (line 19))
  Downloading pinecone_client-6.0.0-py3-none-any.whl.metadata (3.4 kB)
Collecting pinecone<8.0.0,>=6.0.0 (from pinecone[asyncio]<8.0.0,>=6.0.0->langchain-pinecone->-r requirements.txt (line 2))
  Downloading pinecone-7.3.0-py3-none-any.whl.metadata (9.5 kB)
Collecting langchain-tests<1.0.0,>=0.3.7 (from langchain-pinecone->-r requirements.txt (line 2))
  Downloading langchain_tests-0.

In [4]:
import os
import sys
import json
import time
import logging
from dotenv import load_dotenv
from appwrite_service import appwrite_service

# Load environment variables
load_dotenv()

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

print("‚úÖ Setup complete")


‚úÖ Setup complete


In [3]:
# Documentation URL to process
DOCUMENTATION_URL = "https://react.dev/learn"

print(f"üìö Processing: {DOCUMENTATION_URL}")


üìö Processing: https://react.dev/learn


## Step 1: Crawl Documentation

This step crawls the documentation website and extracts all the content.

In [5]:
# Import crawling functions
from crawl_docs import crawl_documentation

print("üï∑Ô∏è  Step 1: Crawling Documentation")
print("=" * 50)
print(f"üìö URL: {DOCUMENTATION_URL}")

# Check if raw docs already exist
if appwrite_service.docs_already_exist(DOCUMENTATION_URL):
    print("‚úÖ Raw documents already exist, skipping crawl...")
    crawl_success = True
else:
    # Start crawling
    start_time = time.time()
    crawl_success = crawl_documentation(DOCUMENTATION_URL)
    crawl_time = time.time() - start_time

    if crawl_success:
        print(f"‚úÖ Crawl completed successfully in {crawl_time:.2f} seconds!")
        print(f"üìÑ Raw documents saved to storage")
    else:
        print(f"‚ùå Crawl failed!")

print(f"üìä Crawl step result: {'‚úÖ Success' if crawl_success else '‚ùå Failed'}")


üï∑Ô∏è  Step 1: Crawling Documentation
üìö URL: https://react.dev/learn
‚úÖ Raw documents already exist, skipping crawl...
üìä Crawl step result: ‚úÖ Success


## Step 2: Chunk Documents

This step takes the raw documents and splits them into smaller chunks for better processing.


In [6]:
# Import chunking functions
from chunk_docs import chunk_and_save_docs

print("üî™ Step 2: Chunking Documents")
print("=" * 50)
print(f"üìö URL: {DOCUMENTATION_URL}")

# Check if chunks already exist
if appwrite_service.chunks_already_exist(DOCUMENTATION_URL):
    print("‚úÖ Chunks already exist, skipping chunking...")
    chunk_success = True
else:
    # Start chunking
    start_time = time.time()
    chunk_success = chunk_and_save_docs(DOCUMENTATION_URL)
    chunk_time = time.time() - start_time

    if chunk_success:
        print(f"‚úÖ Chunking completed successfully in {chunk_time:.2f} seconds!")
        print(f"üî™ Documents chunked and saved to storage")
    else:
        print(f"‚ùå Chunking failed!")

print(f"üìä Chunk step result: {'‚úÖ Success' if chunk_success else '‚ùå Failed'}")


üî™ Step 2: Chunking Documents
üìö URL: https://react.dev/learn
‚úÖ Chunks already exist, skipping chunking...
üìä Chunk step result: ‚úÖ Success


## Step 3: Create Embeddings and Upload

This step creates embeddings for the chunks and uploads them to Pinecone for vector search.

In [10]:
# Import embedding functions
from embed_upload import embed_and_upload_chunks

print("üß† Step 3: Creating Embeddings and Uploading")
print("=" * 50)
print(f"üìö URL: {DOCUMENTATION_URL}")

# Start embedding and uploading
start_time = time.time()
embed_success = embed_and_upload_chunks(DOCUMENTATION_URL)
embed_time = time.time() - start_time

if embed_success:
    print(f"‚úÖ Embedding and upload completed successfully in {embed_time:.2f} seconds!")
    print(f"üß† Embeddings created and uploaded to Pinecone")
else:
    print(f"‚ùå Embedding and upload failed!")

print(f"üìä Embed step result: {'‚úÖ Success' if embed_success else '‚ùå Failed'}")


üß† Step 3: Creating Embeddings and Uploading
üìö URL: https://react.dev/learn


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/387 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/616 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/201 [00:00<?, ?B/s]

‚úÖ Embedding and upload completed successfully in 585.43 seconds!
üß† Embeddings created and uploaded to Pinecone
üìä Embed step result: ‚úÖ Success


## Step 4: Save Completion Status

This step saves the completion status to the database so the documentation appears as available in the frontend.

In [11]:
print("üíæ Step 4: Saving Completion Status")
print("=" * 50)
print(f"üìö URL: {DOCUMENTATION_URL}")

# Get chunk count for completion status
chunks = appwrite_service.get_all_chunks(DOCUMENTATION_URL)
chunks_count = len(chunks) if chunks else 0
print(f"üìä Found {chunks_count} chunks")

# Save completion status
if embed_success and chunks_count > 0:
    completion_success = appwrite_service.save_completion_status(DOCUMENTATION_URL, chunks_count)

    if completion_success:
        print(f"‚úÖ Completion status saved successfully!")
        print(f"üìù Documentation is now available for chat")
    else:
        print(f"‚ùå Failed to save completion status")
else:
    print(f"‚ö†Ô∏è  Skipping completion status - embedding failed or no chunks found")
    completion_success = False

print(f"üìä Completion step result: {'‚úÖ Success' if completion_success else '‚ùå Failed'}")


üíæ Step 4: Saving Completion Status
üìö URL: https://react.dev/learn
üìä Found 15841 chunks
‚úÖ Completion status saved successfully!
üìù Documentation is now available for chat
üìä Completion step result: ‚úÖ Success


## Final Results

Let's check the overall pipeline results and verify everything worked correctly.


In [12]:
print("üéâ Pipeline Results")
print("=" * 50)

# Check all steps
steps = {
    "Crawl": crawl_success,
    "Chunk": chunk_success,
    "Embed": embed_success,
    "Completion": completion_success
}

all_success = all(steps.values())

print(f"üìä Step Results:")
for step, success in steps.items():
    status = "‚úÖ Success" if success else "‚ùå Failed"
    print(f"   {step}: {status}")

print(f"\nüéØ Overall Result: {'‚úÖ SUCCESS' if all_success else '‚ùå FAILED'}")

if all_success:
    print(f"\nüéâ Congratulations! {DOCUMENTATION_URL} is now fully processed and available for chat!")
    print(f"üìù You can now use this documentation in the frontend interface.")
else:
    print(f"\n‚ö†Ô∏è  Some steps failed. Please check the logs above and try again.")

# Verify completion status
is_available = appwrite_service.is_fully_processed(DOCUMENTATION_URL)
print(f"\nüîç Verification: {'‚úÖ Available for chat' if is_available else '‚ùå Not available'}")


üéâ Pipeline Results
üìä Step Results:
   Crawl: ‚úÖ Success
   Chunk: ‚úÖ Success
   Embed: ‚úÖ Success
   Completion: ‚úÖ Success

üéØ Overall Result: ‚úÖ SUCCESS

üéâ Congratulations! https://react.dev/learn is now fully processed and available for chat!
üìù You can now use this documentation in the frontend interface.

üîç Verification: ‚úÖ Available for chat


## Utility Functions

Here are some utility functions to check status and manage the pipeline.


In [13]:
def check_documentation_status(url):
    """Check the current status of documentation processing"""
    print(f"\nüìä Status Check for: {url}")
    print("=" * 50)

    # Check raw docs
    raw_exists = appwrite_service.docs_already_exist(url)
    print(f"üìÑ Raw documents: {'‚úÖ Exists' if raw_exists else '‚ùå Not found'}")

    # Check chunks
    chunks_exist = appwrite_service.chunks_already_exist(url)
    print(f"üî™ Chunks: {'‚úÖ Exists' if chunks_exist else '‚ùå Not found'}")

    # Check completion status
    is_processed = appwrite_service.is_fully_processed(url)
    print(f"‚úÖ Completion status: {'‚úÖ Exists' if is_processed else '‚ùå Not found'}")

    # Overall status
    if is_processed:
        print(f"\nüéâ Documentation is fully processed and ready for questions!")
    elif raw_exists and chunks_exist:
        print(f"\n‚ö†Ô∏è  Raw documents and chunks exist but need embedding.")
    elif raw_exists:
        print(f"\n‚ö†Ô∏è  Raw documents exist but need chunking and embedding.")
    else:
        print(f"\n‚ùå Documentation needs to be crawled first.")

# Check status of current documentation
check_documentation_status(DOCUMENTATION_URL)



üìä Status Check for: https://react.dev/learn
üìÑ Raw documents: ‚úÖ Exists
üî™ Chunks: ‚úÖ Exists
‚úÖ Completion status: ‚úÖ Exists

üéâ Documentation is fully processed and ready for questions!


## Process Different Documentation

You can easily process different documentation by changing the URL and running the pipeline again.

In [None]:
# Predefined documentation sets
PREDEFINED_DOCS = {
    "React": "https://react.dev/learn",
    "Python": "https://docs.python.org/3/",
    "Node.js": "https://nodejs.org/en/docs/",
    "Vue.js": "https://vuejs.org/guide/",
    "Django": "https://docs.djangoproject.com/en/stable/",
    "FastAPI": "https://fastapi.tiangolo.com/",
    "MongoDB": "https://docs.mongodb.com/",
    "PostgreSQL": "https://www.postgresql.org/docs/",
}

print("üìö Available Documentation Sets:")
for name, url in PREDEFINED_DOCS.items():
    is_available = appwrite_service.is_fully_processed(url)
    status = "‚úÖ Available" if is_available else "‚ùå Not Available"
    print(f"  {name}: {status}")

print("\nüí° To process a different documentation set:")
print("1. Change DOCUMENTATION_URL to the desired URL")
print("2. Run the pipeline cells again")


## Summary

This notebook provides a complete pipeline for processing documentation:

1. **Crawl**: Extract content from documentation websites
2. **Chunk**: Split content into manageable pieces
3. **Embed**: Create vector embeddings and upload to Pinecone
4. **Complete**: Save completion status for frontend availability

### Usage:
- Change `DOCUMENTATION_URL` to process different documentation
- Run cells in order from top to bottom
- Check results and status using utility functions

### Next Steps:
- Run the frontend app to chat with processed documentation
- Process additional documentation sets as needed
- Monitor and maintain the pipeline
