# PageIndex - HuggingFace Edition
## Build Document Tree Structures with Free Local Models

This notebook demonstrates how to use PageIndex with free HuggingFace models instead of OpenAI API.

**Features:**
- ‚úÖ No OpenAI API key required
- ‚úÖ Free and open-source models
- ‚úÖ Run completely locally
- ‚úÖ Good for document indexing and RAG systems

## Step 1: Install Dependencies

In [None]:
!pip install -q pymupdf pyyaml transformers torch accelerate sentencepiece protobuf

## Step 2: Clone Repository (or Upload Files)

You can either clone from GitHub or upload the files manually.

In [None]:
# Option 1: Clone from GitHub (replace with your repo)
# !git clone https://github.com/your-username/pageindex-hf.git
# %cd pageindex-hf

# Option 2: Use uploaded files
import os
print("Current directory:", os.getcwd())
print("Files:", os.listdir())

## Step 3: Upload Your PDF

Upload the PDF you want to process.

In [None]:
from google.colab import files
import shutil

# Upload PDF
print("Please upload your PDF file:")
uploaded = files.upload()

# Get the uploaded filename
pdf_path = list(uploaded.keys())[0]
print(f"\n‚úÖ Uploaded: {pdf_path}")

## Step 4: Create Configuration Files

Create the necessary Python files if not already present.

In [None]:
# If files were uploaded separately, skip this cell
# Otherwise, create the files here

# Check if files exist
required_files = ['config.yaml', 'utils.py', 'page_index.py', 'run_pageindex.py']
missing_files = [f for f in required_files if not os.path.exists(f)]

if missing_files:
    print(f"‚ùå Missing files: {missing_files}")
    print("Please upload these files or clone the repository.")
else:
    print("‚úÖ All required files present!")

## Step 5: Run PageIndex

Generate the tree structure for your PDF.

**Note:** First run will download the model (~14GB for Mistral-7B). This may take 5-10 minutes.

In [None]:
# Option 1: Use command line (recommended)
!python run_pageindex.py \
    --pdf_path {pdf_path} \
    --model "mistralai/Mistral-7B-Instruct-v0.2" \
    --device cuda \
    --max-pages-per-node 10

## Step 6: View Results

Load and display the generated tree structure.

In [None]:
import json

# Load the output JSON
output_file = pdf_path.replace('.pdf', '_pageindex.json')

with open(output_file, 'r') as f:
    tree = json.load(f)

# Display summary
print("=" * 60)
print("üìÑ DOCUMENT SUMMARY")
print("=" * 60)
print(f"Description: {tree.get('document_description', 'N/A')}")
print(f"Total Pages: {tree.get('total_pages', 'N/A')}")
print(f"Total Nodes: {len(tree.get('nodes', []))}")
print("\n" + "=" * 60)
print("üå≤ TREE STRUCTURE")
print("=" * 60)

# Display nodes
for i, node in enumerate(tree.get('nodes', [])[:10]):  # Show first 10 nodes
    print(f"\n{i+1}. {node.get('title', 'Untitled')}")
    print(f"   Pages: {node.get('start_index', 'N/A')} - {node.get('end_index', 'N/A')}")
    if 'node_id' in node:
        print(f"   ID: {node['node_id']}")
    if 'summary' in node:
        summary = node['summary'][:150] + "..." if len(node['summary']) > 150 else node['summary']
        print(f"   Summary: {summary}")

if len(tree.get('nodes', [])) > 10:
    print(f"\n... and {len(tree['nodes']) - 10} more nodes")

## Step 7: Download Results

Download the generated JSON file.

In [None]:
from google.colab import files

# Download the output
output_file = pdf_path.replace('.pdf', '_pageindex.json')
files.download(output_file)
print(f"‚úÖ Downloaded: {output_file}")

## Optional: Use Programmatically

You can also use PageIndex as a Python module.

In [None]:
# Import the module
from page_index import build_pageindex
from utils import ConfigLoader

# Load configuration
config_loader = ConfigLoader()
config = config_loader.load()

# Build index
tree = build_pageindex(
    pdf_path=pdf_path,
    config=config,
    output_path="my_custom_output.json"
)

print(f"Generated tree with {len(tree['nodes'])} nodes")

## Optional: Try Different Models

Experiment with different HuggingFace models.

In [None]:
# Try a different model
alternative_model = "HuggingFaceH4/zephyr-7b-beta"

!python run_pageindex.py \
    --pdf_path {pdf_path} \
    --model {alternative_model} \
    --output "output_zephyr.json" \
    --device cuda

## Simple RAG Example

Use the tree structure for basic retrieval.

In [None]:
def simple_retrieve(query, tree, top_k=3):
    """Simple keyword-based retrieval from tree structure"""
    query_words = set(query.lower().split())
    
    results = []
    for node in tree['nodes']:
        # Score based on keyword overlap
        title_words = set(node.get('title', '').lower().split())
        summary_words = set(node.get('summary', '').lower().split())
        
        overlap = len(query_words & (title_words | summary_words))
        
        if overlap > 0:
            results.append({
                'node': node,
                'score': overlap
            })
    
    # Sort by score and return top k
    results.sort(key=lambda x: x['score'], reverse=True)
    return results[:top_k]

# Example query
query = "financial results revenue"
results = simple_retrieve(query, tree, top_k=3)

print(f"\nüîç Query: '{query}'")
print("=" * 60)
for i, result in enumerate(results, 1):
    node = result['node']
    print(f"\n{i}. {node.get('title', 'Untitled')} (Score: {result['score']})")
    print(f"   Pages: {node.get('start_index')}-{node.get('end_index')}")
    if 'summary' in node:
        print(f"   Summary: {node['summary'][:200]}...")

## üìö Next Steps

1. **Integrate with your RAG system**: Use the tree structure for context retrieval
2. **Try different models**: Experiment with various HuggingFace models
3. **Customize configuration**: Adjust settings in `config.yaml`
4. **Process multiple PDFs**: Loop through a directory of PDFs

## üîó Resources

- [Original PageIndex](https://github.com/VectifyAI/PageIndex)
- [HuggingFace Models](https://huggingface.co/models)
- [PageIndex Documentation](https://docs.pageindex.ai)