# üöÄ Invoice Extraction Demo App - Google Colab

Run the Streamlit demo app in Google Colab!

## ‚ö†Ô∏è IMPORTANT: Select GPU Runtime First!

**Before running any cells:**
1. Click **Runtime** ‚Üí **Change runtime type**
2. Select **T4 GPU** (required for Qwen2-VL model)
3. Click **Save**

---

## üìã Instructions:

1. **Run Cell 1**: Setup repository and install dependencies (~2-3 min)
2. **Run Cell 2**: Configure ngrok token (optional but recommended)
3. **Run Cell 3**: Start the app and get public URL
4. **Click the URL** to access your demo app!

**Note:** First run downloads Qwen2-VL model (16GB) - be patient!

## Step 1: Setup Repository & Dependencies

This will:
- Clone the repository
- Install Python packages
- Install system dependencies (poppler for PDFs)

In [None]:
import os

# Clone or update repository
if not os.path.exists('/content/orbit_challenge'):
    print("üì• Cloning repository...")
    !git clone https://github.com/marvin-schumann/orbit_challenge.git /content/orbit_challenge
else:
    print("üì• Repository exists, pulling latest changes...")
    %cd /content/orbit_challenge
    !git pull origin claude/capabilities-overview-01BzAZxMUjPBveeHos3gVvok

%cd /content/orbit_challenge

# Install dependencies quietly
print("\nüì¶ Installing dependencies (this may take 2-3 minutes)...")
!pip install -q streamlit==1.28.0 pyngrok requests 2>&1 | grep -v "already satisfied" || true
!pip install -q -r requirements.txt 2>&1 | grep -v "already satisfied" || true

# Install poppler for PDF support
!apt-get update -qq 2>&1 > /dev/null
!apt-get install -y -qq poppler-utils 2>&1 > /dev/null

# Verify installation
print("\n‚úÖ Setup complete! Verifying...")
!streamlit --version
print(f"üìÅ Working directory: {os.getcwd()}")
print(f"üìÑ App file exists: {os.path.exists('app.py')}")
print("\n‚úÖ Ready to proceed to Step 2!")

## Step 2: Configure Ngrok (Recommended)

**Option A: Use ngrok** (recommended - stable URL, no security warnings)
1. Sign up at https://ngrok.com (free)
2. Get your auth token from https://dashboard.ngrok.com/get-started/your-authtoken
3. Paste it below and run this cell

**Option B: Skip this cell** to use localtunnel instead (less stable, shows security warning)

In [None]:
# Paste your ngrok auth token here (or leave empty to use localtunnel)
NGROK_AUTH_TOKEN = ""  # Get from https://dashboard.ngrok.com/get-started/your-authtoken

if NGROK_AUTH_TOKEN:
    from pyngrok import ngrok, conf
    conf.get_default().auth_token = NGROK_AUTH_TOKEN
    print("‚úÖ Ngrok configured!")
else:
    print("‚ö†Ô∏è  No ngrok token provided - will use localtunnel instead")
    print("   (You'll see a security page - just click 'Continue')")

## Step 3: Run the App!

**This will:**
1. Start Streamlit app in background
2. Create public tunnel (ngrok or localtunnel)
3. Show you the public URL

**‚ö†Ô∏è Important:**
- Keep this cell running while using the app
- First startup may take 1-2 minutes
- First invoice extraction downloads Qwen model (16GB) - be patient!

In [None]:
import subprocess
import time
import requests
from IPython.display import display, HTML

# Kill any existing streamlit processes
!pkill -9 -f streamlit 2>/dev/null || true
time.sleep(2)

# Check if ngrok is configured
USE_NGROK = 'NGROK_AUTH_TOKEN' in globals() and NGROK_AUTH_TOKEN

print("üöÄ Starting Streamlit app...")
print("   This may take 30-60 seconds on first run\n")

# Start Streamlit in background using nohup (more reliable than system_raw)
with open('/tmp/streamlit.log', 'w') as log:
    streamlit_process = subprocess.Popen(
        [
            'streamlit', 'run', '/content/orbit_challenge/app.py',
            '--server.port', '8501',
            '--server.headless', 'true',
            '--server.enableCORS', 'false',
            '--server.enableXsrfProtection', 'false',
            '--browser.gatherUsageStats', 'false'
        ],
        stdout=log,
        stderr=subprocess.STDOUT,
        cwd='/content/orbit_challenge'
    )

# Wait for Streamlit to be ready with better health checks
print("‚è≥ Waiting for Streamlit to start...")
started = False
for i in range(90):  # Increased timeout to 90 seconds
    try:
        response = requests.get('http://localhost:8501/_stcore/health', timeout=2)
        if response.status_code == 200:
            print(f"‚úÖ Streamlit started successfully! (took {i+1}s)\n")
            started = True
            break
    except requests.exceptions.RequestException:
        pass
    
    # Show progress every 10 seconds
    if (i + 1) % 10 == 0:
        print(f"   Still waiting... ({i+1}s elapsed)")
    
    time.sleep(1)

if not started:
    print("\n‚ùå Streamlit failed to start after 90 seconds")
    print("\nüìã Checking logs for errors:\n")
    !tail -50 /tmp/streamlit.log
    print("\nüí° Troubleshooting:")
    print("   1. Make sure you selected T4 GPU runtime")
    print("   2. Try Runtime ‚Üí Restart runtime and run all cells again")
    print("   3. Check if app.py has syntax errors: !python -m py_compile app.py")
    raise Exception("Streamlit failed to start")

# Create tunnel
print("="*70)
print("üåê Creating public tunnel...\n")

if USE_NGROK:
    # Use ngrok
    from pyngrok import ngrok
    
    # Kill any existing ngrok tunnels
    ngrok.kill()
    
    public_url = ngrok.connect(8501, bind_tls=True)
    print("‚úÖ APP IS LIVE!")
    print("="*70)
    print(f"\nüåê Public URL (ngrok):\n   {public_url}")
    print("\nüëÜ Click the link below to access the app!\n")
    print("="*70)
    
    # Display clickable link
    display(HTML(f'''
    <div style="padding: 20px; background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); border-radius: 10px; text-align: center;">
        <h2 style="color: white; margin: 0;">üöÄ Invoice Extraction Demo App</h2>
        <a href="{public_url}" target="_blank" style="display: inline-block; margin-top: 15px; padding: 12px 30px; background: white; color: #667eea; text-decoration: none; border-radius: 5px; font-weight: bold; font-size: 16px;">
            Open App ‚Üí
        </a>
    </div>
    '''))
    
else:
    # Use localtunnel
    print("üì¶ Using localtunnel (no ngrok token provided)...\n")
    !npm install -g localtunnel 2>&1 | grep -v "npm WARN" || true
    
    print("\n‚úÖ APP IS LIVE!")
    print("="*70)
    print("\nüåê Starting localtunnel...\n")
    print("‚ö†Ô∏è  You'll see a URL like: https://******.loca.lt")
    print("‚ö†Ô∏è  Click it, then click 'Continue' on the security page\n")
    print("="*70)
    print("\nüëá Your public URL will appear below:\n")
    
    # Run localtunnel (this will block and show the URL)
    !lt --port 8501

print("\n‚è≥ Keep this cell running while using the app")
print("   Press ‚èπÔ∏è (stop button) to shutdown")

## üîß Troubleshooting

### "Streamlit failed to start"

**Solutions:**
1. **Check GPU**: Runtime ‚Üí Change runtime type ‚Üí T4 GPU
2. **Restart runtime**: Runtime ‚Üí Restart runtime ‚Üí Run all cells
3. **Check logs**: Look at the error output above for specific issues
4. **Verify files**: Make sure repository cloned correctly

### "Can't access the URL"

**Solutions:**
1. Wait 10-15 seconds after the URL appears
2. Make sure Step 3 cell is still running (has spinning indicator)
3. For localtunnel: Click "Continue" on the security page
4. Try using ngrok instead (add token in Step 2)

### "GPU out of memory"

**Solutions:**
1. Runtime ‚Üí Restart runtime
2. Run all cells again
3. In the app sidebar: Disable "Use Qwen2-VL" and use Claude API only

### "Model download is slow"

**This is normal!** Qwen2-VL is 16GB and downloads on first extraction:
- Be patient (5-10 minutes on Colab)
- Model is cached for future runs
- Alternative: Use "Claude API Only" mode in app sidebar

---

## üìö Next Steps

Once the app is running:

1. **Configure** extraction methods in sidebar (Qwen2-VL + Claude API)
2. **Upload** sample invoices (PDF, PNG, or JPG)
3. **Extract** data and watch real-time progress
4. **View** results with color-coded sources
5. **Download** CSV export

**Sample invoices** are in the `Invoices/` folder if you cloned with data.

**For presentations**: Use the "üìä Results" and "üí° How It Works" tabs!