# üöÄ Quick Start: Invoice Extraction on Google Colab

This notebook sets up everything automatically and runs the enhanced extraction.

**To use**:
1. Open this notebook in Google Colab
2. Enable GPU: `Runtime` ‚Üí `Change runtime type` ‚Üí `GPU`
3. Run all cells in order

---

## Step 1: Clone Repository & Setup

In [None]:
# Clone your repository
!git clone https://github.com/marvin-schumann/orbit_challenge.git
%cd orbit_challenge

# Checkout the enhanced branch
!git checkout claude/capabilities-overview-01BzAZxMUjPBveeHos3gVvok

print("\n‚úÖ Repository cloned and branch checked out!")

## Step 2: Verify Files

In [None]:
import os
from pathlib import Path

# Check current directory
print(f"üìÇ Current directory: {os.getcwd()}")
print(f"\nüìÑ Notebooks found:")
!ls -lh *.ipynb

print(f"\nüìã Invoices found:")
invoice_dir = Path("Invoices")
if invoice_dir.exists():
    invoices = list(invoice_dir.glob("*"))
    print(f"Found {len(invoices)} files in Invoices/")
    for inv in invoices:
        print(f"  - {inv.name}")
else:
    print("‚ö†Ô∏è  Invoices directory not found!")
    print("You may need to upload your invoice files manually.")

## Step 3: Check GPU Availability

In [None]:
import torch

if torch.cuda.is_available():
    print(f"‚úÖ GPU Available: {torch.cuda.get_device_name(0)}")
    print(f"   Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
    !nvidia-smi --query-gpu=name,memory.total,memory.free --format=csv
else:
    print("‚ö†Ô∏è  No GPU detected!")
    print("   Go to: Runtime ‚Üí Change runtime type ‚Üí Hardware accelerator ‚Üí GPU")
    print("   Then restart this notebook.")

## Step 4: Run Enhanced Extraction

This will run the `exercise_v04_enhanced.ipynb` notebook with all the improvements.

In [None]:
# Update the notebook to use correct Colab paths
import json

notebook_path = "exercise_v04_enhanced.ipynb"

# Read the notebook
with open(notebook_path, 'r') as f:
    nb = json.load(f)

# Update the INVOICE_DIR path in the extraction cell
for cell in nb['cells']:
    if cell['cell_type'] == 'code':
        source = ''.join(cell['source'])
        if 'INVOICE_DIR = Path' in source:
            # Update path for Colab
            cell['source'] = [line.replace(
                'INVOICE_DIR = Path("/content/Invoices")',
                'INVOICE_DIR = Path("/content/orbit_challenge/Invoices")'
            ).replace(
                'INVOICE_DIR = Path("/Users/marvinschumann/orbit_challenge/Invoices")',
                'INVOICE_DIR = Path("/content/orbit_challenge/Invoices")'
            ) for line in cell['source']]

# Save updated notebook
with open(notebook_path, 'w') as f:
    json.dump(nb, f)

print("‚úÖ Notebook paths updated for Colab")
print("\nüöÄ Starting extraction...\n")
print("="*70)

In [None]:
# Run the enhanced extraction notebook
%run exercise_v04_enhanced.ipynb

## Step 5: Review Results

The `df` DataFrame should now be created with your extracted invoice data.

In [None]:
# Display results
print("üìä Extraction Results:")
print("="*70)
print(df.to_string(index=False))

print("\nüìà Summary:")
print(f"Total invoices: {len(df)}")
print(f"Fields extracted: {list(df.columns)}")

# Check completeness
empty_count = (df == "").sum().sum() + (df == "00000000000").sum().sum()
total_fields = len(df) * len(df.columns)
completeness = ((total_fields - empty_count) / total_fields) * 100

print(f"\n‚úÖ Completeness: {completeness:.1f}%")
print(f"   ({total_fields - empty_count}/{total_fields} fields filled)")

## Step 6 (Optional): Export Results

In [None]:
# Save to CSV for download
output_file = "extracted_invoices_v04.csv"
df.to_csv(output_file, index=False)
print(f"‚úÖ Results saved to: {output_file}")
print("   Download it from the Files panel on the left üìÅ")

## Step 7 (Optional): Push to Celonis

**Note**: Only run this if you have the Celonis credentials configured in `push.ipynb`

In [None]:
# Uncomment to push to Celonis
# %run push.ipynb

---

## üéâ Done!

### What to do next:

1. **Review the results** above - check accuracy
2. **Compare with v03** (Claude API results) if you have them
3. **Note any errors** - which invoices or fields had issues?
4. **Share feedback** - so I can help improve the prompts/logic

### Troubleshooting:

- **Out of memory**: Restart runtime, ensure GPU is enabled
- **Invoices not found**: Check the path in Step 2, upload manually if needed
- **Low accuracy**: Share which fields are problematic - I can enhance prompts

### Performance:

- **Expected time**: ~10-15 seconds per invoice on GPU
- **Expected accuracy**: 90-95% (with validation & retry)
- **Cost**: $0 (completely free!)
