## Phase 2: arXiv API Integration & PDF Processing

### Core Objectives
- arXiv API Integration: Build a robust client with rate limiting and retry logic
- PDF Processing Pipeline: Download and parse scientific PDFs with structured content extraction
- Database Storage: Persist paper metadata and content in PostgreSQL
- Error Handling: Implement comprehensive error handling and graceful degradation
- Automation Ready: Prepare components for Airflow orchestration

### ðŸ”§ What We'll Test In This Notebook
- arXiv API Client - Fetch CS.AI papers with proper rate limiting
- PDF Download System - Download and cache PDFs with error handling
- Docling PDF Parser - Extract structured content (sections, tables, figures)
- Database Integration - Store and retrieve papers from PostgreSQL
- Complete Pipeline - End-to-end processing from arXiv to database
- Production Readiness - Error handling, logging, and performance metrics
### ðŸ“Š Success Metrics
- arXiv API calls succeed with proper rate limiting
- PDF download and caching works reliably
- Docling extracts structured content from scientific PDFs
- Database stores complete paper metadata
- Pipeline handles errors gracefully and continues processing
- All components ready for Airflow automation (Week 2+)


In [None]:
# Check if Fresh Containers are Built and All Services Healthy
import subprocess
import requests
from pathlib import Path

print("WEEK 2 CONTAINER & SERVICE HEALTH CHECK")
print("=" * 50)

# Find project root
current_dir = Path.cwd()
if current_dir.name == "week2" and current_dir.parent.name == "notebooks":
    project_root = current_dir.parent.parent
elif (current_dir / "compose.yml").exists():
    project_root = current_dir
else:
    print("âœ— Could not find project root")
    exit()

print(f"Project root: {project_root}")

# Step 1: Check if containers are built and running
print("\n1. Checking container status...")
try:
    result = subprocess.run(
        ["docker", "compose", "ps", "--format", "table"],
        cwd=str(project_root),
        capture_output=True,
        text=True,
        timeout=10
    )
    
    if result.returncode == 0 and result.stdout.strip():
        print("âœ“ Containers are running:")
        for line in result.stdout.strip().split('\n'):
            print(f"   {line}")
    else:
        print("âœ— No containers running or docker compose failed")
        print("Please run the build commands from the markdown cell above")
        exit()
        
except Exception as e:
    print(f"âœ— Error checking containers: {e}")
    print("Please run the build commands from the markdown cell above")
    exit()

# Step 2: Check all service health (corrected endpoints)
print("\n2. Checking service health...")
services_to_test = {
    "FastAPI": "http://localhost:8000/api/v1/health",
    "PostgreSQL (via API)": "http://localhost:8000/api/v1/health", 
    "Ollama": "http://localhost:11434/api/version",
    "OpenSearch": "http://localhost:9200/_cluster/health",
    "Airflow": "http://localhost:8080/health"
}

all_healthy = True
for service_name, url in services_to_test.items():
    try:
        response = requests.get(url, timeout=5)
        if response.status_code == 200:
            print(f"âœ“ {service_name}: Healthy")
        else:
            print(f"âœ— {service_name}: HTTP {response.status_code}")
            all_healthy = False
    except requests.exceptions.ConnectionError:
        print(f"âœ— {service_name}: Not accessible")
        all_healthy = False
    except Exception as e:
        print(f"âœ— {service_name}: {type(e).__name__}")
        all_healthy = False

print("\n" + "=" * 50)
if all_healthy:
    print("âœ“ ALL SERVICES HEALTHY! Ready for Week 2 development.")
else:
    print("âœ— Some services need attention.")
    print("If you just rebuilt containers, wait 1-2 minutes and run this cell again.")
    print("Airflow and OpenSearch take longest to start up.")

In [1]:
# Environment Check
import sys
from pathlib import Path

print(f"Python Version: {sys.version_info.major}.{sys.version_info.minor}.{sys.version_info.micro}")
print(f"Environment: {sys.executable}")

# Find project root
current_dir = Path.cwd()
if current_dir.name == "notebooks" and current_dir.parent.name == "ArXivPaperCurator":
    project_root = current_dir.parent
elif (current_dir / "compose.yml").exists():
    project_root = current_dir
else:
    project_root = None

if project_root and (project_root / "compose.yml").exists():
    print(f"âœ“ Project root: {project_root}")
    # Add project to Python path
    sys.path.insert(0, str(project_root))
else:
    print("âœ— Missing compose.yml - check directory")
    exit()

Python Version: 3.13.3
Environment: c:\Users\hp\Desktop\Portfolio project\ArXivPaperCurator\.venv\Scripts\python.exe
âœ“ Project root: c:\Users\hp\Desktop\Portfolio project\ArXivPaperCurator


### Service Health Verification