## Phase 2: arXiv API Integration & PDF Processing

### Core Objectives
- arXiv API Integration: Build a robust client with rate limiting and retry logic
- PDF Processing Pipeline: Download and parse scientific PDFs with structured content extraction
- Database Storage: Persist paper metadata and content in PostgreSQL
- Error Handling: Implement comprehensive error handling and graceful degradation
- Automation Ready: Prepare components for Airflow orchestration

### ðŸ”§ What We'll Test In This Notebook
- arXiv API Client - Fetch CS.AI papers with proper rate limiting
- PDF Download System - Download and cache PDFs with error handling
- Docling PDF Parser - Extract structured content (sections, tables, figures)
- Database Integration - Store and retrieve papers from PostgreSQL
- Complete Pipeline - End-to-end processing from arXiv to database
- Production Readiness - Error handling, logging, and performance metrics
### ðŸ“Š Success Metrics
- arXiv API calls succeed with proper rate limiting
- PDF download and caching works reliably
- Docling extracts structured content from scientific PDFs
- Database stores complete paper metadata
- Pipeline handles errors gracefully and continues processing
- All components ready for Airflow automation (Week 2+)


In [2]:
# Check if Fresh Containers are Built and All Services Healthy
import subprocess
import requests
from pathlib import Path

print("WEEK 2 CONTAINER & SERVICE HEALTH CHECK")
print("=" * 50)

# Find project root
current_dir = Path.cwd()
if current_dir.name == "notebooks" and current_dir.parent.name == "ArxivPaperCurator":
    project_root = current_dir.parent
elif (current_dir / "compose.yml").exists():
    project_root = current_dir
else:
    print("âœ— Could not find project root")
    exit()

print(f"Project root: {project_root}")



WEEK 2 CONTAINER & SERVICE HEALTH CHECK
Project root: /teamspace/studios/this_studio/ArxivPaperCurator


In [3]:
# Environment Check
import sys
from pathlib import Path

print(f"Python Version: {sys.version_info.major}.{sys.version_info.minor}.{sys.version_info.micro}")
print(f"Environment: {sys.executable}")

Python Version: 3.12.11
Environment: /home/zeus/miniconda3/envs/cloudspace/bin/python


### Service Health Verification

In [4]:
# Step 1: Check if containers are built and running
print("\n1. Checking container status...")
try:
    result = subprocess.run(
        ["docker", "compose", "ps", "--format", "table"],
        cwd=str(project_root),
        capture_output=True,
        text=True,
        timeout=10
    )
    
    if result.returncode == 0 and result.stdout.strip():
        print("âœ“ Containers are running:")
        for line in result.stdout.strip().split('\n'):
            print(f"   {line}")
    else:
        print("âœ— No containers running or docker compose failed")
        print("Please run the build commands from the markdown cell above")
        exit()
        
except Exception as e:
    print(f"âœ— Error checking containers: {e}")
    print("Please run the build commands from the markdown cell above")
    exit()


1. Checking container status...
âœ“ Containers are running:
   NAME                        IMAGE                                            COMMAND                   SERVICE                 CREATED          STATUS                     PORTS
   rag-airflow                 apache/airflow:3.0.3                             "/bin/bash -c '\n  piâ€¦"   airflow                 11 minutes ago   Up 7 minutes (unhealthy)   0.0.0.0:8080->8080/tcp, :::8080->8080/tcp
   rag-api                     arxivpapercurator-api                            "uvicorn src.main:apâ€¦"    api                     11 minutes ago   Up 7 minutes (unhealthy)   0.0.0.0:8000->8000/tcp, :::8000->8000/tcp
   rag-ollama                  ollama/ollama:0.11.2                             "/bin/ollama serve"       ollama                  11 minutes ago   Up 7 minutes (healthy)     0.0.0.0:11434->11434/tcp, :::11434->11434/tcp
   rag-opensearch              opensearchproject/opensearch:2.19.0              "./opensearch-dockerâ€¦

In [7]:
# Step 2: Check all service health (corrected endpoints)
print("\n2. Checking service health...")
services_to_test = {
    "FastAPI": "https://8000-01khp40rhz7zc0vjm8t9gsh5j1.cloudspaces.litng.ai/api/v1/ping/health",
    "PostgreSQL (via API)": "https://8000-01khp40rhz7zc0vjm8t9gsh5j1.cloudspaces.litng.ai/api/v1/ping/health", 
    "Ollama": "https://11434-01khp40rhz7zc0vjm8t9gsh5j1.cloudspaces.litng.ai/api/version",
    "OpenSearch": "https://9200-01khp40rhz7zc0vjm8t9gsh5j1.cloudspaces.litng.ai/_cluster/health",
    "Airflow": "https://8080-01khp40rhz7zc0vjm8t9gsh5j1.cloudspaces.litng.ai/api/v2/monitor/health"
}

all_healthy = True
for service_name, url in services_to_test.items():
    try:
        response = requests.get(url, timeout=5)
        if response.status_code == 200:
            print(f"âœ“ {service_name}: Healthy")
        else:
            print(f"âœ— {service_name}: HTTP {response.status_code}")
            all_healthy = False
    except requests.exceptions.ConnectionError:
        print(f"âœ— {service_name}: Not accessible")
        all_healthy = False
    except Exception as e:
        print(f"âœ— {service_name}: {type(e).__name__}")
        all_healthy = False

print("\n" + "=" * 50)
if all_healthy:
    print("âœ“ ALL SERVICES HEALTHY! Ready for Week 2 development.")
else:
    print("âœ— Some services need attention.")
    print("If you just rebuilt containers, wait 1-2 minutes and run this cell again.")
    print("Airflow and OpenSearch take longest to start up.")


2. Checking service health...


âœ“ FastAPI: Healthy
âœ“ PostgreSQL (via API): Healthy
âœ“ Ollama: Healthy
âœ“ OpenSearch: Healthy
âœ“ Airflow: Healthy

âœ“ ALL SERVICES HEALTHY! Ready for Week 2 development.


In [6]:
airflow_password = "FkfGnM9ZPup7YTre"
airflow_usernmae = "admin"

In [1]:
%cd ..


/teamspace/studios/this_studio/ArxivPaperCurator


In [4]:
import asyncio
from datetime import datetime, timedelta

# Import our arXiv client
from src.services.arxiv.factory import make_arxiv_client

print("TESTING ARXIV API CLIENT")
print("=" * 40)

# Create client
arxiv_client = make_arxiv_client()
print(f"âœ“ Client created: {arxiv_client.base_url}")
print(f"   Rate limit: {arxiv_client.rate_limit_delay}s")
print(f"   Max results: {arxiv_client.max_results_per_query}")
print(f"   Category: {arxiv_client.search_category}")
print()

TESTING ARXIV API CLIENT
âœ“ Client created: https://export.arxiv.org/api/query
   Rate limit: 3.0s
   Max results: 100
   Category: cs.AI



In [5]:
async def test_paper_fetching():
    """Test fetching papers from arXiv with rate limiting."""
    
    print("Test 1: Fetch Recent CS.AI Papers")
    try:
        papers = await arxiv_client.fetch_papers(
            max_results=2, 
            sort_by="submittedDate",
            sort_order="descending"
        )
        
        print(f"âœ“ Fetched {len(papers)} papers")
        
        if papers:
            for i, paper in enumerate(papers[:2], 1):
                print(f"   {i}. [{paper.arxiv_id}] {paper.title[:60]}...")
                print(f"      Authors: {', '.join(paper.authors[:2])}{'...' if len(paper.authors) > 2 else ''}")
                print(f"      Categories: {', '.join(paper.categories)}")
                print(f"      Published: {paper.published_date}")
                print()
        
        return papers
        
    except Exception as e:
        print(f"âœ— Error fetching papers: {e}")
        if "503" in str(e):
            print("   arXiv API temporarily unavailable (normal)")
            print("   Rate limiting and error handling working correctly")
        return []

# Run the test
papers = await test_paper_fetching()

Test 1: Fetch Recent CS.AI Papers
âœ“ Fetched 0 papers


In [6]:
# Test Date Filtering
async def test_date_filtering():
    """Test date range filtering functionality."""
    
    print("Test 2: Date Range Filtering")
    
    # Use specific dates: 
    from_date = "20250808"  
    to_date = "20260201"    
    try:
        date_papers = await arxiv_client.fetch_papers(
            max_results=5,
            from_date=from_date,
            to_date=to_date
        )
        
        print(f"âœ“ Date filtering test: {len(date_papers)} papers from {from_date}-{to_date}")
        
        if date_papers:
            for i, paper in enumerate(date_papers, 1):
                print(f"   {i}. [{paper.arxiv_id}] {paper.title[:60]}...")
                print(f"      Authors: {', '.join(paper.authors[:2])}{'...' if len(paper.authors) > 2 else ''}")
                print(f"      Categories: {', '.join(paper.categories)}")
                print(f"      Published: {paper.published_date}")
                print()
        
        return date_papers
        
    except Exception as e:
        print(f"âœ— Date filtering error: {e}")
        return []

# Run date filtering test
date_papers = await test_date_filtering()

Test 2: Date Range Filtering
âœ“ Date filtering test: 0 papers from 20250808-20260201
