Skip to content

feat(indexing): convert connector to async streaming with SDK v1.0.0b1#480

Merged
steve-calvert-glean merged 5 commits intomainfrom
scalvert/async-indexing-connector
Apr 22, 2026
Merged

feat(indexing): convert connector to async streaming with SDK v1.0.0b1#480
steve-calvert-glean merged 5 commits intomainfrom
scalvert/async-indexing-connector

Conversation

@steve-calvert-glean
Copy link
Copy Markdown
Contributor

Summary

  • Converts the indexing connector from sync to async streaming using BaseAsyncStreamingDatasourceConnector and BaseAsyncStreamingDataClient
  • Replaces requests + concurrent.futures with aiohttp + asyncio for info page fetching
  • Replaces sync_playwright with async_playwright, using a shared browser with isolated contexts and semaphore-based concurrency (3 concurrent pages) for API reference scraping
  • Updates glean-indexing-sdk to v1.0.0b1, removes deprecated [studio] extra and local path source

Verification

  • Dry-run tested end-to-end: 221/221 documents fetched and transformed, 0 failures
    • 107 info pages via async aiohttp + trafilatura
    • 114 API reference pages via async Playwright
  • All SDK imports verified compatible with v1.0.0b1
  • Both infoPage and apiReference document pipelines produce correct DocumentDefinition objects with proper custom properties

Test plan

  • mise run indexing:dry-run completes with 221/221 documents, 0 failures
  • SDK imports verified: BaseAsyncStreamingDataClient, BaseAsyncStreamingDatasourceConnector
  • Info page pipeline: fetch → transform → DocumentDefinition (infoPage)
  • API reference pipeline: Playwright scrape → transform → DocumentDefinition (apiReference) with 17 custom properties
  • Full indexing run against Glean (requires GLEAN_INDEXING_API_TOKEN)

🤖 Generated with Claude Code

@steve-calvert-glean steve-calvert-glean requested a review from a team as a code owner April 13, 2026 18:41
@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 13, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
glean-developer-site Ready Ready Preview, Comment Apr 22, 2026 3:02pm

Request Review

steve-calvert-glean and others added 5 commits April 22, 2026 08:00
- Migrate from BaseConnectorDataClient to AsyncBaseStreamingDataClient
- Replace sync requests with aiohttp for async HTTP fetching
- Replace sync_playwright with async_playwright for JS-rendered pages
- Use asyncio.as_completed() instead of gather() to yield results immediately
- Process info pages inline during classification, not as separate phase
- Rename connector to DeveloperDocsConnector for consistency

This dramatically improves time-to-first-result from 30-60s to 200-500ms
by streaming results as they complete instead of waiting for all tasks.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Use isolated browser contexts instead of shared pages to prevent
  IPC channel contention between concurrent Playwright operations
- Reduce concurrency from 5 to 3 pages for better stability
- Collect all results before yielding to avoid generator interruptions
  mid-browser-operation causing pipe breakage
- Add robust cleanup with try/except around page and context close
  to prevent cascading failures

This fixes the "Error: write EPIPE" crashes when scraping API reference
pages with concurrent Playwright page operations.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Renamed:
- AsyncBaseStreamingDataClient → BaseAsyncStreamingDataClient
- AsyncBaseStreamingDatasourceConnector → BaseAsyncStreamingDatasourceConnector

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Pin to glean-indexing-sdk>=1.0.0b1 (from local editable path)
- Remove [studio] extra (no longer exists in v1.0.0b1)
- Remove studio dev dependencies (sse-starlette, starlette, uvicorn)
- Remove [tool.uv.sources] local path override
- Regenerate uv.lock with stable transitive deps

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@steve-calvert-glean steve-calvert-glean force-pushed the scalvert/async-indexing-connector branch from 0b66895 to d84089a Compare April 22, 2026 15:01
@steve-calvert-glean steve-calvert-glean merged commit 2daeacd into main Apr 22, 2026
4 checks passed
@steve-calvert-glean steve-calvert-glean deleted the scalvert/async-indexing-connector branch April 22, 2026 18:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants