Skip to content

W&B Documentation Quality Assurance System - Comprehensive tool for automated documentation analysis, issue detection, and PR generation with LLM-powered suggestions

License

Notifications You must be signed in to change notification settings

lukas/doc-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

97 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DocsQA - Documentation Quality Assurance System

A Python-based system that automatically scans W&B documentation for issues like typos, grammar errors, broken links, and outdated content, then proposes fixes via AI-powered suggestions.

Key Features

  • Multi-analyzer Pipeline: Detects broken links, outdated versions, deprecated APIs, style issues
  • LLM-Powered Improvements: Uses GPT-4 for clarity, grammar, and accuracy suggestions with citations
  • Automated PR Creation: Creates GitHub PRs with verified fixes for review
  • Safety Guardrails: Validates patches before auto-applying changes
  • REST API: Full FastAPI backend with issue management and workflow controls

Quick Start

Development Setup

  1. One-time setup:
# Run the setup script (installs uv, dependencies, sets up database)
./setup.sh

# Set environment variables (optional for LLM features)
export OPENAI_API_KEY="your-openai-key"
export GITHUB_APP_ID="your-github-app-id"
  1. Start the API server:
# Using uv (recommended)
uv run docsqa-server

# Server will be available at:
# - API docs: http://localhost:8080/docs
# - Health check: http://localhost:8080/health
  1. Run analysis:
# Rule-based analysis only (no LLM required)
uv run docsqa-analyze --source manual --no-llm

# Full analysis with LLM improvements (requires OPENAI_API_KEY)
uv run docsqa-analyze --source manual

# Debug mode
uv run docsqa-analyze --source manual --debug
  1. Run tests:
uv run pytest

Docker Deployment

  1. Set up environment:
# Create .env file
cat > .env << EOF
OPENAI_API_KEY=your-openai-key
GITHUB_APP_ID=your-github-app-id
GITHUB_INSTALLATION_ID=your-installation-id
GITHUB_PRIVATE_KEY=your-base64-encoded-private-key
EOF
  1. Start services:
docker-compose -f docsqa/docker/docker-compose.yml up -d

Configuration

Main configuration in docsqa/configs/config.yml:

repo:
  url: https://github.com/wandb/docs.git
  branch: main

paths:
  include:
    - content/en/guides/**/*.md
    - content/en/guides/**/*.mdx

llm:
  provider: openai
  model: gpt-4o-mini
  temperature: 0.1

guardrails:
  require_citations: true
  allow_code_edits: false
  max_whitespace_delta_lines: 3

Architecture

Backend Components

  • Core Services: Git utils, MDX parsing, document chunking
  • Rule Analyzers: Link checking, version drift, API/CLI validation, style
  • LLM Engine: Provider-agnostic client with JSON schema validation
  • Verifier: Safety checks for automated patches
  • API: REST endpoints for all operations

Analysis Pipeline

  1. Repository Sync: Clone/pull W&B docs, detect changed files
  2. Document Processing: Parse MDX, extract structure, create chunks
  3. Rule Analysis: Run fast heuristic checks (links, versions, etc.)
  4. LLM Analysis: Context-aware clarity/accuracy suggestions
  5. Verification: Safety checks for auto-apply eligibility
  6. Storage: Persist issues with provenance and patches

Issue Lifecycle

[Detected] → [Verified] → [Staged] → [PR Created] → [Merged] → [Resolved]
                ↓
           [Auto-apply] ←→ [Manual Review]

API Usage

List Issues

curl "http://localhost:8080/api/issues?severity=high&can_auto_apply=true"

Get Issue Details

curl "http://localhost:8080/api/issues/123"

Create PR

curl -X POST "http://localhost:8080/api/prs" \
  -H "Content-Type: application/json" \
  -d '{
    "issue_ids": [1, 2, 3],
    "title": "docs: automated fixes",
    "branch_name": "docs/fixes/2025-01-14",
    "commit_strategy": "one-per-file",
    "open_draft": true
  }'

Trigger Analysis

curl -X POST "http://localhost:8080/api/runs?source=manual"

Rule Catalog

Rule Code Description Severity Auto-Apply
LINK_404 Broken link (404) High
SDKVER_OLD Outdated package version Medium
API_DEPRECATED Deprecated API usage High ✅*
STYLE_TERMINOLOGY Non-canonical terms Low
LLM_CLARITY Clarity improvement Medium ✅*
LLM_ACCURACY Accuracy correction High ✅*

*Auto-apply only with proper citations and verification

CLI Usage

Run Analysis

# Full analysis
python -m crawler.run_analysis --source manual

# Rule-based only
python -m crawler.run_analysis --llm=off

# Specific commit
python -m crawler.run_analysis --commit abc123

GitHub PR Creation

python -m services.github_app --open-pr --issues 1,2,3 --branch docs/fixes/test

Development

Package Management with uv

This project uses uv for fast, reliable Python package management.

# Add a new dependency
uv add requests

# Add a development dependency  
uv add --group dev pytest

# Add optional dependency
uv add --optional embeddings faiss-cpu

# Sync dependencies
uv sync

# Install with all optional dependencies
uv sync --all-extras

# Run commands in the virtual environment
uv run python -c "import requests; print('works!')"

# Run scripts defined in pyproject.toml
uv run docsqa-server
uv run docsqa-analyze --help

# Development tools
uv run black .           # Format code
uv run ruff check .      # Lint code
uv run pytest          # Run tests
uv run mypy docsqa      # Type checking

Project Structure

docsqa/
├── backend/
│   ├── api/              # FastAPI endpoints
│   ├── core/             # Core services & models
│   ├── crawler/          # Analysis pipeline
│   │   └── analyzers/    # Rule-based analyzers
│   ├── migrations/       # Database migrations
│   └── services/         # LLM, embeddings, GitHub
├── configs/              # Configuration & catalogs
│   ├── catalogs/         # API/CLI validation data
│   └── dictionaries/     # Terminology lists
├── docker/               # Docker setup
tests/                    # Test suite

Adding New Rules

  1. Create analyzer in backend/crawler/analyzers/
  2. Add rule definition to api/rules.py
  3. Integrate in crawler/pipeline.py
  4. Update documentation

Testing

# Run all tests
uv run pytest

# Run with coverage
uv run pytest --cov=docsqa

# Run specific test file
uv run pytest tests/test_api.py

Acceptance Tests

The system passes these key acceptance criteria:

  • Crawl: Processes all .md(x) files under content/en/guides
  • Link Validation: Detects broken external links with retries
  • Version Drift: Flags outdated wandb==X.Y.Z references
  • API Deprecation: Validates API usage against catalogs
  • LLM Quality: Provides clarity/accuracy improvements with citations
  • Safety: Verifies patches before auto-apply
  • GitHub PR: Creates draft PRs with proper formatting
  • Auto-resolve: Resolves issues after PR merge

Monitoring & Observability

  • Structured Logs: JSON format with component/rule/latency
  • Token Usage: OpenAI API cost tracking per run
  • Health Checks: Service availability monitoring
  • Performance: Link checker timing, LLM response latency

Security

  • No Secrets: Environment variables only, no hardcoded keys
  • Minimal Permissions: GitHub App scoped to docs repo only
  • Content Filtering: Redacts sensitive patterns from LLM prompts
  • Verification: Multi-layer safety checks for automated changes

Built with Claude Code

About

W&B Documentation Quality Assurance System - Comprehensive tool for automated documentation analysis, issue detection, and PR generation with LLM-powered suggestions

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published