refactor(dagster): clarify bootstrap vs runtime config by spideystreet · Pull Request #37 · opensource-together/ost-linker

spideystreet · 2026-04-22T12:38:15Z

Summary

keep bootstrap config in settings.py for import-time Dagster setup
extract assets, resources, jobs, schedules, and sensors assembly into helper functions in definitions.py
align GO_TRENDING_PATH with the other required runtime environment variables

Validation

python -m py_compile src/linker/definitions.py

Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>

- Isolate API container env (no Dagster/GitHub/LLM secrets) - Move healthcheck to production compose (not just override) - Remove shared volumes from API service - Add rate limiting (60 req/min/IP) on all routes via slowapi decorators - Extract limiter to rate_limit.py to avoid circular imports Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>

- CLAUDE.md: add API run command, env vars (API_HOST, API_PORT, API_RATE_LIMIT) - architecture.md: add REST API section, update Docker services count, add serving layer to data flow - Bump docs submodule with rest-api.mdx page Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>

Verify API response shapes match ost-mcp TypeScript types exactly. Catches breaking changes in field names, types, or structure before they reach the MCP server. Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>

Use snake_case column names (created_at) matching the Prisma model, add gen_random_uuid() for id, remove non-existent updatedAt column.

…line Trending repos now also upsert into github.raw_github_project alongside raw_trending_project, so they flow through the standard pipeline (fetcher → dbt → classification → sync) and land in public.Project.

Adds stg_public__project_bookmark staging and a LEFT JOIN-based exclusion in match_user_recommendation so users never receive a recommendation for a project they have already bookmarked. A singular test enforces the invariant in CI.

… SDK Replace the OpenAI-client-over-OpenRouter wrapper with the official mistralai Python SDK. Simpler stack, direct provider, typed messages for mypy strict. Default model is now `mistral-small-latest` (tracks current Small release). Env var renamed from OPENROUTER_API_KEY to MISTRAL_API_KEY and updated across code, compose, docs, and tests.

New endpoint embeds free-text queries with the same SentenceTransformer used by the pipeline (MiniLM-L6-v2, 384d) and ranks projects by pgvector cosine similarity. Optional hard filters (language, domain, category, techstack) narrow the candidate set before ranking. Model is eagerly loaded in the FastAPI lifespan to avoid cold-request latency. Unit tests cover query validation, ranking shape, and filter propagation to SQL.

New RecommendationEvent Prisma model (mirrors the one in ost-backend) and dbt staging expose a feedback signal: projects shown ≥N times in the lookback window without being clicked or starred are excluded from the mart. Keeps the reco surface fresh instead of repeating projects the user has already dismissed implicitly. Lookback and threshold are configurable via dbt vars (ignored_lookback_days=30, ignored_min_shown=3). A singular test enforces the invariant in CI.

… add user FK Addresses 3 schema gaps flagged in review: 1. `eventType` is now a Postgres enum (RecommendationEventType) — database rejects typos and unknown event types at INSERT time instead of relying on application-level validation alone. 2. `source` promoted from `context.source` to a dedicated enum column (RecommendationSource: PERSONALIZED | TRENDING | SIMILAR | SEMANTIC_SEARCH) + indexed `(source, occurredAt)` for analytics. `rank` promoted to an Int column too. `context` stays jsonb for unstructured metadata (A/B variant, session_id, etc.). 3. Added FK on `userId` with ON DELETE CASCADE — prevents orphan events and ensures RGPD-compliant deletion when users leave. dbt staging, mart exclusion, and singular test updated to match uppercase enum values. 46 dbt tests pass.

Hardens the LLM classification harness for production scale: 1. Parallelization via ThreadPoolExecutor (5 workers default) — ~7× speedup measured on 23 projects (4.9s vs ~35s sequential). 2. Cost tracking: ClassificationResult now carries token usage + model; asset aggregates into Output metadata (prompt_tokens, completion_tokens, estimated_cost_usd, model_version) so a bad prompt change is visible at the next run. 3. DLQ (`match.project_classification_failure`): persistent failures stop consuming LLM budget on each run. Exponential backoff (2h, 4h, 8h, ..., capped at 7d), max 5 attempts. RateLimitError is a distinct exception so 429s don't look like ordinary errors in the logs. Also: modelVersion column added to match.project_classification — every classification now carries the model that produced it, so future model migrations are auditable.

When a worker hits Mistral's 429, the previous behavior was to DLQ the project immediately. A 1-minute 429 window hitting 5 concurrent workers would poison the DLQ with 5+ projects whose only fault was being in the wrong second. _classify_one now sleeps _RATE_LIMIT_COOLDOWN_SECONDS + jitter on the first RateLimitError and retries once. Only the second failure (or any non-429 error) is routed to the DLQ. Successful retries are logged as warnings via the `rate_limit_hits` counter so operators can see 429 pressure at a glance.

…ing IO manager

…a conflicts

…models

…nflicts

…g for docs

…sage for embeddings

spideystreet and others added 30 commits March 10, 2026 14:49

chore(deps): add fastapi, uvicorn, and slowapi

9536f78

Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>

feat(api): add API config module with pydantic-settings

cd43304

Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>

feat(api): add connection pool with psycopg2 SimpleConnectionPool

6550e67

Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>

feat(api): add pydantic response schemas

4c54745

Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>

feat(api): add FastAPI app with health endpoint and rate limiting

95692eb

Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>

feat(api): add categories, domains, and techstacks endpoints

9a7998a

Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>

feat(api): add project search, detail, and similarity endpoints

14cc9a3

Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>

feat(api): add trending recommendations endpoint

7bed7c9

Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>

fix(api): escape ILIKE wildcards and add consistent type::text cast

b47437c

Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>

chore(docker): add FastAPI service to compose stack

6d8af99

Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>

test(api): add auto-marker for api test directory

59fdaa9

Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>

docs(env): add API configuration variables to .env.example

775ed4b

Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>

style(api): fix lint and type issues

b798d07

Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>

test(api): verify SQL params and response relations in project tests

f53dd64

Co-Authored-By: spidecode-bot <263227865+spicode-bot@users.noreply.github.com>

docs: add GitHub Trending scraper design spec

fa3f9e7

docs: add GitHub Trending scraper implementation plan

286604a

feat(trending): initialize Go module with dependencies

ca421cf

feat(trending): implement HTML parsing for GitHub Trending page

9d9308b

feat(trending): add GitHub API client with retry logic

30fc625

feat(trending): implement main orchestration — scrape, enrich, upsert

cdf6b7c

feat(trending): add RawTrendingProject to Prisma schema

4f034ce

feat(trending): add Dagster asset and config for trending scraper

869e5da

feat(api): add /recommendations/github-trending endpoint

5814431

chore(docker): add trending scraper to build and compose

cf70c10

chore: add trending scraper to build scripts, env, and CI

1587ba3

fix(trending): align SQL columns with Prisma schema

f0f9ea9

Use snake_case column names (created_at) matching the Prisma model, add gen_random_uuid() for id, remove non-existent updatedAt column.

spideystreet added 17 commits April 21, 2026 11:18

feat(classification): harden harness, add prompt registry, and stream…

2ffdde7

…ing IO manager

merge(reco-events): resolve conflict in Project model relations

c0470f7

merge(reco-events): integrate recommendation events and resolve schem…

9b2055d

…a conflicts

merge(reco-exclusion): combine bookmarked and ignored filters in dbt …

ce82ed4

…models

merge(trending): integrate GitHub Trending Scraper and resolve API co…

129a682

…nflicts

docs(dagster): up cfg_resource definitions

ab6c943

refactor(config): dagster use pydantic settings

795c7ef

refactor(dagster): add helper functions for composition, add docstrin…

4f209c5

…g for docs

refactor(dagster): GO_TRENDING_PATH mandatory, commentary about CPU u…

6bbff74

…sage for embeddings

docs(dagster): cpu usage info

05a67f9

spideystreet merged commit 349baf9 into develop Apr 22, 2026

spideystreet deleted the feat/pydantic-settings-dbt-config branch April 22, 2026 12:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(dagster): clarify bootstrap vs runtime config#37

refactor(dagster): clarify bootstrap vs runtime config#37
spideystreet merged 47 commits into
developfrom
feat/pydantic-settings-dbt-config

spideystreet commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

spideystreet commented Apr 22, 2026

Summary

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant