|
| 1 | +# The Orchestrator (CLI & Pipeline) |
| 2 | + |
| 3 | +The DevIndex Data Factory is essentially a collection of specialized micro-services (Spider, Updater, Storage, etc.). To coordinate these services into a cohesive, automated workflow, the system relies on the **Orchestrator** layer. |
| 4 | + |
| 5 | +This layer is comprised of three distinct parts: |
| 6 | +1. **The Entry Point:** `apps/devindex/services/cli.mjs` |
| 7 | +2. **The Command Router:** [`DevIndex.services.Manager`](https://github.com/neomjs/neo/blob/dev/apps/devindex/services/Manager.mjs) |
| 8 | +3. **The Automated Pipeline:** `.github/workflows/devindex-pipeline.yml` |
| 9 | + |
| 10 | +--- |
| 11 | + |
| 12 | +## The Entry Point (`cli.mjs`) |
| 13 | + |
| 14 | +The entry point for the backend services is incredibly minimal, leaning entirely on the native Neo.mjs component lifecycle. |
| 15 | + |
| 16 | +```javascript readonly |
| 17 | +import Manager from './Manager.mjs'; |
| 18 | + |
| 19 | +async function start() { |
| 20 | + await Manager.ready(); |
| 21 | +} |
| 22 | + |
| 23 | +start().catch(console.error); |
| 24 | +``` |
| 25 | + |
| 26 | +Because `Manager` is a Neo.mjs singleton (`Neo.setupClass(Manager)`), simply importing the module triggers its instantiation. The `start()` function then simply awaits the native `Manager.ready()` promise, which resolves when the Manager's asynchronous initialization—including executing the requested CLI command—is complete. |
| 27 | + |
| 28 | +--- |
| 29 | + |
| 30 | +## The Command Router (`Manager.mjs`) |
| 31 | + |
| 32 | +The `Manager` service uses the `commander` library to parse command-line arguments and `inquirer` to provide interactive prompts for a robust Developer Experience (DX). |
| 33 | + |
| 34 | +Its primary responsibility is mapping high-level commands to specific service executions. |
| 35 | + |
| 36 | +### Available Commands |
| 37 | +* `update`: Triggers the **Updater** to process a batch of pending users. |
| 38 | +* `add [username]`: Manually adds or forces an update for a specific user. |
| 39 | +* `spider`: Triggers the **Spider** to discover new candidates. Offers interactive strategy selection if run without flags. |
| 40 | +* `cleanup`: Manually triggers the **Data Hygiene** routine. |
| 41 | +* `optin` / `optout`: Processes issue-based and star-based privacy requests. |
| 42 | + |
| 43 | +### The "Pre-Run Cleanup" Pattern |
| 44 | +A critical architectural pattern enforced by the Manager is the "Pre-Run Cleanup". Before executing any command that reads or modifies the index (like `spider` or `update`), the Manager automatically triggers `Cleanup.run()`. |
| 45 | + |
| 46 | +```javascript readonly |
| 47 | +program |
| 48 | + .command('update') |
| 49 | + .action(async (options) => { |
| 50 | + await Cleanup.run(); // Pre-run hygiene |
| 51 | + await this.runUpdate(options.limit); |
| 52 | + }); |
| 53 | +``` |
| 54 | +This guarantees that the services always operate on valid, sorted, and pruned data, preventing dirty data from polluting the discovery or enrichment processes. |
| 55 | + |
| 56 | +### Smart Scheduling |
| 57 | +When the `update` command is run, the Manager doesn't just blindly pass the whole queue to the Updater. It implements a smart scheduling algorithm: |
| 58 | +1. It filters out any user who has already been successfully updated *today* (based on the `lastUpdate` timestamp). |
| 59 | +2. It sorts the remaining backlog, prioritizing completely new users (`lastUpdate: null`) and the oldest records first. |
| 60 | +3. It slices the queue to the requested batch limit to respect API quotas. |
| 61 | + |
| 62 | +--- |
| 63 | + |
| 64 | +## The Automated Pipeline (GitHub Actions) |
| 65 | + |
| 66 | +While a developer can run commands manually via the CLI, the DevIndex is designed to be fully autonomous. The ultimate orchestrator is the GitHub Actions workflow defined in `.github/workflows/devindex-pipeline.yml`. |
| 67 | + |
| 68 | +This workflow runs on an **hourly schedule** and strings the individual services together into a single, atomic "Data Factory" assembly line: |
| 69 | + |
| 70 | +```yaml readonly |
| 71 | +jobs: |
| 72 | + run-pipeline: |
| 73 | + steps: |
| 74 | + # 1. Process Privacy Requests First |
| 75 | + - name: Run DevIndex Opt-In |
| 76 | + run: npm run devindex:optin |
| 77 | + |
| 78 | + - name: Run DevIndex Opt-Out |
| 79 | + run: npm run devindex:optout |
| 80 | + |
| 81 | + # 2. Aggressive Discovery (3x Loop) |
| 82 | + - name: Run DevIndex Spider |
| 83 | + run: | |
| 84 | + for i in 1 2 3; do |
| 85 | + npm run devindex:spider -- --strategy random |
| 86 | + done |
| 87 | +
|
| 88 | + # 3. Enrichment & Processing |
| 89 | + - name: Run DevIndex Updater |
| 90 | + run: npm run devindex:update -- --limit=800 |
| 91 | + |
| 92 | + # 4. Atomic Persistence |
| 93 | + - name: Commit, Rebase and Push |
| 94 | + run: | |
| 95 | + git add apps/devindex/resources/*.json* |
| 96 | + git commit -m "chore(devindex): Hourly pipeline update [skip ci]" |
| 97 | + git pull origin dev --rebase |
| 98 | + git push origin dev |
| 99 | +``` |
| 100 | +
|
| 101 | +### Key Pipeline Concepts: |
| 102 | +
|
| 103 | +1. **Privacy-First Execution:** The `optin` and `optout` services run *before* any discovery or enrichment. This ensures we never accidentally index a user who requested removal in the same hour. |
| 104 | +2. **The 3x Spider Loop:** Because the Spider uses a random-walk algorithm, running it multiple times consecutively (before the Updater) significantly broadens the discovery net while utilizing very little API quota. |
| 105 | +3. **Atomic Commits:** Rather than each service committing its own changes independently (which would cause massive Git conflicts), the services modify the local JSON files on the runner. Only at the very end of the hourly pipeline are all changes bundled into a single, atomic commit containing the fully processed data. The `[skip ci]` flag prevents infinite loops. |
0 commit comments