Skip to content

Commit 5a84ca4

Browse files
committed
docs: Add DevIndex Data Factory Guides: Orchestrator & Data Hygiene (#9244)
1 parent 892e41a commit 5a84ca4

5 files changed

Lines changed: 167 additions & 2 deletions

File tree

apps/devindex/services/Manager.mjs

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,9 @@ class Manager extends Base {
4545
* Entry point for the application.
4646
* @returns {Promise<void>}
4747
*/
48-
async main() {
48+
async initAsync() {
49+
await super.initAsync();
50+
4951
const program = new Command();
5052

5153
program

apps/devindex/services/cli.mjs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ import Manager from './Manager.mjs';
66
dotenv.config({quiet: true});
77

88
async function start() {
9-
await Manager.main();
9+
await Manager.ready();
1010
}
1111

1212
start().catch(console.error);
Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
# Data Hygiene & Cleanup
2+
3+
The **Cleanup Service** ([`DevIndex.services.Cleanup`](https://github.com/neomjs/neo/blob/dev/apps/devindex/services/Cleanup.mjs)) acts as the **Garbage Collector** and **State Enforcer** for the DevIndex data pipeline.
4+
5+
Because the Data Factory operates autonomously—discovering thousands of users and constantly writing to JSON files—data entropy is inevitable. The Cleanup service is invoked automatically by the Orchestrator before any major operation to ensure the system starts with a clean, consistent, and optimized state.
6+
7+
---
8+
9+
## Core Responsibilities
10+
11+
The Cleanup service performs a strict, multi-pass filtering and formatting routine on the entire JSON dataset.
12+
13+
### 1. Threshold Pruning (The Meritocracy Filter)
14+
To prevent the index from bloating with inactive or low-value data, the system enforces a strict `minTotalContributions` threshold (configured in `config.mjs`).
15+
16+
During cleanup, every profile in the rich `users.jsonl` store is evaluated. If a user's total contributions fall below this threshold, their profile is **permanently deleted**. Subsequently, they are also purged from the `tracker.json` index, ensuring the Updater doesn't waste API quota continually re-evaluating them.
17+
18+
### 2. Blocklist Enforcement
19+
Privacy is paramount. If a user's GitHub handle appears in `blocklist.json` (usually populated by the Opt-Out service), the Cleanup routine will aggressively hard-delete any trace of that user from all data files (`users.jsonl`, `tracker.json`, etc.). This is an absolute override that happens on every execution.
20+
21+
### 3. Allowlist Protection & "Resurrection"
22+
Conversely, the `allowlist.json` provides an absolute protective barrier. It serves two distinct functions during cleanup:
23+
* **Protection:** If a user is on the allowlist, they are completely exempt from Threshold Pruning. They will remain in the index even if they have 0 contributions.
24+
* **Resurrection:** Before pruning begins, the service cross-references the allowlist against the tracker queue. If an allowlisted VIP is somehow missing from the tracker, the Cleanup service will instantly "resurrect" them, injecting a new pending entry (`lastUpdate: null`) into the queue to ensure they are scheduled for the next Updater run.
25+
26+
### 4. The 30-Day "Penalty Box" TTL
27+
When the Updater encounters an error analyzing a user (e.g., a GraphQL timeout or a 404), that user is placed in `failed.json`—the "Penalty Box."
28+
29+
Users in the Penalty Box are temporarily protected from being completely pruned from the tracker, giving them a chance to be successfully processed in a future run. However, the Cleanup service enforces a strict **30-Day Time-To-Live (TTL)**.
30+
31+
```javascript readonly
32+
// Enforce Retention Policy (Penalty Box TTL)
33+
const THIRTY_DAYS_MS = 30 * 24 * 60 * 60 * 1000;
34+
const now = Date.now();
35+
36+
for (const [login, timestamp] of failed) {
37+
const ts = new Date(timestamp).getTime();
38+
if (now - ts > THIRTY_DAYS_MS) {
39+
console.log(`[Cleanup] Expiring failed user (TTL > 30d): ${login}`);
40+
failed.delete(login); // Removes Tracker protection
41+
}
42+
}
43+
```
44+
If a user remains in a failed state for more than 30 days, their protection is revoked, and the standard pruning logic will expunge them from the system.
45+
46+
### 5. Canonical Sorting
47+
The final step of the Cleanup routine is purely for Developer Experience (DX) and Git repository health.
48+
49+
When JSON files are modified by asynchronous services, the order of keys or array elements is often unpredictable. If committed as-is, this creates massive, noisy Git diffs, making code review impossible.
50+
51+
The Cleanup service applies **Canonical Sorting** before writing any data back to disk:
52+
* `users.jsonl` is strictly sorted by `total_contributions` (Descending).
53+
* `tracker.json` is sorted alphabetically by `login` (Ascending).
54+
* `blocklist.json`, `allowlist.json`, and `visited.json` are sorted alphabetically (Ascending).
55+
56+
This guarantees that any Git diff generated by the pipeline accurately reflects *meaningful changes* rather than arbitrary structural shuffling.
Lines changed: 105 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,105 @@
1+
# The Orchestrator (CLI & Pipeline)
2+
3+
The DevIndex Data Factory is essentially a collection of specialized micro-services (Spider, Updater, Storage, etc.). To coordinate these services into a cohesive, automated workflow, the system relies on the **Orchestrator** layer.
4+
5+
This layer is comprised of three distinct parts:
6+
1. **The Entry Point:** `apps/devindex/services/cli.mjs`
7+
2. **The Command Router:** [`DevIndex.services.Manager`](https://github.com/neomjs/neo/blob/dev/apps/devindex/services/Manager.mjs)
8+
3. **The Automated Pipeline:** `.github/workflows/devindex-pipeline.yml`
9+
10+
---
11+
12+
## The Entry Point (`cli.mjs`)
13+
14+
The entry point for the backend services is incredibly minimal, leaning entirely on the native Neo.mjs component lifecycle.
15+
16+
```javascript readonly
17+
import Manager from './Manager.mjs';
18+
19+
async function start() {
20+
await Manager.ready();
21+
}
22+
23+
start().catch(console.error);
24+
```
25+
26+
Because `Manager` is a Neo.mjs singleton (`Neo.setupClass(Manager)`), simply importing the module triggers its instantiation. The `start()` function then simply awaits the native `Manager.ready()` promise, which resolves when the Manager's asynchronous initialization—including executing the requested CLI command—is complete.
27+
28+
---
29+
30+
## The Command Router (`Manager.mjs`)
31+
32+
The `Manager` service uses the `commander` library to parse command-line arguments and `inquirer` to provide interactive prompts for a robust Developer Experience (DX).
33+
34+
Its primary responsibility is mapping high-level commands to specific service executions.
35+
36+
### Available Commands
37+
* `update`: Triggers the **Updater** to process a batch of pending users.
38+
* `add [username]`: Manually adds or forces an update for a specific user.
39+
* `spider`: Triggers the **Spider** to discover new candidates. Offers interactive strategy selection if run without flags.
40+
* `cleanup`: Manually triggers the **Data Hygiene** routine.
41+
* `optin` / `optout`: Processes issue-based and star-based privacy requests.
42+
43+
### The "Pre-Run Cleanup" Pattern
44+
A critical architectural pattern enforced by the Manager is the "Pre-Run Cleanup". Before executing any command that reads or modifies the index (like `spider` or `update`), the Manager automatically triggers `Cleanup.run()`.
45+
46+
```javascript readonly
47+
program
48+
.command('update')
49+
.action(async (options) => {
50+
await Cleanup.run(); // Pre-run hygiene
51+
await this.runUpdate(options.limit);
52+
});
53+
```
54+
This guarantees that the services always operate on valid, sorted, and pruned data, preventing dirty data from polluting the discovery or enrichment processes.
55+
56+
### Smart Scheduling
57+
When the `update` command is run, the Manager doesn't just blindly pass the whole queue to the Updater. It implements a smart scheduling algorithm:
58+
1. It filters out any user who has already been successfully updated *today* (based on the `lastUpdate` timestamp).
59+
2. It sorts the remaining backlog, prioritizing completely new users (`lastUpdate: null`) and the oldest records first.
60+
3. It slices the queue to the requested batch limit to respect API quotas.
61+
62+
---
63+
64+
## The Automated Pipeline (GitHub Actions)
65+
66+
While a developer can run commands manually via the CLI, the DevIndex is designed to be fully autonomous. The ultimate orchestrator is the GitHub Actions workflow defined in `.github/workflows/devindex-pipeline.yml`.
67+
68+
This workflow runs on an **hourly schedule** and strings the individual services together into a single, atomic "Data Factory" assembly line:
69+
70+
```yaml readonly
71+
jobs:
72+
run-pipeline:
73+
steps:
74+
# 1. Process Privacy Requests First
75+
- name: Run DevIndex Opt-In
76+
run: npm run devindex:optin
77+
78+
- name: Run DevIndex Opt-Out
79+
run: npm run devindex:optout
80+
81+
# 2. Aggressive Discovery (3x Loop)
82+
- name: Run DevIndex Spider
83+
run: |
84+
for i in 1 2 3; do
85+
npm run devindex:spider -- --strategy random
86+
done
87+
88+
# 3. Enrichment & Processing
89+
- name: Run DevIndex Updater
90+
run: npm run devindex:update -- --limit=800
91+
92+
# 4. Atomic Persistence
93+
- name: Commit, Rebase and Push
94+
run: |
95+
git add apps/devindex/resources/*.json*
96+
git commit -m "chore(devindex): Hourly pipeline update [skip ci]"
97+
git pull origin dev --rebase
98+
git push origin dev
99+
```
100+
101+
### Key Pipeline Concepts:
102+
103+
1. **Privacy-First Execution:** The `optin` and `optout` services run *before* any discovery or enrichment. This ensures we never accidentally index a user who requested removal in the same hour.
104+
2. **The 3x Spider Loop:** Because the Spider uses a random-walk algorithm, running it multiple times consecutively (before the Updater) significantly broadens the discovery net while utilizing very little API quota.
105+
3. **Atomic Commits:** Rather than each service committing its own changes independently (which would cause massive Git conflicts), the services modify the local JSON files on the runner. Only at the very end of the hourly pipeline are all changes bundled into a single, atomic commit containing the fully processed data. The `[skip ci]` flag prevents infinite loops.

learn/guides/devindex/tree.json

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,9 @@
55
{"name": "Privacy & Opt-Out", "parentId": null, "id": "OptOut"},
66
{"name": "The Data Factory", "parentId": null, "isLeaf": false, "id": "DataFactory"},
77
{"name": "Introduction", "parentId": "DataFactory", "id": "data-factory/Intro"},
8+
{"name": "The Orchestrator", "parentId": "DataFactory", "id": "data-factory/Orchestrator"},
89
{"name": "Spider Engine", "parentId": "DataFactory", "id": "data-factory/Engine"},
10+
{"name": "Data Hygiene & Cleanup", "parentId": "DataFactory", "id": "data-factory/DataHygiene"},
911
{"name": "Opt-In Service Architecture", "parentId": "DataFactory", "id": "data-factory/OptIn"},
1012
{"name": "Opt-Out Service Architecture", "parentId": "DataFactory", "id": "data-factory/OptOut"},
1113
{"name": "Architecture", "parentId": null, "id": "Architecture"}

0 commit comments

Comments
 (0)