Skip to content

Commit 1817d5e

Browse files
committed
docs: Add Data Factory guide for Updater Service (#9248)
1 parent cbc3569 commit 1817d5e

2 files changed

Lines changed: 179 additions & 0 deletions

File tree

Lines changed: 178 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,178 @@
1+
# The Updater Service (Enrichment Engine)
2+
3+
The **Updater Service** ([`DevIndex.services.Updater`](https://github.com/neomjs/neo/blob/dev/apps/devindex/services/Updater.mjs)) is the most complex component of the DevIndex Data Factory. While the Spider discovers *who* to track, the Updater is responsible for fetching, aggregating, and minifying the deep historical data required for the frontend visualizations.
4+
5+
It acts as a highly resilient "Worker Bee," taking a batch of usernames from the `tracker.json` queue and converting them into the rich, minified JSON objects stored in `users.jsonl`.
6+
7+
---
8+
9+
## Core Workflow
10+
11+
The Updater processes users in parallel batches. For each user, it executes a strict four-step workflow.
12+
13+
```mermaid
14+
flowchart TD
15+
Start[Start Batch] --> Fetch
16+
17+
subgraph Fetch [1. Fetch]
18+
direction TB
19+
F1[GraphQL: Profile Data]
20+
F2[GraphQL: Multi-Year Contribution Matrix]
21+
end
22+
23+
Fetch --> Enrich
24+
25+
subgraph Enrich [2. Enrich]
26+
direction TB
27+
E1[REST: Public Orgs]
28+
E2[Heuristics: 'Cyborg' Score]
29+
E3[Location Normalization]
30+
end
31+
32+
Enrich --> Filter
33+
34+
subgraph Filter [3. Filter Meritocracy]
35+
direction TB
36+
M1{Allowlisted?}
37+
M2{Total Contributions > Min?}
38+
M3[Pass]
39+
M4[Drop & Prune]
40+
41+
M1 -- No --> M2
42+
M1 -- Yes --> M3
43+
M2 -- Yes --> M3
44+
M2 -- No --> M4
45+
end
46+
47+
Filter --> Persist[4. Persist]
48+
```
49+
50+
### 1. Fetch (The GraphQL Matrix)
51+
To build the historical charts on the frontend, the DevIndex requires the total contribution count for *every year* since the user created their account.
52+
53+
Fetching this sequentially (one API call per year) is too slow. Fetching it all at once often results in GitHub API timeouts (`502 Bad Gateway` or `504 Gateway Timeout`) for prolific users. The Updater implements a smart **Chunking Strategy**:
54+
55+
```javascript readonly
56+
// We split the years into chunks of 4 to prevent 502/504 errors on large accounts.
57+
const yearChunks = [];
58+
const chunkSize = 4;
59+
for (let y = startYear; y <= currentYear; y += chunkSize) {
60+
const end = Math.min(y + chunkSize - 1, currentYear);
61+
yearChunks.push({ start: y, end });
62+
}
63+
64+
// Fetch year chunks sequentially to be safe
65+
for (const chunk of yearChunks) {
66+
try {
67+
await fetchYears(chunk.start, chunk.end); // Fast Path
68+
} catch (err) {
69+
// Fallback: If the 4-year chunk times out, try year-by-year
70+
for (let y = chunk.start; y <= chunk.end; y++) {
71+
await fetchYears(y, y);
72+
}
73+
}
74+
}
75+
```
76+
77+
### 2. Enrich
78+
While GraphQL is powerful, it has limitations. Specifically, querying an organization via GraphQL requires the `read:org` scope, even for public organizations. To keep the required permissions minimal, the Updater fetches public organization memberships via the v3 REST API simultaneously.
79+
80+
### 3. Filter (The Meritocracy Logic)
81+
Once the data is aggregated, the Updater enforces the DevIndex threshold. If a user's `total_contributions` falls below the dynamically calculated minimum threshold (and they are not on the allowlist), they are marked for immediate deletion. This ensures that only the highest-performing developers remain as the index approaches its `maxUsers` cap.
82+
83+
---
84+
85+
## Safe Purge Protocol (Self-Healing)
86+
87+
The internet is chaotic. Users change their names, accounts get suspended, and APIs fail. The Updater is built with extreme defensiveness, implementing a "Safe Purge Protocol" to categorize and handle errors without manual intervention.
88+
89+
```mermaid
90+
flowchart TD
91+
Error((API Error)) --> Type{Error Type?}
92+
93+
Type -- 5xx / Rate Limit --> Transient[Transient Error]
94+
Transient --> PB[Move to Penalty Box]
95+
96+
Type -- 404 / Not Found --> Fatal{Has History?}
97+
98+
Fatal -- Yes --> Rename{Check Database ID}
99+
Rename -- New Login Found --> Recover[Prune Old, Fetch New immediately]
100+
Rename -- No New Login --> Protected[Assume Suspended / Protected in Penalty Box]
101+
102+
Fatal -- No --> BadSeed[Bad Seed / Typo]
103+
BadSeed --> Prune[Prune from Tracker immediately]
104+
```
105+
106+
### 1. Transient Errors (The Penalty Box)
107+
If a fetch fails due to a network timeout or a GitHub rate limit, the user is moved to the `failed.json` map (the "Penalty Box") and their `lastUpdate` timestamp in the tracker is refreshed. This pushes them to the back of the queue, allowing the system to retry them naturally on a subsequent pass.
108+
109+
### 2. The Rename Problem (Database ID Resolution)
110+
The most sophisticated recovery mechanism handles account renames.
111+
If a user changes their GitHub handle, querying their old login returns a fatal `404 NOT_FOUND`. A naive scraper would delete their history.
112+
113+
Instead, the Updater leverages the immutable integer `databaseId` (stored as `i` in the rich profile):
114+
115+
```javascript readonly
116+
// ID-Based Rename Handling
117+
if (isFatal && richUser && richUser.i) {
118+
try {
119+
const newLogin = await GitHub.getLoginByDatabaseId(richUser.i);
120+
if (newLogin && newLogin.toLowerCase() !== lowerLogin) {
121+
console.log(`[${login}] 🔄 RENAME DETECTED -> ${newLogin}`);
122+
123+
// 1. Mark old login for removal
124+
indexUpdates.push({ login, delete: true });
125+
prunedLogins.push(login);
126+
127+
// 2. Fetch data for new login immediately to preserve their spot
128+
const newData = await this.fetchUserData(newLogin);
129+
// ...
130+
}
131+
}
132+
// ...
133+
}
134+
```
135+
136+
### 3. Bad Seeds vs. Protected Users
137+
If a `404` occurs and it is *not* a rename:
138+
* **No History (Bad Seed):** If the user is pending (`lastUpdate: null`) and has never been indexed, they are likely a typo or an organization mistaken for a user by the Spider. They are aggressively **Pruned**.
139+
* **Has History (Protected):** If the user exists in the rich data store but suddenly returns a `404`, we assume their account was temporarily flagged or suspended by GitHub (common for very active OSS contributors). They are **Protected** and moved to the Penalty Box to avoid erasing years of historical tracking data over a temporary glitch.
140+
141+
---
142+
143+
## Checkpointing & State Persistence
144+
145+
Processing complex GraphQL queries and REST calls takes time. To ensure that progress is not lost if the process is interrupted (e.g., by a runner timeout or a network failure), the Updater uses a chunked persistence strategy.
146+
147+
Instead of writing to disk after every single user, or waiting until the end of a massive run, it flushes its internal state to the `Storage` layer at regular intervals (defined by `config.updater.saveInterval`, typically every 10 users):
148+
149+
```javascript readonly
150+
// Checkpoint Save
151+
if (results.length >= saveInterval) {
152+
await this.saveCheckpoint(results, indexUpdates, failedLogins, recoveredLogins, prunedLogins);
153+
// Reset arrays...
154+
}
155+
```
156+
157+
The `saveCheckpoint` method is an atomic operation that synchronizes:
158+
1. New enriched profiles to `users.jsonl`
159+
2. Updated timestamps to `tracker.json`
160+
3. Erased profiles from pruning
161+
4. Addition/Removals from the `failed.json` Penalty Box
162+
163+
This guarantees that the database remains perfectly consistent, even if the Node.js process is forcefully terminated mid-run.
164+
165+
---
166+
167+
## Rate Limit Protection
168+
169+
Because the Updater consumes significant GraphQL query points, it constantly monitors the quota. Before processing *every chunk*, it checks the `core` limit. If it drops below a critical threshold (e.g., 50 requests remaining), it instantly triggers a graceful shutdown.
170+
171+
```javascript readonly
172+
if (GitHub.rateLimit.core.remaining < 50) {
173+
console.warn(`[Updater] ⚠️ RATE LIMIT CRITICAL: Stopping gracefully.`);
174+
break;
175+
}
176+
```
177+
178+
This prevents the Action from failing violently and ensures that all progress made up to that point is safely checkpointed to disk.

learn/guides/devindex/tree.json

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@
99
{"name": "Storage & Configuration", "parentId": "DataFactory", "id": "data-factory/Storage"},
1010
{"name": "GitHub API Client", "parentId": "DataFactory", "id": "data-factory/GitHubAPI"},
1111
{"name": "Spider Engine", "parentId": "DataFactory", "id": "data-factory/Engine"},
12+
{"name": "Updater Engine", "parentId": "DataFactory", "id": "data-factory/Updater"},
1213
{"name": "Data Hygiene & Cleanup", "parentId": "DataFactory", "id": "data-factory/DataHygiene"},
1314
{"name": "Opt-In Service Architecture", "parentId": "DataFactory", "id": "data-factory/OptIn"},
1415
{"name": "Opt-Out Service Architecture", "parentId": "DataFactory", "id": "data-factory/OptOut"},

0 commit comments

Comments
 (0)