|
| 1 | +# The Updater Service (Enrichment Engine) |
| 2 | + |
| 3 | +The **Updater Service** ([`DevIndex.services.Updater`](https://github.com/neomjs/neo/blob/dev/apps/devindex/services/Updater.mjs)) is the most complex component of the DevIndex Data Factory. While the Spider discovers *who* to track, the Updater is responsible for fetching, aggregating, and minifying the deep historical data required for the frontend visualizations. |
| 4 | + |
| 5 | +It acts as a highly resilient "Worker Bee," taking a batch of usernames from the `tracker.json` queue and converting them into the rich, minified JSON objects stored in `users.jsonl`. |
| 6 | + |
| 7 | +--- |
| 8 | + |
| 9 | +## Core Workflow |
| 10 | + |
| 11 | +The Updater processes users in parallel batches. For each user, it executes a strict four-step workflow. |
| 12 | + |
| 13 | +```mermaid |
| 14 | +flowchart TD |
| 15 | + Start[Start Batch] --> Fetch |
| 16 | +
|
| 17 | + subgraph Fetch [1. Fetch] |
| 18 | + direction TB |
| 19 | + F1[GraphQL: Profile Data] |
| 20 | + F2[GraphQL: Multi-Year Contribution Matrix] |
| 21 | + end |
| 22 | +
|
| 23 | + Fetch --> Enrich |
| 24 | +
|
| 25 | + subgraph Enrich [2. Enrich] |
| 26 | + direction TB |
| 27 | + E1[REST: Public Orgs] |
| 28 | + E2[Heuristics: 'Cyborg' Score] |
| 29 | + E3[Location Normalization] |
| 30 | + end |
| 31 | +
|
| 32 | + Enrich --> Filter |
| 33 | +
|
| 34 | + subgraph Filter [3. Filter Meritocracy] |
| 35 | + direction TB |
| 36 | + M1{Allowlisted?} |
| 37 | + M2{Total Contributions > Min?} |
| 38 | + M3[Pass] |
| 39 | + M4[Drop & Prune] |
| 40 | +
|
| 41 | + M1 -- No --> M2 |
| 42 | + M1 -- Yes --> M3 |
| 43 | + M2 -- Yes --> M3 |
| 44 | + M2 -- No --> M4 |
| 45 | + end |
| 46 | +
|
| 47 | + Filter --> Persist[4. Persist] |
| 48 | +``` |
| 49 | + |
| 50 | +### 1. Fetch (The GraphQL Matrix) |
| 51 | +To build the historical charts on the frontend, the DevIndex requires the total contribution count for *every year* since the user created their account. |
| 52 | + |
| 53 | +Fetching this sequentially (one API call per year) is too slow. Fetching it all at once often results in GitHub API timeouts (`502 Bad Gateway` or `504 Gateway Timeout`) for prolific users. The Updater implements a smart **Chunking Strategy**: |
| 54 | + |
| 55 | +```javascript readonly |
| 56 | +// We split the years into chunks of 4 to prevent 502/504 errors on large accounts. |
| 57 | +const yearChunks = []; |
| 58 | +const chunkSize = 4; |
| 59 | +for (let y = startYear; y <= currentYear; y += chunkSize) { |
| 60 | + const end = Math.min(y + chunkSize - 1, currentYear); |
| 61 | + yearChunks.push({ start: y, end }); |
| 62 | +} |
| 63 | + |
| 64 | +// Fetch year chunks sequentially to be safe |
| 65 | +for (const chunk of yearChunks) { |
| 66 | + try { |
| 67 | + await fetchYears(chunk.start, chunk.end); // Fast Path |
| 68 | + } catch (err) { |
| 69 | + // Fallback: If the 4-year chunk times out, try year-by-year |
| 70 | + for (let y = chunk.start; y <= chunk.end; y++) { |
| 71 | + await fetchYears(y, y); |
| 72 | + } |
| 73 | + } |
| 74 | +} |
| 75 | +``` |
| 76 | + |
| 77 | +### 2. Enrich |
| 78 | +While GraphQL is powerful, it has limitations. Specifically, querying an organization via GraphQL requires the `read:org` scope, even for public organizations. To keep the required permissions minimal, the Updater fetches public organization memberships via the v3 REST API simultaneously. |
| 79 | + |
| 80 | +### 3. Filter (The Meritocracy Logic) |
| 81 | +Once the data is aggregated, the Updater enforces the DevIndex threshold. If a user's `total_contributions` falls below the dynamically calculated minimum threshold (and they are not on the allowlist), they are marked for immediate deletion. This ensures that only the highest-performing developers remain as the index approaches its `maxUsers` cap. |
| 82 | + |
| 83 | +--- |
| 84 | + |
| 85 | +## Safe Purge Protocol (Self-Healing) |
| 86 | + |
| 87 | +The internet is chaotic. Users change their names, accounts get suspended, and APIs fail. The Updater is built with extreme defensiveness, implementing a "Safe Purge Protocol" to categorize and handle errors without manual intervention. |
| 88 | + |
| 89 | +```mermaid |
| 90 | +flowchart TD |
| 91 | + Error((API Error)) --> Type{Error Type?} |
| 92 | + |
| 93 | + Type -- 5xx / Rate Limit --> Transient[Transient Error] |
| 94 | + Transient --> PB[Move to Penalty Box] |
| 95 | + |
| 96 | + Type -- 404 / Not Found --> Fatal{Has History?} |
| 97 | + |
| 98 | + Fatal -- Yes --> Rename{Check Database ID} |
| 99 | + Rename -- New Login Found --> Recover[Prune Old, Fetch New immediately] |
| 100 | + Rename -- No New Login --> Protected[Assume Suspended / Protected in Penalty Box] |
| 101 | + |
| 102 | + Fatal -- No --> BadSeed[Bad Seed / Typo] |
| 103 | + BadSeed --> Prune[Prune from Tracker immediately] |
| 104 | +``` |
| 105 | + |
| 106 | +### 1. Transient Errors (The Penalty Box) |
| 107 | +If a fetch fails due to a network timeout or a GitHub rate limit, the user is moved to the `failed.json` map (the "Penalty Box") and their `lastUpdate` timestamp in the tracker is refreshed. This pushes them to the back of the queue, allowing the system to retry them naturally on a subsequent pass. |
| 108 | + |
| 109 | +### 2. The Rename Problem (Database ID Resolution) |
| 110 | +The most sophisticated recovery mechanism handles account renames. |
| 111 | +If a user changes their GitHub handle, querying their old login returns a fatal `404 NOT_FOUND`. A naive scraper would delete their history. |
| 112 | + |
| 113 | +Instead, the Updater leverages the immutable integer `databaseId` (stored as `i` in the rich profile): |
| 114 | + |
| 115 | +```javascript readonly |
| 116 | +// ID-Based Rename Handling |
| 117 | +if (isFatal && richUser && richUser.i) { |
| 118 | + try { |
| 119 | + const newLogin = await GitHub.getLoginByDatabaseId(richUser.i); |
| 120 | + if (newLogin && newLogin.toLowerCase() !== lowerLogin) { |
| 121 | + console.log(`[${login}] 🔄 RENAME DETECTED -> ${newLogin}`); |
| 122 | + |
| 123 | + // 1. Mark old login for removal |
| 124 | + indexUpdates.push({ login, delete: true }); |
| 125 | + prunedLogins.push(login); |
| 126 | + |
| 127 | + // 2. Fetch data for new login immediately to preserve their spot |
| 128 | + const newData = await this.fetchUserData(newLogin); |
| 129 | + // ... |
| 130 | + } |
| 131 | + } |
| 132 | + // ... |
| 133 | +} |
| 134 | +``` |
| 135 | + |
| 136 | +### 3. Bad Seeds vs. Protected Users |
| 137 | +If a `404` occurs and it is *not* a rename: |
| 138 | +* **No History (Bad Seed):** If the user is pending (`lastUpdate: null`) and has never been indexed, they are likely a typo or an organization mistaken for a user by the Spider. They are aggressively **Pruned**. |
| 139 | +* **Has History (Protected):** If the user exists in the rich data store but suddenly returns a `404`, we assume their account was temporarily flagged or suspended by GitHub (common for very active OSS contributors). They are **Protected** and moved to the Penalty Box to avoid erasing years of historical tracking data over a temporary glitch. |
| 140 | + |
| 141 | +--- |
| 142 | + |
| 143 | +## Checkpointing & State Persistence |
| 144 | + |
| 145 | +Processing complex GraphQL queries and REST calls takes time. To ensure that progress is not lost if the process is interrupted (e.g., by a runner timeout or a network failure), the Updater uses a chunked persistence strategy. |
| 146 | + |
| 147 | +Instead of writing to disk after every single user, or waiting until the end of a massive run, it flushes its internal state to the `Storage` layer at regular intervals (defined by `config.updater.saveInterval`, typically every 10 users): |
| 148 | + |
| 149 | +```javascript readonly |
| 150 | +// Checkpoint Save |
| 151 | +if (results.length >= saveInterval) { |
| 152 | + await this.saveCheckpoint(results, indexUpdates, failedLogins, recoveredLogins, prunedLogins); |
| 153 | + // Reset arrays... |
| 154 | +} |
| 155 | +``` |
| 156 | + |
| 157 | +The `saveCheckpoint` method is an atomic operation that synchronizes: |
| 158 | +1. New enriched profiles to `users.jsonl` |
| 159 | +2. Updated timestamps to `tracker.json` |
| 160 | +3. Erased profiles from pruning |
| 161 | +4. Addition/Removals from the `failed.json` Penalty Box |
| 162 | + |
| 163 | +This guarantees that the database remains perfectly consistent, even if the Node.js process is forcefully terminated mid-run. |
| 164 | + |
| 165 | +--- |
| 166 | + |
| 167 | +## Rate Limit Protection |
| 168 | + |
| 169 | +Because the Updater consumes significant GraphQL query points, it constantly monitors the quota. Before processing *every chunk*, it checks the `core` limit. If it drops below a critical threshold (e.g., 50 requests remaining), it instantly triggers a graceful shutdown. |
| 170 | + |
| 171 | +```javascript readonly |
| 172 | +if (GitHub.rateLimit.core.remaining < 50) { |
| 173 | + console.warn(`[Updater] ⚠️ RATE LIMIT CRITICAL: Stopping gracefully.`); |
| 174 | + break; |
| 175 | +} |
| 176 | +``` |
| 177 | + |
| 178 | +This prevents the Action from failing violently and ensures that all progress made up to that point is safely checkpointed to disk. |
0 commit comments