Skip to content

MLE-28226, MLE-28260: Fix silent NPE in getSslContext() and IndexOutOfBoundsException in ThreadManager thread scaling#600

Merged
vshaniga merged 1 commit into
marklogic:developfrom
tposham:develop
May 5, 2026
Merged

MLE-28226, MLE-28260: Fix silent NPE in getSslContext() and IndexOutOfBoundsException in ThreadManager thread scaling#600
vshaniga merged 1 commit into
marklogic:developfrom
tposham:develop

Conversation

@tposham
Copy link
Copy Markdown

@tposham tposham commented Apr 30, 2026

Summary

Fixes two bugs in MLCP error handling:

  • MLE-28226: ContentReader.getSslContext() swallowed NoSuchAlgorithmException/KeyManagementException, causing a silent NPE when SSL initialization failed. Now propagates exceptions to the caller, which already handles them via XccConfigException.
  • MLE-28260: ThreadManager may crash with IndexOutOfBoundsException when completed tasks appeared before active tasks in taskList. The randomIndexes list (sized to active task count) was accessed using the full loop index. Added a separate activeIdx counter that only increments for active tasks.

Changes

File Change
ContentReader.java Removed try-catch in getSslContext(), added throws clause to match SslConfigOptions interface contract
ThreadManager.java Added int activeIdx = 0 and used randomIndexes.get(activeIdx++) in scaleOutThreadPool(), scaleInThreadPool(), and assignIdleThreads()

Tests

  • Ran unit tests → All passed
  • Ran 06mlcp test suite → No regression failures were found

Copilot AI review requested due to automatic review settings April 30, 2026 07:10
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes two runtime failure modes in MLCP’s SSL initialization and auto-scaling thread assignment logic to avoid silent failures and IndexOutOfBoundsExceptions during thread redistribution.

Changes:

  • ContentReader.SslOptions#getSslContext() now propagates NoSuchAlgorithmException/KeyManagementException instead of swallowing them, preventing a silent NPE when SSL setup fails.
  • ThreadManager now uses an activeIdx counter when consuming randomIndexes so completed tasks earlier in taskList don’t cause randomIndexes to be indexed with the full loop index.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
src/main/java/com/marklogic/mapreduce/examples/ContentReader.java Removes exception-swallowing in SSLContext creation and aligns getSslContext() with the SslConfigOptions throws contract.
src/main/java/com/marklogic/contentpump/ThreadManager.java Fixes incorrect indexing into randomIndexes by introducing activeIdx in scaling/idle-thread assignment paths.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/main/java/com/marklogic/contentpump/ThreadManager.java
Copy link
Copy Markdown
Contributor

@NeoSaber NeoSaber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change for ContentReader looks fine to me.

I think copilot is correct that the change to ThreadManager doesn't quite fix the issue, but I don't really like its suggested fixes. They seem too much like overengineering to me.

My first thought was to try a modulo op to limit the activeIdx to the size of randomIndexes, but when I ran that suggestion by copilot it pointed out that wouldn't really work in the context of how the scaling functions are using it. It suggested adding something like this:

if (activeIdx >= randomIndexes.size()) {
    if (LOG.isDebugEnabled()) {
        LOG.debug("Skipping task added after active task snapshot; "
                + "it will be considered in the next polling cycle.");
    }
    continue;
}

I don't know what the best approach might be here. I am not a fan of any of the copilot suggestions I've seen for it, but my ideas might not work either.

…fBoundsException in ThreadManager Thread Scaling
@tposham
Copy link
Copy Markdown
Author

tposham commented May 4, 2026

Used copilot to generate fixes/suggestions, it says to move the threadManager.runThreadPoller(); to after the tasks are submitted to prevent the race condition where activeIdx >= randomIndexes.size().

LocalJobRunner.java

MLE-28260: Moved runThreadPoller() from before the task submission loop to after it.

BEFORE:
  pool = threadManager.initThreadPool();
  threadManager.runThreadPoller();       // Poller timer starts here
  ...
  for (each split) {
      threadManager.submitTask(task);    // Adds tasks to taskList
  }
  threadManager.shutdownThreadPool();

The poller was scheduled (with a 1-minute initial delay) BEFORE tasks were
submitted. runThreadPoller() registers a repeating timer via
scheduleWithFixedDelay — it returns instantly, but the timer fires on a
background thread every minute. If task submission took longer than the
1-minute delay (e.g., very large jobs with thousands of splits), the poller
would fire while submitTask() was still adding tasks to taskList. This
caused a race condition: the poller's scale methods would see more active
tasks than randomIndexes had entries, causing an IndexOutOfBoundsException
that silently killed the poller and stopped all auto-scaling for the
remainder of the job.

AFTER:
  pool = threadManager.initThreadPool();
  ...
  for (each split) {
      threadManager.submitTask(task);    // Adds tasks to taskList
  }
  if (pool != null) {
      threadManager.runThreadPoller();   // Poller timer starts here
  }
  threadManager.shutdownThreadPool();

Now the poller starts AFTER all tasks are in taskList. Since submitTask()
is the only method that adds to taskList, and it's only called from this
loop, taskList is fully populated and stable before the poller ever fires.
No race possible.

Tasks are not affected by the delayed poller start — submitTask() assigns
initial threads and submits tasks to the pool immediately via
pool.submit(). The poller's role is only to REBALANCE threads as server
capacity changes, not to start tasks.

The if (pool != null) guard prevents starting the poller in single-threaded
mode. When pool is null, the for-loop runs each mapper synchronously to
completion — all work finishes inside the loop, so a poller is unnecessary.
In the original code this guard wasn't needed because runThreadPoller() was
called unconditionally before the branch point, and the poller's internal
runAutoScaling() check made it a no-op in single-threaded mode. But now
that we're after the loop, starting a pointless timer that would only delay
shutdownThreadPool() is wasteful, so we skip it entirely.

ThreadManager.java

MLE-28260: Fixed IndexOutOfBoundsException in scaleOutThreadPool(),
scaleInThreadPool(), and assignIdleThreads().

THE BUG:
These three methods iterate taskList (all tasks, including done ones) and
access randomIndexes (sized to active task count only). Previously they used
the loop variable `i` to index into randomIndexes:

    randomIndexes.get(i)

But `i` iterates over ALL tasks (taskList.size()), while randomIndexes only
has entries for ACTIVE tasks (activeTaskCounts). When a done task appears
before an active task in the list, `i` advances past the done task but
randomIndexes doesn't have an entry for that position. Once `i` exceeds
randomIndexes.size(), the access throws IndexOutOfBoundsException.

Example: taskList = [active, DONE, active, DONE, active]
  taskList.size() = 5, activeTaskCounts = 3, randomIndexes has 3 entries
  i=0: active → randomIndexes.get(0) ✓
  i=1: DONE   → skip (continue)
  i=2: active → randomIndexes.get(2) ✓
  i=3: DONE   → skip (continue)
  i=4: active → randomIndexes.get(4) ← CRASH (only 3 entries)

This crashed the ThreadPoller, silently killing all auto-scaling for the
rest of the job. The ScheduledExecutorService swallows exceptions from
scheduled tasks, so there was no visible error — just degraded throughput.

THE FIX (two parts):

1. Separate activeIdx counter:
   Added `int activeIdx = 0` before each loop. Changed
   `randomIndexes.get(i)` to `randomIndexes.get(activeIdx++)`.
   activeIdx only increments when processing an active task, not when
   skipping a done task. This keeps it in sync with randomIndexes.size().

   Same example with the fix:
     i=0: active → randomIndexes.get(activeIdx=0) ✓, activeIdx becomes 1
     i=1: DONE   → skip, activeIdx stays 1
     i=2: active → randomIndexes.get(activeIdx=1) ✓, activeIdx becomes 2
     i=3: DONE   → skip, activeIdx stays 2
     i=4: active → randomIndexes.get(activeIdx=2) ✓, activeIdx becomes 3

2. Bounds guard (defense-in-depth):
   Added `if (activeIdx >= randomIndexes.size()) break` before each
   randomIndexes.get() call. This protects against any remaining edge case
   where the active task count at iteration time exceeds the count used to
   build randomIndexes (e.g., if a task's done-status changes between
   getActiveTaskCounts() and the loop). Tasks beyond the snapshot are
   safely skipped and handled in the next polling cycle.

   Uses `break` (not `continue`) because once activeIdx reaches the limit,
   it can never become valid again — activeIdx only increments, and
   randomIndexes.size() is fixed for the cycle. Continuing would just
   hit the same guard on every remaining active task. break exits the
   for-loop (not the method), so post-loop code like
   pool.setCorePoolSize() in scaleInThreadPool() still executes.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/main/java/com/marklogic/contentpump/LocalJobRunner.java
@tposham tposham requested a review from NeoSaber May 4, 2026 10:30
@vshaniga vshaniga merged commit c44e043 into marklogic:develop May 5, 2026
7 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants