Skip to content

feat: Reconnect feature on stream error#419

Merged
CarlosGamero merged 25 commits intomainfrom
feat/reconnect_on_error
Mar 19, 2026
Merged

feat: Reconnect feature on stream error#419
CarlosGamero merged 25 commits intomainfrom
feat/reconnect_on_error

Conversation

@CarlosGamero
Copy link
Collaborator

@CarlosGamero CarlosGamero commented Mar 19, 2026

Summary by CodeRabbit

  • New Features

    • Automatic consumer reconnection with retries, exponential backoff, max-attempt reporting, and reconnection state tracking.
  • Bug Fixes

    • Safer shutdown/stream handling, unified error and commit routing, hardened commit behavior, and simplified publisher error logging.
  • Tests

    • Added reconnect behavior tests (success and exhaustion) and updated consumer/publisher assertions.
  • Chores

    • Bumped local package dependency versions.

@CarlosGamero CarlosGamero self-assigned this Mar 19, 2026
@coderabbitai
Copy link

coderabbitai bot commented Mar 19, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0717cb65-05e3-477e-9e57-1a9117d35ad6

📥 Commits

Reviewing files that changed from the base of the PR and between d006089 and 2b2b9af.

📒 Files selected for processing (2)
  • packages/kafka/lib/AbstractKafkaConsumer.ts
  • packages/kafka/test/publisher/PermissionPublisher.spec.ts

📝 Walkthrough

Walkthrough

Adds lazy consumer initialization and a reconnect mechanism (up to 5 attempts with exponential backoff) to AbstractKafkaConsumer, consolidates stream consumption into a single async handler, renames handlerErrorhandleError, and tightens commit, logging, and shutdown/error flows.

Changes

Cohort / File(s) Summary
Reconnect & Stream Handling
packages/kafka/lib/AbstractKafkaConsumer.ts
Made consumer optional and created in init(); added isReconnecting; unified handleStream(...) for single/batch streams; stream errors now trigger reconnect(error) with exponential backoff (MAX 5 attempts); replaced commitMessage(...) with commit(...); hardened close() to tolerate missing consumer and swallow close errors.
Service & Publisher Error Handling / Logging
packages/kafka/lib/AbstractKafkaService.ts, packages/kafka/lib/AbstractKafkaPublisher.ts
Renamed protected handlerErrorhandleError; logger derived as child with { origin: this.constructor.name }; publisher catch paths call handleError and error payloads trimmed.
Tests — reconnect coverage & assertions
packages/kafka/test/consumer/PermissionConsumer.reconnect.spec.ts, packages/kafka/test/consumer/PermissionConsumer.spec.ts, packages/kafka/test/consumer/PermissionBatchConsumer.spec.ts
Added reconnect tests covering success and max-attempt exhaustion with error reporting; updated assertions to use resolves.not.toThrow() in several specs.
Package version bumps
packages/kafka/package.json, packages/kafka/load-tests/package.json
Bumped @platformatic/kafka from 1.30.0 → 1.31.0 in package manifests.
Publisher tests
packages/kafka/test/publisher/PermissionPublisher.spec.ts
Removed InternalError import and related nested error assertions; updated expected publish-failure snapshot message.

Sequence Diagram

sequenceDiagram
    actor Client
    participant Consumer as AbstractKafkaConsumer
    participant Stream as ConsumerStream
    participant Handler as MessageHandler
    participant ErrorReporter as ErrorReporter

    Client->>Consumer: init()
    activate Consumer
    Consumer->>Consumer: create consumer instance
    Consumer->>Stream: start async iteration (handleStream)
    deactivate Consumer

    loop stream processing
        Stream->>Consumer: yield message(s)
        alt single or batch
            Consumer->>Handler: process(message(s), topic)
            Handler-->>Consumer: result
            Consumer->>Stream: commit(offset of last message)
        else stream error
            Stream->>Consumer: throw error
            Consumer->>Consumer: reconnect(error)
            activate Consumer
            Consumer->>Consumer: for attempt in 1..5
            Consumer->>Consumer: close()
            Consumer->>Consumer: wait backoff (2^attempt * 1000ms)
            Consumer->>Consumer: init()
            alt init succeeds
                Consumer->>Stream: resume handleStream
                break
            end
            end
            alt max attempts exhausted
                Consumer->>ErrorReporter: handleError({ message: "Consumer failed to reconnect after max attempts", maxAttempts: 5 })
                Consumer->>Consumer: isReconnecting = false
            end
            deactivate Consumer
        end
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • kjamrog
  • kibertoad

Poem

🐰 I hopped the stream when it went dark,
I twitched my whiskers, found a spark.
Five small hops, patient and bright,
I try again until it's right.
The pipeline hums — the rabbit's light.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: Reconnect feature on stream error' directly and accurately summarizes the main change: implementing a reconnect mechanism triggered by stream errors, as evidenced by the new reconnect() method, reconnect test suite, and error-handling flow updates across all Kafka service classes.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/reconnect_on_error
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@CarlosGamero CarlosGamero changed the title feat: Reconnect retries on error feat: Reconnect retries on stream error Mar 19, 2026
const topics = Object.keys(this.options.handlers)
if (topics.length === 0) throw new Error('At least one topic must be defined')

this.consumer = new Consumer({
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consumer needs to be recreated, once you call close it ends in a final state so we need to start from scratch

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add this as a comment?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! added :D

messageOrBatch: MessageOrBatch<SupportedMessageValues<TopicsConfig>>,
): Promise<void> {
const messageProcessingStartTimestamp = Date.now()
this.logger.debug({ origin: this.constructor.name, topic }, 'Consuming message(s)')
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Origin is now added with logger.child to avoid repetition

consumer = new PermissionBatchConsumer(testContext.cradle, { handlers: {} })
// When - Then
await expect(consumer.close()).resolves.not.toThrowError()
await expect(consumer.close()).resolves.not.toThrow()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

toThrowError is deprecated

expect(consumer.isConnected).toBe(true)
})

it('should handle errors on reconnection', { timeout: 40_000 }, async () => {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not super happy with this, I tried using vitest fake timers but they do not work with the stream for some reason 😢

@CarlosGamero CarlosGamero marked this pull request as ready for review March 19, 2026 14:28
@CarlosGamero CarlosGamero changed the title feat: Reconnect retries on stream error feat: Reconnect feature on stream error Mar 19, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
packages/kafka/lib/AbstractKafkaService.ts (1)

128-136: ⚠️ Potential issue | 🟠 Major

Keep handlerError() as a compatibility hook or treat this as a breaking change.

Renaming a protected method on an exported base class is semver-breaking. Downstream subclasses can still compile with handlerError(...), but after this change their override is silently bypassed because internal callers now dispatch to handleError(...) instead.

Suggested compatibility shim
+  protected handleError(error: unknown, context: Record<string, unknown> = {}): void {
+    this.handlerError(error, context)
+  }
+
+  /** `@deprecated` Use handleError() */
-  protected handleError(error: unknown, context: Record<string, unknown> = {}): void {
+  protected handlerError(error: unknown, context: Record<string, unknown> = {}): void {
     const resolvedErrorLog = resolveGlobalErrorLogObject(error)
     this.logger.error({ ...resolvedErrorLog, ...context })
     if (isError(error))
       this.errorReporter.report({
         error,
         context: { ...context, error: resolvedErrorLog.error },
       })
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/kafka/lib/AbstractKafkaService.ts` around lines 128 - 136, The
protected method was renamed from handlerError to handleError which is a
breaking change for downstream subclasses overriding handlerError; restore
compatibility by adding a protected handlerError(...) shim that delegates to the
new handleError(...) (or alternatively update internal callers to invoke
handlerError instead). Specifically, in AbstractKafkaService add a protected
handlerError(error: unknown, context: Record<string, unknown> = {}) method that
calls this.handleError(error, context) so existing subclasses overriding
handlerError continue to be invoked, and ensure handleError remains the
canonical implementation used by internal callers like any existing error
dispatch sites.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/kafka/lib/AbstractKafkaConsumer.ts`:
- Around line 228-245: Set and check a shutdown flag to cancel in-flight
reconnects: in close() set a durable boolean (e.g. this._closed = true or
this._closing = true), clear any scheduled reconnect timer (e.g.
this._reconnectTimer/this._reconnectTimeout) and then proceed to close
consumer/streams; in reconnect() check that flag before awaiting backoff and
again immediately before calling init() and bail out (throw or return) if the
flag is set. Update the code paths that schedule reconnection (the reconnect()
loop and any backoff scheduling) to store the timer id so close() can clear it,
and ensure both reconnect() and init() honor the same this._closed/this._closing
flag to avoid reopening the consumer after an explicit close.
- Around line 155-170: The init() guard currently sets this.consumer before
startup completes, which makes subsequent init() calls no-ops if joinGroup() or
consume() throws; change the flow so you only assign this.consumer after the
consumer has fully started (i.e., after joinGroup() and consume() complete
successfully). Concretely, create the Consumer instance in a local variable, run
await localConsumer.joinGroup() and await localConsumer.consume(), then set
this.consumer = localConsumer; apply the same pattern for the other consumer
assignment site referenced around the second block (the code near the 202-207
region) so no partial initialization can be left on error.

In `@packages/kafka/test/consumer/PermissionConsumer.reconnect.spec.ts`:
- Around line 43-51: The test currently only waits for initSpy to be called
which can pass before the reconnect completes; update the test to wait for
init() to resolve and then assert a post-reconnect behavior (e.g., successfully
consuming a new message) to prove reconnect finished. Specifically, use
waitAndRetry or a Promise that waits until initSpy.mock.calls.length > 0 AND the
initSpy call has resolved (inspect the Promise returned by the spied init or
replace the spy with one that returns a controllable Promise), then publish a
fresh message and assert the consumer consumed it (reference consumer, initSpy,
closeSpy, waitAndRetry, and init()). Ensure the assertion verifies message
receipt after init() resolves rather than only that initSpy was invoked.

---

Outside diff comments:
In `@packages/kafka/lib/AbstractKafkaService.ts`:
- Around line 128-136: The protected method was renamed from handlerError to
handleError which is a breaking change for downstream subclasses overriding
handlerError; restore compatibility by adding a protected handlerError(...) shim
that delegates to the new handleError(...) (or alternatively update internal
callers to invoke handlerError instead). Specifically, in AbstractKafkaService
add a protected handlerError(error: unknown, context: Record<string, unknown> =
{}) method that calls this.handleError(error, context) so existing subclasses
overriding handlerError continue to be invoked, and ensure handleError remains
the canonical implementation used by internal callers like any existing error
dispatch sites.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c91381fa-2d77-47c3-a5f4-ad985b737cc5

📥 Commits

Reviewing files that changed from the base of the PR and between f21d661 and 08a6975.

📒 Files selected for processing (6)
  • packages/kafka/lib/AbstractKafkaConsumer.ts
  • packages/kafka/lib/AbstractKafkaPublisher.ts
  • packages/kafka/lib/AbstractKafkaService.ts
  • packages/kafka/test/consumer/PermissionBatchConsumer.spec.ts
  • packages/kafka/test/consumer/PermissionConsumer.reconnect.spec.ts
  • packages/kafka/test/consumer/PermissionConsumer.spec.ts

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
packages/kafka/lib/AbstractKafkaConsumer.ts (1)

216-227: Consider guarding against empty batches.

If messageOrBatch is an empty array, accessing messageOrBatch[0].topic will throw. While KafkaMessageBatchStream shouldn't yield empty batches, defensive coding would guard this edge case.

🛡️ Optional defensive check
   for await (const messageOrBatch of stream) {
+    if (Array.isArray(messageOrBatch) && messageOrBatch.length === 0) continue
     await this.consume(
       Array.isArray(messageOrBatch) ? messageOrBatch[0].topic : messageOrBatch.topic,
       messageOrBatch,
     )
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/kafka/lib/AbstractKafkaConsumer.ts` around lines 216 - 227, The
handleStream method can throw if a yielded KafkaMessageBatchStream batch is an
empty array; add a defensive guard in handleStream before accessing
messageOrBatch[0].topic to skip or error on empty arrays (check
Array.isArray(messageOrBatch) && messageOrBatch.length > 0), and only call
this.consume with the topic when the batch has at least one message; update
logic around the Array.isArray(messageOrBatch) conditional so consume is invoked
with the correct topic for both single messages and non-empty batches and empty
batches are safely ignored or logged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@packages/kafka/lib/AbstractKafkaConsumer.ts`:
- Around line 216-227: The handleStream method can throw if a yielded
KafkaMessageBatchStream batch is an empty array; add a defensive guard in
handleStream before accessing messageOrBatch[0].topic to skip or error on empty
arrays (check Array.isArray(messageOrBatch) && messageOrBatch.length > 0), and
only call this.consume with the topic when the batch has at least one message;
update logic around the Array.isArray(messageOrBatch) conditional so consume is
invoked with the correct topic for both single messages and non-empty batches
and empty batches are safely ignored or logged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f2bd4cb4-3780-45a8-badf-31901926a6da

📥 Commits

Reviewing files that changed from the base of the PR and between 08a6975 and 8659a3e.

📒 Files selected for processing (1)
  • packages/kafka/lib/AbstractKafkaConsumer.ts

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
packages/kafka/lib/AbstractKafkaConsumer.ts (2)

155-183: ⚠️ Potential issue | 🟠 Major

Delay assigning this.consumer until startup is fully successful.

Line 155 uses this.consumer as the init guard, but Line 160 sets it before joinGroup()/consume(). If startup fails, later init() calls become no-ops on a half-initialized instance.

Suggested fix
-    this.consumer = new Consumer({
+    const consumer = new Consumer({
       ...this.options.kafka,
       ...this.options,
       autocommit: false, // Handling commits manually
       deserializers: {
         key: stringDeserializer,
         value: safeJsonDeserializer,
         headerKey: stringDeserializer,
         headerValue: stringDeserializer,
       },
     })

     try {
       const { handlers: _, ...consumeOptions } = this.options // Handlers cannot be passed to consume method

-      await this.consumer.joinGroup({
+      await consumer.joinGroup({
         sessionTimeout: consumeOptions.sessionTimeout,
         rebalanceTimeout: consumeOptions.rebalanceTimeout,
         heartbeatInterval: consumeOptions.heartbeatInterval,
       })

-      this.consumerStream = await this.consumer.consume({ ...consumeOptions, topics })
+      const consumerStream = await consumer.consume({ ...consumeOptions, topics })
+      this.consumer = consumer
+      this.consumerStream = consumerStream

@@
     } catch (error) {
+      await consumer.close().catch(() => undefined)
       throw new InternalError({
         message: 'Consumer init failed',
         errorCode: 'KAFKA_CONSUMER_INIT_ERROR',
         cause: error,
       })
     }

Also applies to: 202-208

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/kafka/lib/AbstractKafkaConsumer.ts` around lines 155 - 183, The init
guard currently sets this.consumer early which makes subsequent failed starts
permanent; instead instantiate a local Consumer (e.g., const consumer = new
Consumer(...)) and use that local for joinGroup() and consume(); only after
joinGroup() and consume() succeed assign this.consumer = consumer and
this.consumerStream = consumerStream (or assign consumerStream to
this.consumerStream after successful consume). Also apply the same pattern for
the second block referenced (where this.consumer and this.consumerStream are
currently assigned around lines 202-208) so you never set instance fields until
startup completes successfully.

228-245: ⚠️ Potential issue | 🟠 Major

Make reconnect single-flight and cancelable by explicit close().

Right now, reconnect attempts can overlap, and an explicit close() can still be followed by a later init() from an in-flight reconnect loop.

Suggested direction
+  private isClosed = false
+  private reconnectPromise?: Promise<void>

   async init(): Promise<void> {
+    if (this.isClosed) return Promise.resolve()
     if (this.consumer) return Promise.resolve()
@@
   async close(): Promise<void> {
+    this.isClosed = true
     if (!this.consumer) return Promise.resolve()
@@
   private async reconnect(error: unknown): Promise<void> {
+    if (this.reconnectPromise) return this.reconnectPromise
+    this.reconnectPromise = (async () => {
+      if (this.isClosed) return
       this.isReconnecting = true
@@
-      await setTimeout(Math.pow(2, attempt) * 1000) // Backoff delay starting with 1s
+      await setTimeout(Math.pow(2, attempt) * 1000) // Backoff delay starting with 1s
+      if (this.isClosed) return
       await this.init()
@@
-    this.handleError(new Error('Consumer failed to reconnect after max attempts'), {
+    this.handleError(new Error('Consumer failed to reconnect after max attempts'), {
       maxAttempts: MAX_RECONNECT_ATTEMPTS,
     })
+    })().finally(() => {
+      this.isReconnecting = false
+      this.reconnectPromise = undefined
+    })
+    return this.reconnectPromise
   }

Also applies to: 247-261

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/kafka/lib/AbstractKafkaConsumer.ts` around lines 228 - 245, The
reconnect logic allows overlapping reconnect attempts and does not cancel
in-flight reconnect loops when close() is called; make reconnect single-flight
and cancelable by introducing a cancelation and single-flight guard: add an
AbortController or boolean flag (e.g., this._closing / this._closed and
this._reconnectAbort) and a single-flight promise/lock (e.g.,
this._reconnectPromise) used by init() and the reconnect loop to ensure only one
reconnect runs at a time; update init() and any reconnect loop function
(referenced as the reconnect loop around init/reconnect logic, and the init()
method) to check the abort flag/AbortSignal before performing work and to await
or reuse this._reconnectPromise to prevent concurrent runs; modify close() to
set the abort flag/signal, call this._reconnectAbort.abort() (if using
AbortController), await the in-flight this._reconnectPromise to finish or
cancel, and ensure further init() calls return early when this._closed is true
so no new reconnect starts after close() completes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@packages/kafka/lib/AbstractKafkaConsumer.ts`:
- Around line 155-183: The init guard currently sets this.consumer early which
makes subsequent failed starts permanent; instead instantiate a local Consumer
(e.g., const consumer = new Consumer(...)) and use that local for joinGroup()
and consume(); only after joinGroup() and consume() succeed assign this.consumer
= consumer and this.consumerStream = consumerStream (or assign consumerStream to
this.consumerStream after successful consume). Also apply the same pattern for
the second block referenced (where this.consumer and this.consumerStream are
currently assigned around lines 202-208) so you never set instance fields until
startup completes successfully.
- Around line 228-245: The reconnect logic allows overlapping reconnect attempts
and does not cancel in-flight reconnect loops when close() is called; make
reconnect single-flight and cancelable by introducing a cancelation and
single-flight guard: add an AbortController or boolean flag (e.g., this._closing
/ this._closed and this._reconnectAbort) and a single-flight promise/lock (e.g.,
this._reconnectPromise) used by init() and the reconnect loop to ensure only one
reconnect runs at a time; update init() and any reconnect loop function
(referenced as the reconnect loop around init/reconnect logic, and the init()
method) to check the abort flag/AbortSignal before performing work and to await
or reuse this._reconnectPromise to prevent concurrent runs; modify close() to
set the abort flag/signal, call this._reconnectAbort.abort() (if using
AbortController), await the in-flight this._reconnectPromise to finish or
cancel, and ensure further init() calls return early when this._closed is true
so no new reconnect starts after close() completes.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d166066f-c010-4527-9f29-d4413dc63f21

📥 Commits

Reviewing files that changed from the base of the PR and between 8659a3e and a14d4b3.

📒 Files selected for processing (1)
  • packages/kafka/lib/AbstractKafkaConsumer.ts

@CarlosGamero CarlosGamero merged commit d5d4cbe into main Mar 19, 2026
7 checks passed
@CarlosGamero CarlosGamero deleted the feat/reconnect_on_error branch March 19, 2026 17:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants