Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Consuming right after eviction results in No such file or directory #3891

Closed
SecretPocketCat opened this issue Mar 7, 2024 · 10 comments · Fixed by #3900
Closed

[Bug]: Consuming right after eviction results in No such file or directory #3891

SecretPocketCat opened this issue Mar 7, 2024 · 10 comments · Fixed by #3900
Assignees
Labels
bug Something isn't working

Comments

@SecretPocketCat
Copy link

SecretPocketCat commented Mar 7, 2024

What happened
Trying to consume right after records have been evicted results in an FS error (I assume that's for the segment containing that record).
The fetched offsets used for resolving logical offsets are stale.

Other error: offset error: Io(
    Os {
        code: 2,
        kind: NotFound,
        message: "No such file or directory",
    },
)

Expected behavior
The consumer will use continue from the first valid/non-evicted offset

Describe the setup

  • client crate ver: 0.21.5
  • local fluvio cluster
 Fluvio CLI                     : 0.11.4                                                           
 Fluvio CLI Arch                : x86_64-unknown-linux-musl                                        
 Fluvio CLI SHA256              : 13ba05c6ef4e37b8c7eb7ea6d703a7cacb5f82fbb31ac28c9b1b617f3884f25b 
 Fluvio Channel Frontend SHA256 : 13ba05c6ef4e37b8c7eb7ea6d703a7cacb5f82fbb31ac28c9b1b617f3884f25b 
 Fluvio Platform                : Not available (docker)                                           
 Git Commit                     : 45c0089dbcd57123ba27470dd410c7b00b93e2ed  

How to reproduce it (as minimally and precisely as possible)

repro repo

Steps to reproduce the behavior:

  1. Create a topic with minimal retention (10s) & small segment size
  2. Produce a bunch of records
  3. Wait for 11s
  4. Start consuming (either with an absolute offset of 0 or from the beginning)
  5. Hit IO error
...
2024-03-07T13:47:47.061990Z DEBUG dispatch_loop{spec="Partition"}:sync_metadata{epoch=98 spec=Partition}: fluvio_stream_model::store::dual_store: Apply changes: Partition no changes, reverting back epoch to: 1
2024-03-07T13:47:57.958475Z DEBUG fluvio::fluvio: Creating consumer topic="17794220-fa77-4fca-9ab8-901aa318d0d7"
2024-03-07T13:47:57.958733Z DEBUG stream:stream_with_config:inner_stream_batches_with_config:request_stream{offset=Offset { inner: Absolute(0) }}: fluvio::offset: fetching offset for replica: 17794220-fa77-4fca-9ab8-901aa318d0d7-0
2024-03-07T13:47:57.958791Z DEBUG stream:stream_with_config:inner_stream_batches_with_config:request_stream{offset=Offset { inner: Absolute(0) }}:send_and_receive{self=MultiplexerSocket 16}: fluvio_socket::multiplexing: sending request api=1002 correlation_id=9
2024-03-07T13:47:57.959061Z DEBUG dispatcher_loop{self=MultiplexDisp(16)}: fluvio_socket::multiplexing: received frame correlation_id=9 len=68 remaining=68
2024-03-07T13:47:57.959133Z DEBUG stream:stream_with_config:inner_stream_batches_with_config:request_stream{offset=Offset { inner: Absolute(0) }}:send_and_receive{self=MultiplexerSocket 16}: fluvio_socket::multiplexing: receive serial message correlation_id=9 len=68
2024-03-07T13:47:57.959198Z DEBUG stream:stream_with_config:inner_stream_batches_with_config:request_stream{offset=Offset { inner: Absolute(0) }}: fluvio::offset: replica: 17794220-fa77-4fca-9ab8-901aa318d0d7-0, fetch offset: error: None, partition: 0, start: 0, last: 352
2024-03-07T13:47:57.959221Z DEBUG stream:stream_with_config:inner_stream_batches_with_config:request_stream{offset=Offset { inner: Absolute(0) }}: fluvio::consumer: start_absolute_offset=0 end_absolute_offset=352 record_count=352
2024-03-07T13:47:57.959247Z DEBUG stream:stream_with_config:inner_stream_batches_with_config:request_stream{offset=Offset { inner: Absolute(0) }}: fluvio::consumer: stream_fetch_version stream_fetch_version=22
2024-03-07T13:47:57.959642Z DEBUG dispatcher_loop{self=MultiplexDisp(16)}: fluvio_socket::multiplexing: received frame correlation_id=10 len=201 remaining=201
2024-03-07T13:47:57.959697Z DEBUG poll_next{api_key=1003 correlation_id=10}: fluvio_socket::multiplexing: response len>>> response_len=201 remaining=201 version=22
2024-03-07T13:47:57.959733Z DEBUG fluvio::consumer: first stream response stream_id=1 last_offset=None
2024-03-07T13:47:57.959776Z DEBUG fluvio_socket::multiplexing: multiplexer stream: 10 closed
2024-03-07T13:47:57.959827Z  INFO run: fluvio::producer::partition_producer: partition producer end event received
2024-03-07T13:47:57.959855Z  INFO run: fluvio::producer::partition_producer: partition producer end
2024-03-07T13:47:57.959931Z DEBUG dispatch_loop{spec="SPU"}: fluvio::sync::controller: SPU terminated
2024-03-07T13:47:57.959945Z  INFO dispatcher_loop{self=MultiplexDisp(9)}: fluvio_socket::multiplexing: multiplexer terminated
2024-03-07T13:47:57.959957Z DEBUG dispatch_loop{spec="SPU"}: fluvio_socket::multiplexing: multiplexer stream: 5 closed
2024-03-07T13:47:57.959998Z DEBUG dispatch_loop{spec="Partition"}: fluvio::sync::controller: Partition terminated
2024-03-07T13:47:57.960025Z DEBUG dispatch_loop{spec="Topic"}: fluvio::sync::controller: received request spec=Topic
2024-03-07T13:47:57.960034Z DEBUG dispatch_loop{spec="Partition"}: fluvio_socket::multiplexing: multiplexer stream: 6 closed
2024-03-07T13:47:57.960058Z DEBUG dispatch_loop{spec="Topic"}: fluvio::sync::controller: No more items to receive from stream!
2024-03-07T13:47:57.960059Z DEBUG dispatch_loop{spec="SPU"}: fluvio::sync::controller: received request spec=SPU
2024-03-07T13:47:57.960073Z DEBUG dispatch_loop{spec="Topic"}: fluvio::sync::controller: Topic terminated
2024-03-07T13:47:57.960002Z DEBUG dispatch_loop{spec="Topic"}: fluvio::sync::controller: Topic terminated
2024-03-07T13:47:57.960078Z DEBUG dispatch_loop{spec="SPU"}: fluvio::sync::controller: No more items to receive from stream!
2024-03-07T13:47:57.960102Z DEBUG dispatch_loop{spec="Partition"}: fluvio::sync::controller: received request spec=Partition
2024-03-07T13:47:57.960108Z DEBUG dispatch_loop{spec="SPU"}: fluvio::sync::controller: SPU terminated
2024-03-07T13:47:57.960115Z DEBUG dispatch_loop{spec="Topic"}: fluvio_socket::multiplexing: multiplexer stream: 3 closed
2024-03-07T13:47:57.960120Z DEBUG dispatch_loop{spec="Partition"}: fluvio::sync::controller: No more items to receive from stream!
2024-03-07T13:47:57.960128Z DEBUG dispatch_loop{spec="SPU"}: fluvio_socket::multiplexing: multiplexer stream: 1 closed
2024-03-07T13:47:57.960140Z DEBUG dispatch_loop{spec="Partition"}: fluvio::sync::controller: Partition terminated
2024-03-07T13:47:57.960184Z DEBUG dispatch_loop{spec="Partition"}: fluvio_socket::multiplexing: multiplexer stream: 2 closed
2024-03-07T13:47:57.960164Z DEBUG dispatch_loop{spec="Topic"}: fluvio_socket::multiplexing: multiplexer stream: 7 closed
Error: Record error: Other error: offset error: Io(
    Os {
        code: 2,
        kind: NotFound,
        message: "No such file or directory",
    },
)

Additional context
I'm using an absolute offset to track offsets (before fluvio offset management lands).

@SecretPocketCat SecretPocketCat added the bug Something isn't working label Mar 7, 2024
@SecretPocketCat
Copy link
Author

Added a repro link.
I've also tried clamping the offset, but the result is the same. I'm not sure whether this'd be considered a bug or if it works as intended, but then I'd be at a loss on how to use absolute (stored) offsets.

@SecretPocketCat SecretPocketCat changed the title [Bug]: Absolute offset is not clamped [Bug]: Absolute offset after eviction results in No such file or directory Mar 7, 2024
@SecretPocketCat SecretPocketCat changed the title [Bug]: Absolute offset after eviction results in No such file or directory [Bug]: Consuming right after eviction results in No such file or directory Mar 7, 2024
@SecretPocketCat
Copy link
Author

I was mistaken in the fact it happens just with an absolute offset - the same thing occurs with Offset::beginning(). The fetched offsets used for the offset resolve seem to be stale.

@ajhunyady
Copy link
Contributor

ajhunyady commented Mar 7, 2024

@SecretPocketCat thanks for reporting.
@galibey is this issue related to the other issue you are working on? If yes, please share the proposed solution here.

@galibey
Copy link
Contributor

galibey commented Mar 12, 2024

Looks like there are two problems here.

  1. When the user attempts to read an evicted offset. The error, in that case, is the valid behavior, but, we do need to improve reporting there. I created a separate ticket for it.
  2. The storage layer does not recognize that the segment is gone, and tries to read from it. That's why we have File not found error. This issue will be fixed within this ticket.

@SecretPocketCat After all these fixes, you will still need to manually restart the consuming. We can't do it implicitly, because this is the kind of decision that every user has to make according to project specifics.

@ajhunyady
Copy link
Contributor

@SecretPocketCat can you please confirm if the fix worked?

Please use latest to check:

fvm install latest

To swith back to the production version, please use:

fvm switch stable

@SecretPocketCat
Copy link
Author

@ajhunyady It does work.
I'm somewhat confused that using an absolute offset works too. I thought that was just supposed to result in the OffsetEvicted error #3902 but with the above mentioned repro running with latest version of fluvio both absolute and beginning work the same.

@ajhunyady
Copy link
Contributor

ajhunyady commented Mar 19, 2024

@ajhunyady It does work. I'm somewhat confused that using an absolute offset works too. I thought that was just supposed to result in the OffsetEvicted error #3902 but with the above mentioned repro running with latest version of fluvio both absolute and beginning work the same.

My assumption is that this was done as a convenience to the client to avoid an error, though it does seem to defeat the purpose of the error. @galibey can you please shed some light on this?

@galibey
Copy link
Contributor

galibey commented Mar 20, 2024

@SecretPocketCat In your repro example there is no eviction occurring there. The segment_size is smaller than batch_size so most of the produced batches are rejected. You can see it by waiting on ProduceOutput returning from send. Only the last records are stored on the topic.
In the example, to call an eviction by size, you can set the batch size and the max partition size:

...
    spec.set_storage(TopicStorageConfig {
        segment_size: Some(10000),
        max_partition_size: Some(20000),
    });
...
    let producer = fluvio
        .topic_producer_with_config(
            topic_name.clone(),
            TopicProducerConfigBuilder::default()
                .batch_size(9000)
                .build()?,
        )
        .await?;
...

Also, make sure you use the latest fluvio crate in your Cargo.toml. Since the last version is not released yet, you need to point it to git master or local path.

@SecretPocketCat
Copy link
Author

@galibey Alright, now it makes sense. Thank you.

@digikata
Copy link
Contributor

FYI you can also do fvm install latest to try out the git master fluvio (fvm switch stable) to go back to the the stable fluvio after.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants