Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPU stuck with large consumer and producer #2116

Closed
sehz opened this issue Jan 17, 2022 · 6 comments
Closed

SPU stuck with large consumer and producer #2116

sehz opened this issue Jan 17, 2022 · 6 comments
Assignees
Labels
bug Something isn't working SPU SPU related Stability/Scalability
Milestone

Comments

@sehz
Copy link
Contributor

sehz commented Jan 17, 2022

run long-running longevity test:

CI=true nohup ./target/release/fluvio-test longevity --timeout 90000 -- --runtime-seconds=72000  --producers 100 --consumers 100 > /tmp/long.out 2>/tmp/long.err  &

After while, test failed. It seems that SPU is stuck. Consume is stuck:

$ fluvio consume longevity -T 10
Consuming records starting 10 from the end of topic 'longevity'
⠒

Produce seems to be working:

$ fluvio produce longevity 
> xxx
Ok!
> ^C

And consuming and producing to different topic is working.

Logs:

$ kubectl exec fluvio-spg-main-0 -- ls -lh /var/lib/fluvio/data/spu-logs-0/longevity-0
total 6G     
-rw-r--r--    1 root     root        3.5M Jan 16 21:44 00000000000000000000.index
-rw-r--r--    1 root     root     1024.0M Jan 16 21:44 00000000000000000000.log
-rw-r--r--    1 root     root        3.5M Jan 16 22:05 00000000000000906481.index
-rw-r--r--    1 root     root     1024.0M Jan 16 22:05 00000000000000906481.log
-rw-r--r--    1 root     root        3.5M Jan 16 22:26 00000000000001812182.index
-rw-r--r--    1 root     root     1024.0M Jan 16 22:26 00000000000001812182.log
-rw-r--r--    1 root     root        3.5M Jan 16 22:48 00000000000002717805.index
-rw-r--r--    1 root     root     1024.0M Jan 16 22:48 00000000000002717805.log
-rw-r--r--    1 root     root        3.5M Jan 16 23:09 00000000000003623429.index
-rw-r--r--    1 root     root     1024.0M Jan 16 23:09 00000000000003623429.log
-rw-r--r--    1 root     root       10.0M Jan 17 05:12 00000000000004529054.index
-rw-r--r--    1 root     root      694.9M Jan 17 05:12 00000000000004529054.log
-rw-r--r--    1 root     root           8 Jan 17 05:12 replication.chk
@sehz sehz added this to the 0.9.18 milestone Jan 17, 2022
@sehz sehz added bug Something isn't working SPU SPU related Stability/Scalability labels Jan 17, 2022
@sehz sehz self-assigned this Jan 20, 2022
@sehz
Copy link
Contributor Author

sehz commented Jan 20, 2022

This happens when there is no more disk space:

total 10G
-rw-r--r--    1 root     root        3.5M Jan 20 07:19 00000000000000000000.index
-rw-r--r--    1 root     root     1024.0M Jan 20 07:19 00000000000000000000.log
-rw-r--r--    1 root     root        3.5M Jan 20 07:40 00000000000000906490.index
-rw-r--r--    1 root     root     1024.0M Jan 20 07:40 00000000000000906490.log
-rw-r--r--    1 root     root        3.5M Jan 20 08:01 00000000000001812200.index
-rw-r--r--    1 root     root     1024.0M Jan 20 08:01 00000000000001812200.log
-rw-r--r--    1 root     root        3.5M Jan 20 08:23 00000000000002717832.index
-rw-r--r--    1 root     root     1024.0M Jan 20 08:23 00000000000002717832.log
-rw-r--r--    1 root     root        3.5M Jan 20 08:44 00000000000003623463.index
-rw-r--r--    1 root     root     1024.0M Jan 20 08:44 00000000000003623463.log
-rw-r--r--    1 root     root        3.5M Jan 20 09:06 00000000000004529095.index
-rw-r--r--    1 root     root     1024.0M Jan 20 09:06 00000000000004529095.log
-rw-r--r--    1 root     root        3.5M Jan 20 09:27 00000000000005434727.index
-rw-r--r--    1 root     root     1024.0M Jan 20 09:27 00000000000005434727.log
-rw-r--r--    1 root     root        3.5M Jan 20 09:49 00000000000006340359.index
-rw-r--r--    1 root     root     1024.0M Jan 20 09:49 00000000000006340359.log
-rw-r--r--    1 root     root        3.5M Jan 20 10:10 00000000000007245991.index
-rw-r--r--    1 root     root     1024.0M Jan 20 10:10 00000000000007245991.log
-rw-r--r--    1 root     root        3.5M Jan 20 10:31 00000000000008151623.index
-rw-r--r--    1 root     root     1024.0M Jan 20 10:31 00000000000008151623.log
-rw-r--r--    1 root     root       10.0M Jan 20 10:32 00000000000009057254.index
-rw-r--r--    1 root     root       18.5M Jan 20 10:32 00000000000009057254.log
-rw-r--r--    1 root     root           8 Jan 20 10:32 replication.chk

Produce works fine.
Consume from beginning works

 fluvio consume longevity  -B

But consume from tail doesn't work

$ fluvio consume longevity --tail 20

Memory is stable after reaching ~900M at one point:

NAME                        CPU(cores)   MEMORY(bytes)
fluvio-sc-5677dd6f4-9nwjm   1m           28Mi
fluvio-spg-main-0           3001m        546Mi

Restarting SPU fixed issue with tail:

$ kubectl rollout restart statefulset fluvio-spg-main

@sehz
Copy link
Contributor Author

sehz commented Jan 20, 2022

In another run, SPU gets stuck with half of storage occupied:

total 4G
-rw-r--r--    1 root     root        3.5M Jan 20 17:28 00000000000000000000.index
-rw-r--r--    1 root     root     1024.0M Jan 20 17:28 00000000000000000000.log
-rw-r--r--    1 root     root        3.5M Jan 20 17:50 00000000000000906489.index
-rw-r--r--    1 root     root     1024.0M Jan 20 17:50 00000000000000906489.log
-rw-r--r--    1 root     root        3.5M Jan 20 18:11 00000000000001812198.index
-rw-r--r--    1 root     root     1024.0M Jan 20 18:11 00000000000001812198.log
-rw-r--r--    1 root     root        3.5M Jan 20 18:33 00000000000002717828.index
-rw-r--r--    1 root     root     1024.0M Jan 20 18:33 00000000000002717828.log
-rw-r--r--    1 root     root       10.0M Jan 20 18:40 00000000000003623459.index
-rw-r--r--    1 root     root      368.4M Jan 20 18:40 00000000000003623459.log
-rw-r--r--    1 root     root           8 Jan 20 18:40 replication.chk

NAME                        CPU(cores)   MEMORY(bytes)
fluvio-sc-5677dd6f4-vtrnv   1m           28Mi
fluvio-spg-main-0           5m           535Mi

@sehz
Copy link
Contributor Author

sehz commented Jan 24, 2022

Looks like the disconnect is due to failure in the SEND_FILE call:

2022-01-24T01:02:19.038160Z ERROR fluvio_spu::services::public::stream_fetch: error starting stream fetch handler: Io(
    Custom {
        kind: Other,
        error: "zero copy failed: Nix error: ENOTCONN: Transport endpoint is not connected",
    },
)

There are 2 problems when this happens:

  • TCP connection is dropped. The consumer gets a disconnect network event.
  • The consumer doesn't recover gracefully.

In the https://github.com/infinyon/fluvio/blob/master/crates/fluvio-spu/src/services/public/stream_fetch.rs

spawn(async move {
                if let Err(err) = StreamFetchHandler::fetch(
                    ctx,
                    sink,
                    end_event.clone(),
                    leader_state,
                    stream_id,
                    header,
                    replica,
                    consumer_offset_listener,
                    msg,
                )
                .await
                {
                    error!("error starting stream fetch handler: {:#?}", err);
                    end_event.notify();
                }
            });

Also, SPU only occupies around 32M of physical memory. 535Mi reported in pods are for virtual memory size.

@sehz
Copy link
Contributor Author

sehz commented Jan 24, 2022

Attempt to consume with tail option result in:

$ fluvio consume  longevity --tail 10
Consuming records starting 10 from the end of topic 'longevity'
Error: 
   0: Dataplane error: Offset out of range
   1: Offset out of range

This occurs both in OSX (M1 Macbook Pro) and Linux.

@sehz
Copy link
Contributor Author

sehz commented Jan 25, 2022

Was able to reproduce the issue by just running a large number of producers only as suspected.

 nohup ./target/release/fluvio-test longevity --timeout 90000  -- --runtime-seconds=72000 --producers 300 --consumers 0 > /tmp/test.out 2> /tmp/test.err &

Consuming from start works:

$ fluvio consume  longevity -B

But at a certain period, consume with tail either hangs (SPU doesn't respond) or return this error message:

 $ fluvio consume  longevity --tail 10
Consuming records starting 10 from the end of topic 'longevity'
Error: 
   0: Dataplane error: Other error: no records found in active replica, start: 2033342, max: None, active: Segment(base=1813065,end=2033353)
   1: Other error: no records found in active replica, start: 2033342, max: None, active: Segment(base=1813065,end=2033353)

When hangs, disconnecting client cause following error in SPU:

22-01-25T20:28:46.201530Z  INFO accept_incoming{self=FluvioApiServer("0.0.0.0:9010")}: fluvio_service::server: Received connection, spawning request handler
2022-01-25T20:43:26.447819Z  INFO handle_request{host="0.0.0.0:9010"}: fluvio_service::server: Response sent successfully, closing connection host=0.0.0.0:9010 peer_addr=127.0.0.1:59974
2022-01-25T20:44:26.497113Z ERROR fluvio_future::zero_copy: error sendfile: ENOTCONN: Transport endpoint is not connected    
2022-01-25T20:44:26.497234Z ERROR fluvio_spu::services::public::stream_fetch: error starting stream fetch handler: Io(
    Custom {
        kind: Other,
        error: "zero copy failed: Nix error: ENOTCONN: Transport endpoint is not connected",
    },
)



@sehz sehz modified the milestones: 0.9.18, 0.9.19 Jan 31, 2022
@nacardin nacardin modified the milestones: 0.9.19, 0.9.20 Feb 9, 2022
@sehz
Copy link
Contributor Author

sehz commented Feb 15, 2022

The underlying issue is that in high load situation, a "batch" is overwritten by next batch as here:

$ cargo run --bin storage-cli --features=cli --release  validate ~/.fluvio/data/spu-logs-5001/longevity-0 --verbose

.....
index offset = 774845, idx_pos = 917793704, diff_pos=2370
found batch offset = 774846, pos = 917794889, diff_pos = 1185
found batch offset = 774847, pos = 917796074, diff_pos = 1185
index offset = 774847, idx_pos = 917796074, diff_pos=2370
found batch offset = 774848, pos = 917797259, diff_pos = 1185
found batch offset = 774849, pos = 917798444, diff_pos = 1185
index offset = 774849, idx_pos = 917798444, diff_pos=2370
found batch offset = 774851, pos = 917799629, diff_pos = 1185
index offset = 774851, idx_pos = 917800814, diff_pos=2370
-- index mismatch: diff pos = 1185, diff from prev index pos=2370

Here, batch 774850 is overwritten by 774851 even thought index is written. This happen due fact we are using stock async file implementation:

impl AsyncWrite for File {
    fn poll_write(
        mut self: Pin<&mut Self>,
        cx: &mut Context<'_>,
        buf: &[u8],
    ) -> Poll<io::Result<usize>> {
        ready!(self.poll_reposition(cx))?;
        self.is_dirty = true;
        Pin::new(self.unblock.get_mut()).poll_write(cx, buf)
    }

Write here is designed for general purpose file system. Tokio file implementation suffers sames problem. In our case, we open file using append mode in order to ensure immutable batches. There is no need to reposition since file system call write atomically position to end of file when file is opened in append mode.

With std file system write, the problem seems to be gone. Test now run 10+ hours without interruption with multiple runs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working SPU SPU related Stability/Scalability
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants