SPU stuck with large consumer and producer #2116

sehz · 2022-01-17T05:13:06Z

run long-running longevity test:

CI=true nohup ./target/release/fluvio-test longevity --timeout 90000 -- --runtime-seconds=72000  --producers 100 --consumers 100 > /tmp/long.out 2>/tmp/long.err  &

After while, test failed. It seems that SPU is stuck. Consume is stuck:

$ fluvio consume longevity -T 10
Consuming records starting 10 from the end of topic 'longevity'
⠒

Produce seems to be working:

$ fluvio produce longevity 
> xxx
Ok!
> ^C

And consuming and producing to different topic is working.

Logs:

$ kubectl exec fluvio-spg-main-0 -- ls -lh /var/lib/fluvio/data/spu-logs-0/longevity-0
total 6G     
-rw-r--r--    1 root     root        3.5M Jan 16 21:44 00000000000000000000.index
-rw-r--r--    1 root     root     1024.0M Jan 16 21:44 00000000000000000000.log
-rw-r--r--    1 root     root        3.5M Jan 16 22:05 00000000000000906481.index
-rw-r--r--    1 root     root     1024.0M Jan 16 22:05 00000000000000906481.log
-rw-r--r--    1 root     root        3.5M Jan 16 22:26 00000000000001812182.index
-rw-r--r--    1 root     root     1024.0M Jan 16 22:26 00000000000001812182.log
-rw-r--r--    1 root     root        3.5M Jan 16 22:48 00000000000002717805.index
-rw-r--r--    1 root     root     1024.0M Jan 16 22:48 00000000000002717805.log
-rw-r--r--    1 root     root        3.5M Jan 16 23:09 00000000000003623429.index
-rw-r--r--    1 root     root     1024.0M Jan 16 23:09 00000000000003623429.log
-rw-r--r--    1 root     root       10.0M Jan 17 05:12 00000000000004529054.index
-rw-r--r--    1 root     root      694.9M Jan 17 05:12 00000000000004529054.log
-rw-r--r--    1 root     root           8 Jan 17 05:12 replication.chk

The text was updated successfully, but these errors were encountered:

sehz · 2022-01-20T16:44:12Z

This happens when there is no more disk space:

total 10G
-rw-r--r--    1 root     root        3.5M Jan 20 07:19 00000000000000000000.index
-rw-r--r--    1 root     root     1024.0M Jan 20 07:19 00000000000000000000.log
-rw-r--r--    1 root     root        3.5M Jan 20 07:40 00000000000000906490.index
-rw-r--r--    1 root     root     1024.0M Jan 20 07:40 00000000000000906490.log
-rw-r--r--    1 root     root        3.5M Jan 20 08:01 00000000000001812200.index
-rw-r--r--    1 root     root     1024.0M Jan 20 08:01 00000000000001812200.log
-rw-r--r--    1 root     root        3.5M Jan 20 08:23 00000000000002717832.index
-rw-r--r--    1 root     root     1024.0M Jan 20 08:23 00000000000002717832.log
-rw-r--r--    1 root     root        3.5M Jan 20 08:44 00000000000003623463.index
-rw-r--r--    1 root     root     1024.0M Jan 20 08:44 00000000000003623463.log
-rw-r--r--    1 root     root        3.5M Jan 20 09:06 00000000000004529095.index
-rw-r--r--    1 root     root     1024.0M Jan 20 09:06 00000000000004529095.log
-rw-r--r--    1 root     root        3.5M Jan 20 09:27 00000000000005434727.index
-rw-r--r--    1 root     root     1024.0M Jan 20 09:27 00000000000005434727.log
-rw-r--r--    1 root     root        3.5M Jan 20 09:49 00000000000006340359.index
-rw-r--r--    1 root     root     1024.0M Jan 20 09:49 00000000000006340359.log
-rw-r--r--    1 root     root        3.5M Jan 20 10:10 00000000000007245991.index
-rw-r--r--    1 root     root     1024.0M Jan 20 10:10 00000000000007245991.log
-rw-r--r--    1 root     root        3.5M Jan 20 10:31 00000000000008151623.index
-rw-r--r--    1 root     root     1024.0M Jan 20 10:31 00000000000008151623.log
-rw-r--r--    1 root     root       10.0M Jan 20 10:32 00000000000009057254.index
-rw-r--r--    1 root     root       18.5M Jan 20 10:32 00000000000009057254.log
-rw-r--r--    1 root     root           8 Jan 20 10:32 replication.chk

Produce works fine.
Consume from beginning works

 fluvio consume longevity  -B

But consume from tail doesn't work

$ fluvio consume longevity --tail 20

Memory is stable after reaching ~900M at one point:

NAME                        CPU(cores)   MEMORY(bytes)
fluvio-sc-5677dd6f4-9nwjm   1m           28Mi
fluvio-spg-main-0           3001m        546Mi

Restarting SPU fixed issue with tail:

$ kubectl rollout restart statefulset fluvio-spg-main

sehz · 2022-01-20T21:40:00Z

In another run, SPU gets stuck with half of storage occupied:

total 4G
-rw-r--r--    1 root     root        3.5M Jan 20 17:28 00000000000000000000.index
-rw-r--r--    1 root     root     1024.0M Jan 20 17:28 00000000000000000000.log
-rw-r--r--    1 root     root        3.5M Jan 20 17:50 00000000000000906489.index
-rw-r--r--    1 root     root     1024.0M Jan 20 17:50 00000000000000906489.log
-rw-r--r--    1 root     root        3.5M Jan 20 18:11 00000000000001812198.index
-rw-r--r--    1 root     root     1024.0M Jan 20 18:11 00000000000001812198.log
-rw-r--r--    1 root     root        3.5M Jan 20 18:33 00000000000002717828.index
-rw-r--r--    1 root     root     1024.0M Jan 20 18:33 00000000000002717828.log
-rw-r--r--    1 root     root       10.0M Jan 20 18:40 00000000000003623459.index
-rw-r--r--    1 root     root      368.4M Jan 20 18:40 00000000000003623459.log
-rw-r--r--    1 root     root           8 Jan 20 18:40 replication.chk

NAME                        CPU(cores)   MEMORY(bytes)
fluvio-sc-5677dd6f4-vtrnv   1m           28Mi
fluvio-spg-main-0           5m           535Mi

sehz · 2022-01-24T01:15:24Z

Looks like the disconnect is due to failure in the SEND_FILE call:

2022-01-24T01:02:19.038160Z ERROR fluvio_spu::services::public::stream_fetch: error starting stream fetch handler: Io(
    Custom {
        kind: Other,
        error: "zero copy failed: Nix error: ENOTCONN: Transport endpoint is not connected",
    },
)

There are 2 problems when this happens:

TCP connection is dropped. The consumer gets a disconnect network event.
The consumer doesn't recover gracefully.

In the https://github.com/infinyon/fluvio/blob/master/crates/fluvio-spu/src/services/public/stream_fetch.rs

spawn(async move {
                if let Err(err) = StreamFetchHandler::fetch(
                    ctx,
                    sink,
                    end_event.clone(),
                    leader_state,
                    stream_id,
                    header,
                    replica,
                    consumer_offset_listener,
                    msg,
                )
                .await
                {
                    error!("error starting stream fetch handler: {:#?}", err);
                    end_event.notify();
                }
            });

Also, SPU only occupies around 32M of physical memory. 535Mi reported in pods are for virtual memory size.

sehz · 2022-01-24T05:22:03Z

Attempt to consume with tail option result in:

$ fluvio consume  longevity --tail 10
Consuming records starting 10 from the end of topic 'longevity'
Error: 
   0: Dataplane error: Offset out of range
   1: Offset out of range

This occurs both in OSX (M1 Macbook Pro) and Linux.

sehz · 2022-01-25T18:21:04Z

Was able to reproduce the issue by just running a large number of producers only as suspected.

 nohup ./target/release/fluvio-test longevity --timeout 90000  -- --runtime-seconds=72000 --producers 300 --consumers 0 > /tmp/test.out 2> /tmp/test.err &

Consuming from start works:

$ fluvio consume  longevity -B

But at a certain period, consume with tail either hangs (SPU doesn't respond) or return this error message:

 $ fluvio consume  longevity --tail 10
Consuming records starting 10 from the end of topic 'longevity'
Error: 
   0: Dataplane error: Other error: no records found in active replica, start: 2033342, max: None, active: Segment(base=1813065,end=2033353)
   1: Other error: no records found in active replica, start: 2033342, max: None, active: Segment(base=1813065,end=2033353)

When hangs, disconnecting client cause following error in SPU:

22-01-25T20:28:46.201530Z  INFO accept_incoming{self=FluvioApiServer("0.0.0.0:9010")}: fluvio_service::server: Received connection, spawning request handler
2022-01-25T20:43:26.447819Z  INFO handle_request{host="0.0.0.0:9010"}: fluvio_service::server: Response sent successfully, closing connection host=0.0.0.0:9010 peer_addr=127.0.0.1:59974
2022-01-25T20:44:26.497113Z ERROR fluvio_future::zero_copy: error sendfile: ENOTCONN: Transport endpoint is not connected    
2022-01-25T20:44:26.497234Z ERROR fluvio_spu::services::public::stream_fetch: error starting stream fetch handler: Io(
    Custom {
        kind: Other,
        error: "zero copy failed: Nix error: ENOTCONN: Transport endpoint is not connected",
    },
)

sehz · 2022-02-15T17:53:36Z

The underlying issue is that in high load situation, a "batch" is overwritten by next batch as here:

$ cargo run --bin storage-cli --features=cli --release  validate ~/.fluvio/data/spu-logs-5001/longevity-0 --verbose

.....
index offset = 774845, idx_pos = 917793704, diff_pos=2370
found batch offset = 774846, pos = 917794889, diff_pos = 1185
found batch offset = 774847, pos = 917796074, diff_pos = 1185
index offset = 774847, idx_pos = 917796074, diff_pos=2370
found batch offset = 774848, pos = 917797259, diff_pos = 1185
found batch offset = 774849, pos = 917798444, diff_pos = 1185
index offset = 774849, idx_pos = 917798444, diff_pos=2370
found batch offset = 774851, pos = 917799629, diff_pos = 1185
index offset = 774851, idx_pos = 917800814, diff_pos=2370
-- index mismatch: diff pos = 1185, diff from prev index pos=2370

Here, batch 774850 is overwritten by 774851 even thought index is written. This happen due fact we are using stock async file implementation:

impl AsyncWrite for File {
    fn poll_write(
        mut self: Pin<&mut Self>,
        cx: &mut Context<'_>,
        buf: &[u8],
    ) -> Poll<io::Result<usize>> {
        ready!(self.poll_reposition(cx))?;
        self.is_dirty = true;
        Pin::new(self.unblock.get_mut()).poll_write(cx, buf)
    }

Write here is designed for general purpose file system. Tokio file implementation suffers sames problem. In our case, we open file using append mode in order to ensure immutable batches. There is no need to reposition since file system call write atomically position to end of file when file is opened in append mode.

With std file system write, the problem seems to be gone. Test now run 10+ hours without interruption with multiple runs.

sehz added this to the 0.9.18 milestone Jan 17, 2022

sehz added bug Something isn't working SPU SPU related Stability/Scalability labels Jan 17, 2022

sehz self-assigned this Jan 20, 2022

sehz mentioned this issue Jan 20, 2022

Spurious Longevity test failures #2125

Closed

sehz modified the milestones: 0.9.18, 0.9.19 Jan 31, 2022

nacardin modified the milestones: 0.9.19, 0.9.20 Feb 9, 2022

sehz mentioned this issue Feb 10, 2022

zero_copy send file problem infinyon/future-aio#127

Closed

sehz mentioned this issue Feb 15, 2022

[Merged by Bors] - Fix index corruption #2186

Closed

bors bot closed this as completed in 1436d2d Feb 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SPU stuck with large consumer and producer #2116

SPU stuck with large consumer and producer #2116

sehz commented Jan 17, 2022 •

edited

Loading

sehz commented Jan 20, 2022 •

edited

Loading

sehz commented Jan 20, 2022

sehz commented Jan 24, 2022 •

edited

Loading

sehz commented Jan 24, 2022 •

edited

Loading

sehz commented Jan 25, 2022 •

edited

Loading

sehz commented Feb 15, 2022

SPU stuck with large consumer and producer #2116

SPU stuck with large consumer and producer #2116

Comments

sehz commented Jan 17, 2022 • edited Loading

sehz commented Jan 20, 2022 • edited Loading

sehz commented Jan 20, 2022

sehz commented Jan 24, 2022 • edited Loading

sehz commented Jan 24, 2022 • edited Loading

sehz commented Jan 25, 2022 • edited Loading

sehz commented Feb 15, 2022

sehz commented Jan 17, 2022 •

edited

Loading

sehz commented Jan 20, 2022 •

edited

Loading

sehz commented Jan 24, 2022 •

edited

Loading

sehz commented Jan 24, 2022 •

edited

Loading

sehz commented Jan 25, 2022 •

edited

Loading