regression with 2.3.3 and 2.2.10: huge disk reads #10361

c0deright · 2019-02-15T09:59:06Z

Parity Ethereum version: 2.3.3-beta and 2.2.10-stable
Operating system: Linux
Installation: provided binaries
Fully synchronized: yes
Network: ethereum
Restarted: yes

Yesterday I upgraded my stable boxes from 2.2.9 zu 2.2.10 and my test box from 2.3.2-beta to 2.3.3-beta.

IO seems to be much higher than with 2.2.9 and 2.3.2 because a couple hours later all the EBS gp2 volumes the EC2 instances use had exhausted their BurstBalance credits.

From the monitoring I can see that versions 2.2.10 and 2.3.3 do much more disk reads than the versions before. Switching back to 2.2.9 and 2.3.2 resolved the issue: let the systems run for several hours, no heavy IO.

Switching back to 2.3.3 and 2.2.10: heavy disk reads was happening again.

c0deright · 2019-02-15T13:07:50Z

Info: Burstbalance can be considered an AWS EBS gp2 volume's remaining credit which is expressed in % from 0 to 100 and represents the possibility (BurstBalance > 0%) to burst over the provisioned IOPS or being limited to a low baseline IOPS limit (BurstBalance = 0%).

Serpent6877 · 2019-02-15T23:59:53Z

I can concur I am seeing this as well. I went from zero iowait to upwards of 24%. This is on a hardware RAID-10 of Samsung pro SSD's.

tzapu · 2019-02-16T06:48:59Z

same here as well on 3 archive+trace. this seems to affect performance in general in a very bad way. and now, due to the RPC attack issue, we can t downgrade either.

jam10o-new · 2019-02-16T08:30:28Z

Can people here mention which version they are coming from? Specifically whether they are updating from pre-2.2.9/2.3.2 or only a single patch number.

tzapu · 2019-02-16T16:09:01Z

@joshua-mir in our case at least 3 of 4 were from v2.2.7 to v2.2.10

(v2.2.7 also had somekind of performance drops compared to to v2.1.x but I did not have any chance to document it and it was not nearly this big)

jam10o-new · 2019-02-16T16:25:44Z

@tzapu the earlier performance regression was #10230 - it was most noticeable on smaller networks (testnets and private networks) and the fix for that was part of the latest release. We're still trying to diagnose which change in the latest release might be the source of this regression.

c0deright · 2019-02-18T12:27:04Z

@joshua-mir The issue was only seen in 2.2.9->2.2.10 and 2.3.2->2.3.3 by me.

In my case when BurstBalance of the volume hits 0% and IOPS is limited to a very low baseline this immediately results in peers being dropped (i guess parity-ethereum isn't replying fast enough to incoming packets from the peers due to being stuck in IOWAIT for a long time).

Serpent6877 · 2019-02-18T16:32:52Z

Mine was from 2.2.9 to 2.2.10 as well.

jleeh · 2019-02-26T11:31:28Z

I'm experiencing the same issues also. Read operations between 2.2.9 and 2.2.10 are significantly higher on average. Read ops over 3 days below.

2.2.9:

2.2.10:

Block propagation times are suffering due to it, with differences of 2.2.9 being around 500-600ms on average and 2.2.10 being around 1.3s at the same specs.

This is something that worries me for later versions, I really don't want to worry about the integrity and performance of our Parity boxes on new versions I upgrade to, especially as 2.2.10 was for a critical security fix.

c0deright · 2019-02-28T10:30:20Z

Could we remove the Z0-unconfirmed since the issue being confirmed by several users?

This issue prevents us currently from upgrading beyond 2.2.9/2.3.2 and I guess the soon to be released 2.3.5-stable and 2.4.0-beta won't have a fix for this issue.

jleeh · 2019-02-28T11:57:46Z

To give more evidence of real impact of this issue:

4 boxes, 3 running 2.2.10 and 1 running 2.2.9 all at the exact same specs.

gituser · 2019-03-07T14:24:21Z

Any update on this?
I can also confirm on v2.3.5 situation hasn't changed and I observe huge disk reads / writes.

tzapu · 2019-03-26T09:13:42Z

we re seeing some encouraging signs with v2.3.8, can anyone else take a look? looks almost the same as v2.2.7

c0deright · 2019-03-26T09:41:26Z

I was waiting to hear something like this. Will test ASAP. Thanks @tzapu

c0deright · 2019-03-26T19:43:04Z

@tzapu Can't confirm. Still heavy IO with 2.3.8.

jam10o-new · 2019-03-28T15:23:16Z

k, uploaded.

sha256sums:

463e4ab444142f4c48c4f962489b58e7d4362a7f301f16aca60a2bcee35dc705  parity-no-10051
691fca282e68b8a4f2bf3ff7c66f51a828ccd9b23789a711428f0ce3b7ea420d  parity-no-10167-10305
f676b9909b5967297b61384007f6058a36748a6186fd54f58c67b8a56e74266f  parity-no-10167-10305-10051
b8efb141287c926f9bbed4049650eccfc428788ebd0b48af0c7e841d55719d9a  parity-no-10305

note: these binaries are master (2.5.0) without the specified commits (or any dependent commits) I haven't checked whether they work properly in normal usecases tbh, only that they run. You will likely have the best experience with "parity-no-10167-10305" but 10167 was a pretty important security fix so I wouldn't recommend using it in production. no-10305 is likely worse than just plain master.

c0deright · 2019-03-28T16:58:48Z

If anyone is interested in testing: you can download the binaries from my dropbox.

But please check the sha256sums as posted above by @joshua-mir before executing binaries downloaded from someone off the internet, first!

Testing right now, will report back on monday.

Serpent6877 · 2019-03-29T08:59:15Z

@c0deright Can you use my script and do some RPC port load testing as well? It requires parallel installed. "apt install parallel" or "yum install parallel". Maybe do 10000 requests or so. This would emulate more of what a pool does.

#!/bin/bash
#
# Test Parity RPC - Brad Dameron (serpent6877@hotmail.com)
#
# parity-test.sh <RPC PORT> <# requests before delay> <Delay time between tests>
# Use CTRL+C to exit program
#
# Example ./parity-test.sh 8545 1000 15 5
# Test port 8545 with 1000 requests of each type 15 times with a 5 second delay between tests
#
# Requires parallel installed.


PORT=$1
REQUESTS=$2
TESTS=$3
DELAY=$4
LOGFILE=parity-test.log

# Set some defaults if no command line option
if [ -z $PORT ]
then
  PORT=8545
fi
if [ -z $DELAY ]
then
  REQUESTS=100
fi
if [ -z $TESTS ]
then
  TESTS=10
fi
if [ -z $DELAY ]
then
  DELAY=5
fi

getBalance() {
   PORT=$1
   LOGFILE=$2
   # Test getBalance
   curl -s -X POST --data '{"jsonrpc":"2.0","method":"eth_getBalance","params":["0xc94770007dda54cF92009BFF0dE90c06F603a09f", "latest"],"id":1}' -H "Content-Type: application/json" -X POST localhost:$PORT >> $LOGFILE
   # Test getWork
   curl -s -X POST --data '{"jsonrpc":"2.0","method":"eth_getWork","params":[],"id":73}' -H "Content-Type: application/json" -X POST localhost:$PORT >> $LOGFILE
   # Test current blockNumber
   curl -s -X POST --data '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}' -H "Content-Type: application/json" -X POST localhost:$PORT >> $LOGFILE
   # Test peerCount
   curl -s -X POST --data '{"jsonrpc":"2.0","method":"net_peerCount","params":[],"id":74}' -H "Content-Type: application/json" -X POST localhost:$PORT >> $LOGFILE
   # Test accounts
   curl -s -X POST --data '{"jsonrpc":"2.0","method":"eth_accounts","params":[],"id":1}' -H "Content-Type: application/json" -X POST localhost:$PORT >> $LOGFILE
}

export -f getBalance

for run in $(eval echo "{1..$TESTS}")
do
   START=$(date +%s)
   echo "$START - Starting testing ---------" >> $LOGFILE
   seq $REQUESTS | parallel -j0 -N0 getBalance $PORT $LOGFILE > /dev/null
   END=$(date +%s)
   DIFF=$(( $END - $START ))
   echo "$END - Stopping testing ---------" >> $LOGFILE
   echo "It took $DIFF seconds for $REQUESTS requests"
   echo "Delaying for $DELAY seconds."
   sleep $DELAY
done

c0deright · 2019-03-29T10:29:34Z

@Serpent6877

Can't do it right now. I'm away from the keyboard until monday and just have the 4 binaries running for ~18 hours each on that t3.medium instance.

Maybe afterwards when one binary sticks out I can do your test.

c0deright · 2019-04-01T08:48:59Z

parity-no-10051

looks good

parity-no-10167-10305

looked good for 3 hours then went crazy on disk reads

parity-no-10167-10305-10051

looks good

parity-no-10305

looks good

I'm currently retesting parity-no-10167-10305 to see if it does go crazy on disk reads again.

jam10o-new · 2019-04-01T08:58:12Z

wow, so the binary I thought would be the most stable is the most unstable.

c0deright · 2019-04-01T09:03:04Z

Let's not jump to conclusions easily. I still think that peers connected to my node might have a play in here. When e.g. a peer is requesting old blocks not in my nodes cache it has to read them from disk, right? So maybe this was the case here where one or more peers did request old blocks frequently. That's why I'm retesting parity-no-10167-10305 right now.

From the old graphs in this thread one can see that the read spike did in fact start right after one of the affected parity versions was started not several hours after normal operation. So this might really be a stupid coincidence caused by a peer after 3 hours runtime.

c0deright · 2019-04-02T09:15:14Z

As I said I did retest parity-no-10167-10305 and what should I say, it looks good!

I think it's safe to say that all 4 binaries @joshua-mir supplied don't show this behaviour and parity-no-10167-10305 on the 1st test going crazy on disk reads after 3 hours of normal operation might be related to a peer and not parity itself.

Right now I'm testing 2.4.4-beta and without promising too much, it does look good.

jam10o-new · 2019-04-02T09:19:43Z

Thanks for all that work regardless 🙏
If this is an issue with peer behaviour we still need to identify what that behavior is and account for it though.

c0deright · 2019-04-02T09:33:48Z

I just looked at network graphs: There is no out of the ordinary traffic anomaly that correlates to performance degradation caused by heavy disk reads. If it's really caused by peers then it's not because we're sending huge amounts of blocks read from disk to our peers.

c0deright · 2019-04-02T10:25:36Z

@Serpent6877

I used a modified version (diff) of your script to run on my test machine against parity-2.4.4-beta:

without RPC 'eth_getWork' because my parity doesn't run with --author
only 620 curl requests in parallel because not enough ressources for 1000 (raised ulimit -n and ulimit -u, still t3.medium)

Results:

Load/CPU usage did skyrocket during the test (due to many processes being created in parallel)
disk reads did increase
after the script ran everything got back to normal
parity-2.4.4-beta running without problems for 1.5 hours now

1554199781 - Starting testing ---------
1554199802 - Stopping testing ---------
It took 21 seconds for 620 requests
Delaying for 5 seconds.
1554199807 - Starting testing ---------
1554199825 - Stopping testing ---------
It took 18 seconds for 620 requests
Delaying for 5 seconds.
1554199830 - Starting testing ---------
1554199848 - Stopping testing ---------
It took 18 seconds for 620 requests
Delaying for 5 seconds.
1554199853 - Starting testing ---------
1554199871 - Stopping testing ---------
It took 18 seconds for 620 requests
Delaying for 5 seconds.
1554199876 - Starting testing ---------
1554199894 - Stopping testing ---------
It took 18 seconds for 620 requests
Delaying for 5 seconds.
1554199899 - Starting testing ---------
1554199916 - Stopping testing ---------
It took 17 seconds for 620 requests
Delaying for 5 seconds.
1554199921 - Starting testing ---------
1554199942 - Stopping testing ---------
It took 21 seconds for 620 requests
Delaying for 5 seconds.
1554199947 - Starting testing ---------
1554199967 - Stopping testing ---------
It took 20 seconds for 620 requests
Delaying for 5 seconds.
1554199972 - Starting testing ---------
1554199990 - Stopping testing ---------
It took 18 seconds for 620 requests
Delaying for 5 seconds.
1554199995 - Starting testing ---------
1554200013 - Stopping testing ---------
It took 18 seconds for 620 requests
Delaying for 5 seconds.
1554200018 - Starting testing ---------
1554200041 - Stopping testing ---------
It took 23 seconds for 620 requests
Delaying for 5 seconds.
1554200046 - Starting testing ---------
1554200063 - Stopping testing ---------
It took 17 seconds for 620 requests
Delaying for 5 seconds.
1554200068 - Starting testing ---------
1554200089 - Stopping testing ---------
It took 21 seconds for 620 requests
Delaying for 5 seconds.
1554200094 - Starting testing ---------
1554200113 - Stopping testing ---------
It took 19 seconds for 620 requests
Delaying for 5 seconds.
1554200118 - Starting testing ---------
1554200141 - Stopping testing ---------
It took 23 seconds for 620 requests
Delaying for 5 seconds.

jleeh · 2019-04-02T22:28:52Z

I've rolled v2.4.5 stable to our Parity instances and it's looking a lot better on my end as well (presuming 2.4.5 has the fix in?). I'm seeing around 500 IOPS average compared to around 1200 IOPS.

This is only the first 5 hours of running the version, but on the previous bad version, I saw high disk usage by now.

My only complaint which has always been the case with Parity is that the read size per op seems quite inefficient. You see each read size being around 20Kb, when even an SSD with small block size per op can typically fit in around 256Kb per op. If this was batched a bit better, you could really reduce the cost of running a full node.

jam10o-new · 2019-04-02T22:32:22Z

@jleeh same request as #10108?

jleeh · 2019-04-02T22:35:37Z

Yeah same. Thanks for linking.

c0deright · 2019-04-03T06:57:43Z

@jleeh 2.4.5-stable? Are you sure? Latest stable is 2.3.9, latest beta 2.4.4

After testing 2.4.4-beta I did run 2.3.9-stable over night and it looks good, too:

@tzapu did you test 2.3.9 yet?

jleeh · 2019-04-03T11:13:14Z

@c0deright I'm sure. I did raise an eyebrow to it considering you was testing 2.4.4 beta. This is from the latest docker stable tag:
Parity-Ethereum/v2.4.5-stable-b6a764d-20190402/x86_64-linux-gnu/rustc1.33.0

c0deright · 2019-04-03T12:48:23Z

@c0deright I'm sure. I did raise an eyebrow to it considering you was testing 2.4.4 beta. This is from the latest docker stable tag:
Parity-Ethereum/v2.4.5-stable-b6a764d-20190402/x86_64-linux-gnu/rustc1.33.0

@joshua-mir How is this possible when in fact there isn't even a 2.4.5 tag let alone a 2.4.5 release?

% docker run --rm -it parity/parity:stable --version
  version Parity-Ethereum/v2.4.5-stable-b6a764d-20190402/x86_64-linux-gnu/rustc1.33.0

% docker run --rm -it parity/parity:beta --version
  version Parity-Ethereum/v2.5.0-beta-3c85f29-20190402/x86_64-linux-gnu/rustc1.33.0

jam10o-new · 2019-04-03T12:53:14Z

@c0deright #10561 has a change to our CI pipeline which must have accidentally pushed to dockerhub - I'll let our DevOps team know

Edit: that pr is probably unrelated

Edit 2: this is expected - stable and beta are always ahead of releases on docker @jleeh

soc1c · 2019-04-03T13:34:17Z

confirming 2.4 is stable now. b6a764d

c0deright · 2019-04-03T14:30:16Z

For me this issue is fixed with v2.3.9-stable and v2.4.4-beta.

Does anybody else see heavy disk IO with these versions?

c0deright · 2019-04-04T09:17:48Z

v2.3.9-stable is running for ~43 hours now and in the morning there was a 2 hour window where read operations per second did increase by a factor of 8 and then went back to normal again:

The Logfile doesn't seem to indicate anything at all besides 3 Reorgs at the time read ops did begin to increase.

I still think that this issue is fixed but there seems to be still an issue with potentially peer induced read spike.

jam10o-new · 2019-04-04T09:23:30Z

So you are seeing IO spiking on reorgs maybe? I'll close this for now and if we see evidence that the issue hasn't been resolved in ordinary cases I'll reopen, or perhaps open a new issue if we can confirm this is reorg related (although high io on reorg is probably expected)

c0deright · 2019-04-04T09:36:24Z

I can't confirm if it has anything to do with reorgs, just that there were 3 in the log at the time. Might be completely unrelated. I'll test some more with v2.3.9-stable and then v2.4.5-stable and might then upgrade prod servers in a week or two.

Thanks for helping debugging this, @joshua-mir

tzapu · 2019-04-09T21:33:49Z

in our case the red line marks the move to v.2.3.9 and switched to v2.4.5 about 6 hours ago

/edit lower is better :P

c0deright · 2019-08-12T10:46:26Z

I see this issue again with 2.6.0-beta.

parity v2.6.0 is constantly reading from disk even with only 5 peers connected. I'm comparing disk IO to v2.2.9 we're still running in prod because of #10371

Running parity on AWS EBS volumes of type gp2 is almost impossible because it drains the IO credits so fast.

I have v2.6.0 now configured to only connect to my private nodes (reserved_only = true) and will report back.

Edit: Seems to be ~~peer induces disk reads~~ a memory issue: #10962

Upgrading the VM from 4GB of RAM to 8GB fixes the massive reads.

c0deright changed the title ~~regression with 2.3.3 and 2.2.10: high IO~~ regression with 2.3.3 and 2.2.10: high read IO Feb 15, 2019

c0deright changed the title ~~regression with 2.3.3 and 2.2.10: high read IO~~ regression with 2.3.3 and 2.2.10: huge disk reads Feb 15, 2019

5chdn added this to the 2.4 milestone Feb 15, 2019

5chdn added P5-sometimesoon 🌲 Issue is worth doing soon. F7-footprint 🐾 An enhancement to provide a smaller (system load, memory, network or disk) footprint. Z0-unconfirmed 🤔 Issue might be valid, but it’s not yet known. M4-core ⛓ Core client code / Rust. labels Feb 15, 2019

c0deright mentioned this issue Feb 18, 2019

Memory usage spike induced crash #10371

Closed

5chdn modified the milestones: 2.4, 2.5 Feb 21, 2019

c0deright mentioned this issue Feb 26, 2019

New blocks aren't fetched and RPC goes down on periodic snapshot #10388

Closed

jam10o-new removed the Z0-unconfirmed 🤔 Issue might be valid, but it’s not yet known. label Feb 28, 2019

andrewheadricke mentioned this issue Mar 7, 2019

Latest parity has some sort of memory/race issue #10459

Closed

jam10o-new added P2-asap 🌊 No need to stop dead in your tracks, however issue should be addressed as soon as possible. and removed P5-sometimesoon 🌲 Issue is worth doing soon. labels Mar 8, 2019

soc1c modified the milestones: 2.5, 2.6 Apr 2, 2019

jam10o-new closed this as completed Apr 4, 2019

jam10o-new added the M4-io 💾 Interaction with filesystem/databases. label Apr 4, 2019

c0deright mentioned this issue Aug 12, 2019

v2.6.0: 4 GB of RAM causing endless disk reads #10962

Closed

regression with 2.3.3 and 2.2.10: huge disk reads #10361

regression with 2.3.3 and 2.2.10: huge disk reads #10361

Comments

c0deright commented Feb 15, 2019 • edited

c0deright commented Feb 15, 2019 • edited

Serpent6877 commented Feb 15, 2019

tzapu commented Feb 16, 2019

jam10o-new commented Feb 16, 2019

tzapu commented Feb 16, 2019

jam10o-new commented Feb 16, 2019

c0deright commented Feb 18, 2019 • edited

Serpent6877 commented Feb 18, 2019

jleeh commented Feb 26, 2019

c0deright commented Feb 28, 2019

jleeh commented Feb 28, 2019

gituser commented Mar 7, 2019

tzapu commented Mar 26, 2019

c0deright commented Mar 26, 2019

c0deright commented Mar 26, 2019

jam10o-new commented Mar 28, 2019

c0deright commented Mar 28, 2019

Serpent6877 commented Mar 29, 2019

c0deright commented Mar 29, 2019

c0deright commented Apr 1, 2019

jam10o-new commented Apr 1, 2019

c0deright commented Apr 1, 2019 • edited

c0deright commented Apr 2, 2019 • edited

jam10o-new commented Apr 2, 2019

c0deright commented Apr 2, 2019

c0deright commented Apr 2, 2019 • edited

jleeh commented Apr 2, 2019

jam10o-new commented Apr 2, 2019

jleeh commented Apr 2, 2019

c0deright commented Apr 3, 2019

jleeh commented Apr 3, 2019 • edited

c0deright commented Apr 3, 2019 • edited

jam10o-new commented Apr 3, 2019 • edited

soc1c commented Apr 3, 2019

c0deright commented Apr 3, 2019

c0deright commented Apr 4, 2019

jam10o-new commented Apr 4, 2019

c0deright commented Apr 4, 2019

tzapu commented Apr 9, 2019 • edited

c0deright commented Aug 12, 2019 • edited

c0deright commented Feb 15, 2019 •

edited

c0deright commented Feb 15, 2019 •

edited

c0deright commented Feb 18, 2019 •

edited

c0deright commented Apr 1, 2019 •

edited

c0deright commented Apr 2, 2019 •

edited

c0deright commented Apr 2, 2019 •

edited

jleeh commented Apr 3, 2019 •

edited

c0deright commented Apr 3, 2019 •

edited

jam10o-new commented Apr 3, 2019 •

edited

tzapu commented Apr 9, 2019 •

edited

c0deright commented Aug 12, 2019 •

edited