Skip to content
This repository has been archived by the owner on Nov 6, 2020. It is now read-only.

regression with 2.3.3 and 2.2.10: huge disk reads #10361

Closed
c0deright opened this issue Feb 15, 2019 · 47 comments
Closed

regression with 2.3.3 and 2.2.10: huge disk reads #10361

c0deright opened this issue Feb 15, 2019 · 47 comments
Labels
F7-footprint 🐾 An enhancement to provide a smaller (system load, memory, network or disk) footprint. M4-core ⛓ Core client code / Rust. M4-io 💾 Interaction with filesystem/databases. P2-asap 🌊 No need to stop dead in your tracks, however issue should be addressed as soon as possible.
Milestone

Comments

@c0deright
Copy link

c0deright commented Feb 15, 2019

  • Parity Ethereum version: 2.3.3-beta and 2.2.10-stable
  • Operating system: Linux
  • Installation: provided binaries
  • Fully synchronized: yes
  • Network: ethereum
  • Restarted: yes

Yesterday I upgraded my stable boxes from 2.2.9 zu 2.2.10 and my test box from 2.3.2-beta to 2.3.3-beta.

IO seems to be much higher than with 2.2.9 and 2.3.2 because a couple hours later all the EBS gp2 volumes the EC2 instances use had exhausted their BurstBalance credits.

From the monitoring I can see that versions 2.2.10 and 2.3.3 do much more disk reads than the versions before. Switching back to 2.2.9 and 2.3.2 resolved the issue: let the systems run for several hours, no heavy IO.

Switching back to 2.3.3 and 2.2.10: heavy disk reads was happening again.

@c0deright c0deright changed the title regression with 2.3.3 and 2.2.10: high IO regression with 2.3.3 and 2.2.10: high read IO Feb 15, 2019
@c0deright c0deright changed the title regression with 2.3.3 and 2.2.10: high read IO regression with 2.3.3 and 2.2.10: huge disk reads Feb 15, 2019
@5chdn 5chdn added this to the 2.4 milestone Feb 15, 2019
@5chdn 5chdn added P5-sometimesoon 🌲 Issue is worth doing soon. F7-footprint 🐾 An enhancement to provide a smaller (system load, memory, network or disk) footprint. Z0-unconfirmed 🤔 Issue might be valid, but it’s not yet known. M4-core ⛓ Core client code / Rust. labels Feb 15, 2019
@c0deright
Copy link
Author

c0deright commented Feb 15, 2019

Info: Burstbalance can be considered an AWS EBS gp2 volume's remaining credit which is expressed in % from 0 to 100 and represents the possibility (BurstBalance > 0%) to burst over the provisioned IOPS or being limited to a low baseline IOPS limit (BurstBalance = 0%).

parity

@Serpent6877
Copy link

I can concur I am seeing this as well. I went from zero iowait to upwards of 24%. This is on a hardware RAID-10 of Samsung pro SSD's.

@tzapu
Copy link
Contributor

tzapu commented Feb 16, 2019

same here as well on 3 archive+trace. this seems to affect performance in general in a very bad way. and now, due to the RPC attack issue, we can t downgrade either.

@jam10o-new
Copy link
Contributor

Can people here mention which version they are coming from? Specifically whether they are updating from pre-2.2.9/2.3.2 or only a single patch number.

@tzapu
Copy link
Contributor

tzapu commented Feb 16, 2019

@joshua-mir in our case at least 3 of 4 were from v2.2.7 to v2.2.10

(v2.2.7 also had somekind of performance drops compared to to v2.1.x but I did not have any chance to document it and it was not nearly this big)

@jam10o-new
Copy link
Contributor

@tzapu the earlier performance regression was #10230 - it was most noticeable on smaller networks (testnets and private networks) and the fix for that was part of the latest release. We're still trying to diagnose which change in the latest release might be the source of this regression.

@c0deright
Copy link
Author

c0deright commented Feb 18, 2019

@joshua-mir The issue was only seen in 2.2.9->2.2.10 and 2.3.2->2.3.3 by me.

In my case when BurstBalance of the volume hits 0% and IOPS is limited to a very low baseline this immediately results in peers being dropped (i guess parity-ethereum isn't replying fast enough to incoming packets from the peers due to being stuck in IOWAIT for a long time).

@Serpent6877
Copy link

Mine was from 2.2.9 to 2.2.10 as well.

@jleeh
Copy link

jleeh commented Feb 26, 2019

I'm experiencing the same issues also. Read operations between 2.2.9 and 2.2.10 are significantly higher on average. Read ops over 3 days below.

2.2.9:
image

2.2.10:
image

Block propagation times are suffering due to it, with differences of 2.2.9 being around 500-600ms on average and 2.2.10 being around 1.3s at the same specs.

This is something that worries me for later versions, I really don't want to worry about the integrity and performance of our Parity boxes on new versions I upgrade to, especially as 2.2.10 was for a critical security fix.

@c0deright
Copy link
Author

Could we remove the Z0-unconfirmed since the issue being confirmed by several users?

This issue prevents us currently from upgrading beyond 2.2.9/2.3.2 and I guess the soon to be released 2.3.5-stable and 2.4.0-beta won't have a fix for this issue.

@jam10o-new jam10o-new removed the Z0-unconfirmed 🤔 Issue might be valid, but it’s not yet known. label Feb 28, 2019
@jleeh
Copy link

jleeh commented Feb 28, 2019

To give more evidence of real impact of this issue:
image
4 boxes, 3 running 2.2.10 and 1 running 2.2.9 all at the exact same specs.

@gituser
Copy link

gituser commented Mar 7, 2019

Any update on this?
I can also confirm on v2.3.5 situation hasn't changed and I observe huge disk reads / writes.

@jam10o-new jam10o-new added P2-asap 🌊 No need to stop dead in your tracks, however issue should be addressed as soon as possible. and removed P5-sometimesoon 🌲 Issue is worth doing soon. labels Mar 8, 2019
@tzapu
Copy link
Contributor

tzapu commented Mar 26, 2019

we re seeing some encouraging signs with v2.3.8, can anyone else take a look? looks almost the same as v2.2.7

@c0deright
Copy link
Author

I was waiting to hear something like this. Will test ASAP. Thanks @tzapu

@c0deright
Copy link
Author

@tzapu Can't confirm. Still heavy IO with 2.3.8.

@jam10o-new
Copy link
Contributor

k, uploaded.

sha256sums:

463e4ab444142f4c48c4f962489b58e7d4362a7f301f16aca60a2bcee35dc705  parity-no-10051
691fca282e68b8a4f2bf3ff7c66f51a828ccd9b23789a711428f0ce3b7ea420d  parity-no-10167-10305
f676b9909b5967297b61384007f6058a36748a6186fd54f58c67b8a56e74266f  parity-no-10167-10305-10051
b8efb141287c926f9bbed4049650eccfc428788ebd0b48af0c7e841d55719d9a  parity-no-10305

note: these binaries are master (2.5.0) without the specified commits (or any dependent commits) I haven't checked whether they work properly in normal usecases tbh, only that they run. You will likely have the best experience with "parity-no-10167-10305" but 10167 was a pretty important security fix so I wouldn't recommend using it in production. no-10305 is likely worse than just plain master.

@c0deright
Copy link
Author

If anyone is interested in testing: you can download the binaries from my dropbox.

But please check the sha256sums as posted above by @joshua-mir before executing binaries downloaded from someone off the internet, first!

Testing right now, will report back on monday.

@Serpent6877
Copy link

@c0deright Can you use my script and do some RPC port load testing as well? It requires parallel installed. "apt install parallel" or "yum install parallel". Maybe do 10000 requests or so. This would emulate more of what a pool does.

#!/bin/bash
#
# Test Parity RPC - Brad Dameron (serpent6877@hotmail.com)
#
# parity-test.sh <RPC PORT> <# requests before delay> <Delay time between tests>
# Use CTRL+C to exit program
#
# Example ./parity-test.sh 8545 1000 15 5
# Test port 8545 with 1000 requests of each type 15 times with a 5 second delay between tests
#
# Requires parallel installed.


PORT=$1
REQUESTS=$2
TESTS=$3
DELAY=$4
LOGFILE=parity-test.log

# Set some defaults if no command line option
if [ -z $PORT ]
then
  PORT=8545
fi
if [ -z $DELAY ]
then
  REQUESTS=100
fi
if [ -z $TESTS ]
then
  TESTS=10
fi
if [ -z $DELAY ]
then
  DELAY=5
fi

getBalance() {
   PORT=$1
   LOGFILE=$2
   # Test getBalance
   curl -s -X POST --data '{"jsonrpc":"2.0","method":"eth_getBalance","params":["0xc94770007dda54cF92009BFF0dE90c06F603a09f", "latest"],"id":1}' -H "Content-Type: application/json" -X POST localhost:$PORT >> $LOGFILE
   # Test getWork
   curl -s -X POST --data '{"jsonrpc":"2.0","method":"eth_getWork","params":[],"id":73}' -H "Content-Type: application/json" -X POST localhost:$PORT >> $LOGFILE
   # Test current blockNumber
   curl -s -X POST --data '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}' -H "Content-Type: application/json" -X POST localhost:$PORT >> $LOGFILE
   # Test peerCount
   curl -s -X POST --data '{"jsonrpc":"2.0","method":"net_peerCount","params":[],"id":74}' -H "Content-Type: application/json" -X POST localhost:$PORT >> $LOGFILE
   # Test accounts
   curl -s -X POST --data '{"jsonrpc":"2.0","method":"eth_accounts","params":[],"id":1}' -H "Content-Type: application/json" -X POST localhost:$PORT >> $LOGFILE
}

export -f getBalance

for run in $(eval echo "{1..$TESTS}")
do
   START=$(date +%s)
   echo "$START - Starting testing ---------" >> $LOGFILE
   seq $REQUESTS | parallel -j0 -N0 getBalance $PORT $LOGFILE > /dev/null
   END=$(date +%s)
   DIFF=$(( $END - $START ))
   echo "$END - Stopping testing ---------" >> $LOGFILE
   echo "It took $DIFF seconds for $REQUESTS requests"
   echo "Delaying for $DELAY seconds."
   sleep $DELAY
done

@c0deright
Copy link
Author

@Serpent6877

Can't do it right now. I'm away from the keyboard until monday and just have the 4 binaries running for ~18 hours each on that t3.medium instance.

Maybe afterwards when one binary sticks out I can do your test.

@c0deright
Copy link
Author

parity-no-10051

looks good

parity-no-10167-10305

looked good for 3 hours then went crazy on disk reads

parity-no-10167-10305-10051

looks good

parity-no-10305

looks good

I'm currently retesting parity-no-10167-10305 to see if it does go crazy on disk reads again.

10361_first_results

@jam10o-new
Copy link
Contributor

wow, so the binary I thought would be the most stable is the most unstable.

@c0deright
Copy link
Author

c0deright commented Apr 1, 2019

Let's not jump to conclusions easily. I still think that peers connected to my node might have a play in here. When e.g. a peer is requesting old blocks not in my nodes cache it has to read them from disk, right? So maybe this was the case here where one or more peers did request old blocks frequently. That's why I'm retesting parity-no-10167-10305 right now.

From the old graphs in this thread one can see that the read spike did in fact start right after one of the affected parity versions was started not several hours after normal operation. So this might really be a stupid coincidence caused by a peer after 3 hours runtime.

@soc1c soc1c modified the milestones: 2.5, 2.6 Apr 2, 2019
@c0deright
Copy link
Author

c0deright commented Apr 2, 2019

10361_retest

As I said I did retest parity-no-10167-10305 and what should I say, it looks good!

I think it's safe to say that all 4 binaries @joshua-mir supplied don't show this behaviour and parity-no-10167-10305 on the 1st test going crazy on disk reads after 3 hours of normal operation might be related to a peer and not parity itself.

Right now I'm testing 2.4.4-beta and without promising too much, it does look good.

@jam10o-new
Copy link
Contributor

Thanks for all that work regardless 🙏
If this is an issue with peer behaviour we still need to identify what that behavior is and account for it though.

@c0deright
Copy link
Author

I just looked at network graphs: There is no out of the ordinary traffic anomaly that correlates to performance degradation caused by heavy disk reads. If it's really caused by peers then it's not because we're sending huge amounts of blocks read from disk to our peers.

@c0deright
Copy link
Author

c0deright commented Apr 2, 2019

@Serpent6877

I used a modified version (diff) of your script to run on my test machine against parity-2.4.4-beta:

  • without RPC 'eth_getWork' because my parity doesn't run with --author
  • only 620 curl requests in parallel because not enough ressources for 1000 (raised ulimit -n and ulimit -u, still t3.medium)

Results:

  • Load/CPU usage did skyrocket during the test (due to many processes being created in parallel)
  • disk reads did increase
  • after the script ran everything got back to normal
  • parity-2.4.4-beta running without problems for 1.5 hours now
1554199781 - Starting testing ---------
1554199802 - Stopping testing ---------
It took 21 seconds for 620 requests
Delaying for 5 seconds.
1554199807 - Starting testing ---------
1554199825 - Stopping testing ---------
It took 18 seconds for 620 requests
Delaying for 5 seconds.
1554199830 - Starting testing ---------
1554199848 - Stopping testing ---------
It took 18 seconds for 620 requests
Delaying for 5 seconds.
1554199853 - Starting testing ---------
1554199871 - Stopping testing ---------
It took 18 seconds for 620 requests
Delaying for 5 seconds.
1554199876 - Starting testing ---------
1554199894 - Stopping testing ---------
It took 18 seconds for 620 requests
Delaying for 5 seconds.
1554199899 - Starting testing ---------
1554199916 - Stopping testing ---------
It took 17 seconds for 620 requests
Delaying for 5 seconds.
1554199921 - Starting testing ---------
1554199942 - Stopping testing ---------
It took 21 seconds for 620 requests
Delaying for 5 seconds.
1554199947 - Starting testing ---------
1554199967 - Stopping testing ---------
It took 20 seconds for 620 requests
Delaying for 5 seconds.
1554199972 - Starting testing ---------
1554199990 - Stopping testing ---------
It took 18 seconds for 620 requests
Delaying for 5 seconds.
1554199995 - Starting testing ---------
1554200013 - Stopping testing ---------
It took 18 seconds for 620 requests
Delaying for 5 seconds.
1554200018 - Starting testing ---------
1554200041 - Stopping testing ---------
It took 23 seconds for 620 requests
Delaying for 5 seconds.
1554200046 - Starting testing ---------
1554200063 - Stopping testing ---------
It took 17 seconds for 620 requests
Delaying for 5 seconds.
1554200068 - Starting testing ---------
1554200089 - Stopping testing ---------
It took 21 seconds for 620 requests
Delaying for 5 seconds.
1554200094 - Starting testing ---------
1554200113 - Stopping testing ---------
It took 19 seconds for 620 requests
Delaying for 5 seconds.
1554200118 - Starting testing ---------
1554200141 - Stopping testing ---------
It took 23 seconds for 620 requests
Delaying for 5 seconds.

@jleeh
Copy link

jleeh commented Apr 2, 2019

I've rolled v2.4.5 stable to our Parity instances and it's looking a lot better on my end as well (presuming 2.4.5 has the fix in?). I'm seeing around 500 IOPS average compared to around 1200 IOPS.

This is only the first 5 hours of running the version, but on the previous bad version, I saw high disk usage by now.

My only complaint which has always been the case with Parity is that the read size per op seems quite inefficient. You see each read size being around 20Kb, when even an SSD with small block size per op can typically fit in around 256Kb per op. If this was batched a bit better, you could really reduce the cost of running a full node.

@jam10o-new
Copy link
Contributor

@jleeh same request as #10108?

@jleeh
Copy link

jleeh commented Apr 2, 2019

Yeah same. Thanks for linking.

@c0deright
Copy link
Author

@jleeh 2.4.5-stable? Are you sure? Latest stable is 2.3.9, latest beta 2.4.4

After testing 2.4.4-beta I did run 2.3.9-stable over night and it looks good, too:

parity_unstable_and_stable

@tzapu did you test 2.3.9 yet?

@jleeh
Copy link

jleeh commented Apr 3, 2019

@c0deright I'm sure. I did raise an eyebrow to it considering you was testing 2.4.4 beta. This is from the latest docker stable tag:
Parity-Ethereum/v2.4.5-stable-b6a764d-20190402/x86_64-linux-gnu/rustc1.33.0

@c0deright
Copy link
Author

c0deright commented Apr 3, 2019

@c0deright I'm sure. I did raise an eyebrow to it considering you was testing 2.4.4 beta. This is from the latest docker stable tag:
Parity-Ethereum/v2.4.5-stable-b6a764d-20190402/x86_64-linux-gnu/rustc1.33.0

@joshua-mir How is this possible when in fact there isn't even a 2.4.5 tag let alone a 2.4.5 release?

% docker run --rm -it parity/parity:stable --version
  version Parity-Ethereum/v2.4.5-stable-b6a764d-20190402/x86_64-linux-gnu/rustc1.33.0

% docker run --rm -it parity/parity:beta --version
  version Parity-Ethereum/v2.5.0-beta-3c85f29-20190402/x86_64-linux-gnu/rustc1.33.0

@jam10o-new
Copy link
Contributor

jam10o-new commented Apr 3, 2019

@c0deright #10561 has a change to our CI pipeline which must have accidentally pushed to dockerhub - I'll let our DevOps team know

Edit: that pr is probably unrelated

Edit 2: this is expected - stable and beta are always ahead of releases on docker @jleeh

@soc1c
Copy link
Contributor

soc1c commented Apr 3, 2019

confirming 2.4 is stable now. b6a764d

@c0deright
Copy link
Author

For me this issue is fixed with v2.3.9-stable and v2.4.4-beta.

Does anybody else see heavy disk IO with these versions?

@c0deright
Copy link
Author

v2.3.9-stable is running for ~43 hours now and in the morning there was a 2 hour window where read operations per second did increase by a factor of 8 and then went back to normal again:

parity-239

The Logfile doesn't seem to indicate anything at all besides 3 Reorgs at the time read ops did begin to increase.

I still think that this issue is fixed but there seems to be still an issue with potentially peer induced read spike.

@jam10o-new
Copy link
Contributor

So you are seeing IO spiking on reorgs maybe? I'll close this for now and if we see evidence that the issue hasn't been resolved in ordinary cases I'll reopen, or perhaps open a new issue if we can confirm this is reorg related (although high io on reorg is probably expected)

@jam10o-new jam10o-new added the M4-io 💾 Interaction with filesystem/databases. label Apr 4, 2019
@c0deright
Copy link
Author

I can't confirm if it has anything to do with reorgs, just that there were 3 in the log at the time. Might be completely unrelated. I'll test some more with v2.3.9-stable and then v2.4.5-stable and might then upgrade prod servers in a week or two.

Thanks for helping debugging this, @joshua-mir

@tzapu
Copy link
Contributor

tzapu commented Apr 9, 2019

image
in our case the red line marks the move to v.2.3.9 and switched to v2.4.5 about 6 hours ago

/edit lower is better :P

@c0deright
Copy link
Author

c0deright commented Aug 12, 2019

I see this issue again with 2.6.0-beta.

parity v2.6.0 is constantly reading from disk even with only 5 peers connected. I'm comparing disk IO to v2.2.9 we're still running in prod because of #10371

Running parity on AWS EBS volumes of type gp2 is almost impossible because it drains the IO credits so fast.

I have v2.6.0 now configured to only connect to my private nodes (reserved_only = true) and will report back.

Edit: Seems to be peer induces disk reads a memory issue: #10962

Upgrading the VM from 4GB of RAM to 8GB fixes the massive reads.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
F7-footprint 🐾 An enhancement to provide a smaller (system load, memory, network or disk) footprint. M4-core ⛓ Core client code / Rust. M4-io 💾 Interaction with filesystem/databases. P2-asap 🌊 No need to stop dead in your tracks, however issue should be addressed as soon as possible.
Projects
None yet
Development

No branches or pull requests

8 participants