Skip to content
This repository has been archived by the owner on Nov 6, 2020. It is now read-only.

Parity does not close domain socket properly, runs out of memory #8774

Closed
santtu opened this issue Jun 3, 2018 · 7 comments · Fixed by paritytech/jsonrpc#277 or #8876
Closed

Parity does not close domain socket properly, runs out of memory #8774

santtu opened this issue Jun 3, 2018 · 7 comments · Fixed by paritytech/jsonrpc#277 or #8876
Assignees
Labels
F2-bug 🐞 The client fails to follow expected behavior. M6-rpcapi 📣 RPC API.
Milestone

Comments

@santtu
Copy link

santtu commented Jun 3, 2018

I'm running:

  • Which Parity version?: 1.11.1
  • Which operating system?: Linux, Ubuntu 16.04
  • How installed?: from source
  • Are you fully synchronized?: no
  • Which network are you connected to?: ethereum
  • Did you try to restart the node?: yes

While running analysis program over IPC, parity started crashing due to Linux OOM killer (like in #8618). I checked around and based on #6516 took a look at lsof on the parity program and it shows a huge number of open sockets:

# lsof -p 11486 | wc -l
238699
<some time passes>
# lsof -p 11486 | wc -l
308174

The program doing the analysis isn't optimized and opens and closes an IPC socket for each request. But while it is itself working stupidly, it is correctly closing the sockets (lsof shows only one active IPC socket open at a time on the analysis program, e.g. the one currently being used).

Thus it seems that parity is not properly closing IPC sockets after they have been closed by the remote end. I killed the analysis program, e.g. there is no IPC/RPC activity and the number of open files on parity did not decrease (but didn't increase either).

(The memory use goes up simultanously, probably due to resources such as buffers etc. associated with the open sockets. At the end of the test parity was hogging 8GB of memory.)

The files left open on parity are IPC sockets:

parity  11486 parity  494u     unix 0xffff88031bc82000       0t0   4077212 /srv/parity/parity.ipc type=STREAM
parity  11486 parity  495u     unix 0xffff880485db1c00       0t0   4077428 /srv/parity/parity.ipc type=STREAM
parity  11486 parity  496u     unix 0xffff8803b8283800       0t0   4080870 /srv/parity/parity.ipc type=STREAM
parity  11486 parity  497u     unix 0xffff88028cbff000       0t0   4077072 /srv/parity/parity.ipc type=STREAM
parity  11486 parity  498u     unix 0xffff8803c6fa1400       0t0   4081311 /srv/parity/parity.ipc type=STREAM
parity  11486 parity  499u     unix 0xffff8804897d8c00       0t0   4083075 /srv/parity/parity.ipc type=STREAM
parity  11486 parity  500u     unix 0xffff8803afe16800       0t0   4081886 /srv/parity/parity.ipc type=STREAM
parity  11486 parity  501u     unix 0xffff88048b5fa800       0t0   4083255 /srv/parity/parity.ipc type=STREAM

I can reproduce this reliably by just opening and closing the IPC socket, even with a shell script (give the socket path as argument):

#!/bin/sh -e
for i in `seq 0 1000`
do
    echo $i
    nc -U $1 </dev/null >/dev/null &
    pid=$!
    sleep 0.1
    kill $pid
done

Monitoring the number of files before starting the script shows a ~650 files open quite stable:

# while lsof -n -p 18186 | wc -l; do; sleep 5; done
630
651
647
652
648
656

When the script is started the numbers start to go up and do not come down even after the script terminates:

721
753
820
860
917
944
1007
1051
1116
1157
1215
1254
1315
1369
1409
1467
1517
1536
1592
1635
1631
1630

So it seems quite clear that parity is not properly either detecting that the remote end has closed the IPC socket, or not properly closing them after that.

@santtu
Copy link
Author

santtu commented Jun 3, 2018

I also observed the same problem with 1.10.1. Also, this issue seems to affect only IPC sockets as using HTTP instead will not cause a file descriptor leakage.

@debris
Copy link
Collaborator

debris commented Jun 4, 2018

thanks for the report!

@debris debris added F2-bug 🐞 The client fails to follow expected behavior. M6-rpcapi 📣 RPC API. labels Jun 4, 2018
@gnunicorn
Copy link
Contributor

gnunicorn commented Jun 4, 2018

This sounds very much like it might be the underlying cause of the errors we've seen and discussed in #6791 .

Thanks @santtu for the specific details, I'll try to reproduce this.


I can confirm Parity (on master) leaking ipc-socket-connections, when they are closed unexepectedly as per the script provided. Currently trying to create a minimal replicatable example to see if this is in parity/jsonrpc crate or within the local code base.

@gnunicorn
Copy link
Contributor

gnunicorn commented Jun 4, 2018

Issues seems to be on the side of parity or the specific way we use jsonrpc. With the script against a minimal rpc socket (gist), I see that every connection is properly closed:

TRACE 2018-06-04T18:28:29Z: jsonrpc_ipc_server::server: Accepted incoming IPC connection: 1
TRACE 2018-06-04T18:28:29Z: jsonrpc_ipc_server::server: Peer: service finished
TRACE 2018-06-04T18:28:29Z: jsonrpc_ipc_server::server: Accepted incoming IPC connection: 2
TRACE 2018-06-04T18:28:29Z: jsonrpc_ipc_server::server: Peer: service finished
TRACE 2018-06-04T18:28:29Z: jsonrpc_ipc_server::server: Accepted incoming IPC connection: 3
TRACE 2018-06-04T18:28:29Z: jsonrpc_ipc_server::server: Peer: service finished
TRACE 2018-06-04T18:28:29Z: jsonrpc_ipc_server::server: Accepted incoming IPC connection: 4
TRACE 2018-06-04T18:28:30Z: jsonrpc_ipc_server::server: Peer: service finished

A message, we are lacking, when run against Parity master (RUST_LOG=ipc=trace cargo run)

2018-06-04 20:36:03   TRACE ipc  Accepted incoming IPC connection: 1035
2018-06-04 20:36:03   TRACE ipc  Accepted incoming IPC connection: 1036
2018-06-04 20:36:03   TRACE ipc  Accepted incoming IPC connection: 1037
2018-06-04 20:36:03   TRACE ipc  Accepted incoming IPC connection: 1038
2018-06-04 20:36:03   TRACE ipc  Accepted incoming IPC connection: 1039
2018-06-04 20:36:03   TRACE ipc  Accepted incoming IPC connection: 1040
2018-06-04 20:36:03   TRACE ipc  Accepted incoming IPC connection: 1041
2018-06-04 20:36:04   TRACE ipc  Accepted incoming IPC connection: 1042
2018-06-04 20:36:04   TRACE ipc  Accepted incoming IPC connection: 1043
2018-06-04 20:36:04   TRACE ipc  Accepted incoming IPC connection: 1044
2018-06-04 20:36:04   TRACE ipc  Accepted incoming IPC connection: 1045
2018-06-04 20:36:04   TRACE ipc  Accepted incoming IPC connection: 1046
2018-06-04 20:36:04   TRACE ipc  Accepted incoming IPC connection: 1047
2018-06-04 20:36:04   TRACE ipc  Accepted incoming IPC connection: 1048
2018-06-04 20:36:04   TRACE ipc  Accepted incoming IPC connection: 1049
2018-06-04 20:36:04   TRACE ipc  Accepted incoming IPC connection: 1050
2018-06-04 20:36:05   TRACE ipc  Accepted incoming IPC connection: 1051
2018-06-04 20:36:05   TRACE ipc  Accepted incoming IPC connection: 1052

We are leaking those connections within parity.

@Tbaut Tbaut added this to the 1.12 milestone Jun 4, 2018
@gnunicorn
Copy link
Contributor

We are leaking those connections within parity.

Small update. I figured out the root Arc causing the leak. Our extractor, run at the establishing of every ipc request, also creates a Session for PubSub-Facilities. However, this Arc is somewhere not being returned properly (or we might even have a cycle somewhere) and stays around even after the connection has closed. However, as the closing-code is supposted to be run during drop and that never happens because someone still holds a strong reference to that Arc (not yet figured out where that happens), drop never happens, thus clean-up never takes place.

Simple way to show this being the cause: replace this Some(...) with None and you'll see the leak is gone . The leak happens within or in interaction with that Session-Arc.

@gnunicorn
Copy link
Contributor

gnunicorn commented Jun 5, 2018

I was able to create an example with jsonrpc alone in the way the objects are structured there. Therefore I consider this a jsonrpc bug (as it must be fixed there) and not strictly parity (as the underlying crate causes the issue although it is being used properly by parity).

@debris debris self-assigned this Jun 6, 2018
@debris
Copy link
Collaborator

debris commented Jun 6, 2018

working on a solution now

@debris debris added this to Needs Assignment in Core via automation Jun 6, 2018
@debris debris moved this from Needs Assignment to Being Actively Worked On in Core Jun 6, 2018
debris added a commit that referenced this issue Jun 12, 2018
debris added a commit that referenced this issue Jun 13, 2018
tavakyan referenced this issue in C4Coin/c4coin-parity Jun 14, 2018
dvdplm added a commit that referenced this issue Jun 14, 2018
* master:
  Handle removed logs in filter changes and add geth compatibility field (#8796)
  fixed ipc leak, closes #8774 (#8876)
  scripts: remove md5 checksums (#8884)
  hardware_wallet/Ledger `Sign messages` + some refactoring (#8868)
andresilva pushed a commit that referenced this issue Jun 18, 2018
ascjones added a commit that referenced this issue Jun 18, 2018
Instead of cherrypicking 8b78141, just ran cargo update -p jsonrpc-core
@ordian ordian moved this from Being Actively Worked On to Done in Core Jun 18, 2018
5chdn pushed a commit that referenced this issue Jun 19, 2018
* Fixed AuthorityRound deadlock on shutdown, closes #8088 (#8803)

* CI: Fix docker tags (#8822)

* scripts: enable docker builds for beta and stable

* scripts: docker latest should be beta not master

* scripts: docker latest is master

* Fix concurrent access to signer queue (#8854)

* Fix concurrent access to signer queue

* Put request back to the queue if confirmation failed

* typo: fix docs and rename functions to be more specific

`request_notify` does not need to be public, and it's renamed to `notify_result`.
`notify` is renamed to `notify_message`.

* Change trace info "Transaction" -> "Request"

* Add new ovh bootnodes and fix port for foundation bootnode 3.2 (#8886)

* Add new ovh bootnodes and fix port for foundation bootnode 3.2

* Remove old bootnodes.

* Remove duplicate 1118980bf48b0a3640bdba04e0fe78b1add18e1cd99bf22d53daac1fd9972ad650df52176e7c7d89d1114cfef2bc23a2959aa54998a46afcf7d91809f0855082

* Block 0 is valid in queries (#8891)

Early exit for block nr 0 leads to spurious error about pruning: `…your node is running with state pruning…`.

Fixes #7547, #8762

* update jsonrpc libs, fixed ipc leak, closes #8774 (#8876)

Instead of cherrypicking 8b78141, just ran cargo update -p jsonrpc-core

* Add ETC Cooperative-run load balanced parity node (#8892)

* Minor fix in chain supplier and light provider (#8906)

* fix chain supplier increment

* fix light provider block_headers
5chdn pushed a commit that referenced this issue Jun 19, 2018
* `duration_ns: u64 -> duration: Duration` (#8457)

* duration_ns: u64 -> duration: Duration

* format on millis {:.2} -> {}

* Keep all enacted blocks notify in order (#8524)

* Keep all enacted blocks notify in order

* Collect is unnecessary

* Update ChainNotify to use ChainRouteType

* Fix all ethcore fn defs

* Wrap the type within ChainRoute

* Fix private-tx and sync api

* Fix secret_store API

* Fix updater API

* Fix rpc api

* Fix informant api

* Eagerly cache enacted/retracted and remove contain_enacted/retracted

* Fix indent

* tests: should use full expr form for struct constructor

* Use into_enacted_retracted to further avoid copy

* typo: not a function

* rpc/tests: ChainRoute -> ChainRoute::new

* Handle removed logs in filter changes and add geth compatibility field (#8796)

* Add removed geth compatibility field in log

* Fix mocked tests

* Add field block hash in PollFilter

* Store last block hash info for log filters

* Implement canon route

* Use canon logs for fetching reorg logs

Light client removed logs fetching is disabled. It looks expensive.

* Make sure removed flag is set

* Address grumbles

* Fixed AuthorityRound deadlock on shutdown, closes #8088 (#8803)

* CI: Fix docker tags (#8822)

* scripts: enable docker builds for beta and stable

* scripts: docker latest should be beta not master

* scripts: docker latest is master

* ethcore: fix ancient block error msg handling (#8832)

* Disable parallel verification and skip verifiying already imported txs. (#8834)

* Reject transactions that are already in pool without verifying them.

* Avoid verifying already imported transactions.

* Fix concurrent access to signer queue (#8854)

* Fix concurrent access to signer queue

* Put request back to the queue if confirmation failed

* typo: fix docs and rename functions to be more specific

`request_notify` does not need to be public, and it's renamed to `notify_result`.
`notify` is renamed to `notify_message`.

* Change trace info "Transaction" -> "Request"

* Don't allocate in expect_valid_rlp unless necessary (#8867)

* don't allocate via format! in case there's no error

* fix test?

* fixed ipc leak, closes #8774 (#8876)

* Add new ovh bootnodes and fix port for foundation bootnode 3.2 (#8886)

* Add new ovh bootnodes and fix port for foundation bootnode 3.2

* Remove old bootnodes.

* Remove duplicate 1118980bf48b0a3640bdba04e0fe78b1add18e1cd99bf22d53daac1fd9972ad650df52176e7c7d89d1114cfef2bc23a2959aa54998a46afcf7d91809f0855082

* Block 0 is valid in queries (#8891)

Early exit for block nr 0 leads to spurious error about pruning: `…your node is running with state pruning…`.

Fixes #7547, #8762

* Add ETC Cooperative-run load balanced parity node (#8892)

* Minor fix in chain supplier and light provider (#8906)

* fix chain supplier increment

* fix light provider block_headers

* Check whether we need resealing in miner and unwrap has_account in account_provider (#8853)

* Remove unused Result wrap in has_account

* Check whether we need to reseal for external transactions

* Fix reference to has_account interface

* typo: missing )

* Refactor duplicates to prepare_and_update_sealing

* Fix build

* Allow disabling local-by-default for transactions with new config entry (#8882)

* Add tx_queue_allow_unknown_local config option

- Previous commit messages:

dispatcher checks if we have the sender account

Add `tx_queue_allow_unknown_local` to MinerOptions

Add `tx_queue_allow_unknown_local` to config

fix order in MinerOptions to match Configuration

add cli flag for tx_queue_allow_unknown_local

Update refs to `tx_queue_allow_unknown_local`

Add tx_queue_allow_unknown_local to config test

revert changes to dispatcher

Move tx_queue_allow_unknown_local to `import_own_transaction`

Fix var name

if statement should return the values

derp de derp derp derp semicolons

Reset dispatch file to how it was before

fix compile issues + change from FLAG to ARG

add test and use `into`

import MinerOptions, clone the secret

Fix tests?

Compiler/linter issues fixed

Fix linter msg - case of constants

IT LIVES

refactor to omit yucky explict return

update comments

Fix based on diff AccountProvider.has_account method

* Refactor flag name + don't change import_own_tx behaviour

fix arg name

Note: force commit to try and get gitlab tests working again 😠

* Add fn to TestMinerService

* Avoid race condition from trusted sources

- refactor the miner tests a bit to cut down on code reuse
- add `trusted` param to dispatch_transaction and import_claimed_local_transaction

Add param to `import_claimed_local_transaction`

Fix fn sig in tests
ordian added a commit to ordian/parity that referenced this issue Jun 20, 2018
…rp_sync_on_light_client

* 'master' of https://github.com/paritytech/parity: (29 commits)
  Block 0 is valid in queries (openethereum#8891)
  fixed osx permissions (openethereum#8901)
  Atomic create new files with permissions to owner in ethstore (openethereum#8896)
  Add ETC Cooperative-run load balanced parity node (openethereum#8892)
  Add support for --chain tobalaba (openethereum#8870)
  fix some warns on nightly (openethereum#8889)
  Add new ovh bootnodes and fix port for foundation bootnode 3.2 (openethereum#8886)
  SecretStore: service pack 1 (openethereum#8435)
  Handle removed logs in filter changes and add geth compatibility field (openethereum#8796)
  fixed ipc leak, closes openethereum#8774 (openethereum#8876)
  scripts: remove md5 checksums (openethereum#8884)
  hardware_wallet/Ledger `Sign messages` + some refactoring (openethereum#8868)
  Check whether we need resealing in miner and unwrap has_account in account_provider (openethereum#8853)
  docker: Fix alpine build (openethereum#8878)
  Remove mac os installers etc (openethereum#8875)
  README.md: update the list of dependencies (openethereum#8864)
  Fix concurrent access to signer queue (openethereum#8854)
  Tx permission contract improvement (openethereum#8400)
  Limit the number of transactions in pending set (openethereum#8777)
  Use sealing.enabled to emit eth_mining information (openethereum#8844)
  ...
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
F2-bug 🐞 The client fails to follow expected behavior. M6-rpcapi 📣 RPC API.
Projects
None yet
5 participants
@gnunicorn @santtu @debris @Tbaut and others