Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem: bridge dies when Parity is stopped #25

Merged
merged 3 commits into from
Mar 10, 2018

Conversation

yrashk
Copy link
Contributor

@yrashk yrashk commented Feb 27, 2018

Reproduced on the latest (3961987) POA master branch.

Steps to reproduce:

  1. Run two Parity-based nodes responsible for Home and Foreign chains.
  2. Run bridge: RUST_LOG=info bridge --config ... --database ....
  3. Kill parity process responsible for Foreign chain.

Expected results:
The bridge handles gracefully death of Parity node: warns about the connection lose, shutdowns all operations (deposit_relay, withdraw_confirm and withdraw_relay) for a while, waits when the connection appears and runs all operations after that.

Actual results:
After killing Parity process the following appear in the terminal where the bridge is running:

WARN:<unknown>: Unexpected IO error: Error { repr: Os { code: 32, message: "Broken pipe" } }

No messages appear from withdraw_confirm and withdraw_relay.
Then after some time (few seconds or few minutes) the following appear on the terminal and the bridge dies:

Request eth_blockNumber timed out

Solution: once "Broken pipe" error is caught, attempt to
reconnect repeatedly with a pause of 1 second between attempts.

When other errors are caught, simply restart the bridge,
as there is no indication that the connection has been severed.

Fixes #22

cc @akolotov

Steps to reproduce:

Run two Parity-based nodes responsible for Home and Foreign chains.
Run bridge: RUST_LOG=info bridge --config ... --database ....
Kill parity process responsible for Foreign chain.
Expected results:
The bridge handles gracefully death of Parity node: warns about the
connection lose, shutdowns all operations (deposit_relay,
withdraw_confirm and withdraw_relay) for a while, waits when the
connection appears and runs all operations after that.

Actual results:
After killing Parity process the following appear in the terminal where
the bridge is running:

WARN:<unknown>: Unexpected IO error: Error { repr: Os { code: 32,
message: "Broken pipe" } }
No messages appear from withdraw_confirm and withdraw_relay.
Then after some time (few seconds or few minutes) the following appear
on the terminal and the bridge dies:

Request eth_blockNumber timed out

Solution: once "Broken pipe" error is caught, attempt to
reconnect repeatedly with a pause of 1 second between attempts.

When other errors are caught, simply restart the bridge,
as there is no indication that the connection has been severed.

Fixes omni#22
@ghost ghost assigned yrashk Feb 27, 2018
@ghost ghost added the in progress label Feb 27, 2018
cli/src/main.rs Outdated
loop {
result = match &result {
&Err(Error(ErrorKind::Web3(web3::error::Error(web3::error::ErrorKind::Io(ref e), _)), _)) if e.kind() == ::std::io::ErrorKind::BrokenPipe => {
warn!("Connection to Parity has been severed, attempting to reconnect");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bridge can work with geth too. So does it make sense to use more common statement like 'Connection to the node ...'?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I was just printing this because the repository is called parity-bridge :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. But it is not from Parity anymore :) our repo's name is more legacy matter than real restriction. Don't you mind to change the message?

@yrashk
Copy link
Contributor Author

yrashk commented Mar 2, 2018 via email

@akolotov
Copy link
Contributor

akolotov commented Mar 2, 2018

Yurii, did you test a scenario when the bridge is run before parity instances started? I have got

INFO:bridge: Parsing cli arguments
INFO:bridge: Loading config
INFO:bridge: Starting event loop
INFO:bridge: Establishing ipc connection
Cannot connect to home node ipc

Caused by:
  Error(Io(Error { repr: Os { code: 2, message: "No such file or directory" } }), State { next_error: None, backtrace: None })

in that case.

The bridge binary is build from your repo (ipc-lost branch).

@akolotov
Copy link
Contributor

akolotov commented Mar 2, 2018

Another observation: I did the steps to reproduce the issue described in #22 and found that

WARN:bridge: Bridge is down with Request eth_blockNumber timed out, attempting to restart
WARN:bridge: Connection to Parity has been severed, attempting to reconnect

appears just after timeout expiration rather than after appearing of WARN:<unknown>: Unexpected IO error: Error { repr: Os { code: 32, message: "Broken pipe" } }. Do you think it is possible to discover the connection loss earlier than request_timeout expires?

@yrashk
Copy link
Contributor Author

yrashk commented Mar 3, 2018

Yes, I did test the scenario before starting the instances. This is happening during the attempted deployment scenario. I am happy to cover it as well if you want.

As for the order of failures, I am afraid I can't control that, to the best of my knowledge.

@akolotov
Copy link
Contributor

akolotov commented Mar 3, 2018

Ok. If it is for deployment only, we don’t need to fix it since a separate deployment script will be implemented.

@akolotov
Copy link
Contributor

akolotov commented Mar 4, 2018

@yrashk, I have tried the case when the bridge is run when at least parity instance does not exist:

  1. Run 2 instances parity and the bridge. The bridge produces:
INFO:bridge: Parsing cli arguments
INFO:bridge: Loading config
INFO:bridge: Starting event loop
INFO:bridge: Establishing ipc connection
INFO:bridge: Deploying contracts (if needed)
INFO:bridge: Deployed new bridge contracts
  1. Stop the bridge and one parity instance. So, just one instance exists:
$ ps ax | grep parity | grep PoA | grep -v screen
32465 pts/8    Ssl+   0:03 /opt/parity/parity --config /home/koal/parity/PoA_foreign.toml
32538 pts/9    Ssl+   0:02 /opt/parity/parity --config /home/koal/parity/PoA_home.toml
$ kill 32538
$ ps ax | grep parity | grep PoA | grep -v screen
32465 pts/8    Ssl+   0:03 /opt/parity/parity --config /home/koal/parity/PoA_foreign.toml
  1. Run the bridge instance:
$ RUST_LOG=info ./bridge --config erc20.toml --database erc20_db.toml
INFO:bridge: Parsing cli arguments
INFO:bridge: Loading config
INFO:bridge: Starting event loop
INFO:bridge: Establishing ipc connection
Cannot connect to home node ipc

Caused by:
  Error(Io(Error { repr: Os { code: 2, message: "No such file or directory" } }), State { next_error: None, backtrace: None })

As you can see the bridge instance does not start when one parity instanse is not accessible through IPC even if it is not contract deployment scenario.

@yrashk
Copy link
Contributor Author

yrashk commented Mar 4, 2018

Bridge is always trying to deploy (if necessary) upon startup, that's how that part of the code is. I am wondering, since you mentioned that deployment will be handled elsewhere, does that mean that contract deployment will no longer be bridge's responsibility?

Bridge should be usable with other node implementations.

Solution: remove reference to Parity
@yrashk
Copy link
Contributor Author

yrashk commented Mar 5, 2018

@akolotov I updated the log message. As for deployment, lets figure out the exact deployment scenario of bridge contracts to make sure both sides play well together?

@akolotov
Copy link
Contributor

akolotov commented Mar 6, 2018

@yrashk that's correct. Deployment of the bridge contracts will be separated from the bridge binary in order to add more flexibility to manage bridge contracts:

  • deploy contracts
  • upgrade contracts
  • configure contracts after deployment (gas limits, list of authorities etc.)

So, my suggestion is to not improve any logic on Rust which is used only in deployment scenario but definitely resolve issues which appear for ordinary bridge operations.

@yrashk
Copy link
Contributor Author

yrashk commented Mar 6, 2018 via email

@akolotov
Copy link
Contributor

akolotov commented Mar 6, 2018

@yrashk we will use JS for deployment since the deployment procedure is tied with the main development tool - truffle.

@akolotov
Copy link
Contributor

akolotov commented Mar 8, 2018

Please address the issue with the error appearing when the bridge is started whereas parity is not.

@yrashk
Copy link
Contributor Author

yrashk commented Mar 8, 2018

So, are we reverting this decision? I am fine either way, just need a clear indication.

Ok. If it is for deployment only, we don’t need to fix it since a separate deployment script will be implemented.

@akolotov
Copy link
Contributor

akolotov commented Mar 8, 2018

As I confirmed by the test described above the issue is reproduced even if the bridge contracts already deployed. So it is not about reverting the decision but to address the issue with connectivity for more cases.

Solution: attempt to connect to the nodes until successful
@yrashk
Copy link
Contributor Author

yrashk commented Mar 8, 2018

I've added functionality to let the bridge attempt reconnecting to the nodes at its startup (the deployment/verification of deployment of contracts) until successful.

As a side note, I have been thinking a lot about the logic we're putting into this. While it is fine to go with what we have right now, it is worth exploring an idea that in the longer term it'd be better to rely on a smarter process supervisor (especially considering that contract deployment will no longer be part of the startup soon). Namely, if we just use systemd for supervising the service, we can use Restart, RestartSec, ExecStop, RestartPreventExitStatus, RestartForceExitStatus settings (systemd.service) as well as StartLimitIntervalSec and StartLimitBurst from systemd.unit to enable more flexible handling of these errors that also don't require recompiling the bridge.

@akolotov
Copy link
Contributor

akolotov commented Mar 9, 2018

@yrashk I tried to test your changes one more time and found the issue that bridge relays the same transactions again if I kill the foreign parity instance and start it:

  1. Run both instances of parity
  2. Run bridge
  3. Do deposit and withdraw
  4. Bridge logs display the transactions
  5. Kill the foreign parity instance
  6. Wait few seconds
  7. Run the foreign parity instance
  8. Bridge logs display new transactions

Could you check it?

@yrashk
Copy link
Contributor Author

yrashk commented Mar 9, 2018 via email

@yrashk
Copy link
Contributor Author

yrashk commented Mar 9, 2018 via email

@akolotov
Copy link
Contributor

@yrashk the database updated with new blocks but transactions are re-sent anyway.

Here is the logs from the foreign parity instance before it is killed:

2018-03-10 09:02:53     0/25 peers   87 KiB chain 5 MiB db 0 bytes queue 448 bytes sync  RPC:  0 conn,  1 req/s, 175 µs
2018-03-10 09:03:10  Transaction mined (hash e0ac09e000484637fb5af07649d90103edbe6dc48640caa2e81956ac11d88b36)
|-------- deposit() invocation
2018-03-10 09:03:10  Imported #6501 fa4a…cf4e (1 txs, 0.08 Mgas, 0.99 ms, 0.77 KiB)
2018-03-10 09:03:20  Transaction mined (hash 7914bbbcc10c105823689628b9815c340eb1a923433b8e5c64ebd8664df31eeb)
|-------- approveAndCall() invocation
2018-03-10 09:03:20  Imported #6502 f517…7239 (1 txs, 0.06 Mgas, 1.01 ms, 0.83 KiB)
2018-03-10 09:03:25  Transaction mined (hash 2c6324d205e276170013f8f4450a7379998b4b4fb4b06001d3b296257b84e103)
|-------- submitSignature() invocation
2018-03-10 09:03:25  Imported #6503 50b8…91df (1 txs, 0.26 Mgas, 1.12 ms, 1.03 KiB)
2018-03-10 09:03:28     0/25 peers   98 KiB chain 5 MiB db 0 bytes queue 448 bytes sync  RPC:  0 conn,  2 req/s, 186 µs

The database file contains the following after that:

$ cat ../erc20_db.toml
home_contract_address = "0x2ace2268ed7a96713e6cd18e9b2c2ef0c306c22c"
foreign_contract_address = "0x5e702ea5d81e14ba807fdd0ac923e811a12bfef1"
home_deploy = 6209
foreign_deploy = 6488
checked_deposit_relay = 6218
checked_withdraw_relay = 6503
checked_withdraw_confirm = 6503

As soon as the parity instance is killed and run it produces the following logs:

2018-03-10 09:04:29  Starting Parity/v1.9.2-unstable-0feb0bb-20180201/x86_64-linux-gnu/rustc1.20.0
2018-03-10 09:04:29  Keys path /home/koal/parity/keys/PoA_foreign
2018-03-10 09:04:29  DB path /home/koal/parity/PoA_foreign/chains/PoA_foreign/db/d87bc73279457e60
2018-03-10 09:04:29  Path to dapps /home/koal/parity/PoA_foreign/dapps
2018-03-10 09:04:29  State DB configuration: fast
2018-03-10 09:04:29  Operating mode: active
2018-03-10 09:04:29  Configured for PoA_foreign using AuthorityRound engine
2018-03-10 09:04:30  Public node URL: enode://96b8fa10c0504cb3edb182786fcc8baf6cec3ff5bc4b5f9f82a24c49fe6bd5d09b64a8a2a595ec8d215d0523ec254cf31dda6151cf0a4669edc7bb964fd50af8@192.168.1.15:30343
2018-03-10 09:04:35  Transaction mined (hash 2faba8376df4632da7b4a0ce4cb984b7e51cc90849bf94e8a56a635195253537)
|-------- deposit() invocation
2018-03-10 09:04:35  Transaction mined (hash 00da53a569c639617e23854a2fb3e80745690c69230e07df92e7159f1fe1c958)
|-------- submitSignature() invocation
2018-03-10 09:04:35  Imported #6504 783d…e97e (2 txs, 4.20 Mgas, 1.15 ms, 1.23 KiB)

and the database file contains:

$ cat ../erc20_db.toml
home_contract_address = "0x2ace2268ed7a96713e6cd18e9b2c2ef0c306c22c"
foreign_contract_address = "0x5e702ea5d81e14ba807fdd0ac923e811a12bfef1"
home_deploy = 6209
foreign_deploy = 6488
checked_deposit_relay = 6219
checked_withdraw_relay = 6504
checked_withdraw_confirm = 6504

@yrashk
Copy link
Contributor Author

yrashk commented Mar 10, 2018 via email

@akolotov
Copy link
Contributor

@yrashk I cannot agree that we need to merge this PR without having a fix for the issue I described above. My thoughts are:

  1. the issue does not appear on the original bridge: the bridge process dies after updating the database, so as soon as the bridge starts again it reads checked_deposit_relay, checked_withdraw_relay and checked_withdraw_confirm from the database that causes no new transactions appearance.
  2. if we merge the PR, we will introduce such kind of behaviour of the bridge when double spend possible without authorization by a user. It is defintely bad state of source code especially when we don't have any estimate on the fix to address this double spend.

So, the question is if it is possible to either update internal state of the bridge properly as so no new transactions appears when it loses the connectivity to parity instances (the fix I would prefer) or re-read the database when parity connectivity re-established (I would consider this as a quick workaround and issue to introduce a proper fix is needed to be open)

@akolotov
Copy link
Contributor

I think about the scenario further and recall that the bridge contracts currently have protection for double spend appearing due to handling the same events one more time.

So I will merge the request and create a new issue to update internal state of the bridge properly when parity connectivity re-established.

@akolotov akolotov merged commit 9c375cc into omni:poa-dev-2 Mar 10, 2018
@ghost ghost removed the in progress label Mar 10, 2018
@yrashk
Copy link
Contributor Author

yrashk commented Mar 10, 2018 via email

@yrashk
Copy link
Contributor Author

yrashk commented Mar 10, 2018 via email

@yrashk
Copy link
Contributor Author

yrashk commented Mar 10, 2018 via email

@yrashk
Copy link
Contributor Author

yrashk commented Mar 10, 2018

@akolotov I'm away from my computer for the weekend, so I don't have proper testing (or even editing) facilities, would you mind checking this yrashk@19241a9 ? It's a bit crude but I think it should solve the issue you brought up. Please let me know if you will have a chance to try it out.

@akolotov
Copy link
Contributor

I will be able to check changes late evening only.

@yrashk
Copy link
Contributor Author

yrashk commented Mar 10, 2018 via email

@akolotov
Copy link
Contributor

I have compiled the bridge with the changes proposed above and the issue with duplicated transactions is not reproduced. So, we can conclude yrashk@19241a9 as containing the fix.

In order to track the work under the issue, #27 was created.

noot pushed a commit to noot/poa-bridge that referenced this pull request Jul 18, 2018
Problem: bridge dies when Parity is stopped
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants