-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem: bridge dies when Parity is stopped #25
Conversation
Steps to reproduce: Run two Parity-based nodes responsible for Home and Foreign chains. Run bridge: RUST_LOG=info bridge --config ... --database .... Kill parity process responsible for Foreign chain. Expected results: The bridge handles gracefully death of Parity node: warns about the connection lose, shutdowns all operations (deposit_relay, withdraw_confirm and withdraw_relay) for a while, waits when the connection appears and runs all operations after that. Actual results: After killing Parity process the following appear in the terminal where the bridge is running: WARN:<unknown>: Unexpected IO error: Error { repr: Os { code: 32, message: "Broken pipe" } } No messages appear from withdraw_confirm and withdraw_relay. Then after some time (few seconds or few minutes) the following appear on the terminal and the bridge dies: Request eth_blockNumber timed out Solution: once "Broken pipe" error is caught, attempt to reconnect repeatedly with a pause of 1 second between attempts. When other errors are caught, simply restart the bridge, as there is no indication that the connection has been severed. Fixes omni#22
cli/src/main.rs
Outdated
loop { | ||
result = match &result { | ||
&Err(Error(ErrorKind::Web3(web3::error::Error(web3::error::ErrorKind::Io(ref e), _)), _)) if e.kind() == ::std::io::ErrorKind::BrokenPipe => { | ||
warn!("Connection to Parity has been severed, attempting to reconnect"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bridge can work with geth too. So does it make sense to use more common statement like 'Connection to the node ...'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I was just printing this because the repository is called parity-bridge
:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood. But it is not from Parity anymore :) our repo's name is more legacy matter than real restriction. Don't you mind to change the message?
Ok, will do this tomorrow.
In general, I suggest that such trivial changes can be done by the merger,
reducing the need for back and forth communications over a triviality.
…On Sat, Mar 3, 2018, 12:39 AM Alexander Kolotov ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In cli/src/main.rs
<#25 (comment)>
:
> @@ -86,8 +87,34 @@ fn execute<S, I>(command: I) -> Result<String, Error> where I: IntoIterator<Item
};
info!(target: "bridge", "Starting listening to events");
- let bridge = create_bridge(app_ref, &database).and_then(|_| future::ok(true)).collect();
- event_loop.run(bridge)?;
+ let bridge = create_bridge(app_ref.clone(), &database).and_then(|_| future::ok(true)).collect();
+ let mut result = event_loop.run(bridge);
+ loop {
+ result = match &result {
+ &Err(Error(ErrorKind::Web3(web3::error::Error(web3::error::ErrorKind::Io(ref e), _)), _)) if e.kind() == ::std::io::ErrorKind::BrokenPipe => {
+ warn!("Connection to Parity has been severed, attempting to reconnect");
Understood. But it is not from Parity anymore :) our repo's name is more
legacy matter than real restriction. Don't you mind to change the message?
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#25 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAABxG7DS8ALIxiYT-AxsI9LbY9s69jZks5taYO_gaJpZM4SVw1O>
.
|
Yurii, did you test a scenario when the bridge is run before parity instances started? I have got
in that case. The bridge binary is build from your repo (ipc-lost branch). |
Another observation: I did the steps to reproduce the issue described in #22 and found that
appears just after timeout expiration rather than after appearing of |
Yes, I did test the scenario before starting the instances. This is happening during the attempted deployment scenario. I am happy to cover it as well if you want. As for the order of failures, I am afraid I can't control that, to the best of my knowledge. |
Ok. If it is for deployment only, we don’t need to fix it since a separate deployment script will be implemented. |
@yrashk, I have tried the case when the bridge is run when at least parity instance does not exist:
As you can see the bridge instance does not start when one parity instanse is not accessible through IPC even if it is not contract deployment scenario. |
Bridge is always trying to deploy (if necessary) upon startup, that's how that part of the code is. I am wondering, since you mentioned that deployment will be handled elsewhere, does that mean that contract deployment will no longer be bridge's responsibility? |
Bridge should be usable with other node implementations. Solution: remove reference to Parity
@akolotov I updated the log message. As for deployment, lets figure out the exact deployment scenario of bridge contracts to make sure both sides play well together? |
@yrashk that's correct. Deployment of the bridge contracts will be separated from the bridge binary in order to add more flexibility to manage bridge contracts:
So, my suggestion is to not improve any logic on Rust which is used only in deployment scenario but definitely resolve issues which appear for ordinary bridge operations. |
What technology stack will be used for contract provisioning. Asking
because I want to make sure it's a good fit for integration tests.
…On Tue, Mar 6, 2018, 9:22 PM Alexander Kolotov ***@***.***> wrote:
@yrashk <https://github.com/yrashk> that's correct. Deployment of the
bridge contracts will be separated from the bridge binary in order to add
more flexibility to manage bridge contracts:
- deploy contracts
- upgrade contracts
- configure contracts after deployment (gas limits, list of
authorities etc.)
So, my suggestion is to not improve any logic on Rust which is used only
in deployment scenario but definitely resolve issues which appear for
ordinary bridge operations.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#25 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAABxLjSbhXysw63J8XWJnI3RE-IdJL0ks5tbpu5gaJpZM4SVw1O>
.
|
@yrashk we will use JS for deployment since the deployment procedure is tied with the main development tool - |
Please address the issue with the error appearing when the bridge is started whereas parity is not. |
So, are we reverting this decision? I am fine either way, just need a clear indication. Ok. If it is for deployment only, we don’t need to fix it since a separate deployment script will be implemented. |
As I confirmed by the test described above the issue is reproduced even if the bridge contracts already deployed. So it is not about reverting the decision but to address the issue with connectivity for more cases. |
Solution: attempt to connect to the nodes until successful
I've added functionality to let the bridge attempt reconnecting to the nodes at its startup (the deployment/verification of deployment of contracts) until successful. As a side note, I have been thinking a lot about the logic we're putting into this. While it is fine to go with what we have right now, it is worth exploring an idea that in the longer term it'd be better to rely on a smarter process supervisor (especially considering that contract deployment will no longer be part of the startup soon). Namely, if we just use |
@yrashk I tried to test your changes one more time and found the issue that bridge relays the same transactions again if I kill the foreign parity instance and start it:
Could you check it? |
Will do.
…On Fri, Mar 9, 2018, 8:51 PM Alexander Kolotov ***@***.***> wrote:
@yrashk <https://github.com/yrashk> I tried to test your changes one more
time and found the issue that bridge relays the same transactions again if
I kill the foreign parity instance and start it:
1. Run both instances of parity
2. Run bridge
3. Do deposit and withdraw
4. Bridge logs display the transactions
5. Kill the foreign parity instance
6. Wait few seconds
7. Run the foreign parity instance
8. Bridge logs display new transactions
Could you check it?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#25 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAABxIslglLIRMioxxiB5xg5j-n-EQO2ks5tcojQgaJpZM4SVw1O>
.
|
Can you confirm if the aforementioned transaction's block number gets
persisted by the bridge?
…On Fri, Mar 9, 2018, 10:15 PM Yurii Rashkovskii ***@***.***> wrote:
Will do.
On Fri, Mar 9, 2018, 8:51 PM Alexander Kolotov ***@***.***>
wrote:
> @yrashk <https://github.com/yrashk> I tried to test your changes one
> more time and found the issue that bridge relays the same transactions
> again if I kill the foreign parity instance and start it:
>
> 1. Run both instances of parity
> 2. Run bridge
> 3. Do deposit and withdraw
> 4. Bridge logs display the transactions
> 5. Kill the foreign parity instance
> 6. Wait few seconds
> 7. Run the foreign parity instance
> 8. Bridge logs display new transactions
>
> Could you check it?
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#25 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AAABxIslglLIRMioxxiB5xg5j-n-EQO2ks5tcojQgaJpZM4SVw1O>
> .
>
|
@yrashk the database updated with new blocks but transactions are re-sent anyway. Here is the logs from the foreign parity instance before it is killed:
The database file contains the following after that:
As soon as the parity instance is killed and run it produces the following logs:
and the database file contains:
|
You raise a very good point, I was just thinking about this last night.
The underlying issue here is that you can't have atomic transaction AND a
record of it on the bridge's side, physically.
There are perhaps multiple ways we can address this (but none, I believe,
is blocking this PR as they are different issues).
1. Bridge can split preparation and sending of transactions. If the bridge
reliably persists raw transactions before sending them out, and ensures no
new transaction is sent out before the outbox is empty, this might be a
reasonable way to ensure we don't repeat transactions.
2. Ensuring idempotency of these transfers in some form at the contract
side, but I'm afraid this might drive the storage costs into the
"prohibitively expensive" territory.
Either way, I don't think this is the same as the original one as the it
might happen in a number of other scenarios.
…On Sat, Mar 10, 2018, 1:19 PM Alexander Kolotov ***@***.***> wrote:
@yrashk <https://github.com/yrashk> the database updated with new blocks
but transactions are re-sent anyway.
Here is the logs from the foreign parity instance before it is killed:
2018-03-10 09:02:53 0/25 peers 87 KiB chain 5 MiB db 0 bytes queue 448 bytes sync RPC: 0 conn, 1 req/s, 175 µs
2018-03-10 09:03:10 Transaction mined (hash e0ac09e000484637fb5af07649d90103edbe6dc48640caa2e81956ac11d88b36)
|-------- deposit() invocation
2018-03-10 09:03:10 Imported #6501 fa4a…cf4e (1 txs, 0.08 Mgas, 0.99 ms, 0.77 KiB)
2018-03-10 09:03:20 Transaction mined (hash 7914bbbcc10c105823689628b9815c340eb1a923433b8e5c64ebd8664df31eeb)
|-------- approveAndCall() invocation
2018-03-10 09:03:20 Imported #6502 f517…7239 (1 txs, 0.06 Mgas, 1.01 ms, 0.83 KiB)
2018-03-10 09:03:25 Transaction mined (hash 2c6324d205e276170013f8f4450a7379998b4b4fb4b06001d3b296257b84e103)
|-------- submitSignature() invocation
2018-03-10 09:03:25 Imported #6503 50b8…91df (1 txs, 0.26 Mgas, 1.12 ms, 1.03 KiB)
2018-03-10 09:03:28 0/25 peers 98 KiB chain 5 MiB db 0 bytes queue 448 bytes sync RPC: 0 conn, 2 req/s, 186 µs
The database file contains the following after that:
$ cat ../erc20_db.toml
home_contract_address = "0x2ace2268ed7a96713e6cd18e9b2c2ef0c306c22c"
foreign_contract_address = "0x5e702ea5d81e14ba807fdd0ac923e811a12bfef1"
home_deploy = 6209
foreign_deploy = 6488
checked_deposit_relay = 6218
checked_withdraw_relay = 6503
checked_withdraw_confirm = 6503
As soon as the parity instance is killed and run it produces the following
logs:
2018-03-10 09:04:29 Starting Parity/v1.9.2-unstable-0feb0bb-20180201/x86_64-linux-gnu/rustc1.20.0
2018-03-10 09:04:29 Keys path /home/koal/parity/keys/PoA_foreign
2018-03-10 09:04:29 DB path /home/koal/parity/PoA_foreign/chains/PoA_foreign/db/d87bc73279457e60
2018-03-10 09:04:29 Path to dapps /home/koal/parity/PoA_foreign/dapps
2018-03-10 09:04:29 State DB configuration: fast
2018-03-10 09:04:29 Operating mode: active
2018-03-10 09:04:29 Configured for PoA_foreign using AuthorityRound engine
2018-03-10 09:04:30 Public node URL: ***@***.***:30343
2018-03-10 09:04:35 Transaction mined (hash 2faba8376df4632da7b4a0ce4cb984b7e51cc90849bf94e8a56a635195253537)
|-------- deposit() invocation
2018-03-10 09:04:35 Transaction mined (hash 00da53a569c639617e23854a2fb3e80745690c69230e07df92e7159f1fe1c958)
|-------- submitSignature() invocation
2018-03-10 09:04:35 Imported #6504 783d…e97e (2 txs, 4.20 Mgas, 1.15 ms, 1.23 KiB)
and the database file contains:
$ cat ../erc20_db.toml
home_contract_address = "0x2ace2268ed7a96713e6cd18e9b2c2ef0c306c22c"
foreign_contract_address = "0x5e702ea5d81e14ba807fdd0ac923e811a12bfef1"
home_deploy = 6209
foreign_deploy = 6488
checked_deposit_relay = 6219
checked_withdraw_relay = 6504
checked_withdraw_confirm = 6504
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#25 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAABxMF0STtTsTdcQD-aOomwn-Zm14gKks5tc3CDgaJpZM4SVw1O>
.
|
@yrashk I cannot agree that we need to merge this PR without having a fix for the issue I described above. My thoughts are:
So, the question is if it is possible to either update internal state of the bridge properly as so no new transactions appears when it loses the connectivity to parity instances (the fix I would prefer) or re-read the database when parity connectivity re-established (I would consider this as a quick workaround and issue to introduce a proper fix is needed to be open) |
I think about the scenario further and recall that the bridge contracts currently have protection for double spend appearing due to handling the same events one more time. So I will merge the request and create a new issue to update internal state of the bridge properly when parity connectivity re-established. |
On this note, I agree with you on this particular regression and I'm
already working on resolving it. I already found the source of it. Will
send a separate PR.
…On Sat, Mar 10, 2018, 4:13 PM Alexander Kolotov ***@***.***> wrote:
Merged #25 <#25>.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#25 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAABxPlMeW8XLb-gbNYrZkIzukYh2PGZks5tc5kkgaJpZM4SVw1O>
.
|
Just want to emphasize that other ways for this to go wrong still exist, so
the fact that we will deal with this particular scenario doesn't mean it
will always go well. There's still a gap between a transaction and the
persistence event, however small it is to our eye.
…On Sat, Mar 10, 2018, 4:50 PM Yurii Rashkovskii ***@***.***> wrote:
On this note, I agree with you on this particular regression and I'm
already working on resolving it. I already found the source of it. Will
send a separate PR.
On Sat, Mar 10, 2018, 4:13 PM Alexander Kolotov ***@***.***>
wrote:
> Merged #25 <#25>.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#25 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AAABxPlMeW8XLb-gbNYrZkIzukYh2PGZks5tc5kkgaJpZM4SVw1O>
> .
>
|
Another side note on this: if we relied on an external supervisor as per my
suggestion above, this accidental complexity wouldn't have happened.
What happened in this case is the unnoticed subtlety of how the database
gets cloned away from it's deserialized value inside of one of the
auxiliary structures and gets harder to recover for an internal restart.
…On Wed, Feb 28, 2018, 5:45 AM Alexander Kolotov ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In cli/src/main.rs
<#25 (comment)>
:
> @@ -86,8 +87,34 @@ fn execute<S, I>(command: I) -> Result<String, Error> where I: IntoIterator<Item
};
info!(target: "bridge", "Starting listening to events");
- let bridge = create_bridge(app_ref, &database).and_then(|_| future::ok(true)).collect();
- event_loop.run(bridge)?;
+ let bridge = create_bridge(app_ref.clone(), &database).and_then(|_| future::ok(true)).collect();
+ let mut result = event_loop.run(bridge);
+ loop {
+ result = match &result {
+ &Err(Error(ErrorKind::Web3(web3::error::Error(web3::error::ErrorKind::Io(ref e), _)), _)) if e.kind() == ::std::io::ErrorKind::BrokenPipe => {
+ warn!("Connection to Parity has been severed, attempting to reconnect");
Bridge can work with geth too. So does it make sense to use more common
statement like 'Connection to the node ...'?
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#25 (review)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAABxKaj65ed9ZSrMupINaocxerujA7iks5tZIWLgaJpZM4SVw1O>
.
|
@akolotov I'm away from my computer for the weekend, so I don't have proper testing (or even editing) facilities, would you mind checking this yrashk@19241a9 ? It's a bit crude but I think it should solve the issue you brought up. Please let me know if you will have a chance to try it out. |
I will be able to check changes late evening only. |
That's sooner than me.
…On Sat, Mar 10, 2018, 5:33 PM Alexander Kolotov ***@***.***> wrote:
I will be able to check changes late evening only.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#25 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAABxPA23PlsrDZH77epNdNO-XFbaRfzks5tc6vrgaJpZM4SVw1O>
.
|
I have compiled the bridge with the changes proposed above and the issue with duplicated transactions is not reproduced. So, we can conclude yrashk@19241a9 as containing the fix. In order to track the work under the issue, #27 was created. |
Problem: bridge dies when Parity is stopped
Reproduced on the latest (3961987) POA master branch.
Steps to reproduce:
RUST_LOG=info bridge --config ... --database ...
.Expected results:
The bridge handles gracefully death of Parity node: warns about the connection lose, shutdowns all operations (
deposit_relay
,withdraw_confirm
andwithdraw_relay
) for a while, waits when the connection appears and runs all operations after that.Actual results:
After killing Parity process the following appear in the terminal where the bridge is running:
No messages appear from
withdraw_confirm
andwithdraw_relay
.Then after some time (few seconds or few minutes) the following appear on the terminal and the bridge dies:
Solution: once "Broken pipe" error is caught, attempt to
reconnect repeatedly with a pause of 1 second between attempts.
When other errors are caught, simply restart the bridge,
as there is no indication that the connection has been severed.
Fixes #22
cc @akolotov