Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LND mis-implements gossip_queries #3728

Open
rustyrussell opened this issue Nov 15, 2019 · 3 comments · May be fixed by #3785

Comments

@rustyrussell
Copy link

@rustyrussell rustyrussell commented Nov 15, 2019

Background

c-lightning has started sending errors to LND nodes because of their reply_channel_range answers for large queries.

Tested against @bitconner's node.

Steps to reproduce

We send 01076fe28c0ab6f1b372c1a6a246ae63f74f931e8365e15a089c68d61900000000000007b2b400018402 which is:

QUERY_CHANNEL_RANGE:
chain_hash=000000000019d6689c085ae165831e934ff763ae46a2a6c172b3f1b60a8ce26f
first_blocknum=504500
number_of_blocks=99330

LND replies with four messages (because of non-zlib):

REPLY_CHANNEL_RANGE:
chain_hash=000000000019d6689c085ae165831e934ff763ae46a2a6c172b3f1b60a8ce26f
first_blocknum=504500
number_of_blocks=99330
complete=0
...
REPLY_CHANNEL_RANGE:
chain_hash=000000000019d6689c085ae165831e934ff763ae46a2a6c172b3f1b60a8ce26f
first_blocknum=504500
number_of_blocks=99330
complete=0
...
REPLY_CHANNEL_RANGE:
chain_hash=000000000019d6689c085ae165831e934ff763ae46a2a6c172b3f1b60a8ce26f
first_blocknum=504500
number_of_blocks=99330
complete=0
...
REPLY_CHANNEL_RANGE:
chain_hash=000000000019d6689c085ae165831e934ff763ae46a2a6c172b3f1b60a8ce26f
first_blocknum=504500
number_of_blocks=99330
complete=1

These responses are incorrect.

Expected behaviour

There are two problems: first, the range of blocks is wrong, second, the complete flags is wrong.

From the spec:

  - MUST respond with one or more `reply_channel_range` whose combined range
	cover the requested `first_blocknum` to `first_blocknum` plus
	`number_of_blocks` minus one.
  - For each `reply_channel_range`:
    - MUST set with `chain_hash` equal to that of `query_channel_range`,
    - MUST encode a `short_channel_id` for every open channel it knows in blocks `first_blocknum` to `first_blocknum` plus `number_of_blocks` minus one.
    - MUST limit `number_of_blocks` to the maximum number of blocks whose
      results could fit in `encoded_short_ids`

You do not limit number_of_blocks to the maximum number of blocks which fit in the encoded_short_ids field, but instead, reuse the same value. The intention was that you return (non-overlapping, though that is not stated) first_blocknum + number_of_blocks ranges. Instead you return the same number over and over, which is wrong.

Secondly:

    - if does not maintain up-to-date channel information for `chain_hash`:
      - MUST set `complete` to 0.
    - otherwise:
      - SHOULD set `complete` to 1.

complete is NOT an end marker. It is used to indicate a lite node (or node bootstrapping) which doesn't have full gossip information.

Actual behaviour

Eclair and c-lightning now need to figure out how to detect and mitigate this for now, and make sure we test protocol compatibility properly in the future!

@Roasbeef

This comment has been minimized.

Copy link
Member

@Roasbeef Roasbeef commented Nov 20, 2019

Thanks for reporting this! We should have the fix for setting the block related fields properly in our next release (0.9). However the fix for our usage of complete will take longer to roll out, since our nodes today use that as a termination signal. Based on my current understanding, we only need to fix the block fields in order to achieve better compatibility amongst us all, since the complete flag as defined in the spec is more or less just advisory.

What does CL use the complete flag for today? Will it take that as a signal to keep querying other nodes until they find one w/ the flag set? Looking back, if this was actually how it was intended to be used (signal partial storage of the channel graph), then we should've added a global feature bit (in the node announcement), as otherwise you only find out I don't have all the data when I'm queried.

@araspitzu

This comment has been minimized.

Copy link

@araspitzu araspitzu commented Dec 2, 2019

I reproduced this in testnet with the latest eclair from master and LND latest release v0.8.1-beta. From the eclair side i sent a query_channel_range with first_blocknum=0 and number_of_blocks=2147483647, LND replied with 2 reply_channel_range:

reply_channel_range first_blocknum=0 number_of_blocks=2147483647 complete=0 <-- 8000 SCIDs
reply_channel_range first_blocknum=0 number_of_blocks=2147483647 complete=1 <-- 737  SCIDs

The issue seem to lay out pretty much as @rustyrussell describes it, with the main difference that eclair is actually more forgiving than c-lightning and it ends up being able to sync from LND, i performed a full sync from both the LND node and endurance and the resulting routing tables are almost identical, with the exception of the number channel announcements but this might be because of different heuristics for flappy channels.

Roasbeef added a commit to Roasbeef/lnd that referenced this issue Dec 3, 2019
In this commit we fix in a bug in `lnd` that could cause other
implementations which implement a strict version of the spec to
disconnect when trying to sync their channel graph using the gossip
query feature. Before this commit, we would embed the request to a
`QueryChannelRange` in the response, causing some clients to reject the
response as the `FirstBlockHeight` and `NumBlocks` field would be
identical for each chunk of the response.

In order to remedy this, we now properly set these two fields with each
returned chunk. Note that even after this commit, we keep our existing
behavior surrounding the `Complete` field as is. Otherwise, current
`lnd` clients which rely on this field (rather than the two
aforementioned fields) wouldn't be able to properly detect when a set of
responses to their query was "complete".

Partially fixes lightningnetwork#3728.
@Roasbeef

This comment has been minimized.

Copy link
Member

@Roasbeef Roasbeef commented Dec 3, 2019

Half the fix can be found here: #3785 (still needs proper interop testing)

This doesn't effect any lnd clients in the wild, but should help existing c-lightning clients in the wild. As for our usage of the Complete field as I mentioned above, we likely won't change it for a version or two as we depend on it currently (knowing when to process the chunks). Also AFAICT, no one actually does anything with the Complete field as its defined in the spec today?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.