Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Database format does not allow pruning #82

Closed
ecdsa opened this issue Dec 18, 2016 · 17 comments
Closed

Database format does not allow pruning #82

ecdsa opened this issue Dec 18, 2016 · 17 comments
Assignees

Comments

@ecdsa
Copy link
Contributor

ecdsa commented Dec 18, 2016

History pruning is needed in order to be able to efficiently request and spend funds on addresses with long histories.

A pruned history does not contain the full set of transactions touching an address. It contains all the transactions that have UTXOs, but only a limited number of transaction that have STXOs (spent transaction outputs). The transactions that spend pruned inputs are similarly removed from the history, so that pruning one input removes two items from the list.

The current database format stores histories in the exact form that is required by the get_history RPC: A list of (tx_num, height). This means that the information required for pruning is lost: the server does not know which outgoing input spends which incoming output. In order to be able to prune histories, this information must be retained by the server.

electrum-server used to store spent histories as (txout, txin) pairs, where txin spends txout. This of course is likely to impact performance, but I think not being able to serve pruned histories is a regression. There might be ways of doing this more efficiently than how it is done in electrum-server.

@kyuupichan
Copy link
Owner

This is what I understood. I believe it's simple with the info currently in the DB, but I might be missing a detail. Can you perhaps give an example of an address and an example of it being pruned by e-s?

@kyuupichan
Copy link
Owner

I understand the goal here now.

@kyuupichan
Copy link
Owner

kyuupichan commented Dec 18, 2016

@ecdsa incidentally, I think this is perhaps not a good approach. It's pushing expensive processing to the server, which is dangerous. It belongs on the client. If someone wants to request histories for addresses with long histories, and they can't handle it because they're on mobile, the response perhaps should be to not do that then.
We can serve large histories smoothly with some new protocol about height. I'm not sure we should be catering to people who want to download abbreviated histories of their overused addresses and to push the responsibility for the filtering on the server. The right place is for them to bear their own expense. So I'm not very sympathetic to this use case at all.

@ecdsa
Copy link
Contributor Author

ecdsa commented Jan 8, 2017

Following our discussion on IRC: I think processing in electrum-server is expensive because spent histories are sorted by the height of the incoming transaction. If we change the chronology and use the height of the outgoing (spending) transaction instead, then adding new transaction pairs becomes a append-only operation. I believe that such a database would not be more expensive than what ElectrumX is currently doing. It would actually be lighter than the current database, because it would remove the redundancy between UTXO and histories.

@ecdsa
Copy link
Contributor Author

ecdsa commented Jan 8, 2017

Also, a new protocol that serves histories in an incremental way should be designed around pruning, because pruned histories have the same set of UTXOS as the final result. This means the wallet can be used before the whole history is downloaded.

@kyuupichan
Copy link
Owner

It would (asymptotically) double the history size because you would be storing tx pairs, no?

@kyuupichan
Copy link
Owner

In fact it can be worse; consider 3 receive txs with 3 payments in each, and 3 spend txs spending one from each receive TX. At present this is 6 txs in the history; storing pairs it would be 9 pairs, for 3x the space cost. I realize this is not common, but there do exist many large txs in the DB with large numbers of payments to the same address in one tx

@kyuupichan
Copy link
Owner

There may be room to compress the pair data though. Any changes here should be post 1.0 I think.

@ecdsa
Copy link
Contributor Author

ecdsa commented Jan 8, 2017

No it would not double the size because you are storing pairs.
[(a,b),(c,d)] is the same size as [a,b,c,d]

note: it would indeed increase the size in your second example.

@kyuupichan
Copy link
Owner

True, so perhaps this might not be so bad.

How do you see the protocol regarding incremental history? Starting from the most recent spend, and working backwards? If we serve history in pairs, the same receiving tx could be sent multiple times that way, which may not be an issue.

If we just serve tx hashes and heights like now, it could be expensive to sort it properly.

@ecdsa
Copy link
Contributor Author

ecdsa commented Jan 8, 2017

If we change the protocol, we do not need to sort histories before we serve them. The reason why the server sorts histories is in order to compute the same hash as the client ; this hash was defined before pruning was introduced in electrum-server, and has not been updated since.

In a new protocol, I would like to have two hashes per address: one for the utxo set, one for the spent history. each of them would be computed as the merkle root of a linear tree. That way, the hash does not depend on the length of history that is served. (or on the amount of utxos served, if we make that part incremental too).

@ecdsa
Copy link
Contributor Author

ecdsa commented Jan 8, 2017

note: the merkle root hash also allows a server to prune its database if they want to, and serve the same hash as other servers.

@ecdsa
Copy link
Contributor Author

ecdsa commented Jan 8, 2017

note: this new protocol idea would actually require a lot more space, because the database would need to store all these hashes (unless they are computed when the history is served, but I guess this is what we want to avoid)

@ecdsa
Copy link
Contributor Author

ecdsa commented Jan 8, 2017

note 2: to mitigate that, we could store one hash per chunk of 100 items.
in that case the server would serve history ranges that are multiple of 100.

@kyuupichan
Copy link
Owner

I think computing the merkle tree when serving is fine; we do that already for each tx

@kyuupichan kyuupichan self-assigned this Jan 9, 2017
@kyuupichan kyuupichan added this to the 1.1 milestone Jan 9, 2017
@erasmospunk
Copy link
Contributor

erasmospunk commented Feb 24, 2017

@ecdsa I think that if we allow querying by block height ranges, it would reduce the bandwidth usage that is the most important for mobile users. It is fine if the server never supports pruning as the historical data is useful in many cases.

@kyuupichan when the DB::get_history is executed, does the self.hist_db.iterator(prefix=hashX) return the transactions in order of height or randomly?

@kyuupichan kyuupichan removed this from the 1.1 milestone Mar 30, 2017
@kyuupichan
Copy link
Owner

@ecdsa I don't think I want to implement pruning. I suspect the vast majority of transactions and DB space is occupied by addresses with minimal re-use, and that this will only become more true going forwards.

Each little-used address is more expensive for the server: it requires a key and small value in the history DB, whereas a heavily-used address requires only a handful of keys and uses many larger values.

I think the reason electrum-server implemented pruning was that it stored all the history in a single value, and the read-append-write cycle was inefficient. ElectrumX doesn't have this issue.

Also I have plans in the not-too-distant future to improve ElectrumXs history handling by fixing #348 and #185.

So I see this as ultimately a non-issue - the cost of supporting big histories will be small compared to the cost of supporting histories for all small addresses. Please re-open if you disagree.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants