Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get locality plots #257

Open
adiabat opened this issue Feb 22, 2021 · 18 comments
Open

Get locality plots #257

adiabat opened this issue Feb 22, 2021 · 18 comments
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@adiabat
Copy link
Contributor

adiabat commented Feb 22, 2021

On the 2021-02-22 call, we discussed locality and realized... we don't actually know what kind of locality we're getting! Are all the leaves getting proofs all clustered at the right, with occasional leaves on the left? We hope so, but don't have much data there.

We have the data though; we just need to visualize / analyze it. A good place to get the data is in accumulator.BatchProof.Targets, either on the server end when it's being built, or the client end when it's being consumed.

Or actually, just reading it off of disk should also be easy. It's just a list of uint64s for every utxo consumed in a block. (Could be uint32 but that's another issue).

We should start with a histogram, or 3D histogram. For ever block, just make a histogram of accumulator.BatchProof.Targets. Since the values range from 0 to tens of millions, and there's only a few thousand entries, visualizing it directly with 1 horizontal pixel per position, and 1 vertical pixel per number of targets at that position would be too sparse. So we can group ranges together, such that 1 horizontal pixel represents 100 position values, and then pixels can pile up on top of each other in the bin. There's probably some matplotlib stuff for this, and also to make a 3D plot of these histograms over time each block gets one row on the z-axis)

This would give a good starting point for understanding the locality we're getting. Are there big hills on the right, and a few little dots on the left? Is it scattered randomly and uniformly? How does it change over time? This could lead to better ideas about measuring the locality and proof sizes, and hopefully ideas about how to improve locality to get faster performance and smaller proofs.

@adiabat adiabat added enhancement New feature or request good first issue Good for newcomers labels Feb 22, 2021
@kcalvinalvin
Copy link
Member

I'm currently generating the csv for all the targets. Let me know if anyone needs the csv to plot it

@kcalvinalvin
Copy link
Member

kcalvinalvin commented Feb 24, 2021

Initial plotting. I'll get a better computer to do the rest of the blocks but I think this shows the picture pretty well. Each dot represents a leaf access.

utreexo-leaf-accesses-150000

Remap was done to row 27 in NewForest() to account for remaps (switching up leaf positions).

code to do the plotting

  import matplotlib.pyplot as plt                                                                                                                               
  import numpy as np                                                                                                                                            
  import csv                                                                                                                                                    
                                                                                                                                                                
  localityFile = open("./locality.csv")                                                                                                                         
  data = csv.reader(localityFile)                                                                                                                               
                                                                                                                                                                
  x = []                                                                                                                                                        
  y = []                                                                                                                                                        
  for idx, row in enumerate(data):                                                                                                                              
      if int(row[0]) > 150000:                                                                                                                                  
          break                                                                                                                                                 
      if idx % 1000 == 0:                                                                                                                                       
          print("on", idx)                                                                                                                                      
      for element in row[1:len(row)]:                                                                                                                           
          if element != '':                                                                                                                                     
              #if int(row[0]) > 80000:                                                                                                                          
              #    if int(element) < 20000:                                                                                                                     
              #        print("block", row[0])                                                                                                                   
              x.append(int(row[0]))                                                                                                                             
              y.append(int(element))                                                                                                                            
                                                                                                                                                                
  print("plotting...")                                                                                                                                          
  plt.title("Utreexo leaf accesses. (mainnet to block #150,000)")                                                                                               
  plt.xlabel("block height")                                                                                                                                    
  plt.ylabel("leaf positions")                                                                                                                                  
                                                                                                                                                                
  plt.scatter(x, y, 0.01)                                                                                                                                       
  plt.show()

You can check this really quickly by looping through targets and printing if a target is less than 20,000 or something. Ex: block 149796 position accesses that are to the right of the forest: 3645,18770,39935,22413,5460,24055,15002,19616,43166,16032,18813,7304,41912,16383,25441,24913

@adiabat
Copy link
Contributor Author

adiabat commented Feb 24, 2021

Very cool. The things that jump out at me:
It's a bit too concentrated to really get a feel for how much is on the high-postion edge (in this case the top of the curve) and how much of the density is down below that curve. It's definitely denser at the top than bottom but since the blue squares overlap a lot we can't see how much. Maybe bin the dots and use different colors or something?

Also then there's these weird empty chunks, where there are, I guess, 100,000 leaves that are a total dead-zone; the whole range gets created (and created very quickly, as you can tell by the fact that the curve "jumps" upwards in the region where the gap is created) and then never accessed. What are these UTXOs? Spam I guess... is this mainnet of testnet? Maybe we could look at better unspendable detection if this is just spam.

Then again, clustered together spam is better than spread apart spam. Also interesting is that in the ~20K or so block after those regions are created, nothing diffuses into those empty regions, and the regions themselves don't move. With a high-up enough swap they might eventually.

@kcalvinalvin
Copy link
Member

kcalvinalvin commented Mar 1, 2021

-- to be updated with blocks closer to the tip --

utreexo_0-100,000
utreexo_100,001-200,000
utreexo_200,001-300,000
utreexo_300,001-400,000

  import matplotlib.pyplot as plt
  import numpy as np
  import csv

  localityFile = open("./locality.csv")
  data = csv.reader(localityFile)

  x = []
  y = []
  for idx, row in enumerate(data):
      #if int(row[0]) > 100000:
      #    break
      if int(row[0]) < 400000:
          if idx % 1000 == 0:
              print("skipping", idx)
          continue
      if int(row[0]) > 500000:
          break
      if idx % 1000 == 0:
          print("on", idx)
      for element in row[1:len(row)]:
          if element != '':
              #if int(row[0]) > 140000:
              #    if int(element) < 20000:
              #        print("block", row[0])
              x.append(int(row[0]))
              y.append(int(element))

  print("plotting...")
  plt.title("Utreexo leaf accesses. (mainnet to block #150,000)")
  plt.xlabel("block height")
  plt.ylabel("leaf positions")
  #plt.scatter(x, y, 0.0001)
  plt.hist2d(x, y, bins=(1000, 1000), cmap=plt.cm.jet) # histogram
  plt.show()

@kcalvinalvin
Copy link
Member

I think these really came out well. Think I can stop plotting now :)

utreexo-hexhist-0-1
utreexo-hexhist-1-2
utreexo-hexhist-2-3
utreexo-hexhist-3-4
utreexo-hexhist-4-5
utreexo-hexhist-5-6
utreexo-hexhist-6-tip

code:

import matplotlib.pyplot as plt
import csv

localityFile = open("./locality.csv")
data = csv.reader(localityFile)

x = []
y = []
for idx, row in enumerate(data):
    #if int(row[0]) > 150000:
    #    break
    if int(row[0]) < 600000:
        if idx % 1000 == 0:
            print("skipping", idx)
        continue
    #if int(row[0]) > 300000:
    #    break
    if idx % 1000 == 0:
        print("on", idx)
    for element in row[1:len(row)]:
        if element != '':
            #if int(row[0]) > 140000:
            #    if int(element) < 20000:
            #        print("block", row[0])
            y.append(int(row[0]))
            x.append(int(element))

print("plotting...")
plt.title("Utreexo leaf accesses. (mainnet to block 600,001-668,900)")
plt.ylabel("block height")
plt.xlabel("leaf positions")


plt.gca().invert_yaxis()
plt.hexbin(x, y, gridsize=30, cmap='hot')
cb = plt.colorbar()
cb.set_label('leaf accesses')
print("plotted")
plt.show()

@kcalvinalvin
Copy link
Member

kcalvinalvin commented Mar 4, 2021

ok I lied. But these really should be it.

utreexo-hist-log-0-1
utreexo-hist-log-1-2
utreexo-hist-log-2-3
utreexo-hist-log-3-4
utreexo-hist-log-4-5
utreexo-hist-log-5-6
utreexo-hist-log-6-tip
utreexo-hist-log-1-tip

@kcalvinalvin
Copy link
Member

utreexo-list-line-5-6

@Shymaa-Arafat
Copy link

Have u read/ viewed the presentation of the paper "Merkle Trees Optimized for Stateless Clients in Bitcoin"?
https://youtu.be/HEKtDILPeaI
PDF version
https://ia.cr/2021/340
They're examining closely the effect of Locality of reference using a probabilistic model; they compared their work to Utreexo 2019 paper and saying that keeping the initial order of UTXOs in a "Trie" (not Tree) gives a better performance due to such principle ( they don't have to reorg after insert/delete operations)
.
I have been thinking could be using VEB Trees a half way between Tree/Trie?
and I don't think u 'll have to reorg at all this way?

@kcalvinalvin
Copy link
Member

Have u read/ viewed the presentation of the paper "Merkle Trees Optimized for Stateless Clients in Bitcoin"?

We’re aware. You may want to read our discussion on a different accumulator design
#249

@Shymaa-Arafat
Copy link

Thanks, I added a comment there.
However, another thought from here do u think is it possible to assign a time duration/freq of access weight to each TXO?
either defined by owner, or thru these curves?

@kcalvinalvin
Copy link
Member

However, another thought from here do u think is it possible to assign a time duration/freq of access weight to each TXO?

Not quite sure what you mean by time duration/freq of access weight to each TXO. Each UTXO (Unspent Transaction Output), by definition, can only be accessed once so there wouldn't be any freq of access. The charts you see currently are representing time to live or time duration, as you put it.

either defined by owner, or thru these curves?

Not sure what you mean by thru these curves so I can't quite answer that. defined by owner is doable if you define "owner" as an address. Since there is address reuse, one could keep track of that as well as how often each "owner" spends their coins after receiving it. This information wouldn't really be of help to what we were trying to find out but since I copied the exact code I used to find this, one could make a new chart with this information.

@Shymaa-Arafat
Copy link

I meant freq of access of each Merkle Tree leaf (how frequent its proof is requested), I was thinking would it be more efficient to make the Merkle Tree weighted?
Could this be predicted by the system(by a formula derived from these curves, u started by asking which leaves r getting proved most of the time) or better by the creator of the TXO?
If so would be efficient for short living leaves to have shorter paths to fasten deletion with less swaps?Or for long livings to be the shorter for faster proofs (hash validation)?
.
I think we already know some durations r frequent like 6 and 100

@kcalvinalvin
Copy link
Member

If so would be efficient for short living leaves to have shorter paths to fasten deletion with less swaps?Or for long livings to be the shorter for faster proofs (hash validation)?

In utreexo, the shortest trees and the newest leaves are on the rightmost side. In Bitcoin, there is a spending pattern in which the newest utxos are spent the quickest. Therefore, newest leaves are the shortest lived.

With this in mind, we know that we already have the shortest path to prove a leaf. Since the rightmost trees are the shortest and the shortest lived leaves are at the rightmost trees, we already have this efficiency. There’s no need to add additional complexity which will hurt readability and speed of the code.

With a swapless design, the locality (newest leaves being on the rightmost side) will improve even more.

@Shymaa-Arafat
Copy link

Shymaa-Arafat commented Apr 28, 2021

In Bitcoin, there is a spending pattern in which the newest utxos are spent the quickest.

-I thought the whole idea from these plots is that this statement is not quite accurate and needs more investigation?
In Bolton Bailey paper they assume a probabilistic model where a TXO duration follows a zeta distribution and most likely lasts for 2 blocks, while the analysis from Utreexo 2019 paper (or in the presentation if not written) mentions 3 types of TXOs those likely live for 1-2 blocks, those who stays for 6 (to prevent double spending), those who live for 100 (miners have to wait 100 blocks)

-So is it useful to distinguish bet these kinds?
can we predict with high probability the expected life time of each TXO and put it in a different cluster based on that???
-These the weights I meant from the beginning, thru this conversation I started to think why not like say 3 kinds of trees (a forest for each cluster if the original design is still the same) instead of a weighted tree?
.
(Ps.
I don't want to be pursuing so if u think the discussion is not fruitful/useful u may not reply)

@adiabat
Copy link
Contributor Author

adiabat commented Apr 29, 2021

@Shymaa-Arafat
one thing that may not be clear, based on your mention of "freq of access of each Merkle Tree leaf"
Because this is a UTXO set, unlike the account model in ethereum, every UTXO will be accessed only once. Once the UTXO is read, it is deleted. So we don't have different frequency of access for each leaf.

What we do know, in the case of IBD, is when each leaf will be accessed, since the whole set of transactions is known ahead of time. Clustering many leaves together so that they're all simultaneously accessed at the same time would reduce proof sizes, and that's what these plots are trying to look at.

So far we're not looking at "predicting" TXO lifespan, as for IBD we already know them with certainty.

@Shymaa-Arafat
Copy link

unlike the account model in ethereum, every UTXO will be accessed only once. Once the UTXO is read, it is deleted.

Maybe I do have a misunderstanding here, u mean if a TXO has value of say 2 btc then 1btc is spent/transformed/exchanged/..., this one is dead and u create another TXO to hold the remaining value????
-Even though, the freq of access could still relate to its life duration as it could be needed(node accessed to fetch hash value) as part of other proofs.

So far we're not looking at "predicting" TXO lifespan, as for IBD we already know them with certainty.

It is just an idea of using these past values in the IBD to predict the future, but maybe it is not best suited here as it address the structure of the original forest not the cashing mechanism

@adiabat
Copy link
Contributor Author

adiabat commented Apr 30, 2021

Maybe I do have a misunderstanding here, u mean if a TXO has value of say 2 btc then 1btc is spent/transformed/exchanged/..., this one is dead and u create another TXO to hold the remaining value????

Yes, that's how bitcoin and UTXOs work, entries are created, then read and deleted. Nothing gets modified or read without being deleted (except maybe for mempool transactions that don't make it into a block).

There might be ways to predict TXO durations based on heuristics but we're not doing that yet. Duration of the input TXOs likely does correlate with duration of output TXOs so that could help give hints on what to cache, but so far we're focusing more on IBD where we have all the data for certain and we can likely get a bigger improvement.

@Shymaa-Arafat
Copy link

Yes, that's how bitcoin and UTXOs work, entries are created, then read and deleted.

Although I'm still not sure this the best suited issue for this discussion, but a first thought idea would be to replace the old TXO that became input with the new one; ie no swaps just modify the corresponding hashes in the proof
-If one old TXO led to more than one make the old leaf node a root of a corresponding subtree
-Only if n1 inputs created n2 outputs where n1>n2 u will have to choose n1-n2 leaves to delete according to some criteria/heuristics?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

3 participants