Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wip - catchup to last known block #31

Merged
merged 3 commits into from
Aug 5, 2016
Merged

Conversation

yusefnapora
Copy link
Collaborator

This updates the SimpleClient and receive_blockchain_into_indexer functions to use the new canonical_stream output format from mediachain/oldchain-client#88

I added a cli flag so I could test catching up to a known block (--last-known-block=QmF00...), but plan to remove that soon. It seems like we should be writing out the last known block ref somewhere so we can read it in after a restart.

@autoencoder, where do you think that value should get stored? Could just spit it out to a file somewhere...

@parkan
Copy link
Collaborator

parkan commented Aug 4, 2016

Does indexer have a rockdb so far?

@parkan
Copy link
Collaborator

parkan commented Aug 4, 2016

This needs corresponding requirements.txt change

@yusefnapora
Copy link
Collaborator Author

the testnet indexer has rocksdb installed, yes.

@parkan
Copy link
Collaborator

parkan commented Aug 4, 2016

should probably put it in there, then?

started = True
else:
continue

yield ref, obj
yield obj_info
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

presumably these returns are different now, are the callsites updated?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, the get_artefacts and get_entities functions just pass through this return value unchanged, and only get_artefacts is used ATM. I updated receive_blockchain_into_indexer to unpack the dictionary instead of the tuple.

The only other place its used is in the debug tail_blockchain method, which just prints without unpacking anything

Copy link
Collaborator

@autoencoder autoencoder Aug 5, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even better - since this is a public API, it'd be nice to have some kind of versioning that's machine-interpretable, and each API caller would do an assert against the version number each time the program ran, prior to actually using the API.

E.g. major number increment indicates reverse-compatibility breaking changes, minor number increment indicates behind-the-scenes enhancements, and each API caller asserts that major number == last major number for which the person manually reviewed the list of breaking changes.

This would help us avoid subtle bugs that automated tests may not catch & help keep all devs keep in sync.

@yusefnapora
Copy link
Collaborator Author

yeah, I haven't so far since we need to coordinate with @autoencoder - it's only needed for the blockchain stream, and I didn't want to break any other deployments he's got :) But it would be cool to have it as a hard requirement, since the blockchain catchup will be memory hungry and transient without rocksdb installed.

@parkan
Copy link
Collaborator

parkan commented Aug 4, 2016

Ok deferring to @autoencoder then, would also be ok to drop it into a ~/.mediachain or w/e

@yusefnapora
Copy link
Collaborator Author

yeah, the rocksdb blockchain cache gets stored in ~/.mediachain anyway, so we could just write the block ref to a file there.

@parkan
Copy link
Collaborator

parkan commented Aug 4, 2016

I'm down to just drop a file then. Do we need any locking semantics? I guess not given that only 1 thread would write, though now that I think about it we probably need a pidfile/lock for the whole thing

@yusefnapora
Copy link
Collaborator Author

yeah, at the moment we shouldn't need locking, since we'll just read once before we start tailing the blockchain, and only write from one thread. but it might get more complicated as time goes on if we end up doing some kind of parallelization, etc

@parkan
Copy link
Collaborator

parkan commented Aug 4, 2016

Yeah, we should probably drop a pidfile and refuse to run if another process is present for now

@autoencoder
Copy link
Collaborator

autoencoder commented Aug 4, 2016

For the current block count, may end up sticking that counter (edit: probably more of a transaction ID) in ES, since the contents of ES would be the reason for the Indexer to be tracking this number.

Sounding good. Will take a closer look / merge in the AM.

@yusefnapora
Copy link
Collaborator Author

@autoencoder the thing is, we can't easily do a simple block height counter without changing the transactor RPC api to include that. At the moment we just have a ref to the block, which isn't ordered at all.

We could keep our own count on the client, but that would get tricky with the partial catchup, etc. Adding the block height or some kind of sequence number to the API probably makes sense

@yusefnapora
Copy link
Collaborator Author

actually, I guess we could pull the index of the first entry in the block, and use that as a sequence number... will think about that some more

@autoencoder
Copy link
Collaborator

autoencoder commented Aug 4, 2016

Ideally it'd be an ID which could be later used to identify exactly what position of which fork the chain was on... and then if the API caller later tries to resume from a position that was on an abandoned fork, the client API would the replay the necessary inserts / updates / deletes into the Indexer to get it back in sync with the proper fork. Something like that.

Maybe not needed yet.

@vyzo
Copy link
Collaborator

vyzo commented Aug 5, 2016

I echo @parkan for having a lock/pid file for the local block cache.
Btw, is rocksdb safe for concurrent processes?

@autoencoder
Copy link
Collaborator

Ok, took a look. Looks good. Noting some of the parts still WIP:

  • Persistent state recording strategy. BTW, why not just record this in RocksDB? Looks like we'll have at least 2 "current block position" IDs recorded. One for the Client cache, and one or more for Indexer instances.
  • Multi-process parallelism - Am I right that the proposed strategy is: the Client reader does single-process reconciliation / downloading / caching of the blockchain, and then multiple reader processes will be able to access that cache? @yusef looks like RocksDB does support multiple reader processes, but you have to explicitly open the other readers in read-only mode: Does RocksDB support multi-process read access? facebook/rocksdb#908
  • We should do a refreshed, top-down look at this API as a whole, from the API user's perspective in typical use-cases. Perhaps next week.

@autoencoder autoencoder merged commit 444f9fd into master Aug 5, 2016
@yusefnapora
Copy link
Collaborator Author

yeah, it probably does make more sense for us to track the current block in rocksdb in the client code. I can set that up and see about exposing the index so we can use it as our "block height".

We could do multi-process catchup by opening the block cache in read-only mode; that's a good idea. It would need a little bit of extension to the current BlockCache api. Right now the BlockCache is just a read-through cache that doesn't keep track of block ordering, etc. So there's no way to "seek" to a particular block; instead we're always walking back through the chain from the current block. But I think if we track the block index numbers we should be able to spawn multiple processes that each take a range of blocks.

We'll also need to track the particular blockchain that the blocks are part of; it would be nice if each chain had a unique id or genesis block ref or something that we could identify the chain with. Right now the block cache will store blocks from any chain, since it's just a K/V store. But if we want to keep track of the structure / sequence of blocks, we also need to differentiate between different chains and either store them in separate rocskdb instances or prefix the keys, etc.

@yusefnapora yusefnapora deleted the yn-last-known-block branch August 8, 2016 18:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants