Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CODE] Basic Inferencing API on hivemind.Server aka rpc_inference #3

Closed
justheuristic opened this issue May 31, 2022 · 2 comments
Closed
Assignees

Comments

@justheuristic
Copy link
Collaborator

Why: talk to a 176B model ran on hundreds of small devices.

Implement an extended hivemind.Server that has forward/backward as usual, and an additional RPC named forward_incremental (stream<->stream)

  1. Here's the protocol for forward_incremental:
  • client sends request containing:
    • requested layers
    • requested max sequence length
    • [optional: bid?]
  1. server responds with info protobuf that contains:
  • bool accepted: if True, server decides to let client run inference and will await first request for T=10 seconds.
    • [optional: queue?]
  • [float queue length: 0 if accepted right now, N if need to wait for N other nodes to finish before running]
  • [float throughput: server's estimated computation time, including time in queue]
  1. client sends prefix embeddings:
  • Tensor prefix input_embeddings [1, prefix_length, hidden_size] with compression
  • [optional prefix attention mask[prefix_length, prefix_length], default = tril]
  1. server runs forward pass, saves attention caches and return
  • Tensor prefix output_embeddings [1, prefix_length, hidden_size] with compression
  1. client sends another token input embeddings
  • Tensor input_embeddings [1, 1, hidden_size] with compression
  • [optional prefix attention mask[1, prefix_length + prev_tokens], default = tril]
  1. server runs forward pass with attention cache and returns
  • Tensor output_embeddings
  • current length

GOTO step 5 while current length <= max length
If client does not send ping in T seconds (maybe empty message if no data yet), server closes connection.

Don't think about it:

  • support fixed max length for now, e.g. 1024 or 2048?
  • inference up to 256 steps excluding prefix - to ensure we don't spend too long with the same node?
  • select one or more of that node's consequent layers to inference at once?
  • send more than one token at a time?
  • option to backtrack for a few tokens for beam search inference?
  • beam search with multiple hypotheses - and an option to reorder them internally?
@justheuristic justheuristic self-assigned this May 31, 2022
@justheuristic justheuristic transferred this issue from another repository Jun 12, 2022
@justheuristic
Copy link
Collaborator Author

justheuristic commented Jun 19, 2022

Things that should be done before the first demo:

  • rpc_inference does computes something :)
  • client has a way of running rpc_inference for more than one step :)
  • test cache basic functionality (no leaks, values are properly reused)
  • test session basic functionality (closes properly, no leaks, multi-input)
  • rpc_inference works w/o mask and other stuff
  • rpc_inference uses cache correctly

[list of non-required steps merged into a separate issue]

@justheuristic justheuristic changed the title [CODE] Inferencing API on hivemind.Server aka forward_incremental [CODE] Inferencing API on hivemind.Server aka rpc_inference Jun 19, 2022
@justheuristic justheuristic changed the title [CODE] Inferencing API on hivemind.Server aka rpc_inference [CODE] Basic Inferencing API on hivemind.Server aka rpc_inference Jun 19, 2022
justheuristic added a commit that referenced this issue Jun 19, 2022
@justheuristic
Copy link
Collaborator Author

awaiting post-merge review by @GreenFatGuy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant