Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BIT-484] Add TextCausalLMNext #848

Merged
merged 16 commits into from
Jul 20, 2022
Merged

Conversation

opentaco
Copy link
Contributor

@opentaco opentaco commented Jul 20, 2022

Adds the TextCausalLMNext Synapse, as well as the tokenizer_utils needed to encode/decode the synapse. Both core_server and core_validator are running successfully on Nobunaga with this branch.

TextCausalLMNext sends a user-defined topk of server token phrases for input context continuation, which validator/client uses for validated generation where server phrase probabilities are evaluated against a groundtruth continuation. Supports backward gradients from validator/client to server.

Check detailed commit explanations.

Specifies messaging of topk server token phrases with probabilities. Server last position token predictions are retokenized to token phrases with the bittensor tokenizer. Allows for zero translation loss CausalLM next generation between different tokenizers.

Also adds comment specifying proto compile command, which is useful to see for manual compilation instruction.
Specifies messaging of topk server token phrases with probabilities. Server last position token predictions are retokenized to token phrases with the bittensor tokenizer. Allows for zero translation loss CausalLM next generation between different tokenizers.
Specifies messaging of topk server token phrases with probabilities. Server last position token predictions are retokenized to token phrases with the bittensor tokenizer. Allows for zero translation loss CausalLM next generation between different tokenizers.
Specifies messaging of topk server token phrases with probabilities. Server last position token predictions are retokenized to token phrases with the bittensor tokenizer. Allows for zero translation loss CausalLM next generation between different tokenizers.
Tokenizer utilities adds functions to compact and unravel topk server token phrases (standard tokenized), to be used with TextCausalLMNext synapse.
Unit test new tokenizer utility functions that compact and unravel topk server token phrases (standard tokenized), to be used with TextCausalLMNext synapse.
Calculates the cross entropy of a phrase prediction against a target phrase, so that this is a multi-token extension of typical cross entropy calculated for next token prediction, to be used with TextCausalLMNext synapse.
Adds unit test for calculating the cross entropy of a phrase prediction against a target phrase, so that this is a multi-token extension of typical cross entropy calculated for next token prediction, to be used with TextCausalLMNext synapse.
@coveralls
Copy link

coveralls commented Jul 20, 2022

Pull Request Test Coverage Report for Build 9de67db5-bc8e-4bd7-a246-c05490ce7857

  • 163 of 182 (89.56%) changed or added relevant lines in 8 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-13.02%) to 60.515%

Changes Missing Coverage Covered Lines Changed/Added Lines %
bittensor/utils/tokenizer_utils.py 90 93 96.77%
bittensor/_dendrite/dendrite_impl.py 2 9 22.22%
bittensor/_synapse/text_causallmnext_impl.py 53 62 85.48%
Totals Coverage Status
Change from base Build 335e0d95-b9f6-4311-b6b9-f7a4f511d31e: -13.02%
Covered Lines: 3646
Relevant Lines: 6025

💛 - Coveralls

@opentaco opentaco requested a review from Eugene-hu July 20, 2022 20:29
Copy link
Contributor

@Eugene-hu Eugene-hu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved, left you a few questions

@@ -436,7 +436,80 @@ def text_causal_lm (
self.update_stats( formatted_endpoints, synapses, formatted_inputs, outputs, codes, times )
return outputs[0], codes[0], times[0]

def text_last_hidden_state(
def text_causal_lm_next(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we expect this synapse to be called on its own or always in conjunction with CausalLM?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TextCausalLMNext can be called independently of TextCausalLM, but for initial tests they'll both run so comparisons can be performed.

""" Factory function which returns a TextCausalLMNext synapse adapter given arguments.
Args:
topk (:obj:`int`):
Specifies the number of topk server token phrases to return.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little bit confused about the topk variable. Are they the topk logits of the server's tokenizer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, topk logits/phrases, under the assumption that each logit represents a text phrase (the server represents this with a single token with associated probability, but it could be multiple std tokens (a token phrase) at the validator).

return forward_request_tensor

def decode_forward_request_tensor(self, forward_request_tensor: torch.Tensor) -> torch.Tensor:
return forward_request_tensor
Copy link
Contributor

@Eugene-hu Eugene-hu Jul 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a encoding or decoding function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, encoding and decoding is best performed at the origin, that is server and validator forward() functions respectively. The reason being that variable length phrases need to otherwise be converted to padded tensors, simply for the purpose of passing it to the synapse encode_forward_request_tensor(), where it would need to be decoded and then re-encoded and compacted, thus duplicating work best done immediately at the origin.


def check_forward_response_tensor(self, forward_request_tensor, forward_response_tensor):
# forward_request_tensor: [batch_size, sequence_len]
# forward_response_tensor: [ >= batch_size * (2 * topk + 1)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The response tensor is a 1-D tensor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the response is 1-D to efficiently compact variable length token phrases by omitting padding that would otherwise be required to make it n-D.

@opentaco opentaco merged commit b14578d into Synapse Jul 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants