[BIT-484] Add TextCausalLMNext #848

opentaco · 2022-07-20T12:34:33Z

Adds the TextCausalLMNext Synapse, as well as the tokenizer_utils needed to encode/decode the synapse. Both core_server and core_validator are running successfully on Nobunaga with this branch.

TextCausalLMNext sends a user-defined topk of server token phrases for input context continuation, which validator/client uses for validated generation where server phrase probabilities are evaluated against a groundtruth continuation. Supports backward gradients from validator/client to server.

Check detailed commit explanations.

Specifies messaging of topk server token phrases with probabilities. Server last position token predictions are retokenized to token phrases with the bittensor tokenizer. Allows for zero translation loss CausalLM next generation between different tokenizers. Also adds comment specifying proto compile command, which is useful to see for manual compilation instruction.

Specifies messaging of topk server token phrases with probabilities. Server last position token predictions are retokenized to token phrases with the bittensor tokenizer. Allows for zero translation loss CausalLM next generation between different tokenizers.

Tokenizer utilities adds functions to compact and unravel topk server token phrases (standard tokenized), to be used with TextCausalLMNext synapse.

Unit test new tokenizer utility functions that compact and unravel topk server token phrases (standard tokenized), to be used with TextCausalLMNext synapse.

Calculates the cross entropy of a phrase prediction against a target phrase, so that this is a multi-token extension of typical cross entropy calculated for next token prediction, to be used with TextCausalLMNext synapse.

Adds unit test for calculating the cross entropy of a phrase prediction against a target phrase, so that this is a multi-token extension of typical cross entropy calculated for next token prediction, to be used with TextCausalLMNext synapse.

coveralls · 2022-07-20T12:40:49Z

Pull Request Test Coverage Report for Build 9de67db5-bc8e-4bd7-a246-c05490ce7857

163 of 182 (89.56%) changed or added relevant lines in 8 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage decreased (-13.02%) to 60.515%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
bittensor/utils/tokenizer_utils.py	90	93	96.77%
bittensor/_dendrite/dendrite_impl.py	2	9	22.22%
bittensor/_synapse/text_causallmnext_impl.py	53	62	85.48%

Totals
Change from base Build 335e0d95-b9f6-4311-b6b9-f7a4f511d31e:	-13.02%
Covered Lines:	3646
Relevant Lines:	6025

💛 - Coveralls

Eugene-hu

Approved, left you a few questions

Eugene-hu · 2022-07-20T21:12:31Z

bittensor/_dendrite/dendrite_impl.py

@@ -436,7 +436,80 @@ def text_causal_lm (
        self.update_stats( formatted_endpoints, synapses, formatted_inputs, outputs, codes, times )
        return outputs[0], codes[0], times[0]

-    def text_last_hidden_state( 
+    def text_causal_lm_next(


Should we expect this synapse to be called on its own or always in conjunction with CausalLM?

TextCausalLMNext can be called independently of TextCausalLM, but for initial tests they'll both run so comparisons can be performed.

Eugene-hu · 2022-07-20T21:14:08Z

bittensor/_synapse/__init__.py

+        """ Factory function which returns a TextCausalLMNext synapse adapter given arguments.
+            Args:
+                topk (:obj:`int`):
+                    Specifies the number of topk server token phrases to return.


A little bit confused about the topk variable. Are they the topk logits of the server's tokenizer?

Yes, topk logits/phrases, under the assumption that each logit represents a text phrase (the server represents this with a single token with associated probability, but it could be multiple std tokens (a token phrase) at the validator).

Eugene-hu · 2022-07-20T21:15:00Z

bittensor/_synapse/text_causallmnext_impl.py

+        return forward_request_tensor
+
+    def decode_forward_request_tensor(self, forward_request_tensor: torch.Tensor) -> torch.Tensor:
+        return forward_request_tensor


Do we need a encoding or decoding function?

Currently, encoding and decoding is best performed at the origin, that is server and validator forward() functions respectively. The reason being that variable length phrases need to otherwise be converted to padded tensors, simply for the purpose of passing it to the synapse encode_forward_request_tensor(), where it would need to be decoded and then re-encoded and compacted, thus duplicating work best done immediately at the origin.

Eugene-hu · 2022-07-20T21:16:12Z

bittensor/_synapse/text_causallmnext_impl.py

+
+    def check_forward_response_tensor(self, forward_request_tensor, forward_response_tensor):
+        # forward_request_tensor: [batch_size, sequence_len]
+        # forward_response_tensor: [ >= batch_size * (2 * topk + 1)]


The response tensor is a 1-D tensor?

Yes, the response is 1-D to efficiently compact variable length token phrases by omitting padding that would otherwise be required to make it n-D.

opentaco added 14 commits July 20, 2022 11:45

Add code_to_synapse for text_causal_lm_next

f0d687b

Add topk token phrases utilities

86ebed3

Tokenizer utilities adds functions to compact and unravel topk server token phrases (standard tokenized), to be used with TextCausalLMNext synapse.

Add unit test for topk token phrases utilities

426defa

Unit test new tokenizer utility functions that compact and unravel topk server token phrases (standard tokenized), to be used with TextCausalLMNext synapse.

Add phrase entropy for topk token phrases

583e1d8

Calculates the cross entropy of a phrase prediction against a target phrase, so that this is a multi-token extension of typical cross entropy calculated for next token prediction, to be used with TextCausalLMNext synapse.

Add axon tests for TextCausalLMNext

8c93dda

Add dendrite tests for TextCausalLMNext

1010b5b

Add forward-backward tests for TextCausalLMNext

d25352c

Add receptor tests for TextCausalLMNext

9f080eb

Add receptor_pool tests for TextCausalLMNext

2d4ca10

opentaco added 2 commits July 20, 2022 15:18

Update receptor_pool tests for TextCausalLMNext

75675f7

Update receptor_pool tests for TextCausalLMNext

8d74e18

opentaco requested a review from Eugene-hu July 20, 2022 20:29

Eugene-hu approved these changes Jul 20, 2022

View reviewed changes

opentaco merged commit b14578d into Synapse Jul 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BIT-484] Add TextCausalLMNext #848

[BIT-484] Add TextCausalLMNext #848

opentaco commented Jul 20, 2022 •

edited

Loading

coveralls commented Jul 20, 2022 •

edited

Loading

Eugene-hu left a comment

Eugene-hu Jul 20, 2022

opentaco Jul 20, 2022

Eugene-hu Jul 20, 2022

opentaco Jul 20, 2022

Eugene-hu Jul 20, 2022 •

edited

Loading

opentaco Jul 20, 2022

Eugene-hu Jul 20, 2022

opentaco Jul 20, 2022

[BIT-484] Add TextCausalLMNext #848

[BIT-484] Add TextCausalLMNext #848

Conversation

opentaco commented Jul 20, 2022 • edited Loading

coveralls commented Jul 20, 2022 • edited Loading

Pull Request Test Coverage Report for Build 9de67db5-bc8e-4bd7-a246-c05490ce7857

💛 - Coveralls

Eugene-hu left a comment

Choose a reason for hiding this comment

Eugene-hu Jul 20, 2022

Choose a reason for hiding this comment

opentaco Jul 20, 2022

Choose a reason for hiding this comment

Eugene-hu Jul 20, 2022

Choose a reason for hiding this comment

opentaco Jul 20, 2022

Choose a reason for hiding this comment

Eugene-hu Jul 20, 2022 • edited Loading

Choose a reason for hiding this comment

opentaco Jul 20, 2022

Choose a reason for hiding this comment

Eugene-hu Jul 20, 2022

Choose a reason for hiding this comment

opentaco Jul 20, 2022

Choose a reason for hiding this comment

opentaco commented Jul 20, 2022 •

edited

Loading

coveralls commented Jul 20, 2022 •

edited

Loading

Eugene-hu Jul 20, 2022 •

edited

Loading