Small optimizations #5

tengomucho · 2024-03-20T15:43:50Z

What does this PR do?

This PR adds few optimizations that have been added in preparation of model compilation for decoding. Note that compilation is still not enabled by default due to a bug I am currently investigating.
On the other hand, I spent some time profiling the code with the xla profiling API and I was able to understand that adding few xm.mark_step improved the performance, and that the token processing code eventually runs faster if executed in the CPU, because it will avoid recompilation.

This will only enable compilation for decoding. Note that there is not a big speedup for now, probably due to slot increasing buffer size over time, triggering recompilation.

Logits post-processing is not very heavyweight, and doing it on CPU actually accelerates decoding, because compilation is not re-triggered.

mfuntowicz

LGTM! 🤗

mfuntowicz · 2024-03-22T07:16:00Z

text-generation-inference/server/text_generation_server/generator.py

@@ -512,8 +523,11 @@ def _generate_token(
            # Save KV cache
            self.past_key_values = outputs.past_key_values
        # Barrier for XLA model
-        xm.mark_step(wait=False)
+        xm.mark_step()


For my knowledge: We were not waiting before, why this has changed here?

the default is wait=False, and I did not want to give the false impression I am changing the default behaviour, so I just removed the default parameter.

mfuntowicz · 2024-03-22T07:16:24Z

text-generation-inference/tests/test_generator.py

@@ -44,13 +45,19 @@ def create_request(
    seed: int = 0,
    repetition_penalty: float = 1.0,
 ):
+    # For these tests we can safely set typical_p to 1.0 (default)
+    typical_p = 1.0
+    if do_sample == False:


if not do_sample

mfuntowicz · 2024-03-22T07:16:59Z

text-generation-inference/tests/test_generator_gemma.py

@@ -35,13 +36,19 @@ def create_request(
    seed: int = 0,
    repetition_penalty: float = 1.0,
 ):
+    # For these tests we can safely set typical_p to 1.0 (default)
+    typical_p = 1.0
+    if do_sample == False:


if not do_sample

tengomucho added 4 commits March 19, 2024 12:54

feat(test): add tqdm to get feedback when running locally

9056959

fix(test): remove generation config warnings

1ec9adb

feat: compilation can be enabled only for decoding

c3549ab

This will only enable compilation for decoding. Note that there is not a big speedup for now, probably due to slot increasing buffer size over time, triggering recompilation.

feat: logits post-processing happens on CPU

3a4d10b

Logits post-processing is not very heavyweight, and doing it on CPU actually accelerates decoding, because compilation is not re-triggered.

tengomucho marked this pull request as ready for review March 20, 2024 15:57

tengomucho requested a review from mfuntowicz March 20, 2024 15:57

mfuntowicz approved these changes Mar 22, 2024

View reviewed changes

fix: comparison to False should be cond is False

401eea6

tengomucho force-pushed the small-optimizations branch from 4dda7e9 to 401eea6 Compare March 22, 2024 08:55

tengomucho merged commit a8452e7 into main Mar 22, 2024
1 check passed

tengomucho deleted the small-optimizations branch March 22, 2024 09:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Small optimizations #5

Small optimizations #5

tengomucho commented Mar 20, 2024 •

edited

Loading

mfuntowicz left a comment

mfuntowicz Mar 22, 2024

tengomucho Mar 22, 2024

mfuntowicz Mar 22, 2024

mfuntowicz Mar 22, 2024

Small optimizations #5

Small optimizations #5

Conversation

tengomucho commented Mar 20, 2024 • edited Loading

What does this PR do?

mfuntowicz left a comment

Choose a reason for hiding this comment

mfuntowicz Mar 22, 2024

Choose a reason for hiding this comment

tengomucho Mar 22, 2024

Choose a reason for hiding this comment

mfuntowicz Mar 22, 2024

Choose a reason for hiding this comment

mfuntowicz Mar 22, 2024

Choose a reason for hiding this comment

tengomucho commented Mar 20, 2024 •

edited

Loading