Skip to content
This repository has been archived by the owner on Aug 30, 2024. It is now read-only.

[SYCL] Support SYCL layer for LLaMA2 model #272

Merged
merged 77 commits into from
Jun 21, 2024

Conversation

ThanatosShinji
Copy link
Contributor

Type of Change

  • Convert BesTLA int4 weight to SYCL weight
  • Load SYCL weight to ne_tensor
  • Activation host->device conversion
  • Device workspace design
  • Convert 'output.weight' to SYCL tensor, run LLaMa2 with ARC750

@ThanatosShinji ThanatosShinji force-pushed the sync_int4 branch 2 times, most recently from 4dd6636 to 11772b9 Compare June 1, 2024 03:33
@luoyu-intel luoyu-intel force-pushed the sync_int4 branch 2 times, most recently from ecca7b9 to 16231e3 Compare June 4, 2024 03:26
@luoyu-intel luoyu-intel marked this pull request as ready for review June 4, 2024 08:11
@luoyu-intel
Copy link
Contributor

SYCL UT passed and llama2 sycl model_eval with i7-1185G7's Iris iGPU, 12900K(as CPU device) and A770

@luoyu-intel
Copy link
Contributor

luoyu-intel commented Jun 6, 2024

QKV+FFN on GPU, MHA on CPU, 9600K+A770:
from 150ms/token(CPU) to 60ms/token(CPU+SYCL). memcpy between the device and host took the most time.

Skip MHA: 13.6ms/token (SYCL only)

@ThanatosShinji
Copy link
Contributor Author

ThanatosShinji commented Jun 10, 2024

Skip CPU MHA(prevent memcpy between host and device), run all other layers on GPU
LLama-7B int4 g128 sym
A750-8GB, 15 ms/token
MTL-155H, 51 ms/token

This latency plus MHA latency should be the final end2end performance.

Copy link
Contributor

@a32543254 a32543254 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ThanatosShinji
Copy link
Contributor Author

Run LLaMa2 all layers on A750: 19.5ms/token
MHA+Rope+RMS=5ms, need further optimization.

@ThanatosShinji
Copy link
Contributor Author

n_context = 1024, in=512, out=512, only 5.8GB GPU memory.

@ThanatosShinji
Copy link
Contributor Author

ThanatosShinji commented Jun 20, 2024

A750

Once again, here's a picture of the two sides of my face. It
model_print_timings: load time = 939.34 ms
model_print_timings: sample time = 1.96 ms / 16 runs ( 0.12 ms per token)
model_print_timings: prompt eval time = 939.30 ms / 2 tokens ( 469.65 ms per token)
model_print_timings: eval time = 257.18 ms / 15 runs ( 17.15 ms per token)
model_print_timings: total time = 1200.16 ms
========== eval time log of each prediction ==========
prediction 0, time: 939.30ms
prediction 1, time: 17.77ms
prediction 2, time: 17.15ms
prediction 3, time: 17.13ms
prediction 4, time: 16.85ms

@ThanatosShinji
Copy link
Contributor Author

ThanatosShinji commented Jun 20, 2024

MTL 155H, ~55ms/ token, the total latency of int4 gemm is 47ms

Once upon a time, there existed a little girl, who liked to have adventures. She wanted to go to places and meet new people, and have many different experiences.
The girl had lots of dreams. She imagined herself as the greatest writer on earth, or an astronaut, an artist,
model_print_timings:        load time =   693.66 ms
model_print_timings:      sample time =     5.88 ms /    32 runs   (    0.18 ms per token)
model_print_timings: prompt eval time =   692.23 ms /    32 tokens (   21.63 ms per token)
model_print_timings:        eval time =  1752.01 ms /    31 runs   (   56.52 ms per token)
model_print_timings:       total time =  2458.35 ms
========== eval time log of each prediction ==========
prediction   0, time: 692.23ms
prediction   1, time: 58.09ms
prediction   2, time: 55.58ms
prediction   3, time: 55.13ms
prediction   4, time: 54.90ms

@luoyu-intel
Copy link
Contributor

luoyu-intel commented Jun 21, 2024

llama2-7b on A770+9900K:

 Once upon a time, there existed a little girl, who liked to have adventures. She wanted to go to places and meet new people, and have exciting experiences. However, no matter how hard she tried, she always got bored and restless wherever she went.
One day, while wandering
model_print_timings:        load time =   231.00 ms
model_print_timings:      sample time =     6.60 ms /    32 runs   (    0.21 ms per token)
model_print_timings: prompt eval time =   230.26 ms /    32 tokens (    7.20 ms per token)
model_print_timings:        eval time =   523.42 ms /    31 runs   (   16.88 ms per token)
model_print_timings:       total time =   765.82 ms
========== eval time log of each prediction ==========
prediction   0, time: 230.26ms
prediction   1, time: 17.19ms
prediction   2, time: 16.78ms
prediction   3, time: 16.82ms
prediction   4, time: 16.65ms
 Once upon a time, there was a little girl named Maria. obviously, you can see the beauty of this story as it unfolds. There are many different elements
model_print_timings:        load time =   773.71 ms
model_print_timings:      sample time =     6.50 ms /    32 runs   (    0.20 ms per token)
model_print_timings: prompt eval time =   773.64 ms /     2 tokens (  386.82 ms per token)
model_print_timings:        eval time =   501.96 ms /    31 runs   (   16.19 ms per token)
model_print_timings:       total time =  1285.98 ms
========== eval time log of each prediction ==========
prediction   0, time: 773.64ms
prediction   1, time: 16.78ms
prediction   2, time: 16.00ms
prediction   3, time: 15.93ms
prediction   4, time: 16.00ms

@luoyu-intel luoyu-intel merged commit dceba67 into intel:main Jun 21, 2024
12 of 13 checks passed
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants