[SYCL] Support SYCL layer for LLaMA2 model #272

ThanatosShinji · 2024-05-29T14:36:56Z

Type of Change

Convert BesTLA int4 weight to SYCL weight
Load SYCL weight to ne_tensor
Activation host->device conversion
Device workspace design
Convert 'output.weight' to SYCL tensor, run LLaMa2 with ARC750

luoyu-intel · 2024-06-04T08:14:34Z

SYCL UT passed and llama2 sycl model_eval with i7-1185G7's Iris iGPU, 12900K(as CPU device) and A770

luoyu-intel · 2024-06-06T09:05:39Z

QKV+FFN on GPU, MHA on CPU, 9600K+A770:
from 150ms/token(CPU) to 60ms/token(CPU+SYCL). memcpy between the device and host took the most time.

Skip MHA: 13.6ms/token (SYCL only)

neural_speed/core/CMakeLists.txt

neural_speed/core/layers/ne_bestla.cpp

ThanatosShinji · 2024-06-10T03:17:47Z

Skip CPU MHA(prevent memcpy between host and device), run all other layers on GPU
LLama-7B int4 g128 sym
A750-8GB, 15 ms/token
MTL-155H, 51 ms/token

This latency plus MHA latency should be the final end2end performance.

a32543254

LGTM

ThanatosShinji · 2024-06-17T11:43:44Z

Run LLaMa2 all layers on A750: 19.5ms/token
MHA+Rope+RMS=5ms, need further optimization.

ThanatosShinji · 2024-06-20T11:11:24Z

n_context = 1024, in=512, out=512, only 5.8GB GPU memory.

ThanatosShinji · 2024-06-20T13:11:54Z

A750

Once again, here's a picture of the two sides of my face. It
model_print_timings: load time = 939.34 ms
model_print_timings: sample time = 1.96 ms / 16 runs ( 0.12 ms per token)
model_print_timings: prompt eval time = 939.30 ms / 2 tokens ( 469.65 ms per token)
model_print_timings: eval time = 257.18 ms / 15 runs ( 17.15 ms per token)
model_print_timings: total time = 1200.16 ms
========== eval time log of each prediction ==========
prediction 0, time: 939.30ms
prediction 1, time: 17.77ms
prediction 2, time: 17.15ms
prediction 3, time: 17.13ms
prediction 4, time: 16.85ms

ThanatosShinji · 2024-06-20T13:35:51Z

MTL 155H, ~55ms/ token, the total latency of int4 gemm is 47ms

Once upon a time, there existed a little girl, who liked to have adventures. She wanted to go to places and meet new people, and have many different experiences.
The girl had lots of dreams. She imagined herself as the greatest writer on earth, or an astronaut, an artist,
model_print_timings:        load time =   693.66 ms
model_print_timings:      sample time =     5.88 ms /    32 runs   (    0.18 ms per token)
model_print_timings: prompt eval time =   692.23 ms /    32 tokens (   21.63 ms per token)
model_print_timings:        eval time =  1752.01 ms /    31 runs   (   56.52 ms per token)
model_print_timings:       total time =  2458.35 ms
========== eval time log of each prediction ==========
prediction   0, time: 692.23ms
prediction   1, time: 58.09ms
prediction   2, time: 55.58ms
prediction   3, time: 55.13ms
prediction   4, time: 54.90ms

luoyu-intel · 2024-06-21T01:40:22Z

llama2-7b on A770+9900K:

 Once upon a time, there existed a little girl, who liked to have adventures. She wanted to go to places and meet new people, and have exciting experiences. However, no matter how hard she tried, she always got bored and restless wherever she went.
One day, while wandering
model_print_timings:        load time =   231.00 ms
model_print_timings:      sample time =     6.60 ms /    32 runs   (    0.21 ms per token)
model_print_timings: prompt eval time =   230.26 ms /    32 tokens (    7.20 ms per token)
model_print_timings:        eval time =   523.42 ms /    31 runs   (   16.88 ms per token)
model_print_timings:       total time =   765.82 ms
========== eval time log of each prediction ==========
prediction   0, time: 230.26ms
prediction   1, time: 17.19ms
prediction   2, time: 16.78ms
prediction   3, time: 16.82ms
prediction   4, time: 16.65ms

 Once upon a time, there was a little girl named Maria. obviously, you can see the beauty of this story as it unfolds. There are many different elements
model_print_timings:        load time =   773.71 ms
model_print_timings:      sample time =     6.50 ms /    32 runs   (    0.20 ms per token)
model_print_timings: prompt eval time =   773.64 ms /     2 tokens (  386.82 ms per token)
model_print_timings:        eval time =   501.96 ms /    31 runs   (   16.19 ms per token)
model_print_timings:       total time =  1285.98 ms
========== eval time log of each prediction ==========
prediction   0, time: 773.64ms
prediction   1, time: 16.78ms
prediction   2, time: 16.00ms
prediction   3, time: 15.93ms
prediction   4, time: 16.00ms

This reverts commit 0ce1574.

This reverts commit 7f40ae9.

ThanatosShinji force-pushed the sync_int4 branch 2 times, most recently from 4dd6636 to 11772b9 Compare June 1, 2024 03:33

luoyu-intel force-pushed the sync_int4 branch 2 times, most recently from ecca7b9 to 16231e3 Compare June 4, 2024 03:26

luoyu-intel marked this pull request as ready for review June 4, 2024 08:11

luoyu-intel force-pushed the sync_int4 branch from 95b49ad to 08b77f1 Compare June 5, 2024 05:24

luoyu-intel requested review from a32543254, zhentaoyu and airMeng June 7, 2024 01:19

luoyu-intel force-pushed the sync_int4 branch from 94fcaeb to 8f64e54 Compare June 7, 2024 02:58

zhentaoyu reviewed Jun 7, 2024

View reviewed changes

neural_speed/core/CMakeLists.txt Show resolved Hide resolved

neural_speed/core/layers/ne_bestla.cpp Outdated Show resolved Hide resolved

zhentaoyu added sycl cpu review_complexity:high labels Jun 7, 2024

a32543254 approved these changes Jun 11, 2024

View reviewed changes

ThanatosShinji force-pushed the sync_int4 branch from 9e9e05d to f08b8ea Compare June 20, 2024 13:50

luoyu-intel approved these changes Jun 20, 2024

View reviewed changes

luoyu-intel added the ready to merge label Jun 20, 2024

ThanatosShinji added 4 commits June 21, 2024 16:44

fixed all UTs

23aad22

move sycl benchmark to benchmark project

83592d1

add q4 UT for sycl prologue_b

1d6914e

sycl gemv case

c29095a

ThanatosShinji and others added 25 commits June 21, 2024 16:45

add SYCL rope

52285a9

all device f32 mha

03da9a8

remove unused code

89da98b

fixed

0323edd

fixed

360f099

refactor sycl context for multiple allocation

ebfa1d5

support n_gpu_layer

fb8c397

reuse scratch

13a8813

add new mha version

ab3d76c

new version of MHA

4d254eb

lower malloc size

023d78c

compile without sycl

72f71c5

run llama without sycl build

e881934

clang-format

243d744

fix clang-tidy

e4e64e0

fix py build

2aae1ba

fix warning

3d36da0

use std header

5028cf2

update math

f3e6eb4

update math

234b157

revert scratch without SYCL

a580f56

use cl for c_compiler

116073a

compile on linux

481dd4f

Revert "compile on linux"

f37e04d

This reverts commit 0ce1574.

Revert "use cl for c_compiler"

145dc98

This reverts commit 7f40ae9.

luoyu-intel force-pushed the sync_int4 branch from 4f08bd4 to 145dc98 Compare June 21, 2024 08:45

luoyu-intel added 3 commits June 21, 2024 17:33

fix memory leak, set lower extra memory size.

42fa774

revert embedding size on CPU

5d02676

clang-format

5bee840

luoyu-intel merged commit dceba67 into intel:main Jun 21, 2024
12 of 13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYCL] Support SYCL layer for LLaMA2 model #272

[SYCL] Support SYCL layer for LLaMA2 model #272

ThanatosShinji commented May 29, 2024

luoyu-intel commented Jun 4, 2024

luoyu-intel commented Jun 6, 2024 •

edited

Loading

ThanatosShinji commented Jun 10, 2024 •

edited

Loading

a32543254 left a comment

ThanatosShinji commented Jun 17, 2024

ThanatosShinji commented Jun 20, 2024

ThanatosShinji commented Jun 20, 2024 •

edited

Loading

ThanatosShinji commented Jun 20, 2024 •

edited

Loading

luoyu-intel commented Jun 21, 2024 •

edited

Loading

[SYCL] Support SYCL layer for LLaMA2 model #272

[SYCL] Support SYCL layer for LLaMA2 model #272

Conversation

ThanatosShinji commented May 29, 2024

Type of Change

luoyu-intel commented Jun 4, 2024

luoyu-intel commented Jun 6, 2024 • edited Loading

ThanatosShinji commented Jun 10, 2024 • edited Loading

a32543254 left a comment

Choose a reason for hiding this comment

ThanatosShinji commented Jun 17, 2024

ThanatosShinji commented Jun 20, 2024

ThanatosShinji commented Jun 20, 2024 • edited Loading

ThanatosShinji commented Jun 20, 2024 • edited Loading

luoyu-intel commented Jun 21, 2024 • edited Loading

luoyu-intel commented Jun 6, 2024 •

edited

Loading

ThanatosShinji commented Jun 10, 2024 •

edited

Loading

ThanatosShinji commented Jun 20, 2024 •

edited

Loading

ThanatosShinji commented Jun 20, 2024 •

edited

Loading

luoyu-intel commented Jun 21, 2024 •

edited

Loading