Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance Regression in DML #641

Open
contentis opened this issue Jun 25, 2024 · 2 comments
Open

Performance Regression in DML #641

contentis opened this issue Jun 25, 2024 · 2 comments
Assignees
Labels

Comments

@contentis
Copy link

Comparing performance from 0.3.0rc2 to the current release 0.3.0, there seems to be a 7-10% drop in performance for a token generation when using the benchmark script (benchmark.py).
Profiling indicated that the inferences themselves haven't changed, but the delta originates during the sampling process. Notably, in the rc2, there doesn't seem to be any PCI traffic.

Set-Up:

  • RTX4090
  • DML 1.14.1
  • ORT 1.16
  • Intel 13900k

Model:

  • LLame2 7b int4

Raw Data

0.3.0rc2

100/100
Average Tokenization Latency (per token): 0.9491149859968573 ms
Average Tokenization Throughput (per token): 1053.6131182774425 tps
Average Prompt Processing Latency (per token): 0.45547810004791245 ms
Average Prompt Processing Throughput (per token): 2195.495238727852 tps
Average Token Generation Latency (per token): 5.24359868934632 ms
Average Token Generation Throughput (per token): 190.7087210986893 tps
Average Sampling Latency (per token): 0.007989048026502132 ms
Average Sampling Throughput (per token): 125171.35917604853 tps
Average Wall Clock Time: 0.5673357248306274 s
Average Wall Clock Throughput: 352.52495347390305 tps
Results saved in genai_e2e!

1000/100
Average Tokenization Latency (per token): 38.520145014626905 ms
Average Tokenization Throughput (per token): 25.96044224704448 tps
Average Prompt Processing Latency (per token): 0.14640189501224085 ms
Average Prompt Processing Throughput (per token): 6830.512678243603 tps
Average Token Generation Latency (per token): 5.873470604854796 ms
Average Token Generation Throughput (per token): 170.2570877214294 tps
Average Sampling Latency (per token): 0.0082304006209597 ms
Average Sampling Throughput (per token): 121500.7684380977 tps
Average Wall Clock Time: 0.7697359204292298 s
Average Wall Clock Throughput: 1429.0615402053788 tps
Results saved in genai_e2e!

2000/100
Average Tokenization Latency (per token): 57.92622998706065 ms
Average Tokenization Throughput (per token): 17.26333649235202 tps
Average Prompt Processing Latency (per token): 0.17663256999803706 ms
Average Prompt Processing Throughput (per token): 5661.470022267768 tps
Average Token Generation Latency (per token): 6.459128232367073 ms
Average Token Generation Throughput (per token): 154.81965429776434 tps
Average Sampling Latency (per token): 0.008355751167982817 ms
Average Sampling Throughput (per token): 119678.04927362534 tps
Average Wall Clock Time: 1.0549691677093507 s
Average Wall Clock Throughput: 1990.5795015409972 tps
Results saved in genai_e2e!


0.3.0

100/100
Average Tokenization Latency (per token): 0.9885049948934466 ms
Average Tokenization Throughput (per token): 1011.6286768058188 tps
Average Prompt Processing Latency (per token): 0.7301077501033433 ms
Average Prompt Processing Throughput (per token): 1369.6608478111002 tps
Average Token Generation Latency (per token): 5.755390152876087 ms
Average Token Generation Throughput (per token): 173.7501669630997 tps
Average Sampling Latency (per token): 0.008945500769186765 ms
Average Sampling Throughput (per token): 111788.04024527628 tps
Average Wall Clock Time: 0.6463512778282166 s
Average Wall Clock Throughput: 309.4292637155656 tps
Results saved in genai_e2e!

1000/100
Average Tokenization Latency (per token): 36.456640012329444 ms
Average Tokenization Throughput (per token): 27.42984541805842 tps
Average Prompt Processing Latency (per token): 0.1962642399885226 ms
Average Prompt Processing Throughput (per token): 5095.171693317536 tps
Average Token Generation Latency (per token): 6.411659899649841 ms
Average Token Generation Throughput (per token): 155.96585215859827 tps
Average Sampling Latency (per token): 0.008837299712467939 ms
Average Sampling Throughput (per token): 113156.73707310943 tps
Average Wall Clock Time: 0.8736567020416259 s
Average Wall Clock Throughput: 1259.0757873538178 tps
Results saved in genai_e2e!

2000/100
Average Tokenization Latency (per token): 271.83306501829065 ms
Average Tokenization Throughput (per token): 3.6787283398828383 tps
Average Prompt Processing Latency (per token): 0.21109468999493403 ms
Average Prompt Processing Throughput (per token): 4737.21058556233 tps
Average Token Generation Latency (per token): 7.096416060283611 ms
Average Token Generation Throughput (per token): 140.91620213711576 tps
Average Sampling Latency (per token): 0.0104329998139292 ms
Average Sampling Throughput (per token): 95849.70936785509 tps
Average Wall Clock Time: 1.4074696063995362 s
Average Wall Clock Throughput: 1492.0393239410928 tps
Results saved in genai_e2e!

@zhangxiang1993
Copy link
Member

Hi @contentis, thanks for reporting this, we are looking into it.

@zhangxiang1993
Copy link
Member

Quick updates: this performance regression issue is related to a directml dependency on a lower level d3d library which was intended to fix some memory leak issue, and the performance regression is kinda the sacrifice from that fix. We are still looking into a proper fix for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants