Question about the latency speedup! #26

ybai62868 · 2023-10-26T08:26:24Z

Hi,

Thanks for the great work!
I am curious about whether you will provide the script to get the end-to-end inference latency on a single GPU for the Llama family models?

Thanks,
Yang

Eric-mingjie · 2023-10-27T02:41:32Z

I am adding this to my TODO lists for this repo. Not sure when i can get back on this. But feel free to check out this blog post on end to end speedup evaluation of huggingface Transformer models with structured sparsity.

llCurious · 2023-11-08T08:55:46Z

Hi @Eric-mingjie . I try to benchmark the efficiency gain owing to the sparsity. However, i found that sparse matmul seems to be slower than dense matmul.

sparsity_ratio = 0.5

linear = torch.nn.Linear(1024, 3072, bias=False).float().cpu().eval()

sort_res = torch.sort(linear.weight, dim=-1, stable=True)
indices = sort_res[1][:,:int(linear.weight.shape[1]* sparsity_ratio)]
mask = (torch.zeros_like(linear.weight) == 1)
mask.scatter_(1, indices, True)

x = torch.rand(3072, 1024).float().cpu()

with torch.inference_mode():
    start = time.time()
    dense_output = linear(x)
    print(f"Dense linear {(time.time() - start) * 1000} ms")
    
    # convert to sparse format
    linear.weight = torch.nn.Parameter(linear.weight.to_sparse_csr())

    start = time.time()
    sparse_output = torch.sparse.mm(linear.weight, x.t()).t()
    # sparse_output = linear(x)
    print(f"Sparse linear {(time.time() - start) * 1000} ms")

    # sparse and dense matmul are numerically equivalent
    assert torch.allclose(sparse_output, dense_output, atol=1e-3)

Runing the above codes yields the following outputs:

Dense linear 13.79251480102539 ms
Sparse linear 155.81130981445312 ms

The sparse matmul is about 10x slower. Do you have any ideas on this phenomenon?

Eric-mingjie · 2023-11-09T21:37:05Z

Did you set the sparse kernel in torch.sparse as they did here https://pytorch.org/tutorials/prototype/semi_structured_sparse.html?

import torch
from torch.sparse import to_sparse_semi_structured, SparseSemiStructuredTensor
from torch.utils.benchmark import Timer
SparseSemiStructuredTensor._FORCE_CUTLASS = True

llCurious · 2023-11-10T02:22:19Z

Did you set the sparse kernel in torch.sparse as they did here https://pytorch.org/tutorials/prototype/semi_structured_sparse.html?
import torch
from torch.sparse import to_sparse_semi_structured, SparseSemiStructuredTensor
from torch.utils.benchmark import Timer
SparseSemiStructuredTensor._FORCE_CUTLASS = True

Nope. I intend to use the sparse matmul on cpu device. The to_sparse_semi_structured seems to be designated for GPUs.
BTW, after extensive analysis (link), a initial conclusion is: the dense matmul has been optimized a lot, hence the sparse matmul is advantageous iff. the sparsity ratio is large enough (like reaching above 90%).

In this regard, weight sparsity shall be friendly to memory usage alone, and do not help improve the throughput.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about the latency speedup! #26

Question about the latency speedup! #26

ybai62868 commented Oct 26, 2023

Eric-mingjie commented Oct 27, 2023

llCurious commented Nov 8, 2023

Eric-mingjie commented Nov 9, 2023 •

edited

Loading

llCurious commented Nov 10, 2023

Question about the latency speedup! #26

Question about the latency speedup! #26

Comments

ybai62868 commented Oct 26, 2023

Eric-mingjie commented Oct 27, 2023

llCurious commented Nov 8, 2023

Eric-mingjie commented Nov 9, 2023 • edited Loading

llCurious commented Nov 10, 2023

Eric-mingjie commented Nov 9, 2023 •

edited

Loading