Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about the latency speedup! #26

Open
ybai62868 opened this issue Oct 26, 2023 · 4 comments
Open

Question about the latency speedup! #26

ybai62868 opened this issue Oct 26, 2023 · 4 comments

Comments

@ybai62868
Copy link

Hi,

Thanks for the great work!
I am curious about whether you will provide the script to get the end-to-end inference latency on a single GPU for the Llama family models?

Thanks,
Yang

@Eric-mingjie
Copy link
Collaborator

I am adding this to my TODO lists for this repo. Not sure when i can get back on this. But feel free to check out this blog post on end to end speedup evaluation of huggingface Transformer models with structured sparsity.

@llCurious
Copy link

Hi @Eric-mingjie . I try to benchmark the efficiency gain owing to the sparsity. However, i found that sparse matmul seems to be slower than dense matmul.

sparsity_ratio = 0.5

linear = torch.nn.Linear(1024, 3072, bias=False).float().cpu().eval()

sort_res = torch.sort(linear.weight, dim=-1, stable=True)
indices = sort_res[1][:,:int(linear.weight.shape[1]* sparsity_ratio)]
mask = (torch.zeros_like(linear.weight) == 1)
mask.scatter_(1, indices, True)

x = torch.rand(3072, 1024).float().cpu()

with torch.inference_mode():
    start = time.time()
    dense_output = linear(x)
    print(f"Dense linear {(time.time() - start) * 1000} ms")
    
    # convert to sparse format
    linear.weight = torch.nn.Parameter(linear.weight.to_sparse_csr())

    start = time.time()
    sparse_output = torch.sparse.mm(linear.weight, x.t()).t()
    # sparse_output = linear(x)
    print(f"Sparse linear {(time.time() - start) * 1000} ms")

    # sparse and dense matmul are numerically equivalent
    assert torch.allclose(sparse_output, dense_output, atol=1e-3)

Runing the above codes yields the following outputs:

Dense linear 13.79251480102539 ms
Sparse linear 155.81130981445312 ms

The sparse matmul is about 10x slower. Do you have any ideas on this phenomenon?

@Eric-mingjie
Copy link
Collaborator

Eric-mingjie commented Nov 9, 2023

Did you set the sparse kernel in torch.sparse as they did here https://pytorch.org/tutorials/prototype/semi_structured_sparse.html?

import torch
from torch.sparse import to_sparse_semi_structured, SparseSemiStructuredTensor
from torch.utils.benchmark import Timer
SparseSemiStructuredTensor._FORCE_CUTLASS = True

@llCurious
Copy link

Did you set the sparse kernel in torch.sparse as they did here https://pytorch.org/tutorials/prototype/semi_structured_sparse.html?

import torch
from torch.sparse import to_sparse_semi_structured, SparseSemiStructuredTensor
from torch.utils.benchmark import Timer
SparseSemiStructuredTensor._FORCE_CUTLASS = True

Nope. I intend to use the sparse matmul on cpu device. The to_sparse_semi_structured seems to be designated for GPUs.
BTW, after extensive analysis (link), a initial conclusion is: the dense matmul has been optimized a lot, hence the sparse matmul is advantageous iff. the sparsity ratio is large enough (like reaching above 90%).

In this regard, weight sparsity shall be friendly to memory usage alone, and do not help improve the throughput.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants