Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About the implentation of .cpu() #96

Open
reflectionie opened this issue Jul 19, 2024 · 1 comment
Open

About the implentation of .cpu() #96

reflectionie opened this issue Jul 19, 2024 · 1 comment

Comments

@reflectionie
Copy link

Thanks for your work, may I ask when you expect to implement the .cpu() method of HQQLinear? Or can you please briefly describe how to implement it, I can implement it myself and submit a PR:

def cpu(self):

@mobicham
Copy link
Collaborator

mobicham commented Jul 19, 2024

Thanks! It should be similar to .cuda() but instead would use .to('cpu'):

hqq/hqq/core/quantize.py

Lines 472 to 535 in b1a7c06

def cuda(self, device):
self.meta["compute_dtype"] = self.compute_dtype
if type(self.W_q) == nn.parameter.Parameter:
self.W_q.data, self.meta = Quantizer.cuda(self.W_q.data, self.meta, device)
else:
self.W_q, self.meta = Quantizer.cuda(self.W_q, self.meta, device)
if self.meta["quant_zero"]:
if "zero_q" in self.meta:
self.meta["zero_q"], self.meta["meta_zero"] = Quantizer.cuda(
self.meta["zero_q"], self.meta["meta_zero"], device
)
else:
_, self.meta["meta_zero"] = Quantizer.cuda(
None, self.meta["meta_zero"], device
)
elif "zero" in self.meta:
self.meta["zero"] = self.meta["zero"].to(device)
if self.meta["quant_scale"]:
if "scale_q" in self.meta:
self.meta["scale_q"], self.meta["meta_scale"] = Quantizer.cuda(
self.meta["scale_q"], self.meta["meta_scale"], device
)
else:
_, self.meta["meta_scale"] = Quantizer.cuda(
None, self.meta["meta_scale"], device
)
elif "scale" in self.meta:
self.meta["scale"] = self.meta["scale"].to(device)
# #Use zero/scale with streams for dequantization is faster than packing in "zero_scale"
# for key in ["zero", "zero_q", "scale", "scale_q"]:
# if((key in self.meta) and self.offload_meta):
# self.meta[key] = self.meta[key].contiguous().cpu().pin_memory()
if self.offload_meta:
if "zero_scale" not in self.meta:
if self.meta["quant_scale"] and self.meta["quant_zero"]:
self.meta["zero_scale"] = torch.stack(
(self.meta["zero_q"], self.meta["scale_q"])
)
del self.meta["scale_q"], self.meta["zero_q"]
else:
self.meta["zero_scale"] = torch.stack(
(self.meta["zero"], self.meta["scale"])
).to(self.compute_dtype)
del self.meta["scale"], self.meta["zero"]
self.meta["zero_scale"] = (
self.meta["zero_scale"].contiguous().cpu().pin_memory()
)
if self.bias is not None:
self.bias = self.bias.to(device=device, dtype=self.compute_dtype)
self.W_q = nn.Parameter(self.W_q, requires_grad=False)
self.device = device
self.in_gpu = True
torch.cuda.empty_cache()
return self

RIght now it is a mess because we support quantizing the scale/zero values and support offloading them to the cpu.
I think in the future we are gonna remove this which should make things much easier: #93 (comment)

May I ask why would need the .cpu() call? If you just want to use HQQLinear with cpu, you can just pass HQQLinear(...device='cpu')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants