About the implentation of .cpu() #96

reflectionie · 2024-07-19T04:57:22Z

Thanks for your work, may I ask when you expect to implement the .cpu() method of HQQLinear? Or can you please briefly describe how to implement it, I can implement it myself and submit a PR:

hqq/hqq/core/quantize.py

Line 563 in b1a7c06

def cpu(self):

mobicham · 2024-07-19T07:40:54Z

Thanks! It should be similar to .cuda() but instead would use .to('cpu'):

hqq/hqq/core/quantize.py

Lines 472 to 535 in b1a7c06

    
           def cuda(self, device): 
        
               self.meta["compute_dtype"] = self.compute_dtype 
        
               if type(self.W_q) == nn.parameter.Parameter: 
        
                   self.W_q.data, self.meta = Quantizer.cuda(self.W_q.data, self.meta, device) 
        
               else: 
        
                   self.W_q, self.meta = Quantizer.cuda(self.W_q, self.meta, device) 
        
               if self.meta["quant_zero"]: 
        
                   if "zero_q" in self.meta: 
        
                       self.meta["zero_q"], self.meta["meta_zero"] = Quantizer.cuda( 
        
                           self.meta["zero_q"], self.meta["meta_zero"], device 
        
                       ) 
        
                   else: 
        
                       _, self.meta["meta_zero"] = Quantizer.cuda( 
        
                           None, self.meta["meta_zero"], device 
        
                       ) 
        
               elif "zero" in self.meta: 
        
                   self.meta["zero"] = self.meta["zero"].to(device) 
        
               if self.meta["quant_scale"]: 
        
                   if "scale_q" in self.meta: 
        
                       self.meta["scale_q"], self.meta["meta_scale"] = Quantizer.cuda( 
        
                           self.meta["scale_q"], self.meta["meta_scale"], device 
        
                       ) 
        
                   else: 
        
                       _, self.meta["meta_scale"] = Quantizer.cuda( 
        
                           None, self.meta["meta_scale"], device 
        
                       ) 
        
               elif "scale" in self.meta: 
        
                   self.meta["scale"] = self.meta["scale"].to(device) 
        
               # #Use zero/scale with streams for dequantization is faster than packing in "zero_scale" 
        
               # for key in ["zero", "zero_q", "scale", "scale_q"]: 
        
               #     if((key in self.meta) and self.offload_meta): 
        
               #         self.meta[key] = self.meta[key].contiguous().cpu().pin_memory() 
        
               if self.offload_meta: 
        
                   if "zero_scale" not in self.meta: 
        
                       if self.meta["quant_scale"] and self.meta["quant_zero"]: 
        
                           self.meta["zero_scale"] = torch.stack( 
        
                               (self.meta["zero_q"], self.meta["scale_q"]) 
        
                           ) 
        
                           del self.meta["scale_q"], self.meta["zero_q"] 
        
                       else: 
        
                           self.meta["zero_scale"] = torch.stack( 
        
                               (self.meta["zero"], self.meta["scale"]) 
        
                           ).to(self.compute_dtype) 
        
                           del self.meta["scale"], self.meta["zero"] 
        
                   self.meta["zero_scale"] = ( 
        
                       self.meta["zero_scale"].contiguous().cpu().pin_memory() 
        
                   ) 
        
               if self.bias is not None: 
        
                   self.bias = self.bias.to(device=device, dtype=self.compute_dtype) 
        
               self.W_q = nn.Parameter(self.W_q, requires_grad=False) 
        
               self.device = device 
        
               self.in_gpu = True 
        
               torch.cuda.empty_cache() 
        
               return self

RIght now it is a mess because we support quantizing the scale/zero values and support offloading them to the cpu.
I think in the future we are gonna remove this which should make things much easier: #93 (comment)

May I ask why would need the .cpu() call? If you just want to use HQQLinear with cpu, you can just pass HQQLinear(...device='cpu')

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About the implentation of .cpu() #96

About the implentation of .cpu() #96

reflectionie commented Jul 19, 2024

mobicham commented Jul 19, 2024 •

edited

Loading

About the implentation of .cpu() #96

About the implentation of .cpu() #96

Comments

reflectionie commented Jul 19, 2024

mobicham commented Jul 19, 2024 • edited Loading

mobicham commented Jul 19, 2024 •

edited

Loading