Skip to content

Commit

Permalink
Update index.html
Browse files Browse the repository at this point in the history
  • Loading branch information
mobicham committed Nov 13, 2023
1 parent de47317 commit a358480
Showing 1 changed file with 4 additions and 4 deletions.
8 changes: 4 additions & 4 deletions index.html
Original file line number Diff line number Diff line change
Expand Up @@ -138,7 +138,7 @@ <h2 id="hqq" class="">Half-Quadratic Quantization</h2>
& \beta^{(t+1)}\leftarrow\kappa\beta^{(t)}
\end{array}$$

<h4>Sub-problem (\( (sp}_{1} \))</h4>
<h4>Sub-problem \text{(sp}_{1})</h4>

This problem takes the form of a <a href="https://web.stanford.edu/~boyd/papers/pdf/prox_algs.pdf">Proximal Operator</a>. When \( \phi() \) is the \( l_{1} \) norm, the solution is the <a href="https://sparse-plex.readthedocs.io/en/latest/book/opt/soft_thresholding.html">soft-thresholding operator</a>. There exists a more general thresholding solution for the \( l_{p}\)-norm with \( 0 \le p \leq 1 \) that we adopt known is as the <a href="https://inria.hal.science/hal-01317151/file/lowrank_ieee_tip.pdf">generalized soft-thresholding operator</a>:

Expand All @@ -148,7 +148,7 @@ <h4>Sub-problem (\( (sp}_{1} \))</h4>
\end{array}$$


<h4>Sub-problem (\( (sp}_{2} \))</h4>
<h4>Sub-problem \text{(sp}_{2})</h4>
The second sub-problem can be rewritten as follows:
$$\begin{array}{c}
z^{(t+1)}\leftarrow\underset{z}{\text{argmin}}\,\frac{1}{2}||z-\left(W_{q}^{(t+1)}-\frac{(W-W_{e}^{(t+1)})}{s}\right)||_{2}^{2}\\
Expand All @@ -165,8 +165,8 @@ <h4>Sub-problem (\( (sp}_{2} \))</h4>
<h2 id="processing_time" class="">Processing Time</h2>
<p>We report the processing time to quantize the <a href="https://ai.meta.com/llama/">Llama2</a> models. We noticed that the processing time for GPTQ and AWQ drastically changes from one machine to another. GPTQ heavily relies on the CPU which creates issues on virtual machines, so we limit the number of threads to those available in the virtual machine (32) to avoid the process hanging for hours. Our method performs the whole quantization on the GPU with half-precision and only uses the CPU to transfer data to the GPU once the solver is finished. </p>
<center><img src="figs/llama2-7b_time.png" /></center>
center><img src="figs/llama2-13b_time.png" /></center>
center><img src="figs/llama2-70b_time.png" /></center>
<center><img src="figs/llama2-13b_time.png" /></center>
<center><img src="figs/llama2-70b_time.png" /></center>

<h2 id="benchmark" class="">Benchmark</h2>

Expand Down

0 comments on commit a358480

Please sign in to comment.