Skip to content

Latest commit

 

History

History
39 lines (34 loc) · 3.65 KB

README.md

File metadata and controls

39 lines (34 loc) · 3.65 KB

Cheap-ML-models

A list of methods/resources/links on how to optimise training, inference and throughput of expensive ML models.


Motivation

  • Using only scale to improve performance means that resource consumption also grows. This motivates research into efficient methods.
  • This project is an attempt to list methods and findings.
  • This is needed in resource-constrained devices such as smartphones or embedded systems.

Methods

  • Pruning allows you to remove unnecessary weights by zeroing them out of your model network.
  • Quantisation allows you to reduce the computation complexity of a model by reducing the precision of the weight’s representation.
  • Distillation/Teacher-student will force a smaller neural network to learn the objectives of a larger neural network.
  • ONNX Runtime was designed with a focus on performance and scalability in order to support heavy workloads in high-scale production scenarios.

Quantisation

  • Quantisation does not have to be applied consistently to all parts of the model. Forward and backward passes can be done in half-precision, while parameters are stored and updated in full precision. In NNs, the amount of time taken to process inputs and to generate outputs (latency) is the sum of two components: data movement and arithmetic operations. Quantization helps improve upon both these facets – using a lower precision helps transfer data in the GPU faster and also enables leveraging specialized hardware in modern GPUs that reduces the time taken for data movement and the matrix multiplications respectively. However, quantizing LLMs has proven to be significantly more challenging as they grow in size.
  • There are several different types of quantisation methods:
    • Fixed-point quantisation, in which each parameter or computation is represented by a fixed number of bits.
    • Floating-point quantisation, in which some parameters or computations are represented with higher precision than others.
    • Dynamic quantisation which quantises the weights of the model during the training process, this is useful for deep learning models where the weights are updated frequently.
    • Quantisation can also be applied on the activation values during the inference process, this is called post-training quantisation.

Articles


Blogs