<p style="font-size: 15px; font-family: Verdana;">
    <b>Translations:</b>
    <a name="English version" color="#000">English</a>, <a name="Chinese version" href="https://zhuanlan.zhihu.com/p/555283334">Chinese</a>
</p>


<a name="introduction"></a>
<h1 style="font-family: Verdana; font-weight: bold;">Introduction</h1>
<p style="font-size: 15px; font-family: Verdana;">
    Despite the stunning success of Transformers in Natural Language Processing (NLP) tasks, it is still challenging to train them on even modern Graphics Processing Units (GPUs) or deploy them in production, due to the massive number of parameters. Training or inferencing such large models, we can probably run <a href="https://en.wikipedia.org/wiki/Out_of_memory">out of memory (OOM)</a> or the process becomes very long.<br><br>
Nevertheless, there are a lot of offered approaches to avoid such problems, so the main contribution of this article is to describe and show how the provided methods can be applied in the Training and Inference scripts. Firstly, the article will go through basic approaches such as Gradient Accumulation, Freezing, Automatic Mixed Precision, 8-bit Optimizers, and Gradient Checkpointing, and then describe NLP-specific optimizing approaches such as Dynamic Padding, Uniform Dynamic Padding, and Fast Tokenizers.

</p>

<a name="#table_of_contents"></a>
<h1 style="font-family: Verdana; font-weight: bold;">Table on contents</h1>

<ul>
    <li style="font-size: 17px; font-family: Verdana;"><b><a href="#introduction">Introduction</a></b></li>
    <li style="font-size: 17px; font-family: Verdana;"><b><a href="#table_of_contents">Table on contents</a></b></li>
    <li style="font-size: 17px; font-family: Verdana;"><b><a href="#gradient_accumulation">Gradient Accumulation</a></b>
        <ul>
            <li style="font-size: 15px; font-family: Verdana;"><a href="#gradient_accumulation_vanilla_training">Vanilla training loop</a></li>
            <li style="font-size: 15px; font-family: Verdana;"><a href="#gradient_accumulation_training">Training loop with Gradient Accumulation</a></li>
        </ul>
    </li>
    <li style="font-size: 17px; font-family: Verdana;"><b><a href="#freezing">Freezing</a></b>
        <ul>
            <li style="font-size: 15px; font-family: Verdana;"><a href="#freezing_implementation">Implementation</a></li>
        </ul>
    </li>
    <li style="font-size: 17px; font-family: Verdana;"><b><a href="#amp">Automatic Mixed Precision</a></b>
        <ul>
            <li style="font-size: 15px; font-family: Verdana;"><a href="#amp_vanilla_training">Vanilla training loop</a></li>
            <li style="font-size: 15px; font-family: Verdana;"><a href="#amp_training">Training loop with Automatic Mixed Precision</a></li>
        </ul>
    </li>
    <li style="font-size: 17px; font-family: Verdana;"><b><a href="#8bit_optimizers">8-bit Optimizers</a></b>
        <ul>
            <li style="font-size: 15px; font-family: Verdana;"><a href="#8bit_optimizers_pytorch_training">Initializing optimizers via PyTorch</a></li>
            <li style="font-size: 15px; font-family: Verdana;"><a href="#8bit_optimizers_bytesandbites_training">Initializing optimizers via bytesandbites</a></li>
        </ul>
    </li>
    <li style="font-size: 17px; font-family: Verdana;"><b><a href="#gradient_checkpointing">Gradient Checkpointing</a></b>
        <ul>
            <li style="font-size: 15px; font-family: Verdana;"><a href="#gradient_checkpointing_implementation">Implementation</a></li>
        </ul>
    </li>
            <li style="font-size: 17px; font-family: Verdana;"><b><a href="#fast_tokenizers">Fast Tokenizers</a></b>
        <ul>
            <li style="font-size: 15px; font-family: Verdana;"><a href="#fast_tokenizers_implementation">Implementation</a></li>
        </ul>
    </li>
            <li style="font-size: 17px; font-family: Verdana;"><b><a href="#dynamic_padding">Dynamic Padding</a></b>
        <ul>
<!--             <li style="font-size: 15px; font-family: Verdana;">Implementation</li> -->
        </ul>
    </li>
        <li style="font-size: 17px; font-family: Verdana;"><b><a href="#uniform_dynamic_padding">Uniform Dynamic Padding</a></b>
        <ul>
<!--             <li style="font-size: 15px; font-family: Verdana;">Implementation</li> -->
        </ul>
    </li>
        <li style="font-size: 17px; font-family: Verdana;"><b><a href="#conclusion">Conclusion</a></b></li>
        <li style="font-size: 17px; font-family: Verdana;"><b><a href="#references">References</a></b></li>
        <li style="font-size: 17px; font-family: Verdana;"><b><a href="#releases">Releases</a></b></li>
</ul>

In [None]:
!pip uninstall -q -y transformers

In [None]:
import sys
sys.path.append("../input/torch-components-library/torch-components-main")
sys.path.append("../input/transformers/src")
import transformers
import warnings
import os

os.environ["TOKENIZERS_PARALLELISM"] = "true"

warnings.simplefilter("ignore")
transformers.logging.set_verbosity_error()

<h1 style="font-family: Verdana; font-weight: bold;">Gradient Accumulation</h1>

<p style="font-size: 15px; font-family: Verdana;">
The idea behind Gradient Accumulation is very simple - simulating a larger batch size. Sometimes using a large batch size is necessary for better convergence or improving the performance, however, it often requires a lot of memory. One possible solution to such an issue is to use a smaller batch size, however, on the one hand, small batch size leads to increasing training or inference time, and on the other hand, the gradient descent algorithms are very sensitive to the choice of batch size and may lead to unstable convergence and performance reduction. Instead, we can run multiply steps (accumulation steps) and accumulate (compute average) gradients a certain number of accumulation steps, and then when we have enough computed gradients perform the optimization step.
</p>

<center><img src="https://miro.medium.com/max/1400/1*rJIH9gPhctTLCk5G5iQ_oA.png" width="1000px" alt="Gradient Accumulatiom image"></center>
<p style="font-size: 20px; font-family: Verdana; color: grey;"><center style="font-size: 17px; font-family: Verdana; color: grey;">Visualization of how Gradient Accumulation works</center></p>

<h3 style="font-family: Verdana; font-weight: bold;">Vanilla training loop</h3>


In [None]:
for step, batch in enumerate(loader, 1):
    
    # prepare inputs and targets for the model and loss function respectively.
    
    # forward pass
    outputs = model(inputs)
    
    # computing loss
    loss = loss_fn(outputs, targets)
    
    # backward pass
    loss.backward()
    
    # perform optimization step
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)
    optimizer.step()
    model.zero_grad()
    
    # perform validation loop
    if step % validation_steps == 0:
        validation_loop()

<h3 style="font-family: Verdana; font-weight: bold;">Training loop with Gradient Accumulation</h3>

In [None]:
steps = len(loader)

# perform validation loop each `validation_steps` training steps!
validation_steps = int(validation_steps * gradient_accumulation_steps)

for step, batch in enumerate(loader, 1):
    
    # prepare inputs and targets for the model and loss function respectively.
    
    # forward pass
    outputs = model(inputs)
    
    # computing loss
    loss = loss_fn(outputs, targets)
    
    # accumulating gradients over steps
    if gradient_accumulation_steps > 1:
        loss = loss / gradient_accumulation_steps
    
    # backward pass
    loss.backward()
    
    # perform optimization step after certain number of accumulating steps and at the end of epoch
    if step % gradient_accumulation_steps == 0 or step == steps:
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)
        optimizer.step()
        model.zero_grad()
        
    # perform validation loop
    if step % validation_steps == 0:
        validation_loop()

<h1 style="font-family: Verdana; font-weight: bold;">Freezing</h1>

<p style="font-size: 15px; font-family: Verdana;">
Freezing is an effective way to speed up training and decrease memory utilization almost without losing final quality by toggling computing gradients in certain layers of the model.<br><br>
A well-known fact in Deep Learning is that low layers learn input data patterns and at the same time top layers learn high-level features, which are specific to target tasks. When performing the optimization step with some kind of optimization algorithms (e.g. SGD, AdamW, or RMSprop), the low layers receive small gradients, and hence the parameters almost stay not changed, this is called <a href="https://en.wikipedia.org/wiki/Vanishing_gradient_problem">Gradient Vanishing</a>, so instead of computing "useless" gradients and perform optimization of such low-gradients parameters, which sometimes require a lot of time and computational power, we can just freeze them.<br><br>
PyTorch provides a comfortable API for toggling computing gradients. Such behavior can be set by the property <a href="https://pytorch.org/docs/stable/generated/torch.Tensor.requires_grad.html#torch.Tensor.requires_grad">requires_grad</a> of <a href="https://pytorch.org/docs/stable/tensors.html">torch.Tensor</a>.

</p>

<h3 style="font-family: Verdana; font-weight: bold;">Implementation</h3>

In [None]:
def freeze(module):
    """
    Freezes module's parameters.
    """
    
    for parameter in module.parameters():
        parameter.requires_grad = False
        
def get_freezed_parameters(module):
    """
    Returns names of freezed parameters of the given module.
    """
    
    freezed_parameters = []
    for name, parameter in module.named_parameters():
        if not parameter.requires_grad:
            freezed_parameters.append(name)
            
    return freezed_parameters

In [None]:
import torch
from transformers import AutoConfig, AutoModel


# initializing model
model_path = "microsoft/deberta-v3-base"
config = AutoConfig.from_pretrained(model_path)
model = AutoModel.from_pretrained(model_path, config=config)


# freezing embeddings and first 2 layers of encoder
freeze(model.embeddings)
freeze(model.encoder.layer[:2])

freezed_parameters = get_freezed_parameters(model)
print(f"Freezed parameters: {freezed_parameters}")

# selecting parameters, which requires gradients and initializing optimizer
model_parameters = filter(lambda parameter: parameter.requires_grad, model.parameters())
optimizer = torch.optim.AdamW(params=model_parameters, lr=2e-5, weight_decay=0.0)

<h1 style="font-family: Verdana; font-weight: bold;">Automatic Mixed Precision</h1>

<p style="font-size: 15px; font-family: Verdana;">
Automatic Mixed Precision (AMP) is another very simple way of reducing memory consumption and training time without losing final quality, which was introduced in <a href="https://arxiv.org/abs/1710.03740">"Mixed Precision Training"</a> paper by NVIDIA and Baidu researchers in 2017. The key idea behind the approach is to use lower precision for keeping the model's gradients and parameters in the memory, i.e instead of using full precision (e.g float32) the proposed approach uses half-precision (e.g float16) for keeping tensors in memory. However, when computing gradients in lower precision, some values can be so small that they are treated as zeros, this phenomenon is called <a href="https://en.wikipedia.org/wiki/Integer_overflow">"overflow"</a>. In order to prevent "overflow", the authors of the original paper proposed a gradient scaling method.<br><br>
PyTorch provides a package with necessary functionality (from lowering precision to gradient scaling) for using Automatic Mixed Precision, called <a href="https://pytorch.org/docs/stable/amp.html">torch.cuda.amp</a>. Automatic Mixed Precision was implemented as a context manager, so it can easily be inserted in training and inferencing scripts.


</p>

<center><img src="https://developer-blogs.nvidia.com/wp-content/uploads/2019/01/pasted-image-0-21.png" alt="Automatic Mixed Precision image"></center>
<p style="font-size: 20px; font-family: Verdana; color: grey;"><center style="font-size: 17px; font-family: Verdana; color: grey;">Visualization of how Automatic Mixed Precision works</center></p>

<h3 style="font-family: Verdana; font-weight: bold;">Vanilla training loop</h3>

In [None]:
for step, batch in enumerate(loader, 1):
    
    # prepare inputs and targets for the model and loss function respectively.
    
    # forward pass
    outputs = model(inputs)
    
    # computing loss
    loss = loss_fn(outputs, targets)
    
    # backward pass
    loss.backward()
    
    # perform optimization step
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)
    optimizer.step()
    model.zero_grad()

<h3 style="font-family: Verdana; font-weight: bold;">Training loop with Automatic Mixed Precision</h3>

In [None]:
from torch.cuda.amp import autocast, GradScaler


scaler = GradScaler()

for step, batch in enumerate(loader, 1):
    
    # prepare inputs and targets for the model and loss function respectively.

    # forward pass with `autocast` context manager
    with autocast(enabled=True):
        outputs = model(inputs)
    
    # computing loss
    loss = loss_fn(outputs, targets)
    
    # scale gradint and perform backward pass
    scaler.scale(loss).backward()
    
    # before gradient clipping the optimizer parameters must be unscaled.
    scaler.unscale_(optimizer)
    
    # perform optimization step
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)
    
    scaler.step(optimizer)
    scaler.update()

<h1 style="font-family: Verdana; font-weight: bold;">8-bit Optimizers</h1>

<p style="font-size: 15px; font-family: Verdana;">
    The idea of 8-bit Optimizers is similar to Automatic Mixed Precision, where the model's parameters and gradients are kept in lower precision, but 8-bit Optimizers additionally keep the optimizer's state in lower precision too. The authors (Meta Research) detail described the 8-bit Optimizers in the original paper <a href="https://arxiv.org/abs/2110.02861">"8-bit Optimizers via Block-wise Quantization"</a>, and showed that 8-bit Optimizers lead to significant decreasing memory utilization and slightly speeding up the training. Additionally, the authors studied the impact of different hyperparameter settings and show that 8-bit Optimizers are stable to different choices of learning rate, betas and weight decay parameters without losing performance or hurting the convergence. Therefore, the authors provided a comfortable high-level library for 8-bit Optimizers, called <a href="https://github.com/facebookresearch/bitsandbytes">bitsandbytes</a>. 
    
</p>

<center><img src="https://i.ibb.co/9bj3JqG/Screenshot-3.png" alt="8-bit Optimizers table" border="0"></center>
<p style="font-size: 20px; font-family: Verdana; color: grey;"><center style="font-size: 17px; font-family: Verdana; color: grey;">Comparison table of different optimizers</center></p>

<h3 style="font-family: Verdana; font-weight: bold;">Initializing optimizer via PyTorch API</h3>

In [None]:
import torch
from transformers import AutoConfig, AutoModel

# initializing model
model_path = "microsoft/deberta-v3-base"
config = AutoConfig.from_pretrained(model_path)
model = AutoModel.from_pretrained(model_path, config=config)


# selecting parameters, which requires gradients
model_parameters = filter(lambda parameter: parameter.requires_grad, model.parameters())

# initializing optimizer
optimizer = torch.optim.AdamW(params=model_parameters, lr=2e-5, weight_decay=0.0)
print(f"32-bit Optimizer:\n\n{optimizer}")

<h3 style="font-family: Verdana; font-weight: bold;">Initializing optimizer via bitsandbytes API</h3>

In [None]:
!pip install -q bitsandbytes-cuda110

In [None]:
def set_embedding_parameters_bits(embeddings_path, optim_bits=32):
    """
    https://github.com/huggingface/transformers/issues/14819#issuecomment-1003427930
    """
    
    embedding_types = ("word", "position", "token_type")
    for embedding_type in embedding_types:
        attr_name = f"{embedding_type}_embeddings"
        
        if hasattr(embeddings_path, attr_name): 
            bnb.optim.GlobalOptimManager.get_instance().register_module_override(
                getattr(embeddings_path, attr_name), 'weight', {'optim_bits': optim_bits}
            )

In [None]:
import bitsandbytes as bnb


# selecting parameters, which requires gradients
model_parameters = filter(lambda parameter: parameter.requires_grad, model.parameters())

# initializing optimizer 
bnb_optimizer = bnb.optim.AdamW(params=model_parameters, lr=2e-5, weight_decay=0.0, optim_bits=8)
# bnb_optimizer = bnb.optim.AdamW8bit(params=model_parameters, lr=2e-5, weight_decay=0.0) # equivalent to the above line

# setting embeddings parameters
set_embedding_parameters_bits(embeddings_path=model.embeddings)

print(f"8-bit Optimizer:\n\n{bnb_optimizer}")

<h1 style="font-family: Verdana; font-weight: bold;">Gradient Checkpointing</h1>

<p style="font-size: 15px; font-family: Verdana;">
Sometimes even using small batch size and other optimization techniques, e.g. Gradient Accumulation, Freezing, or Automatic Precision Training, we still can run out of memory, especially in cases when the models are large enough. One of the proposed powerful solutions for solving this issue is Gradient Checkpointing, which was firstly introduced in the <a href="https://arxiv.org/abs/1604.06174">"Training Deep Nets With Sublinear Memory Cost"</a> paper in 2016. The authors demonstrated that Gradient Checkpointing can significantly reduce memory utilization from $ O(n) $ to $ O(\sqrt{n}) $, where $ n $ is the number of layers in the model. This approach allows training large models on a single GPU or provides more memory for increasing the batch size for better and faster convergence.

</p>

<center><img src="https://miro.medium.com/max/1400/0*nMSeZxl6ppnrivgv."></center>
<p style="font-size: 20px; font-family: Verdana; color: grey;"><center style="font-size: 17px; font-family: Verdana; color: grey;">Number of blocks (layers) versus memory utilization in megabytes</center></p>

<p style="font-size: 15px; font-family: Verdana;">
The idea behind Gradient Checkpoint is to compute gradients in small chunks while removing unnecessary gradients from the memory during forward and backpropagation passes, thereby reducing memory utilization, despite, such an approach requires more compute steps to reproduce the whole back propagation graph.

</p>

<center><img src="https://miro.medium.com/max/1082/0*s7U1QDfSXuVd1LrF." width="1000px"></center>

<p style="font-size: 20px; font-family: Verdana; color: grey;"><center style="font-size: 17px; font-family: Verdana; color: grey;">Demonstration of how Gradient Checkpointing works in forward and backpropagation passes</center></p>

<p style="font-size: 15px; font-family: Verdana;">
PyTorch framework provides gradient checkpointing from the box via <a href="https://pytorch.org/docs/stable/checkpoint.html#torch.utils.checkpoint.checkpoint">torch.utils.checkpoint.checkpoint</a> and <a href="https://pytorch.org/docs/stable/checkpoint.html#torch.utils.checkpoint.checkpoint_sequential">torch.utils.checkpoint.checkpoint_sequential</a> functions.<br><br>
    <i>"Specifically, in the forward pass, function will run in torch.no_grad() manner, i.e., not storing the intermediate activations. Instead, the forward pass saves the inputs tuple and the function parameter. In the backwards pass, the saved inputs and function is retrieved, and the forward pass is computed on function again, now tracking the intermediate activations, and then the gradients are calculated using these activation values."</i><br><br>
Additionally, HuggingFace Transformers supports Gradient Checkpoint too. Gradient Checkpointing can be performed by <a href="https://huggingface.co/docs/transformers/main_classes/model#transformers.PreTrainedModel.gradient_checkpointing_enable">gradient_checkpointing_enable</a> method of <a href="https://huggingface.co/docs/transformers/main_classes/model#transformers.PreTrainedModel">PreTrainedModel</a> instance.
</p>


<h3 style="font-family: Verdana; font-weight: bold;">Implementation</h3>

In [None]:
from transformers import AutoConfig, AutoModel

# https://github.com/huggingface/transformers/issues/9919
from torch.utils.checkpoint import checkpoint


# initializing model
model_path = "microsoft/deberta-v3-base"
config = AutoConfig.from_pretrained(model_path)
model = AutoModel.from_pretrained(model_path, config=config)


# gradient checkpointing
model.gradient_checkpointing_enable()
print(f"Gradient Checkpointing: {model.is_gradient_checkpointing}")

<h1 style="font-family: Verdana; font-weight: bold;">Fast Tokenizers</h1>

<p style="font-size: 15px; font-family: Verdana;">
<a href="https://huggingface.co/docs/transformers/v4.19.3/en/index">HuggingFace Transformers</a> provides two types of Tokenizers: Base and Fast. The main difference between them is that Fast Tokenizers are written on Rust since Python is very slow in loops, which are necessary during tokenization. This is a non-trivial way of allowing us to get a additional speed-up during the tokenization process. The types of Tokenizers can be easily changed through HuggingFace Transformers API in <a href="https://huggingface.co/docs/transformers/v4.19.3/en/model_doc/auto#transformers.AutoTokenizer.from_pretrained">from_pretrained</a> method of <a href="https://huggingface.co/docs/transformers/v4.19.3/en/model_doc/auto#transformers.AutoTokenizer">transformers.AutoTokenizer</a> instance by setting <a href="https://huggingface.co/docs/transformers/v4.19.3/en/model_doc/auto#transformers.AutoTokenizer.from_pretrained.use_fast">use_fast</a> property to True.
</p>

<center><img src="https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter6/tokenization_pipeline.svg" width="1000px" alt="Tokenization process"></center>

<p style="font-size: 20px; font-family: Verdana; color: grey;"><center style="font-size: 17px; font-family: Verdana; color: grey;">Visualization of how Tokenization works</center></p>

<h3 style="font-family: Verdana; font-weight: bold;">Implementation</h3>

In [None]:
from transformers import AutoTokenizer

# initializing Base version of Tokenizer
model_path = "microsoft/deberta-v3-base"
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
print(f"Base version Tokenizer:\n\n{tokenizer}", end="\n"*3)

# initializing Fast version of Tokenizer
fast_tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)
print(f"Fast version Tokenizer:\n\n{fast_tokenizer}")

<h1 style="font-family: Verdana; font-weight: bold;">Dynamic Padding</h1>

<p style="font-size: 15px; font-family: Verdana;">
Generally, the models are trained with a batch of inputs, and each input in the batch must have a fixed size, I.e the batch must be a representation of the matrix. The fixed size is often selected relatively on the length distribution in the dataset, the number of features, or other factors. In NLP tasks, the input size is referred to as the length of the text and called max length. Unfortunately, different texts have different lengths, so to handle such cases, the researchers proposed padding tokens and truncation. Truncation is applied, when the max length is smaller than the input text's length, so some tokens (often lasts) are removed. Padding tokens are special tokens, which is added to the end of input text when the input text's length is smaller than the max length, also worthing to note that padding tokens should not be included in calculating loss in some tasks (e.g Masked Language Modeling or Named Entity Recognition).
</p>

<img src="https://i.ibb.co/J2j895v/fixed-padding-length-1.png" alt="fixed-padding-length-1" border="0">
<p style="font-size: 20px; font-family: Verdana; color: grey;"><center style="font-size: 17px; font-family: Verdana; color: grey;">Visualization of Fixed Padding in the batch</center></p>

<p style="font-size: 15px; font-family: Verdana;">
 However, the padding tokens have significant drawbacks. It is very ineffective and requires more additional memory in cases where the input text is very short relatively to chosen max length. To prevent extra computational operations, the developers proposed one very effective approach - pad the inputs of the batch to the maximum input length of the batch. Although the difference in the terminology is very insufficient, such an approach can speed up training by 35% or even 50%! Despite this, speed up and memory usage depends on batch size and distribution of lengths.

</p>

<center><img src="https://i.ibb.co/BzRMzgx/dynamic-padding.png" alt="dynamic-padding" border="0"></center>

<p style="font-size: 20px; font-family: Verdana; color: grey;"><center style="font-size: 17px; font-family: Verdana; color: grey;">Visualization of how Dynamic Padding works in the batch</center></p>

<h1 style="font-family: Verdana; font-weight: bold;">Uniform Dynamic Padding</h1>

<p style="font-size: 15px; font-family: Verdana;">
There is one additional possible approach, which is based on Dynamic Padding, called Uniform Dynamic Padding. The idea is to sort texts by their corresponding lengths beforehand and not shuffle samples during training or inference. This approach is very effective and needs fewer computations during training or inference than Dynamic Padding. However, it is not recommended to use Uniform Dynamic Padding during training since the training implies shuffling of inputs.

</p>

<img src="https://i.ibb.co/MCBJj71/uniform-length-batching.png" alt="uniform-length-batching" border="0">
<p style="font-size: 20px; font-family: Verdana; color: grey;"><center style="font-size: 17px; font-family: Verdana; color: grey;">Visualization of how Uniform Dynamic Padding works in the batch</center></p>

<h1 style="font-family: Verdana; font-weight: bold;">Conclusion</h1>

<p style="font-size: 15px; font-family: Verdana;">
The optimization is a necessary step in developing models even on modern GPUs. For this reason, the article went through the most powerful and popular approaches for speeding up the training and reducing the memory consumption of large models such as Transformers.  
</p>

<a name="#references"></a>
<h1 style="font-family: Verdana; font-weight: bold;">References</h1>

<p style="font-size: 15px; font-family: Verdana;">During writing article the following links were used:</p>
<ul>
    <li style="font-size: 15px; font-family: Verdana;"><a href="https://huggingface.co/docs/transformers/performance">Performance and Scalability: How To Fit a Bigger Model and Train It Faster</a></li>
    <li style="font-size: 15px; font-family: Verdana;"><a href="https://www.kaggle.com/code/rhtsingh/speeding-up-transformer-w-optimization-strategies">Speeding up Transformer w/ Optimization Strategies</a></li>
    <li style="font-size: 15px; font-family: Verdana;"><a href="https://www.kaggle.com/competitions/AI4Code/discussion/327777">Things you can try to speed up training speed and preventing memory shortage if you are using transformers.</a></li>
    <li style="font-size: 15px; font-family: Verdana;"><a href="https://www.kaggle.com/competitions/feedback-prize-2021/discussion/303131">8-bit Adam and other memory optimizations</a></li>
    <li style="font-size: 15px; font-family: Verdana;"><a href="https://medium.com/tensorflow/fitting-larger-networks-into-memory-583e3c758ff9">Fitting larger networks into memory.
</a></li>
    
</ul>

<a name="#realeases"></a>
<h1 style="font-family: Verdana; font-weight: bold;">Realeases</h1>
<ul>
    <li style="font-size: 15px; font-family: Verdana;"><b>26.06.2022</b> - updated "Training loop with Gradient Accumulation", added validation loop and change some symbols.</li>
    <li style="font-size: 15px; font-family: Verdana;"><b>18.08.2022</b> - added Chinese translation provided by <a href="https://www.kaggle.com/zachary666">@zachary666</a></li>
</ul>