# Getting started


<div class="alert alert-info">

<b>Warning</b>

Precision debug tools with Nvidia-DLFramework-Inspect for Transformer Engine is currently supported only for Torch.

</div>

Transformer Engine supports FP8 training, which can significantly speed up training for most transformers without any noticeable difference in accuracy compared to higher precision. However, in some cases, reducing precision may lead to a drop in accuracy or cause instability during training. To address this, we provide specialized precision debug tools that help identify potential issues. They allow, for example:

- switching a specific GEMM operation from FP8 to higher precision,
- selecting between immediate or delayed scaling of FP8 tensors,
- detailed logging of tensor statistics.

Moreover one can use Nvidia-DLFramework-Inspect with Transformer Engine to add own debug features. Fake casting to user-defined precision is described in (link)[link].

There are 3 things one needs to do to use Transformer Engine debug features:

1. Define `TransformerLayer` and other TE layers with option `debug=True`.
2. Create a **config.yaml** file to configure the desired features.
3. Install, import and initialize Nvidia-DLFramework-Inspect tool.

<figure align="center">
<img src="./img/introduction.svg">
    <figcaption> Fig 1: To use debug features, one needs to init the Transformer Layer with `debug=True` and specify the configuration in the `config.yaml` file. This configuration can affect the behavior of the GEMMs. </figcaption>
</figure>

Let's look at a simple example of training a Transformer layer using Transformer Engine with FP8 precision. This example demonstrates how to set up the layer, define an optimizer, and perform a few training iterations using dummy data.

```python
# train.py

from transformer_engine.pytorch import TransformerLayer
import torch
import torch.nn as nn
import torch.optim as optim
import transformer_engine as te

hidden_size = 512
num_attention_heads = 8

transformer_layer = TransformerLayer(
    hidden_size=hidden_size,
    num_attention_heads=num_attention_heads
).cuda()

dummy_input = torch.randn(10, 32, hidden_size).cuda()
criterion = nn.MSELoss()
optimizer = optim.Adam(transformer_layer.parameters(), lr=1e-4)
dummy_target = torch.randn(10, 32, hidden_size).cuda()

for epoch in range(5):
    transformer_layer.train()
    optimizer.zero_grad()
    with te.fp8_autocast(enabled=True):
        output = transformer_layer(dummy_input)
    loss = criterion(output, dummy_target)
    loss.backward()
    optimizer.step()
```

We will demonstrate two debug features on the code above:

1. Disabling FP8 precision for a specific GEMM operations, such as the FC1 and FC2 forward propagation GEMM.
2. Logging statistics for other GEMM operations, such as gradient statistics for dgrad GEMM within the LayerNormLinear layer.

#### Requirements

To use the debug features of Transformer Engine, you need to install the (Nvidia-DLFramework-Inspect)[link] package provided by NVIDIA. You can install it by following these steps:

```
git clone [link]
cd Nvidia-DLFramework-Inspect
pip install .
```

#### Config file

We need to prepare **config.yaml** file, as below

```yaml
# config.yaml

fc1_fprop_to_fp8:
  enabled: True
  layers:
    layer_types: [fc1, fc2] # contains fc1 or fc2 in name
  transformer_engine:
    DisableFp8Gemm:
      enabled: True
      gemms: [fprop]

log_tensor_stats:
  enabled: True
  layers:
    layer_types: [layernorm_linear] # contains layernorm_linear in name
  transformer_engine:
    LogTensorStats:
      enabled: True
      stats: [max, min, mean, std, l1_norm]
      tensors: [activation]
      freq: 1
      start_step: 2
      end_step: 5
```

Further explanation on how to create config files is in the next section of this tutorial.

#### Adjusting Python file

```python
# (...)

import nvtorch_inspect.api as nvinspect_api
nvinspect_api.initialize(
    config_file="./config.yaml",
    feature_dirs=["/path/to/transformer_engine/debug/features"],
    log_dir="./log",
    default_logging_enabled=True)

# (...)

transformer_layer = TransformerLayer(..., debug=True, debug_name="transformer_layer").cuda()

# (...)
```

In the modified code above, the following changes were made:

1. Added an import for `nvtorch_inspect.api`.
2. Replaced the original `TransformerLayer` import with the debug version from `transformer_engine.debug.pytorch`.
3. Initialized the Nvidia-DLFramework-Inspect by calling `nvinspect_api.initialize()` with appropriate configuration, specifying the path to the config file, feature directories, and log directory.
4. Modified the instantiation of `TransformerLayer` to include the `name` parameter (`name="transformer_layer"`). This helps in identifying the specific layer for debugging purposes.

#### Inspecting the logs

Let's look at the files with the logs. Two files will be created:

1. First for main debug logs.
2. Second for statistics logs.

Let's look inside them!

```
# log/debug_logs/debug_log_globalrank-0.log

INFO - [DEBUG-INFO] Default logging to file enabled at ./log
INFO - transformer_layer.self_attn.layernorm_linear_qkv: Fprop Activation: Delayed Scaling
INFO - transformer_layer.self_attn.layernorm_linear_qkv: Fprop Weight: Delayed Scaling
INFO - transformer_layer.self_attn.proj: Fprop Activation: Delayed Scaling
INFO - transformer_layer.self_attn.proj: Fprop Weight: Delayed Scaling
INFO - transformer_layer.layernorm_mlp.fc1: Feature=DisableFp8Gemm, API=is_fp8_gemm_enabled: fprop: FP8 GEMM: False
INFO - transformer_layer.layernorm_mlp.fc1: Fprop: BF16
INFO - transformer_layer.layernorm_mlp.fc2: Feature=DisableFp8Gemm, API=is_fp8_gemm_enabled: fprop: FP8 GEMM: False
INFO - transformer_layer.layernorm_mlp.fc2: Fprop: BF16
INFO - transformer_layer.layernorm_mlp.fc2: Fprop Activation: Delayed Scaling
INFO - transformer_layer.layernorm_mlp.fc2: Fprop Weight: Delayed Scaling
INFO - transformer_layer.layernorm_mlp.fc2: Dgrad Gradient: Delayed Scaling
INFO - transformer_layer.layernorm_mlp.fc2: Dgrad Weight: Delayed Scaling
INFO - transformer_layer.layernorm_mlp.fc2: Wgrad Gradient: Delayed Scaling
INFO - transformer_layer.layernorm_mlp.fc2: Wgrad Activation: Delayed Scaling
INFO - transformer_layer.layernorm_mlp.fc1: Dgrad Gradient: Delayed Scaling
INFO - transformer_layer.layernorm_mlp.fc1: Dgrad Weight: Delayed Scaling
INFO - transformer_layer.layernorm_mlp.fc1: Wgrad Gradient: Delayed Scaling
INFO - transformer_layer.layernorm_mlp.fc1: Wgrad Activation: Delayed Scaling
INFO - transformer_layer.self_attn.proj: Dgrad Gradient: Delayed Scaling
INFO - transformer_layer.self_attn.proj: Dgrad Weight: Delayed Scaling
INFO - transformer_layer.self_attn.proj: Wgrad Gradient: Delayed Scaling
INFO - transformer_layer.self_attn.proj: Wgrad Activation: Delayed Scaling
INFO - transformer_layer.self_attn.layernorm_linear_qkv: Dgrad Gradient: Delayed Scaling
INFO - transformer_layer.self_attn.layernorm_linear_qkv: Dgrad Weight: Delayed Scaling
INFO - transformer_layer.self_attn.layernorm_linear_qkv: Wgrad Gradient: Delayed Scaling
INFO - transformer_layer.self_attn.layernorm_linear_qkv: Wgrad Activation: Delayed Scaling
....
```

In the main log file, you can find detailed information about the transformer's layer GEMMs behavior. You can see that `fc1` and `fc2` fprop GEMMs are run in high precision, as intended.

```
# log/debug_statistics_logs/statistics_log_globalrank-0.log

INFO - transformer_layer.self_attn.layernorm_linear_qkv_activation_max              iteration=000002 				 value=4.4874
INFO - transformer_layer.self_attn.layernorm_linear_qkv_activation_min              iteration=000002 				 value=-4.1867
INFO - transformer_layer.self_attn.layernorm_linear_qkv_activation_mean             iteration=000002 				 value=-0.0000
INFO - transformer_layer.self_attn.layernorm_linear_qkv_activation_std              iteration=000002 				 value=0.9999
INFO - transformer_layer.self_attn.layernorm_linear_qkv_activation_l1_norm          iteration=000002 				 value=130665.7031
INFO - transformer_layer.self_attn.layernorm_linear_qkv_activation_max              iteration=000003 				 value=4.4872
INFO - transformer_layer.self_attn.layernorm_linear_qkv_activation_min              iteration=000003 				 value=-4.1864
INFO - transformer_layer.self_attn.layernorm_linear_qkv_activation_mean             iteration=000003 				 value=-0.0000
INFO - transformer_layer.self_attn.layernorm_linear_qkv_activation_std              iteration=000003 				 value=0.9998
INFO - transformer_layer.self_attn.layernorm_linear_qkv_activation_l1_norm          iteration=000003 				 value=130654.3047
INFO - transformer_layer.self_attn.layernorm_linear_qkv_activation_max              iteration=000004 				 value=4.4872
INFO - transformer_layer.self_attn.layernorm_linear_qkv_activation_min              iteration=000004 				 value=-4.1861
INFO - transformer_layer.self_attn.layernorm_linear_qkv_activation_mean             iteration=000004 				 value=-0.0000
INFO - transformer_layer.self_attn.layernorm_linear_qkv_activation_std              iteration=000004 				 value=0.9997
INFO - transformer_layer.self_attn.layernorm_linear_qkv_activation_l1_norm          iteration=000004 				 value=130643.2422
INFO - transformer_layer.self_attn.layernorm_linear_qkv_activation_max              iteration=000005 				 value=4.4876
INFO - transformer_layer.self_attn.layernorm_linear_qkv_activation_min              iteration=000005 				 value=-4.1858
INFO - transformer_layer.self_attn.layernorm_linear_qkv_activation_mean             iteration=000005 				 value=-0.0000
INFO - transformer_layer.self_attn.layernorm_linear_qkv_activation_std              iteration=000005 				 value=0.9997
INFO - transformer_layer.self_attn.layernorm_linear_qkv_activation_l1_norm          iteration=000005 				 value=130632.5781
```

The second log file (`statistics_log_globalrank-0.log`) contains statistics for tensors we requested in `config.yaml`.


#### Logging using TensorBoard

Precision debug tools supports logging using [TensorBoard](https://www.tensorflow.org/tensorboard). To enable it, one needs to pass the argument `tb_writer` to the `nvinspect_api.initialize()`.  Let's modify `train.py` file.

```python

# (...)

from torch.utils.tensorboard import SummaryWriter
tb_writer = SummaryWriter('./tensorboard_dir/run1')

# add tb_writer to the Debug API initialization
nvinspect_api.initialize(
    config_file="./config.yaml",
    feature_dirs=["/path/to/transformer_engine/debug/features"],
    log_dir="./log",
    tb_writer=tb_writer)

# (...)
```

Let's run training and open TensorBoard by `tensorboard --logdir=./tensorboard_dir/run1`:

<figure align="center">
<img src="./img/tensorboard.png">
    <figcaption> Fig 2: TensorBoard with plotted stats.</figcaption>
</figure>


#### Conclusion

Transformer Engine's precision debug tools help address FP8 training issues. Properly setting up the configuration files and debug layers will help you gain better insights into model behavior and optimize the training process.

