# HARDWARE AGNOSTIC TRAINING (PREPARATION) 硬件不可知训练(准备)

To train on CPU/GPU/TPU without changing your code, we need to build a few good habits :)

要在 CPU/GPU/TPU 上训练而不改变代码，我们需要养成一些好习惯:)

# Delete .cuda() or .to() calls

**Delete any calls to .cuda() or .to(device).**

```python
# before lightning
def forward(self, x):
    x = x.cuda(0)
    layer_1.cuda(0)
    x_hat = layer_1(x)


# after lightning
def forward(self, x):
    x_hat = layer_1(x)
```

# Init tensors using Tensor.to and register_buffer

When you need to create a new tensor, use Tensor.to. This will make your code scale to any arbitrary number of GPUs or TPUs with Lightning.

当您需要创建一个新的张量时，使用 **Tensor.to**。这将使您的代码扩展到任意数量的 GPU 或带闪电的 TPU。

```python
# before lightning
def forward(self, x):
    z = torch.Tensor(2, 3)
    z = z.cuda(0)


# with lightning
def forward(self, x):
    z = torch.Tensor(2, 3)
    z = z.to(x)
```

The `LightningModule` knows what device it is on. You can access the reference `via self.device`. Sometimes it is necessary to store tensors as module attributes. ** However, if they are not parameters they will remain on the CPU even if the module gets moved to a new device.** To prevent that and remain device agnostic, register the tensor as a buffer in your modules’ __init__ method with register_buffer().

LightningModule 知道它在哪个设备上。您可以通过 self. device 访问引用。有时需要将张量存储为模属性。但是，如果它们不是参数，即使模块被移动到新设备上，它们也会保留在 CPU 上。为了避免这种情况并保持设备不可知性，可以使用 register _ buffer ()在模块的 _ _ init _ _ 方法中将张量注册为缓冲区。

```python
class LitModel(LightningModule):
    def __init__(self):
        ...
        self.register_buffer("sigma", torch.eye(3))
        # you can now access self.sigma anywhere in your module
```

# Remove samplers

DistributedSampler is automatically handled by Lightning.

Lightning 将会自动完成分布式训练；

# Synchronize validation and test logging 同步验证和测试日志记录

When running in distributed mode, we have to ensure that the validation and test step logging calls are synchronized across processes. This is done by adding `sync_dist=True` to `all self.log` calls in the validation and test step. This ensures that each GPU worker has the same behaviour when tracking model checkpoints, which is important for later downstream tasks such as testing the best checkpoint across all workers. The sync_dist option can also be used in logging calls during the step methods, but be aware that this can lead to significant communication overhead and slow down your training.

在分布式模式下运行时，我们必须确保验证和测试步骤日志记录调用跨进程同步。这是通过在验证和测试步骤中将 sync _ dist = True 添加到所有 self. log 调用来完成的。这样可以确保每个 GPU 工作人员在跟踪模型检查点时具有相同的行为，这对后续任务(例如测试所有工作人员的最佳检查点)很重要。Sync _ dist 选项也可以用于在步骤方法期间记录调用，但是请注意，这可能导致显著的通信开销，并减慢您的培训。

Note if you use any built in metrics or custom metrics that use TorchMetrics, these do not need to be updated and are automatically handled for you.

注意，如果您使用任何使用 TorchMetrics 的内置指标或自定义指标，则不需要更新这些指标，它们将自动为您处理。

In [1]:
def validation_step(self, batch, batch_idx):
    x, y = batch
    logits = self(x)
    loss = self.loss(logits, y)
    # Add sync_dist=True to sync logging across all GPU workers (may have performance impact)
    self.log("validation_loss", loss, on_step=True, on_epoch=True, sync_dist=True)


def test_step(self, batch, batch_idx):
    x, y = batch
    logits = self(x)
    loss = self.loss(logits, y)
    # Add sync_dist=True to sync logging across all GPU workers (may have performance impact)
    self.log("test_loss", loss, on_step=True, on_epoch=True, sync_dist=True)

It is possible to perform some computation manually and log the reduced result on rank 0 as follows:

可以手工进行一些计算，并将降低后的结果记录在0级上，如下所示:

In [None]:
def __init__(self):
    super().__init__()
    self.outputs = []


def test_step(self, batch, batch_idx):
    x, y = batch
    tensors = self(x)
    self.outputs.append(tensors)
    return tensors


def on_test_epoch_end(self):
    mean = torch.mean(self.all_gather(self.outputs))
    self.outputs.clear()  # free memory

    # When you call `self.log` only on rank 0, don't forget to badd
    # `rank_zero_only=True` to avoid deadlocks on synchronization.
    # Caveat: monitoring this is unimplemented, see https://github.com/Lightning-AI/lightning/issues/15852
    if self.trainer.is_global_zero:
        self.log("my_reduced_metric", mean, rank_zero_only=True)

# Make models pickleable 让模特变得可以pickleable（序列化？）

It’s very likely your code is already pickleable, in that case no change in necessary. However, if you run a distributed model and get the following error:

您的代码很可能已经是可pickle的，在这种情况下不需要进行任何更改。但是，如果您运行一个分布式模型并得到以下错误:

```
self._launch(process_obj)
File "/net/software/local/python/3.6.5/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 47,
in _launch reduction.dump(process_obj, fp)
File "/net/software/local/python/3.6.5/lib/python3.6/multiprocessing/reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
_pickle.PicklingError: Can't pickle <function <lambda> at 0x2b599e088ae8>:
attribute lookup <lambda> on __main__ failed
```

This means something in your model definition, transforms, optimizer, dataloader or callbacks cannot be pickled, and the following code will fail:

这意味着模型定义、转换、优化器、数据加载器或回调中的某些内容不能被 pickle，并且下面的代码将会失败:

```python
import pickle

pickle.dump(some_object)
```

这是在 PyTorch 中使用多个进程进行分布式培训的一个限制。若要修复此问题，请找到无法腌制的代码段。堆栈跟踪的结尾通常是有帮助的。例如: 在这里的 stacktrace 示例中，似乎在代码的某个地方有一个 lambda 函数，它不能被 pickle。

# GPU TRAINING (BASIC)

## Train on GPUs

在默认情况下，Trainer 将在所有可用的 GPU 上运行。确保您在至少有一个 GPU 的机器上运行。没有必要指定任何 NVIDIA 标志，因为闪电将为您做到这一点。

In [1]:
import lightning as L
from lightning import Trainer
# run on as many GPUs as available by default
trainer = Trainer(accelerator="auto", devices="auto", strategy="auto")
# equivalent to
trainer = Trainer()

# run on one GPU
trainer = Trainer(accelerator="gpu", devices=1)
# run on multiple GPUs
trainer = Trainer(accelerator="gpu", devices=8)
# choose the number of devices automatically
trainer = Trainer(accelerator="gpu", devices="auto")

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
C:\Users\alpha\.conda\envs\vp\Lib\site-packages\lightning\pytorch\trainer\connectors\logger_connector\logger_connector.py:75: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `lightning.pytorch` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs


MisconfigurationException: You requested gpu: [0, 1, 2, 3, 4, 5, 6, 7]
 But your machine only has: [0]

## Choosing GPU devices
https://lightning.ai/docs/pytorch/stable/accelerators/gpu_basic.html


You can select the GPU devices using ranges, a list of indices or a string containing a comma separated list of GPU ids:

您可以使用范围、索引列表或包含逗号分隔的 GPU id 列表的字符串来选择 GPU 设备:

```python
# DEFAULT (int) specifies how many GPUs to use per node
Trainer(accelerator="gpu", devices=k)

# Above is equivalent to
Trainer(accelerator="gpu", devices=list(range(k)))

# Specify which GPUs to use (don't use when running on cluster)
Trainer(accelerator="gpu", devices=[0, 1])

# Equivalent using a string
Trainer(accelerator="gpu", devices="0, 1")

# To use all available GPUs put -1 or '-1'
# equivalent to `list(range(torch.cuda.device_count())) and `"auto"`
Trainer(accelerator="gpu", devices=-1)
```

## Find usable CUDA devices

If you want to run several experiments at the same time on your machine, for example for a hyperparameter sweep, then you can use the following utility function to pick GPU indices that are “accessible”, without having to change your code every time.

如果你想在你的机器上同时运行几个实验，比如超参数扫描，那么你可以使用下面的实用函数来选择可访问的 GPU 索引，而不必每次都更改你的代码。

This is especially useful when GPUs are configured to be in “exclusive compute mode”, such that only one process at a time is allowed access to the device. This special mode is often enabled on server GPUs or systems shared among multiple users.

当 GPU 被配置为“独占计算模式”时，这尤其有用，因为一次只允许一个进程访问设备。这种特殊模式通常在服务器 GPU 或多个用户共享的系统上启用。

In [6]:
from lightning.pytorch.accelerators import find_usable_cuda_devices

# Find two GPUs on the system that are not already occupied
trainer = Trainer(accelerator="cuda", devices=find_usable_cuda_devices(1))

from lightning.fabric.accelerators import find_usable_cuda_devices

# Works with Fabric too
fabric = L.Fabric(accelerator="cuda", devices=find_usable_cuda_devices(1))

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
