# How does Lightning help me debug ?

The Lightning Trainer has a lot of arguments devoted to maximizing your debugging productivity.

# Set a breakpoint 设置断点
A breakpoint stops your code execution so you can inspect variables, etc… and allow your code to execute one line at a time.

# Run all your model code once quickly
If you’ve ever trained a model for days only to crash during validation or testing then this trainer argument is about to become your best friend.
> 如果你曾经训练了一个模特好几天，却在验证或测试的时候崩溃了，那么这个训练师的争论将会成为你最好的朋友。

The fast_dev_run argument in the trainer runs 5 batch of training, validation, test and prediction data through your trainer to see if there are any bugs:
> 教练中的 fast _ dev _ run 参数通过教练运行5批训练、验证、测试和预测数据，以查看是否存在任何错误:


In [None]:
import lightning as L
trainer = L.Trainer(fast_dev_run=True)
# To change how many batches to use, change the argument to an integer. Here we run 7 batches of each:
trainer = Trainer(fast_dev_run=7)

# Shorten the epoch length

Sometimes it’s helpful to only use a fraction of your training, val, test, or predict data (or a set number of batches). For example, you can use 20% of the training set and 1% of the validation set.
On larger datasets like Imagenet, this can help you debug or test a few things faster than waiting for a full epoch.
> 有时候，只使用一小部分训练、 val、测试或预测数据(或一组固定数量的批处理)是有帮助的。例如，您可以使用训练集的20% 和验证集的1% 。对于较大的数据集，如 Imagenet，这可以帮助您更快地调试或测试一些东西，而不是等待一个完整的纪元。

In [None]:
# use only 10% of training data and 1% of val data
trainer = Trainer(limit_train_batches=0.1, limit_val_batches=0.01)
# use 10 batches of train and 5 batches of val
trainer = Trainer(limit_train_batches=10, limit_val_batches=5)

# Run a Sanity Check 完整性检查

Lightning runs 2 steps of validation in the beginning of training. This avoids crashing in the validation loop sometime deep into a lengthy training loop.
> 闪电运行2步验证在训练的开始。这样可以避免在有时深入到冗长的训练循环中的验证循环中崩溃。

In [None]:
trainer = Trainer(num_sanity_val_steps=2)

# Print LightningModule weights summary

Whenever the .fit() function gets called, the Trainer will print the weights summary for the LightningModule.

In [4]:
trainer.fit(...)

NameError: name 'trainer' is not defined

this generate a table like:
```
  | Name  | Type        | Params
----------------------------------
0 | net   | Sequential  | 132 K
1 | net.0 | Linear      | 131 K
2 | net.1 | BatchNorm1d | 1.0 K
```

To add the child modules to the summary add a ModelSummary:

In [None]:
from lightning.pytorch.callbacks import ModelSummary

trainer = Trainer(callbacks=[ModelSummary(max_depth=-1)])

To print the model summary if .fit() is not called:

In [None]:
from lightning.pytorch.utilities.model_summary import ModelSummary

model = LitModel()
summary = ModelSummary(model, max_depth=-1)
print(summary)

To turn off the autosummary use:

In [None]:
trainer = Trainer(enable_model_summary=False)

# Print input output layer dimensions

Another debugging tool is to display the intermediate input- and output sizes of all your layers by setting the example_input_array attribute in your LightningModule.

```python
class LitModel(LightningModule):
    def __init__(self, *args, **kwargs):
        self.example_input_array = torch.Tensor(32, 1, 28, 28)
```

With the input array, the summary table will include the input and output layer dimensions:

```
  | Name  | Type        | Params | In sizes  | Out sizes
--------------------------------------------------------------
0 | net   | Sequential  | 132 K  | [10, 256] | [10, 512]
1 | net.0 | Linear      | 131 K  | [10, 256] | [10, 512]
2 | net.1 | BatchNorm1d | 1.0 K  | [10, 512] | [10, 512]
```

# FIND BOTTLENECKS IN YOUR CODE (BASIC)

## Find training loop bottlenecks

The most basic profile measures all the key methods across Callbacks, DataModules and the LightningModule in the training loop.

```python
trainer = Trainer(profiler="simple")
```

Once the .fit() function has completed, you’ll see an output like this:

```
FIT Profiler Report

-------------------------------------------------------------------------------------------
|  Action                                          |  Mean duration (s) |  Total time (s) |
-------------------------------------------------------------------------------------------
|  [LightningModule]BoringModel.prepare_data       |  10.0001           |  20.00          |
|  run_training_epoch                              |  6.1558            |  6.1558         |
|  run_training_batch                              |  0.0022506         |  0.015754       |
|  [LightningModule]BoringModel.optimizer_step     |  0.0017477         |  0.012234       |
|  [LightningModule]BoringModel.val_dataloader     |  0.00024388        |  0.00024388     |
|  on_train_batch_start                            |  0.00014637        |  0.0010246      |
|  [LightningModule]BoringModel.teardown           |  2.15e-06          |  2.15e-06       |
|  [LightningModule]BoringModel.on_train_start     |  1.644e-06         |  1.644e-06      |
|  [LightningModule]BoringModel.on_train_end       |  1.516e-06         |  1.516e-06      |
|  [LightningModule]BoringModel.on_fit_end         |  1.426e-06         |  1.426e-06      |
|  [LightningModule]BoringModel.setup              |  1.403e-06         |  1.403e-06      |
|  [LightningModule]BoringModel.on_fit_start       |  1.226e-06         |  1.226e-06      |
-------------------------------------------------------------------------------------------
```

在这个报告中，我们可以看到最慢的函数是 ready _ data。现在你可以明白为什么数据准备会减慢你的训练。

简单侧写器自动测量训练循环中使用的所有标准方法，包括:
- on_train_epoch_start
- on_train_epoch_end
- on_train_batch_start
- model_backward
- on_after_backward
- optimizer_step
- on_train_batch_end
- on_training_end
- etc…

## Profile the time within every function 分析每一个函数的时间

To profile the time within every function, use the AdvancedProfiler built on top of Python’s cProfiler.
```python
trainer = Trainer(profiler="advanced")
```

Once the .fit() function has completed, you’ll see an output like this:
```
Profiler Report

Profile stats for: get_train_batch
        4869394 function calls (4863767 primitive calls) in 18.893 seconds
Ordered by: cumulative time
List reduced from 76 to 10 due to restriction <10>
ncalls  tottime  percall  cumtime  percall filename:lineno(function)
3752/1876    0.011    0.000   18.887    0.010 {built-in method builtins.next}
    1876     0.008    0.000   18.877    0.010 dataloader.py:344(__next__)
    1876     0.074    0.000   18.869    0.010 dataloader.py:383(_next_data)
    1875     0.012    0.000   18.721    0.010 fetch.py:42(fetch)
    1875     0.084    0.000   18.290    0.010 fetch.py:44(<listcomp>)
    60000    1.759    0.000   18.206    0.000 mnist.py:80(__getitem__)
    60000    0.267    0.000   13.022    0.000 transforms.py:68(__call__)
    60000    0.182    0.000    7.020    0.000 transforms.py:93(__call__)
    60000    1.651    0.000    6.839    0.000 functional.py:42(to_tensor)
    60000    0.260    0.000    5.734    0.000 transforms.py:167(__call__)
```

如果分析器报告太长，您可以将报告传输到一个文件:
```pytorch
from lightning.pytorch.profilers import AdvancedProfiler

profiler = AdvancedProfiler(dirpath=".", filename="perf_logs")
trainer = Trainer(profiler=profiler)
```

# Measure accelerator usage
检测瓶颈的另一个有用技术是确保您正在使用加速器的全部容量(GPU/TPU/IPU/HPU)。这可以通过 DeviceStatsMonitor 来衡量:
```python
from lightning.pytorch.callbacks import DeviceStatsMonitor

trainer = Trainer(callbacks=[DeviceStatsMonitor()])
```

CPU 指标将在 CPU 加速器上默认跟踪。要为其他加速器启用它，请设置 DeviceStatsMonitor (cpu _ stats = True)。要禁用日志 CPU 指标，可以指定 DeviceStatsMonitor (CPU _ stats = False)。

# Experiment

In [11]:
from json import encoder

import torch.utils.data as data
import torch
import torch.nn as nn
from torchvision import datasets
import torchvision.transforms as transforms
import os
import torch.nn.functional as F
from torchvision import transforms
from torchvision.datasets import MNIST
from torch.utils.data import DataLoader
import lightning as L

In [12]:
# Load data sets
transform = transforms.ToTensor()
train_set = datasets.MNIST(root="MNIST", download=True, train=True, transform=transform)
test_set = datasets.MNIST(root="MNIST", download=True, train=False, transform=transform)
train_loader = DataLoader(train_set, num_workers=0, batch_size=64, shuffle=True)
test_loader = DataLoader(test_set, num_workers=0, batch_size=64, shuffle=True)

In [13]:
# PyTorch
class Encoder(nn.Module):
    def __init__(self):
        super().__init__()
        self.l1 = nn.Sequential(nn.Linear(28 * 28, 64), nn.ReLU(), nn.Linear(64, 3))

    def forward(self, x):
        return self.l1(x)


class Decoder(nn.Module):
    def __init__(self):
        super().__init__()
        self.l1 = nn.Sequential(nn.Linear(3, 64), nn.ReLU(), nn.Linear(64, 28 * 28))

    def forward(self, x):
        return self.l1(x)

In [31]:
# PyTorch-Lightning
class LitAutoEncoder(L.LightningModule):
    def __init__(self, encoder, decoder):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder
        # Example input array for logging and model tracing
        self.example_input_array = torch.rand(16, 1, 28, 28)
    
    def forward(self, x):
         # Define the forward pass
        x = x.view(x.size(0), -1)  # Flatten the input
        z = self.encoder(x)
        x_hat = self.decoder(z)
        return x_hat
        

    def training_step(self, batch, batch_idx):
        # training_step defines the train loop.
        x, y = batch
        x_hat = self(x)  # Use the forward method
        loss = F.mse_loss(x_hat, x.view(x.size(0), -1))
        return loss
    
    def test_step(self, batch, batch_idx):
        # this is the test loop
        x, y = batch
        x_hat = self(x)  # Use the forward method
        test_loss = F.mse_loss(x_hat, x.view(x.size(0), -1))
        self.log("test_loss", test_loss)

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
        return optimizer

In [32]:
model = LitAutoEncoder.load_from_checkpoint("ckpts/lightning_logs/version_0/checkpoints/epoch=2-step=2250.ckpt",encoder=Encoder(), decoder=Decoder())


In [50]:
trainer = L.Trainer(num_sanity_val_steps=2,accelerator='gpu', max_epochs=3,profiler="simple")
trainer.fit(model, train_loader, test_loader)

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name    | Type    | Params | In sizes  | Out sizes
------------------------------------------------------------
0 | encoder | Encoder | 50.4 K | [16, 784] | [16, 3]  
1 | decoder | Decoder | 51.2 K | [16, 3]   | [16, 784]
------------------------------------------------------------
101 K     Trainable params
0         Non-trainable params
101 K     Total params
0.407     Total estimated model params size (MB)


Training: |                                                                                                   …

`Trainer.fit` stopped: `max_epochs=3` reached.
FIT Profiler Report

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|  Action                                                                                                                                                        	|  Mean duration (s)	|  Num calls      	|  Total time (s) 	|  Percentage %   	|
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|  Total                                                                                                                                             

In [47]:
trainer = L.Trainer(num_sanity_val_steps=2,accelerator='gpu', max_epochs=3,profiler="advanced")
trainer.fit(model, train_loader, test_loader)

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name    | Type    | Params | In sizes  | Out sizes
------------------------------------------------------------
0 | encoder | Encoder | 50.4 K | [16, 784] | [16, 3]  
1 | decoder | Decoder | 51.2 K | [16, 3]   | [16, 784]
------------------------------------------------------------
101 K     Trainable params
0         Non-trainable params
101 K     Total params
0.407     Total estimated model params size (MB)


Training: |                                                                                                   …

`Trainer.fit` stopped: `max_epochs=3` reached.
FIT Profiler Report
Profile stats for: [LightningModule]LitAutoEncoder.configure_callbacks
         7 function calls in 0.000 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 contextlib.py:141(__exit__)
        1    0.000    0.000    0.000    0.000 {built-in method builtins.next}
        1    0.000    0.000    0.000    0.000 profiler.py:55(profile)
        1    0.000    0.000    0.000    0.000 advanced.py:71(stop)
        1    0.000    0.000    0.000    0.000 module.py:923(configure_callbacks)
        1    0.000    0.000    0.000    0.000 {method 'get' of 'dict' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}



Profile stats for: [LightningModule]LitAutoEncoder.prepare_data
         7 function calls in 0.000 seconds

   Ordered by: cumulative time

   ncalls  tottime  per

In [48]:
from lightning.pytorch.callbacks import DeviceStatsMonitor
trainer = L.Trainer(num_sanity_val_steps=2,accelerator='gpu', max_epochs=3,callbacks=[DeviceStatsMonitor()])
trainer.fit(model, train_loader, test_loader)

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name    | Type    | Params | In sizes  | Out sizes
------------------------------------------------------------
0 | encoder | Encoder | 50.4 K | [16, 784] | [16, 3]  
1 | decoder | Decoder | 51.2 K | [16, 3]   | [16, 784]
------------------------------------------------------------
101 K     Trainable params
0         Non-trainable params
101 K     Total params
0.407     Total estimated model params size (MB)


Training: |                                                                                                   …

`Trainer.fit` stopped: `max_epochs=3` reached.


# TRACK AND VISUALIZE EXPERIMENTS 

想要可视化和监视他们的模型开发的用户
模型开发中，我们跟踪感兴趣的值，如确认损失，以可视化模型的学习过程。模型开发就像驾驶一辆没有窗户的汽车，图表和日志提供了知道在哪里驾驶汽车的窗口。有了“闪电”，你可以想象出任何你能想到的东西: 数字、文字、图像、音频。你的创造力和想象力是唯一的限制因素。

## Track metrics

Metric visualization is the most basic but powerful way of understanding how your model is doing throughout the model development process.

To track a metric, simply use the self.log method available inside the LightningModule

度量可视化是理解您的模型在整个模型开发过程中如何运行的最基本但最强大的方法。要跟踪一个度量，只需使用 LightningModule 中提供的 self. log 方法

```python
class LitModel(L.LightningModule):
    def training_step(self, batch, batch_idx):
        value = ...
        self.log("some_value", value)
```

To log multiple metrics at once, use self.log_dict:

```python
values = {"loss": loss, "acc": acc, "metric_n": metric_n}  # add more items if needed
self.log_dict(values)
```

To view metrics in the commandline progress bar, set the prog_bar argument to True.

```python
self.log(..., prog_bar=True)
```

```
Epoch 3:  33%|███▉        | 307/938 [00:01<00:02, 289.04it/s, loss=0.198, v_num=51, acc=0.211, metric_n=0.937]`
```