# 实验跟踪与可视化教程 (TensorBoard, Weights & Biases, MLflow)

欢迎来到机器学习实验跟踪与可视化教程！在进行机器学习项目时，有效地跟踪实验过程、记录关键指标和可视化结果对于保证可复现性、比较不同尝试的效果以及深入理解模型行为至关重要。

本教程将分别介绍三个广泛使用的工具，它们各有侧重，但都能帮助你更好地管理和理解你的机器学习实验：

1.  **TensorBoard**: Google 开发的可视化工具包，擅长实时监控训练指标、可视化模型图和数据，通常在本地运行。
2.  **Weights & Biases (WandB)**: 一个流行的云平台（对个人和学术免费），提供实验跟踪、高级可视化、协作和模型管理功能。
3.  **MLflow (Tracking 组件)**: 一个开源的端到端 MLOps 平台，其 Tracking 组件专注于记录和查询实验参数、指标、代码和模型，可本地或远程部署。

我们将通过训练一个简单的 CNN 模型对 FashionMNIST 数据集进行分类的示例，分别展示如何集成和使用这三个工具。

## 准备工作：安装必要的库

请确保在运行相应部分之前已安装所需的库。

```bash
# 通用依赖
pip install torch torchvision scikit-learn pandas matplotlib numpy

# TensorBoard (如果尚未随 PyTorch/TensorFlow 安装)
pip install tensorboard

# Weights & Biases
pip install wandb

# MLflow
pip install mlflow
```

**重要提示**: 
*   **WandB**: 需要注册免费账号并在首次使用时 `wandb login`。
*   **MLflow**: 默认在本地 `./mlruns` 目录记录，可通过 `mlflow ui` 查看。

## 使用 TensorBoard 进行可视化

TensorBoard 通过读取事件文件来可视化训练过程。PyTorch 提供了 `SummaryWriter` 来方便地生成这些文件。

In [None]:
# --- TensorBoard: 导入与设置 ---
print("--- Setting up for TensorBoard ---")
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader
from torch.utils.tensorboard import SummaryWriter
import numpy as np
import os
import time

device_tb = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"TensorBoard section using device: {device_tb}")

tb_config = {
    "learning_rate": 0.001,
    "epochs": 2,
    "batch_size": 64,
    "optimizer": "Adam",
}

# --- TensorBoard: 数据准备 --- 
tb_dataset_loaded = False
try:
    print("TensorBoard: Preparing FashionMNIST dataset...")
    tb_transform = transforms.Compose([
        transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))
    ])
    tb_trainset = torchvision.datasets.FashionMNIST(root='./data_tb', train=True, download=True, transform=tb_transform)
    tb_testset = torchvision.datasets.FashionMNIST(root='./data_tb', train=False, download=True, transform=tb_transform)
    tb_trainloader = DataLoader(tb_trainset, batch_size=tb_config['batch_size'], shuffle=True, num_workers=0)
    tb_testloader = DataLoader(tb_testset, batch_size=tb_config['batch_size']*2, shuffle=False, num_workers=0)
    tb_classes = ('T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat', 
                  'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot')
    print("TensorBoard: Dataset prepared.")
    tb_dataset_loaded = True
except Exception as e:
    print(f"TensorBoard: Error loading dataset: {e}")

# --- TensorBoard: 模型定义 --- 
class SimpleCNN_TB(nn.Module):
    def __init__(self):
        super().__init__()
        self.network = nn.Sequential(
            nn.Conv2d(1, 16, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),
            nn.Conv2d(16, 32, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),
            nn.Flatten(),
            nn.Linear(32 * 7 * 7, 128),
            nn.ReLU(),
            nn.Linear(128, 10)
        )
    def forward(self, x): return self.network(x)
print("TensorBoard: SimpleCNN_TB model defined.")

In [None]:
# --- TensorBoard: 训练循环与日志记录 ---
def train_with_tensorboard(cfg, train_loader, test_loader, classes):
    print("\n--- Running Training Loop for TensorBoard --- ")
    if not tb_dataset_loaded: return None
        
    model = SimpleCNN_TB().to(device_tb)
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=cfg['learning_rate'])
    
    run_name = f"TB_LR={cfg['learning_rate']}_{int(time.time())}"
    tb_log_dir = os.path.join("runs_tb", run_name)
    writer = SummaryWriter(log_dir=tb_log_dir)
    print(f"TB Train: Logging to {writer.log_dir}")
    
    global_step = 0
    for epoch in range(cfg['epochs']):
        model.train()
        running_loss = 0.0
        for i, data in enumerate(train_loader, 0):
            inputs, labels = data[0].to(device_tb), data[1].to(device_tb)
            optimizer.zero_grad()
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            running_loss += loss.item()
            global_step += 1
            
            if i % 400 == 399:
                 batch_loss = running_loss / 400
                 writer.add_scalar('Loss/train_step_tb', batch_loss, global_step)
                 running_loss = 0.0
        
        # Epoch evaluation & logging
        model.eval()
        correct, total, test_loss = 0, 0, 0.0
        sample_images = None
        with torch.no_grad():
            for batch_idx, data in enumerate(test_loader):
                images, labels = data[0].to(device_tb), data[1].to(device_tb)
                if batch_idx == 0: sample_images = images.cpu()
                outputs = model(images)
                loss = criterion(outputs, labels)
                test_loss += loss.item()
                _, predicted = torch.max(outputs.data, 1)
                total += labels.size(0)
                correct += (predicted == labels).sum().item()
                    
        epoch_accuracy = 100 * correct / total
        epoch_test_loss = test_loss / len(test_loader)
        print(f'TB Run - Epoch {epoch + 1} Test Acc: {epoch_accuracy:.2f}%, Test Loss: {epoch_test_loss:.4f}')
        
        writer.add_scalar('Accuracy/test_tb', epoch_accuracy, epoch)
        writer.add_scalar('Loss/test_tb', epoch_test_loss, epoch)
        if epoch == cfg['epochs'] - 1:
            if sample_images is not None:
                 img_grid = torchvision.utils.make_grid(sample_images[:16], nrow=4)
                 writer.add_image('Test_Samples_tb', img_grid, epoch)
            for name, param in model.named_parameters():
                if param.requires_grad:
                     writer.add_histogram(f"Weights_tb/{name.replace('.', '/')}", param.cpu().data.numpy(), epoch)
                     if param.grad is not None: 
                          writer.add_histogram(f"Gradients_tb/{name.replace('.', '/')}", param.cpu().grad.numpy(), epoch)
                          
    writer.close()
    print("TB Train: TensorBoard training finished.")
    return model

# --- Run Training --- 
if tb_dataset_loaded:
    model_tb = train_with_tensorboard(tb_config, tb_trainloader, tb_testloader, tb_classes)
else:
    print("Skipping TensorBoard training run due to dataset loading error.")

### A.4 查看 TensorBoard UI

1.  打开终端。
2.  导航到包含 `runs_tb` 的目录。
3.  运行 `tensorboard --logdir runs_tb`。
4.  在浏览器中打开 `http://localhost:6006/`。

# 使用 Weights & Biases (WandB) 进行跟踪

### B.1 WandB 简介与设置
WandB 提供云端服务来跟踪实验，需要注册并登录。它通过 `wandb.init()` 开始跟踪，并使用 `wandb.log()` 记录数据。

**重要**: 运行下面的代码前，请确保你已经在环境中通过 `wandb login` 登录，或设置了 `WANDB_API_KEY` 环境变量，否则日志记录将处于 `disabled` 模式。

In [None]:
# --- WandB: 导入与设置 ---
print("\n--- Setting up for Weights & Biases ---")
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader
import wandb
import numpy as np
import os
import time

device_wandb = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"WandB section using device: {device_wandb}")

# --- 配置 (在此部分内定义) ---
wandb_config = {
    "learning_rate": 0.0015,
    "epochs": 2,
    "batch_size": 128,
    "optimizer": "RMSprop",
}

# --- 数据准备 (在此部分内执行) --- 
wandb_dataset_loaded = False
try:
    print("WandB Section: Preparing FashionMNIST dataset...")
    wandb_transform = transforms.Compose([
        transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))
    ])
    wandb_trainset = torchvision.datasets.FashionMNIST(root='./data_wandb', train=True, download=True, transform=wandb_transform)
    wandb_testset = torchvision.datasets.FashionMNIST(root='./data_wandb', train=False, download=True, transform=wandb_transform)
    wandb_trainloader = DataLoader(wandb_trainset, batch_size=wandb_config['batch_size'], shuffle=True, num_workers=0)
    wandb_testloader = DataLoader(wandb_testset, batch_size=wandb_config['batch_size']*2, shuffle=False, num_workers=0)
    wandb_classes = ('T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
                     'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot')
    print("WandB Section: Dataset prepared.")
    wandb_dataset_loaded = True
except Exception as e:
    print(f"WandB Section: Error loading dataset: {e}")

# --- 模型定义 (在此部分内定义) --- 
class SimpleCNN_WandB(nn.Module):
    # (模型结构同前)
    def __init__(self):
        super().__init__()
        self.network = nn.Sequential(
            nn.Conv2d(1, 16, kernel_size=3, padding=1),
            nn.ReLU(), nn.MaxPool2d(2, 2),
            nn.Conv2d(16, 32, kernel_size=3, padding=1),
            nn.ReLU(), nn.MaxPool2d(2, 2),
            nn.Flatten(),
            nn.Linear(32 * 7 * 7, 128), nn.ReLU(),
            nn.Linear(128, 10)
        )
    def forward(self, x): return self.network(x)
print("WandB Section: SimpleCNN_WandB model defined.")

# Re-check WandB login possibility (defined in first code cell)
wandb_mode = "online" if 'wandb_online_mode_possible' in globals() and wandb_online_mode_possible else "disabled"

In [None]:
# --- B.3 WandB 训练循环 --- 
def train_with_wandb(cfg, train_loader, test_loader, classes):
    print("\n--- Running Training Loop for WandB --- ")
    if not wandb_dataset_loaded: return None
        
    run_timestamp = int(time.time())
    run_name = f"WandB_{cfg['optimizer']}_lr{cfg['learning_rate']}_{run_timestamp}"
    
    wandb_run = None
    try:
        wandb_run = wandb.init(
            project="pytorch-tracking-tutorial-revised", 
            config=cfg, name=run_name, reinit=True, mode=wandb_mode
        )
        print(f"WandB Train: Run initialized (mode: {wandb_mode}). URL: {wandb_run.url if wandb_run and wandb_mode == 'online' else 'N/A'}")
    except Exception as e:
        print(f"WandB Train: Could not initialize WandB: {e}. Aborting training for WandB.")
        return None

    model = SimpleCNN_WandB().to(device_wandb)
    criterion = nn.CrossEntropyLoss()
    if wandb.config.optimizer == 'Adam':
         optimizer = optim.Adam(model.parameters(), lr=wandb.config.learning_rate)
    else:
         optimizer = optim.RMSprop(model.parameters(), lr=wandb.config.learning_rate)

    if wandb_mode == 'online':
        wandb.watch(model, log="all", log_freq=100)
        
    global_step = 0
    for epoch in range(wandb.config.epochs):
        model.train()
        running_loss = 0.0
        for i, data in enumerate(train_loader, 0):
            inputs, labels = data[0].to(device_wandb), data[1].to(device_wandb)
            optimizer.zero_grad()
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            running_loss += loss.item()
            global_step += 1
            
            if i % 400 == 399:
                batch_loss = running_loss / 400
                if wandb_mode == 'online':
                    wandb.log({"step_loss_wandb": batch_loss, "global_step_wandb": global_step})
                running_loss = 0.0
        
        # Epoch evaluation & logging
        model.eval()
        correct, total, test_loss = 0, 0, 0.0
        sample_images, sample_labels, sample_preds = [], [], []
        with torch.no_grad():
            for batch_idx, data in enumerate(test_loader):
                images, labels = data[0].to(device_wandb), data[1].to(device_wandb)
                if batch_idx == 0: 
                    sample_images = images.cpu()
                    sample_labels = labels.cpu()
                outputs = model(images)
                loss = criterion(outputs, labels)
                test_loss += loss.item()
                _, predicted = torch.max(outputs.data, 1)
                if batch_idx == 0:
                    sample_preds = predicted.cpu()
                total += labels.size(0)
                correct += (predicted == labels).sum().item()
        
        epoch_accuracy = 100 * correct / total
        epoch_test_loss = test_loss / len(test_loader)
        print(f'WandB Run - Epoch {epoch + 1} Test Acc: {epoch_accuracy:.2f}%, Test Loss: {epoch_test_loss:.4f}')
        
        if wandb_mode == 'online':
            wandb_logs = {"epoch": epoch + 1, "test_accuracy_wandb": epoch_accuracy, "test_loss_wandb": epoch_test_loss}
            if len(sample_images) > 0:
                 wandb_images = []
                 num_samples_to_log = min(16, len(sample_images))
                 for idx in range(num_samples_to_log):
                     wandb_images.append(wandb.Image(
                         sample_images[idx],
                         caption=f"Pred: {classes[sample_preds[idx]]}, True: {classes[sample_labels[idx]]}"
                     ))
                 wandb_logs["test_samples_wandb"] = wandb_images
            wandb.log(wandb_logs)
            
    if wandb_run:
        wandb_run.finish()
    print("WandB Train: WandB training finished.")
    return model

# --- Run Training with WandB ---
if wandb_dataset_loaded:
    model_wandb = train_with_wandb(wandb_config, wandb_trainloader, wandb_testloader, wandb_classes)
else:
    print("Skipping WandB training run due to dataset loading error.")

### B.4 查看 WandB Dashboard

1.  如果 `wandb_mode` 是 `online`，访问你的 WandB 账户 ([https://wandb.ai/](https://wandb.ai/))。
2.  找到名为 `pytorch-tracking-tutorial-revised` 的项目。
3.  查看对应的 run。

# Section C: 使用 MLflow Tracking

### C.1 MLflow Tracking 简介
MLflow Tracking 用于记录实验运行的参数、指标和 Artifacts。它可以将日志保存到本地 `mlruns` 目录或配置远程服务器。

**核心步骤**：
1. `mlflow.set_experiment()`: 设置实验名称。
2. `with mlflow.start_run():`: 开始一个运行。
3. 在 `with` 块内使用 `mlflow.log_*` 方法记录信息。
4. `mlflow.pytorch.log_model()`: (可选) 记录 PyTorch 模型。
5. 在终端运行 `mlflow ui` 查看结果。

In [None]:
# --- C.2 MLflow 设置与导入 ---
print("\n--- Setting up for MLflow ---")
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader
import mlflow
import mlflow.pytorch
import numpy as np
import matplotlib.pyplot as plt # For logging plot artifact
import os
import time

device_mlflow = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"MLflow section using device: {device_mlflow}")

# --- 配置 (在此部分内定义) ---
mlflow_config = {
    "learning_rate": 0.005,
    "epochs": 2,
    "batch_size": 64,
    "optimizer": "SGD",
}

# --- 数据准备 (在此部分内执行) --- 
mlflow_dataset_loaded = False
try:
    print("MLflow Section: Preparing FashionMNIST dataset...")
    mlflow_transform = transforms.Compose([
        transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))
    ])
    mlflow_trainset = torchvision.datasets.FashionMNIST(root='./data_mlflow', train=True, download=True, transform=mlflow_transform)
    mlflow_testset = torchvision.datasets.FashionMNIST(root='./data_mlflow', train=False, download=True, transform=mlflow_transform)
    mlflow_trainloader = DataLoader(mlflow_trainset, batch_size=mlflow_config['batch_size'], shuffle=True, num_workers=0)
    mlflow_testloader = DataLoader(mlflow_testset, batch_size=mlflow_config['batch_size']*2, shuffle=False, num_workers=0)
    mlflow_classes = ('T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
                      'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot')
    print("MLflow Section: Dataset prepared.")
    mlflow_dataset_loaded = True
except Exception as e:
    print(f"MLflow Section: Error loading dataset: {e}")

# --- 模型定义 (在此部分内定义) --- 
class SimpleCNN_MLflow(nn.Module):
    # (模型结构同前)
    def __init__(self):
        super().__init__()
        self.network = nn.Sequential(
            nn.Conv2d(1, 16, kernel_size=3, padding=1),
            nn.ReLU(), nn.MaxPool2d(2, 2),
            nn.Conv2d(16, 32, kernel_size=3, padding=1),
            nn.ReLU(), nn.MaxPool2d(2, 2),
            nn.Flatten(),
            nn.Linear(32 * 7 * 7, 128), nn.ReLU(),
            nn.Linear(128, 10)
        )
    def forward(self, x): return self.network(x)
print("MLflow Section: SimpleCNN_MLflow model defined.")

In [None]:
# --- C.3 MLflow 训练循环 --- 
def train_with_mlflow(cfg, train_loader, test_loader, classes):
    print("\n--- Running Training Loop for MLflow --- ")
    if not mlflow_dataset_loaded: return None

    mlflow.set_experiment("FashionMNIST Classification Revised")
    run_timestamp = int(time.time())
    run_name = f"MLflow_{cfg['optimizer']}_lr{cfg['learning_rate']}_{run_timestamp}"

    with mlflow.start_run(run_name=run_name) as run:
        print(f"MLflow Train: Run started. Run ID: {run.info.run_id}")
        mlflow.log_params(cfg)
        
        model = SimpleCNN_MLflow().to(device_mlflow)
        criterion = nn.CrossEntropyLoss()
        optimizer = optim.SGD(model.parameters(), lr=cfg['learning_rate'], momentum=0.9)

        global_step = 0
        final_epoch_accuracy = 0 
        for epoch in range(cfg['epochs']):
            model.train()
            running_loss = 0.0
            for i, data in enumerate(train_loader, 0):
                inputs, labels = data[0].to(device_mlflow), data[1].to(device_mlflow)
                optimizer.zero_grad()
                outputs = model(inputs)
                loss = criterion(outputs, labels)
                loss.backward()
                optimizer.step()
                running_loss += loss.item()
                global_step += 1
                
                if i % 400 == 399:
                    batch_loss = running_loss / 400
                    mlflow.log_metric("step_loss_mlflow", batch_loss, step=global_step)
                    running_loss = 0.0
            
            # Epoch evaluation & logging
            model.eval()
            correct, total, test_loss = 0, 0, 0.0
            with torch.no_grad():
                for data in test_loader:
                    images, labels = data[0].to(device_mlflow), data[1].to(device_mlflow)
                    outputs = model(images)
                    loss = criterion(outputs, labels)
                    test_loss += loss.item()
                    _, predicted = torch.max(outputs.data, 1)
                    total += labels.size(0)
                    correct += (predicted == labels).sum().item()
            
            epoch_accuracy = 100 * correct / total
            epoch_test_loss = test_loss / len(test_loader)
            final_epoch_accuracy = epoch_accuracy
            print(f'MLflow Run - Epoch {epoch + 1} Test Acc: {epoch_accuracy:.2f}%, Test Loss: {epoch_test_loss:.4f}')
            
            mlflow.log_metric("test_accuracy_mlflow", epoch_accuracy, step=epoch)
            mlflow.log_metric("test_loss_mlflow", epoch_test_loss, step=epoch)

        # --- End of Training (within 'with' block) ---
        print("MLflow Train: Logging model and artifacts...")
        mlflow.pytorch.log_model(model, "model_mlflow")
        mlflow.log_metric("final_accuracy", final_epoch_accuracy)
        
        # Log a simple config file as artifact
        cfg_path = "mlflow_config.txt"
        with open(cfg_path, "w") as f: f.write(str(cfg))
        mlflow.log_artifact(cfg_path)
        if os.path.exists(cfg_path): os.remove(cfg_path) 
        
    print(f"MLflow Train: MLflow run finished. Run ID: {run.info.run_id}")
    return model

# --- Run Training with MLflow ---
if mlflow_dataset_loaded:
    model_mlflow = train_with_mlflow(mlflow_config, mlflow_trainloader, mlflow_testloader, mlflow_classes)
else:
    print("Skipping MLflow training run due to dataset loading error.")

### C.4 查看 MLflow UI

1.  打开终端。
2.  导航到包含 `mlruns` 的目录。
3.  运行 `mlflow ui`。
4.  在浏览器中打开 `http://localhost:5000`。
5.  找到名为 `FashionMNIST Classification Revised` 的实验查看运行结果。

# 比较与总结

## 比较与总结

本教程分别独立地展示了如何使用 TensorBoard, Weights & Biases, 和 MLflow Tracking 来跟踪和可视化一个简单的 PyTorch 训练过程。每个工具都有其独特的优势和适用场景。

| 特性             | TensorBoard                     | Weights & Biases (WandB)            | MLflow Tracking                |
|------------------|---------------------------------|-------------------------------------|--------------------------------|
| **类型**         | 开源可视化工具包                 | 商业云平台 (个人/学术免费)          | 开源 MLOps 平台 (Tracking组件) |
| **核心功能**     | 实时监控, 可视化 (图, 图表, 嵌入)| 实验跟踪, 可视化, 协作, 模型管理 | 实验跟踪, 参数/指标/代码/模型记录 |
| **设置**         | 简单 (通常随框架安装)             | 需要注册登录, `pip install wandb` | `pip install mlflow`, 可本地运行 |
| **UI 托管**      | 本地运行 `tensorboard` 命令     | 云端仪表板                          | 本地运行 `mlflow ui` 或远程服务器 |
| **协作**         | 有限 (共享日志文件)             | 强大 (团队, 报告, 项目)           | 良好 (共享 Tracking Server)      |
| **集成**         | PyTorch, TensorFlow, JAX 等     | PyTorch, TF, Keras, Sklearn, XGBoost等 | 多种框架 (PyTorch, TF, Sklearn等) |
| **超参数扫描**   | HParams 插件 (较基础)           | 内置强大的 Sweeps 功能              | 需要与其他库 (如 Hyperopt) 集成  |
| **模型/数据版本**| 不直接支持                      | 支持 Artifacts (版本化)             | 支持 Artifacts (版本化)          |
| **部署/注册**    | 无                              | 有限 (集成部署工具)                 | MLflow Models & Registry      |

**选择建议**: 
*   **TensorBoard**: 快速本地可视化和调试训练过程的首选。
*   **WandB**: 需要强大云端协作、高级可视化和集成超参数扫描时非常好用。
*   **MLflow**: 适合需要开源、可自托管、关注实验复现性、代码/模型版本管理和 MLOps 集成的场景。

选择哪个工具取决于你的具体需求和工作流程偏好。