# TensorBoard可视化训练过程

本教程介绍如何使用TensorBoard监控和可视化神经网络的训练过程。

## 学习目标

1. 理解TensorBoard的基本功能
2. 掌握日志目录的设置方法
3. 学会使用TensorBoard回调
4. 了解高级可视化功能

## TensorBoard功能概览

| 功能 | 描述 | 用途 |
|------|------|------|
| Scalars | 标量指标曲线 | 监控损失和指标 |
| Graphs | 计算图可视化 | 理解模型结构 |
| Distributions | 权重分布 | 检测梯度问题 |
| Histograms | 直方图 | 分析权重变化 |
| Images | 图像数据 | 可视化输入/输出 |

## 1. 环境配置与数据准备

In [None]:
import os
import time
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow import keras
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# 设置随机种子
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
tf.random.set_seed(RANDOM_SEED)

print(f"TensorFlow版本: {tf.__version__}")

In [None]:
# 加载数据
housing = fetch_california_housing()

# 划分数据集
X_train_full, X_test, y_train_full, y_test = train_test_split(
    housing.data, housing.target, test_size=0.2, random_state=RANDOM_SEED
)
X_train, X_valid, y_train, y_valid = train_test_split(
    X_train_full, y_train_full, test_size=0.25, random_state=RANDOM_SEED
)

# 标准化
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_valid = scaler.transform(X_valid)
X_test = scaler.transform(X_test)

print(f"训练集: {X_train.shape}")
print(f"验证集: {X_valid.shape}")
print(f"测试集: {X_test.shape}")

## 2. 设置日志目录

为每次训练创建唯一的日志目录，便于对比不同实验。

In [None]:
# 设置根日志目录
root_logdir = os.path.join(os.curdir, "my_logs")

def get_run_logdir(name=None):
    """
    生成带时间戳的日志目录路径
    
    Parameters:
    -----------
    name : str, optional
        实验名称，用于区分不同的实验
    
    Returns:
    --------
    str : 日志目录路径
    """
    run_id = time.strftime("run_%Y_%m_%d-%H_%M_%S")
    if name:
        run_id = f"{name}_{run_id}"
    return os.path.join(root_logdir, run_id)

# 创建日志目录
run_logdir = get_run_logdir("baseline")
print(f"日志目录: {run_logdir}")

## 3. 基本TensorBoard使用

使用TensorBoard回调记录训练过程。

In [None]:
def create_model(n_hidden=2, n_neurons=30, activation='relu'):
    """
    创建回归模型
    
    Parameters:
    -----------
    n_hidden : int
        隐藏层数量
    n_neurons : int
        每层神经元数量
    activation : str
        激活函数
    
    Returns:
    --------
    keras.Model : 编译好的模型
    """
    model = keras.Sequential()
    model.add(keras.layers.InputLayer(input_shape=[8]))
    
    for _ in range(n_hidden):
        model.add(keras.layers.Dense(n_neurons, activation=activation))
    
    model.add(keras.layers.Dense(1))
    
    model.compile(
        loss='mse',
        optimizer='sgd',
        metrics=['mae']
    )
    
    return model

In [None]:
# 创建模型
model = create_model()

# 创建TensorBoard回调
tensorboard_cb = keras.callbacks.TensorBoard(
    log_dir=run_logdir,
    histogram_freq=1,      # 每个epoch记录直方图
    write_graph=True,      # 记录计算图
    write_images=False,    # 不记录权重图像
    update_freq='epoch',   # 每个epoch更新
    profile_batch=0        # 禁用性能分析（设为2可启用）
)

# 训练模型
history = model.fit(
    X_train, y_train,
    epochs=30,
    validation_data=(X_valid, y_valid),
    callbacks=[tensorboard_cb],
    verbose=1
)

print(f"\n日志已保存到: {run_logdir}")

## 4. 对比不同模型配置

使用TensorBoard对比不同超参数配置的训练效果。

In [None]:
# 定义不同的配置
configs = [
    {'name': 'shallow', 'n_hidden': 1, 'n_neurons': 30},
    {'name': 'deep', 'n_hidden': 3, 'n_neurons': 30},
    {'name': 'wide', 'n_hidden': 1, 'n_neurons': 100},
]

# 训练每个配置
for config in configs:
    print(f"\n训练配置: {config['name']}")
    print("="*40)
    
    # 创建模型
    model = create_model(
        n_hidden=config['n_hidden'],
        n_neurons=config['n_neurons']
    )
    
    # 创建独立的日志目录
    logdir = get_run_logdir(config['name'])
    
    # TensorBoard回调
    tensorboard_cb = keras.callbacks.TensorBoard(
        log_dir=logdir,
        histogram_freq=1
    )
    
    # 训练
    history = model.fit(
        X_train, y_train,
        epochs=20,
        validation_data=(X_valid, y_valid),
        callbacks=[tensorboard_cb],
        verbose=0
    )
    
    # 评估
    test_loss, test_mae = model.evaluate(X_test, y_test, verbose=0)
    print(f"测试MSE: {test_loss:.4f}, 测试MAE: {test_mae:.4f}")
    print(f"日志保存到: {logdir}")

## 5. 自定义指标记录

使用`tf.summary`记录自定义指标。

In [None]:
class CustomMetricsCallback(keras.callbacks.Callback):
    """
    自定义指标回调
    
    记录学习率、梯度范数等自定义指标
    """
    
    def __init__(self, log_dir):
        super().__init__()
        self.log_dir = log_dir
        self.writer = None
    
    def on_train_begin(self, logs=None):
        """训练开始时创建SummaryWriter"""
        self.writer = tf.summary.create_file_writer(self.log_dir)
    
    def on_epoch_end(self, epoch, logs=None):
        """每个epoch结束时记录自定义指标"""
        with self.writer.as_default():
            # 记录学习率
            lr = float(self.model.optimizer.learning_rate)
            tf.summary.scalar('learning_rate', lr, step=epoch)
            
            # 计算并记录训练/验证损失比率（用于检测过拟合）
            train_loss = logs.get('loss', 0)
            val_loss = logs.get('val_loss', 1)
            if val_loss > 0:
                overfit_ratio = train_loss / val_loss
                tf.summary.scalar('overfit_ratio', overfit_ratio, step=epoch)
            
            # 记录权重统计信息
            for layer in self.model.layers:
                if hasattr(layer, 'kernel'):
                    weights = layer.kernel
                    tf.summary.scalar(
                        f'{layer.name}/weight_mean', 
                        tf.reduce_mean(weights), 
                        step=epoch
                    )
                    tf.summary.scalar(
                        f'{layer.name}/weight_std', 
                        tf.math.reduce_std(weights), 
                        step=epoch
                    )
            
            self.writer.flush()
    
    def on_train_end(self, logs=None):
        """训练结束时关闭writer"""
        if self.writer:
            self.writer.close()

In [None]:
# 使用自定义指标回调
model = create_model(n_hidden=2, n_neurons=50)

custom_logdir = get_run_logdir("custom_metrics")

callbacks = [
    keras.callbacks.TensorBoard(log_dir=custom_logdir, histogram_freq=1),
    CustomMetricsCallback(log_dir=custom_logdir)
]

history = model.fit(
    X_train, y_train,
    epochs=30,
    validation_data=(X_valid, y_valid),
    callbacks=callbacks,
    verbose=1
)

print(f"\n自定义指标已记录到: {custom_logdir}")

## 6. 使用HParams记录超参数

使用TensorBoard的HParams功能记录和对比超参数实验。

In [None]:
from tensorboard.plugins.hparams import api as hp

# 定义超参数空间
HP_NUM_HIDDEN = hp.HParam('num_hidden', hp.IntInterval(1, 4))
HP_NUM_NEURONS = hp.HParam('num_neurons', hp.Discrete([30, 50, 100]))
HP_LEARNING_RATE = hp.HParam('learning_rate', hp.RealInterval(0.001, 0.1))

# 定义要追踪的指标
METRIC_MSE = 'mse'
METRIC_MAE = 'mae'

def run_hparams_experiment(hparams, run_dir):
    """
    运行超参数实验
    
    Parameters:
    -----------
    hparams : dict
        超参数字典
    run_dir : str
        日志目录
    
    Returns:
    --------
    tuple : (test_mse, test_mae)
    """
    # 创建模型
    model = keras.Sequential([
        keras.layers.InputLayer(input_shape=[8])
    ])
    
    for _ in range(hparams[HP_NUM_HIDDEN]):
        model.add(keras.layers.Dense(hparams[HP_NUM_NEURONS], activation='relu'))
    
    model.add(keras.layers.Dense(1))
    
    model.compile(
        loss='mse',
        optimizer=keras.optimizers.SGD(learning_rate=hparams[HP_LEARNING_RATE]),
        metrics=['mae']
    )
    
    # 训练
    model.fit(
        X_train, y_train,
        epochs=20,
        validation_data=(X_valid, y_valid),
        callbacks=[
            keras.callbacks.TensorBoard(log_dir=run_dir),
            hp.KerasCallback(run_dir, hparams)
        ],
        verbose=0
    )
    
    # 评估
    test_mse, test_mae = model.evaluate(X_test, y_test, verbose=0)
    return test_mse, test_mae

In [None]:
# 运行超参数搜索实验
hparams_logdir = os.path.join(root_logdir, "hparams_search")

# 配置HParams
with tf.summary.create_file_writer(hparams_logdir).as_default():
    hp.hparams_config(
        hparams=[HP_NUM_HIDDEN, HP_NUM_NEURONS, HP_LEARNING_RATE],
        metrics=[hp.Metric(METRIC_MSE, display_name='MSE'),
                 hp.Metric(METRIC_MAE, display_name='MAE')]
    )

# 定义实验配置
experiments = [
    {HP_NUM_HIDDEN: 1, HP_NUM_NEURONS: 30, HP_LEARNING_RATE: 0.01},
    {HP_NUM_HIDDEN: 2, HP_NUM_NEURONS: 50, HP_LEARNING_RATE: 0.01},
    {HP_NUM_HIDDEN: 2, HP_NUM_NEURONS: 100, HP_LEARNING_RATE: 0.005},
    {HP_NUM_HIDDEN: 3, HP_NUM_NEURONS: 50, HP_LEARNING_RATE: 0.01},
]

# 运行每个实验
print("超参数搜索实验:")
print("="*60)

for i, hparams in enumerate(experiments):
    run_name = f"run_{i}"
    run_dir = os.path.join(hparams_logdir, run_name)
    
    print(f"\n{run_name}: hidden={hparams[HP_NUM_HIDDEN]}, "
          f"neurons={hparams[HP_NUM_NEURONS]}, lr={hparams[HP_LEARNING_RATE]}")
    
    test_mse, test_mae = run_hparams_experiment(hparams, run_dir)
    
    # 记录最终指标
    with tf.summary.create_file_writer(run_dir).as_default():
        hp.hparams(hparams)
        tf.summary.scalar(METRIC_MSE, test_mse, step=1)
        tf.summary.scalar(METRIC_MAE, test_mae, step=1)
    
    print(f"  MSE: {test_mse:.4f}, MAE: {test_mae:.4f}")

print(f"\n所有实验日志保存到: {hparams_logdir}")

## 7. 启动TensorBoard

### 方式一：命令行启动

在终端中运行：
```bash
tensorboard --logdir=./my_logs --port=6006
```

然后在浏览器中访问 `http://localhost:6006`

### 方式二：在Jupyter中启动

In [None]:
# 在Jupyter中加载TensorBoard扩展
# 注意：此功能需要jupyter-tensorboard扩展
try:
    %load_ext tensorboard
    print("TensorBoard扩展已加载")
except:
    print("无法加载TensorBoard扩展，请使用命令行启动")

In [None]:
# 启动TensorBoard（在Jupyter中）
# 取消注释下面的行以启动
# %tensorboard --logdir ./my_logs

## 8. 清理日志文件

In [None]:
import shutil

# 列出所有日志目录
print("已创建的日志目录:")
if os.path.exists(root_logdir):
    for item in sorted(os.listdir(root_logdir)):
        item_path = os.path.join(root_logdir, item)
        if os.path.isdir(item_path):
            # 计算目录大小
            size = sum(
                os.path.getsize(os.path.join(dirpath, f))
                for dirpath, dirnames, filenames in os.walk(item_path)
                for f in filenames
            ) / 1024
            print(f"  {item} ({size:.1f} KB)")

# 取消注释以下代码来清理日志
# if os.path.exists(root_logdir):
#     shutil.rmtree(root_logdir)
#     print("\n已清理所有日志文件")

## 小结

### TensorBoard回调参数

| 参数 | 默认值 | 说明 |
|------|--------|------|
| log_dir | None | 日志保存目录 |
| histogram_freq | 0 | 直方图记录频率（epoch数） |
| write_graph | True | 是否记录计算图 |
| write_images | False | 是否将权重可视化为图像 |
| update_freq | 'epoch' | 更新频率（'batch'/'epoch'/整数） |
| profile_batch | 0 | 性能分析的批次 |

### 最佳实践

1. **命名规范**: 为每个实验使用描述性名称
2. **版本控制**: 在日志目录名中包含时间戳
3. **定期清理**: 删除不需要的旧日志
4. **记录超参数**: 使用HParams功能方便对比

### 常用命令

```bash
# 基本启动
tensorboard --logdir=./my_logs

# 指定端口
tensorboard --logdir=./my_logs --port=6007

# 绑定所有网络接口（远程访问）
tensorboard --logdir=./my_logs --host=0.0.0.0
```