## 步骤1: GPU检查（无需挂载Drive）

In [None]:
# 检查GPU
!nvidia-smi

KeyboardInterrupt: 

In [3]:
import torch
print(f"CUDA可用: {torch.cuda.is_available()}")
print(f"GPU型号: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'None'}")
print(f"GPU数量: {torch.cuda.device_count()}")

CUDA可用: True
GPU型号: Tesla T4
GPU数量: 1


## 步骤2: 安装Miniconda3

In [4]:
%%bash
# 安装Miniconda3 (如果尚未安装)
if [ ! -d "/root/miniconda3" ]; then
    echo "开始安装Miniconda3..."
    wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O /tmp/miniconda.sh
    bash /tmp/miniconda.sh -b -p /root/miniconda3
    rm /tmp/miniconda.sh
    echo "Miniconda3安装完成"
else
    echo "Miniconda3已安装"
fi

# 初始化conda
/root/miniconda3/bin/conda init bash
source /root/.bashrc

开始安装Miniconda3...
PREFIX=/root/miniconda3
Unpacking bootstrapper...
Unpacking payload...

Installing base environment...

Preparing transaction: ...working... done
Executing transaction: ...working... done
installation finished.
    You currently have a PYTHONPATH environment variable set. This may cause
    unexpected behavior when running the Python interpreter in Miniconda3.
    For best results, please verify that your PYTHONPATH only points to
    directories of packages that are compatible with the Python interpreter
    in Miniconda3: /root/miniconda3
Miniconda3安装完成
no change     /root/miniconda3/condabin/conda
no change     /root/miniconda3/bin/conda
no change     /root/miniconda3/bin/conda-env
no change     /root/miniconda3/bin/activate
no change     /root/miniconda3/bin/deactivate
no change     /root/miniconda3/etc/profile.d/conda.sh
no change     /root/miniconda3/etc/fish/conf.d/conda.fish
no change     /root/miniconda3/shell/condabin/Conda.psm1
no change     /root/miniconda

--2025-12-11 11:31:53--  https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
Resolving repo.anaconda.com (repo.anaconda.com)... 104.16.191.158, 104.16.32.241, 2606:4700::6810:20f1, ...
Connecting to repo.anaconda.com (repo.anaconda.com)|104.16.191.158|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 157891003 (151M) [application/octet-stream]
Saving to: ‘/tmp/miniconda.sh’

     0K .......... .......... .......... .......... ..........  0% 74.1M 2s
    50K .......... .......... .......... .......... ..........  0% 7.55M 11s
   100K .......... .......... .......... .......... ..........  0% 7.11M 14s
   150K .......... .......... .......... .......... ..........  0% 46.2M 12s
   200K .......... .......... .......... .......... ..........  0% 11.8M 12s
   250K .......... .......... .......... .......... ..........  0% 32.8M 11s
   300K .......... .......... .......... .......... ..........  0% 77.6M 9s
   350K .......... .......... .......... ..

In [5]:
# 验证conda安装
!/root/miniconda3/bin/conda --version

conda 25.9.1


## 步骤3: 上传代码到Colab

**选择以下任一方式上传代码:**

### 方式A: VS Code 直接上传（推荐）
直接在 VS Code 的文件浏览器中，将 `tess-diffusion` 文件夹拖拽到 `/content/` 目录

### 方式B: 使用 Colab 文件上传
运行下方代码，选择 zip 文件上传

### 方式C: 从 GitHub/URL 下载
如果代码在 GitHub 上，可以直接 wget 下载

In [None]:
import os
import zipfile
import shutil

# 设置路径
zip_path = '/content/tess-diffusion-colab.zip'  # 如果使用上传方式
extract_to = '/content/tess-diffusion'

# 方式A: 如果你已经通过VS Code直接拖拽上传了tess-diffusion文件夹
if os.path.exists(extract_to) and os.path.isdir(extract_to):
    print(f"✓ 代码目录已存在: {extract_to}")
    %cd {extract_to}
    print(f"当前目录: {os.getcwd()}")

# 方式B: 从zip文件解压
elif os.path.exists(zip_path):
    print(f"解压 {zip_path}...")
    if os.path.exists(extract_to):
        shutil.rmtree(extract_to)
    with zipfile.ZipFile(zip_path, 'r') as zf:
        zf.extractall(extract_to)
    print("✓ 解压完成")
    %cd {extract_to}
    print(f"当前目录: {os.getcwd()}")

# 方式C: 使用Colab文件上传
else:
    print("请选择上传方式:")
    print("1. 在VS Code中拖拽 tess-diffusion 文件夹到 /content/")
    print("2. 或运行下面的代码上传zip文件:")
    print("")
    print("from google.colab import files")
    print("uploaded = files.upload()")
    print("# 然后重新运行本单元格")

In [6]:
# 验证关键文件
import os

check_files = [
    'tess_train1_oneline.txt',
    'tess_valid1_oneline.txt',
    'run_mlm.py',
    'extend_tokenizer_vocab.py',
    'configs/tess_gpu_oneline_sc.json',
    'environment.yaml'
]

print("="*60)
print("文件验证")
print("="*60)
all_exists = True
for f in check_files:
    exists = os.path.exists(f)
    status = "✓" if exists else "✗"
    print(f"{status} {f}")
    if not exists:
        all_exists = False

if all_exists:
    print("\n✓ 所有文件都存在!")
else:
    print("\n✗ 部分文件缺失,请检查压缩包")

文件验证
✗ tess_train1_oneline.txt
✗ tess_valid1_oneline.txt
✗ run_mlm.py
✗ extend_tokenizer_vocab.py
✗ configs/tess_gpu_oneline_sc.json
✗ environment.yaml

✗ 部分文件缺失,请检查压缩包


## 步骤4: 创建Conda虚拟环境并安装依赖 (~10分钟)

使用更新的 environment.yaml 配置

In [None]:
%%bash
# 使用environment.yaml创建环境
export PATH="/root/miniconda3/bin:$PATH"

# 删除旧环境(如果存在)
conda env remove -n sdlm -y 2>/dev/null || true

# 创建新环境
echo "创建conda环境 sdlm..."
conda env create -f environment.yaml

echo "✓ 环境创建完成"

In [None]:
# 验证环境和依赖版本
!/root/miniconda3/envs/sdlm/bin/python -c "
import sys
import torch
import transformers
import diffusers
import datasets
import accelerate
import numpy
import scipy

print('='*60)
print('环境验证')
print('='*60)
print(f'Python: {sys.version}')
print(f'PyTorch: {torch.__version__}')
print(f'CUDA可用: {torch.cuda.is_available()}')
print(f'Transformers: {transformers.__version__}')
print(f'Diffusers: {diffusers.__version__}')
print(f'Datasets: {datasets.__version__}')
print(f'Accelerate: {accelerate.__version__}')
print(f'Numpy: {numpy.__version__}')
print(f'Scipy: {scipy.__version__}')
print('='*60)
"

## 步骤5: 扩展Tokenizer词汇表 (~3分钟)

In [None]:
# 检查数据文件
!head -n 3 tess_train1_oneline.txt
!echo "---"
!wc -l tess_*.txt

In [None]:
# 扩展tokenizer
!/root/miniconda3/envs/sdlm/bin/python extend_tokenizer_vocab.py \
    --train_file tess_train1_oneline.txt \
    --base_model roberta-base \
    --output_dir extended_tokenizer

print("\n✓ Tokenizer扩展完成")

# 查看统计
!cat extended_tokenizer/vocab_extension_stats.json

In [None]:
# 验证tokenizer扩展结果
!/root/miniconda3/envs/sdlm/bin/python -c "
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('extended_tokenizer')
test_entities = ['South_Korea', 'North_Korea', 'Cleric_(Iraq)', 'Government_Official_(Turkey)']

print('='*60)
print('Tokenizer扩展验证')
print('='*60)
print(f'词汇表大小: {len(tokenizer)}')
print()

all_single = True
for entity in test_entities:
    tokens = tokenizer.tokenize(entity)
    is_single = len(tokens) == 1 and tokens[0] == entity
    status = '✓' if is_single else '✗'
    print(f'{status} {entity} → {tokens}')
    if not is_single:
        all_single = False

print('='*60)
if all_single:
    print('✓ 所有实体都是单个token,扩展成功!')
else:
    print('✗ 部分实体被分词,扩展有问题')
"

## 步骤6: 配置训练参数

使用 `tess_gpu_oneline_sc.json` 作为基础配置

In [None]:
import json
import os

# 读取GPU配置
with open('configs/tess_gpu_oneline_sc.json', 'r') as f:
    config = json.load(f)

# 修改为Colab本地存储（不使用Drive）
config.update({
    'tokenizer_name': 'extended_tokenizer',
    'output_dir': '/content/tess_outputs',  # 使用本地路径
    'per_device_train_batch_size': 8,  # 根据GPU显存调整
    'per_device_eval_batch_size': 8,
    'num_train_epochs': 3,
    'fp16': True,
    'time_save_interval_seconds': 1800,
    'gdrive_backup_dir': None,  # 不使用Drive备份
    'backup_keep_last': 2,
    'save_total_limit': 5
})

# 保存Colab专用配置
with open('configs/tess_colab_updated.json', 'w') as f:
    json.dump(config, f, indent=2)

print("✓ 配置已更新")
print("\n关键配置:")
print(f"  - tokenizer: {config['tokenizer_name']}")
print(f"  - output_dir: {config['output_dir']}")
print(f"  - batch_size: {config['per_device_train_batch_size']}")
print(f"  - epochs: {config['num_train_epochs']}")
print(f"  - fp16: {config['fp16']}")
print(f"  - save_total_limit: {config['save_total_limit']}")
print("\n⚠️ 注意: 输出保存在 /content/tess_outputs/")
print("训练完成后请及时下载结果，会话结束后数据会丢失！")

## 步骤7: 训练模型

### 选项A: 快速验证 (1 epoch, ~2小时)

In [None]:
# 快速训练 - 1个epoch用于验证
!/root/miniconda3/envs/sdlm/bin/python run_mlm.py \
    --model_name_or_path roberta-base \
    --tokenizer_name extended_tokenizer \
    --train_file tess_train1_oneline.txt \
    --validation_file tess_valid1_oneline.txt \
    --output_dir /content/tess_outputs_quick \
    --line_by_line True \
    --max_seq_length 256 \
    --pad_to_max_length True \
    --per_device_train_batch_size 8 \
    --num_train_epochs 1 \
    --save_steps 500 \
    --save_total_limit 5 \
    --eval_steps 500 \
    --logging_steps 50 \
    --fp16 True \
    --simplex_value 5 \
    --num_diffusion_steps 500 \
    --self_condition logits_addition \
    --self_condition_zeros_after_softmax True \
    --overwrite_output_dir True

### 选项B: 完整训练 (3 epochs, ~6-7小时)

In [None]:
# 完整训练 - 使用配置文件
!/root/miniconda3/envs/sdlm/bin/python run_mlm.py configs/tess_colab_updated.json

## 步骤8: 监控训练

In [None]:
# 启动TensorBoard
%load_ext tensorboard
%tensorboard --logdir /content/tess_outputs

In [None]:
# 查看最新checkpoints
!ls -lht /content/tess_outputs/checkpoint-* 2>/dev/null | head -5 || echo "尚无checkpoint"

## 步骤9: 评测模型

In [None]:
# 找到最新checkpoint
import os
import glob

output_dir = '/content/tess_outputs'
checkpoint_dirs = glob.glob(f"{output_dir}/checkpoint-*")

if checkpoint_dirs:
    latest_checkpoint = max(checkpoint_dirs, key=lambda x: int(x.split('-')[-1]))
    print(f"✓ 使用checkpoint: {latest_checkpoint}")
    checkpoint_path = latest_checkpoint
else:
    print("✗ 未找到checkpoint,请先完成训练")
    checkpoint_path = None

In [None]:
# 快速评测 (如果有run_optimized_eval.py)
if checkpoint_path and os.path.exists('run_optimized_eval.py'):
    !/root/miniconda3/envs/sdlm/bin/python run_optimized_eval.py \
        --checkpoint {checkpoint_path} \
        --mode tail \
        --quick
else:
    print("跳过评测 (缺少checkpoint或评测脚本)")

## 步骤10: 下载训练结果

训练完成后，请及时下载结果到本地

In [None]:
# 方式A: 打包下载所有结果
import os

if os.path.exists('/content/tess_outputs'):
    print("打包训练结果...")
    !cd /content && tar -czf tess_training_results.tar.gz tess_outputs/
    
    print("\n使用以下方式下载:")
    print("1. 在VS Code文件浏览器中右键 /content/tess_training_results.tar.gz → Download")
    print("2. 或运行下面的代码:")
    print("")
    print("from google.colab import files")
    print("files.download('/content/tess_training_results.tar.gz')")
    
    # 显示文件大小
    import subprocess
    size_output = subprocess.check_output(['du', '-sh', '/content/tess_training_results.tar.gz']).decode('utf-8')
    print(f"\n打包文件大小: {size_output.split()[0]}")
else:
    print("✗ 未找到训练输出目录")

---

## 常见问题

### 1. 如何上传代码?
**推荐方式 - VS Code直接上传:**
1. 在 VS Code 文件浏览器中找到 `/content/` 目录
2. 拖拽整个 `tess-diffusion` 文件夹到 `/content/`
3. 上传完成后运行步骤3的代码验证

**备选方式 - 使用Colab上传:**
```python
from google.colab import files
uploaded = files.upload()  # 选择zip文件
```

### 2. 训练结果如何保存?
**⚠️ 重要**: Colab会话结束后 `/content/` 目录会被清空！

**下载方式:**
- 在VS Code文件浏览器中右键 `/content/tess_outputs/` → Download
- 或使用步骤10的打包下载代码

**建议**: 每隔一段时间下载checkpoint备份

### 3. 训练中断恢复
```python
!/root/miniconda3/envs/sdlm/bin/python run_mlm.py \
    --resume_from_checkpoint /content/tess_outputs/checkpoint-3000 \
    configs/tess_colab_updated.json
```

### 4. 内存不足
减小batch size:
```python
# 修改配置
config['per_device_train_batch_size'] = 4
config['gradient_accumulation_steps'] = 2
```

### 5. 依赖冲突问题
```bash
# 重新创建环境
conda env remove -n sdlm -y
conda env create -f environment.yaml
```

---

## 预期性能

使用更新的依赖和配置:
- **训练时间**: 6-7小时 (3 epochs, T4 GPU)
- **Tail MRR**: 目标 35-45%
- **Tail Hits@10**: 目标 55-65%
- **Checkpoint数量**: 最多5个 (自动管理)

---

## 完整工作流程总结

1. **上传代码** → 拖拽 `tess-diffusion` 到 `/content/`
2. **安装环境** → 运行Miniconda和conda环境创建
3. **扩展tokenizer** → 自动处理实体词汇
4. **开始训练** → 选择快速/完整训练
5. **监控进度** → TensorBoard实时查看
6. **下载结果** → 训练完成后立即打包下载

**无需Google Drive，所有操作在VS Code中完成！**