# Distilling knowledge in models pretrained on CIFAR-10/100 datasets, using ***torchdistill***

## 1. Make sure you have access to GPU/TPU
Google Colab: *Runtime* -> *Change runtime type* -> *Hardware accelarator*: "GPU" or "TPU"

In [1]:
!pwd

/content


In [1]:
!nvidia-smi

Wed Jul 16 02:41:30 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   38C    P8              9W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

## 2. Install ***torchdistill***

In [None]:
!pip install torchdistill

Collecting torchdistill
  Downloading torchdistill-1.1.3-py3-none-any.whl.metadata (24 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=2.5.1->torchdistill)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=2.5.1->torchdistill)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=2.5.1->torchdistill)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=2.5.1->torchdistill)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=2.5.1->torchdistill)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (fro

## 3. Clone ***torchdistill*** repository to use its example code and configuration files

In [3]:
!git clone https://github.com/yoshitomo-matsubara/torchdistill.git

Cloning into 'torchdistill'...
remote: Enumerating objects: 11067, done.[K
remote: Counting objects: 100% (2030/2030), done.[K
remote: Compressing objects: 100% (516/516), done.[K
remote: Total 11067 (delta 1596), reused 1532 (delta 1514), pack-reused 9037 (from 5)[K
Receiving objects: 100% (11067/11067), 10.81 MiB | 27.88 MiB/s, done.
Resolving deltas: 100% (6778/6778), done.


## 4. Distill knowledge in models pretrained on CIFAR-10

Note that the hyperparameters of ResNet, WRN (Wide ResNet), and DenseNet-BC were chosen based on either train/val (splitting 50k samples into train:val = 45k:5k) or cross-validation, according to the original papers.  
For the final run (once the hyperparameters are finalized), the authors used all the training images (50k samples).  
- ResNet: https://github.com/facebookarchive/fb.resnet.torch
- WRN (Wide ResNet): https://github.com/szagoruyko/wide-residual-networks
- DenseNet-BC: https://github.com/liuzhuang13/DenseNet

The following examples demonstrate how to 1) tune hyperparameter and 2) do final-run with ResNet-20 on CIFAR-10 dataset, respectively.

### 4.1 Hyperparameter tuning based on train:val = 45k:5k
Let's start with a small **student model**, ResNet-20, with a pretrained DenseNet-BC (k=12, depth=100) as a **teacher model** for tutorial.  

Open `torchdistill/configs/sample/cifar10/kd/resnet20_from_densenet_bc_k12_depth100-hyperparameter_tuning.yaml` and update hyperparameters as you wish e.g., number of epochs (*num_epochs*), batch size (*batch_size* in *train_data_loader* entry), learning rate (*lr* within *optimizer* entry), and so on.
By default, the hyperparameters in the example config are identical to those in the final run config.
  
You will find a lot of module names from [PyTorch documentation](https://pytorch.org/docs/stable/index.html) and [torchvision](https://pytorch.org/docs/stable/torchvision/) such as [`SGD`](https://pytorch.org/docs/stable/optim.html#torch.optim.SGD), [`MultiStepLR`](https://pytorch.org/docs/stable/optim.html#torch.optim.lr_scheduler.MultiStepLR), [`CrossEntropyLoss`](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html#torch.nn.CrossEntropyLoss), [`CIFAR10`](https://pytorch.org/docs/stable/torchvision/datasets.html#torchvision.datasets.CIFAR10), [`RandomCrop`](https://pytorch.org/docs/stable/torchvision/transforms.html#torchvision.transforms.RandomCrop) (, and more). You can update their parameters or replace such modules with other modules in the packages. For instance, `SGD` could be replaced with [`Adam`](https://pytorch.org/docs/stable/optim.html#torch.optim.Adam), and then you will change the parameters under `params` (at least delete `momentum` entry as the parameter is not for `Adam`).

In [4]:
!python torchdistill/examples/image_classification.py --config torchdistill/configs/sample/cifar10/kd/resnet20_from_densenet_bc_k12_depth100-hyperparameter_tuning.yaml --run_log log/cifar10/kd/resnet20_from_densenet_bc_k12_depth100-hyperparameter_tuning.log

python3: can't open file '/content/torchdistill/examples/image_classification.py': [Errno 2] No such file or directory


### 4.2 Final run with hyperparameters determinded by the above hyperparameter-tuning
Once you tune the hyperparameters, you can update the values in **a config file whose name ends with "-final_run.yaml"**. Notice that the only difference between default example configs for hyperparameter tuning and final run is datasets entry.

In [6]:
!git clone https://github.com/yoshitomo-matsubara/torchdistill.git
%cd torchdistill

fatal: destination path 'torchdistill' already exists and is not an empty directory.
/content/torchdistill


In [8]:
!pwd

/content/torchdistill


In [11]:
from torchvision import datasets, transforms

transform = transforms.Compose([transforms.ToTensor()])
dataset = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
print("CIFAR-10 train set size:", len(dataset))


100%|██████████| 170M/170M [00:02<00:00, 76.0MB/s]


CIFAR-10 train set size: 50000


In [12]:
!cat configs/sample/cifar10/kd/resnet20_from_resnet56.yaml

cat: configs/sample/cifar10/kd/resnet20_from_resnet56.yaml: No such file or directory


In [16]:
!python /content/torchdistill/examples/torchvision/image_classification.py --config /content/torchdistill/configs/sample/cifar10/kd/resnet20_from_densenet_bc_k12_depth100-final_run.yaml --run_log /content/torchdistill/log/cifar10/kd/resnet20_from_densenet_bc_k12_depth100-final_run.log

2025/07/16 03:29:06	INFO	torchdistill.common.main_util	Not using distributed mode
2025/07/16 03:29:06	INFO	__main__	Namespace(config='/content/torchdistill/configs/sample/cifar10/kd/resnet20_from_densenet_bc_k12_depth100-final_run.yaml', device='cuda', run_log='/content/torchdistill/log/cifar10/kd/resnet20_from_densenet_bc_k12_depth100-final_run.log', start_epoch=0, seed=None, disable_cudnn_benchmark=False, test_only=False, student_only=False, log_config=False, world_size=1, dist_url='env://', adjust_lr=False)
2025/07/16 03:29:06	INFO	torchdistill.common.main_util	Getting `RandomSampler` from `torch.utils.data`
2025/07/16 03:29:06	INFO	torchdistill.common.main_util	Getting `SequentialSampler` from `torch.utils.data`
2025/07/16 03:29:06	INFO	torchdistill.common.main_util	ckpt file path is None
2025/07/16 03:29:06	INFO	torchdistill.common.main_util	ckpt file path is None
2025/07/16 03:29:06	INFO	__main__	Start training
2025/07/16 03:29:06	INFO	torchdistill.core.distillation	[teacher mode

At the end of the training process, you will see improved accuracy of the student model (ResNet-20) compared to that trained without teacher in another example notebook and/or the accuracy reported in [the ResNet paper (Table 6)](https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/He_Deep_Residual_Learning_CVPR_2016_paper.pdf).

## 5. More sample configurations, models, datasets...
For CIFAR-10/100 datasets, you can find more [sample configurations](https://github.com/yoshitomo-matsubara/torchdistill/tree/master/configs/legacy/sample/) and [models](https://github.com/yoshitomo-matsubara/torchdistill/tree/master/torchdistill/models/classification) in the [***torchdistill***](https://github.com/yoshitomo-matsubara/torchdistill) repository.
If you would like to use larger datasets e.g., **ImageNet** and **COCO** datasets and models in `torchvision` (or your own modules), refer to the [official configurations](https://github.com/yoshitomo-matsubara/torchdistill/tree/master/configs/legacy/official) used in some published papers.
Experiments with such large datasets and models will require you to use your own machine due to limited disk space and session time (12 hours for free version and 24 hours for Colab Pro) on Google Colab.


# Colab examples for training student models without teacher models
You can find Colab examples for training models without teachers in the [***torchdistill***](https://github.com/yoshitomo-matsubara/torchdistill) repository.

In [15]:
!cat <<EOF > /content/torchdistill/configs/sample/cifar10/kd/resnet20_from_densenet_bc_k12_depth100-final_run.yaml
datasets:
  cifar10:
    name: &dataset_name 'cifar10'
    key: 'CIFAR10'
    root: &root_dir !join ['./resource/dataset/', *dataset_name]
    splits:
      train:
        dataset_id: &cifar10_train !join [*dataset_name, '/train']
        kwargs:
          root: *root_dir
          train: True
          download: True
          transform_configs:
            - key: 'RandomCrop'
              kwargs:
                size: 32
                padding: 4
            - key: 'RandomHorizontalFlip'
              kwargs:
                p: 0.5
            - key: 'ToTensor'
              kwargs:
            - &normalize
              key: 'Normalize'
              kwargs:
                mean: [0.49139968, 0.48215841, 0.44653091]
                std: [0.24703223, 0.24348513, 0.26158784]
      val:
        dataset_id: &cifar10_val !join [ *dataset_name, '/val' ]
        kwargs:
          root: *root_dir
          train: False
          download: True
          transform_configs: &val_transform
            - key: 'ToTensor'
              kwargs:
            - *normalize
      test:
        dataset_id: &cifar10_test !join [*dataset_name, '/test']
        kwargs:
          root: *root_dir
          train: False
          download: True
          transform_configs: *val_transform

models:
  teacher_model:
    key: &teacher_model_key 'densenet_bc_k12_depth100'
    kwargs:
      num_classes: 10
      memory_efficient: False
      pretrained: True
    src_ckpt:
  student_model:
    key: &student_model_key 'resnet20'
    kwargs:
      num_classes: 10
      pretrained: False
    _experiment: &student_experiment !join [*dataset_name, '-', *student_model_key, '_from_', *teacher_model_key]
    src_ckpt:
    dst_ckpt: !join ['./resource/ckpt/', *dataset_name, '/kd/', *student_experiment, '-final_run.pt']

train:
  log_freq: 100
  num_epochs: 182
  train_data_loader:
    dataset_id: *cifar10_train
    sampler:
      class_or_func: !import_get
        key: 'torch.utils.data.RandomSampler'
      kwargs:
    kwargs:
      batch_size: 64
      num_workers: 16
      pin_memory: True
      drop_last: False
    cache_output:
  val_data_loader:
    dataset_id: *cifar10_val # Changed to use the val dataset split
    sampler: &val_sampler
      class_or_func: !import_get
        key: 'torch.utils.data.SequentialSampler'
      kwargs:
    kwargs:
      batch_size: 128
      num_workers: 16
      pin_memory: True
      drop_last: False
  teacher:
    forward_proc: 'forward_batch_only'
    sequential: []
    wrapper: 'DataParallel'
    requires_grad: False
    frozen_modules: []
  student:
    forward_proc: 'forward_batch_only'
    adaptations:
    sequential: []
    wrapper: 'DistributedDataParallel'
    requires_grad: True
    frozen_modules: []
  optimizer:
    key: 'SGD'
    kwargs:
      lr: 0.1
      momentum: 0.9
      weight_decay: 0.0001
  scheduler:
    key: 'MultiStepLR'
    kwargs:
      milestones: [91, 136]
      gamma: 0.1
  criterion:
    key: 'WeightedSumLoss'
    kwargs:
      sub_terms:
        kd:
          criterion:
            key: 'KDLoss'
            kwargs:
              student_module_path: '.'
              student_module_io: 'output'
              teacher_module_path: '.'
              teacher_module_io: 'output'
              temperature: 4.0
              alpha: 0.9
              reduction: 'batchmean'
          weight: 1.0

test:
  test_data_loader: # Added test_data_loader
    dataset_id: *cifar10_test
    sampler: *val_sampler
    kwargs:
      batch_size: 1
      num_workers: 16
      pin_memory: True
      drop_last: False
EOF

datasets:
  cifar10:
    name: &dataset_name 'cifar10'
    key: 'CIFAR10'
    root: &root_dir !join ['./resource/dataset/', *dataset_name]
    splits:
      train:
        dataset_id: &cifar10_train !join [*dataset_name, '/train']
        kwargs:
          root: *root_dir
          train: True
          download: True
          transform_configs:
            - key: 'RandomCrop'
              kwargs:
                size: 32
                padding: 4
            - key: 'RandomHorizontalFlip'
              kwargs:
                p: 0.5
            - key: 'ToTensor'
              kwargs:
            - &normalize
              key: 'Normalize'
              kwargs:
                mean: [0.49139968, 0.48215841, 0.44653091]
                std: [0.24703223, 0.24348513, 0.26158784]
      val:
        dataset_id: &cifar10_val !join [ *dataset_name, '/val' ]
        kwargs:
          root: *root_dir
          train: False
          download: True
          transform_configs: &val_transform
 