## torch.no_grad() vs param.requires_grad
- torch.no_grad()
    - 定义了一个上下文管理器，隐式的不进行梯度更新，不会改变requires_grad
    - will not store any grad in layer
    - 适用于eval阶段，此时模行只进行forward
- param.requires_grad
    - 显式的frozen掉一些module(layer)的梯度更新

In [1]:
from transformers import BertModel
import torch
from torch import nn

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
bert = BertModel.from_pretrained('bert-base-uncased')

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [3]:
def calc_learnable_params(model):
    total_param = 0
    for name,param in model.named_parameters():
        if param.requires_grad:
            total_param+=param.numel()
    return total_param

In [6]:
calc_learnable_params(bert)/1024/1024

104.410400390625