In [7]:
import torch
torch.__version__

'1.13.0.dev20220721+cu113'

FSDP currently does not support layer level fine tuning.  Thus options are whole model fine tuning, 
or for large language models, Child Tuning. 

Child Tuning was developed in the paper: 
Raise a Child in Large Language Model: Towards Effective and Generalizable Fine-tuning


@inproceedings{xu-etal-2021-childtuning,
    title = "Raise a Child in Large Language Model: Towards Effective and Generalizable Fine-tuning",
    author = "Runxin Xu and
    Fuli Luo and Zhiyuan Zhang and
    Chuanqi Tan and Baobao Chang and
    Songfang Huang and Fei Huang",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    year = "2021",
    publisher = "Association for Computational Linguistics",
}

https://arxiv.org/abs/2109.05687

![Paper intro](./images/child_tuning.png)

![Improved results](./images/child_tuning_gains.png)

In [8]:
from torch.distributions import Bernoulli

In [43]:
torch.manual_seed(2022)
torch.cuda.manual_seed(2022)

In [53]:
grad = torch.randn(3,3)
reserve_p = .30

In [54]:
grad

tensor([[ 2.0203,  0.1361, -0.9314],
        [ 1.3920,  0.7097, -2.1463],
        [ 0.9796,  0.2208, -0.3193]])

In [46]:
r = grad.new_full(size=grad.size(), fill_value=reserve_p)
r


tensor([[0.3000, 0.3000, 0.3000],
        [0.3000, 0.3000, 0.3000],
        [0.3000, 0.3000, 0.3000]])

In [47]:
rdist = Bernoulli(r)
rdist

Bernoulli(probs: torch.Size([3, 3]))

In [48]:
rp = rdist.sample() 
rp

tensor([[0., 1., 0.],
        [0., 0., 1.],
        [1., 0., 1.]])

In [49]:
amplifier = rp/reserve_p
amplifier

tensor([[0.0000, 3.3333, 0.0000],
        [0.0000, 0.0000, 3.3333],
        [3.3333, 0.0000, 3.3333]])

In [50]:
newgrad = grad*amplifier
newgrad

tensor([[ 0.0000,  1.1021,  0.0000],
        [ 0.0000, -0.0000, -3.0270],
        [-7.7042,  0.0000,  6.8715]])

In [55]:
grad_mask = Bernoulli(grad.new_full(size=grad.size(), fill_value=reserve_p))
grad *= grad_mask.sample() / reserve_p

There are two versions of child Tuning - task free and task dependent. In T5 testing, had better results with Task Free, so that's what we'll show here.

(for reference - Task dependent = you train one epoch with the parameters being monitored to create a Fisher Information Matrix, or the most 'active' parameters for that task.  These are then isolated and the only ones updated during fine tuning).

Task free is more akin to a strong regularizer due to the random masking of a subset of the model params.  


In [64]:
# usage:

from ChildTuningOptimizer import ChildTuningAdamW

model = torch.nn.Linear(100,200) # lol

In [65]:
from torch.distributed.fsdp import (
    FullyShardedDataParallel as FSDP,
    MixedPrecision,
    StateDictType,
)

In [None]:
# ----- main FSDP init -----------
    model = FSDP(
        model,
        auto_wrap_policy=my_auto_wrap_policy,
        mixed_precision=mp_policy,
        backward_prefetch=prefetch_policy,
        sharding_strategy=cfg.sharding_strategy,
        device_id=torch.cuda.current_device(),
        forward_prefetch=cfg.forward_prefetch,
    )


In [None]:
optimizer = ChildTuningAdamW(model.parameters(), lr=4e-8, reserve_p=0.35, mode="taskfree")

Child Tuning otherwise works same as AdamW, but with the masked tuning.  Provides finer grained tuning vs the hard layer freezing as it operates both horizontally and vertically within the entire model.  

Note - during Child Fine Tuning, you can adjust the reserve_p percentage.  Percentage of 1.0 = normal AdamW.

General best practice is around 30% - 35% of the network should be used for the fine tuning task (ala reserve_p = .30 - .35), but you can run / compare for your specific task.   
Child Tuning will often lag vs 'whole model' for the first few epochs, but then will usually catch up and exceed after that. 