Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PLSC多个数据集训练loss问题 #58

Closed
gobigrassland opened this issue Jun 16, 2020 · 10 comments
Closed

PLSC多个数据集训练loss问题 #58

gobigrassland opened this issue Jun 16, 2020 · 10 comments

Comments

@gobigrassland
Copy link

gobigrassland commented Jun 16, 2020

PLSC很好解决大规模分类训练。实际模型训练会使用多个数据集,而且这些数据集存在不同程度id重叠,而清理这些数据集间重叠id也比较麻烦。于是训练时采用多个数据集输入,模型主干网络参数共享,针对各个数据集有各自的分类层。目前PLSC代码中
shard_logit = loss._get_info('shard_logit‘)
shard_prob = loss._get_info('shard_prob’)
shard_label = loss._get_info('shard_label‘)
shard_dim = loss._get_info('shard_dim’)
大体上通过这块实现大的分类层权重分拆到多个GPU中(这里描述不准确)。

针对多个分类的连接权重,现在的PLSC代码就不能处理了。我对plsc/model/dist_algo.py中minimize函数进行修改,大概修改为如下形式:

def compute_gradient_multi_branches(self,
loss,
dataset_name,
startup_program=None,
parameter_list=None,
no_grad_set=None,
callbacks=None):
assert loss.get_info('shard_logit{}'.format(dataset_name))

    shard_logit = loss._get_info('shard_logit_{}'.format(dataset_name))
    shard_prob = loss._get_info('shard_prob_{}'.format(dataset_name))
    shard_label = loss._get_info('shard_label_{}'.format(dataset_name))
    shard_dim = loss._get_info('shard_dim_{}'.format(dataset_name))

    op_maker = fluid.core.op_proto_and_checker_maker
    op_role_key = op_maker.kOpRoleAttrName()
    op_role_var_key = op_maker.kOpRoleVarAttrName()
    backward_role = int(op_maker.OpRole.Backward)
    loss_backward_role = int(op_maker.OpRole.Loss) | int(
        op_maker.OpRole.Backward)

    # minimize a scalar of reduce_sum to generate the backward network
    scalar = fluid.layers.reduce_sum(shard_logit)
    block = loss.block

if not self._use_fp16:
#ret = self._optimizer.minimize(scalar)
params_grads = self._optimizer.backward(scalar)
print(loss, scalar, dataset_name)
# remove the unnecessary ops
index = 0
"""
for i, op in enumerate(block.ops):
if op.all_attrs()[op_role_key] == loss_backward_role:
index = i
break
"""
for i,op in enumerate(block.ops):
print(i, dataset_name, block.ops[i])

希望能针对不同分支的分类loss分别求梯度,然后再进行各个分支的梯度聚合操作,但是发现不同数据集对应的分支,op就有很大差异。之前我在tensorflow中实现的,只有最后各个分类层参数不一样也不需共享,因此可以对共同的参数梯度比如取一个均值然后再更新。我实验中,以webface和vggface2为例,发现这两个分支对应op差异很大。其中一个比另一个分支多了很多op

多出的部分op:

inputs {
parameter: "X"
arguments: "prelu_32.w_0@GRAD"
}
outputs {
parameter: "Out"
arguments: "prelu_32.w_0@GRAD"
}
type: "c_sync_calc_stream"
attrs {
name: "op_device"
type: STRING
s: ""
}
attrs {
name: "op_role"
type: INT
i: 1
}
attrs {
name: "op_callstack"
type: STRINGS
strings: "
}
attrs {
name: "op_namescope"
type: STRING
s: "/"
}
attrs {
name: "op_role_var"
type: STRINGS
}

inputs {
parameter: "X"
arguments: "prelu_24.w_0@GRAD"
}
outputs {
parameter: "Out"
arguments: "prelu_24.w_0@GRAD"
}
type: "c_allreduce_sum"
attrs {
name: "op_device"
type: STRING
s: ""
}
attrs {
name: "ring_id"
type: INT
i: 0
}
attrs {
name: "use_calc_stream"
type: BOOLEAN
b: false
}
attrs {
name: "op_role"
type: INT
i: 1
}
attrs {
name: "op_role_var"
type: STRINGS
}

上面是使用多个数据,针对不同数据集使用不同分类层来联合训练。不过遇到上面问题,希望能给个建议

@sandyhouse
Copy link

能把组网代码也大体贴一下吗?

@gobigrassland
Copy link
Author

gobigrassland commented Jun 16, 2020

@sandyhouse 主要想通过下面3个函数实现对不同数据集采用对应的分类层,实现同一网络的联合训练

  1. build_program_multi_branch是参考plsc/entry.py中build_program函数编写,主要实现功能:
    将emb特征按照固定比例拆分,比如训练时webface和vggface2这两个数据集的batchsize都是10,那么拆分后前10个emb是webface数据集中图片特征,后10个emb是vggface2数据集中图片的特征。然后分别针对webface和vggface2计算各自的分类损失

  2. minimize_multi_branches函数是在paddle/fluid/incubate/fleet/collective/init.py重新定义的函数,参考该py文件中minimize函数修改,使之可以输入多个分类损失,分别求其梯度,然后聚合(还没有实现)

  3. compute_gradient_multi_branches函数是在plsc/models/dist_algo.py中参考minimize函数修改的,使之能针对不同的分类损失求梯度

def build_program_multi_branch(self,
                  is_train=True,
                  use_parallel_test=False,
                  dist_strategy=None):
            #此处省略部分,与build_program函数相同
            emb = model.build_network(input=image, label=label, is_train=True)
            emb_split = fluid.layers.split(emb, batch_size_multi_branch, dim=0)
            label_split = fluid.layers.split(label, batch_size_multi_branch, dim=0)
            loss_split = []
            name_split = []
            for ind in range(len(batch_size_multi_branch)):
                if self.loss_type == "dist_arcface":
                    avg_loss = dist_algo.distributed_arcface_classify(emb_split[ind], label_split[ind], int(self.datasets_info[ind][-2]), num_trainers, trainer_id, self.margin, self.scale, self.param_attr, self.datasets_info[ind][0])
                loss_split.append(avg_loss)
                name_split.append(self.datasets_info[ind][0])

            optimizer = None
            if is_train:
                # initialize optimizer
                optimizer = self._get_optimizer()
                if self.num_trainers > 1:
                    dist_optimizer = fleet.distributed_optimizer(
                        optimizer, strategy=dist_strategy)
                    #dist_optimizer.minimize(loss_split[ind], self.datasets_info[ind][0])
                    dist_optimizer.minimize_multi_branches(loss_split, name_split)
def minimize_multi_branches(self,
             losses,
             names,
             startup_program=None,
             parameter_list=None,
             no_grad_set=None):

    for ind, name in enumerate(names):
        loss = losses[ind]
        main_program = loss.block.program
        if startup_program is None:
            startup_program = fluid.default_startup_program()
        fleet.startup_program = startup_program

        self._loss = loss

        self._check_collective_mode(main_program, self._optimizer,
                                    self._strategy)

        param_grads = self._optimizer.compute_gradient_multi_branches(
            loss,
            name,
            startup_program=startup_program,
            parameter_list=parameter_list,
            no_grad_set=no_grad_set)

        fleet._origin_program = main_program.clone(for_test=False)
        fleet._transpiled_program = main_program
        fleet.main_program = self._try_to_compile(startup_program, main_program)
def compute_gradient_multi_branches(self,
             loss,
             dataset_name,
             startup_program=None,
             parameter_list=None,
             no_grad_set=None,
             callbacks=None):
    assert loss._get_info('shard_logit_{}'.format(dataset_name))

    shard_logit = loss._get_info('shard_logit_{}'.format(dataset_name))
    shard_prob = loss._get_info('shard_prob_{}'.format(dataset_name))
    shard_label = loss._get_info('shard_label_{}'.format(dataset_name))
    shard_dim = loss._get_info('shard_dim_{}'.format(dataset_name))

    op_maker = fluid.core.op_proto_and_checker_maker
    op_role_key = op_maker.kOpRoleAttrName()
    op_role_var_key = op_maker.kOpRoleVarAttrName()
    backward_role = int(op_maker.OpRole.Backward)
    loss_backward_role = int(op_maker.OpRole.Loss) | int(
        op_maker.OpRole.Backward)

    # minimize a scalar of reduce_sum to generate the backward network
    scalar = fluid.layers.reduce_sum(shard_logit)
    block = loss.block

    if not self._use_fp16:
        #ret = self._optimizer.minimize(scalar)
        params_grads = self._optimizer.backward(scalar)
        print(loss, scalar, dataset_name)
        # remove the unnecessary ops
        index = 0
        """
        for i, op in enumerate(block.ops):
            #print(i, op)
            if op.all_attrs()[op_role_key] == loss_backward_role:
                index = i
                break
        """
        for i,op in enumerate(block.ops):
            print(i, dataset_name, block.ops[i])

        """
        assert block.ops[index - 1].type == 'reduce_sum'
        assert block.ops[index].type == 'fill_constant'
        assert block.ops[index + 1].type == 'reduce_sum_grad'
        block._remove_op(index + 1)
        block._remove_op(index)
        block._remove_op(index - 1)

        self.insert_commom_backward_op(block, index, shard_logit, shard_prob,
                                        shard_label, shard_dim, op_role_key,
                                        backward_role, loss_backward_role)
        """
        return params_grads

@sandyhouse
Copy link

多出OP的原因是:为第一个loss调用backward会在program自动插入反向操作(OP),此时的program成为program1;为第二个loss调用backward会在program1的基础上插入反向操作(OP),所以会多出很多OP。我考虑下这种场景需求怎么可以满足。

@gobigrassland
Copy link
Author

好的,多谢。我在很多应用场景都要用到多种数据源来联合训练。你们做的联邦学习PaddleFL也会使用不同数据源训练不同任务

@sandyhouse
Copy link

想到一个比较粗糙的方案,为不同的数据集构建不同的program(结合使用with program_guard和unique_name),这样在作反向的时候相当于只对各自loss对应的program执行相应操作。

@gobigrassland
Copy link
Author

@sandyhouse 好的,我先试试,到时候再请教

@sandyhouse
Copy link

请问你的问题是否解决 @gobigrassland

@gobigrassland
Copy link
Author

@sandyhouse 有两个思路:
1.还是将多个分支都写在一张graph中,在dist_algo.py中将多个分类层引入更多的op都按照原始代码的方式先删除再插入。这个能运行,目前正在验证指标值是否正常
2.参考你们的PALM写多任务,还在进行中

您还有其他更好的解决办法没?

@sandyhouse
Copy link

想到一个比较粗糙的方案,为不同的数据集构建不同的program(结合使用with program_guard和unique_name),这样在作反向的时候相当于只对各自loss对应的program执行相应操作。

目前看还是觉得这种方案比较简单一些,但不知道实现中是否存在其它问题。

@gobigrassland
Copy link
Author

我开始也看program_guard和unique_name的API,可能对这些理解不到位,还是没有实现您说的。我这段时间再试一试这种方法。如果您方便的话,可不可以给我写个简单的示例,我参考一下

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants