# 多GPU来训练 --- 使用Gluon


在Gluon里可以很容易的使用数据并行。在[多GPU来训练 --- 从0开始](./multiple-gpus-scratch.md)里我们手动实现了几个数据同步函数来使用数据并行，Gluon里实现了同样的功能。


## 多设备上的初始化

之前我们介绍了如果使用`initialize()`里的`ctx`在CPU或者特定GPU上初始化模型。事实上，`ctx`可以接受一系列的设备，它会将初始好的参数复制所有的设备上。

这里我们使用之前介绍Resnet18来作为演示。

In [8]:
import sys
sys.path.append('..')
import utils 
from mxnet import gpu,cpu 
net=utils.resnet18(10) 
# ctx=[gpu(0),gpu(0)] 
ctx=[gpu(0),cpu(1)]
net.initialize(ctx=ctx) 


记得前面提到的[延迟初始化](../chapter_gluon-basics/parameters.md)，这里参数还没有被初始化。我们需要先给定数据跑一次。

Gluon提供了之前我们实现的`split_and_load`函数，它将数据分割并返回各个设备上的复制。然后根据输入的设备，计算也会在相应的数据上执行。


In [11]:
from mxnet import nd 
from mxnet import gluon 
x=nd.random.uniform(shape=(4,1,28,28)) 
x_list=gluon.utils.split_and_load(x,ctx) 
print(x_list) 
print(net(x_list[0])) 
print(net(x_list[1])) 

[
[[[[0.5946477  0.76241076 0.9286425  ... 0.55939174 0.01368841
    0.42221504]
   [0.20991999 0.13124892 0.6999888  ... 0.7284153  0.06722905
    0.00324187]
   [0.03553848 0.04015033 0.42025933 ... 0.00680894 0.36719233
    0.72177184]
   ...
   [0.25489825 0.7400993  0.80378944 ... 0.14468406 0.06724603
    0.61052424]
   [0.59011513 0.6301808  0.715846   ... 0.99822533 0.366307
    0.6627515 ]
   [0.6474741  0.16908157 0.2554662  ... 0.731404   0.2669775
    0.94236565]]]


 [[[0.3346171  0.29929847 0.04035985 ... 0.9418107  0.14765576
    0.7996131 ]
   [0.41482422 0.93953705 0.82133496 ... 0.71857834 0.19046582
    0.2492019 ]
   [0.16201365 0.57839215 0.40134495 ... 0.5595262  0.05132892
    0.47892925]
   ...
   [0.3079302  0.12343674 0.0858999  ... 0.5605767  0.2879493
    0.9880256 ]
   [0.03409025 0.5199983  0.45360327 ... 0.16639715 0.29464877
    0.38487077]
   [0.81168246 0.7024145  0.6347502  ... 0.16443568 0.47478315
    0.74929905]]]]
<NDArray 2x1x28x28 @gpu(0)>, 
[[[

这时候我们可以来看初始的过程发生了什么了。记得我们可以通过`data`来访问参数值，它默认会返回CPU上值。但这里我们只在两个GPU上初始化了，在访问的对应设备的值的时候，我们需要指定设备。


In [12]:
weight=net[1].params.get('weight') 
print(weight.data(ctx[0])[0])
print(weight.data(ctx[1])[0]) 
try:
    weight.data(cpu()) 
except:
    print('Not initialize on ',cpu())


[[[ 0.04197619 -0.05456534  0.06447314]
  [ 0.06561633  0.04185344 -0.0067775 ]
  [-0.05908101  0.04418553  0.04269098]]]
<NDArray 1x3x3 @gpu(0)>

[[[ 0.04197619 -0.05456534  0.06447314]
  [ 0.06561633  0.04185344 -0.0067775 ]
  [-0.05908101  0.04418553  0.04269098]]]
<NDArray 1x3x3 @cpu(1)>
Not initialize on  cpu(0)



上一章我们提到过如何在多GPU之间复制梯度求和并广播，这个在`gluon.Trainer`里面会被默认执行。这样我们可以实现完整的训练函数了。

## 训练


In [16]:
from mxnet import gluon
from mxnet import autograd 
from time import time 
from mxnet import init 

def train(num_gpus,batch_size,lr):
    train_data,test_data=utils.load_data_fashion_mnist(batch_size) 
#     ctx=[gpu(i) for i in range(num_gpus)] 
    ctx=[cpu(i) for i in range(num_gpus)] 
    print('Running on',ctx) 
    net=utils.resnet18(10) 
    net.initialize(init=init.Xavier(),ctx=ctx) 
    loss=gluon.loss.SoftmaxCrossEntropyLoss() 
    trainer=gluon.Trainer(
        net.collect_params(),'sgd',{'learning_rate':lr}
    )
    for epoch in range(5):
        start=time()
        total_loss=0 
        for data,label in train_data:
            data_list=gluon.utils.split_and_load(data,ctx) 
            label_list=gluon.utils.split_and_load(label,ctx) 
            with autograd.record():
                losses=[loss(net(X),y) for X,y in zip(data_list,label_list)]
            for l in losses:
                l.backward()
            total_loss+=sum([l.sum().asscalar() for l in losses]) 
            trainer.step(batch_size) 
        nd.waitall() 
        print('Epoch %d, training time = %.1f sec'%(
            epoch, time()-start))
        test_acc=utils.evaluate_accuracy(test_data,net,ctx[0])
        print('       validation accuracy = %.4f'%(test_acc)) 

    

In [17]:
train(1,256,.1)

Running on [cpu(0)]
Epoch 0, training time = 71.9 sec
       validation accuracy = 0.8761
Epoch 1, training time = 71.3 sec
       validation accuracy = 0.8957
Epoch 2, training time = 73.9 sec
       validation accuracy = 0.9126
Epoch 3, training time = 74.3 sec
       validation accuracy = 0.9135
Epoch 4, training time = 72.3 sec
       validation accuracy = 0.9057


同样的参数，但使用两个GPU 。 

In [None]:
train(2,256,.1) 

Running on [cpu(0), cpu(1)]
Epoch 0, training time = 262.0 sec
       validation accuracy = 0.8803
Epoch 1, training time = 258.5 sec
       validation accuracy = 0.8955
Epoch 2, training time = 262.7 sec
       validation accuracy = 0.8944
