Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is model parallelism supported? #96

Open
Dapid opened this Issue Nov 16, 2017 · 14 comments

Comments

6 participants
@Dapid
Copy link

Dapid commented Nov 16, 2017

All the examples seem to implement data parallelism. Can I train a model that is too big for my card?

@alsrgv

This comment has been minimized.

Copy link
Collaborator

alsrgv commented Nov 17, 2017

@Dapid, yes, you can. There are changes you should do:

  1. Manually place portions of the model to appropriate GPUs using with tf.device('/gpu:N'):

  2. Modify or remove the portion of the code that pins GPUs to be visible to TensorFlow: config.gpu_options.visible_device_list = str(hvd.local_rank())

  3. Modify your mpirun command to adjust the number of processes you run per machine. If your model uses all GPUs, you'd run one process per machine.

The rest of the code and documentation should work as is.

@alsrgv alsrgv added the question label Nov 17, 2017

@Dapid

This comment has been minimized.

Copy link
Author

Dapid commented Nov 30, 2017

Thanks! I have it working now.

This is a rough sketch of how I am defining my model:

with tf.device('/gpu:1'):
    inputs = get_inputs()
    for _ in range(n_layers // 2):   
        layer = add_layer(layer)

with tf.device('/gpu:0'):
    for _ in range(n_layers - n_layers // 2):
       layer = add_layer(layer)

model = keras.models.Model(inputs=inputs, outputs=layer)
model.compile()

It runs, but nvidia-smi reports the usage of GPU0 between 70 and 99%, while GPU1 is between 0 and 4%. Shouldn't the usage be alternating between GPU0 and GPU1? Any insights on this?

Tensorboard tells me that the model is indeed distributed, and it wouldn't fit in a single card anyway.

@alsrgv

This comment has been minimized.

Copy link
Collaborator

alsrgv commented Nov 30, 2017

@Dapid, looks very reasonable. Maybe layers on GPU1 are easier to compute, and so the utilization is lower?

@Nimi42

This comment has been minimized.

Copy link

Nimi42 commented Jan 25, 2018

Would you be so kind to elaborate further on 1., 2. and 3.?

This is how I understand the horvod workflow. Please correct me if
I'm wrong.

  1. If I want to split my model across gpus I would do so independent of Horvod.

I would need a Cluster of nodes with similar specs (amount of gpu's)? And then I could
split my model the gpus of the respective node, while horvod would take care of the data
parallelism across the nodes.

  1. What do you mean with "Modify or remove"?
    config.gpu_options.visible_device_list = str(hvd.local_rank())

Is this part of the code necessary to make the GPUs visible as Processes?
So if I remove that line the scatter-reduce / allgather will only happen between
the nodes, but not on the nodes between GPUs.

What does it do and why do I have to remove it or change it?

  1. Are you talking about the -np option?

So if I have four machines with an arbitrary number of gpu's, which I use for model parallelism,
I would start mpi like this (one process per machine -np 4):

mpirun -np 4
-H server1:4,server2:4,server3:4,server4:4
-bind-to none -map-by slot
-x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH
python train.py

And what about the -H option? The server:"4" part is about the declaration of available GPUs
or can I leave the number out?

And a fourth and last question. Does OpenMPI use the NCCL 2 API provided by NVIDIA.
I just want to make sure I understood correctly.

@alsrgv

This comment has been minimized.

Copy link
Collaborator

alsrgv commented Jan 27, 2018

@Nimi42, take a look at this example - https://gist.github.com/alsrgv/cbbb2e25f09c1df983098098511d16c4. Relevant parts are marked with "Horovod Model Parallelism" comment.

In that example, each process uses two GPUs. I have 4 GPUs per server, so I launch two processes per server. I run it with:

$ mpirun -np 8 \
    -H server1:2,server2:2,server3:2,server4:2 \
    -bind-to none -map-by slot 
    -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH 
    python train.py

The open-source version of OpenMPI directly does not use NCCL 2. Some of proprietary forks/extensions do. Horovod uses NCCL 2 independently of OpenMPI.

@sumedhvdatar

This comment has been minimized.

Copy link

sumedhvdatar commented Aug 10, 2018

Is model parallelism supported in Keras using horovod?

@alsrgv

This comment has been minimized.

Copy link
Collaborator

alsrgv commented Aug 10, 2018

@sumedhvdatar

This comment has been minimized.

Copy link

sumedhvdatar commented Aug 10, 2018

@alsrgv just made my question more clear by raising this issue #437. Can you clarify on that?

@brksrl

This comment has been minimized.

Copy link

brksrl commented Aug 13, 2018

Hello @alsrgv,

for the given example: https://gist.github.com/alsrgv/cbbb2e25f09c1df983098098511d16c4

As far as I understand this is hybrid parallelism, as there is a model parallelism within each model replica. Hovorod provides MPI communication between replicas, but how about communication between gpu's inside the replica? For the given example above, does "/gpu:0" and "/gpu:1" communicate via MPI between send/receive nodes?

@alsrgv

This comment has been minimized.

Copy link
Collaborator

alsrgv commented Aug 13, 2018

@brksrl, TensorFlow internally manages communication between "/gpu:0" <-> "/gpu:1". Typically it uses direct CUDA IPC.

@brksrl

This comment has been minimized.

Copy link

brksrl commented Aug 14, 2018

@alsrgv, What if we use cpu instead of gpu. In that case, is communication between "/cpu:0" <-> "/cpu:1" gRPC(default) or MPI? If it is gRPC, is there a way to force communication to be MPI between "/cpu:0" <-> "/cpu:1"?

@alsrgv

This comment has been minimized.

Copy link
Collaborator

alsrgv commented Aug 17, 2018

@brksrl, if you use CPU, you don't need to manually specify devices within a server for model parallelism. It may still be beneficial to run one process per socket and use --bind-to socket option in mpirun command - but that would do data parallelism instead of model parallelism.

@mgoundge11

This comment has been minimized.

Copy link

mgoundge11 commented Jan 28, 2019

Hi @alsrgv @Dapid I have two systems with single GPU (same gpu's ) each.
Currently I made distributed training running using horovod on these systems using data parallelism approach on my own dataset for object detection using keras retinannet.
But I wanted to train retinanet using large batch size. Is model parallelism will be helpful?
If yes, How to do it using horovod?
Do I need at-least two GPU's per system to do model parallelism?
Pls help I'm stuck here. Thank you very much in advance.

@alsrgv

This comment has been minimized.

Copy link
Collaborator

alsrgv commented Feb 12, 2019

@mgoundge11, Horovod currently supports model parallelism within the server. It'd be helpful if you can move GPU from one server to another.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.