How should I distribute the data? #7

shinesun130 · 2017-10-26T10:24:38Z

I want to train on hdfs cluster with distribute tensorflow, now I have start same code on ps, master and each worker use 'run_config' to specify them, and I use the 'estimator' and tf.contrib.learn.Experiment".
But I don't know if I should separate the whole train data to each worker(each worker has different data),
or I just specify all the workers the same path (the whole train data)?
if specify all the workers the same path, then all the data will load to memory,right? I think it will have some issue .
Forgive my poor English.
Thanks in advance!

hereismari · 2017-10-26T12:00:53Z

That's a good question.

I'm not an expert, but there are two approaches I know of.

If each worker (including the master) has access to the same data on the same path (this can be hdfs or google cloud), I assume that each worker will get a random batch from this path, load it in memory and train using this batch. This can be done and works. The problem is that when you start to add more workers they may be seeing the same data over and over since it's a random batch and they are getting the data from the same source, so it's harder to guarantee a good convergence and to scale linearly as you add more machines.
Another option is to separate your dataset into N datasets, if you have N workers you can configure them to each read from a specific dataset. This will guarantee that each worker is training with different data. The problem with this approach is that it makes harder to scale: if you now have N+1 workers you'll need N+1 dataset divisions. Also, make sure you did a "good" division of your original dataset into N datasets.

If specify all the workers the same path, then all the data will load to memory, right? I think it will have some issue.
Not exactly true. You'll some part of the data loaded in memory not all of it, in a way that if your data doesn't fit in memory you'll not have a problem.

Is possible that there are other approaches and some other trade-offs to consider, but hopefully this helps you 😄!

hereismari added the question label Oct 26, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How should I distribute the data? #7

How should I distribute the data? #7

shinesun130 commented Oct 26, 2017

hereismari commented Oct 26, 2017

How should I distribute the data? #7

How should I distribute the data? #7

Comments

shinesun130 commented Oct 26, 2017

hereismari commented Oct 26, 2017