Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How should I distribute the data? #7

Open
shinesun130 opened this issue Oct 26, 2017 · 1 comment
Open

How should I distribute the data? #7

shinesun130 opened this issue Oct 26, 2017 · 1 comment
Labels

Comments

@shinesun130
Copy link

I want to train on hdfs cluster with distribute tensorflow, now I have start same code on ps, master and each worker use 'run_config' to specify them, and I use the 'estimator' and tf.contrib.learn.Experiment".
But I don't know if I should separate the whole train data to each worker(each worker has different data),
or I just specify all the workers the same path (the whole train data)?
if specify all the workers the same path, then all the data will load to memory,right? I think it will have some issue .
Forgive my poor English.
Thanks in advance!

@hereismari
Copy link
Owner

That's a good question.

I'm not an expert, but there are two approaches I know of.

  1. If each worker (including the master) has access to the same data on the same path (this can be hdfs or google cloud), I assume that each worker will get a random batch from this path, load it in memory and train using this batch. This can be done and works. The problem is that when you start to add more workers they may be seeing the same data over and over since it's a random batch and they are getting the data from the same source, so it's harder to guarantee a good convergence and to scale linearly as you add more machines.

  2. Another option is to separate your dataset into N datasets, if you have N workers you can configure them to each read from a specific dataset. This will guarantee that each worker is training with different data. The problem with this approach is that it makes harder to scale: if you now have N+1 workers you'll need N+1 dataset divisions. Also, make sure you did a "good" division of your original dataset into N datasets.

If specify all the workers the same path, then all the data will load to memory, right? I think it will have some issue.
Not exactly true. You'll some part of the data loaded in memory not all of it, in a way that if your data doesn't fit in memory you'll not have a problem.

Is possible that there are other approaches and some other trade-offs to consider, but hopefully this helps you 😄!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants