chunk_size value #27

davoudisaeedeh · 2022-06-19T18:06:58Z

I figured out the model.fit takes batch_size * batches_per_epoch samples. However, we import 100,000 samples each time we need new data (chunk_size). Can we reduce this number to batch_size * batches_per_epoch samples so that the memory usage decreases? (in case of fixed batch_size=64)

nadavbra · 2022-06-20T15:20:01Z

Hi @dsaeedeh, can you please provide more context about your question? Which script/function of ProteinBERT are you using exactly?

davoudisaeedeh · 2022-06-25T06:33:41Z

Hi,
In class ModelTrainer existed in pretraining.py file, there is a function:
def train_next_epoch(self, autosave = True):
changed_episode, episode = self.epoch_generator.determine_episode_and_ready_next_epoch()
if changed_episode:
log('Starting a new episode with seq_len = %d.' % episode.seq_len)
self.model_generator.dummy_epoch = self.epoch_generator.create_dummpy_epoch()[:2]
self.model_generator.update_state(self.model)
self.model = self.model_generator.create_model(episode.seq_len)
X, Y, sample_weigths = self.epoch_generator.create_next_epoch()
log('Epoch %d (current sample %d):' % (self.current_epoch_index, self.epoch_generator.current_sample_index))
self.model.fit(X, Y, sample_weight = sample_weigths, batch_size = episode.batch_size, callbacks = self.fit_callbacks)

model.fit takes X and Y with size of batch_size * batches_per_epoch samples. It means that we only need to import this number of samples into the memory each time. So, can we reduce chunk_size from 100,000 samples to this number ?

nadavbra · 2022-06-25T19:29:49Z

What dataset are you training on? Are you using the same seq_len throughout the entire pretraining (without switching to episodes to different protein lengths)? The idea of a larger chunk_size is to make the process more efficient and run faster by making fewer storage reads, but sure you can make it smaller if you want.

davoudisaeedeh · 2022-07-01T03:52:47Z

My dataset is the same as yours but with a different annotation vector. I am using a fixed seq_len throughout the entire pre-training. Thanks for your reply. I agree with you however, in case of memory usage I think smaller chunk_size would be more efficient.

nadavbra closed this as completed Jul 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chunk_size value #27

chunk_size value #27

davoudisaeedeh commented Jun 19, 2022 •

edited

Loading

nadavbra commented Jun 20, 2022

davoudisaeedeh commented Jun 25, 2022 •

edited

Loading

nadavbra commented Jun 25, 2022

davoudisaeedeh commented Jul 1, 2022 •

edited

Loading

chunk_size value #27

chunk_size value #27

Comments

davoudisaeedeh commented Jun 19, 2022 • edited Loading

nadavbra commented Jun 20, 2022

davoudisaeedeh commented Jun 25, 2022 • edited Loading

nadavbra commented Jun 25, 2022

davoudisaeedeh commented Jul 1, 2022 • edited Loading

davoudisaeedeh commented Jun 19, 2022 •

edited

Loading

davoudisaeedeh commented Jun 25, 2022 •

edited

Loading

davoudisaeedeh commented Jul 1, 2022 •

edited

Loading