Loading real data on subset of hosts #187

khatwanimohit · 2023-09-29T00:41:19Z

No description provided.

rwitten

Just unassigning myself. We will talk live to discuss landing this CR.

rwitten · 2024-03-18T18:59:10Z

MaxText/pyconfig.py

@@ -317,12 +317,16 @@ def get_individual_scales(scale):
 def calculate_global_batch_sizes(raw_keys):
  """ Calculates target global batch size from target devices and per_device_batch"""
  per_device_batch_size = raw_keys['per_device_batch_size']
+  expansion_factor_real_data = raw_keys['expansion_factor_real_data']
  num_devices = get_num_target_devices(raw_keys)
  if per_device_batch_size < 1.0:
    # For per_device_batch_size<1, we load the data as if per_device_batch_size=1
    global_batch_size_to_load = num_devices


But couldn't fewer than that number of hosts load the data?

Like I think we should still ramp the data?

MaxText/input_pipeline/input_pipeline_interface.py

rwitten

This has my approval but please discuss with Aireen and Roshani before merging.

@aireenmei @RoshaniN -- this CR changes it so MaxText can make balancing decisions about the number of hosts that read from GCS. We find in practice this is a useful lever because the thundering horde of VMs can crush GCS but the aggregate data isn't too much.

aireenmei · 2024-03-22T23:17:27Z

Thanks for adding me to the thread. I think I missed some context so not sure I understand the whole idea. I see we only have a subset of host loading data, are they going to pass the data to hosts that are not loading real data? Why the rest of hosts are returning synthetic data?

RoshaniN · 2024-03-25T20:26:13Z

I don't see any issues as standalone_dataloader would be using the same input pipeline as train. I would like to understand the recommendations on the expansion_factor_real_data, will do that offline.

aireenmei · 2024-04-01T18:51:28Z

Could you share some convergence results when expansion_factor_real_data != -1 ?

khatwanimohit · 2024-04-01T19:20:47Z

Could you share some convergence results when expansion_factor_real_data != -1 ?

Convergence test with expansion_factor_real_data=4 (i.e. 16 hosts out of 64 hosts will load the real data)
https://cloudlogging.app.goo.gl/DPDSXu2tSM3ga8hG8

aireenmei · 2024-04-01T22:56:49Z

Thanks! Could you also run a convergence test with grain? bash end_to_end/test_convergence_1b_params.sh DATASET_TYPE="c4-array_record" ...

khatwanimohit · 2024-04-02T16:07:41Z

Thanks! Could you also run a convergence test with grain? bash end_to_end/test_convergence_1b_params.sh DATASET_TYPE="c4-array_record" ...

Convergence run with c4-array_record: https://cloudlogging.app.goo.gl/jiFMzAx8SDRw4nM46

@aireenmei I will also add airflow tests for both the convergence test

RoshaniN

Thanks Mohit!

MaxText/configs/base.yml

MaxText/input_pipeline/input_pipeline_interface.py

aireenmei

Thanks Mohit!

khatwanimohit requested a review from rwitten as a code owner September 29, 2023 00:41

khatwanimohit assigned rwitten Sep 29, 2023

khatwanimohit force-pushed the mohit/hosts_real_data branch 2 times, most recently from 92cd603 to 3841984 Compare September 29, 2023 23:06

rwitten removed their assignment Sep 30, 2023

rwitten reviewed Sep 30, 2023

View reviewed changes

khatwanimohit force-pushed the mohit/hosts_real_data branch from 3841984 to f064388 Compare January 19, 2024 21:54

khatwanimohit force-pushed the mohit/hosts_real_data branch 3 times, most recently from a19360b to d41cbb6 Compare January 31, 2024 19:59

khatwanimohit force-pushed the mohit/hosts_real_data branch 2 times, most recently from 7db70f5 to 9d6ae76 Compare February 27, 2024 20:40

khatwanimohit force-pushed the mohit/hosts_real_data branch 6 times, most recently from 60f3a1d to b4578c1 Compare March 12, 2024 23:43

khatwanimohit assigned rwitten Mar 13, 2024

rwitten requested changes Mar 18, 2024

View reviewed changes

rwitten removed their assignment Mar 18, 2024

khatwanimohit force-pushed the mohit/hosts_real_data branch 4 times, most recently from c108249 to f92ec46 Compare March 21, 2024 23:01

khatwanimohit assigned rwitten Mar 21, 2024

rwitten requested a review from aireenmei March 22, 2024 18:06

rwitten approved these changes Mar 22, 2024

View reviewed changes

github-actions bot added the pull ready label Mar 22, 2024

RoshaniN self-requested a review March 25, 2024 20:26

rwitten removed their assignment Mar 26, 2024

khatwanimohit force-pushed the mohit/hosts_real_data branch from f92ec46 to 24d9512 Compare April 1, 2024 18:46

RoshaniN approved these changes Apr 2, 2024

View reviewed changes

MaxText/configs/base.yml Outdated Show resolved Hide resolved

MaxText/input_pipeline/input_pipeline_interface.py Outdated Show resolved Hide resolved

MaxText/input_pipeline/input_pipeline_interface.py Show resolved Hide resolved

khatwanimohit force-pushed the mohit/hosts_real_data branch from 24d9512 to 0681c96 Compare April 2, 2024 18:05

aireenmei approved these changes Apr 2, 2024

View reviewed changes

khatwanimohit force-pushed the mohit/hosts_real_data branch from 0681c96 to 4272b6a Compare April 2, 2024 18:13

load real data in subset of hosts

2a0972b

khatwanimohit force-pushed the mohit/hosts_real_data branch from 4272b6a to 2a0972b Compare April 2, 2024 19:19

copybara-service bot merged commit 5cb6052 into main Apr 2, 2024
8 checks passed

copybara-service bot deleted the mohit/hosts_real_data branch April 2, 2024 21:21

A9isha pushed a commit that referenced this pull request Apr 11, 2024

[PT/XLA] Install expecttest in CI test script (#187)

e6d8d77

A9isha mentioned this pull request Apr 11, 2024

Converting checkpoints #551

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loading real data on subset of hosts #187

Loading real data on subset of hosts #187

khatwanimohit commented Sep 29, 2023

rwitten left a comment

rwitten Mar 18, 2024

rwitten Mar 18, 2024

rwitten left a comment

aireenmei commented Mar 22, 2024

RoshaniN commented Mar 25, 2024

aireenmei commented Apr 1, 2024

khatwanimohit commented Apr 1, 2024

aireenmei commented Apr 1, 2024

khatwanimohit commented Apr 2, 2024 •

edited

Loading

RoshaniN left a comment

aireenmei left a comment

Loading real data on subset of hosts #187

Loading real data on subset of hosts #187

Conversation

khatwanimohit commented Sep 29, 2023

rwitten left a comment

Choose a reason for hiding this comment

rwitten Mar 18, 2024

Choose a reason for hiding this comment

rwitten Mar 18, 2024

Choose a reason for hiding this comment

rwitten left a comment

Choose a reason for hiding this comment

aireenmei commented Mar 22, 2024

RoshaniN commented Mar 25, 2024

aireenmei commented Apr 1, 2024

khatwanimohit commented Apr 1, 2024

aireenmei commented Apr 1, 2024

khatwanimohit commented Apr 2, 2024 • edited Loading

RoshaniN left a comment

Choose a reason for hiding this comment

aireenmei left a comment

Choose a reason for hiding this comment

khatwanimohit commented Apr 2, 2024 •

edited

Loading