Skip to content

Commit 498691a

Browse files
authored
trainer: Update the get_job_logs() API (#4198)
* trainer: Update the get_job_logs() API Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Use follow=True for getting started example Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
1 parent 7651dc5 commit 498691a

File tree

5 files changed

+14
-17
lines changed

5 files changed

+14
-17
lines changed

content/en/docs/components/trainer/getting-started.md

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -179,9 +179,8 @@ Step: node-3, Status: Succeeded, Devices: gpu x 1
179179
Finally, you can check the training logs from the master node:
180180

181181
```python
182-
logs = TrainerClient().get_job_logs(name=job_id)
183-
184-
print(logs["node-0"])
182+
for logline in TrainerClient().get_job_logs(job_name, follow=True):
183+
print(logline)
185184
```
186185

187186
Since training was run on 4 GPUs, each PyTorch node processes 60,000 / 4 = 15,000 images

content/en/docs/components/trainer/user-guides/builtin-trainer/torchtune.md

Lines changed: 5 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -107,8 +107,7 @@ We can use the `get_job_logs()` API to get the TrainJob logs.
107107
```python
108108
from kubeflow.trainer.constants import constants
109109

110-
log_dict = client.get_job_logs(job_name, step=constants.DATASET_INITIALIZER)
111-
print(log_dict[constants.DATASET_INITIALIZER])
110+
print("\n".join(client.get_job_logs(job_name, step=constants.DATASET_INITIALIZER)))
112111
```
113112

114113
Output:
@@ -124,8 +123,7 @@ Fetching 3 files: 100%|██████████| 3/3 [00:01<00:00, 1.82it
124123
#### Model Initializer
125124

126125
```python
127-
log_dict = client.get_job_logs(job_name, step=constants.MODEL_INITIALIZER)
128-
print(log_dict[constants.MODEL_INITIALIZER])
126+
print("\n".join(client.get_job_logs(job_name, step=constants.MODEL_INITIALIZER)))
129127
```
130128

131129
Output:
@@ -141,8 +139,7 @@ Fetching 8 files: 100%|██████████| 8/8 [01:02<00:00, 7.87s/
141139
#### Training Node
142140

143141
```python
144-
log_dict = client.get_job_logs(job_name, follow=False)
145-
print(log_dict[f"{constants.NODE}-0"])
142+
print("\n".join(client.get_job_logs(job_name)))
146143
```
147144

148145
Output:
@@ -160,7 +157,7 @@ INFO:torchtune.utils._logging:Memory stats after model init:
160157
GPU peak memory allocation: 2.33 GiB
161158
GPU peak memory reserved: 2.34 GiB
162159
GPU peak memory active: 2.33 GiB
163-
/opt/conda/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py:4631: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user.
160+
/opt/conda/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py:4631: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user.
164161
warnings.warn( # warn only once
165162
INFO:torchtune.utils._logging:Optimizer is initialized.
166163
INFO:torchtune.utils._logging:Loss is initialized.
@@ -207,6 +204,7 @@ Currently, we support:
207204
208205
1. Data Directory: Use all data files under this directory. For example, `hf://tatsu-lab/alpaca/data` uses all data files under the `/data` directory of `tatsu-lab/alpaca` repo in HuggingFace.
209206
2. Single Data File: Use the single data file given the path. For example, `hf://tatsu-lab/alpaca/data/xxx.parquet` uses the single `/data/xxx.parquet` data file of `tatsu-lab/alpaca` repo in HuggingFace.
207+
210208
{{% /alert %}}
211209
212210
#### Model Initializer

content/en/docs/components/trainer/user-guides/deepspeed.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -137,7 +137,7 @@ job_id = TrainerClient().train(
137137
TrainerClient().wait_for_job_status(job_id)
138138

139139
# Since we launch DeepSpeed with `mpirun`, all logs should be consumed from the node-0.
140-
print(TrainerClient().get_job_logs(name=job_id, node_rank=0)["node-0"])
140+
print("\n".join(TrainerClient().get_job_logs(name=job_id)))
141141
```
142142

143143
You should see the distributed environment across the two training nodes as follows:
@@ -284,7 +284,7 @@ job_id = TrainerClient().train(
284284
You can use the `get_job_logs()` API to see your TrainJob logs:
285285

286286
```py
287-
print(TrainerClient().get_job_logs(name=job_id)["node-0"])
287+
print("\n".join(TrainerClient().get_job_logs(name=job_id)))
288288
```
289289

290290
{{% alert title="Note" color="info" %}}

content/en/docs/components/trainer/user-guides/mlx.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -139,7 +139,7 @@ TrainerClient().wait_for_job_status(job_id)
139139

140140

141141
# Since we launch MLX with `mpirun`, all logs should be consumed from the node-0.
142-
print(TrainerClient().get_job_logs(name=job_id, node_rank=0)["node-0"])
142+
print("\n".join(TrainerClient().get_job_logs(name=job_id)))
143143
```
144144

145145
You should see the distributed environment as follows:
@@ -238,7 +238,7 @@ job_id = TrainerClient().train(
238238
You can use the `get_job_logs()` API to see your TrainJob logs:
239239

240240
```py
241-
print(TrainerClient().get_job_logs(name=job_id)["node-0"])
241+
print("\n".join(TrainerClient().get_job_logs(name=job_id)))
242242
```
243243

244244
{{% alert title="Note" color="info" %}}

content/en/docs/components/trainer/user-guides/pytorch.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -106,10 +106,10 @@ job_id = TrainerClient().train(
106106
TrainerClient().wait_for_job_status(job_id)
107107

108108
print("Distributed PyTorch env on node-0")
109-
print(TrainerClient().get_job_logs(name=job_id, node_rank=0)["node-0"])
109+
print("\n".join(TrainerClient().get_job_logs(name=job_id, step="node-0")))
110110

111111
print("Distributed PyTorch env on node-1")
112-
print(TrainerClient().get_job_logs(name=job_id, node_rank=1)["node-1"])
112+
print("\n".join(TrainerClient().get_job_logs(name=job_id, step="node-1")))
113113
```
114114

115115
You should see the distributed environment across the two training nodes as follows:
@@ -212,7 +212,7 @@ job_id = TrainerClient().train(
212212
You can use the `get_job_logs()` API to see your TrainJob logs:
213213

214214
```py
215-
print(TrainerClient().get_job_logs(name=job_id)["node-0"])
215+
print("\n".join(TrainerClient().get_job_logs(name=job_id)))
216216
```
217217

218218
## Next Steps

0 commit comments

Comments
 (0)