trainer: Update the get_job_logs() API (#4198)

andreyvelich · web-flow · commit 498691a0eae4 · 2025-09-04T14:10:12.000Z
* trainer: Update the get_job_logs() API

Signed-off-by: Andrey Velichkevich &lt;andrey.velichkevich@gmail.com&gt;

* Use follow=True for getting started example

Signed-off-by: Andrey Velichkevich &lt;andrey.velichkevich@gmail.com&gt;

---------

Signed-off-by: Andrey Velichkevich &lt;andrey.velichkevich@gmail.com&gt;
diff --git a/content/en/docs/components/trainer/getting-started.md b/content/en/docs/components/trainer/getting-started.md
@@ -179,9 +179,8 @@ Step: node-3, Status: Succeeded, Devices: gpu x 1
 Finally, you can check the training logs from the master node:
 
 ```python
-logs = TrainerClient().get_job_logs(name=job_id)
-
-print(logs["node-0"])
+for logline in TrainerClient().get_job_logs(job_name, follow=True):
+    print(logline)
 ```
 
 Since training was run on 4 GPUs, each PyTorch node processes 60,000 / 4 = 15,000 images
diff --git a/content/en/docs/components/trainer/user-guides/builtin-trainer/torchtune.md b/content/en/docs/components/trainer/user-guides/builtin-trainer/torchtune.md
@@ -107,8 +107,7 @@ We can use the `get_job_logs()` API to get the TrainJob logs.
 ```python
 from kubeflow.trainer.constants import constants
 
-log_dict = client.get_job_logs(job_name, step=constants.DATASET_INITIALIZER)
-print(log_dict[constants.DATASET_INITIALIZER])
+print("\n".join(client.get_job_logs(job_name, step=constants.DATASET_INITIALIZER)))
 ```
 
 Output:
@@ -124,8 +123,7 @@ Fetching 3 files: 100%|██████████| 3/3 [00:01<00:00,  1.82it
 #### Model Initializer
 
 ```python
-log_dict = client.get_job_logs(job_name, step=constants.MODEL_INITIALIZER)
-print(log_dict[constants.MODEL_INITIALIZER])
+print("\n".join(client.get_job_logs(job_name, step=constants.MODEL_INITIALIZER)))
 ```
 
 Output:
@@ -141,8 +139,7 @@ Fetching 8 files: 100%|██████████| 8/8 [01:02<00:00,  7.87s/
 #### Training Node
 
 ```python
-log_dict = client.get_job_logs(job_name, follow=False)
-print(log_dict[f"{constants.NODE}-0"])
+print("\n".join(client.get_job_logs(job_name)))
 ```
 
 Output:
@@ -160,7 +157,7 @@ INFO:torchtune.utils._logging:Memory stats after model init:
 	GPU peak memory allocation: 2.33 GiB
 	GPU peak memory reserved: 2.34 GiB
 	GPU peak memory active: 2.33 GiB
-/opt/conda/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py:4631: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user. 
+/opt/conda/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py:4631: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user.
   warnings.warn(  # warn only once
 INFO:torchtune.utils._logging:Optimizer is initialized.
 INFO:torchtune.utils._logging:Loss is initialized.
@@ -207,6 +204,7 @@ Currently, we support:
 
 1. Data Directory: Use all data files under this directory. For example, `hf://tatsu-lab/alpaca/data` uses all data files under the `/data` directory of `tatsu-lab/alpaca` repo in HuggingFace.
 2. Single Data File: Use the single data file given the path. For example, `hf://tatsu-lab/alpaca/data/xxx.parquet` uses the single `/data/xxx.parquet` data file of `tatsu-lab/alpaca` repo in HuggingFace.
+
 {{% /alert %}}
 
 #### Model Initializer
diff --git a/content/en/docs/components/trainer/user-guides/deepspeed.md b/content/en/docs/components/trainer/user-guides/deepspeed.md
@@ -137,7 +137,7 @@ job_id = TrainerClient().train(
 TrainerClient().wait_for_job_status(job_id)
 
 # Since we launch DeepSpeed with `mpirun`, all logs should be consumed from the node-0.
-print(TrainerClient().get_job_logs(name=job_id, node_rank=0)["node-0"])
+print("\n".join(TrainerClient().get_job_logs(name=job_id)))
 ```
 
 You should see the distributed environment across the two training nodes as follows:
@@ -284,7 +284,7 @@ job_id = TrainerClient().train(
 You can use the `get_job_logs()` API to see your TrainJob logs:
 
 ```py
-print(TrainerClient().get_job_logs(name=job_id)["node-0"])
+print("\n".join(TrainerClient().get_job_logs(name=job_id)))
 ```
 
 {{% alert title="Note" color="info" %}}
diff --git a/content/en/docs/components/trainer/user-guides/mlx.md b/content/en/docs/components/trainer/user-guides/mlx.md
@@ -139,7 +139,7 @@ TrainerClient().wait_for_job_status(job_id)
 
 
 # Since we launch MLX with `mpirun`, all logs should be consumed from the node-0.
-print(TrainerClient().get_job_logs(name=job_id, node_rank=0)["node-0"])
+print("\n".join(TrainerClient().get_job_logs(name=job_id)))
 ```
 
 You should see the distributed environment as follows:
@@ -238,7 +238,7 @@ job_id = TrainerClient().train(
 You can use the `get_job_logs()` API to see your TrainJob logs:
 
 ```py
-print(TrainerClient().get_job_logs(name=job_id)["node-0"])
+print("\n".join(TrainerClient().get_job_logs(name=job_id)))
 ```
 
 {{% alert title="Note" color="info" %}}
diff --git a/content/en/docs/components/trainer/user-guides/pytorch.md b/content/en/docs/components/trainer/user-guides/pytorch.md
@@ -106,10 +106,10 @@ job_id = TrainerClient().train(
 TrainerClient().wait_for_job_status(job_id)
 
 print("Distributed PyTorch env on node-0")
-print(TrainerClient().get_job_logs(name=job_id, node_rank=0)["node-0"])
+print("\n".join(TrainerClient().get_job_logs(name=job_id, step="node-0")))
 
 print("Distributed PyTorch env on node-1")
-print(TrainerClient().get_job_logs(name=job_id, node_rank=1)["node-1"])
+print("\n".join(TrainerClient().get_job_logs(name=job_id, step="node-1")))
 ```
 
 You should see the distributed environment across the two training nodes as follows:
@@ -212,7 +212,7 @@ job_id = TrainerClient().train(
 You can use the `get_job_logs()` API to see your TrainJob logs:
 
 ```py
-print(TrainerClient().get_job_logs(name=job_id)["node-0"])
+print("\n".join(TrainerClient().get_job_logs(name=job_id)))
 ```
 
 ## Next Steps