Proposal to reduce the ranging mode duration #315

arjunsuresh · 2023-06-22T13:10:52Z

An email was sent to the power,inference WGs for this, still adding as an issue for better tracking.

During the last LLM taskforce meeting the long runtime of the LLMs was raised as a concern and this is double in the case of power runs. I'm just giving one of our systems as an example for gpt-j model where we would like to submit some open variants.

Runtime for the offline scenario: ~13 hours (at 450W GPU). If we run 4 models, it is more than 2 days for just the offline scenario on a single system, and with a complete ranging mode run, this means an additional 2 days at least. This will be way worse for those doing closed models and we are not even talking about GPT3 model here.

In order to find the optimal duration of the ranging mode I collected the below data from the 3.0 inference results and we can see that with just a 2-minute ranging mode run, the worst-case power delta compared to a full duration is ~10%. Compared to inference 3.0 round, in the current master branch, we are already multiplying the current measured during ranging mode by 10% and so this means that even if we reduce the ranging mode duration by 2 minutes, there won't be any change in the results and it can benefit all the power submitters. Of course, we can multiply the current measured during the ranging mode by 1.25 or make the ranging mode go to 5 minutes to be extra safe but if we are forced to do a full ranging mode run, we won't be able to do any power submission for LLMs.

AI: There is no code change needed to reduce the ranging mode duration - but submitters must be allowed to use different user_conf files for the ranging and testing mode runs. We already have the code in power-dev which checks for the avg_power delta between the ranging and testing modes and so this change should be completely safe.

Power Data from inference 3.0

In the below graph, X-axis shows the avg_power for the specified durations and the y-axis shows the delta of the avg_power for the given duration compared to the average power during the entire duration.

arjunsuresh · 2023-11-28T23:58:53Z

This can be done in a much cleaner way inside loadgen so that it becomes transparent to the users if loadgen can identify the ranging and testing mode runs. Since this is not currently possible, we tried this mechanism and it worked well.

arjunsuresh mentioned this issue Jun 22, 2023

Allow fixed range setting for high power devices, optionally remove ranging mode #296

Closed

arjunsuresh closed this as completed Nov 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal to reduce the ranging mode duration #315

Proposal to reduce the ranging mode duration #315

arjunsuresh commented Jun 22, 2023

arjunsuresh commented Nov 28, 2023

Proposal to reduce the ranging mode duration #315

Proposal to reduce the ranging mode duration #315

Comments

arjunsuresh commented Jun 22, 2023

arjunsuresh commented Nov 28, 2023