Skip to content

feat: add total_samples as a field to logs being emitted#326

Merged
JamesKunstle merged 1 commit intoinstructlab:mainfrom
RobotSail:track-total-samples
Nov 11, 2024
Merged

feat: add total_samples as a field to logs being emitted#326
JamesKunstle merged 1 commit intoinstructlab:mainfrom
RobotSail:track-total-samples

Conversation

@RobotSail
Copy link
Copy Markdown
Member

Today we emit a number of important values in the logs, but we do not explicitly announce how many samples
we are planning on iterating through. This commit emits total_samples as a field in both the initial log
as well as the subsequent logs during training. This way, this value can be accessed consistently by
any program interested in displaying information about its progress for convenience.

Fixes #325

Signed-off-by: Oleg S 97077423+RobotSail@users.noreply.github.com

Today we emit a number of important values in the logs, but we do not explicitly announce how many samples
we are planning on iterating through. This commit emits total_samples as a field in both the initial log
as well as the subsequent logs during training. This way, this value can be accessed consistently by
any program interested in displaying information about its progress for convenience.

Signed-off-by: Oleg S <97077423+RobotSail@users.noreply.github.com>
@mergify mergify Bot added the one-approval label Nov 8, 2024
@JamesKunstle
Copy link
Copy Markdown
Contributor

I approved then put the hold- quick q, is len(dataloader.dataset) a cached length value or is it computed each time? we could cache it before we start the loop if it's not cached already.

@RobotSail
Copy link
Copy Markdown
Member Author

RobotSail commented Nov 11, 2024

@JamesKunstle What is your concern with the calculation of len(dataloader.dataset) if it's not cached? In Python, the __len__ implementation should be $O(1)$. If this isn't the case then it would be a bug that needs to be fixed elsewhere.

@JamesKunstle
Copy link
Copy Markdown
Contributor

@RobotSail that tends to be the case for inbuilt stuff but this would be a length calculation for our custom Dataset objects. I confirmed that it's O(1) as best as I can tell so this should be good

@JamesKunstle JamesKunstle merged commit ee14d11 into instructlab:main Nov 11, 2024
@JamesKunstle JamesKunstle removed the hold label Nov 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Log the total number of samples

3 participants