feat: add log exporting to e2e tests by RobotSail · Pull Request #308 · instructlab/training

RobotSail · 2024-10-25T22:08:27Z

Currently, the training library runs through a series of end-to-end tests which ensure there are
no bugs in the code being tested. However; we do not perform any form of validation to assure that
the training logic and quality has not diminished.

This presents an issue where we can potentially be "correct" in the sense of no hard errors being hit,
but invisible bugs may be introduced which cause models to regress in training quality, or other
bugs that plague the models themselves to seep in.

This commit fixes that problem by introducng the ability to export the training loss data itself
from the test and rendering the loss curve using matplotlib.

When the results are outputted, they can be found under the "Summary" tab of a Github actions run.
For example:

Resolves #179

Signed-off-by: Oleg S 97077423+RobotSail@users.noreply.github.com

JamesKunstle

lgtm!

RobotSail · 2024-11-13T14:47:04Z

@nathan-weinberg I've updated the CI scripts with your feedback, please take another pass when you get a chance and make sure that we didn't miss anything.

nathan-weinberg

I'd like the version number commenting to be consistent with how is it everywhere else, but otherwise LGTM

nathan-weinberg · 2024-11-13T17:34:56Z

Can we squash commits before merging? Great work on this @RobotSail excited to see it in action!

Currently, the training library runs through a series of end-to-end tests which ensure there are no bugs in the code being tested. However; we do not perform any form of validation to assure that the training logic and quality has not diminished. This presents an issue where we can potentially be "correct" in the sense of no hard errors being hit, but invisible bugs may be introduced which cause models to regress in training quality, or other bugs that plague the models themselves to seep in. This commit fixes that problem by introducng the ability to export the training loss data itself from the test and rendering the loss curve using matplotlib. Signed-off-by: Oleg S <97077423+RobotSail@users.noreply.github.com>

RobotSail · 2024-11-13T19:26:47Z

@nathan-weinberg This has been squashed, I'll remove the hold since that's the only issue.

mergify Bot added CI/CD Affects CI/CD configuration ci-failure dependencies Pull requests that update a dependency file labels Oct 25, 2024

RobotSail force-pushed the official-loss-printing branch from 4742edf to 8f77076 Compare October 25, 2024 22:11

mergify Bot added ci-failure and removed ci-failure labels Oct 25, 2024

RobotSail force-pushed the official-loss-printing branch from 8f77076 to 00e0231 Compare October 25, 2024 22:13

mergify Bot removed the ci-failure label Oct 25, 2024

RobotSail force-pushed the official-loss-printing branch from 00e0231 to 4d3e3a7 Compare October 25, 2024 22:17

RobotSail requested review from JamesKunstle, Maxusmusti, aldopareja, cdoern, danmcp and nathan-weinberg October 25, 2024 22:18

danmcp suggested changes Oct 25, 2024

View reviewed changes

Comment thread .github/workflows/e2e-nvidia-l4-x1.yml

Comment thread .github/workflows/e2e-nvidia-l4-x1.yml Outdated

RobotSail force-pushed the official-loss-printing branch from 4d3e3a7 to 82d5711 Compare October 26, 2024 17:25

danmcp reviewed Oct 26, 2024

View reviewed changes

Comment thread .github/workflows/e2e-nvidia-l4-x1.yml Outdated

JamesKunstle approved these changes Oct 28, 2024

View reviewed changes

mergify Bot added the one-approval label Oct 28, 2024

nathan-weinberg requested changes Oct 29, 2024

View reviewed changes

RobotSail force-pushed the official-loss-printing branch 2 times, most recently from 1fd7c48 to 387828b Compare November 6, 2024 14:23

RobotSail force-pushed the official-loss-printing branch from 387828b to 039b743 Compare November 13, 2024 14:43

nathan-weinberg approved these changes Nov 13, 2024

View reviewed changes

Comment thread .github/workflows/e2e-nvidia-l4-x1.yml Outdated

Comment thread .github/workflows/e2e-nvidia-l4-x1.yml Outdated

Comment thread .github/workflows/e2e-nvidia-l4-x1.yml Outdated

mergify Bot removed the one-approval label Nov 13, 2024

danmcp approved these changes Nov 13, 2024

View reviewed changes

nathan-weinberg added the hold label Nov 13, 2024

RobotSail force-pushed the official-loss-printing branch from ab6151d to c809c73 Compare November 13, 2024 19:26

RobotSail removed the hold label Nov 13, 2024

nathan-weinberg removed request for Maxusmusti, aldopareja and cdoern November 13, 2024 19:53

mergify Bot merged commit ff36e64 into instructlab:main Nov 13, 2024

Conversation

RobotSail commented Oct 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JamesKunstle left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

RobotSail commented Nov 13, 2024

Uh oh!

nathan-weinberg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nathan-weinberg commented Nov 13, 2024

Uh oh!

RobotSail commented Nov 13, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

RobotSail commented Oct 25, 2024 •

edited

Loading