"Beginner Example" Notebook shows no performance improvement after many epochs #74

SamShowalter · 2023-02-19T23:49:24Z

Bug

When run, the beginner example notebook does not demonstrate the algorithm is learning. After 30 epochs, each output report produces the exact same output, included below. This is after adjusting the config parameters to match the recommendations (e.g. 100 epochs, 102400 iterations, etc.)

Reproduce the Bug

System: Ubuntu 20.04.4

git clone the repository
cd into repository
Make anaconda environment with python 3.8 and cudatoolkit 11.3.
pip install -r requirements.txt
Verify torch works and can access the GPUs
Install juptyerlab
Update config with appropriate hyperparameters. The only ones that are changed are as follows:

    'epoch': 100,  # set to 100
    'num_train_iter': 102400,  # set to 102400
    'num_eval_iter': 1024,   # set to 1024
    'num_log_iter': 256,    # set to 256

Open jupyterlab and run notebooks/Beginner_Example.ipynb

Error Messages and Logs

Output:

Epoch: 0
/home/.conda/envs/nlpenv/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1318: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
[2023-02-19 15:33:47,914 INFO] confusion matrix
[2023-02-19 15:33:47,915 INFO] [[0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]]
[2023-02-19 15:33:47,917 INFO] evaluation metric
[2023-02-19 15:33:47,918 INFO] acc: 0.1011
[2023-02-19 15:33:47,919 INFO] precision: 0.0101
[2023-02-19 15:33:47,919 INFO] recall: 0.1000
[2023-02-19 15:33:47,920 INFO] f1: 0.0184
model saved: ./saved_models/fixmatch/latest_model.pth
(... truncated by poster ...)
Epoch: 17
/home/.conda/envs/nlpenv/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1318: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
[2023-02-19 15:34:40,825 INFO] confusion matrix
[2023-02-19 15:34:40,826 INFO] [[0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]]
[2023-02-19 15:34:40,828 INFO] evaluation metric
[2023-02-19 15:34:40,829 INFO] acc: 0.1011
[2023-02-19 15:34:40,830 INFO] precision: 0.0101
[2023-02-19 15:34:40,830 INFO] recall: 0.1000
[2023-02-19 15:34:40,831 INFO] f1: 0.0184
model saved: ./saved_models/fixmatch/latest_model.pth
Epoch: 18
/home/.conda/envs/nlpenv/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1318: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
[2023-02-19 15:35:33,695 INFO] confusion matrix
[2023-02-19 15:35:33,697 INFO] [[0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]]
[2023-02-19 15:35:33,699 INFO] evaluation metric
[2023-02-19 15:35:33,700 INFO] acc: 0.1011
[2023-02-19 15:35:33,700 INFO] precision: 0.0101
[2023-02-19 15:35:33,701 INFO] recall: 0.1000
[2023-02-19 15:35:33,702 INFO] f1: 0.0184
model saved: ./saved_models/fixmatch/latest_model.pth
Epoch: 19
/home/.conda/envs/nlpenv/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1318: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
[2023-02-19 15:36:26,505 INFO] confusion matrix
[2023-02-19 15:36:26,506 INFO] [[0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]]
[2023-02-19 15:36:26,508 INFO] evaluation metric
[2023-02-19 15:36:26,509 INFO] acc: 0.1011
[2023-02-19 15:36:26,510 INFO] precision: 0.0101
[2023-02-19 15:36:26,510 INFO] recall: 0.1000
[2023-02-19 15:36:26,512 INFO] f1: 0.0184
model saved: ./saved_models/fixmatch/latest_model.pth
Epoch: 20
/home/.conda/envs/nlpenv/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1318: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
[2023-02-19 15:37:19,113 INFO] confusion matrix
[2023-02-19 15:37:19,115 INFO] [[0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]]
[2023-02-19 15:37:19,117 INFO] evaluation metric
[2023-02-19 15:37:19,118 INFO] acc: 0.1011
[2023-02-19 15:37:19,119 INFO] precision: 0.0101
[2023-02-19 15:37:19,119 INFO] recall: 0.1000
[2023-02-19 15:37:19,120 INFO] f1: 0.0184
model saved: ./saved_models/fixmatch/latest_model.pth
Epoch: 21
/home/.conda/envs/nlpenv/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1318: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
[2023-02-19 15:38:11,989 INFO] confusion matrix
[2023-02-19 15:38:11,990 INFO] [[0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]]
[2023-02-19 15:38:11,992 INFO] evaluation metric
[2023-02-19 15:38:11,993 INFO] acc: 0.1011
[2023-02-19 15:38:11,994 INFO] precision: 0.0101
[2023-02-19 15:38:11,994 INFO] recall: 0.1000
[2023-02-19 15:38:11,995 INFO] f1: 0.0184
model saved: ./saved_models/fixmatch/latest_model.pth
Epoch: 22
/home/.conda/envs/nlpenv/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1318: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
[2023-02-19 15:39:04,797 INFO] confusion matrix
[2023-02-19 15:39:04,798 INFO] [[0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]]
[2023-02-19 15:39:04,801 INFO] evaluation metric
[2023-02-19 15:39:04,802 INFO] acc: 0.1011
[2023-02-19 15:39:04,802 INFO] precision: 0.0101
[2023-02-19 15:39:04,803 INFO] recall: 0.1000
[2023-02-19 15:39:04,803 INFO] f1: 0.0184
model saved: ./saved_models/fixmatch/latest_model.pth
Epoch: 23
/home/.conda/envs/nlpenv/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1318: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
[2023-02-19 15:39:57,722 INFO] confusion matrix
[2023-02-19 15:39:57,723 INFO] [[0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]]
[2023-02-19 15:39:57,726 INFO] evaluation metric
[2023-02-19 15:39:57,727 INFO] acc: 0.1011
[2023-02-19 15:39:57,727 INFO] precision: 0.0101
[2023-02-19 15:39:57,728 INFO] recall: 0.1000
[2023-02-19 15:39:57,729 INFO] f1: 0.0184
model saved: ./saved_models/fixmatch/latest_model.pth
Epoch: 24
/home/.conda/envs/nlpenv/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1318: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
[2023-02-19 15:40:50,675 INFO] confusion matrix
[2023-02-19 15:40:50,677 INFO] [[0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]]
[2023-02-19 15:40:50,679 INFO] evaluation metric
[2023-02-19 15:40:50,680 INFO] acc: 0.1011
[2023-02-19 15:40:50,681 INFO] precision: 0.0101
[2023-02-19 15:40:50,681 INFO] recall: 0.1000
[2023-02-19 15:40:50,682 INFO] f1: 0.0184

Can someone provide guidance on why I don't see any learning occur? I kept the notebook as-is except the updates to the config I noted above.

The text was updated successfully, but these errors were encountered:

Hhhhhhao · 2023-02-22T16:51:59Z

Can provide more information? Which model are you using? What's the learning rate?

SamShowalter · 2023-02-22T17:14:50Z

Hi, the beginner notebook is exactly the same as is provided except for the parameters that I noted above. So the model is the vit_tiny_patch and the learning rate 5e-4 as provided. The learning rate looks like it could be low to me, but I figured that the provided example was tuned to train well.

ppsmk388 · 2023-03-04T23:51:55Z

I meet the same problem

hldqiuye123 · 2023-03-19T04:23:36Z

Hi, the beginner notebook is exactly the same as is provided except for the parameters that I noted above. So the model is the vit_tiny_patch and the learning rate 5e-4 as provided. The learning rate looks like it could be low to me, but I figured that the provided example was tuned to train well.

Have you solved it? "I have encountered the same problem, but after increasing the learning rate, there is still no change. The model consistently outputs the same results.".

wzc32 · 2023-03-27T03:22:50Z

I've met the same!!
Do other notebooks run all right?

SamShowalter · 2023-03-27T15:51:14Z

Any update? Could someone who has it working provide their config file as well as the output performance to serve as a sanity check?

5v3D3 · 2023-04-03T17:27:58Z

If this issue is still open, from what I can tell, it is because the train function of the Trainer class is only calling train_step (which isn't calling paramupdate hook anymore). So you could add backward and step calls after train_step or just use the train function inherited from algorithmbase via algorithm.train() instead (this one calls all necessary hooks, I needed to manually send my model to gpu though).

hijihyo · 2023-04-13T03:11:32Z

If this issue is still open, from what I can tell, it is because the train function of the Trainer class is only calling train_step (which isn't calling paramupdate hook anymore). So you could add backward and step calls after train_step or just use the train function inherited from algorithmbase via algorithm.train() instead (this one calls all necessary hooks, I needed to manually send my model to gpu though).

I met the same problem, but 5V3D3's solution did help. When I used algorithm.train() instead of trainer.fit() the performance was improving. I used the following code:

algorithm.model = algorithm.model.cuda()
algorithm.train()

sachinmotwani20 · 2023-05-03T00:48:41Z

algorithm.model = algorithm.model.cuda()
algorithm.train()

The above code works for me. Not sure how to perform the evaluation in this context. Tried the following.
algorithm.eval()

AttributeError: 'FixMatch' object has no attribute 'eval'

Any suggestions?

Hhhhhhao · 2023-07-20T05:33:23Z

Fixed in PR #135

* [Update] resolve requirements.txt conflicts * [Fix] Fix mean teacher bug in #102 * [Fix] Fix DebiasPL bug * [Fix] Fix potential sample data bug in #119 * [Update] Add auto issue/pr closer * [Update] Update requirements.txt * [Fix] Fix bug in #74 * [Fix] Fix amp lighting bug in #123 * [Fix] Fix notebook bugs * [Update] release semilearn 0.3.1

github-actions · 2023-08-20T01:46:31Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions · 2023-08-25T01:46:45Z

This issue was closed because it has been stalled for 5 days with no activity.

Hhhhhhao added the question Further information is requested label Feb 22, 2023

Hhhhhhao added the bug Something isn't working label May 3, 2023

Hhhhhhao mentioned this issue May 12, 2023

"Custom_Dataset" Notebook shows no performance improvement across epochs #104

Closed

Hhhhhhao added a commit that referenced this issue Jul 20, 2023

[Fix] Fix bug in #74

e3c444c

Hhhhhhao mentioned this issue Jul 20, 2023

[Update] Release semilearn 0.3.1. #135

Merged

github-actions bot added the Stale label Aug 20, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Beginner Example" Notebook shows no performance improvement after many epochs #74

"Beginner Example" Notebook shows no performance improvement after many epochs #74

SamShowalter commented Feb 19, 2023 •

edited

Loading

Hhhhhhao commented Feb 22, 2023

SamShowalter commented Feb 22, 2023

ppsmk388 commented Mar 4, 2023

hldqiuye123 commented Mar 19, 2023

wzc32 commented Mar 27, 2023

SamShowalter commented Mar 27, 2023

5v3D3 commented Apr 3, 2023

hijihyo commented Apr 13, 2023 •

edited

Loading

sachinmotwani20 commented May 3, 2023

Hhhhhhao commented Jul 20, 2023

github-actions bot commented Aug 20, 2023

github-actions bot commented Aug 25, 2023

"Beginner Example" Notebook shows no performance improvement after many epochs #74

"Beginner Example" Notebook shows no performance improvement after many epochs #74

Comments

SamShowalter commented Feb 19, 2023 • edited Loading

Bug

Reproduce the Bug

Error Messages and Logs

Hhhhhhao commented Feb 22, 2023

SamShowalter commented Feb 22, 2023

ppsmk388 commented Mar 4, 2023

hldqiuye123 commented Mar 19, 2023

wzc32 commented Mar 27, 2023

SamShowalter commented Mar 27, 2023

5v3D3 commented Apr 3, 2023

hijihyo commented Apr 13, 2023 • edited Loading

sachinmotwani20 commented May 3, 2023

Hhhhhhao commented Jul 20, 2023

github-actions bot commented Aug 20, 2023

github-actions bot commented Aug 25, 2023

SamShowalter commented Feb 19, 2023 •

edited

Loading

hijihyo commented Apr 13, 2023 •

edited

Loading