Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Beginner Example" Notebook shows no performance improvement after many epochs #74

Closed
SamShowalter opened this issue Feb 19, 2023 · 12 comments
Labels
bug Something isn't working question Further information is requested Stale

Comments

@SamShowalter
Copy link

SamShowalter commented Feb 19, 2023

Bug

When run, the beginner example notebook does not demonstrate the algorithm is learning. After 30 epochs, each output report produces the exact same output, included below. This is after adjusting the config parameters to match the recommendations (e.g. 100 epochs, 102400 iterations, etc.)

Reproduce the Bug

System: Ubuntu 20.04.4

  1. git clone the repository
  2. cd into repository
  3. Make anaconda environment with python 3.8 and cudatoolkit 11.3.
  4. pip install -r requirements.txt
  5. Verify torch works and can access the GPUs
  6. Install juptyerlab
  7. Update config with appropriate hyperparameters. The only ones that are changed are as follows:
    'epoch': 100,  # set to 100
    'num_train_iter': 102400,  # set to 102400
    'num_eval_iter': 1024,   # set to 1024
    'num_log_iter': 256,    # set to 256
  1. Open jupyterlab and run notebooks/Beginner_Example.ipynb

Error Messages and Logs

Output:

Epoch: 0
/home/.conda/envs/nlpenv/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1318: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
[2023-02-19 15:33:47,914 INFO] confusion matrix
[2023-02-19 15:33:47,915 INFO] [[0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]]
[2023-02-19 15:33:47,917 INFO] evaluation metric
[2023-02-19 15:33:47,918 INFO] acc: 0.1011
[2023-02-19 15:33:47,919 INFO] precision: 0.0101
[2023-02-19 15:33:47,919 INFO] recall: 0.1000
[2023-02-19 15:33:47,920 INFO] f1: 0.0184
model saved: ./saved_models/fixmatch/latest_model.pth
(... truncated by poster ...)
Epoch: 17
/home/.conda/envs/nlpenv/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1318: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
[2023-02-19 15:34:40,825 INFO] confusion matrix
[2023-02-19 15:34:40,826 INFO] [[0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]]
[2023-02-19 15:34:40,828 INFO] evaluation metric
[2023-02-19 15:34:40,829 INFO] acc: 0.1011
[2023-02-19 15:34:40,830 INFO] precision: 0.0101
[2023-02-19 15:34:40,830 INFO] recall: 0.1000
[2023-02-19 15:34:40,831 INFO] f1: 0.0184
model saved: ./saved_models/fixmatch/latest_model.pth
Epoch: 18
/home/.conda/envs/nlpenv/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1318: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
[2023-02-19 15:35:33,695 INFO] confusion matrix
[2023-02-19 15:35:33,697 INFO] [[0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]]
[2023-02-19 15:35:33,699 INFO] evaluation metric
[2023-02-19 15:35:33,700 INFO] acc: 0.1011
[2023-02-19 15:35:33,700 INFO] precision: 0.0101
[2023-02-19 15:35:33,701 INFO] recall: 0.1000
[2023-02-19 15:35:33,702 INFO] f1: 0.0184
model saved: ./saved_models/fixmatch/latest_model.pth
Epoch: 19
/home/.conda/envs/nlpenv/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1318: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
[2023-02-19 15:36:26,505 INFO] confusion matrix
[2023-02-19 15:36:26,506 INFO] [[0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]]
[2023-02-19 15:36:26,508 INFO] evaluation metric
[2023-02-19 15:36:26,509 INFO] acc: 0.1011
[2023-02-19 15:36:26,510 INFO] precision: 0.0101
[2023-02-19 15:36:26,510 INFO] recall: 0.1000
[2023-02-19 15:36:26,512 INFO] f1: 0.0184
model saved: ./saved_models/fixmatch/latest_model.pth
Epoch: 20
/home/.conda/envs/nlpenv/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1318: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
[2023-02-19 15:37:19,113 INFO] confusion matrix
[2023-02-19 15:37:19,115 INFO] [[0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]]
[2023-02-19 15:37:19,117 INFO] evaluation metric
[2023-02-19 15:37:19,118 INFO] acc: 0.1011
[2023-02-19 15:37:19,119 INFO] precision: 0.0101
[2023-02-19 15:37:19,119 INFO] recall: 0.1000
[2023-02-19 15:37:19,120 INFO] f1: 0.0184
model saved: ./saved_models/fixmatch/latest_model.pth
Epoch: 21
/home/.conda/envs/nlpenv/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1318: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
[2023-02-19 15:38:11,989 INFO] confusion matrix
[2023-02-19 15:38:11,990 INFO] [[0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]]
[2023-02-19 15:38:11,992 INFO] evaluation metric
[2023-02-19 15:38:11,993 INFO] acc: 0.1011
[2023-02-19 15:38:11,994 INFO] precision: 0.0101
[2023-02-19 15:38:11,994 INFO] recall: 0.1000
[2023-02-19 15:38:11,995 INFO] f1: 0.0184
model saved: ./saved_models/fixmatch/latest_model.pth
Epoch: 22
/home/.conda/envs/nlpenv/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1318: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
[2023-02-19 15:39:04,797 INFO] confusion matrix
[2023-02-19 15:39:04,798 INFO] [[0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]]
[2023-02-19 15:39:04,801 INFO] evaluation metric
[2023-02-19 15:39:04,802 INFO] acc: 0.1011
[2023-02-19 15:39:04,802 INFO] precision: 0.0101
[2023-02-19 15:39:04,803 INFO] recall: 0.1000
[2023-02-19 15:39:04,803 INFO] f1: 0.0184
model saved: ./saved_models/fixmatch/latest_model.pth
Epoch: 23
/home/.conda/envs/nlpenv/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1318: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
[2023-02-19 15:39:57,722 INFO] confusion matrix
[2023-02-19 15:39:57,723 INFO] [[0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]]
[2023-02-19 15:39:57,726 INFO] evaluation metric
[2023-02-19 15:39:57,727 INFO] acc: 0.1011
[2023-02-19 15:39:57,727 INFO] precision: 0.0101
[2023-02-19 15:39:57,728 INFO] recall: 0.1000
[2023-02-19 15:39:57,729 INFO] f1: 0.0184
model saved: ./saved_models/fixmatch/latest_model.pth
Epoch: 24
/home/.conda/envs/nlpenv/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1318: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
[2023-02-19 15:40:50,675 INFO] confusion matrix
[2023-02-19 15:40:50,677 INFO] [[0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]]
[2023-02-19 15:40:50,679 INFO] evaluation metric
[2023-02-19 15:40:50,680 INFO] acc: 0.1011
[2023-02-19 15:40:50,681 INFO] precision: 0.0101
[2023-02-19 15:40:50,681 INFO] recall: 0.1000
[2023-02-19 15:40:50,682 INFO] f1: 0.0184

Can someone provide guidance on why I don't see any learning occur? I kept the notebook as-is except the updates to the config I noted above.

@Hhhhhhao
Copy link
Contributor

Can provide more information? Which model are you using? What's the learning rate?

@Hhhhhhao Hhhhhhao added the question Further information is requested label Feb 22, 2023
@SamShowalter
Copy link
Author

Hi, the beginner notebook is exactly the same as is provided except for the parameters that I noted above. So the model is the vit_tiny_patch and the learning rate 5e-4 as provided. The learning rate looks like it could be low to me, but I figured that the provided example was tuned to train well.

@ppsmk388
Copy link

ppsmk388 commented Mar 4, 2023

I meet the same problem

@hldqiuye123
Copy link

Hi, the beginner notebook is exactly the same as is provided except for the parameters that I noted above. So the model is the vit_tiny_patch and the learning rate 5e-4 as provided. The learning rate looks like it could be low to me, but I figured that the provided example was tuned to train well.

Have you solved it? "I have encountered the same problem, but after increasing the learning rate, there is still no change. The model consistently outputs the same results.".

@wzc32
Copy link

wzc32 commented Mar 27, 2023

I've met the same!!
Do other notebooks run all right?

@SamShowalter
Copy link
Author

Any update? Could someone who has it working provide their config file as well as the output performance to serve as a sanity check?

@5v3D3
Copy link

5v3D3 commented Apr 3, 2023

If this issue is still open, from what I can tell, it is because the train function of the Trainer class is only calling train_step (which isn't calling paramupdate hook anymore). So you could add backward and step calls after train_step or just use the train function inherited from algorithmbase via algorithm.train() instead (this one calls all necessary hooks, I needed to manually send my model to gpu though).

@hijihyo
Copy link

hijihyo commented Apr 13, 2023

If this issue is still open, from what I can tell, it is because the train function of the Trainer class is only calling train_step (which isn't calling paramupdate hook anymore). So you could add backward and step calls after train_step or just use the train function inherited from algorithmbase via algorithm.train() instead (this one calls all necessary hooks, I needed to manually send my model to gpu though).

I met the same problem, but 5V3D3's solution did help. When I used algorithm.train() instead of trainer.fit() the performance was improving. I used the following code:

algorithm.model = algorithm.model.cuda()
algorithm.train()

@SachinLearns
Copy link

algorithm.model = algorithm.model.cuda()
algorithm.train()

The above code works for me. Not sure how to perform the evaluation in this context. Tried the following.
algorithm.eval()

AttributeError: 'FixMatch' object has no attribute 'eval'

Any suggestions?

@Hhhhhhao
Copy link
Contributor

Fixed in PR #135

Hhhhhhao added a commit that referenced this issue Jul 20, 2023
* [Update] resolve requirements.txt conflicts

* [Fix] Fix mean teacher bug in #102

* [Fix] Fix DebiasPL bug

* [Fix] Fix potential sample data bug in #119

* [Update] Add auto issue/pr closer

* [Update] Update requirements.txt

* [Fix] Fix bug in #74

* [Fix] Fix amp lighting bug in #123

* [Fix] Fix notebook bugs

* [Update] release semilearn 0.3.1
@github-actions
Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Aug 20, 2023
@github-actions
Copy link

This issue was closed because it has been stalled for 5 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working question Further information is requested Stale
Projects
None yet
Development

No branches or pull requests

8 participants