-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA out of memory #22
Comments
This is just trying to train the handwriting recognition network, right?
|
Thank you very much, I'll try those things and response to you soon. |
Hello again! You're right, the problem is the width of cvl. There was the line bounding box of cvl. Some of lines were not bounding correctly and when pass into the model, their width explode significantly. So I remove all of them out of database and the model worked well. |
The charset file defines the characters the model can see and produce. Besure each model component gets |
Thank you so much, Sir |
Sorry for disturbing again, sir. Now, I have done training the model but I don't know how to calculate FID scores as in your paper. Could you give me some instruction please. THank for your help. |
I had modified someones FID code to resize handwriting images properly: https://github.com/herobd/pytorch-fid-handwriting You'll need to generate a bunch of handwriting images (which I think was the point of the |
Thank you very much, Sir. |
Hello again, I've change the charset.json file with the character of my language handwritten text database. Then I started the recognizer training process, but I got all loss is nan at the first iteration and the process stopped. The encoder has the same result as recognizer. Could you give me some advices, please. Thank you in advance. |
Hmm, double check the charset.json to be sure it looks like mine (especially starting at index of 1 instead of 0). If nothing's wrong I would print the inputs to the CTC loss call and be sure they look fine. You could even run my original set up to get an example of what the inputs should be. Also, this is unlikely, but if the predictions given to the CTC loss is shorter than the target, that causes problems (CTC can't be computed properly as it assumes a longer input). |
This is my charset.json : { |
I've went to model/loss.py/CTCloss and print input_length and target_length and got this result: |
And some of input is nan |
I've run debug mode on PyCharm in recognizer training and I found out that problem is the predict label. After prediction and converting, label2string_single, the output character list only contain characters in English despite we're using a latin language database. I think that the model for IAM dataset not have capability for learning the latin character. And we consider to use the model for RIMES dataset beacause of our langague has some French characters. Another way is that we'll add some layers to the model to make it more deeply. Could you give me some advices. Thank you in advance |
Input length is the image width (after downsampling), the target length is the length of the target string. The model predicts something for each image unit (after the network downsamples it), which should be more predictions than the target string (as each written character is multiple image units long). The capacity of the network won't be causing the NaNs. Trace where those NaNs back to where they originate. The model for the IAM and RIMES datasets are identical with the exception of the output classes. Be sure in the config that |
Oh, I didn't notice that num_class in the config. I changed the num_class to 216 and the model run smoothly. I really appreciate it. I'm really grateful. |
Hello, we have to disturb you again. We have train both the recog and the encoder, they're ok, but when we train generator at 40k iterations, the recon_sample and recon_gt_mask are all blank, only the recon_gt has text. Could you give us some advice for this problem. Thank you very much. |
You're referring to the output images? Your're sure it's writing new images? I've never seen it generate blank images. Generally at the beginning of training it generates weird blurry images. |
<img width="655" alt="Capture1" src="https://user-images.githubusercontent.com/91046245/145511900-eaa63e97-0e9b-4dfd-a284-5ecce2644c84. |
this is our text file |
What do the losses look like? You can use |
Also, do you have any of the generated ("samples") images from the beginning of training? |
Maybe turn down the learning rate? It would be good to see what the loss is doing. |
this is the result after running code python graph.py -c path/to/latest/snapshot.pth loaded iteration 40750 |
can you give me advice about this? thank you so much |
The losses seem more chaotic than what the IAM model had (you can use |
Ok, so it is generating good things at the beginning. Try droping the main optimizer's learning rate, but keep the discriminators the same. And the reverse. |
Hi, thanks a lot your efforts. It's a greatwork. I was trying to train your model with CVL dataset on Google Colab Pro. I've transform the image format from .tif to .png and feed to the model. But after 200 iterations, the error Cuda out of memory appears. This is the output:
`NumExpr defaulting to 4 threads.
/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py:481: UserWarning: This DataLoader will create 6 worker processes in total. Our suggested max number of worker in current system is 4, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary. cpuset_checked))
/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py:481: UserWarning: This DataLoader will create 6 worker processes in total. Our suggested max number of worker in current system is 4, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary. cpuset_checked))
model style_extrator Trainable parameters: 10136912 HWWithStyle( (hwr): CNNOnlyHWR( (cnn): Sequential( (conv0): Conv2d(1, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (relu0): ReLU(inplace=True) (pooling0): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False) (conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (relu1): ReLU(inplace=True) (pooling1): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False) (conv2): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (batchnorm2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu2): ReLU(inplace=True) (conv3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (relu3): ReLU(inplace=True) (pooling2): MaxPool2d(kernel_size=(2, 2), stride=(2, 1), padding=(0, 1), dilation=1, ceil_mode=False) (conv4): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (batchnorm4): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu4): ReLU(inplace=True) (conv5): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1)) (relu5): ReLU(inplace=True) (pooling3): MaxPool2d(kernel_size=(2, 2), stride=(2, 1), padding=(0, 1), dilation=1, ceil_mode=False) (conv6): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1)) (batchnorm6): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu6): ReLU(inplace=True) ) (cnn1d): Sequential( (0): Conv1d(512, 512, kernel_size=(3,), stride=(1,), padding=(2,), dilation=(2,)) (1): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU(inplace=True) (3): Conv1d(512, 512, kernel_size=(3,), stride=(1,), padding=(4,), dilation=(4,)) (4): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (5): ReLU(inplace=True) (6): Conv1d(512, 512, kernel_size=(3,), stride=(1,)) (7): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (8): ReLU(inplace=True) (9): Conv1d(512, 512, kernel_size=(3,), stride=(1,), padding=(8,), dilation=(8,)) (10): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (11): ReLU(inplace=True) (12): Conv1d(512, 80, kernel_size=(3,), stride=(1,)) (13): LogSoftmax(dim=1) ) ) )
Begin training
WARNING: upsampling image to fit size /usr/local/lib/python3.7/dist-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at /pytorch/c10/core/TensorImpl.h:1156.) return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
WARNING: upsampling image to fit size
WARNING: upsampling image to fit size
WARNING: upsampling image to fit size
WARNING: upsampling image to fit size
WARNING: upsampling image to fit size
Train iteration: 100, loss: 3.290166, recogLoss: 3.290166, CER: 1.000000, WER: 1.000000,
Train iteration: 200, loss: 2.875033, recogLoss: 2.875033, CER: 1.000000, WER: 1.000000, sec_per_iter: 0.623328, avg_loss: 3.082350, avg_recogLoss: 3.082350, avg_CER: 1.000000, avg_WER: 1.000000, Traceback (most recent call last):
File "train.py", line 133, in main(config, args.resume) File "train.py", line 79, in main trainer.train()
File "/content/drive/.shortcut-targets-by-id/1gLhWu0Me1satHwX83jnDJLd9nHAl9Mp6/mockproject/paper1/base/base_trainer.py", line 219, in train result = self._train_iteration(self.iteration) File "/content/drive/.shortcut-targets-by-id/1gLhWu0Me1satHwX83jnDJLd9nHAl9Mp6/mockproject/paper1/trainer/hw_with_style_trainer.py", line 378, in _train_iteration pred, recon, losses = self.run(instance) File "/content/drive/.shortcut-targets-by-id/1gLhWu0Me1satHwX83jnDJLd9nHAl9Mp6/mockproject/paper1/trainer/hw_with_style_trainer.py", line 736, in run pred = self.model.hwr(image, style) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "/content/drive/.shortcut-targets-by-id/1gLhWu0Me1satHwX83jnDJLd9nHAl9Mp6/mockproject/paper1/model/cnn_only_hwr.py", line 131, in forward conv = self.cnn(input) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/container.py", line 139, in forward input = module(input) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/conv.py", line 443, in forward return self._conv_forward(input, self.weight, self.bias) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/conv.py", line 440, in _conv_forward self.padding, self.dilation, self.groups)
RuntimeError: CUDA out of memory. Tried to allocate 3.33 GiB (GPU 0; 15.90 GiB total capacity; 11.91 GiB already allocated; 3.02 GiB free; 11.97 GiB reserved in total by PyTorch)
I don't change anything in your architechture and loss functions.
Could you advice something me to fix this issue.
Thank you again
Sorry for my bad english.
The text was updated successfully, but these errors were encountered: