Skip to content
This repository has been archived by the owner on Oct 30, 2023. It is now read-only.

Training too slow #2

Closed
ZhuFengdaaa opened this issue Jun 22, 2018 · 2 comments
Closed

Training too slow #2

ZhuFengdaaa opened this issue Jun 22, 2018 · 2 comments

Comments

@ZhuFengdaaa
Copy link

My machine has 3 Titan 1080 Ti, 12 Intel i7 CPUs. Its total memory is 65GB. However, the program cost me more than 5800s to run an epoch.
My command is python3 main.py --use_both True --use_vg True --batch_size 128 because batch size 256 will out of memory.

epoch 1, time: 5844.42
        train_loss: 3.32, norm: 4.2468, score: 51.21
gradual warmup lr: 0.0010
epoch 2, time: 5844.72
        train_loss: 3.05, norm: 2.5201, score: 55.44
gradual warmup lr: 0.0014
epoch 3, time: 5839.73
        train_loss: 2.90, norm: 1.7370, score: 58.02
lr: 0.0014
epoch 4, time: 5835.09
        train_loss: 2.75, norm: 1.3749, score: 60.45
lr: 0.0014
epoch 5, time: 5837.11
        train_loss: 2.64, norm: 1.2232, score: 62.33
lr: 0.0014
epoch 6, time: 5829.90
        train_loss: 2.54, norm: 1.1545, score: 63.88
lr: 0.0014
epoch 7, time: 5832.88
        train_loss: 2.46, norm: 1.1238, score: 65.32
lr: 0.0014
epoch 8, time: 5834.77
        train_loss: 2.39, norm: 1.1157, score: 66.59
@jnhwkim
Copy link
Owner

jnhwkim commented Jun 22, 2018

We used 4 Titan XPs with the batch size of 256 (4000s/epoch). It seems that your 5800s is within the expected range of running time. For the (possibly) degraded performance if the batch size is altered, please refer to this link.
p.s. Our training log is here. Please refer it to compare the training curve.

@jnhwkim jnhwkim closed this as completed Jun 22, 2018
@tchaton
Copy link

tchaton commented Jan 9, 2019

Hello guys,

I have added some code there to be rewieded.
I am trying to optimize the code using torch.einsum
However, i am not sure of everything there.
#15

Best,
T.C

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants