Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow validation during training #80

Closed
kobeyy opened this issue Jan 28, 2020 · 4 comments
Closed

Slow validation during training #80

kobeyy opened this issue Jan 28, 2020 · 4 comments

Comments

@kobeyy
Copy link

kobeyy commented Jan 28, 2020

I've been using the transformer_small.yaml configuration to train a model.
During training the validate_on_data() method takes 5 times longer than in 'test' mode. I did adapt the test mode code a bit to load all the lines from a file and batch them equally as in training mode.
I can't find a good explanation for it since i'm using the same config file and same validation data.

Data set sizes:
train 90727,
valid 926,
test 926

Expected behavior
I would suspect it to take more or less the same time since the metric calculation is only done after that method.

System:

  • Ubuntu 18
  • CPU
  • python 3.7.4

As my knowledge about transformers is rather limited i was hoping someone had some insight into this.
Thank you for this really nice code base!

@juliakreutzer
Copy link
Collaborator

juliakreutzer commented Jan 31, 2020

Hi @kobeyy thanks so much! That is indeed a weird phenomenon and I am not sure where it originates from. One difference is that during validation greedy decoding instead of beam search is used. But that should only speed things up. I'll look into it!
Which was the version of the code you were running it on, i.e., which is the latest commit?

@kobeyy
Copy link
Author

kobeyy commented Feb 1, 2020

I've only recently discovered your repository so it was with the following code
commit: a7cff61

Since them I've been doing more tests and discovered something else that could have something to do with this. When training the same dataset for 50 epochs the time to validate the dev dataset changes dramatically. In the beginning it takes more than 600s to validate 926 inputs. After some training it suddenly goes down to 50s to validate the same inputs.

Has this something to do with initialization of the weights? Is this a specific property of a transformer maybe?

Validation result (greedy) at epoch   1, step      400: bleu:   0.00, loss: 32008.7148, ppl:   4.3102, duration: 671.1562s
Validation result (greedy) at epoch   2, step      800: bleu:   6.26, loss: 15272.6094, ppl:   2.0079, duration: 675.3320s
Validation result (greedy) at epoch   3, step     1200: bleu:   9.61, loss: 11819.4004, ppl:   1.7151, duration: 40.2812s
Validation result (greedy) at epoch   4, step     1600: bleu:  19.98, loss: 8547.9072, ppl:   1.4772, duration: 669.6965s
Validation result (greedy) at epoch   5, step     2000: bleu:  32.51, loss: 5353.7832, ppl:   1.2768, duration: 697.3001s
Validation result (greedy) at epoch   6, step     2400: bleu:  37.37, loss: 4489.4004, ppl:   1.2274, duration: 537.8312s
Validation result (greedy) at epoch   7, step     2800: bleu:  38.12, loss: 4351.2305, ppl:   1.2197, duration: 671.8434s
Validation result (greedy) at epoch   8, step     3200: bleu:  41.90, loss: 3476.2017, ppl:   1.1719, duration: 41.2239s
Validation result (greedy) at epoch   8, step     3600: bleu:  40.06, loss: 3471.8345, ppl:   1.1717, duration: 46.6077s
Validation result (greedy) at epoch   9, step     4000: bleu:  42.01, loss: 2777.8691, ppl:   1.1352, duration: 722.4088s
Validation result (greedy) at epoch  10, step     4400: bleu:  43.41, loss: 3119.1055, ppl:   1.1530, duration: 678.4914s
Validation result (greedy) at epoch  11, step     4800: bleu:  47.62, loss: 2516.2986, ppl:   1.1217, duration: 183.5982s
Validation result (greedy) at epoch  12, step     5200: bleu:  46.87, loss: 2443.6604, ppl:   1.1180, duration: 47.7615s
Validation result (greedy) at epoch  13, step     5600: bleu:  51.19, loss: 2202.4766, ppl:   1.1058, duration: 66.7764s
Validation result (greedy) at epoch  14, step     6000: bleu:  51.08, loss: 2038.4586, ppl:   1.0975, duration: 195.3814s
Validation result (greedy) at epoch  15, step     6400: bleu:  50.86, loss: 2025.7654, ppl:   1.0969, duration: 68.2886s
Validation result (greedy) at epoch  15, step     6800: bleu:  54.32, loss: 2014.4696, ppl:   1.0963, duration: 669.3979s
Validation result (greedy) at epoch  16, step     7200: bleu:  53.02, loss: 2027.4260, ppl:   1.0970, duration: 345.0356s
Validation result (greedy) at epoch  17, step     7600: bleu:  54.21, loss: 1696.3250, ppl:   1.0805, duration: 63.9025s
Validation result (greedy) at epoch  18, step     8000: bleu:  53.67, loss: 1767.0493, ppl:   1.0840, duration: 115.9756s
Validation result (greedy) at epoch  19, step     8400: bleu:  55.62, loss: 1683.1099, ppl:   1.0799, duration: 184.7958s
Validation result (greedy) at epoch  20, step     8800: bleu:  55.62, loss: 1680.7856, ppl:   1.0797, duration: 74.7237s
Validation result (greedy) at epoch  21, step     9200: bleu:  53.13, loss: 1638.7617, ppl:   1.0777, duration: 201.6353s
Validation result (greedy) at epoch  22, step     9600: bleu:  55.51, loss: 1904.4341, ppl:   1.0908, duration: 127.1939s
Validation result (greedy) at epoch  23, step    10000: bleu:  56.80, loss: 1537.6284, ppl:   1.0727, duration: 48.0172s
Validation result (greedy) at epoch  23, step    10400: bleu:  57.24, loss: 1485.4012, ppl:   1.0701, duration: 53.8170s
Validation result (greedy) at epoch  24, step    10800: bleu:  54.43, loss: 1584.8862, ppl:   1.0750, duration: 49.0060s
Validation result (greedy) at epoch  25, step    11200: bleu:  56.70, loss: 1465.6007, ppl:   1.0692, duration: 46.4052s
Validation result (greedy) at epoch  26, step    11600: bleu:  57.45, loss: 1452.8262, ppl:   1.0686, duration: 50.4125s
Validation result (greedy) at epoch  27, step    12000: bleu:  56.70, loss: 1488.7253, ppl:   1.0703, duration: 44.9463s
Validation result (greedy) at epoch  28, step    12400: bleu:  57.88, loss: 1439.3315, ppl:   1.0679, duration: 51.4236s
Validation result (greedy) at epoch  29, step    12800: bleu:  57.45, loss: 1384.6335, ppl:   1.0652, duration: 45.4507s
Validation result (greedy) at epoch  30, step    13200: bleu:  57.67, loss: 1414.4309, ppl:   1.0667, duration: 50.1187s
Validation result (greedy) at epoch  30, step    13600: bleu:  60.04, loss: 1348.5345, ppl:   1.0635, duration: 43.3917s
Validation result (greedy) at epoch  31, step    14000: bleu:  58.64, loss: 1366.7507, ppl:   1.0644, duration: 43.5456s
Validation result (greedy) at epoch  32, step    14400: bleu:  57.78, loss: 1329.0974, ppl:   1.0625, duration: 43.6657s
Validation result (greedy) at epoch  33, step    14800: bleu:  58.75, loss: 1336.3790, ppl:   1.0629, duration: 53.3218s
Validation result (greedy) at epoch  34, step    15200: bleu:  57.78, loss: 1321.9717, ppl:   1.0622, duration: 49.5135s
Validation result (greedy) at epoch  35, step    15600: bleu:  57.88, loss: 1360.4719, ppl:   1.0641, duration: 46.0718s
Validation result (greedy) at epoch  36, step    16000: bleu:  59.50, loss: 1285.9434, ppl:   1.0605, duration: 53.3062s
Validation result (greedy) at epoch  37, step    16400: bleu:  60.26, loss: 1312.4065, ppl:   1.0617, duration: 45.7327s
Validation result (greedy) at epoch  38, step    16800: bleu:  60.15, loss: 1306.4736, ppl:   1.0614, duration: 46.0401s
Validation result (greedy) at epoch  38, step    17200: bleu:  58.96, loss: 1293.1626, ppl:   1.0608, duration: 46.2133s
Validation result (greedy) at epoch  39, step    17600: bleu:  60.48, loss: 1269.6205, ppl:   1.0597, duration: 47.1750s
Validation result (greedy) at epoch  40, step    18000: bleu:  60.37, loss: 1248.1321, ppl:   1.0586, duration: 46.4387s
Validation result (greedy) at epoch  41, step    18400: bleu:  59.83, loss: 1252.2852, ppl:   1.0588, duration: 45.3749s
Validation result (greedy) at epoch  42, step    18800: bleu:  60.15, loss: 1252.2458, ppl:   1.0588, duration: 47.2642s
Validation result (greedy) at epoch  43, step    19200: bleu:  60.15, loss: 1243.7896, ppl:   1.0584, duration: 46.7049s
Validation result (greedy) at epoch  44, step    19600: bleu:  59.07, loss: 1226.6882, ppl:   1.0576, duration: 46.5180s
Validation result (greedy) at epoch  45, step    20000: bleu:  60.15, loss: 1231.0714, ppl:   1.0578, duration: 44.7068s
Validation result (greedy) at epoch  45, step    20400: bleu:  60.91, loss: 1210.8223, ppl:   1.0568, duration: 46.6818s
Validation result (greedy) at epoch  46, step    20800: bleu:  58.96, loss: 1215.4613, ppl:   1.0570, duration: 46.7588s
Validation result (greedy) at epoch  47, step    21200: bleu:  61.23, loss: 1208.6156, ppl:   1.0567, duration: 46.6501s
Validation result (greedy) at epoch  48, step    21600: bleu:  61.23, loss: 1191.9607, ppl:   1.0559, duration: 47.3566s
Validation result (greedy) at epoch  49, step    22000: bleu:  61.12, loss: 1211.2007, ppl:   1.0568, duration: 45.6916s
Validation result (greedy) at epoch  50, step    22400: bleu:  61.56, loss: 1207.3948, ppl:   1.0567, duration: 50.6894s```

@juliakreutzer
Copy link
Collaborator

Hi @kobeyy
thanks for the additional insights. Could you try again with the latest version? I added some code on stopping after generating eos in greedy decoding, so it should be faster now.

@juliakreutzer
Copy link
Collaborator

Closing this due to inactivity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants