Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Train with M40 card but got OOM message #8

Closed
chesterkuo opened this issue Mar 28, 2018 · 14 comments
Closed

Train with M40 card but got OOM message #8

chesterkuo opened this issue Mar 28, 2018 · 14 comments

Comments

@chesterkuo
Copy link

i'm checking this model with M40 device , which is 24G memory on this board.

What's you default batch size used on 1080 card ?? as it seem tf show OOM when i increase batch size to 64 ?

@chesterkuo chesterkuo changed the title try with M40 Train with M40 card but got OOM message Mar 28, 2018
@ghost
Copy link

ghost commented Mar 28, 2018

Hi, here are the hyperparameters used in the original paper which I haven't tried due to memory issue.

flags.DEFINE_integer("char_dim", 200, "Embedding dimension for char")
flags.DEFINE_integer("batch_size", 32, "Batch size")
flags.DEFINE_integer("num_steps", 150000, "Number of steps")
flags.DEFINE_integer("hidden", 128, "Hidden size")
flags.DEFINE_integer("num_heads", 8, "Number of heads in self attention")
flags.DEFINE_boolean("q2c", True, "Whether to use query to context attention or not")

Please change those lines above in config.py file and let us know the results. Thanks for your contribution!

@chesterkuo
Copy link
Author

ok, i'm trying to use smaller batch size to avoid OOM case.

Also , using default char_dim in config.py instead of 200 here, will update results after 60000 steps.

@chesterkuo
Copy link
Author

After trained with 60000 steps , run with testing case, here is results.

python config.py --mode test
Exact Match: 69.9716177862, F1: 79.4625328804

@chesterkuo
Copy link
Author

here is my default config.

flags.DEFINE_integer("char_dim", 64, "Embedding dimension for char")

flags.DEFINE_integer("para_limit", 400, "Limit length for paragraph")
flags.DEFINE_integer("ques_limit", 50, "Limit length for question")
flags.DEFINE_integer("ans_limit", 30, "Limit length for answers")
flags.DEFINE_integer("test_para_limit", 1000, "Limit length for paragraph in test file")
flags.DEFINE_integer("test_ques_limit", 100, "Limit length for question in test file")
flags.DEFINE_integer("char_limit", 16, "Limit length for character")
flags.DEFINE_integer("word_count_limit", -1, "Min count for word")
flags.DEFINE_integer("char_count_limit", -1, "Min count for char")

flags.DEFINE_integer("capacity", 15000, "Batch size of dataset shuffle")
flags.DEFINE_integer("num_threads", 4, "Number of threads in input pipeline")
flags.DEFINE_boolean("is_bucket", False, "build bucket batch iterator or not")
flags.DEFINE_list("bucket_range", [40, 401, 40], "the range of bucket")

flags.DEFINE_integer("batch_size", 32, "Batch size")
flags.DEFINE_integer("num_steps", 60000, "Number of steps")
flags.DEFINE_integer("checkpoint", 1000, "checkpoint to save and evaluate the model")
flags.DEFINE_integer("period", 100, "period to save batch loss")
flags.DEFINE_integer("val_num_batches", 150, "Number of batches to evaluate the model")
flags.DEFINE_float("dropout", 0.1, "Dropout prob across the layers")
flags.DEFINE_float("grad_clip", 5.0, "Global Norm gradient clipping rate")
flags.DEFINE_float("learning_rate", 0.001, "Learning rate")
flags.DEFINE_float("decay", 0.9999, "Exponential moving average decay")
flags.DEFINE_float("l2_norm", 3e-7, "L2 norm scale")
flags.DEFINE_integer("hidden", 128, "Hidden size")
flags.DEFINE_integer("num_heads", 8, "Number of heads in self attention")
flags.DEFINE_boolean("q2c", True, "Whether to use query to context attention or not")

@ghost
Copy link

ghost commented Mar 29, 2018

@chesterkuo nice! Could you share with us the training curve from tensorboard? It could possibly keep improving until 150k steps just like the paper's results.

@chesterkuo
Copy link
Author

loss

@ghost
Copy link

ghost commented Mar 29, 2018

It seems the model is overfitting. We need more dropouts and better regularization.

@chesterkuo
Copy link
Author

hi @minsangkim142

Did you see similar issue on your training env ?

@ghost
Copy link

ghost commented Mar 30, 2018

Overfitting does occur even with hidden size = 96, however it is not as bad and the dev loss stays quite low (around 3.1). There are a few possible reasons to why our model is performance lower than the original paper (by about 2~3%):

  1. The dropouts are not placed in right places. From the first author:
    The dropout is applied between every two sub layers, and also between every two blocks. I would say we applied dropout whenever there is a new layer.
  2. The dropouts are too low: The original paper suggests 0.1 as dropouts but I find this too low at times. Maybe increasing the dropouts to 0.15 or 0.2 and training longer might help.
  3. The model architecture is different and we forgot an important feature that helps regularize: This may be hard to find but some architectures help with better regularization such as depthwise-separable convolution vs normal convolution.

@ghost ghost closed this as completed Mar 30, 2018
@ghost ghost reopened this Mar 30, 2018
@chesterkuo
Copy link
Author

after 150000 step, it seem overfitting as well.

Exact Match: 69.0066225166, F1: 78.594759759

@chesterkuo
Copy link
Author

chesterkuo commented Apr 2, 2018

Hi @minsangkim142

Try to change dropout to 0.2 in config.py as following , after run with 160000 step , here is eval results.

flags.DEFINE_float("dropout", 0.2, "Dropout prob across the layers")

{"f1": 79.11957637723766, "exact_match": 69.49858088930937}

@ghost
Copy link

ghost commented Apr 4, 2018

Hi @chesterkuo , thanks for sharing your results. Could it be possible to share your tensorboard plot?

I'm trying different ways to apply dropouts to the network to decrease the overfitting even further. Will push new commits soon. The goal is to achieve best EM/F1 performance out of all opensource repositories on github.

@chesterkuo
Copy link
Author

200000-step-1
200000-step-2

@ghost ghost added help wanted and removed help wanted labels Apr 26, 2018
@ghost
Copy link

ghost commented Apr 26, 2018

Made a new issue for this. #13

@ghost ghost closed this as completed Apr 26, 2018
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant