Reduce search space for AllenNLP example. #1542

himkt · 2020-07-21T12:42:43Z

Motivation

Current allennlp_simple.py may be computationally extensive and it sometimes causes a timeout.
https://github.com/optuna/optuna/actions/runs/175854473

Description of the changes

In this PR, I reduce the hyperparameter search space for the model.
Additionally, I extracted the dataset reader from the script and created the module because I'll also use it in allennlp_jsonnet.py.

I tested this new example on GitHub Actions to confirm that it reduced the execution time for each trial.
https://github.com/himkt/optuna/runs/894041267

himkt · 2020-07-21T12:57:23Z

examples/allennlp/allennlp_simple.py

    optimizer = torch.optim.SGD(model.parameters(), lr=lr)

    data_loader = torch.utils.data.DataLoader(
-        train_dataset, batch_size=64, collate_fn=allennlp.data.allennlp_collate
+        train_dataset, batch_size=16, collate_fn=allennlp.data.allennlp_collate


I changed batch size, as I found small batch size improved accuracy.

himkt · 2020-07-21T13:01:35Z

examples/allennlp/allennlp_simple.py

@@ -105,11 +100,11 @@ def objective(trial):
    if DEVICE > -1:
        model.to(torch.device("cuda:{}".format(DEVICE)))

-    lr = trial.suggest_float("lr", 1e-5, 1e-1, log=True)
+    lr = trial.suggest_float("lr", 1e-2, 1e-1, log=True)


I increased the lower bound of learning rate, as I found that model training didn't succeed at all.

74c8a23 1e-2 -> 1e-3

Sounds good 👍

himkt · 2020-07-21T13:02:28Z

examples/allennlp/allennlp_simple.py

@@ -123,7 +118,7 @@ def objective(trial):
        validation_data_loader=validation_data_loader,
        validation_metric="+" + TARGET_METRIC,
        patience=None,  # `patience=None` since it could conflict with AllenNLPPruningCallback
-        num_epochs=50,
+        num_epochs=30,


I decreased the number of epochs for reducing the execution time for each trial.

examples/allennlp/allennlp_simple.py

codecov-commenter · 2020-07-21T13:50:44Z

Codecov Report

Merging #1542 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master    #1542   +/-   ##
=======================================
  Coverage   89.15%   89.15%           
=======================================
  Files         104      104           
  Lines        7891     7891           
=======================================
  Hits         7035     7035           
  Misses        856      856

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4f2fa9c...b35111f. Read the comment docs.

toshihikoyanase

Thank you for your PR. I confirmed that the trials finished in ten minutes, and it basically looks good to me except for a typo of the class name.

toshihikoyanase · 2020-07-22T00:39:59Z

examples/allennlp/subsample_dataset_reader.py

+
+
+@DatasetReader.register("subsample")
+class SubsampledDatasetReader(allennlp.data.dataset_readers.TextClassificationJsonReader):


This class is called SubsampleDatasetReader in the allennlp_simple.py. Could you tell me which is correct?

Suggested change

class SubsampledDatasetReader(allennlp.data.dataset_readers.TextClassificationJsonReader):

class SubsampleDatasetReader(allennlp.data.dataset_readers.TextClassificationJsonReader):

Thank you for the review.
Initially, I create SubSampledDataset, which describes a reader for a subsampled dataset.
But I renamed it to SubsampleDataset since I could describe this class as a sub-sample dataset.
(And I thought it would be better to use only present tense words in the filenames if it was possible.)

I'm not confident for this convention, so I'd like to get your feedback.
Do you think SubsampledDatasetReader is better?

I think either one is OK because I found both use cases.

SubsampledDatasetReader can be seen in the following issue:
Limit dataset size allenai/allennlp#3099 (comment)
(BTW, itertools.islice was employed in the code example in the issue. It can be useful if the dataset is too large to fit in memory.)

subsample is also used in the variable name batch_subsample_map.

I prefer SubsampleDatasetReader for its simplicity, but it is not a strong opinion.

OK, so let me leave the dataset reader name as is.

islice ... It can be useful if the dataset is too large to fit in memory.
Thank you for pointing out, you're right!
I gave it try to use islice in 4437f48 and found the dataset is quite balanced.

train: Counter({'0': 1003, '1': 997}) validation: Counter({'1': 489, '0': 511})

I also ran HPO on GPU to confirm training could go well.

python allennlp_simple.py .... [I 2020-07-22 13:38:58,896] Trial 21 finished with value: 0.704 and parameters: {'dropout': 0.18415084463564746, 'embedding_dim': 33, 'output_dim': 69, 'max_filter_size': 4, 'num_filters': 55, 'lr': 0.09093981266227305}. Best is trial 12 with value: 0.725. Number of finished trials: 22 Best trial: Value: 0.725 Params: dropout: 0.31246648830863727 embedding_dim: 22 output_dim: 98 max_filter_size: 4 num_filters: 38 lr: 0.09838055114180111

So I determined to remove train_test_split and simply use islice.

HideakiImamura

Thanks for the PR! The changes look basically good to me. I have a minor comment.

examples/allennlp/subsample_dataset_reader.py

HideakiImamura

LGTM!

toshihikoyanase

LGTM! Thank you for your contribution.

Reduce search-space

8378a6b

himkt requested a review from toshihikoyanase July 21, 2020 12:42

himkt added 2 commits July 21, 2020 12:54

Extract dataset reader

bb895ca

Replace valid_data_size with validation_data_size

b35111f

himkt commented Jul 21, 2020

View reviewed changes

toshihikoyanase self-assigned this Jul 22, 2020

toshihikoyanase requested changes Jul 22, 2020

View reviewed changes

himkt added 6 commits July 22, 2020 22:25

Rename SubsampledDatasetReader to SubsampleDatasetReader

773653a

Use islice

4437f48

Remove unused module importings

f490d5e

Remove description for optuna cli

ac4a13c

Nits

1fa68d2

Expand search space for learning rate

74c8a23

HideakiImamura reviewed Jul 30, 2020

View reviewed changes

examples/allennlp/subsample_dataset_reader.py Show resolved Hide resolved

HideakiImamura approved these changes Jul 31, 2020

View reviewed changes

toshihikoyanase approved these changes Jul 31, 2020

View reviewed changes

toshihikoyanase added this to the v2.1.0 milestone Jul 31, 2020

toshihikoyanase merged commit 02651af into optuna:master Jul 31, 2020

toshihikoyanase added the example label Jul 31, 2020

himkt deleted the patch/reduce-allennlp-simple-search-space branch July 31, 2020 11:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce search space for AllenNLP example. #1542

Reduce search space for AllenNLP example. #1542

himkt commented Jul 21, 2020

himkt Jul 21, 2020

himkt Jul 21, 2020

himkt Jul 28, 2020

toshihikoyanase Jul 31, 2020

himkt Jul 21, 2020

codecov-commenter commented Jul 21, 2020

toshihikoyanase left a comment

toshihikoyanase Jul 22, 2020

himkt Jul 22, 2020

toshihikoyanase Jul 22, 2020

himkt Jul 22, 2020 •

edited

HideakiImamura left a comment

HideakiImamura left a comment

toshihikoyanase left a comment



		@DatasetReader.register("subsample")
		class SubsampledDatasetReader(allennlp.data.dataset_readers.TextClassificationJsonReader):

	class SubsampledDatasetReader(allennlp.data.dataset_readers.TextClassificationJsonReader):
	class SubsampleDatasetReader(allennlp.data.dataset_readers.TextClassificationJsonReader):

Reduce search space for AllenNLP example. #1542

Reduce search space for AllenNLP example. #1542

Conversation

himkt commented Jul 21, 2020

Motivation

Description of the changes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Jul 21, 2020

Codecov Report

toshihikoyanase left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

himkt Jul 22, 2020 • edited

Choose a reason for hiding this comment

HideakiImamura left a comment

Choose a reason for hiding this comment

HideakiImamura left a comment

Choose a reason for hiding this comment

toshihikoyanase left a comment

Choose a reason for hiding this comment

himkt Jul 22, 2020 •

edited