Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add allennlp example. #949

Merged
merged 30 commits into from Mar 18, 2020
Merged
Show file tree
Hide file tree
Changes from 18 commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
435aa2d
Add simple example of allennlp
Feb 20, 2020
2982f8d
Implement simple example to use allennlp
Feb 21, 2020
afeb9d3
Add search space used in allentune example
Feb 21, 2020
d08e2a6
Add create_model to separate defining model from objective
Feb 22, 2020
85f8586
Use CPU by default
Feb 22, 2020
cd7d3e8
Remove test_dataset (it is not used)
Feb 22, 2020
6fbea99
Update README
Feb 23, 2020
764bcdb
Merge branch 'allennlp-example' of github.com:himkt/optuna into allen…
Feb 23, 2020
e72dc3d
Update order of import
Feb 26, 2020
782ac6a
Update examples/allennlp_simple.py
himkt Feb 26, 2020
0f9b3f4
Update examples/allennlp_simple.py
himkt Feb 26, 2020
f97cfe4
Update examples/allennlp_simple.py
himkt Feb 26, 2020
07a2437
Update examples/allennlp_simple.py
himkt Feb 26, 2020
fe08bad
Merge branch 'allennlp-example' of github.com:himkt/optuna into allen…
Feb 26, 2020
037d12a
Reduce data size
Feb 26, 2020
0db55bc
Adjust experimental settings for CI
Feb 26, 2020
095358d
Tune num_epoch for computational time
Feb 26, 2020
ec1e8ea
Update setup.py
Feb 26, 2020
fadc5cd
Apply suggestions from code review
himkt Mar 2, 2020
c94080f
Ignore allennlp if Python version is 3.5 or 3.8
Mar 2, 2020
86e6891
Apply suggestions from code review
himkt Mar 2, 2020
0113b9c
Apply suggestions from code review
himkt Mar 2, 2020
0819437
Merge branch 'allennlp-example' of github.com:himkt/optuna into allen…
Mar 2, 2020
d11af78
Remove comment
Mar 3, 2020
e534b92
Use trial.number to create unique serialization_dir
Mar 3, 2020
8f62e22
Move prepare_data inside objective for optuna cli
Mar 12, 2020
4277a9a
Add description for allennlp example
Mar 12, 2020
ef99a13
Merge branch 'master' into allennlp-example
Mar 17, 2020
108d746
Apply black
Mar 17, 2020
1e59f0e
Apply feedback
Mar 18, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Expand Up @@ -89,6 +89,7 @@ study.optimize(objective, n_trials=100) # Invoke optimization of the objective
* [PyTorch Ignite](./examples/pytorch_ignite_simple.py)
* [PyTorch Lightning](./examples/pytorch_lightning_simple.py)
* [FastAI](./examples/fastai_simple.py)
* [AllenNLP](./examples/allennlp_simple.py)

## Installation

Expand Down
127 changes: 127 additions & 0 deletions examples/allennlp_simple.py
@@ -0,0 +1,127 @@
import uuid

import allennlp
import allennlp.data
import allennlp.models
import allennlp.modules
import optuna
import torch
himkt marked this conversation as resolved.
Show resolved Hide resolved


DEVICE = -1 # If you want to use GPU, use DEVICE = 0
himkt marked this conversation as resolved.
Show resolved Hide resolved

# Run tuning with small portion of data
# to reduce computational time.
himkt marked this conversation as resolved.
Show resolved Hide resolved
# https://github.com/optuna/optuna/pull/949#pullrequestreview-364110499
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, we can remove this link according to the past PRs. Maybe it would be helpful if we share code conventions like this in a design document.

MAX_DATA_SIZE = 3000

GLOVE_FILE_PATH = (
'https://s3-us-west-2.amazonaws.com/'
'allennlp/datasets/glove/glove.6B.50d.txt.gz'
himkt marked this conversation as resolved.
Show resolved Hide resolved
)


def prepare_data():
glove_indexer = allennlp.data.token_indexers.SingleIdTokenIndexer(
lowercase_tokens=True
)
tokenizer = allennlp.data.tokenizers.WordTokenizer(
word_splitter=allennlp.data.tokenizers.word_splitter.JustSpacesWordSplitter(),
)

reader = allennlp.data.dataset_readers.TextClassificationJsonReader(
token_indexers={'tokens': glove_indexer},
tokenizer=tokenizer,
himkt marked this conversation as resolved.
Show resolved Hide resolved
)
train_dataset = reader.read(
'https://s3-us-west-2.amazonaws.com/allennlp/datasets/imdb/train.jsonl'
)
train_dataset = train_dataset[:MAX_DATA_SIZE]

valid_dataset = reader.read(
'https://s3-us-west-2.amazonaws.com/allennlp/datasets/imdb/dev.jsonl'
)
valid_dataset = valid_dataset[:MAX_DATA_SIZE]

vocab = allennlp.data.Vocabulary.from_instances(train_dataset)
return train_dataset, valid_dataset, vocab


def create_model(vocab, trial: optuna.Trial):
himkt marked this conversation as resolved.
Show resolved Hide resolved
embedding = allennlp.modules.Embedding(
embedding_dim=50,
trainable=True,
pretrained_file=GLOVE_FILE_PATH,
num_embeddings=vocab.get_vocab_size('tokens'),
)

embedder = allennlp.modules.text_field_embedders.BasicTextFieldEmbedder(
{'tokens': embedding}
)

output_dim = trial.suggest_int('output_dim', 80, 120)
himkt marked this conversation as resolved.
Show resolved Hide resolved
max_filter_size = trial.suggest_int('max_filter_size', 3, 6)
num_filters = trial.suggest_int('num_filters', 64, 256)
himkt marked this conversation as resolved.
Show resolved Hide resolved
encoder = allennlp.modules.seq2vec_encoders.CnnEncoder(
ngram_filter_sizes=range(1, max_filter_size),
num_filters=num_filters,
embedding_dim=50,
output_dim=output_dim,
)

dropout = trial.suggest_uniform('dropout', 0, 0.5)
model = allennlp.models.BasicClassifier(
text_field_embedder=embedder,
seq2vec_encoder=encoder,
dropout=dropout,
vocab=vocab,
)

print(output_dim, max_filter_size, num_filters, dropout)
himkt marked this conversation as resolved.
Show resolved Hide resolved
return model


def objective(trial: optuna.Trial):
himkt marked this conversation as resolved.
Show resolved Hide resolved
model = create_model(vocab, trial)

if DEVICE > -1:
print(f'send model to GPU #{DEVICE}')
himkt marked this conversation as resolved.
Show resolved Hide resolved
model.cuda(DEVICE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: model.to(torch.device(’cuda:{}’.format(DEVICE))).


lr = trial.suggest_loguniform('lr', 1e-1, 1e0)
optimizer = torch.optim.SGD(model.parameters(), lr=lr)

iterator = allennlp.data.iterators.BasicIterator(
batch_size=10,
)
iterator.index_with(vocab)

trainer = allennlp.training.Trainer(
model=model,
optimizer=optimizer,
iterator=iterator,
train_dataset=train_dataset,
validation_dataset=valid_dataset,
patience=3,
num_epochs=6,
cuda_device=DEVICE,
serialization_dir=f'/tmp/xx/{uuid.uuid1()}',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be difficult for users to find the model corresponding to the specific trial. This is partly because the directory names are generated based on uuid and partly because the directory is not displayed in log messages. What do you think if we remove it for simplicity? If you remove it, please delete import uuid too.

Suggested change
serialization_dir=f'/tmp/xx/{uuid.uuid1()}',

Copy link
Member Author

@himkt himkt Mar 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on pytorch_lightning_simple.py, I revise my PR to use trial.number for creating a model serialization directory. e534b92

)
metrics = trainer.train()
return metrics['best_validation_accuracy']


if __name__ == '__main__':
train_dataset, valid_dataset, vocab = prepare_data()

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=2)
himkt marked this conversation as resolved.
Show resolved Hide resolved

print('Number of finished trials: ', len(study.trials))
print('Best trial:')
trial = study.best_trial

print(' Value: ', trial.value)
print(' Params: ')
for key, value in trial.params.items():
print(' {}: {}'.format(key, value))
1 change: 1 addition & 0 deletions setup.py
Expand Up @@ -73,6 +73,7 @@ def get_extras_require():
'sphinx_rtd_theme',
],
'example': [
'allennlp',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allennlp does not support Python 3.5. (c.f., https://pypi.org/project/allennlp/). Please exclude it from the installation targets if the Python version is 3.5.
In addition, please exclude examples/allennlp from the CI targets if the job is examples-python35.

'catboost',
'chainer',
'lightgbm',
Expand Down