Skip to content

Commit

Permalink
[docs] The use of do_lower_case in scripts is on its way to depreca…
Browse files Browse the repository at this point in the history
…tion (#3738)
  • Loading branch information
julien-c committed Apr 10, 2020
1 parent b169ac9 commit cbad305
Show file tree
Hide file tree
Showing 4 changed files with 4 additions and 20 deletions.
3 changes: 0 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -337,7 +337,6 @@ python ./examples/run_glue.py \
--task_name $TASK_NAME \
--do_train \
--do_eval \
--do_lower_case \
--data_dir $GLUE_DIR/$TASK_NAME \
--max_seq_length 128 \
--per_gpu_eval_batch_size=8 \
Expand Down Expand Up @@ -391,7 +390,6 @@ python -m torch.distributed.launch --nproc_per_node 8 ./examples/run_glue.py \
--task_name MRPC \
--do_train \
--do_eval \
--do_lower_case \
--data_dir $GLUE_DIR/MRPC/ \
--max_seq_length 128 \
--per_gpu_eval_batch_size=8 \
Expand Down Expand Up @@ -424,7 +422,6 @@ python -m torch.distributed.launch --nproc_per_node=8 ./examples/run_squad.py \
--model_name_or_path bert-large-uncased-whole-word-masking \
--do_train \
--do_eval \
--do_lower_case \
--train_file $SQUAD_DIR/train-v1.1.json \
--predict_file $SQUAD_DIR/dev-v1.1.json \
--learning_rate 3e-5 \
Expand Down
8 changes: 4 additions & 4 deletions docs/source/serialization.rst
Original file line number Diff line number Diff line change
Expand Up @@ -58,14 +58,14 @@ where

``Uncased`` means that the text has been lowercased before WordPiece tokenization, e.g., ``John Smith`` becomes ``john smith``. The Uncased model also strips out any accent markers. ``Cased`` means that the true case and accent markers are preserved. Typically, the Uncased model is better unless you know that case information is important for your task (e.g., Named Entity Recognition or Part-of-Speech tagging). For information about the Multilingual and Chinese model, see the `Multilingual README <https://github.com/google-research/bert/blob/master/multilingual.md>`__ or the original TensorFlow repository.

When using an ``uncased model``\ , make sure to pass ``--do_lower_case`` to the example training scripts (or pass ``do_lower_case=True`` to FullTokenizer if you're using your own script and loading the tokenizer your-self.).
When using an ``uncased model``\ , make sure your tokenizer has ``do_lower_case=True`` (either in its configuration, or passed as an additional parameter).

Examples:

.. code-block:: python
# BERT
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True, do_basic_tokenize=True)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_basic_tokenize=True)
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
# OpenAI GPT
Expand Down Expand Up @@ -140,13 +140,13 @@ Here is the recommended way of saving the model, configuration and vocabulary to
torch.save(model_to_save.state_dict(), output_model_file)
model_to_save.config.to_json_file(output_config_file)
tokenizer.save_vocabulary(output_dir)
tokenizer.save_pretrained(output_dir)
# Step 2: Re-load the saved model and vocabulary
# Example for a Bert model
model = BertForQuestionAnswering.from_pretrained(output_dir)
tokenizer = BertTokenizer.from_pretrained(output_dir, do_lower_case=args.do_lower_case) # Add specific options if needed
tokenizer = BertTokenizer.from_pretrained(output_dir) # Add specific options if needed
# Example for a GPT model
model = OpenAIGPTDoubleHeadsModel.from_pretrained(output_dir)
tokenizer = OpenAIGPTTokenizer.from_pretrained(output_dir)
Expand Down
10 changes: 0 additions & 10 deletions examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -168,7 +168,6 @@ python run_glue.py \
--task_name $TASK_NAME \
--do_train \
--do_eval \
--do_lower_case \
--data_dir $GLUE_DIR/$TASK_NAME \
--max_seq_length 128 \
--per_gpu_train_batch_size 32 \
Expand Down Expand Up @@ -209,7 +208,6 @@ python run_glue.py \
--task_name MRPC \
--do_train \
--do_eval \
--do_lower_case \
--data_dir $GLUE_DIR/MRPC/ \
--max_seq_length 128 \
--per_gpu_train_batch_size 32 \
Expand All @@ -236,7 +234,6 @@ python run_glue.py \
--task_name MRPC \
--do_train \
--do_eval \
--do_lower_case \
--data_dir $GLUE_DIR/MRPC/ \
--max_seq_length 128 \
--per_gpu_train_batch_size 32 \
Expand All @@ -261,7 +258,6 @@ python -m torch.distributed.launch \
--task_name MRPC \
--do_train \
--do_eval \
--do_lower_case \
--data_dir $GLUE_DIR/MRPC/ \
--max_seq_length 128 \
--per_gpu_train_batch_size 8 \
Expand Down Expand Up @@ -295,7 +291,6 @@ python -m torch.distributed.launch \
--task_name mnli \
--do_train \
--do_eval \
--do_lower_case \
--data_dir $GLUE_DIR/MNLI/ \
--max_seq_length 128 \
--per_gpu_train_batch_size 8 \
Expand Down Expand Up @@ -336,7 +331,6 @@ python ./examples/run_multiple_choice.py \
--model_name_or_path roberta-base \
--do_train \
--do_eval \
--do_lower_case \
--data_dir $SWAG_DIR \
--learning_rate 5e-5 \
--num_train_epochs 3 \
Expand Down Expand Up @@ -382,7 +376,6 @@ python run_squad.py \
--model_name_or_path bert-base-uncased \
--do_train \
--do_eval \
--do_lower_case \
--train_file $SQUAD_DIR/train-v1.1.json \
--predict_file $SQUAD_DIR/dev-v1.1.json \
--per_gpu_train_batch_size 12 \
Expand Down Expand Up @@ -411,7 +404,6 @@ python -m torch.distributed.launch --nproc_per_node=8 ./examples/run_squad.py \
--model_name_or_path bert-large-uncased-whole-word-masking \
--do_train \
--do_eval \
--do_lower_case \
--train_file $SQUAD_DIR/train-v1.1.json \
--predict_file $SQUAD_DIR/dev-v1.1.json \
--learning_rate 3e-5 \
Expand Down Expand Up @@ -447,7 +439,6 @@ python run_squad.py \
--model_name_or_path xlnet-large-cased \
--do_train \
--do_eval \
--do_lower_case \
--train_file $SQUAD_DIR/train-v1.1.json \
--predict_file $SQUAD_DIR/dev-v1.1.json \
--learning_rate 3e-5 \
Expand Down Expand Up @@ -597,7 +588,6 @@ python examples/hans/test_hans.py \
--task_name hans \
--model_type $MODEL_TYPE \
--do_eval \
--do_lower_case \
--data_dir $HANS_DIR \
--model_name_or_path $MODEL_PATH \
--max_seq_length 128 \
Expand Down
3 changes: 0 additions & 3 deletions valohai.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,3 @@
description: Run evaluation during training at each logging step.
type: flag
default: true
- name: do_lower_case
description: Set this flag if you are using an uncased model.
type: flag

0 comments on commit cbad305

Please sign in to comment.