Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

run_language_modeling.py #4

Closed
jayachaturvedi opened this issue Jul 5, 2021 · 1 comment
Closed

run_language_modeling.py #4

jayachaturvedi opened this issue Jul 5, 2021 · 1 comment

Comments

@jayachaturvedi
Copy link

Hello, I am facing issues running the run_language_modeling.py script when running the example for pretraining Bio_clinicalbert using this line:
python3 run_language_modeling.py --output_dir ./models/clinicalBert-v1 --model_name_or_path emilyalsentzer/Bio_ClinicalBERT --mlm --do_train --learning_rate 5e-5 --max_steps 150000 --block_size 128 --save_steps 1000 --per_gpu_train_batch_size 32 --seed 42 --line_by_line --train_data_file mimic_string.txt --umls --config_name config.json --med_document ./voc/vocab_updated.txt

issue 1 - it said the tokenizer did not have an argument called max_len. this was the error:
'AttributeError: 'BertTokenizerFast' object has no attribute 'max_len' '
Based on advice online, I updated It from 'tokenizer.max_len' to 'tokenizer.model_max_length' which seems to have resolved this issue

issue 2 - the current error message i am getting is:
'TypeError: init() got an unexpected keyword argument 'tui_ids''

When looking for answers to these online, I came across a comment on the huggingface transformers issue forum at huggingface/transformers#8739
They said - 'It is actually due to #8604, where we removed several deprecated arguments. The run_language_modeling.py script is deprecated in favor of language-modeling/run_{clm, plm, mlm}.py.'

Does this apply to the scrpit for UmlsBERT as well? If so, how can I access the updated script? If not, how can I resolve the tui_ids issue?

I am running the scripts on google colab. This is the complete output I get when i run:
python3 run_language_modeling.py --output_dir ./models/clinicalBert-v1 --model_name_or_path emilyalsentzer/Bio_ClinicalBERT --mlm --do_train --learning_rate 5e-5 --max_steps 150000 --block_size 128 --save_steps 1000 --per_gpu_train_batch_size 32 --seed 42 --line_by_line --train_data_file mimic_string.txt --umls --config_name config.json --med_document ./voc/vocab_updated.txt

output:

2021-07-05 09:47:55.207129: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
07/05/2021 09:47:57 - WARNING - main - Process rank: -1, device: cpu, n_gpu: 0, distributed training: False, 16-bits training: False
07/05/2021 09:47:57 - INFO - main - Training/evaluation parameters TrainingArguments(
_n_gpu=0,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_find_unused_parameters=None,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=False,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_steps=500,
evaluation_strategy=IntervalStrategy.NO,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
gradient_accumulation_steps=1,
greater_is_better=None,
group_by_length=False,
ignore_data_skip=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=5e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=-1,
log_level=-1,
log_level_replica=-1,
log_on_each_node=True,
logging_dir=gdrive/My Drive/UmlsBERT-master/language-modeling/models/clinicalBert-v1/runs/Jul05_09-47-57_d7624bb0fdc5,
logging_first_step=False,
logging_steps=500,
logging_strategy=IntervalStrategy.STEPS,
lr_scheduler_type=SchedulerType.LINEAR,
max_grad_norm=1.0,
max_steps=150000,
metric_for_best_model=None,
mp_parameters=,
no_cuda=False,
num_train_epochs=3.0,
output_dir=gdrive/My Drive/UmlsBERT-master/language-modeling/models/clinicalBert-v1,
overwrite_output_dir=False,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=8,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=clinicalBert-v1,
push_to_hub_organization=None,
push_to_hub_token=None,
remove_unused_columns=True,
report_to=['tensorboard'],
resume_from_checkpoint=None,
run_name=gdrive/My Drive/UmlsBERT-master/language-modeling/models/clinicalBert-v1,
save_on_each_node=False,
save_steps=1000,
save_strategy=IntervalStrategy.STEPS,
save_total_limit=None,
seed=42,
sharded_ddp=[],
skip_memory_metrics=True,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_legacy_prediction_loop=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
)
/usr/local/lib/python3.7/dist-packages/transformers/models/auto/modeling_auto.py:847: FutureWarning: The class AutoModelWithLMHead is deprecated and will be removed in a future version. Please use AutoModelForCausalLM for causal language models, AutoModelForMaskedLM for masked language models and AutoModelForSeq2SeqLM for encoder-decoder models.
FutureWarning,
Some weights of the model checkpoint at emilyalsentzer/Bio_ClinicalBERT were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']

  • This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
  • This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
    Traceback (most recent call last):
    File "gdrive/My Drive/UmlsBERT-master/language-modeling/run_language_modeling.py", line 355, in
    main()
    File "gdrive/My Drive/UmlsBERT-master/language-modeling/run_language_modeling.py", line 248, in main
    tui_ids=tui_ids) if training_args.do_train else None
    File "gdrive/My Drive/UmlsBERT-master/language-modeling/run_language_modeling.py", line 136, in get_dataset
    tui_ids=tui_ids)
    TypeError: init() got an unexpected keyword argument 'tui_ids'

P.S. I am not an expert programmer so do let me know if I should provide any further information as this is the first time i'm submitting an issue.

Thank you.

Best,
Jaya

@jayachaturvedi
Copy link
Author

Update: I have found the replacement files (run_clm.py, run_mlm.py, run_plm.py). When running the run_clm_no_trainer.py (since I'm using mimic data to train), I get this error:

ModuleNotFoundError: No module named 'datasets_modules.datasets.mimic_string'

Running on colab - This is the code:

!python3 'gdrive/My Drive/UmlsBERT-master/language-modeling/run_clm_no_trainer.py' --output_dir 'gdrive/My Drive/UmlsBERT-master/language-modeling/models/clinicalBert-v1' --model_name_or_path emilyalsentzer/Bio_ClinicalBERT --learning_rate 5e-5 --block_size 128 --seed 42 --dataset_config_name 'gdrive/My Drive/UmlsBERT-master/language-modeling/config.json' --dataset_name 'gdrive/My Drive/UmlsBERT-master/language-modeling/mimic_string.txt'

Here is the full output:

2021-07-06 10:08:00.087779: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
07/06/2021 10:08:01 - INFO - main - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda
Use FP16 precision: False

Traceback (most recent call last):
File "gdrive/My Drive/UmlsBERT-master/language-modeling/run_clm_no_trainer.py", line 472, in
main()
File "gdrive/My Drive/UmlsBERT-master/language-modeling/run_clm_no_trainer.py", line 241, in main
raw_datasets = load_dataset(args.dataset_name, args.dataset_config_name)
File "/usr/local/lib/python3.7/dist-packages/datasets/load.py", line 838, in load_dataset
**config_kwargs,
File "/usr/local/lib/python3.7/dist-packages/datasets/load.py", line 687, in load_dataset_builder
builder_cls = import_main_class(module_path, dataset=True)
File "/usr/local/lib/python3.7/dist-packages/datasets/load.py", line 91, in import_main_class
module = importlib.import_module(module_path)
File "/usr/lib/python3.7/importlib/init.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 1006, in _gcd_import
File "", line 983, in _find_and_load
File "", line 953, in _find_and_load_unlocked
File "", line 219, in _call_with_frames_removed
File "", line 1006, in _gcd_import
File "", line 983, in _find_and_load
File "", line 953, in _find_and_load_unlocked
File "", line 219, in _call_with_frames_removed
File "", line 1006, in _gcd_import
File "", line 983, in _find_and_load
File "", line 953, in _find_and_load_unlocked
File "", line 219, in _call_with_frames_removed
File "", line 1006, in _gcd_import
File "", line 983, in _find_and_load
File "", line 953, in _find_and_load_unlocked
File "", line 219, in _call_with_frames_removed
File "", line 1006, in _gcd_import
File "", line 983, in _find_and_load
File "", line 965, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'datasets_modules.datasets.mimic_string'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant