New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
run_language_modeling.py #4
Comments
Update: I have found the replacement files (run_clm.py, run_mlm.py, run_plm.py). When running the run_clm_no_trainer.py (since I'm using mimic data to train), I get this error: ModuleNotFoundError: No module named 'datasets_modules.datasets.mimic_string' Running on colab - This is the code: !python3 'gdrive/My Drive/UmlsBERT-master/language-modeling/run_clm_no_trainer.py' --output_dir 'gdrive/My Drive/UmlsBERT-master/language-modeling/models/clinicalBert-v1' --model_name_or_path emilyalsentzer/Bio_ClinicalBERT --learning_rate 5e-5 --block_size 128 --seed 42 --dataset_config_name 'gdrive/My Drive/UmlsBERT-master/language-modeling/config.json' --dataset_name 'gdrive/My Drive/UmlsBERT-master/language-modeling/mimic_string.txt' Here is the full output: 2021-07-06 10:08:00.087779: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0 Traceback (most recent call last): |
Hello, I am facing issues running the run_language_modeling.py script when running the example for pretraining Bio_clinicalbert using this line:
python3 run_language_modeling.py --output_dir ./models/clinicalBert-v1 --model_name_or_path emilyalsentzer/Bio_ClinicalBERT --mlm --do_train --learning_rate 5e-5 --max_steps 150000 --block_size 128 --save_steps 1000 --per_gpu_train_batch_size 32 --seed 42 --line_by_line --train_data_file mimic_string.txt --umls --config_name config.json --med_document ./voc/vocab_updated.txt
issue 1 - it said the tokenizer did not have an argument called max_len. this was the error:
'AttributeError: 'BertTokenizerFast' object has no attribute 'max_len' '
Based on advice online, I updated It from 'tokenizer.max_len' to 'tokenizer.model_max_length' which seems to have resolved this issue
issue 2 - the current error message i am getting is:
'TypeError: init() got an unexpected keyword argument 'tui_ids''
When looking for answers to these online, I came across a comment on the huggingface transformers issue forum at huggingface/transformers#8739
They said - 'It is actually due to #8604, where we removed several deprecated arguments. The run_language_modeling.py script is deprecated in favor of language-modeling/run_{clm, plm, mlm}.py.'
Does this apply to the scrpit for UmlsBERT as well? If so, how can I access the updated script? If not, how can I resolve the tui_ids issue?
I am running the scripts on google colab. This is the complete output I get when i run:
python3 run_language_modeling.py --output_dir ./models/clinicalBert-v1 --model_name_or_path emilyalsentzer/Bio_ClinicalBERT --mlm --do_train --learning_rate 5e-5 --max_steps 150000 --block_size 128 --save_steps 1000 --per_gpu_train_batch_size 32 --seed 42 --line_by_line --train_data_file mimic_string.txt --umls --config_name config.json --med_document ./voc/vocab_updated.txt
output:
2021-07-05 09:47:55.207129: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
07/05/2021 09:47:57 - WARNING - main - Process rank: -1, device: cpu, n_gpu: 0, distributed training: False, 16-bits training: False
07/05/2021 09:47:57 - INFO - main - Training/evaluation parameters TrainingArguments(
_n_gpu=0,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_find_unused_parameters=None,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=False,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_steps=500,
evaluation_strategy=IntervalStrategy.NO,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
gradient_accumulation_steps=1,
greater_is_better=None,
group_by_length=False,
ignore_data_skip=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=5e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=-1,
log_level=-1,
log_level_replica=-1,
log_on_each_node=True,
logging_dir=gdrive/My Drive/UmlsBERT-master/language-modeling/models/clinicalBert-v1/runs/Jul05_09-47-57_d7624bb0fdc5,
logging_first_step=False,
logging_steps=500,
logging_strategy=IntervalStrategy.STEPS,
lr_scheduler_type=SchedulerType.LINEAR,
max_grad_norm=1.0,
max_steps=150000,
metric_for_best_model=None,
mp_parameters=,
no_cuda=False,
num_train_epochs=3.0,
output_dir=gdrive/My Drive/UmlsBERT-master/language-modeling/models/clinicalBert-v1,
overwrite_output_dir=False,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=8,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=clinicalBert-v1,
push_to_hub_organization=None,
push_to_hub_token=None,
remove_unused_columns=True,
report_to=['tensorboard'],
resume_from_checkpoint=None,
run_name=gdrive/My Drive/UmlsBERT-master/language-modeling/models/clinicalBert-v1,
save_on_each_node=False,
save_steps=1000,
save_strategy=IntervalStrategy.STEPS,
save_total_limit=None,
seed=42,
sharded_ddp=[],
skip_memory_metrics=True,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_legacy_prediction_loop=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
)
/usr/local/lib/python3.7/dist-packages/transformers/models/auto/modeling_auto.py:847: FutureWarning: The class
AutoModelWithLMHead
is deprecated and will be removed in a future version. Please useAutoModelForCausalLM
for causal language models,AutoModelForMaskedLM
for masked language models andAutoModelForSeq2SeqLM
for encoder-decoder models.FutureWarning,
Some weights of the model checkpoint at emilyalsentzer/Bio_ClinicalBERT were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
Traceback (most recent call last):
File "gdrive/My Drive/UmlsBERT-master/language-modeling/run_language_modeling.py", line 355, in
main()
File "gdrive/My Drive/UmlsBERT-master/language-modeling/run_language_modeling.py", line 248, in main
tui_ids=tui_ids) if training_args.do_train else None
File "gdrive/My Drive/UmlsBERT-master/language-modeling/run_language_modeling.py", line 136, in get_dataset
tui_ids=tui_ids)
TypeError: init() got an unexpected keyword argument 'tui_ids'
P.S. I am not an expert programmer so do let me know if I should provide any further information as this is the first time i'm submitting an issue.
Thank you.
Best,
Jaya
The text was updated successfully, but these errors were encountered: