[BUG] Error While Trying to Start the Training #595

pjahoorkar · 2024-04-23T11:16:30Z

I get the following error while trying to train the Llama3 model. Appreciate any thoughts. Thanks.

Prerequisites

I have read the documentation.
I have checked other issues for similar problems.

Backend

Hugging Face Space/Endpoints

Interface Used

UI

CLI Command

No response

UI Screenshots & Parameters

Error Logs

Device 0: NVIDIA A10G - 307.6MiB/22.49GiB

INFO | 2024-04-23 11:12:23 | autotrain.app:handle_form:454 - hardware: Local

INFO | 2024-04-23 11:11:16 | autotrain.app:fetch_params:212 - Task: llm:sft

INFO | 2024-04-23 11:10:40 | autotrain.app::154 - AutoTrain started successfully

WARNING | 2024-04-23 11:10:39 | autotrain.trainers.common:init:170 - Parameters not supplied by user and set to default: model, warmup_ratio, optimizer, scheduler, push_to_hub, tags_column, weight_decay, save_strategy, token, repo_id, batch_size, max_grad_norm, data_path, max_seq_length, seed, save_total_limit, username, gradient_accumulation, logging_steps, lr, train_split, tokens_column, valid_split, evaluation_strategy, epochs, auto_find_batch_size, project_name

WARNING | 2024-04-23 11:10:39 | autotrain.trainers.common:init:170 - Parameters not supplied by user and set to default: adam_beta2, warmup_steps, scheduler, class_image_path, adam_epsilon, checkpoints_total_limit, revision, text_encoder_use_attention_mask, image_path, seed, prior_preservation, xl, adam_beta1, prior_loss_weight, validation_images, prior_generation_precision, tokenizer_max_length, model, logging, push_to_hub, rank, center_crop, allow_tf32, local_rank, num_validation_images, token, validation_prompt, repo_id, scale_lr, checkpointing_steps, sample_batch_size, class_labels_conditioning, class_prompt, max_grad_norm, adam_weight_decay, num_class_images, username, tokenizer, resume_from_checkpoint, lr_power, num_cycles, pre_compute_text_embeddings, validation_epochs, epochs, dataloader_num_workers, project_name

WARNING | 2024-04-23 11:10:39 | autotrain.trainers.common:init:170 - Parameters not supplied by user and set to default: model, push_to_hub, task, numerical_columns, num_trials, token, repo_id, id_column, data_path, time_limit, seed, username, train_split, valid_split, categorical_columns, target_columns, project_name

WARNING | 2024-04-23 11:10:39 | autotrain.trainers.common:init:170 - Parameters not supplied by user and set to default: scheduler, lora_alpha, lora_dropout, max_target_length, target_column, text_column, data_path, seed, save_total_limit, peft, gradient_accumulation, model, warmup_ratio, optimizer, push_to_hub, weight_decay, lora_r, token, repo_id, batch_size, max_grad_norm, quantization, max_seq_length, username, logging_steps, lr, train_split, valid_split, evaluation_strategy, epochs, auto_find_batch_size, project_name

WARNING | 2024-04-23 11:10:39 | autotrain.trainers.common:init:170 - Parameters not supplied by user and set to default: model, warmup_ratio, optimizer, scheduler, push_to_hub, image_column, weight_decay, save_strategy, token, target_column, repo_id, batch_size, max_grad_norm, data_path, seed, save_total_limit, username, gradient_accumulation, logging_steps, lr, train_split, valid_split, evaluation_strategy, epochs, auto_find_batch_size, project_name

WARNING | 2024-04-23 11:10:39 | autotrain.trainers.common:init:170 - Parameters not supplied by user and set to default: model, warmup_ratio, optimizer, scheduler, push_to_hub, weight_decay, save_strategy, token, target_column, repo_id, text_column, batch_size, max_grad_norm, data_path, max_seq_length, seed, save_total_limit, username, gradient_accumulation, logging_steps, lr, train_split, valid_split, evaluation_strategy, epochs, auto_find_batch_size, project_name

WARNING | 2024-04-23 11:10:39 | autotrain.trainers.common:init:170 - Parameters not supplied by user and set to default: trainer, scheduler, use_flash_attention_2, lora_alpha, lora_dropout, merge_adapter, model_ref, text_column, data_path, dpo_beta, add_eos_token, seed, save_total_limit, prompt_text_column, gradient_accumulation, model, warmup_ratio, optimizer, push_to_hub, model_max_length, weight_decay, lora_r, token, repo_id, disable_gradient_checkpointing, rejected_text_column, batch_size, max_grad_norm, username, logging_steps, evaluation_strategy, train_split, valid_split, lr, max_prompt_length, auto_find_batch_size, project_name

INFO | 2024-04-23 11:10:39 | autotrain.app::31 - Starting AutoTrain...

Your installed package nvidia-ml-py is corrupted. Skip patch functions nvmlDeviceGetMemoryInfo. You may get incorrect or incomplete results. Please consider reinstall package nvidia-ml-py via pip3 install --force-reinstall nvidia-ml-py nvitop.

Your installed package nvidia-ml-py is corrupted. Skip patch functions nvmlDeviceGet{Compute,Graphics,MPSCompute}RunningProcesses. You may get incorrect or incomplete results. Please consider reinstall package nvidia-ml-py via pip3 install --force-reinstall nvidia-ml-py nvitop.

Additional Information

No response

The text was updated successfully, but these errors were encountered:

abhishekkrthakur · 2024-04-23T11:20:43Z

whats the error?

abhishekkrthakur · 2024-04-23T12:04:29Z

closing issue since there is no error.

pjahoorkar added the bug Something isn't working label Apr 23, 2024

abhishekkrthakur closed this as completed Apr 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Error While Trying to Start the Training #595

[BUG] Error While Trying to Start the Training #595

pjahoorkar commented Apr 23, 2024 •

edited

Loading

abhishekkrthakur commented Apr 23, 2024

abhishekkrthakur commented Apr 23, 2024

[BUG] Error While Trying to Start the Training #595

[BUG] Error While Trying to Start the Training #595

Comments

pjahoorkar commented Apr 23, 2024 • edited Loading

Prerequisites

Backend

Interface Used

CLI Command

UI Screenshots & Parameters

Error Logs

Additional Information

abhishekkrthakur commented Apr 23, 2024

abhishekkrthakur commented Apr 23, 2024

pjahoorkar commented Apr 23, 2024 •

edited

Loading