Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Error While Trying to Start the Training #595

Closed
2 tasks done
pjahoorkar opened this issue Apr 23, 2024 · 2 comments
Closed
2 tasks done

[BUG] Error While Trying to Start the Training #595

pjahoorkar opened this issue Apr 23, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@pjahoorkar
Copy link

pjahoorkar commented Apr 23, 2024

I get the following error while trying to train the Llama3 model. Appreciate any thoughts. Thanks.

Prerequisites

  • I have read the documentation.
  • I have checked other issues for similar problems.

Backend

Hugging Face Space/Endpoints

Interface Used

UI

CLI Command

No response

UI Screenshots & Parameters

Screenshot 2024-04-23 121251
Screenshot 2024-04-23 121511

Error Logs

Device 0: NVIDIA A10G - 307.6MiB/22.49GiB


INFO | 2024-04-23 11:12:23 | autotrain.app:handle_form:454 - hardware: Local

INFO | 2024-04-23 11:11:16 | autotrain.app:fetch_params:212 - Task: llm:sft

INFO | 2024-04-23 11:10:40 | autotrain.app::154 - AutoTrain started successfully

WARNING | 2024-04-23 11:10:39 | autotrain.trainers.common:init:170 - Parameters not supplied by user and set to default: model, warmup_ratio, optimizer, scheduler, push_to_hub, tags_column, weight_decay, save_strategy, token, repo_id, batch_size, max_grad_norm, data_path, max_seq_length, seed, save_total_limit, username, gradient_accumulation, logging_steps, lr, train_split, tokens_column, valid_split, evaluation_strategy, epochs, auto_find_batch_size, project_name

WARNING | 2024-04-23 11:10:39 | autotrain.trainers.common:init:170 - Parameters not supplied by user and set to default: adam_beta2, warmup_steps, scheduler, class_image_path, adam_epsilon, checkpoints_total_limit, revision, text_encoder_use_attention_mask, image_path, seed, prior_preservation, xl, adam_beta1, prior_loss_weight, validation_images, prior_generation_precision, tokenizer_max_length, model, logging, push_to_hub, rank, center_crop, allow_tf32, local_rank, num_validation_images, token, validation_prompt, repo_id, scale_lr, checkpointing_steps, sample_batch_size, class_labels_conditioning, class_prompt, max_grad_norm, adam_weight_decay, num_class_images, username, tokenizer, resume_from_checkpoint, lr_power, num_cycles, pre_compute_text_embeddings, validation_epochs, epochs, dataloader_num_workers, project_name

WARNING | 2024-04-23 11:10:39 | autotrain.trainers.common:init:170 - Parameters not supplied by user and set to default: model, push_to_hub, task, numerical_columns, num_trials, token, repo_id, id_column, data_path, time_limit, seed, username, train_split, valid_split, categorical_columns, target_columns, project_name

WARNING | 2024-04-23 11:10:39 | autotrain.trainers.common:init:170 - Parameters not supplied by user and set to default: scheduler, lora_alpha, lora_dropout, max_target_length, target_column, text_column, data_path, seed, save_total_limit, peft, gradient_accumulation, model, warmup_ratio, optimizer, push_to_hub, weight_decay, lora_r, token, repo_id, batch_size, max_grad_norm, quantization, max_seq_length, username, logging_steps, lr, train_split, valid_split, evaluation_strategy, epochs, auto_find_batch_size, project_name

WARNING | 2024-04-23 11:10:39 | autotrain.trainers.common:init:170 - Parameters not supplied by user and set to default: model, warmup_ratio, optimizer, scheduler, push_to_hub, image_column, weight_decay, save_strategy, token, target_column, repo_id, batch_size, max_grad_norm, data_path, seed, save_total_limit, username, gradient_accumulation, logging_steps, lr, train_split, valid_split, evaluation_strategy, epochs, auto_find_batch_size, project_name

WARNING | 2024-04-23 11:10:39 | autotrain.trainers.common:init:170 - Parameters not supplied by user and set to default: model, warmup_ratio, optimizer, scheduler, push_to_hub, weight_decay, save_strategy, token, target_column, repo_id, text_column, batch_size, max_grad_norm, data_path, max_seq_length, seed, save_total_limit, username, gradient_accumulation, logging_steps, lr, train_split, valid_split, evaluation_strategy, epochs, auto_find_batch_size, project_name

WARNING | 2024-04-23 11:10:39 | autotrain.trainers.common:init:170 - Parameters not supplied by user and set to default: trainer, scheduler, use_flash_attention_2, lora_alpha, lora_dropout, merge_adapter, model_ref, text_column, data_path, dpo_beta, add_eos_token, seed, save_total_limit, prompt_text_column, gradient_accumulation, model, warmup_ratio, optimizer, push_to_hub, model_max_length, weight_decay, lora_r, token, repo_id, disable_gradient_checkpointing, rejected_text_column, batch_size, max_grad_norm, username, logging_steps, evaluation_strategy, train_split, valid_split, lr, max_prompt_length, auto_find_batch_size, project_name

INFO | 2024-04-23 11:10:39 | autotrain.app::31 - Starting AutoTrain...

Your installed package nvidia-ml-py is corrupted. Skip patch functions nvmlDeviceGetMemoryInfo. You may get incorrect or incomplete results. Please consider reinstall package nvidia-ml-py via pip3 install --force-reinstall nvidia-ml-py nvitop.

Your installed package nvidia-ml-py is corrupted. Skip patch functions nvmlDeviceGet{Compute,Graphics,MPSCompute}RunningProcesses. You may get incorrect or incomplete results. Please consider reinstall package nvidia-ml-py via pip3 install --force-reinstall nvidia-ml-py nvitop.

Additional Information

No response

@pjahoorkar pjahoorkar added the bug Something isn't working label Apr 23, 2024
@abhishekkrthakur
Copy link
Member

whats the error?

@abhishekkrthakur
Copy link
Member

closing issue since there is no error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants