Running on single GPU(16GB)

Hi,

What is the best way to run this on my high performance laptop?
Should this somehow work? Can i calculate how many days/weeks it will run? 

Thanks in advance

Specs:

> OS: Win 11 (WSL2)
> CPU: Intel Core i7 12850HX
> Make: Lenovo Thinkpad P16 gen 1
> Memory: 128GB DDR5-4800 (2400MHz) 
> GPU: Nvidia RTX A5500 16GB

I found that this command would work on my laptop it seems:
`ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/multi_gpu.yaml --num_processes=1 scripts/run_sft.py recipes/zephyr-7b-beta/sft/config_lora.yaml --load_in_4bit=true --gradient_accumulation_steps=1024 --per_device_eval_batch_size=1 --per_device_train_batch_size=1`

how now run it for 1-2 hours ish:

> ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/multi_gpu.yaml --num_processes=1 scripts/run_sft.py recipes/zephyr-7b-beta/sft/config_lora.yaml --load_in_4bit=true --gradient_accumulation_steps=1024 --per_device_eval_batch_size=1 --per_device_train_batch_size=1
> INFO:root:Using nproc_per_node=1.
> 2023-11-27 15:41:33.914308: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
> 2023-11-27 15:41:33.941565: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
> To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
> 2023-11-27 15:41:34.582753: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
> [2023-11-27 15:41:35,164] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
> /usr/local/lib/python3.11/dist-packages/trl/trainer/ppo_config.py:141: UserWarning: The `optimize_cuda_cache` arguement will be deprecated soon, please use `optimize_device_cache` instead.
>   warnings.warn(
> 2023-11-27 15:41:35 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1 distributed training: True, 16-bits training: False
> 2023-11-27 15:41:35 - INFO - __main__ - Model parameters ModelArguments(base_model_revision=None, model_name_or_path='mistralai/Mistral-7B-v0.1', model_revision='main', model_code_revision=None, torch_dtype='auto', trust_remote_code=False, use_flash_attention_2=True, use_peft=True, lora_r=64, lora_alpha=16, lora_dropout=0.1, lora_target_modules=['q_proj', 'k_proj', 'v_proj', 'o_proj'], lora_modules_to_save=None, load_in_8bit=False, load_in_4bit=True, bnb_4bit_quant_type='nf4', use_bnb_nested_quant=False)
> 2023-11-27 15:41:35 - INFO - __main__ - Data parameters DataArguments(chat_template=None, dataset_mixer={'HuggingFaceH4/ultrachat_200k': 1.0}, dataset_splits=['train_sft', 'test_sft'], max_train_samples=None, max_eval_samples=None, preprocessing_num_workers=12, truncation_side=None)
> 2023-11-27 15:41:35 - INFO - __main__ - Training/evaluation parameters SFTConfig(
> _n_gpu=1,
> adafactor=False,
> adam_beta1=0.9,
> adam_beta2=0.999,
> adam_epsilon=1e-08,
> auto_find_batch_size=False,
> bf16=True,
> bf16_full_eval=False,
> data_seed=None,
> dataloader_drop_last=False,
> dataloader_num_workers=0,
> dataloader_pin_memory=True,
> ddp_backend=None,
> ddp_broadcast_buffers=None,
> ddp_bucket_cap_mb=None,
> ddp_find_unused_parameters=None,
> ddp_timeout=1800,
> debug=[],
> deepspeed=None,
> disable_tqdm=False,
> dispatch_batches=None,
> do_eval=True,
> do_predict=False,
> do_train=False,
> eval_accumulation_steps=None,
> eval_delay=0,
> eval_steps=None,
> evaluation_strategy=IntervalStrategy.EPOCH,
> fp16=False,
> fp16_backend=auto,
> fp16_full_eval=False,
> fp16_opt_level=O1,
> fsdp=[],
> fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
> fsdp_min_num_params=0,
> fsdp_transformer_layer_cls_to_wrap=None,
> full_determinism=False,
> gradient_accumulation_steps=1024,
> gradient_checkpointing=True,
> gradient_checkpointing_kwargs={'use_reentrant': False},
> greater_is_better=None,
> group_by_length=False,
> half_precision_backend=auto,
> hub_always_push=False,
> hub_model_id=zephyr-7b-sft-lora,
> hub_private_repo=False,
> hub_strategy=HubStrategy.EVERY_SAVE,
> hub_token=<HUB_TOKEN>,
> ignore_data_skip=False,
> include_inputs_for_metrics=False,
> include_tokens_per_second=False,
> jit_mode_eval=False,
> label_names=None,
> label_smoothing_factor=0.0,
> learning_rate=2e-05,
> length_column_name=length,
> load_best_model_at_end=False,
> local_rank=0,
> log_level=info,
> log_level_replica=warning,
> log_on_each_node=True,
> logging_dir=data/zephyr-7b-sft-lora/runs/Nov27_15-41-35,
> logging_first_step=True,
> logging_nan_inf_filter=True,
> logging_steps=5,
> logging_strategy=IntervalStrategy.STEPS,
> lr_scheduler_type=SchedulerType.COSINE,
> max_grad_norm=1.0,
> max_seq_length=2048,
> max_steps=-1,
> metric_for_best_model=None,
> mp_parameters=,
> neftune_noise_alpha=None,
> no_cuda=False,
> num_train_epochs=1,
> optim=OptimizerNames.ADAMW_TORCH,
> optim_args=None,
> output_dir=data/zephyr-7b-sft-lora,
> overwrite_output_dir=True,
> past_index=-1,
> per_device_eval_batch_size=1,
> per_device_train_batch_size=1,
> prediction_loss_only=False,
> push_to_hub=True,
> push_to_hub_model_id=None,
> push_to_hub_organization=None,
> push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
> ray_scope=last,
> remove_unused_columns=True,
> report_to=['tensorboard'],
> resume_from_checkpoint=None,
> run_name=data/zephyr-7b-sft-lora,
> save_on_each_node=False,
> save_safetensors=True,
> save_steps=500,
> save_strategy=IntervalStrategy.NO,
> save_total_limit=None,
> seed=42,
> skip_memory_metrics=True,
> split_batches=False,
> tf32=None,
> torch_compile=False,
> torch_compile_backend=None,
> torch_compile_mode=None,
> torchdynamo=None,
> tpu_metrics_debug=False,
> tpu_num_cores=None,
> use_cpu=False,
> use_ipex=False,
> use_legacy_prediction_loop=False,
> use_mps_device=False,
> warmup_ratio=0.0,
> warmup_steps=0,
> weight_decay=0.0,
> )
> Overwrite dataset info from restored data version if exists.
> 2023-11-27 15:41:38 - INFO - datasets.builder - Overwrite dataset info from restored data version if exists.
> Loading Dataset info from /root/.cache/huggingface/datasets/HuggingFaceH4___ultrachat_200k/default/0.0.0/e9d36c4d9da46458
> 2023-11-27 15:41:38 - INFO - datasets.info - Loading Dataset info from /root/.cache/huggingface/datasets/HuggingFaceH4___ultrachat_200k/default/0.0.0/e9d36c4d9da46458
> Found cached dataset ultrachat_200k (/root/.cache/huggingface/datasets/HuggingFaceH4___ultrachat_200k/default/0.0.0/e9d36c4d9da46458)
> 2023-11-27 15:41:38 - INFO - datasets.builder - Found cached dataset ultrachat_200k (/root/.cache/huggingface/datasets/HuggingFaceH4___ultrachat_200k/default/0.0.0/e9d36c4d9da46458)
> Loading Dataset info from /root/.cache/huggingface/datasets/HuggingFaceH4___ultrachat_200k/default/0.0.0/e9d36c4d9da46458
> 2023-11-27 15:41:38 - INFO - datasets.info - Loading Dataset info from /root/.cache/huggingface/datasets/HuggingFaceH4___ultrachat_200k/default/0.0.0/e9d36c4d9da46458
> Overwrite dataset info from restored data version if exists.
> 2023-11-27 15:41:40 - INFO - datasets.builder - Overwrite dataset info from restored data version if exists.
> Loading Dataset info from /root/.cache/huggingface/datasets/HuggingFaceH4___ultrachat_200k/default/0.0.0/e9d36c4d9da46458
> 2023-11-27 15:41:40 - INFO - datasets.info - Loading Dataset info from /root/.cache/huggingface/datasets/HuggingFaceH4___ultrachat_200k/default/0.0.0/e9d36c4d9da46458
> Found cached dataset ultrachat_200k (/root/.cache/huggingface/datasets/HuggingFaceH4___ultrachat_200k/default/0.0.0/e9d36c4d9da46458)
> 2023-11-27 15:41:40 - INFO - datasets.builder - Found cached dataset ultrachat_200k (/root/.cache/huggingface/datasets/HuggingFaceH4___ultrachat_200k/default/0.0.0/e9d36c4d9da46458)
> Loading Dataset info from /root/.cache/huggingface/datasets/HuggingFaceH4___ultrachat_200k/default/0.0.0/e9d36c4d9da46458
> 2023-11-27 15:41:40 - INFO - datasets.info - Loading Dataset info from /root/.cache/huggingface/datasets/HuggingFaceH4___ultrachat_200k/default/0.0.0/e9d36c4d9da46458
> Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/HuggingFaceH4___ultrachat_200k/default/0.0.0/e9d36c4d9da46458/cache-91f7f728fecb2505.arrow
> 2023-11-27 15:41:40 - INFO - datasets.arrow_dataset - Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/HuggingFaceH4___ultrachat_200k/default/0.0.0/e9d36c4d9da46458/cache-91f7f728fecb2505.arrow
> Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/HuggingFaceH4___ultrachat_200k/default/0.0.0/e9d36c4d9da46458/cache-83009ff6f17d65d0.arrow
> 2023-11-27 15:41:40 - INFO - datasets.arrow_dataset - Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/HuggingFaceH4___ultrachat_200k/default/0.0.0/e9d36c4d9da46458/cache-83009ff6f17d65d0.arrow
> 2023-11-27 15:41:40 - INFO - __main__ - Training on the following datasets and their proportions: ['train : 207865', 'test : 23110']
> [INFO|tokenization_utils_base.py:2022] 2023-11-27 15:41:40,744 >> loading file tokenizer.model from cache at /root/.cache/huggingface/hub/models--mistralai--Mistral-7B-v0.1/snapshots/5e9c98b96d071dce59368012254c55b0ec6f8658/tokenizer.model
> [INFO|tokenization_utils_base.py:2022] 2023-11-27 15:41:40,744 >> loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--mistralai--Mistral-7B-v0.1/snapshots/5e9c98b96d071dce59368012254c55b0ec6f8658/tokenizer.json
> [INFO|tokenization_utils_base.py:2022] 2023-11-27 15:41:40,744 >> loading file added_tokens.json from cache at None
> [INFO|tokenization_utils_base.py:2022] 2023-11-27 15:41:40,744 >> loading file special_tokens_map.json from cache at /root/.cache/huggingface/hub/models--mistralai--Mistral-7B-v0.1/snapshots/5e9c98b96d071dce59368012254c55b0ec6f8658/special_tokens_map.json
> [INFO|tokenization_utils_base.py:2022] 2023-11-27 15:41:40,744 >> loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--mistralai--Mistral-7B-v0.1/snapshots/5e9c98b96d071dce59368012254c55b0ec6f8658/tokenizer_config.json
> Loading cached processed dataset at /root/.cache/huggingface/datasets/HuggingFaceH4___ultrachat_200k/default/0.0.0/e9d36c4d9da46458/cache-3e95fae9b410a2c7.arrow
> 2023-11-27 15:41:40 - INFO - datasets.arrow_dataset - Loading cached processed dataset at /root/.cache/huggingface/datasets/HuggingFaceH4___ultrachat_200k/default/0.0.0/e9d36c4d9da46458/cache-3e95fae9b410a2c7.arrow
> Loading cached processed dataset at /root/.cache/huggingface/datasets/HuggingFaceH4___ultrachat_200k/default/0.0.0/e9d36c4d9da46458/cache-84dc14e69dab5370.arrow
> 2023-11-27 15:41:40 - INFO - datasets.arrow_dataset - Loading cached processed dataset at /root/.cache/huggingface/datasets/HuggingFaceH4___ultrachat_200k/default/0.0.0/e9d36c4d9da46458/cache-84dc14e69dab5370.arrow
> 2023-11-27 15:41:40 - INFO - __main__ - Sample 167621 of the processed training set:
> ........
> 2023-11-27 15:41:40 - INFO - __main__ - *** Load pretrained model ***
> 2023-11-27 15:41:40 - INFO - __main__ - *** Model loaded! ***
> /usr/local/lib/python3.11/dist-packages/trl/trainer/sft_trainer.py:145: UserWarning: You passed a model_id to the SFTTrainer. This will automatically create an `AutoModelForCausalLM` or a `PeftModel` (if you passed a `peft_config`) for you.
>   warnings.warn(
> [INFO|configuration_utils.py:717] 2023-11-27 15:41:40,964 >> loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--mistralai--Mistral-7B-v0.1/snapshots/5e9c98b96d071dce59368012254c55b0ec6f8658/config.json
> [INFO|configuration_utils.py:777] 2023-11-27 15:41:40,964 >> Model config MistralConfig {
>   "_name_or_path": "mistralai/Mistral-7B-v0.1",
>   "architectures": [
>     "MistralForCausalLM"
>   ],
>   "bos_token_id": 1,
>   "eos_token_id": 2,
>   "hidden_act": "silu",
>   "hidden_size": 4096,
>   "initializer_range": 0.02,
>   "intermediate_size": 14336,
>   "max_position_embeddings": 32768,
>   "model_type": "mistral",
>   "num_attention_heads": 32,
>   "num_hidden_layers": 32,
>   "num_key_value_heads": 8,
>   "rms_norm_eps": 1e-05,
>   "rope_theta": 10000.0,
>   "sliding_window": 4096,
>   "tie_word_embeddings": false,
>   "torch_dtype": "bfloat16",
>   "transformers_version": "4.35.0",
>   "use_cache": false,
>   "vocab_size": 32000
> }
> 
> [INFO|modeling_utils.py:3121] 2023-11-27 15:41:40,972 >> loading weights file pytorch_model.bin from cache at /root/.cache/huggingface/hub/models--mistralai--Mistral-7B-v0.1/snapshots/5e9c98b96d071dce59368012254c55b0ec6f8658/pytorch_model.bin.index.json
> [INFO|modeling_utils.py:3184] 2023-11-27 15:41:40,974 >> Will use torch_dtype=torch.bfloat16 as defined in model's config object
> [INFO|modeling_utils.py:1222] 2023-11-27 15:41:40,974 >> Instantiating MistralForCausalLM model under default dtype torch.bfloat16.
> [INFO|configuration_utils.py:791] 2023-11-27 15:41:40,976 >> Generate config GenerationConfig {
>   "bos_token_id": 1,
>   "eos_token_id": 2,
>   "use_cache": false
> }
> 
> [INFO|modeling_utils.py:3257] 2023-11-27 15:41:41,631 >> Detected 4-bit loading: activating 4-bit loading for this model
> Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:09<00:00,  4.75s/it][INFO|modeling_utils.py:3950] 2023-11-27 15:41:51,332 >> All model checkpoint weights were used when initializing MistralForCausalLM.
> 
> [INFO|modeling_utils.py:3958] 2023-11-27 15:41:51,332 >> All the weights of MistralForCausalLM were initialized from the model checkpoint at mistralai/Mistral-7B-v0.1.
> If your task is similar to the task the model of the checkpoint was trained on, you can already use MistralForCausalLM for predictions without further training.
> [INFO|configuration_utils.py:751] 2023-11-27 15:41:51,488 >> loading configuration file generation_config.json from cache at /root/.cache/huggingface/hub/models--mistralai--Mistral-7B-v0.1/snapshots/5e9c98b96d071dce59368012254c55b0ec6f8658/generation_config.json
> [INFO|configuration_utils.py:791] 2023-11-27 15:41:51,488 >> Generate config GenerationConfig {
>   "bos_token_id": 1,
>   "eos_token_id": 2
> }
> 
> [INFO|training_args.py:1784] 2023-11-27 15:41:51,646 >> PyTorch: setting up devices
> /usr/local/lib/python3.11/dist-packages/trl/trainer/sft_trainer.py:247: UserWarning: You passed a tokenizer with `padding_side` not equal to `right` to the SFTTrainer. This might lead to some unexpected behaviour due to overflow issues when training a model in half-precision. You might consider adding `tokenizer.padding_side = 'right'` to your code.
>   warnings.warn(
> [INFO|trainer.py:593] 2023-11-27 15:41:52,619 >> Using auto half precision backend
> 2023-11-27 15:41:52 - INFO - __main__ - *** Train ***
> [INFO|trainer.py:1723] 2023-11-27 15:41:53,614 >> ***** Running training *****
> [INFO|trainer.py:1724] 2023-11-27 15:41:53,614 >>   Num examples = 207,865
> [INFO|trainer.py:1725] 2023-11-27 15:41:53,614 >>   Num Epochs = 1
> [INFO|trainer.py:1726] 2023-11-27 15:41:53,614 >>   Instantaneous batch size per device = 1
> [INFO|trainer.py:1729] 2023-11-27 15:41:53,614 >>   Total train batch size (w. parallel, distributed & accumulation) = 1,024
> [INFO|trainer.py:1730] 2023-11-27 15:41:53,614 >>   Gradient Accumulation steps = 1024
> [INFO|trainer.py:1731] 2023-11-27 15:41:53,614 >>   Total optimization steps = 202
> [INFO|trainer.py:1732] 2023-11-27 15:41:53,616 >>   Number of trainable parameters = 54,525,952
>   0%|                                                                                                                                                                                                                                               | 0/202 [00:00<?, ?it/s][WARNING|tokenization_utils_base.py:3831] 2023-11-27 15:41:54,956 >> Token indices sequence length is longer than the specified maximum sequence length for this model (2377 > 2048). Running this sequence through the model will result in indexing errors
> [WARNING|logging.py:314] 2023-11-27 15:41:55,018 >> You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
> [WARNING|logging.py:329] 2023-11-27 15:41:55,763 >> The input hidden states seems to be silently casted in float32, this might be related to the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in torch.bfloat16.
> [W reducer.cpp:1346] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
>   0%|▌                                                                                                                     | 1/202 [4:36:47<927:14:16, 16607.25s/it]{'loss': 1.1453, 'learning_rate': 1.9998790632601496e-05, 'epoch': 0.0}
>   0%|▌                                                                                                                     | 1/202 [4:36:47<927:14:16, 16607.25s/it]


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Running on single GPU(16GB) #55

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Running on single GPU(16GB) #55

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions