-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stuck on training: Created a PretokDataset with rng seed 42 #311
Comments
I have exactly the same issue. "Created a PretokDataset with rng seed 42" and then nothing |
@CatTimson @madroidmaq can you try setting |
#296 Should be the same problem |
@RahulSChand According to your method, my problem disappeared, I can train and see the detailed data of each step of training, thank you very much. like this:
I'd like to know why this tweak works, or what sources I should be looking at for this information. Looking forward to your reply. |
No, it did not resolve my issue. I was still observing ""Created a PretokDataset with rng seed 42"" for about one hour before I cancelled the script. What if the issue is software package dependent? Whoever can run the training successfully would you mind to share you software configs?. Mine is : |
@CatTimson my device |
@madroidmaq since training gets stuck at dataloader, I looked up PyTorch issues for it & found a similar issue other people reported when |
@CatTimson what is your PyTorch version? Use |
I also encountered the same problem on another device, but modifying |
@madroidmaq can you remove this Line 249 in bd18228
Line 322 in bd18228
Just put t0=0 and t1=0 & check? Also you can change below to Line 262 in bd18228
I can't reproduce the error so there is no way for me to check if the suggestion is actually correct but worth a try |
@RahulSChand Thank you for your reply. I modified the file according to your suggestion, but the result is still not working. It will still get stuck after outputting (base) jupyter@umn-20230612-000220:~/llama2/llama2.c$ git diff --cached
diff --git a/tinystories.py b/tinystories.py
index 690cb02..7e46ee8 100644
--- a/tinystories.py
+++ b/tinystories.py
@@ -235,7 +235,7 @@ class Task:
def iter_batches(batch_size, device, num_workers=0, **dataset_kwargs):
ds = PretokDataset(**dataset_kwargs)
dl = torch.utils.data.DataLoader(
- ds, batch_size=batch_size, pin_memory=True, num_workers=num_workers
+ ds, batch_size=batch_size, pin_memory=False, num_workers=num_workers
)
for x, y in dl:
x = x.to(device, non_blocking=True)
diff --git a/train.py b/train.py
index b1972dc..ace7e02 100644
--- a/train.py
+++ b/train.py
@@ -246,7 +246,7 @@ if wandb_log and master_process:
# training loop
train_batch_iter = iter_batches(split="train")
X, Y = next(train_batch_iter) # fetch the very first batch
-t0 = time.time()
+t0 = 0
local_iter_num = 0 # number of iterations in the lifetime of this process
raw_model = model.module if ddp else model # unwrap DDP container if needed
running_mfu = -1.0
@@ -259,7 +259,7 @@ while True:
# evaluate the loss on train/val sets and write checkpoints
if iter_num % eval_interval == 0 and master_process:
losses = estimate_loss()
- print(f"step {iter_num}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")
+ print(f"step {iter_num}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}",flush=True)
if wandb_log:
try:
wandb.log(
@@ -319,7 +319,7 @@ while True:
optimizer.zero_grad(set_to_none=True)
# timing and logging
- t1 = time.time()
+ t1 = 0
dt = t1 - t0
t0 = t1
if iter_num % log_interval == 0 and master_process: The full output is:
|
My original configuration was: 2.0.0+cu117 Then I did complete cleanup/reinstall of everything Now : +---------------------------------------------------------------------------------------+ Result exactly the same : |
getting similar issue with same cuda config, totally lost what could be wrong here! |
@CatTimson @kunwar-vikrant @madroidmaq I was able to reproduce the error when using custom dataset. It happens because the data_dir path You can confirm if this is the case for you by adding a Line 200 in bd18228
If running train.py now prints Line 95 in bd18228
For example, my json data is {
"query": "My order hasn't arrived yet.",
"response": "We apologize for the inconvenience. Can you please provide your order number so we can investigate?"
} So I change "story" to "response" & then run the vocab/tokenization steps again. |
@RahulSChand I re-tested according to your method, and found that it is indeed what you said, and However, I am not training my own data set, but using a custom train_vocab method, the specific execution sequence is as follows:
|
@madroidmaq what is inside your |
@RahulSChand You are right, I encountered
To solve the diff --git a/train_vocab.sh b/train_vocab.sh
index 7803af8..454cb12 100755
--- a/train_vocab.sh
+++ b/train_vocab.sh
@@ -111,16 +111,29 @@ echo "Vocabulary Size: $vocab_size"
# --byte_fallback is true, default in spm is false
# --normalization_rule_name is identity, default in spm is nmt_nfkc
-spm_train --input="$input" \
- --model_prefix="$model_prefix" \
- --model_type=bpe \
- --vocab_size="$vocab_size" \
- --self_test_sample_size=0 \
- --input_format="text" \
- --character_coverage=1.0 \
- --num_threads="$(nproc)" \
- --split_digits=true \
- --allow_whitespace_only_pieces=true \
- --byte_fallback=true \
- --unk_surface=" \342\201\207 " \
- --normalization_rule_name=identity \
+python3 << END
+import sentencepiece as spm
+import os
+
+input_path = "$input"
+model_prefix = "$model_prefix"
+vocab_size = "$vocab_size"
+num_threads = os.cpu_count()
+
+spm.SentencePieceTrainer.train(
+ f'--input={input_path} '
+ f'--model_prefix={model_prefix} '
+ '--model_type=bpe '
+ f'--vocab_size={vocab_size} '
+ '--self_test_sample_size=0 '
+ '--input_format=text '
+ '--character_coverage=1.0 '
+ f'--num_threads={num_threads} '
+ '--split_digits=true '
+ '--allow_whitespace_only_pieces=true '
+ '--byte_fallback=true '
+ '--unk_surface= \342\201\207 '
+ '--normalization_rule_name=identity'
+)
+END
+ \ @karpathy I'm not sure if the above code changes can be accepted, if so I will make a PR to submit. |
@madroidmaq You can do |
@RahulSChand You are correct, the spm_train command line tool can be installed using
The version number in the project is:
The
So, I should still need to manually compile the |
I have discovered another bug: Example below ( single data00.json) |
I have a side question, when will the while loop in the Lines 206 to 223 in c7a2626
|
@mvuthegoat It doesn't need to break. It will yield (run) until we keep calling Line 309 in c7a2626
This isn't a normal while loop, it has |
When I try to train the model, I run into some problems, don't know if anyone has the same problem or how should I solve this problem.
When I execute the training code (below), the log will always be stuck on the output of
Created a PretokDataset with rng seed 42
, and there will be no change for several hours.Below are some key steps I performed along with my device information.
The corresponding output is roughly as follows:
The GPU information of my machine is roughly as follows:
The CPU information of my machine is roughly as follows:
12 vCPUs, 85GB RAM
The text was updated successfully, but these errors were encountered: