-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training nanoGPT on COVID-19 Dataset #391
Comments
Dude you are on a single a100, you need more to scale. Also do you have a link to the dataset? might need filtering for metadata removal, and formatting. A 5 million parameter model is below toy level. I'm surprised you have a loss of 5 on scientific papers. Let's look at the examples you've cited. Biobert, I couldn't find concrete information on parameter size, but its distilled version, compact biobert, https://huggingface.co/nlpie/compact-biobert, is a 65m model, 13x bigger. Comparing the sizes of model weights, as both are from fp32 era, the real model is twice as large as the distil, approx. ~130m model, 26x bigger than you're current example. also @dbl001 i recommend using karpathys Llama2c, its practically the same as NanoGPT, but based on the more modern Llama architecture and integrates better with the current ecosystem, if you want people to use your research. it has better inference, quants, and hf conversions. You can also train a custom tokenizer on llama2c, one more suited for your data. I trained a model of the same size with llama2c with tinystories dataset and got a loss of ~2. |
Thanks! I'll take a look. |
Ah I see, The dataset should be fine, and the llama2 tokenizer should work, but you would need to change dataloader to tokenize the pdfs |
llama2.c hasn't done much better then nanoGPT:
Model parameters:
Learning peaked with loss ~4 after ~2800 iterations:
|
hmm 25m params, but also a max sequence len of 1024, try lowering to 512? Also, a +1 loss is pretty good at that scale, but Also if you're using an a100 why is device = mps and torch.compile off? |
I have tried training on 'mps' on an AMD Radeon Pro 5700XT an A100 on Google Colab Pro.
Next, I'll try an H100. |
{'dim': 4096, 'multiple_of': 256, 'n_heads': 32, 'n_layers': 32, 'norm_eps': 1e-05, 'vocab_size': -1} - that seems big for an A6000 On 'mps' compile=True fails and compile=True failed on the V100 and A100 on Colab. - thats intersting, check on t4? Raise an issue on that |
I am training nanoGPT on a dataset with ~800,000 COVID-19 research papers on an A100 GPU.
I can't get the loss to go any lower the ~5. The generated output looks like a COVID-19 research paper but is nonsensical, contains duplicate adjacent phrases, etc.
Here's an example input line:
Here's an example of generated output with prompt (e.g. The HIV-1 genomic RNA (gRNA) has three major functions...):
The loss stopped decreasing at ~5,000 iterations but I continued until ~7,300.
Here are my parameters from train.py:
Here's train.py parameters:
Questions:
Is the input too complex for the model?
sciBERT and BioBERT can handle scientific papers. Should I try a different tokenizer (other then tiktoken)?
Should I try a different optimizer (other then AdamW)?
The text was updated successfully, but these errors were encountered: