-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
how to train it #19
Comments
that code is not released and maybe wont be released |
Can you create a model that accepts other languages (e.g. German, Russian)? |
@jackylee1 Sorry, as others mentioned, we're not releasing training code for now, as this repository is currently for exploring the existing model. You may find the model.py code in this repo useful, and there are plenty of other projects that train transformer models out there! @branc116 we're not taking requests to train any models as we have limited bandwidth (especially if they require procuring a dataset differently!) |
@openai @WuTheFWasThat |
Same question. Thank you. |
RE other pretrained models: definitely check out Google's BERT model and CMU/Google's Transformer-XL models |
nshepperd has released training code for retraining GPT-2-small: https://github.com/nshepperd/gpt-2/tree/finetuning It works: I've used it for retraining on anime plot synopses & Project Gutenberg poetry. |
Thank you @gwern |
@nshepperd @gwern Thank you so much for sharing the training code (ref: https://github.com/nshepperd/gpt-2/tree/finetuning) ! I was able to train a model (using Chinese dataset) over the default 117M model without problem. However, when I try to generate samples (either conditional or unconditional), I got "FileNotFoundError: [Errno 2] No such file or directory" for encoder.json, hparams.json and vocab.bpe. What would be the correct procedure for generating samples for the new model described above? |
I am not entirely sure because you said you are using Chinese rather than English text. For English text, all you need to do is copy over those missing files from the 117M model directory, which will have them (assuming you ran the download script as you must've if you did any retraining of 117M, of course). Since finetuning doesn't affect the BPE encoding details, it is merely further training the Transformer model itself; so, the model still assumes the exact same encoding as for OA's 117M model, and the encoding is defined by those files. |
@gwern @nshepperd |
I don't know of any reason you couldn't do finetuning in Colab (the main restriction I'm aware of is you only get something like 12 GPU-hours? which is more than enough for many finetuning tasks). But I have little familiarity with it or interest in setting up a notebook to do the finetuning. Colab seems like a very restrictive tool compared to running on your own machine. |
@nshepperd @gwern |
On a heldout set? No idea. You'll have to code that yourself. (I suppose one dirty hack would be to set the learning rate to zero and 'train' on a 'new' dataset and watch the averaged loss...) |
@PapayasTehSkeletor I was able to set it up here in colab and train if that is what you are looking for https://github.com/ak9250/gpt-2-colab |
@ak9250 |
@ak9250 @gwern @nshepperd Having said that, I have just one last question. How can I "save" my modifications? I don't want to have to train the GPT-2 and lose everything when I turn off the computer, again having to spend two hours training, plus I would want to put more texts to train without having to restart the runtime (ie, starting again from zero). So I searched, I had to put "pull" or "fetch" to save the changes I make in the clone, but still do not save the trainings. If you don't want to help me, no problem. I really don't know what usefulness I would give to the AI in addition to training him with hundreds of books and then see the result. |
@PapayasTehSkeletor you will see a checkpoints folder at 1000, save that entire folder by mounting your drive in colab |
@ak9250 |
@ak9250 Works all the way through, except running it. No errors, just does nothing. Am I doing something wrong? |
@northerain while training it should show some output in colab, which cell are you running? Also, you have to run in colab save a copy to drive |
@ak9250 |
@northerain did you run this line to copy over the trained model !cp -r /content/gpt-2/checkpoint/run1/* /content/gpt-2/models/117M/ |
@ak9250 that's the line that does nothing. Loading icon then nothing. |
yes it wont show anything because it copies over files to the model dir |
@ak9250 |
yes or the other one which shows unconditional samples |
Oh man, I'm dumb. Sorry to take up your time, but thank you for the help and for making this. |
Hello everyone, I'm new to Python and have a few questions. I was able to run in the Collage on the instructions ak9250, for training gave the text in Russian, now during the training there are results with incomprehensible, fictional words. Is it a learning process and so it should be, or is this model just not adapted to other languages? Approximately how much time is training in Сolab? |
Another training implementation, based on |
Could you please take a look at this issue:#108 |
And another implementation, same nshepperd's fork but with scripts to process FB2 files: https://github.com/rkfg/gpt-2 It also uses the sentencepiece tokenizer. It's pretty easy to find a lot of Russian (and some other languages as well but not as many) books in FB2 format so this fork primarily targets the Russian audience. Unfortunately there's still no README about these scripts but the basic workflow is:
I don't use validation as it's easy to make just a huge dataset that would require months to fully sample so basically training IS validation, it's impossible to overfit because the samples don't repeat. |
@nshepperd: What is the appropriate format of the dataset to feed in to gpt-2 for fine tuning? Is it sentence segmented or could it be the entire document/corpus? Thanks. |
@negacy Entire document works fine. I'm using a modified fork of @ak9250's Google Colab notebook. Have included code for building a single huge English training text file based on Project Gutenberg (books) or Access to Insight (Dharma texts). ATI seems to overfit quite rapidly, but a 99mb subset of Gutenberg came out quite convincing after only 4 hours of training. YMMV https://github.com/CaptainValor/gpt-2-colab |
So I noticed that nshepperd's fork doesn't preserve the optimizer state in the checkpoints it generates. Be aware that it affects Adam optimizer because it assigns individual learning rates for the model parameters. If you restart training from a checkpoint that information is lost and your learning rate will be set back to the provided value (1e-4 by default). This results in (slight) loss increase after the checkpoint is loaded and I think it might break the learning process at the late stages (the model stops converging or starts diverging). To store everything you need to remove this line (or replace it with |
Hi, @gwern and others! Do you know -- can we fine-tune the gpt-2 for another task such as document scoring? |
Hi, I am using nshepperd's fork https://github.com/nshepperd/gpt-2/tree/finetuning to finetune model with the custom dataset which I created. But once I trained and use the trained model to generate text, the model is able to generate text only on the text which I trained currently and it is forgetting the default trained one which is released by openAI. So my question is how to finetune this model so that the model will be able to generate text on which I trained and also the default one released by openAI? |
@HiDhineshRaja have you figure it out why when finetuning model forgets what it was trained on by OpenAI? |
Have you tried a lower learning rate? Something like 1e-6 maybe. |
No, I left default value for learning rate parameter. Will try to use lower one. |
I'm having trouble running your encode.py. After copying encode.py to the src folder, I run the following command:
And get the following message:
|
2 issues:
|
@rkfg I'm getting the following error while training:
|
Open a new issue in my repository because it's not really related to this original one. Post as many details as possible including the command line you're running. |
@kaihuchen I had that problem too and fixed it here: #156 |
How did you manage to encode the chinese script? |
@Zer0-dev115 Training on Hindi is like retraining entire model (not fine tuning) so you need really large datasets I would think (at least of the order of several hundred MBs if not GBs and could take several days of GPU time. And you could do it in one of two ways: you can feed in unicode text and also setup a simple bpe and json file for it. Better way is to run learn_bpe.py from BPE github on your Hindi text and generate json and bpe. |
how to train it
The text was updated successfully, but these errors were encountered: