Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to train it #19

Closed
jackylee1 opened this issue Feb 15, 2019 · 46 comments

Comments

@jackylee1
Copy link

@jackylee1 jackylee1 commented Feb 15, 2019

how to train it

@miweru

This comment has been minimized.

Copy link

@miweru miweru commented Feb 15, 2019

that code is not released and maybe wont be released

@branc116

This comment has been minimized.

Copy link

@branc116 branc116 commented Feb 15, 2019

Can you create a model that accepts other languages (e.g. German, Russian)?

@WuTheFWasThat

This comment has been minimized.

Copy link
Collaborator

@WuTheFWasThat WuTheFWasThat commented Feb 20, 2019

@jackylee1 Sorry, as others mentioned, we're not releasing training code for now, as this repository is currently for exploring the existing model. You may find the model.py code in this repo useful, and there are plenty of other projects that train transformer models out there!

@branc116 we're not taking requests to train any models as we have limited bandwidth (especially if they require procuring a dataset differently!)

@marcpre

This comment has been minimized.

Copy link

@marcpre marcpre commented Feb 24, 2019

@openai
Impressive results!

@WuTheFWasThat
Any suggestions for other transformer models, that can be trained and played around?

@armoreal armoreal mentioned this issue Feb 24, 2019
@guotong1988

This comment has been minimized.

Copy link

@guotong1988 guotong1988 commented Feb 25, 2019

Same question. Thank you.

@WuTheFWasThat

This comment has been minimized.

Copy link
Collaborator

@WuTheFWasThat WuTheFWasThat commented Feb 28, 2019

RE other pretrained models: definitely check out Google's BERT model and CMU/Google's Transformer-XL models

@gwern

This comment has been minimized.

Copy link

@gwern gwern commented Mar 4, 2019

nshepperd has released training code for retraining GPT-2-small: https://github.com/nshepperd/gpt-2/tree/finetuning

It works: I've used it for retraining on anime plot synopses & Project Gutenberg poetry.

@guotong1988

This comment has been minimized.

Copy link

@guotong1988 guotong1988 commented Mar 4, 2019

Thank you @gwern

@kaihuchen

This comment has been minimized.

Copy link

@kaihuchen kaihuchen commented Mar 5, 2019

@nshepperd @gwern Thank you so much for sharing the training code (ref: https://github.com/nshepperd/gpt-2/tree/finetuning) !

I was able to train a model (using Chinese dataset) over the default 117M model without problem. However, when I try to generate samples (either conditional or unconditional), I got "FileNotFoundError: [Errno 2] No such file or directory" for encoder.json, hparams.json and vocab.bpe.

What would be the correct procedure for generating samples for the new model described above?

@gwern

This comment has been minimized.

Copy link

@gwern gwern commented Mar 5, 2019

I am not entirely sure because you said you are using Chinese rather than English text.

For English text, all you need to do is copy over those missing files from the 117M model directory, which will have them (assuming you ran the download script as you must've if you did any retraining of 117M, of course).

Since finetuning doesn't affect the BPE encoding details, it is merely further training the Transformer model itself; so, the model still assumes the exact same encoding as for OA's 117M model, and the encoding is defined by those files.

@PapayasTehSkeletor

This comment has been minimized.

Copy link

@PapayasTehSkeletor PapayasTehSkeletor commented Mar 6, 2019

@gwern @nshepperd
Hey Gwen, I'm a little noob in all of this... Do you know how to use finetuning in Colab? That is, if there is a way to use it in Colab...

@gwern

This comment has been minimized.

Copy link

@gwern gwern commented Mar 6, 2019

I don't know of any reason you couldn't do finetuning in Colab (the main restriction I'm aware of is you only get something like 12 GPU-hours? which is more than enough for many finetuning tasks). But I have little familiarity with it or interest in setting up a notebook to do the finetuning. Colab seems like a very restrictive tool compared to running on your own machine.

@guotong1988

This comment has been minimized.

Copy link

@guotong1988 guotong1988 commented Mar 7, 2019

@nshepperd @gwern
Have you eval the perplexity in https://github.com/nshepperd/gpt-2/tree/finetuning
?
Thank you!!

@gwern

This comment has been minimized.

Copy link

@gwern gwern commented Mar 7, 2019

On a heldout set? No idea. You'll have to code that yourself. (I suppose one dirty hack would be to set the learning rate to zero and 'train' on a 'new' dataset and watch the averaged loss...)

@ak9250

This comment has been minimized.

Copy link

@ak9250 ak9250 commented Mar 7, 2019

@PapayasTehSkeletor I was able to set it up here in colab and train if that is what you are looking for https://github.com/ak9250/gpt-2-colab

@PapayasTehSkeletor

This comment has been minimized.

Copy link

@PapayasTehSkeletor PapayasTehSkeletor commented Mar 8, 2019

@ak9250
Nice, thank you.

@PapayasTehSkeletor

This comment has been minimized.

Copy link

@PapayasTehSkeletor PapayasTehSkeletor commented Mar 9, 2019

@ak9250 @gwern @nshepperd
Before my next comment, I want to make it clear that by "I'm a novice at this," I mean that I literally don't know anything about this, really. However, ever since I heard the news about gpt-2, I was with the AI in my head and I wanted to try it myself. Although I don't know anything about how this works, I still got (with the help of information I collected on the internet) to use it in the Colab (I think it's safer to use it, since this way I'm not in danger of screwing up something important in the process).

Having said that, I have just one last question. How can I "save" my modifications? I don't want to have to train the GPT-2 and lose everything when I turn off the computer, again having to spend two hours training, plus I would want to put more texts to train without having to restart the runtime (ie, starting again from zero).

So I searched, I had to put "pull" or "fetch" to save the changes I make in the clone, but still do not save the trainings.

If you don't want to help me, no problem. I really don't know what usefulness I would give to the AI in addition to training him with hundreds of books and then see the result.

@ak9250

This comment has been minimized.

Copy link

@ak9250 ak9250 commented Mar 9, 2019

@PapayasTehSkeletor you will see a checkpoints folder at 1000, save that entire folder by mounting your drive in colab
from google.colab import drive
drive.mount('/content/drive')
and then use !cp -r to copy those checkpoints to your google drive, next time you can start off with the previous saved checkpoint

@PapayasTehSkeletor

This comment has been minimized.

Copy link

@PapayasTehSkeletor PapayasTehSkeletor commented Mar 9, 2019

@ak9250
Its working! Thank you very much!

@northerain

This comment has been minimized.

Copy link

@northerain northerain commented Mar 15, 2019

@ak9250 Works all the way through, except running it. No errors, just does nothing. Am I doing something wrong?

@ak9250

This comment has been minimized.

Copy link

@ak9250 ak9250 commented Mar 15, 2019

@northerain while training it should show some output in colab, which cell are you running? Also, you have to run in colab save a copy to drive

@northerain

This comment has been minimized.

Copy link

@northerain northerain commented Mar 15, 2019

@ak9250
I see the samples when training. When running the command under ''use your trained model'' nothing happens.

@ak9250

This comment has been minimized.

Copy link

@ak9250 ak9250 commented Mar 15, 2019

@northerain did you run this line to copy over the trained model !cp -r /content/gpt-2/checkpoint/run1/* /content/gpt-2/models/117M/

@northerain

This comment has been minimized.

Copy link

@northerain northerain commented Mar 15, 2019

@ak9250 that's the line that does nothing. Loading icon then nothing.

@ak9250

This comment has been minimized.

Copy link

@ak9250 ak9250 commented Mar 15, 2019

yes it wont show anything because it copies over files to the model dir
run the next line

@northerain

This comment has been minimized.

Copy link

@northerain northerain commented Mar 15, 2019

@ak9250
this?
''!python3 src/interactive_conditional_samples.py''

@ak9250

This comment has been minimized.

Copy link

@ak9250 ak9250 commented Mar 15, 2019

yes or the other one which shows unconditional samples

@northerain

This comment has been minimized.

Copy link

@northerain northerain commented Mar 15, 2019

Oh man, I'm dumb. Sorry to take up your time, but thank you for the help and for making this.

@Fermag

This comment has been minimized.

Copy link

@Fermag Fermag commented Mar 16, 2019

Hello everyone, I'm new to Python and have a few questions. I was able to run in the Collage on the instructions ak9250, for training gave the text in Russian, now during the training there are results with incomprehensible, fictional words. Is it a learning process and so it should be, or is this model just not adapted to other languages? Approximately how much time is training in Сolab?

@lopuhin

This comment has been minimized.

Copy link

@lopuhin lopuhin commented Mar 19, 2019

Another training implementation, based on
https://github.com/nshepperd/gpt-2/tree/finetuning is here: https://github.com/lopuhin/transformer-lm - main difference is that is uses sentencepiece tokenizer, so it's possible to train it on your own language, not only on English.

@guotong1988

This comment has been minimized.

Copy link

@guotong1988 guotong1988 commented Mar 25, 2019

Could you please take a look at this issue:#108
Thank you.

@rkfg

This comment has been minimized.

Copy link

@rkfg rkfg commented Mar 30, 2019

And another implementation, same nshepperd's fork but with scripts to process FB2 files: https://github.com/rkfg/gpt-2

It also uses the sentencepiece tokenizer. It's pretty easy to find a lot of Russian (and some other languages as well but not as many) books in FB2 format so this fork primarily targets the Russian audience. Unfortunately there's still no README about these scripts but the basic workflow is:

  • filterfb2.sh (to pick the books in the same language)
  • fb2totxt.sh (convert to plain .txt files)
  • concat.sh (concatenate all .txt into one file and insert <|n|> end of line tokens)
  • createspmodel.sh (prepare the BPE sentencepiece dictionary and model, also creates hparams.json; you can reuse those for other datasets if you wish)
  • encode.sh (produces an .npz ready to use with train.py).

I don't use validation as it's easy to make just a huge dataset that would require months to fully sample so basically training IS validation, it's impossible to overfit because the samples don't repeat.

@negacy

This comment has been minimized.

Copy link

@negacy negacy commented Mar 30, 2019

@nshepperd: What is the appropriate format of the dataset to feed in to gpt-2 for fine tuning? Is it sentence segmented or could it be the entire document/corpus?

Thanks.

@CaptainValor

This comment has been minimized.

Copy link

@CaptainValor CaptainValor commented Apr 5, 2019

@negacy Entire document works fine. I'm using a modified fork of @ak9250's Google Colab notebook. Have included code for building a single huge English training text file based on Project Gutenberg (books) or Access to Insight (Dharma texts). ATI seems to overfit quite rapidly, but a 99mb subset of Gutenberg came out quite convincing after only 4 hours of training. YMMV https://github.com/CaptainValor/gpt-2-colab

@rkfg

This comment has been minimized.

Copy link

@rkfg rkfg commented Apr 14, 2019

So I noticed that nshepperd's fork doesn't preserve the optimizer state in the checkpoints it generates. Be aware that it affects Adam optimizer because it assigns individual learning rates for the model parameters. If you restart training from a checkpoint that information is lost and your learning rate will be set back to the provided value (1e-4 by default). This results in (slight) loss increase after the checkpoint is loaded and I think it might break the learning process at the late stages (the model stops converging or starts diverging).

To store everything you need to remove this line (or replace it with var_list=None) or, if you use an existing checkpoint and want to update it with optimizer vars, you might want to create a new saver object with var_list=None at around this line after restoring. Then the full checkpoint will be saved later. Note that it's much bigger, about 1.9 Gb instead of 400-500 Mb.

@sanja7s

This comment has been minimized.

Copy link

@sanja7s sanja7s commented Apr 17, 2019

nshepperd has released training code for retraining GPT-2-small: https://github.com/nshepperd/gpt-2/tree/finetuning

It works: I've used it for retraining on anime plot synopses & Project Gutenberg poetry.

Hi, @gwern and others!

Do you know -- can we fine-tune the gpt-2 for another task such as document scoring?

@HiDhineshRaja

This comment has been minimized.

Copy link

@HiDhineshRaja HiDhineshRaja commented Apr 19, 2019

Hi, I am using nshepperd's fork https://github.com/nshepperd/gpt-2/tree/finetuning to finetune model with the custom dataset which I created. But once I trained and use the trained model to generate text, the model is able to generate text only on the text which I trained currently and it is forgetting the default trained one which is released by openAI. So my question is how to finetune this model so that the model will be able to generate text on which I trained and also the default one released by openAI?

@tomasrasymas

This comment has been minimized.

Copy link

@tomasrasymas tomasrasymas commented Apr 30, 2019

Hi, I am using nshepperd's fork https://github.com/nshepperd/gpt-2/tree/finetuning to finetune model with the custom dataset which I created. But once I trained and use the trained model to generate text, the model is able to generate text only on the text which I trained currently and it is forgetting the default trained one which is released by openAI. So my question is how to finetune this model so that the model will be able to generate text on which I trained and also the default one released by openAI?

@HiDhineshRaja have you figure it out why when finetuning model forgets what it was trained on by OpenAI?

@rkfg

This comment has been minimized.

Copy link

@rkfg rkfg commented Apr 30, 2019

Have you tried a lower learning rate? Something like 1e-6 maybe.

@tomasrasymas

This comment has been minimized.

Copy link

@tomasrasymas tomasrasymas commented Apr 30, 2019

No, I left default value for learning rate parameter. Will try to use lower one.

@zendevil

This comment has been minimized.

Copy link

@zendevil zendevil commented Jul 8, 2019

And another implementation, same nshepperd's fork but with scripts to process FB2 files: https://github.com/rkfg/gpt-2

It also uses the sentencepiece tokenizer. It's pretty easy to find a lot of Russian (and some other languages as well but not as many) books in FB2 format so this fork primarily targets the Russian audience. Unfortunately there's still no README about these scripts but the basic workflow is:

  • filterfb2.sh (to pick the books in the same language)
  • fb2totxt.sh (convert to plain .txt files)
  • concat.sh (concatenate all .txt into one file and insert <|n|> end of line tokens)
  • createspmodel.sh (prepare the BPE sentencepiece dictionary and model, also creates hparams.json; you can reuse those for other datasets if you wish)
  • encode.sh (produces an .npz ready to use with train.py).

I don't use validation as it's easy to make just a huge dataset that would require months to fully sample so basically training IS validation, it's impossible to overfit because the samples don't repeat.

I'm having trouble running your encode.py. After copying encode.py to the src folder, I run the following command:

./encode.py </home/psharma/gpt-2_fork/src/dataset.txt> /home/psharma/gpt-2_fork/src/output.npz

And get the following message:

usage: encode.py [-h] [--model_name MODEL] [--combine CHARS] PATH OUT.npz
encode.py: error: the following arguments are required: PATH, OUT.npz
@rkfg

This comment has been minimized.

Copy link

@rkfg rkfg commented Jul 8, 2019

2 issues:

  1. you're supposed to run encode.sh just like my post says, not encode.py
  2. you should specify the path without <>. The <argument> notation means the argument is required, not that you should put it between < and >.
@rkfg

This comment has been minimized.

Copy link

@rkfg rkfg commented Jul 8, 2019

@zendevil

This comment has been minimized.

Copy link

@zendevil zendevil commented Jul 8, 2019

@rkfg I'm getting the following error while training:

Traceback (most recent call last):
  File "train.py", line 221, in <module>
    main()
  File "train.py", line 92, in main
    os.path.join(CHECKPOINT_DIR, args.run_name))
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/summary/writer/writer.py", line 367, in __init__
    filename_suffix)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/summary/writer/event_file_writer.py", line 67, in __init__
    gfile.MakeDirs(self._logdir)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/lib/io/file_io.py", line 442, in recursive_create_dir
    recursive_create_dir_v2(dirname)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/lib/io/file_io.py", line 458, in recursive_create_dir_v2
    pywrap_tensorflow.RecursivelyCreateDir(compat.as_bytes(path), status)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 528, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.FailedPreconditionError: models/345M/checkpoint/run1; Not a directory

@rkfg

This comment has been minimized.

Copy link

@rkfg rkfg commented Jul 8, 2019

Open a new issue in my repository because it's not really related to this original one. Post as many details as possible including the command line you're running.

@GrahamboJangles

This comment has been minimized.

Copy link

@GrahamboJangles GrahamboJangles commented Jul 19, 2019

@nshepperd @gwern Thank you so much for sharing the training code (ref: https://github.com/nshepperd/gpt-2/tree/finetuning) !

I was able to train a model (using Chinese dataset) over the default 117M model without problem. However, when I try to generate samples (either conditional or unconditional), I got "FileNotFoundError: [Errno 2] No such file or directory" for encoder.json, hparams.json and vocab.bpe.

What would be the correct procedure for generating samples for the new model described above?

@kaihuchen I had that problem too and fixed it here: #156

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
You can’t perform that action at this time.