Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to train it #19

Closed
jackylee1 opened this issue Feb 15, 2019 · 49 comments
Closed

how to train it #19

jackylee1 opened this issue Feb 15, 2019 · 49 comments

Comments

@jackylee1
Copy link

how to train it

@miweru
Copy link

miweru commented Feb 15, 2019

that code is not released and maybe wont be released

@branc116
Copy link

Can you create a model that accepts other languages (e.g. German, Russian)?

@WuTheFWasThat
Copy link
Contributor

@jackylee1 Sorry, as others mentioned, we're not releasing training code for now, as this repository is currently for exploring the existing model. You may find the model.py code in this repo useful, and there are plenty of other projects that train transformer models out there!

@branc116 we're not taking requests to train any models as we have limited bandwidth (especially if they require procuring a dataset differently!)

@marcpre
Copy link

marcpre commented Feb 24, 2019

@openai
Impressive results!

@WuTheFWasThat
Any suggestions for other transformer models, that can be trained and played around?

@guotong1988
Copy link

Same question. Thank you.

@WuTheFWasThat
Copy link
Contributor

RE other pretrained models: definitely check out Google's BERT model and CMU/Google's Transformer-XL models

@gwern
Copy link

gwern commented Mar 4, 2019

nshepperd has released training code for retraining GPT-2-small: https://github.com/nshepperd/gpt-2/tree/finetuning

It works: I've used it for retraining on anime plot synopses & Project Gutenberg poetry.

@guotong1988
Copy link

Thank you @gwern

@kaihuchen
Copy link

@nshepperd @gwern Thank you so much for sharing the training code (ref: https://github.com/nshepperd/gpt-2/tree/finetuning) !

I was able to train a model (using Chinese dataset) over the default 117M model without problem. However, when I try to generate samples (either conditional or unconditional), I got "FileNotFoundError: [Errno 2] No such file or directory" for encoder.json, hparams.json and vocab.bpe.

What would be the correct procedure for generating samples for the new model described above?

@gwern
Copy link

gwern commented Mar 5, 2019

I am not entirely sure because you said you are using Chinese rather than English text.

For English text, all you need to do is copy over those missing files from the 117M model directory, which will have them (assuming you ran the download script as you must've if you did any retraining of 117M, of course).

Since finetuning doesn't affect the BPE encoding details, it is merely further training the Transformer model itself; so, the model still assumes the exact same encoding as for OA's 117M model, and the encoding is defined by those files.

@PapayasTehSkeletor
Copy link

@gwern @nshepperd
Hey Gwen, I'm a little noob in all of this... Do you know how to use finetuning in Colab? That is, if there is a way to use it in Colab...

@gwern
Copy link

gwern commented Mar 6, 2019

I don't know of any reason you couldn't do finetuning in Colab (the main restriction I'm aware of is you only get something like 12 GPU-hours? which is more than enough for many finetuning tasks). But I have little familiarity with it or interest in setting up a notebook to do the finetuning. Colab seems like a very restrictive tool compared to running on your own machine.

@guotong1988
Copy link

@nshepperd @gwern
Have you eval the perplexity in https://github.com/nshepperd/gpt-2/tree/finetuning
?
Thank you!!

@gwern
Copy link

gwern commented Mar 7, 2019

On a heldout set? No idea. You'll have to code that yourself. (I suppose one dirty hack would be to set the learning rate to zero and 'train' on a 'new' dataset and watch the averaged loss...)

@ak9250
Copy link

ak9250 commented Mar 7, 2019

@PapayasTehSkeletor I was able to set it up here in colab and train if that is what you are looking for https://github.com/ak9250/gpt-2-colab

@PapayasTehSkeletor
Copy link

@ak9250
Nice, thank you.

@PapayasTehSkeletor
Copy link

@ak9250 @gwern @nshepperd
Before my next comment, I want to make it clear that by "I'm a novice at this," I mean that I literally don't know anything about this, really. However, ever since I heard the news about gpt-2, I was with the AI in my head and I wanted to try it myself. Although I don't know anything about how this works, I still got (with the help of information I collected on the internet) to use it in the Colab (I think it's safer to use it, since this way I'm not in danger of screwing up something important in the process).

Having said that, I have just one last question. How can I "save" my modifications? I don't want to have to train the GPT-2 and lose everything when I turn off the computer, again having to spend two hours training, plus I would want to put more texts to train without having to restart the runtime (ie, starting again from zero).

So I searched, I had to put "pull" or "fetch" to save the changes I make in the clone, but still do not save the trainings.

If you don't want to help me, no problem. I really don't know what usefulness I would give to the AI in addition to training him with hundreds of books and then see the result.

@ak9250
Copy link

ak9250 commented Mar 9, 2019

@PapayasTehSkeletor you will see a checkpoints folder at 1000, save that entire folder by mounting your drive in colab
from google.colab import drive
drive.mount('/content/drive')
and then use !cp -r to copy those checkpoints to your google drive, next time you can start off with the previous saved checkpoint

@PapayasTehSkeletor
Copy link

@ak9250
Its working! Thank you very much!

@northerain
Copy link

@ak9250 Works all the way through, except running it. No errors, just does nothing. Am I doing something wrong?

@ak9250
Copy link

ak9250 commented Mar 15, 2019

@northerain while training it should show some output in colab, which cell are you running? Also, you have to run in colab save a copy to drive

@northerain
Copy link

northerain commented Mar 15, 2019

@ak9250
I see the samples when training. When running the command under ''use your trained model'' nothing happens.

@ak9250
Copy link

ak9250 commented Mar 15, 2019

@northerain did you run this line to copy over the trained model !cp -r /content/gpt-2/checkpoint/run1/* /content/gpt-2/models/117M/

@northerain
Copy link

@ak9250 that's the line that does nothing. Loading icon then nothing.

@ak9250
Copy link

ak9250 commented Mar 15, 2019

yes it wont show anything because it copies over files to the model dir
run the next line

@northerain
Copy link

@ak9250
this?
''!python3 src/interactive_conditional_samples.py''

@ak9250
Copy link

ak9250 commented Mar 15, 2019

yes or the other one which shows unconditional samples

@northerain
Copy link

Oh man, I'm dumb. Sorry to take up your time, but thank you for the help and for making this.

@Fermag
Copy link

Fermag commented Mar 16, 2019

Hello everyone, I'm new to Python and have a few questions. I was able to run in the Collage on the instructions ak9250, for training gave the text in Russian, now during the training there are results with incomprehensible, fictional words. Is it a learning process and so it should be, or is this model just not adapted to other languages? Approximately how much time is training in Сolab?

@lopuhin
Copy link

lopuhin commented Mar 19, 2019

Another training implementation, based on
https://github.com/nshepperd/gpt-2/tree/finetuning is here: https://github.com/lopuhin/transformer-lm - main difference is that is uses sentencepiece tokenizer, so it's possible to train it on your own language, not only on English.

@guotong1988
Copy link

Could you please take a look at this issue:#108
Thank you.

@rkfg
Copy link

rkfg commented Mar 30, 2019

And another implementation, same nshepperd's fork but with scripts to process FB2 files: https://github.com/rkfg/gpt-2

It also uses the sentencepiece tokenizer. It's pretty easy to find a lot of Russian (and some other languages as well but not as many) books in FB2 format so this fork primarily targets the Russian audience. Unfortunately there's still no README about these scripts but the basic workflow is:

  • filterfb2.sh (to pick the books in the same language)
  • fb2totxt.sh (convert to plain .txt files)
  • concat.sh (concatenate all .txt into one file and insert <|n|> end of line tokens)
  • createspmodel.sh (prepare the BPE sentencepiece dictionary and model, also creates hparams.json; you can reuse those for other datasets if you wish)
  • encode.sh (produces an .npz ready to use with train.py).

I don't use validation as it's easy to make just a huge dataset that would require months to fully sample so basically training IS validation, it's impossible to overfit because the samples don't repeat.

@negacy
Copy link

negacy commented Mar 30, 2019

@nshepperd: What is the appropriate format of the dataset to feed in to gpt-2 for fine tuning? Is it sentence segmented or could it be the entire document/corpus?

Thanks.

@CaptainValor
Copy link

@negacy Entire document works fine. I'm using a modified fork of @ak9250's Google Colab notebook. Have included code for building a single huge English training text file based on Project Gutenberg (books) or Access to Insight (Dharma texts). ATI seems to overfit quite rapidly, but a 99mb subset of Gutenberg came out quite convincing after only 4 hours of training. YMMV https://github.com/CaptainValor/gpt-2-colab

@rkfg
Copy link

rkfg commented Apr 14, 2019

So I noticed that nshepperd's fork doesn't preserve the optimizer state in the checkpoints it generates. Be aware that it affects Adam optimizer because it assigns individual learning rates for the model parameters. If you restart training from a checkpoint that information is lost and your learning rate will be set back to the provided value (1e-4 by default). This results in (slight) loss increase after the checkpoint is loaded and I think it might break the learning process at the late stages (the model stops converging or starts diverging).

To store everything you need to remove this line (or replace it with var_list=None) or, if you use an existing checkpoint and want to update it with optimizer vars, you might want to create a new saver object with var_list=None at around this line after restoring. Then the full checkpoint will be saved later. Note that it's much bigger, about 1.9 Gb instead of 400-500 Mb.

@sanja7s
Copy link

sanja7s commented Apr 17, 2019

nshepperd has released training code for retraining GPT-2-small: https://github.com/nshepperd/gpt-2/tree/finetuning

It works: I've used it for retraining on anime plot synopses & Project Gutenberg poetry.

Hi, @gwern and others!

Do you know -- can we fine-tune the gpt-2 for another task such as document scoring?

@HiDhineshRaja
Copy link

Hi, I am using nshepperd's fork https://github.com/nshepperd/gpt-2/tree/finetuning to finetune model with the custom dataset which I created. But once I trained and use the trained model to generate text, the model is able to generate text only on the text which I trained currently and it is forgetting the default trained one which is released by openAI. So my question is how to finetune this model so that the model will be able to generate text on which I trained and also the default one released by openAI?

@tomasrasymas
Copy link

Hi, I am using nshepperd's fork https://github.com/nshepperd/gpt-2/tree/finetuning to finetune model with the custom dataset which I created. But once I trained and use the trained model to generate text, the model is able to generate text only on the text which I trained currently and it is forgetting the default trained one which is released by openAI. So my question is how to finetune this model so that the model will be able to generate text on which I trained and also the default one released by openAI?

@HiDhineshRaja have you figure it out why when finetuning model forgets what it was trained on by OpenAI?

@rkfg
Copy link

rkfg commented Apr 30, 2019

Have you tried a lower learning rate? Something like 1e-6 maybe.

@tomasrasymas
Copy link

No, I left default value for learning rate parameter. Will try to use lower one.

@zendevil
Copy link

zendevil commented Jul 8, 2019

And another implementation, same nshepperd's fork but with scripts to process FB2 files: https://github.com/rkfg/gpt-2

It also uses the sentencepiece tokenizer. It's pretty easy to find a lot of Russian (and some other languages as well but not as many) books in FB2 format so this fork primarily targets the Russian audience. Unfortunately there's still no README about these scripts but the basic workflow is:

  • filterfb2.sh (to pick the books in the same language)
  • fb2totxt.sh (convert to plain .txt files)
  • concat.sh (concatenate all .txt into one file and insert <|n|> end of line tokens)
  • createspmodel.sh (prepare the BPE sentencepiece dictionary and model, also creates hparams.json; you can reuse those for other datasets if you wish)
  • encode.sh (produces an .npz ready to use with train.py).

I don't use validation as it's easy to make just a huge dataset that would require months to fully sample so basically training IS validation, it's impossible to overfit because the samples don't repeat.

I'm having trouble running your encode.py. After copying encode.py to the src folder, I run the following command:

./encode.py </home/psharma/gpt-2_fork/src/dataset.txt> /home/psharma/gpt-2_fork/src/output.npz

And get the following message:

usage: encode.py [-h] [--model_name MODEL] [--combine CHARS] PATH OUT.npz
encode.py: error: the following arguments are required: PATH, OUT.npz

@rkfg
Copy link

rkfg commented Jul 8, 2019

2 issues:

  1. you're supposed to run encode.sh just like my post says, not encode.py
  2. you should specify the path without <>. The <argument> notation means the argument is required, not that you should put it between < and >.

@rkfg
Copy link

rkfg commented Jul 8, 2019

Right there, in scripts.

@zendevil
Copy link

zendevil commented Jul 8, 2019

@rkfg I'm getting the following error while training:

Traceback (most recent call last):
  File "train.py", line 221, in <module>
    main()
  File "train.py", line 92, in main
    os.path.join(CHECKPOINT_DIR, args.run_name))
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/summary/writer/writer.py", line 367, in __init__
    filename_suffix)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/summary/writer/event_file_writer.py", line 67, in __init__
    gfile.MakeDirs(self._logdir)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/lib/io/file_io.py", line 442, in recursive_create_dir
    recursive_create_dir_v2(dirname)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/lib/io/file_io.py", line 458, in recursive_create_dir_v2
    pywrap_tensorflow.RecursivelyCreateDir(compat.as_bytes(path), status)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 528, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.FailedPreconditionError: models/345M/checkpoint/run1; Not a directory

@rkfg
Copy link

rkfg commented Jul 8, 2019

Open a new issue in my repository because it's not really related to this original one. Post as many details as possible including the command line you're running.

@GrahamboJangles
Copy link

GrahamboJangles commented Jul 19, 2019

@nshepperd @gwern Thank you so much for sharing the training code (ref: https://github.com/nshepperd/gpt-2/tree/finetuning) !

I was able to train a model (using Chinese dataset) over the default 117M model without problem. However, when I try to generate samples (either conditional or unconditional), I got "FileNotFoundError: [Errno 2] No such file or directory" for encoder.json, hparams.json and vocab.bpe.

What would be the correct procedure for generating samples for the new model described above?

@kaihuchen I had that problem too and fixed it here: #156

@Zer0-dev115
Copy link

@nshepperd @gwern Thank you so much for sharing the training code (ref: https://github.com/nshepperd/gpt-2/tree/finetuning) !

I was able to train a model (using Chinese dataset) over the default 117M model without problem. However, when I try to generate samples (either conditional or unconditional), I got "FileNotFoundError: [Errno 2] No such file or directory" for encoder.json, hparams.json and vocab.bpe.

What would be the correct procedure for generating samples for the new model described above?

How did you manage to encode the chinese script?
I am trying to train a model for Hindi language. But I have data in hindi script, I can not train this with "nshepperd code".

@ravi-annaswamy
Copy link

@Zer0-dev115 Training on Hindi is like retraining entire model (not fine tuning) so you need really large datasets I would think (at least of the order of several hundred MBs if not GBs and could take several days of GPU time.

And you could do it in one of two ways: you can feed in unicode text and also setup a simple bpe and json file for it. Better way is to run learn_bpe.py from BPE github on your Hindi text and generate json and bpe.

@niccolox
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests