<a href="https://colab.research.google.com/github/pranjalchaubey/60-days-of-udacity-sixty-ai/blob/master/03%20Sixty%20AI%20Training/Sixty_AI_Training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#  Sixty AI - Train a GPT-2 on Text Corpus  

<br/> We are now going to use our processed text corpus to train OpenAI's GPT-2 Model. 
<br/> Of course, we are not going to use the _original_ GPT-2 from OpenAI. We are instead going to use _GPT-2 Simple_ from _**Max Woolf**_.
<br/> GPT-2 Simple is a pretrained GPT-2 model from OpenAI, but with the added functionality of _Finetuning_. We will use our tiny text corpus to finetune a full blown GPT-2 (small _117M_ model) , so that it starts generating some creative text content on its own. This is _NLP Transfer Learning_ live in action! 

<br/> For more information about `gpt-2-simple`, you can visit [this GitHub repository](https://github.com/minimaxir/gpt-2-simple).
 
 

In [0]:
# Install the GPT-2 Simple library 
!pip install -q gpt-2-simple

# Import Export business 
import gpt_2_simple as gpt2
from datetime import datetime
from google.colab import files

## GPU

Colaboratory now uses an Nvidia T4 GPU, which is slightly faster than the old Nvidia K80 GPU for training GPT-2, and has more memory allowing us to train the larger GPT-2 models and generate more text.

Let's verify which GPU is active.

In [12]:
!nvidia-smi

Thu Aug  1 19:17:39 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 410.79       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   61C    P0    29W /  70W |   6852MiB / 15079MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
+-------

## Downloading GPT-2

We need to download the GPT-2 model first. 

There are two sizes of GPT-2:

* `117M` (default): the "small" model, 500MB on disk.
* `345M`: the "medium" model, 1.5GB on disk.

Larger models have more knowledge, but take longer to finetune and longer to generate text. 
<br/>We will use the smaller 117M model to start things off. 
<br/>The next cell downloads it from Google Cloud Storage and saves it in the Colaboratory VM at `/models/<model_name>`.


In [2]:
# Download the smaller 117M model 
gpt2.download_gpt2(model_name="117M")

Fetching checkpoint: 1.05Mit [00:00, 267Mit/s]                                                      
Fetching encoder.json: 1.05Mit [00:00, 93.5Mit/s]                                                   
Fetching hparams.json: 1.05Mit [00:00, 845Mit/s]                                                    
Fetching model.ckpt.data-00000-of-00001: 498Mit [00:04, 119Mit/s]                                   
Fetching model.ckpt.index: 1.05Mit [00:00, 257Mit/s]                                                
Fetching model.ckpt.meta: 1.05Mit [00:00, 154Mit/s]                                                 
Fetching vocab.bpe: 1.05Mit [00:00, 172Mit/s]                                                       


## Mounting Google Drive

The best way to get input text to-be-trained into the Colaboratory VM, and to get the trained model *out* of Colaboratory, is to route it through Google Drive *first*.
<br/>In case you're a little skeptical about putting down the Google Drive auth code (you should be!), I suggest you check out what is going on under the hood in the `gpt-2-simple` library. Simply [click this link](https://github.com/minimaxir/gpt-2-simple/blob/master/gpt_2_simple/gpt_2.py "click this link"). 

<br/>TL;DR: _It's Safe!_

In [3]:
gpt2.mount_gdrive()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Uploading our Text Corpus to be Trained 

Let's upload our text corpus in the _'Files'_ section (this has to be done manually). 

In [0]:
file_name = "final_text_corpus.csv"
model_name = 'run1' # Default Name 

## Finetune GPT-2

Finally, it's time to _finetune_ our Simple GPT-2 Model on our extracted corpus of text. 
<br/>The next cell will start the actual finetuning of GPT-2. It creates a persistent TensorFlow session which stores the training config, then runs the training for the specified number of `steps`. (setting `steps = -1` will run the finetuning indefinitely)

The model checkpoints will be saved in `/checkpoint/run1` by default. The checkpoints are saved every 500 steps and when the cell is stopped.






In [7]:
"""
Parameters for gpt2.finetune(): 

restore_from: Set to 'fresh' to start training from the base GPT-2, 
  or set to 'latest' to restart training from an existing checkpoint.

sample_every: Number of steps to print example output

print_every: Number of steps to print training progress.

learning_rate:  Learning rate for the training. 
  (default '1e-4', can lower to '1e-5' if you have <1MB input data)

run_name: subfolder within 'checkpoint' to save the model. 
  This is useful if you want to work with multiple models 
  (will also need to specify  'run_name' when loading the model)

overwrite: Set to 'True' if you want to continue finetuning an existing 
  model (w/ restore_from='latest') without creating duplicate copies. 
"""

# Start the tf session 
# LOL.....why they have a 'session' in tf?! :D 
sess = gpt2.start_tf_sess()

# We will train for 1000 epochs, as I have noticed that beyond 1000
# epochs the model starts to overfit on the data
gpt2.finetune(sess,
              dataset=file_name,
              model_name='117M',
              steps=1000,
              restore_from='fresh',
              run_name=model_name,
              print_every=10,
              sample_every=200,
              save_every=500
              )

W0801 16:33:05.579495 140450407880576 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/gpt_2_simple/src/sample.py:17: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
W0801 16:33:21.663784 140450407880576 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py:1276: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.


Loading checkpoint models/117M/model.ckpt


100%|██████████| 1/1 [00:00<00:00, 15.83it/s]

Loading dataset...





dataset has 855013 tokens
Training...
[10 | 28.26] loss=3.07 avg=3.07
[20 | 50.50] loss=3.02 avg=3.04
[30 | 73.22] loss=2.92 avg=3.00
[40 | 96.38] loss=2.81 avg=2.95
[50 | 119.99] loss=2.56 avg=2.87
[60 | 143.29] loss=2.78 avg=2.86
[70 | 166.43] loss=2.44 avg=2.80
[80 | 189.68] loss=2.59 avg=2.77
[90 | 213.02] loss=2.43 avg=2.73
[100 | 236.32] loss=2.64 avg=2.72
[110 | 259.59] loss=1.86 avg=2.64
[120 | 282.85] loss=2.13 avg=2.59
[130 | 306.10] loss=2.18 avg=2.56
[140 | 329.37] loss=2.51 avg=2.56
[150 | 352.63] loss=1.60 avg=2.49
[160 | 375.86] loss=2.27 avg=2.47
[170 | 399.11] loss=2.09 avg=2.45
[180 | 422.37] loss=2.31 avg=2.44
[190 | 445.65] loss=2.39 avg=2.44
[200 | 468.93] loss=2.59 avg=2.45
 "of all the great moments of the day: #60daysofudacity" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "finally the day comes:   ". <|endoftext|>
<|startoftext|> thanks @bhadreshpsavani for being nice about your updates and keeping me on track . it's nice that we don’t have to wai

W0801 17:13:11.830547 140450407880576 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py:960: remove_checkpoint (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to delete files with this prefix.


Copy the trained model to the Google Drive. The checkpoint folder is copied as a `.rar` compressed file.

In [0]:
gpt2.copy_checkpoint_to_gdrive(run_name=model_name)

## Generate Text From The Trained Model

Now that we have trained and/or loaded our finetuned model, its time to generate text! 

In [10]:
'''
prefix: force the text to start with a given character sequence and generate 
  text from there 
nsamples: generate multiple texts at a time
batch_size: generate multiple samples in parallel, giving a massive speedup 
  (in Colaboratory, set a maximum of 20 for batch_size)
length: Number of tokens to generate (default 1023, the maximum)
temperature: The higher the temperature, the crazier the text 
  (default 0.7, recommended to keep between 0.7 and 1.0)
top_k: Limits the generated guesses to the top k guesses 
  (default 0 which disables the behavior; if the generated output is 
  super crazy, you may want to set top_k=40)
top_p: Nucleus sampling: limits the generated guesses to a 
  cumulative probability. (gets good results on a dataset with top_p=0.9)
truncate: Truncates the input text until a given sequence, excluding that
  sequence (e.g. if truncate='<|endoftext|>', the returned text will include 
  everything before the first <|endoftext|>). It may be useful to combine this
  with a smaller length if the input texts are short.
include_prefix: If using truncate and include_prefix=False, the specified 
  prefix will not be included in the returned text.
'''


gpt2.generate(sess,
              length=512,
              temperature=0.8,
              prefix="Day: ",
              nsamples=10,
              batch_size=10
             )


Day:  done the project  <|endoftext|>
<|startoftext|> @mjmolinacontreras you had better get ready for some problems!!! and congratulations!!  i was wondering where to learn this !! <|endoftext|>
<|startoftext|> day 1: finished the videos of lesson 4. had to file a bug report with the firebase security team.   happy learning !! <|endoftext|>
<|startoftext|> day 1: 1. took the pledge of #60daysofudacity 2. finished lesson 2.3 3. started lesson 2.4 and started to understand the concepts deeply. 4. i would like to invite @preriec and @eileen.hertwig to join #60daysofudacity <|endoftext|>
<|startoftext|> *day:* 1. i took the pledge of #60daysofudacity. 2. completed 60 minute blitz of linear algebra in python 3. completed lesson 1 of intro to dl with pytorch youtube series 4. i would like to encourage @birozso, @casabiancadenny, @mariia.denysenko93, @arkachkrbrty <|endoftext|>
<|startoftext|> *day 1:* 1. i took the pledge 2. completed lesson 3 3. i encourage @sarahhelena.barmer and @djnavin6

# Troubleshoot

If the notebook has errors (e.g. GPU Sync Fail or out-of-memory/OOM), force-kill the Colaboratory virtual machine and restart it with the command below:

In [0]:
!kill -9 -1