#GPT-2 Model training for ResiBot

Resi Bot is a text generator bot that was a part of Resilience2032 project that took place in September 2020. This project was a thought experiment on how our world will look like in the year 2032. It explored a variety of issues related to racism, sexism, climate, etc and sought to answer what steps we could take to mitigate these issues. It was anticipated that AI bots would be prolific around that time, and the ResiBot was developed to play the role of the contemporary AI. The bot is trained on a corpus curated by the writers' team of Resilience2032 to generate text that produced the effect of being in the year 2032. 

This note book shows how to train a GPT-2 model and generate a test paragraph of text based on our training Corpus data. 

In [None]:
# This cell mounts the Google Drive. Upload the text file containing the training text, to your google drive. 
# This code assumes that the file is on the root of your google drive, but if you have it in a sub folder, then you need to modify the path in line 7.
# Line 9 displays all the file present in the folder path provided in line 7
# Run the cell, click on the link that appears, login and enter the code in the box to mount the drive. 

from google.colab import drive
drive.mount('/content/drive/My Drive')

%ls

%pwd


Lod all the libraries, including Tensor flow and GPT-2-simple

In [None]:
%tensorflow_version 1.x
!pip install -q gpt-2-simple

import tensorflow as tf

import gpt_2_simple as gpt2
from datetime import datetime
from google.colab import files
import tensorflow as tf

TensorFlow 1.x selected.
  Building wheel for gpt-2-simple (setup.py) ... [?25l[?25hdone
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.



Check the Details of GPU allocated to us by Google Colab

In [None]:
!nvidia-smi

Tue Sep  8 10:52:41 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.66       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   59C    P8    11W /  70W |      0MiB / 15079MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

Download the Gpt-2 model. For our project we are using the smallest model having 124 Million parameters.

In [None]:
gpt2.download_gpt2(model_name="124M")

Fetching checkpoint: 1.05Mit [00:00, 412Mit/s]                                                      
Fetching encoder.json: 1.05Mit [00:00, 107Mit/s]                                                    
Fetching hparams.json: 1.05Mit [00:00, 395Mit/s]                                                    
Fetching model.ckpt.data-00000-of-00001: 498Mit [00:02, 174Mit/s]                                   
Fetching model.ckpt.index: 1.05Mit [00:00, 281Mit/s]                                                
Fetching model.ckpt.meta: 1.05Mit [00:00, 155Mit/s]                                                 
Fetching vocab.bpe: 1.05Mit [00:00, 168Mit/s]                                                       


In [None]:
gpt2.mount_gdrive()


Enter the name of the Training Corpus File

In [None]:
#In this cell, you need to supply the name of the text file to be used for training. Just change the red colored text per your filename. 
file_name="f_corpus.txt"
gpt2.copy_file_from_gdrive(file_name)


Start a session to train our GPT-2 Model

In [None]:
#Running this cell will begin the training of the model. 
#It may take a while, sometimes upto 30 mins. 
#The generated text will be displayed in the output 

tf.reset_default_graph()
sess = gpt2.start_tf_sess()


gpt2.finetune(sess,
              dataset=file_name,
              model_name='124M',
              steps=1000,
              restore_from='fresh',
              run_name='AIR',
              print_every=10,
              sample_every=200,
              save_every=500
              )

Loading checkpoint models/124M/model.ckpt
INFO:tensorflow:Restoring parameters from models/124M/model.ckpt


  0%|          | 0/1 [00:00<?, ?it/s]

Loading dataset...


100%|██████████| 1/1 [00:02<00:00,  2.63s/it]


dataset has 406448 tokens
Training...
[10 | 29.37] loss=3.54 avg=3.54
[20 | 51.69] loss=3.45 avg=3.50
[30 | 74.41] loss=3.26 avg=3.42
[40 | 97.77] loss=3.43 avg=3.42
[50 | 121.11] loss=3.40 avg=3.42
[60 | 144.09] loss=3.10 avg=3.36
[70 | 167.14] loss=3.16 avg=3.33
[80 | 190.38] loss=3.21 avg=3.32
[90 | 213.55] loss=3.44 avg=3.33
[100 | 236.69] loss=2.87 avg=3.28
[110 | 259.87] loss=3.01 avg=3.26
[120 | 283.11] loss=2.77 avg=3.21
[130 | 306.37] loss=2.95 avg=3.19
[140 | 329.61] loss=2.85 avg=3.17
[150 | 352.84] loss=3.01 avg=3.15
[160 | 376.05] loss=3.11 avg=3.15
[170 | 399.24] loss=2.55 avg=3.11
[180 | 422.41] loss=3.25 avg=3.12
[190 | 445.57] loss=2.76 avg=3.10
[200 | 468.73] loss=2.52 avg=3.07
 gas people.

But how to get people to stop buying products that are harmful and toward embracing an open alternative marketplace where health care is made non-discriminatory without barriers or mandates? The answer is through engaging with employers, insurers, consumers and technology companie

Generate text from our freshly trained GPT-2 model

In [None]:
def generate_text(prefix, temperature)
  tf.reset_default_graph()
  sess = gpt2.start_tf_sess()

  gpt2.load_gpt2(sess, run_name='AIR')
  gen_text = gpt2.generate(sess,model_name='124M',length=50,temperature=temperature,prefix=prefix,nsamples=1,batch_size=1)
  return gen_text


Loading checkpoint checkpoint/AIR/model-1000
INFO:tensorflow:Restoring parameters from checkpoint/AIR/model-1000
Is the earth flat? Not really. The Earth’s surface is tilted 30 degrees to the center of the Milky Way.

The fact that the Milky Way is 30 degrees north or 30 degrees south is a known phenomenon. But the Earth’s surface is
Is the earth flat? That’s an interesting one. The argument goes that if the earth were lined up exactly one inch from the sun's surface, then the Earth’s gravitational field would be perturbed about the perimeter of the equator.

If
Is the earth flat? The scientists responsible for that prediction say it is not.

They add that “it is possible that the planet we know through the power of computers — the one with all the data — might be very different from what we know through the printing
Is the earth flat? Is the sun not moving? Is the moon not moving? The Sun-God hypothesis is that it is. The argument goes like this: If the Sun-God hypothesis is true, doe