[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/kamalkraj/minGPT-TF/blob/master/play_char.ipynb)

In [1]:
!git clone https://github.com/kamalkraj/minGPT-TF.git

Cloning into 'minGPT-TF'...
remote: Enumerating objects: 34, done.[K
remote: Counting objects: 100% (34/34), done.[K
remote: Compressing objects: 100% (22/22), done.[K
remote: Total 34 (delta 15), reused 24 (delta 9), pack-reused 0[K
Unpacking objects: 100% (34/34), done.


In [2]:
! pip install fastprogress==0.2.3

Collecting fastprogress==0.2.3
  Downloading https://files.pythonhosted.org/packages/a3/da/ffd8fe0daf7e679804a32a1e8654ac2988e2ef85937fc1d223e98eee736e/fastprogress-0.2.3-py3-none-any.whl
Installing collected packages: fastprogress
  Found existing installation: fastprogress 0.2.5
    Uninstalling fastprogress-0.2.5:
      Successfully uninstalled fastprogress-0.2.5
Successfully installed fastprogress-0.2.3


In [3]:
import os
os.chdir('minGPT-TF')

In [4]:
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2020-08-22 11:11:14--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt’


2020-08-22 11:11:16 (70.2 MB/s) - ‘input.txt’ saved [1115394/1115394]



In [5]:
import math
import numpy as np
import tensorflow as tf
from mingpt.model import GPT, GPTConfig

In [6]:
class CharDataset:

    def __init__(self, data, block_size):
        chars = sorted(list(set(data)))
        data_size, vocab_size = len(data), len(chars)
        print('data has %d characters, %d unique.' % (data_size, vocab_size))
        
        self.stoi = { ch:i for i,ch in enumerate(chars) }
        self.itos = { i:ch for i,ch in enumerate(chars) }
        self.block_size = block_size
        self.vocab_size = vocab_size
        self.data = data
    
    def __len__(self):
        return math.ceil(len(self.data) / (self.block_size + 1))

    def __iter__(self):
        # we're actually going to "cheat" and pick a spot in the dataset at random
        for _ in range(self.__len__()):
            i = np.random.randint(0, len(self.data) - (self.block_size + 1))
            chunk = self.data[i:i+self.block_size+1]
            dix = [self.stoi[s] for s in chunk]
            x = tf.convert_to_tensor(dix[:-1], dtype=tf.int32)
            y = tf.convert_to_tensor(dix[1:], dtype=tf.int32)
            yield x, y
    
    __call__ = __iter__

In [7]:
block_size = 128 

In [8]:
text = open('input.txt', 'r').read()
train_dataset_gen = CharDataset(text, block_size) 

data has 1115394 characters, 65 unique.


In [9]:
train_dataset = tf.data.Dataset.from_generator(train_dataset_gen,(tf.int32,tf.int32))

In [10]:
from mingpt.model import GPT, GPTConfig
mconf = GPTConfig(train_dataset_gen.vocab_size, train_dataset_gen.block_size,
                  n_layer=8, n_head=8, n_embd=512)

In [11]:
from mingpt.trainer import Trainer, TrainerConfig

# initialize a trainer instance and kick off training
tconf = TrainerConfig(max_epochs=10, batch_size=128, learning_rate=6e-4,
                      lr_decay=True, warmup_tokens=512*20, final_tokens=200*len(train_dataset_gen)*block_size,
                      num_workers=4)
trainer = Trainer(GPT, mconf, train_dataset, len(train_dataset_gen), None, None, tconf)

In [12]:
trainer.train()

epoch 1: train loss 374.67728. lr 5.999636e-04
epoch 2: train loss 312.57532. lr 5.998533e-04
epoch 3: train loss 293.06381. lr 5.996690e-04
epoch 4: train loss 266.85275. lr 5.994107e-04
epoch 5: train loss 245.73636. lr 5.990785e-04
epoch 6: train loss 228.52350. lr 5.986725e-04
epoch 7: train loss 215.94473. lr 5.981929e-04
epoch 8: train loss 206.24950. lr 5.976396e-04
epoch 9: train loss 197.78911. lr 5.970130e-04
epoch 10: train loss 191.63712. lr 5.963130e-04


In [13]:
# alright, let's sample some character-level shakespear
from mingpt.utils import sample

context = "O God, O God!"
x = tf.convert_to_tensor([train_dataset_gen.stoi[s] for s in context], dtype=tf.int32)[None,...]
y = sample(trainer.model, x, 2000, temperature=0.9, sample=True, top_k=5)[0]
completion = ''.join([train_dataset_gen.itos[int(i)] for i in y])
print(completion)

O God, O God! think a the most have the signs
And that tell myself mine that
When they water they so see that hus a see
This way the statte, and that this save a from the
will but shall to be the seems of the contraiction
To see only that have the may that hast he wreturnk and his hands
That that whole though to to see so.

CORIOLANUS:
And traitior of your clow of this woes,
In he darking so my fortune and and here whom too
sight thou dread the father is that's show him they wish,
To shall how in my son a many and will and him.

LEONTES:
This new is a crown in thinking to take on the fair a feel
The seems of thing is forth and felcoment to so.

GLOUCESTER:
Ay, to hear in thy sorn trook of me were at see of a
do now otherself and she heart, and he has hath are they hath mird
And this, so the part to should shall nature and thee,
To spure the come of that she so danger in the will shall been;
And with they worshipp the crown of the since
Whereof thou shalt set a say.

Second Mercile:
I w