## CS310 Natural Language Processing
## Lab 11: Pretraining MinGPT 

The task today is to work on top of the Andrej Karpathy’s `minGPT` project (original repo URL: https://github.com/karpathy/minGPT), and define the dataset and training code for pretraining a **character-level** language model. 

In [1]:
import numpy as np
import torch
import torch.nn as nn
from torch.nn import functional as F
from torch.utils.data import Dataset

from mingpt.model import GPT, GPTConfig
from mingpt.trainer import Trainer, TrainerConfig
from mingpt.utils import sample

### T1. Define the `CharDataset` class

`CharDataset` is a subclass of `torch.utils.data.Dataset` that reads a long string of text and returns an iterable sequence of character ID chunks.  

The length of the sequence is determined in `__len__` method. The `__getitem__` method takes an integer `idx` as input and returns the `idx`-th **chunk** of character IDs. 

The size of this chunk is determined by the `block_size` parameter, i.e., the maximum context length for language modeling. The returned `x` and `y` are the character IDs in one chunk, but `y` is shifted by one character.

For example, if the input text is "hello, world!", and `block_size=4`, then the first chunk will be `x="hell"` and `y="ello"`.

In [2]:
class CharDataset(Dataset):
    def __init__(self, data, block_size):
        chars = sorted(list(set(data)))
        data_size, vocab_size = len(data), len(chars)
        print('data has %d characters, %d unique.' % (data_size, vocab_size))
        
        self.stoi = { ch:i for i,ch in enumerate(chars) }
        self.itos = { i:ch for i,ch in enumerate(chars) }
        self.block_size = block_size
        self.vocab_size = vocab_size
        self.data = data
    
    def __len__(self):
        return len(self.data) - self.block_size
    
    def __getitem__(self, idx):
        ### START YOUR CODE ###
        # grab a chunk of (block_size + 1) characters from the data
        chunk = self.data[idx:idx + self.block_size + 1]
        # encode every character to an integer
        ids = [self.stoi[ch] for ch in chunk]

        # Convert to tensor
        x = torch.tensor(ids[:-1], dtype=torch.long)
        y = torch.tensor(ids[1:], dtype=torch.long)
        ### END YOUR CODE ###

        return x, y

In [3]:
# Test
sample_data = 'hello world!'
sample_dataset = CharDataset(sample_data, block_size=4)
print('chunk 0:', sample_dataset[0])

# You should see the expected output as follows:
# data has 12 characters, 9 unique.
# chunk 0: (tensor([4, 3, 5, 5]), tensor([3, 5, 5, 6]))

data has 12 characters, 9 unique.
chunk 0: (tensor([4, 3, 5, 5]), tensor([3, 5, 5, 6]))


### T2. Use the provided `trainer`

Firstly, load some more serious data, such as some sampled text from Shakespeare's works.

In [4]:
block_size = 128
text = open('input.txt', 'r').read()
train_dataset = CharDataset(text, block_size)

data has 1115394 characters, 65 unique.


Secondly, initialize the `GPT` model, with proper hyperparameters.

In [5]:
model_config = GPTConfig(
    train_dataset.vocab_size, 
    train_dataset.block_size,
    n_layer=8, n_head=8, n_embd=512)
model = GPT(model_config)

Now, let's initialie a `Trainer` and start training! It may take a long time to finish one epoch, but you can stop it at any time.

Notes:
- The `Trainer` class supports training in multiple processes, but in order to make it work in Jupyter notebook, we set `num_workers=0` and run in a single process.
- `ckpt_path` specifies the path to save the model. By default, it saves the model every epoch. Set it to `None` if you don't want to save the model.
- No test data is specified, so the thrid argument of `Trainer` is set to `None`.
- Explore other parameters as you like in `trainer.py`.

In [None]:
trainer_config = TrainerConfig(max_epochs=2, batch_size=64, 
                      learning_rate=6e-4, lr_decay=True, 
                      warmup_tokens=512*20, final_tokens=2*len(train_dataset)*block_size,
                      ckpt_path='mingpt_ckpt.pth', num_workers=0)
trainer = Trainer(model, train_dataset, None, trainer_config)
trainer.train()

You can also manually save the model by calling `trainer.save_checkpoint()`.

In [10]:
trainer.save_checkpoint()

Now, you should see the model saved to `mingpt_ckpt.pth`, though it is not fully trained yet.

### T3. Sample from the model



`minGPT` provides a `sample` method to generate completions based on a given prompt.

In [11]:
prompt = "O God, O God!"

x = torch.tensor([train_dataset.stoi[s] for s in prompt], dtype=torch.long)[None,...].to(trainer.device)
y = sample(model, x, 100, temperature=1.0, sample=True, top_k=10)[0]

completion = ''.join([train_dataset.itos[int(i)] for i in y])
print(completion)

O God, O God!
Thoriu ti thad y ar we tot he bed d d withe tes hit tountoundd theild whas,

Bithinds t ther as,
To


Of course it does not read like Shakespeare at all because your model is not trained enough. 

What you can do is to load the model trained by somebody else. Download `mingpt_model.pth` from the course website, and load the model weight by `torch.load`. 

Note that the provided model was trained on GPU, so you need to specify `map_location=torch.device('cpu')` loading it.


In [12]:
### START YOUR CODE ###
pretrained_weight = torch.load('mingpt_model.pth', map_location=torch.device('cpu'))
### END YOUR CODE ###

print(type(pretrained_weight))

<class 'collections.OrderedDict'>


The above loaded `pretrained_weight` is merely a dictionary of parameters and not a `GPT` model instance yet. 

So next, you need to instantiate a new `GPT` model, and load the weights using the `model.load_state_dict` method.

In [13]:
### START YOUR CODE ###
model_pretrained = GPT(model_config)
model_pretrained.load_state_dict(pretrained_weight)
### END YOUR CODE ###

<All keys matched successfully>

Now, re-run the generation code to see if there is any improvement.

In [14]:
prompt = "O God, O God!"

x = torch.tensor([train_dataset.stoi[s] for s in prompt], dtype=torch.long)[None,...].to(trainer.device)
y = sample(model_pretrained, x, 100, temperature=1.0, sample=True, top_k=10)[0]

completion = ''.join([train_dataset.itos[int(i)] for i in y])
print(completion)

O God, O God! my lames more?

BONELO:
There we said, we droth is fast and is.
All, in my stracke, which is a the 


The generated text should read more "Shakespearean" than before.

Congratulations! You have successfully completed the lab. There are several things you can checkout further:
- `minGPT` is no longer actively maintained, but its successor `nanoGPT` is there! Check it out at: https://github.com/karpathy/nanoGPT
- As the author claims, `nanoGPT` "prioritizes teeth over education", which means you can train your own version of GPT-2 level models, given data and GPU cards. 