Notebook to illustrate my observations on the GPT2 model, i have split the code into important individual components used in the github repo of this project (https://github.com/openai/gpt-2). 

This is also a step by step approach, building towards the final LM inference output, to get a better understanding of how this works and generate use cases 

In [1]:
import sys, os, json
import tensorflow as tf
import regex as re
sys.path.append("src/")
import encoder, sample, model

model_name="117M"
cache = {}

def load_encoder_json():
    with open(os.path.join('models', model_name, 'encoder.json'), 'r') as f:
        encoder_json = json.load(f)
    return encoder_json
def load_bpe_merges():
    with open(os.path.join('models', model_name, 'vocab.bpe'), 'r', encoding="utf-8") as f:
        bpe_data = f.read()
    return [tuple(merge_str.split()) for merge_str in bpe_data.split('\n')[1:-1]]
def overwrite_model_params(hparams):
    with open(os.path.join('models', model_name, 'hparams.json')) as f:
        hparams.override_from_dict(json.load(f))


  from ._conv import register_converters as _register_converters


# Get the Encoder

Encoding an input sentence involves the following steps
1. find word tokens in a given sentence ("I love apples" --> ["i", " love", " apples"]), notice the space before words
2. for each character in a given token, use the mapping logic (refer bytes_to_unicode in encode.py) to convert them back to their corresponding character, instead of using ord()
3. apply bpe on the tokens obtained from 2, these bpe tokens 
4. split the bpe tokens (.split(" ")) and get the index for each bpe token splits using the enoder_json, this is just the vocabulary (word2index) used for training the GPT2 model

In [12]:
# load encoder json, vocabulary of 50,257 tokens, token to index, decoder for index2word
encoder_json = load_encoder_json()
decoder_json = {v:k for k,v in encoder_json.items()}
# load bpe data, some merge map based on character frequency (check this)
bpe_merges = load_bpe_merges()
bpe_ranks = dict(zip(bpe_merges, range(len(bpe_merges))))

# encoder to get bytes and decoder to get characters from bytes, from the unicode map (refer bytes_to_unicode in encode.py)
byte_encoder = encoder.bytes_to_unicode()
byte_decoder = {v:k for k, v in byte_encoder.items()}

In [28]:
# useful function for encoding and decoding word tokens
def bpe(token):
    """function to get bpe token from a regular token, taken from encoder.py """
    if token in cache:
        return cache[token]
    word = tuple(token)
    pairs = encoder.get_pairs(word)
    if not pairs:
        return token
    while True:
        bigram = min(pairs, key = lambda pair: bpe_ranks.get(pair, float('inf')))
        if bigram not in bpe_ranks:
            break
        first, second = bigram
        new_word = []
        i = 0
        while i < len(word):
            try:
                j = word.index(first, i)
                new_word.extend(word[i:j])
                i = j
            except:
                new_word.extend(word[i:])
                break

            if word[i] == first and i < len(word)-1 and word[i+1] == second:
                new_word.append(first+second)
                i += 2
            else:
                new_word.append(word[i])
                i += 1
        new_word = tuple(new_word)
        word = new_word
        if len(word) == 1:
            break
        else:
            pairs = encoder.get_pairs(word)
    word = ' '.join(word)
    cache[token] = word
    return word

def get_bpe_tokens(text):
    """function to get bpe tokens from a give text"""
    # some complex regex to select individual tokens
    pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")
    bpe_tokens = []
    for token in re.findall(pat, text):
        token = ''.join(byte_encoder[b] for b in token.encode('utf-8'))
        bpe_tokens.extend(encoder_json[bpe_token] for bpe_token in bpe(token).split(' '))
    return bpe_tokens

def decode_output(tokens):
    """decode output from the LM, convert word indices to text using the decoder json and byte decoders"""
    text = ''.join([decoder_json[token] for token in tokens])
    text = bytearray([byte_decoder[c] for c in text]).decode('utf-8', errors='replace')
    return text

In [31]:
[e for e in encoder_json.items()][:20]

[('!', 0),
 ('"', 1),
 ('#', 2),
 ('$', 3),
 ('%', 4),
 ('&', 5),
 ("'", 6),
 ('(', 7),
 (')', 8),
 ('*', 9),
 ('+', 10),
 (',', 11),
 ('-', 12),
 ('.', 13),
 ('/', 14),
 ('0', 15),
 ('1', 16),
 ('2', 17),
 ('3', 18),
 ('4', 19)]

# Inference parameters

In [25]:
nsamples = 1 # number of samples
batch_size = 1 
temperature = 1 # not sure what this is, need to check
top_k = 40 # next word is selected from top k predictions of LM, uses tf.multinomial to pick one from a sample of topk
hparams = model.default_hparams() # model params
# overwrite
overwrite_model_params(hparams)
length = hparams.n_ctx // 2 # sentence length to generate, this value is 512 by default

# Input Text

In [22]:
text = "i love football"

# LM Sequence Generation

In [23]:
# this intial part of this code involves loading the model and tokens into the tf graph
with tf.Session(graph=tf.Graph()) as sess:
    # input context
    context = tf.placeholder(tf.int32, [batch_size, None])
    
    # this sample sequence uses tf.multinomial distribution to select next words from top_k to generate sequences
    output = sample.sample_sequence(
        hparams=hparams, length=length,
        context=context,
        batch_size=batch_size,
        temperature=temperature, top_k=top_k
    )
    
    # load model 
    saver = tf.train.Saver()
    ckpt = tf.train.latest_checkpoint(os.path.join('models', model_name))
    saver.restore(sess, ckpt)
    
    # get model context tokens
    context_tokens = get_bpe_tokens(text)
    
    # run the session and generate samples based on the input context
    generated = 0    
    for _ in range(nsamples // batch_size):
        # output here is of length 'length'
        out = sess.run(output, feed_dict={
            context: [context_tokens for _ in range(batch_size)]
        })[:, len(context_tokens):]
        
        # decode the output
        for i in range(batch_size):
            generated += 1
            text =decode_output(out[i])
            print("=" * 40 + " SAMPLE " + str(generated) + " " + "=" * 40)
            print(text)

INFO:tensorflow:Restoring parameters from models/117M/model.ckpt
! The same goes for the rest of us.

Boris will take him to the London game where we watch him play for a very short amount of time.

And you just know what happens to him? He's going to have two more seasons at a rate of around £20 million.

You know what? He gets out of this job.<|endoftext|>A few weeks ago I posted about the problem with the NFS in my article "How to use DAGs for storage on S/M files". So I thought the following would be a good place to share with you the first step in what I've learned.

How do DAGs work to store a file on a partition of a machine?

A DAG needs some RAM to store the file in. For this we have to first decide on the "storage" type. In the example above, we have a file system partition. DAGs can store a large number of hard disks in the drive (a few thousand are used all the time. The partition contains the filesystems of the filesystem for which we want DAGs to be able to store the data

In [27]:
out.shape

(1, 512)