<a href="https://colab.research.google.com/github/gpandu/CodeGenGPT/blob/main/CodeGPT_from_scratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Training GPT model is done using following steps.
1. Load and prepare dataset for tokenizer training.
2. Train BPE tokenizer from scratch.
3. Prepare tensorflow dataset with generator.
4. Create layer and model classes for GPT



*   We will use tokenizers library from HuggingFace to train tokenizer from scratch.
*   We will also use datasets to load the python "code_search_net" dataset. It has ~410k of training records. If we load the load at once we will run out of RAM, so we will take advantage streaming the batches.




In [8]:
!pip install tokenizers
!pip install datasets

Collecting tokenizers
  Downloading tokenizers-0.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m39.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface_hub<0.18,>=0.16.4 (from tokenizers)
  Downloading huggingface_hub-0.17.3-py3-none-any.whl (295 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.0/295.0 kB[0m [31m32.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: huggingface_hub, tokenizers
Successfully installed huggingface_hub-0.17.3 tokenizers-0.14.1
Collecting datasets
  Downloading datasets-2.14.5-py3-none-any.whl (519 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.6/519.6 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m15.1 MB

In [9]:
import math
import numpy as np
import tensorflow as tf
from datasets import load_dataset

from tokenizers import (
    decoders,
    models,
    normalizers,
    pre_tokenizers,
    processors,
    trainers,
    Tokenizer,
)


*   Load the "code_search_net" adn we can check details of the dataset like below.



In [11]:


# load code dataset
raw_dataset = load_dataset("code_search_net", "python")


Downloading builder script:   0%|          | 0.00/8.44k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/18.5k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/12.9k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/941M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/412178 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/22176 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/23107 [00:00<?, ? examples/s]

In [6]:
raw_dataset

DatasetDict({
    train: Dataset({
        features: ['repository_name', 'func_path_in_repository', 'func_name', 'whole_func_string', 'language', 'func_code_string', 'func_code_tokens', 'func_documentation_string', 'func_documentation_tokens', 'split_name', 'func_code_url'],
        num_rows: 412178
    })
    test: Dataset({
        features: ['repository_name', 'func_path_in_repository', 'func_name', 'whole_func_string', 'language', 'func_code_string', 'func_code_tokens', 'func_documentation_string', 'func_documentation_tokens', 'split_name', 'func_code_url'],
        num_rows: 22176
    })
    validation: Dataset({
        features: ['repository_name', 'func_path_in_repository', 'func_name', 'whole_func_string', 'language', 'func_code_string', 'func_code_tokens', 'func_documentation_string', 'func_documentation_tokens', 'split_name', 'func_code_url'],
        num_rows: 23107
    })
})



*   Create a generator to load the data in batches for training tokenizer.

   >  Loading all the data at once may cause out of memory error.







In [16]:
tokenizer_batch_len = 1000
def get_training_corpus():
    dataset = raw_dataset["train"]
    for start_idx in range(0, len(dataset), tokenizer_batch_len):
        samples = dataset[start_idx : start_idx + tokenizer_batch_len]
        yield samples["whole_func_string"]


In [23]:
#check if we are able to iterate over the dataset.
iterat = iter(get_training_corpus())
next(iterat)

TypeError: ignored

Tokenization:



*   Subword Tokenization : Keep frequent words and break rearer words into subwords
*   A statastical Alogrothm learns how to do this based on corpus.

> Ex: Listeria ---> "more" , "over"

> "more" and "over" are likely to be more frequent than moreover


*   Tokenization has better chance of handling OOV words while decreasing the size of the overall dictionary.   

* We will use BPE(Byte Pair Encoding) to train tokenizer on "code_search_net" python Dataset.
*   For more information on BPE can be found here. https://huggingface.co/learn/nlp-course/chapter6/5?fw=pt


In [14]:
tokenizer = Tokenizer(models.BPE())
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False)

# "<|endoftext|>" will used to stop the sequence generation during inference. This is also
#  a way telling GPT to learn to about the end of the sequence
trainer = trainers.BpeTrainer(vocab_size=25000, special_tokens=["PAD","<|endoftext|>"])
#Train the tokenizer using BPE trainer, loads the data in batches
tokenizer.train_from_iterator(get_training_corpus(), trainer=trainer)

TypeError: ignored


*    Check if we are able to tokenize and encode the data using trained BPE tokenizer
*   Actual maximum sequence length MAX_SEQ_LENGTH = 256, we will add one to 256 so that last sample will dropped for the inputs and first sample will be dropped from the outputs. This makes model to see only previous samples to predict next sample.

        For example:
                 input  :     295    4354   63    72      6035   63
                 output :     4354   63     72    6035    63     3170




In [31]:
example = '''def add_numbers(a, b):
    """Add the two numbers `a` and `b`."""
    return a + b'''

context_size = 256
tokenizer.enable_padding(length=context_size+1, pad_id = 0, pad_token = "PAD")
tokenizer.enable_truncation(max_length=context_size+1)
tokenizer.post_process()
encoding = tokenizer.encode(raw_dataset["train"][1]["whole_func_string"])
print(encoding.ids)
print(encoding.attention_mask)

[296, 1621, 64, 18224, 64, 1147, 9, 294, 2913, 11455, 41, 275, 233, 291, 233, 21400, 6893, 7910, 232, 481, 1087, 691, 12336, 232, 2027, 6722, 9383, 533, 24681, 15, 842, 233, 758, 14208, 543, 481, 1203, 232, 1850, 462, 309, 369, 13, 308, 10, 234, 309, 2514, 54, 13, 19464, 626, 313, 311, 370, 759, 2913, 11455, 41, 27, 481, 386, 430, 233, 311, 395, 759, 2913, 11455, 41, 27, 7650, 257, 2027, 662, 293, 257, 1850, 481, 386, 310, 1293, 481, 2088, 313, 311, 1380, 27, 806, 233, 311, 781, 27, 394, 1850, 2667, 1444, 1087, 691, 12336, 1555, 462, 257, 6722, 9383, 533, 20205, 15, 1576, 5149, 1561, 301, 959, 915, 257, 1488, 1850, 301, 13386, 293, 1055, 257, 1497, 1445, 443, 257, 936, 1561, 481, 15, 233, 291, 233, 759, 2913, 234, 5149, 324, 233, 462, 481, 64, 1979, 9, 294, 2913, 11455, 41, 10, 443, 759, 1412, 39, 27, 223, 297, 814, 254, 759, 1412, 39, 27, 242, 1555, 13, 6722, 234, 814, 15, 1524, 1240, 925, 3613, 85, 529, 1019, 19, 62, 242, 759, 2913, 60, 1120, 62, 234, 6722, 313, 303, 759, 2913, 0, 0,



*   Generator to prepare inputs and outputs in the batches.
*   Inputs and Outputs will have sequences ids encoded from the tokenizer.



In [32]:
batch_size = 100
def generate_train_data():
  decoder_inputs = []
  decoder_targets = []
  dataset = raw_dataset["train"]
  for start_idx in range(0, len(dataset)-(len(dataset)%batch_size), batch_size):
      samples = dataset[start_idx : start_idx + batch_size]
      seqs = tokenizer.encode_batch(samples["whole_func_string"])
      decoder_inputs = [seq.ids[:-1] for seq in seqs] # Drop the last token in the sentence.
      decoder_targets = [seq.ids[1:] for seq in seqs]  # Drop the first token in the sentence.
      yield decoder_inputs, decoder_targets


In [None]:
#tf_dataset = raw_dataset["train"].to_tf_dataset(batch_size = 100, columns = ['whole_func_string'])
#tf_dataset

<_PrefetchDataset element_spec=TensorSpec(shape=(None,), dtype=tf.string, name=None)>

In [34]:
iterator = iter(generate_train_data())
decoder_inputs, decoder_targets = next(iterator)
print(decoder_inputs[80])


[296, 3020, 64, 877, 64, 2041, 9, 249, 13, 254, 64, 442, 275, 223, 291, 223, 7207, 533, 4161, 351, 9347, 18861, 4808, 15, 180, 842, 754, 19785, 3020, 223, 586, 3020, 1188, 257, 15548, 302, 936, 953, 351, 1436, 257, 5196, 18861, 223, 449, 816, 257, 975, 15, 223, 291, 223, 295, 648, 257, 936, 434, 301, 232, 672, 13, 1878, 293, 1808, 15, 223, 265, 736, 9, 220, 64, 442, 13, 430, 275, 242, 254, 64, 442, 234, 405, 463, 9, 68, 10, 297, 244, 254, 254, 64, 442, 62, 299, 2745, 234, 273, 15, 9849, 16617, 10909, 223, 297, 16266, 254, 254, 64, 442, 27, 242, 265, 273, 15, 5167, 1157, 734, 27, 285, 16266, 234, 273, 15, 21214, 9, 13703, 13, 2071, 10, 242, 297, 235, 254, 1005, 9, 25, 275, 285, 1963, 2041, 234, 2745, 1875, 273, 15, 4104, 35, 64, 9241, 285, 2745, 234, 3244, 3286, 4975, 397, 10, 1875, 273, 15, 9241, 10, 914, 3244, 13703, 689, 309, 24, 416, 235, 354, 1875, 444, 89, 1058, 10, 285, 265, 1963, 2041, 27, 371, 2745, 17302, 273, 15, 12243, 299, 297, 235, 254, 1005, 9, 249, 15, 7257, 275, 242, 19

In [35]:
#tf_dataset = tf.data.Dataset.from_tensor_slices((decoder_inputs, decoder_targets)).batch(10, drop_remainder=True)

tf_dataset = tf.data.Dataset.from_generator(generate_train_data, output_types=(tf.int32, tf.int32), output_shapes=(tf.TensorShape([batch_size, context_size]), tf.TensorShape([batch_size, context_size])))

# checks to see if data is loading properly.
iterator = iter(tf_dataset)
ins, outs = next(iterator)
print(ins.shape)
next(iterator)[0]

(100, 256)


<tf.Tensor: shape=(100, 256), dtype=int32, numpy=
array([[ 296, 1621,   64, ...,    0,    0,    0],
       [ 296, 7534,   64, ...,    0,    0,    0],
       [ 296, 6977,   64, ..., 7425, 1003,  233],
       ...,
       [ 296,  498,   64, ...,    0,    0,    0],
       [ 296,  498,   64, ...,    0,    0,    0],
       [ 296, 2275,   64, ...,    0,    0,    0]], dtype=int32)>

Multi Head Attention



*   Each Attention head performs Scaled Dot Product Self-Attention operation where given Keys, Query and Values, the return matrix of values given by below operation.

        Attention(Q,K,V) = softmax((Q*Transpose(K))/sqrt(d))*V





In [36]:
def scaled_dot_product_attention(query, key, value, mask=None):
  key_dims = tf.cast(tf.shape(key)[-1], tf.float32)
  scaled_scores = tf.matmul(query, key, transpose_b=True) / tf.math.sqrt(key_dims)

  if mask is not None:
    scaled_scores = tf.where(mask==0, -np.inf, scaled_scores)

  softmax = tf.keras.layers.Softmax()
  weights = softmax(scaled_scores)
  return tf.matmul(weights, value), weights



**Generating queries, keys, and values for multiple heads.**

> Now that we have a way to calculate self-attention, let's actually generate the input queries, keys, and values for multiple heads.

>  each attention head had its own separate set of query, key, and value weights. Each weight matrix was of dimension  d x d/h  where h was the number of heads.




In [37]:
class MultiHeadAttention(tf.keras.layers.Layer):
  def __init__(self, d_model, num_heads):
    super(MultiHeadAttention, self).__init__()
    self.d_model = d_model
    self.num_heads = num_heads

    self.d_head = self.d_model // self.num_heads

    self.w_queries = tf.keras.layers.Dense(self.d_model, use_bias=False)
    self.w_keys = tf.keras.layers.Dense(self.d_model, use_bias=False)
    self.w_values = tf.keras.layers.Dense(self.d_model, use_bias=False)

    # Linear layer to generate the final output.
    self.dense = tf.keras.layers.Dense(self.d_model)

  def split_heads(self, x):
    batch_size = x.shape[0]

    split_inputs = tf.reshape(x, (batch_size, -1, self.num_heads, self.d_head))
    return tf.transpose(split_inputs, perm=[0, 2, 1, 3])

  def merge_heads(self, x):
    batch_size = x.shape[0]

    merged_inputs = tf.transpose(x, perm=[0, 2, 1, 3])
    return tf.reshape(merged_inputs, (batch_size, -1, self.d_model))

  def call(self, query, key, value, mask):
    queries = self.w_queries(query)
    keys = self.w_keys(key)
    values = self.w_values(value)

    queries = self.split_heads(queries)
    keys = self.split_heads(keys)
    values = self.split_heads(values)

    output, attn_weights = scaled_dot_product_attention(queries, keys, values, mask)
    output = self.merge_heads(output)

    return self.dense(output), attn_weights


Feed Forward Neural Network

In [38]:
def feed_forward_network(d_model, hidden_dim):
  return tf.keras.Sequential([
      tf.keras.layers.Dense(hidden_dim, activation='relu'),
      tf.keras.layers.Dense(d_model)
  ])

Decode Block

In [39]:
class DecoderBlock(tf.keras.layers.Layer):
  def __init__(self, d_model, num_heads, hidden_dim, dropout_rate=0.1):
    super(DecoderBlock, self).__init__()

    self.mhsa1 = MultiHeadAttention(d_model, num_heads)

    self.ffn = feed_forward_network(d_model, hidden_dim)

    self.dropout1 = tf.keras.layers.Dropout(dropout_rate)
    self.dropout2 = tf.keras.layers.Dropout(dropout_rate)

    self.layernorm1 = tf.keras.layers.LayerNormalization()
    self.layernorm2 = tf.keras.layers.LayerNormalization()

  def call(self, input, training, decoder_mask):
    mhsa_output1, attn_weights = self.mhsa1(input, input, input, decoder_mask)
    mhsa_output1 = self.dropout1(mhsa_output1, training=training)
    mhsa_output1 = self.layernorm1(mhsa_output1 + input)

    ffn_output = self.ffn(mhsa_output1)
    ffn_output = self.dropout2(ffn_output, training=training)
    output = self.layernorm2(ffn_output + mhsa_output1)

    return output, attn_weights


Decoder with Mulitple Layers

In [40]:
class Decoder(tf.keras.layers.Layer):
  def __init__(self, num_blocks, d_model, num_heads, hidden_dim, target_vocab_size,
               max_seq_len, dropout_rate=0.1):
    super(Decoder, self).__init__()

    self.d_model = d_model
    self.max_seq_len = max_seq_len

    self.token_embed = tf.keras.layers.Embedding(target_vocab_size, self.d_model)
    self.pos_embed = tf.keras.layers.Embedding(max_seq_len, self.d_model)

    self.dropout = tf.keras.layers.Dropout(dropout_rate)

    self.blocks = [DecoderBlock(self.d_model, num_heads, hidden_dim, dropout_rate) for _ in range(num_blocks)]

  def call(self, input, training, decoder_mask):
    token_embeds = self.token_embed(input)

    seq_len = input.shape[1]
    # Generate position indices.
    num_pos = input.shape[0] * seq_len
    pos_idx = np.resize(np.arange(seq_len), num_pos)
    pos_idx = np.reshape(pos_idx, input.shape)

    pos_embeds = self.pos_embed(pos_idx)

    x = self.dropout(token_embeds + pos_embeds, training=training)

    for block in self.blocks:
      x, weights = block(x, training, decoder_mask)

    return x, weights

Custom loss function to remove effect of padding

In [41]:
def loss_func(targets, logits):
  ce_loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
  mask = tf.cast(tf.math.not_equal(targets, 0), tf.float32)
  return ce_loss(targets, logits, sample_weight=mask)

In [42]:
class GPTModel(tf.keras.Model):
  def __init__(self, num_blocks, d_model, num_heads, hidden_dim, target_vocab_size,
               max_input_len, dropout_rate=0.1):
    super(GPTModel, self).__init__()

    self.decoder = Decoder(num_blocks, d_model, num_heads, hidden_dim, target_vocab_size,
                           max_input_len, dropout_rate)

    # The final dense layer to generate logits from the model output.
    self.output_layer = tf.keras.layers.Dense(target_vocab_size)

  @tf.function
  def train_step(self, inputs):
      loss = 0.

      input_seq, targets = inputs
      with tf.GradientTape() as tape:

        dec_padding_mask = tf.cast(tf.math.not_equal(input_seq, 0), tf.float32)
        dec_padding_mask = dec_padding_mask[:, tf.newaxis, tf.newaxis, :]
        input_seq_len = len(input_seq[0])
        look_ahead_mask = tf.linalg.band_part(tf.ones((input_seq_len,
                                               input_seq_len)), -1, 0)
        dec_mask = tf.minimum(dec_padding_mask, look_ahead_mask)

        logits, _ = self.decoder(input_seq, True, dec_mask)
        logits =   self.output_layer(logits)
        loss += self.loss(targets, logits)

      # Update the parameters and the optimizer
      variables = self.decoder.trainable_variables
      gradients = tape.gradient(loss, variables)
      self.optimizer.apply_gradients(zip(gradients, variables))

      return {'loss': loss}

  def call(self, input, training):
    logits, _ = self.decoder(input, False, None)
    logits =   self.output_layer(logits)
    return logits



In [43]:
model = GPTModel(
    num_blocks = 6,
    d_model = 512,
    num_heads = 4,
    hidden_dim = 1024,
    target_vocab_size = tokenizer.get_vocab_size(),
    max_input_len = 256)

optimizer = tf.keras.optimizers.Adam()
model.compile(optimizer=optimizer, loss=loss_func, run_eagerly=True)

In [44]:
epochs = 10
model.fit(tf_dataset, epochs=epochs)

Epoch 1/10
   4121/Unknown - 5848s 1s/step - loss: 5.8281

InvalidArgumentError: ignored

In [None]:
def generate_python_code(input_tex, max_len):

