### Overview

In this project, I want to generate star trek scripts. 

I originally tokenized them using the GPT2Tokenizer and tried to use TFGPT2LMHeadModel, but I found that even if I freeze everything but the embedding layers, this takes 13 hours per epoch to train on my hardware. I didn't really appreciate the size of the GPT model.

Now I will simplify a bit. I'll use a lower case tokenizer for a smaller vocab and use a simpler model.

In [1]:
import json
import numpy as np
# from transformers import TFGPT2LMHeadModel, TFGPT2Model, GPT2Tokenizer, GPT2Config
from transformers import DistilBertTokenizer, TFDistilBertModel
import re
from sklearn.model_selection import train_test_split
from random import randint
import matplotlib.pyplot as plt

import string
import random

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers import TextVectorization
from tensorflow.keras.callbacks import ModelCheckpoint

print(tf.__version__)

  from .autonotebook import tqdm as notebook_tqdm


2.13.0


In [2]:
f = open("data/StarTrek_scripts/all_scripts_raw.json")
json_file = json.load(f)
f.close()
#start with TOS: might be more manageable
TOS_scripts=json_file['TOS']
print(TOS_scripts['episode 0'][:1000])







The Star Trek Transcripts - The Cage



The
Cage
Unaired
pilot






 [Bridge]

SPOCK: Check the circuit. 
TYLER: All operating, sir. 
SPOCK: It can't be the screen then. Definitely something out there,
Captain, headed this way. 
TYLER: It could be these meteorites. 
ONE: No, it's something else. There's still something out there. 
TYLER: It's coming at the speed of light, collision course. The
meteorite beam has not deflected it, Captain.
ONE: Evasive manoeuvres, sir?
PIKE: Steady as we go.
GARISON: It's a radio wave, sir. We're passing through an old-style
distress signal.
PIKE: They were keyed to cause interference and attract attention this
way.
GARISON: A ship in trouble making a forced landing, sir. That's it. No
other message.
TYLER: I have a fix. It comes from the Talos star group.
ONE: We've no ships or Earth colonies that far out.
SPOCK: Their call letters check with a survey expedition. SS Columbia.
It disappeared in that region approximately eight


### General plan

I want to generate a star trek script. The model will be some kind of transformer. The input is a series of tokens, I'll start with 128 tokens (needs padded in case the input is shorter). The output is the next word, i.e. input: [The, quick, brown], output: [quick, brown, fox]

To make this, that means I need to:
- parse the scripts, remove line breaks and things.

    - Also need to remove episode title at beginning and copyright stuff at the end.
    - Probably should add a character for stage direction, or perhaps if I keep the colons such that kirk: is distinct from kirk.
    
- Create segments of input tokens in batches of 128

- Embed the wordings

- Split train/test data

- Create model and train
    


In [3]:
#functions to remove metadata and add stage direction tokens to script

def add_special_tokens(script):
    # Replace character names
    script = re.sub(r'\n([A-Z ]+):', r' <CHAR> \1:', script) #adds <CHAR> token any time theres a new line followed by "<CAPITALLETTERS>:"
    # You can add more substitutions here for stage directions or other special tokens
    script = re.sub(r'[\[\{]([^\]\}]+)[\]\}]', r' <LOC> \1 <LOC>', script) #add <LOC> token to indicate location
    script = re.sub(r'\(([^)]+)\)', r' <SD> \1 <SD>', script)
    return script

def remove_metadata(script):
    # Find the position of the 17th newline character
    start_pos = -1
    for _ in range(17):
        start_pos = script.find('\n', start_pos + 1)
        
    # Slice the string from the character after the 8th newline
    if start_pos != -1:
        script = script[start_pos + 1:]
    
    # Find the position of "<Back"
    pos = script.find("<Back")

    # If found, cut off everything past that point
    if pos != -1:
        script = script[:pos]
    return script

# def process_names(text):
#     unique_names = set()

#     # Function to replace "<CHAR> NAME:" with "<CHAR> Name:"
#     def char_replacer(match):
#         name = match.group(1)
#         if name.lower() == "mccoy": #McCoy needs special treatment due to unique capitalization
#             name = "McCoy"
#         else:
#             name = name.capitalize()
#         unique_names.add(f"{name}")
#         return f"<CHAR> {name}"
    
#     # Replace names after "<CHAR>"
#     text = re.sub(r'<CHAR>\s+([A-Z]{2,})', char_replacer, text)
#     # Function to replace all other instances of unique names
#     def name_replacer(match):
#         name = match.group(0)
#         if name == "MCCOY":
#             return " McCoy"
#         return name.capitalize() if name.upper() in unique_names else name

#     # Replace all other instances of unique names
#     text = re.sub(r'\b[A-Z]{2,}\b', name_replacer, text)

#     return text, unique_names

def process_names_lowercase(text):
    unique_names = set()

    # Function to replace "<CHAR> NAME:" with "<CHAR> Name:"
    def char_replacer(match):
        name = match.group(1)
        name = name.lower()
        unique_names.add(f"{name}")
        return f"<CHAR> {name}"
    
    # Replace names after "<CHAR>"
    text = re.sub(r'<CHAR>\s+([A-Z]{2,})', char_replacer, text)
    # Function to replace all other instances of unique names
    def name_replacer(match):
        name = match.group(0)
        return name.lower() if name.upper() in unique_names else name
    # Replace all other instances of unique names
    text = re.sub(r'\b[A-Z]{2,}\b', name_replacer, text)
    return text, unique_names

def preprocess_script(script):
    
    script=remove_metadata(add_special_tokens(script))
    script, names =process_names_lowercase(script)
    script=script.replace('\n', ' ')
    script=script.replace('\r', ' ')
    # Replace multiple spaces with a single space
    script = re.sub(' +', ' ', script)
    script += " <END>"
    script = script.strip()
    return script, names

for i in range(2):
    # print(TOS_scripts['episode '+str(i)][:1100])
    # print(script)
    script, names=preprocess_script(TOS_scripts['episode '+str(i)])
    # print(script[-1000:])
    print(script[:100])
    print(type(script))
    print(names)
    # print('\n')
print(preprocess_script('\nKIRK: Set ship status condition red'))
    

<LOC> Bridge <LOC> <CHAR> spock: Check the circuit. <CHAR> tyler: All operating, sir. <CHAR> spock: 
<class 'str'>
{'magistrate', 'garison', 'officer', 'geologist', 'one', 'survivor', 'orion', 'boyce', 'spock', 'haskins', 'talosian', 'colt', 'tyler', 'old', 'pike', 'pitcairn', 'vina'}
Captain's log, Stardate 1513.1. Our position, orbiting planet M-113. On board the Enterprise, Mister
<class 'str'>
{'crewman', 'sulu', 'crater', 'uhura', 'green', 'blueshirt', 'kirk', 'redshirt', 'nancy', 'mccoy', 'security', 'darnell', 'transporter', 'spock', 'blonde', 'rand'}
('<CHAR> kirk: Set ship status condition red <END>', {'kirk'})


In [4]:
# tokenizer = GPT2Tokenizer.from_pretrained('distilgpt2')
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
script, names=preprocess_script(TOS_scripts['episode 62'])
custom_tokens=list(names)
for token in ["<LOC>", "<CHAR>", "<SD>", "<END>"]:
    custom_tokens.append(token)
new_tokens = [token for token in custom_tokens if token not in tokenizer.get_vocab()]
print(new_tokens)
# Add the new tokens to the tokenizer
tokenizer.add_tokens(new_tokens)

# Print the new vocabulary size
print("Number of added tokens: ", len(new_tokens))

def check_in_vocab(word_to_check):
    word_version=[vocab_word for vocab_word in tokenizer.get_vocab() 
                  if vocab_word.lower() == word_to_check.lower() 
                  or vocab_word.lower() == ("Ġ" + word_to_check).lower()]
    if word_version:
        print(f"Versions of the word '{word_to_check}' in the vocabulary: {', '.join(word_version)}")
    else:
        print(f"The word '{word_to_check}' is not in the vocabulary.")
check_in_vocab('kirk')
# Don't forget to resize the model embeddings to match the new vocabulary size
# model.resize_token_embeddings(len(tokenizer))

['sulu', 'uhura', 'chekov', 'klingon', 'spock', '<LOC>', '<CHAR>', '<SD>', '<END>']
Number of added tokens:  9
Versions of the word 'kirk' in the vocabulary: kirk


A problem I've discovered: I can add ĠKirk to the vocabulary, but the pretrained embeddings would prefer to use "ĠK", "irk". I might want to consider allowing the embeddings to be trainable, since GPT's embeddings have been trained on a general corpus, so they might not perfectly align with the specific nuances of my TV show scripts.

In [5]:
from collections import Counter

# Tokenize a large sample of your text
tokens = tokenizer.tokenize(script)

# Split the tokens into chunks of 128
token_chunks = [tokens[i:i + 129] for i in range(0, len(tokens), 129)]

# You may need to pad the last chunk if it's not 128 tokens long
last_chunk = token_chunks[-1]
if len(last_chunk) < 129:
    last_chunk = last_chunk + [tokenizer.pad_token] * (129 - len(last_chunk))
    token_chunks[-1] = last_chunk

print(token_chunks[0])
# # Convert chunks to input IDs
# input_ids_chunks = [tokenizer.convert_tokens_to_ids(chunk) for chunk in token_chunks]
# # Count the frequency of each token
# token_counts = Counter(tokens)
# print(script[:500])
# print([token for token, count in token_counts.items()])
# # Identify tokens that might be special
# # potential_special_tokens = [token for token, count in token_counts.items() if some_condition(token, count)]
print(script[:500])

['<loc>', 'planet', 'surface', '<loc>', '<sd>', 'kirk', ',', 'mccoy', ',', 'chekov', 'and', 'a', 'security', 'guard', 'have', 'beamed', 'down', 'to', 'a', 'planet', 'with', 'a', 'green', 'sky', 'and', 'occasional', 'out', '##cr', '##ops', 'of', 'vertical', 'rocks', '.', 'their', 'phase', '##rs', 'are', 'drawn', '.', '<sd>', '<char>', 'kirk', ':', 'report', ',', 'mister', 'chekov', '.', '<char>', 'chekov', ':', 'full', 'scan', '.', 'results', 'negative', '.', 'radiation', 'level', 'normal', '.', 'atmosphere', 'and', 'terrain', 'are', 'und', '##ist', '##urbed', '.', 'no', 'evidence', 'of', 'a', 'colony', 'nor', 'any', 'residual', 'after', '##ef', '##fect', 'of', 'a', 'force', 'that', 'might', 'have', 'ann', '##ih', '##ila', '##ted', 'it', '.', '<char>', 'kirk', ':', 'life', 'readings', ',', 'doctor', 'mccoy', '?', '<char>', 'mccoy', ':', 'nothing', '.', 'they', 'said', 'they', 'were', 'being', 'attacked', 'by', 'an', 'unidentified', 'ship', '.', '<char>', 'chekov', ':', 'which', 'we', 'w

#### Some thoughts on training and test
I want to tokenize all the scripts and make chunks of 80 tokens used to predict the next word. But if I take words 0:8, then 1:81, and so on, the data will be highly correlated. This means I can't just randomly take 20\% of these chunks out for test data. Instead, I'll split at the episode level. I have 80 episodes- I'll arbitrarily assign 16 episodes to test data, and set those scripts aside.

### Process scripts and chunk into training and test sets
I initially ran this with a standard layout of 10% validation, 20% test data, but I found that I wasn't really using the test data any different than the validation data. The real test is at the end when I generate something new. Furthermore, I compared two cases: one where it seemed to have its best validation loss (I only trained for ~50 epochs due to hardware limitations), and one where it had the best train loss, but was overfit. The overfit result actually seems to produce more realistic scripts in the final product, so I don't think overfitting as as dangerous for this application as it would be for others. For these reasons, I'm now going to try removing the test set and keeping only 5% in validation, and increasing the size of the training sample as much as I can, starting from my "overfit" results


In [67]:
scripts=[TOS_scripts['episode '+str(i)] for i in range(len(TOS_scripts))]
random_state=42
# train_scripts, test_scripts = train_test_split(scripts, test_size=0.2, random_state=random_state)
# train_scripts, val_scripts = train_test_split(train_scripts, test_size=1/8, random_state=random_state)  # 10% of 80% = 1/8
train_scripts, val_scripts = train_test_split(train_scripts, test_size=0.05, random_state=random_state)  # 10% of 80% = 1/8

# Initialize a set to hold unique new tokens
unique_new_tokens = set()

# Preprocess the scripts and collect new tokens
processed_scripts = []
for script in train_scripts:
    processed_text, new_tokens = preprocess_script(script)
    processed_scripts.append(processed_text)
    unique_new_tokens.update(new_tokens)
new_tokens=list(unique_new_tokens)
for token in ["<LOC>", "<CHAR>", "<SD>", "<END>", "[PAD]"]:
    new_tokens.append(token)
add_tokens = [token for token in new_tokens if token not in tokenizer.get_vocab()]
# Add unique new tokens to the tokenizer
tokenizer.add_tokens(list(add_tokens))
print('vocab size: ',len(tokenizer))

# # Tokenize all the processed scripts
tokenized_scripts = [tokenizer.tokenize(script) for script in processed_scripts]


maxlen = 80  # Max sequence size
def create_chunks(tokenized_scripts, chunk_size=maxlen+1, stride=1,padded_chunks_per_script=1):
    #create overlapping chunks of 80 tokens to predict the next word
    X=[]
    y=[]
    tokenizer.pad_token = '[PAD]'
    for tokenized_script in tokenized_scripts:
        # for i in range(0, len(tokenized_script) - chunk_size + 1, stride):
        #     chunk = tokenized_script[i:i + chunk_size]
        #     X.append(chunk[:-1])
        #     y.append([chunk[-1]])
        #The user might not provide full 80 words, so lets augment using random padded sequences.
        for _ in range(padded_chunks_per_script):
            start_index = randint(0, len(tokenized_script) - 2) # -2 to leave room for at least one token
            random_length = randint(2, chunk_size - 1) # Choose a random length less than chunk_size
            end_index = start_index + random_length
            # Select the random chunk
            
            chunk = tokenized_script[start_index:end_index]
            print('chunk:',chunk)
            y.append([chunk[-1]])
            X_chunk=chunk[:-1]
            print('y:',[chunk[-1]])
            # Pad the chunk to the desired length
            padding_needed = chunk_size - len(chunk)
            pad_token = tokenizer.pad_token_id # or whatever your padding token is
            X_chunk += [pad_token] * padding_needed
            X.append(X_chunk)
            print('X:',X_chunk)
    X =np.array([tokenizer.convert_tokens_to_ids(x) for x in X])
    y =np.array([tokenizer.convert_tokens_to_ids(_y) for _y in y])
    return X,y

# train_X, train_y=create_chunks(tokenized_scripts)

processed_val_scripts=[]
for script in val_scripts:
    processed_text, new_tokens = preprocess_script(script)
    processed_val_scripts.append(processed_text)
tokenized_val_scripts = [tokenizer.tokenize(script) for script in processed_val_scripts]
val_X, val_y=create_chunks(tokenized_val_scripts)

# processed_test_scripts=[]
# for script in test_scripts:
#     processed_text, new_tokens = preprocess_script(script)
#     processed_test_scripts.append(processed_text)
# tokenized_test_scripts = [tokenizer.tokenize(script) for script in processed_test_scripts]
# test_X, test_y=create_chunks(tokenized_test_scripts
print('train shape: ',np.shape(train_X))
print('val shape: ',np.shape(val_X))
# print('test shape: ',np.shape(test_X))



vocab size:  30699
chunk: ['.', 'however', ',', 'it', 'does', 'express', 'the', 'thought', '.', '<char>', 'scot', 't', ':', 'i', "'", 'd', 'like', 'to', 'see', 'the', 'advisor', "'", 's', 'face', '.', '<char>', 'spock', ':', 'you', 'will', 'have', 'that', 'opportunity', '.', '<char>', 'scot', 't', ':']
y: [':']
X: ['.', 'however', ',', 'it', 'does', 'express', 'the', 'thought', '.', '<char>', 'scot', 't', ':', 'i', "'", 'd', 'like', 'to', 'see', 'the', 'advisor', "'", 's', 'face', '.', '<char>', 'spock', ':', 'you', 'will', 'have', 'that', 'opportunity', '.', '<char>', 'scot', 't', 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
chunk: ['##len', '##cy', '.', '<char>', 'hodin', ':', 'surely', ',', 'mister', 'spock', ',', 'you', 'do', 'not', 'intend', ',', 'i', 'hope', ',', 'to', 'create', 'a', 'dispute', 'between', 'the', 'federation', 'and', 'gideon', '?', 'spock', '<loc>', 'on', 'screen', '<loc>', ':', 'y

In [None]:
example=train_X[-1]
tokens = tokenizer.convert_ids_to_tokens(example)
print(tokens)
print(' '.join(tokens).replace(' ##', '')) # Joining tokens and handling subtokens
print(tokenizer.convert_ids_to_tokens(train_y[8]))

### Define miniature transformer model

I'm adapting the small transformer from https://keras.io/examples/generative/text_generation_with_miniature_gpt/
seeing as GPT is way to big for me to train with my hardware.

It looks like a big part of my parameters comes from word embeddings and the dense layer at the end, so I might want to try to reduce my vocabulary. I could use TextVectorization instead of BERT's tokenizer.

In [8]:
#build a miniature transformer model based on https://keras.io/examples/generative/text_generation_with_miniature_gpt/
def causal_attention_mask(batch_size, n_dest, n_src, dtype):
    """
    Mask the upper half of the dot product matrix in self attention.
    This prevents flow of information from future tokens to current token.
    1's in the lower triangle, counting from the lower right corner.
    """
    i = tf.range(n_dest)[:, None]
    j = tf.range(n_src)
    m = i >= j - n_src + n_dest
    mask = tf.cast(m, dtype)
    mask = tf.reshape(mask, [1, n_dest, n_src])
    mult = tf.concat(
        [tf.expand_dims(batch_size, -1), tf.constant([1, 1], dtype=tf.int32)], 0
    )
    return tf.tile(mask, mult)


class TransformerBlock(layers.Layer):
    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):
        super().__init__()
        self.att = layers.MultiHeadAttention(num_heads, embed_dim)
        self.ffn = keras.Sequential(
            [layers.Dense(ff_dim, activation="relu"), layers.Dense(embed_dim),]
        )
        self.layernorm1 = layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = layers.LayerNormalization(epsilon=1e-6)
        self.dropout1 = layers.Dropout(rate)
        self.dropout2 = layers.Dropout(rate)

    def call(self, inputs):
        input_shape = tf.shape(inputs)
        batch_size = input_shape[0]
        seq_len = input_shape[1]
        causal_mask = causal_attention_mask(batch_size, seq_len, seq_len, tf.bool)
        attention_output = self.att(inputs, inputs, attention_mask=causal_mask)
        attention_output = self.dropout1(attention_output)
        out1 = self.layernorm1(inputs + attention_output)
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output)
        return self.layernorm2(out1 + ffn_output)
    
class TokenAndPositionEmbedding(layers.Layer):
    def __init__(self, maxlen, vocab_size, embed_dim):
        super().__init__()
        self.token_emb = layers.Embedding(input_dim=vocab_size, output_dim=embed_dim)
        self.pos_emb = layers.Embedding(input_dim=maxlen, output_dim=embed_dim)

    def call(self, x):
        maxlen = tf.shape(x)[-1]
        positions = tf.range(start=0, limit=maxlen, delta=1)
        positions = self.pos_emb(positions)
        x = self.token_emb(x)
        return x + positions
    
vocab_size = len(tokenizer)  # Only consider the top 20k words
embed_dim = 128  # Embedding size for each token
num_heads = 8  # Number of attention heads
feed_forward_dim = 128  # Hidden layer size in feed forward network inside transformer

def create_model():
    inputs = layers.Input(shape=(maxlen,), dtype=tf.int32)
    embedding_layer = TokenAndPositionEmbedding(maxlen, vocab_size, embed_dim)
    x = embedding_layer(inputs)
    transformer_block = TransformerBlock(embed_dim, num_heads, feed_forward_dim)
    x = transformer_block(x)
    # x = layers.Dense(vocab_size)(x)
    x = layers.GlobalAveragePooling1D()(x) #added this layer to change the architecture to predict only 1 word
    outputs = layers.Dense(vocab_size)(x) #logits of single predicted word
    model = keras.Model(inputs=inputs, outputs=outputs)
    loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
    model.compile(
        "adam", loss=loss_fn,
    )  # No loss and optimization based on word embeddings from transformer block
    return model


In [None]:

model = create_model()
# model.load_weights('StarTrek_model_checkpoint.h5')
model.summary()
with open('StarTrek_oneword_history.json', 'r') as f:
    loaded_history = json.load(f)

for loop_i in range(10):
    checkpoint_filepath = 'StarTrek_oneword_model_checkpoint'+str(loop_i+1)+'.h5'

    model_checkpoint_callback = ModelCheckpoint(
        filepath=checkpoint_filepath,
        save_weights_only=True,
        monitor='loss',
        mode='min',
        save_best_only=True)
    model.load_weights('StarTrek_oneword_model_checkpoint'+str(loop_i)+'.h5')
    history=model.fit(train_X, 
                  train_y, 
                  validation_data=(val_X, val_y), 
                  callbacks=[model_checkpoint_callback], 
                  verbose=1, 
                  epochs=10)
    hist_dict=history.history
    for key in hist_dict.keys():
        loaded_history[key].extend(hist_dict[key])

Model: "model_13"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_14 (InputLayer)       [(None, 80)]              0         
                                                                 
 token_and_position_embeddi  (None, 80, 128)           3939712   
 ng_13 (TokenAndPositionEmb                                      
 edding)                                                         
                                                                 
 transformer_block_13 (Tran  (None, 80, 128)           561024    
 sformerBlock)                                                   
                                                                 
 global_average_pooling1d_1  (None, 128)               0         
 3 (GlobalAveragePooling1D)                                      
                                                                 
 dense_41 (Dense)            (None, 30699)             396

2023-08-26 08:20:09.702284: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.




2023-08-26 08:44:55.456867: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.


Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10

In [46]:
model = create_model()
model.load_weights('StarTrek_oneword_model_checkpoint8.h5')
loss = model.evaluate(test_X, test_y)

   1/2928 [..............................] - ETA: 11:21 - loss: 7.2165

2023-08-28 08:48:21.420623: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.




In [20]:
#plot history
# with open('StarTrek_oneword_history.json', 'w') as f:
#     json.dump(history.history, f)
# history_dict = history.history
loss = loaded_history['loss']
val_loss = loaded_history['val_loss']

epochs = range(1, len(loss) + 1)

# plt.figure(figsize=(10,6))
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'r', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()

# # Further training
# new_history = model.fit(train_X, train_y, epochs=5)

# Appending new history to initial history
# for key in initial_hist_dict.keys():
#     initial_hist_dict[key].extend(new_history.history[key])

NameError: name 'loaded_history' is not defined

In [47]:
def convert_input_to_token_idx(string):
    processed_string=preprocess_script(string)[0][:-5]
    input_tokens= tokenizer.tokenize(processed_string)
    input_tokens=tokenizer.convert_tokens_to_ids(input_tokens)
    padding_needed = 80 - len(input_tokens)
    pad_token = tokenizer.pad_token_id # or whatever your padding token is
    input_tokens += [pad_token] * padding_needed
    return np.array(input_tokens)

def convert_token_idx_to_string(tokens):
    string_array=tokenizer.convert_ids_to_tokens(tokens)
    string_array=[string for string in string_array if string != "[PAD]"]
    text = ' '.join(string_array).replace(' ##', '').replace('[PAD]', '')
    return text

input_string="Captain's log, star"
# input_string="\nKIRK: Set ship status condition"
input_tokens=convert_input_to_token_idx(input_string)
input_tokens=test_X[17]
print(input_tokens)
print(tokenizer.convert_ids_to_tokens(train_y[8]))
# print(input_tokens)
# print(type(train_X[0]))
# print(np.shape(train_X[0]))
print(tokenizer.convert_ids_to_tokens(input_tokens))
print('input = \n',convert_token_idx_to_string(input_tokens),'\n')
generated_token=np.argmax(model.predict(np.expand_dims(input_tokens,axis=0)),axis=-1)
next_word=convert_token_idx_to_string(generated_token)
print('\n')
print('generated = \n', convert_token_idx_to_string(input_tokens)+' '+next_word)


[11332  1024  1037  3371  1012 30528 16075  1024  2009  1005  1055 30526
  1012  2031  2017  4384  2505  4326  2055  2032  1029 30528 11332  1024
  2053  1010  2498  1999  3327  1012  2339  1029 30528 16075  1024  2092
  1010  2009  1005  1055  2498  1045  2064  9231  8400  2302  2019  7749
  1010  2021  2002  1005  1055  2468  6233  2717  3512  1012  2065  2002
  2020  2025  1037 25993  1010  1045  1005  1040  2471  2360  6091  1012
  1998  2005  2178  2518  1010  2002  1005  1055]
['have']
['kirk', ':', 'a', 'minute', '.', '<char>', 'mccoy', ':', 'it', "'", 's', 'spock', '.', 'have', 'you', 'noticed', 'anything', 'strange', 'about', 'him', '?', '<char>', 'kirk', ':', 'no', ',', 'nothing', 'in', 'particular', '.', 'why', '?', '<char>', 'mccoy', ':', 'well', ',', 'it', "'", 's', 'nothing', 'i', 'can', 'pin', '##point', 'without', 'an', 'examination', ',', 'but', 'he', "'", 's', 'become', 'increasingly', 'rest', '##ive', '.', 'if', 'he', 'were', 'not', 'a', 'vulcan', ',', 'i', "'", 'd',

2023-08-28 08:50:56.408249: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.


In [49]:
#generate a longer script

input_tokens=test_X[17]
script=list(input_tokens)
for i in range(100):
    generated_token=list(np.argmax(model.predict(np.expand_dims(input_tokens,axis=0),verbose=0),axis=-1))
    script.append(generated_token[0])
    input_tokens=script[-80:]
script=convert_token_idx_to_string(script)
script = script.replace('<char>', '\n')
print(script)


kirk : a minute . 
 mccoy : it ' s spock . have you noticed anything strange about him ? 
 kirk : no , nothing in particular . why ? 
 mccoy : well , it ' s nothing i can pinpoint without an examination , but he ' s become increasingly restive . if he were not a vulcan , i ' d almost say nervous . and for another thing , he ' s not a heart . 
 mccoy : indeed it ' s the exactment the zen planet , but you said survive there . there said there said there could say they can do it do me . do they understand me do do me which achilles achilles me which which which which which perhaps all all , and spock . <sd> they enters to kill kill you ! 
 jailor : then you are you ! 
 kirk : no , no ! 
 kirk : your friends and you ' ll kill you ! 
 kirk : no


In [None]:
# Experimenting using TextVectorization tokenization instead of DistilBERT
# scripts=[TOS_scripts['episode '+str(i)] for i in range(len(TOS_scripts))]
# random_state=42
# train_scripts, test_scripts = train_test_split(scripts, test_size=0.2, random_state=random_state)
# train_scripts, val_scripts = train_test_split(train_scripts, test_size=1/8, random_state=random_state)  # 10% of 80% = 1/8

# processed_train_scripts = [preprocess_script(script)[0] for script in train_scripts]
# big_string = ' '.join(processed_train_scripts)

# # Convert to a TensorFlow Dataset
# dataset = tf.data.Dataset.from_tensor_slices([big_string])
# def custom_standardization(input_string):
#     """ Remove html line-break tags and handle punctuation """
#     lowercased = tf.strings.lower(input_string)
#     stripped_html = tf.strings.regex_replace(lowercased, "<br />", " ")
#     cleaned_string = tf.strings.regex_replace(stripped_html, "\xa0", " ")  # Replacing non-breaking space with regular space
#     # Exclude < and > from the punctuation string
#     punctuation_without_angle_brackets = string.punctuation.replace("<", "").replace(">", "")
#     return tf.strings.regex_replace(cleaned_string, f"([{punctuation_without_angle_brackets}])", r" \1")

# # Create a vectorization layer and adapt it to the text
# vectorize_layer = TextVectorization(
#     standardize=custom_standardization,
#     max_tokens=vocab_size - 1,
#     output_mode="int",
#     # output_sequence_length=maxlen + 1,
# )
# with tf.device('/CPU:0'):
#     # Your code here (e.g., adapting the vectorize layer)
#     vectorize_layer.adapt(dataset) #for some reason this won't work on my GPU
# vocab = vectorize_layer.get_vocabulary()  # To get words back from token indices

# #create chunks of training, val, and test data
# window_size = 81
# step_size = 3  # This defines the overlap; change as needed
# padded_chunks_per_script=0 #not yet implemented

# # Tokenize all the processed scripts
# processed_train_scripts_tensor = tf.constant(processed_train_scripts)
# vectorized_train_scripts = vectorize_layer(processed_train_scripts_tensor)
# def create_chunks(tokenized_scripts, window_size=window_size, step_size=step_size, padded_chunks_per_script=padded_chunks_per_script):
#     chunks = []
#     for script in tokenized_scripts:
#         for i in range(0, len(script) - window_size + 1, step_size):
#             chunk = script[i:i + window_size]
#             chunks.append(chunk)
#     return chunks

# train_chunks=create_chunks(vectorized_train_scripts)
# train_X=np.array([chunk[:-1] for chunk in train_chunks])
# train_y=np.array([chunk[1:] for chunk in train_chunks])

# #do the same for validation and test data
# processed_val_scripts = [preprocess_script(script)[0] for script in val_scripts]
# processed_val_scripts_tensor = tf.constant(processed_val_scripts)
# vectorized_val_scripts = vectorize_layer(processed_val_scripts_tensor)
# val_chunks=create_chunks(vectorized_val_scripts)
# val_X=np.array([chunk[:-1] for chunk in val_chunks])
# val_y=np.array([chunk[1:] for chunk in val_chunks])


# processed_test_scripts = [preprocess_script(script)[0] for script in test_scripts]
# processed_test_scripts_tensor = tf.constant(processed_test_scripts)
# vectorized_test_scripts = vectorize_layer(processed_test_scripts_tensor)
# test_chunks=create_chunks(vectorized_test_scripts)
# test_X=np.array([chunk[:-1] for chunk in test_chunks])
# test_y=np.array([chunk[1:] for chunk in test_chunks])

In [None]:
# #demonstrate how the vectorize layer works and how the x y vectors work
# print(np.shape(train_X))
# print(np.shape(val_X))
# print(np.shape(test_X))
# print(train_X[-1]) #batching caused shorter scripts to be filled with zeros. Maybe go back to the loop mehtod
# print([vocab[idx] for idx in val_X[12]])
# print([vocab[idx] for idx in val_y[12]])
# # print(vocab[:20])

In [None]:

# #use GPT-2 model. We'll need to train from here to learn star trek vocab
# Initializing a GPT2 configuration

configuration = GPT2Config.from_pretrained('distilgpt2')
configuration.vocab_size = len(tokenizer)
# configuration.n_positions = 128 #The maximum sequence length that this model might ever be used with. 
# configuration.n_embd = 768, #768, #Dimensionality of the embeddings and hidden states.
# configuration.n_layer = 4, #12, #Number of hidden layers in the Transformer encoder.
# configuration.n_head = 4, #12, #Number of attention heads for each attention layer in the Transformer encoder.
                           
model = TFGPT2Model(configuration)
model.resize_token_embeddings(len(tokenizer))  # resizing to match new vocab size

# If you want to see the summary, you can build the model first:
model.build(input_shape=(None, 128))  # None for batch size, 128 for sequence length

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
model.summary()


# configuration = GPT2Config(vocab_size = len(tokenizer), 
#                            n_positions = 128, #The maximum sequence length that this model might ever be used with. 
#                            n_embd = 64, #768, #Dimensionality of the embeddings and hidden states.
#                            n_layer = 4, #12, #Number of hidden layers in the Transformer encoder.
#                            n_head = 4, #12, #Number of attention heads for each attention layer in the Transformer encoder.
#                            n_inner = None,
#                            activation_function = 'gelu_new',
#                            resid_pdrop = 0.1,
#                            embd_pdrop = 0.1,
#                            attn_pdrop = 0.1,
#                            layer_norm_epsilon = 1e-05,
#                            initializer_range = 0.02,
#                            summary_type = 'cls_index',
#                            summary_use_proj = True,
#                            summary_activation = None,
#                            summary_proj_to_labels = True,
#                            summary_first_dropout = 0.1,
#                            scale_attn_weights = True,
#                            use_cache = True,
#                            bos_token_id = 50256,
#                            eos_token_id = 50256,
#                            scale_attn_by_inverse_layer_idx = False,
#                            reorder_and_upcast_attn = False)

# # Initializing a model (with random weights) from the configuration
# model = TFGPT2LMHeadModel(configuration)
# # from tensorflow.keras.layers import Input, Dense
# # from tensorflow.keras.models import Model
# # input_layer = Input(shape=(128,), dtype='int32')
# # gpt2_output = gpt2_model(input_layer)
# # output_layer = Dense(vocab_size, activation='softmax')(gpt2_output[0][:,-1,:])

# # model = Model(inputs=input_layer, outputs=output_layer)
# # model = TFGPT2LMHeadModel.from_pretrained("gpt2")
# model.resize_token_embeddings(len(tokenizer))  # resizing to match new vocab size
# # print(len(tokenizer))

# # #use DistilBERT model. We'll need to train from here to learn star trek vocab
# # model = TFDistilBertModel.from_pretrained('distilbert-base-uncased')
# # model.resize_token_embeddings(len(tokenizer))  # resizing to match new vocab size
# print(len(tokenizer))

In [None]:
#training
# Train only the embedding layer

# Freeze all layers within the transformer
for layer in model.transformer.h:
    layer.trainable = False
# Unfreeze the embedding layer (wte)
model.transformer.wte.trainable = True
# Freeze all layers first
# for layer in model.layers:
#     layer.trainable = False

# # Unfreeze the embedding layer
# model.layers[0].embeddings.trainable = True

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
model.summary()
history = model.fit(x=train_X, y=train_y, epochs=10, batch_size=32, validation_data=(val_X, val_y))

# # Save the weights
model.save_weights('StarTrek_weights.h5')

# # Unfreeze more layers (optional)
# for layer in model.layers[:n]:  # Unfreeze the first n layers
#     layer.trainable = True
# model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', lr=0.0001) # Smaller learning rate
# model.fit(training_data)

# # Unfreeze all layers (optional)
# for layer in model.layers:
#     layer.trainable = True
# model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', lr=0.00001) # Even smaller learning rate
# model.fit(training_data)

# # Save the final model
# model.save('final_model.h5')


In [None]:
# train_loss = history.history['loss']
# val_loss = history.history['val_loss']
# plt.plot(train_loss, label='Training Loss')
# plt.plot(val_loss, label='Validation Loss')
# plt.xlabel('Epochs')
# plt.ylabel('Loss')
# plt.legend()
# plt.show()