# Define Model Language

https://keras.io/guides/sequential_model/

> When to use a Sequential model
A Sequential model is appropriate for a plain stack of layers where each layer has exactly one input tensor and one output tensor.


## 2. Model Architecture:
1. Embedding layer
    - Helps model understand 'meaning' of words by mapping them to representative vector space instead of semantic integers
2. Stacked LSTM layers
    - Stacked LSTMs add more depth than additional cells in a single LSTM layer (see paper: https://arxiv.org/abs/1303.5778)
    - The first LSTM layer must have `return sequences` flag set to True in order to pass sequence information to the second LSTM layer instead of just its end states
3. Dense (regression) layer with ReLU activation
4. Dense layer with Softmax activation 
    - Outputs word probability across entire vocab

In [1]:
# Neural Net Preprocessing
from sklearn.feature_extraction.text import CountVectorizer
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.preprocessing.sequence import pad_sequences
# Neural Net Layers
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Embedding

# Neural Net Training
from tensorflow.keras.models import load_model
from tensorflow.keras.callbacks import ModelCheckpoint
from keras.callbacks import EarlyStopping

Using TensorFlow backend.


In [4]:
# Import the data
import pandas as pd
DATA_DIRECTORY = 'data'
TRAIN_DATA_TEXT_GEN_FORMAT_FILE_NAME = 'youtube_video_ids_with_transcript_text_and_author.csv'
train_df = pd.read_csv(DATA_DIRECTORY + '/' + TRAIN_DATA_TEXT_GEN_FORMAT_FILE_NAME)
# Selecting Dan Lok as author style to emulate
author = train_df[train_df['author'] == 'DAN']["text"]
print('Number of training sentences: ',author.shape[0])

('Number of training sentences: ', 58)


In [7]:
max_words = 50000 # Max size of the dictionary
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(author.values)
sequences = tokenizer.texts_to_sequences(author.values)

# Flatten the list of lists resulting from the tokenization. This will reduce the list
# to one dimension, allowing us to apply the sliding window technique to predict the next word
text = [item for sublist in sequences for item in sublist]
vocab_size = len(tokenizer.word_index)

# Flatten the list of lists resulting from the tokenization. This will reduce the list
# to one dimension, allowing us to apply the sliding window technique to predict the next word
text = [item for sublist in sequences for item in sublist]
vocab_size = len(tokenizer.word_index)

print('Vocabulary size in this corpus: ', vocab_size)

# Training on 19 words to predict the 20th
sentence_len = 20
pred_len = 1
train_len = sentence_len - pred_len
seq = []
# Sliding window to generate train data
for i in range(len(text)-sentence_len):
    seq.append(text[i:i+sentence_len])
# Reverse dictionary to decode tokenized sequences back to words
reverse_word_map = dict(map(reversed, tokenizer.word_index.items()))

('Vocabulary size in this corpus: ', 7125)


In [8]:
# define model
model = Sequential([
    Embedding(vocab_size+1, 50, input_length=train_len),
    LSTM(150, return_sequences=True),
    LSTM(150),
    Dense(150, activation='relu'),
    Dense(vocab_size, activation='softmax')
])

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 19, 50)            356300    
_________________________________________________________________
lstm (LSTM)                  (None, 19, 150)           120600    
_________________________________________________________________
lstm_1 (LSTM)                (None, 150)               180600    
_________________________________________________________________
dense (Dense)                (None, 150)               22650     
_________________________________________________________________
dense_1 (Dense)              (None, 7125)              1075875   
Total params: 1,756,025
Trainable params: 1,756,025
Non-trainable params: 0
_________________________________________________________________


## 3. CoLab Model Training 1-2+ Hours 
### *20x Speed Up TF GPU*

[colab.research.google.com/drive/1Ckybg1q91bV-tj7ohu_mIeaAn_WZPXxJ#scrollTo=5WBM8bENqPpQ](https://colab.research.google.com/drive/1Ckybg1q91bV-tj7ohu_mIeaAn_WZPXxJ#scrollTo=5WBM8bENqPpQ)

## 4. HD5 NNWeights
### Result: Huge Vector of Weights in HD5 File Format

In [10]:
!ls data/*hdf5

data/model_weights.hdf5


## 5. Measure Track Accuracy
### Result: Huge Vector of Weights in HD5 File Format
#### `loss: 1.3737 - accuracy: 0.6570`
#### `100 Epochs` ~1-2+ hrs.