<a href="https://colab.research.google.com/github/mvenouziou/Project-Text-Generation/blob/main/Mo_nonlinear_text_gen.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Text Generation RNN

This program constructs a character-level sequence model to generate text according to a character distribution learned from the dataset. 

- Try my web app implementation at www.communicatemission.com/ml-projects#text_generation. (Currently, only the standard model is implemented in the app)
- See more at https://github.com/mvenouziou/Project-Text-Generation.

- See credits /attributions below

The code implements two different model architectures: "linear" and "nonlinear."
The linear model uses character-level embeddings to form the model. The nonlinear model adds a parallel word level embedding network, which is merged with the character embedding model. 

---

**What's New?**
*(These items are original in the sense that I personally have not seen them at the original time of coding. Citations are below for content I have seen elsewhere.)*

- Experiments with: Nonlinear model architecture uses parallel RNN's for word-level embeddings and character-level embeddings. 

- Experiments with: Tensorflow Probability layers to create a more interpretable probability distribution model. (Character-model only). The standard text generation algorithm outputs logits, which we view as a distribution from which to generate the next character. Here, we formalize this as outputing our model as a TF Probability Distribution, using probablistic weights in the Dense layer (instead of scalars) and trained via maximum likelihood. 

- *(Note: the probabalistic model produces extremely poor results when used with the parallel word-level path. Perhaps it requires a larger number of units in the inner layers to use this more nuanced model. However, the current processing power (and the desire to ultimately run this without GPU) limits our ability to increase network size.)*

- Option to implement either the standard linear model architecture (see credits below) or nonlinear architectures.

- Manage RNN statefulness for independent data sources. The linear model (credited below) codes' approach to statefulness imposes a dependence relation between samples / batches. This model implements the ability to treat independent works (individual poems, books, authors, etc.) as truly independent samples by resetting RNN states and shuffling independent data sources.

- Load and prepare data from multiple CSV and text files. Each rows from a CSV and each complete TXT file are treated as independent data sources. (CSV data prep accepts titles and content.) 

- Random crops and start locations to better match training data with desired generated text lengths.

- Parameters to perturb learned probabilties in final generation function, to add extra variety to generated text.

---
**Credits / Citations / Attributions:**

**Linear Model and Shared Code** 

Other than items noted in previous sections, this python code and linear model structure is based heavily on Imperial College London's Coursera course, "Customising your models with Tensorflow 2" *(https://www.coursera.org/learn/customising-models-tensorflow2)* and the Tensorflow RNN text generation documentation *(https://www.tensorflow.org/tutorials/text/text_generation?hl=en).*


**Nonlinear Model:**   

This utilizes pretrained embeddings:
-  Small BERT word embeddings from Tensorflow Hub, (*credited to Iulia Turc, Ming-Wei Chang, Kenton Lee, Kristina Toutanova's paper "Well-Read Students Learn Better: On the Importance of Pre-training Compact Models." *https://tfhub.dev/google/collections/bert/1)*
- ELECTRA-Small++ from Tensorflow Hub, (*credited to Kevin Clark and Minh-Thang Luong and Quoc V. Le and Christopher D. Manning's paper "ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators." *https://hub.tensorflow.google.cn/google/electra_small/2)*

ELECTRA-Small++ has four times as many paramaters as the Small BERT embedding, producing better results, but at large computational cost.

**Web App:** 

The web app is built on the Anvil platform and (at the time of this writing) is hosted on Google Cloud server (CPU).

**Datasets:**

- *'robert_frost_collection.csv'* is a Kaggle dataset available at https://www.kaggle.com/archanghosh/robert-frost-collection. Any other datasets used are public domain works available from Project Gutenberg https://www.gutenberg.org.

---

**About**

Find me online at:
- LinkedIn: https://www.linkedin.com/in/movenouziou/ 
- GitHub: https://github.com/mvenouziou

---

In [131]:
#### PACKAGE IMPORTS ####
# TF Model design
import tensorflow as tf
from tensorflow import keras

# TF text processing (also required for TF HUB word encoders)
!pip install -q tensorflow-text
import tensorflow_text as text  

# TF pretrained models (for word encodings)
import tensorflow_hub as hub

# TF probability modules
import tensorflow_probability as tfp  
from tensorflow_probability import layers as tfpl
from tensorflow_probability import distributions as tfd

# TF TensorBoard notebook extension
%load_ext tensorboard
import datetime, os

# data handling
import numpy as np
import pandas as pd
import string
import random
import re

# file management
import os
import bz2
import pickle
import _pickle as cPickle

# integrations
!pip install -q anvil-uplink
import anvil.server

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


### Set Model Paramaters

Define Paramaters class

In [132]:
class Paramaters:
    def __init__(self,  
                 # integrations
                use_gdrive, use_anvil,
                 # model architecture
                use_probability_layers,  # implements TensorFlow Probability
                use_word_path,  # note: TFP layers not recommended with word-level model 
                use_electra, # use False for BERT embeddings (fewer params, word model only)
                # datasets
                author, data_files, 
                datasets_dir='https://raw.githubusercontent.com/mvenouziou/text_generator/main/',
                # model params
                num_trailing_words=5, padded_example_length=8, batch_size=128):
        
        # save param choices
        # note: additional attributes are added below
        self._use_gdrive = use_gdrive
        self._use_anvil = use_anvil
        self._author = author       
        self._num_trailing_words = num_trailing_words
        self._padded_example_length = padded_example_length
        self._batch_size = batch_size
        self._use_probability_layers = use_probability_layers
        self._use_word_path = use_word_path
        self._use_electra = use_electra
        self._data_files = list(data_files)
        self._datasets_dir = datasets_dir
        
        # 3rd party integrations
        # Mount Google Drive:
        if self._use_gdrive:
            self._gdrive_dir = '/content/gdrive/'
            from google.colab import drive
            drive.mount(self._gdrive_dir)
        else:
            self._gdrive_dir = ''

        # Anvil's web app server
        if self._use_anvil:
            anvil.server.connect('53NFXI7IX7IE233XQTVJDXUM-PUGRV2WON2LETWBG')

        # Filepath Structure
        # path name conventions due to model structure
        if self._use_probability_layers :
            self._author +=  '/probability/' 
        if self._use_word_path:
            self._author += '_words_model/'
        if self._use_electra:
            self._author += 'electra/'

        # models / checkpoints directories
        # (Google Drive)
        self._filepath = self._gdrive_dir + 'MyDrive/Colab_Notebooks/models/text_generation/' + self._author
        self._checkpoint_dir = self._filepath + '/checkpoints/'
        self._prediction_model_dir = self._filepath + '/prediction_model/'
        self._training_model_dir = self._filepath + '/training_model/'
        self._processed_data_dir = self._filepath + '/proc_data/'
        self._tensorboard_dir = self._checkpoint_dir  + '/logs/'

        # Create Tokenizer / Set Vocab Size
        # character tokenizer
        def create_character_tokenizer():
        
            char_tokens = string.printable
            filters = '#$%&()*+-/<=>@[]^_`{|}~\t'

            # Initialize standard keras tokenizer
            tokenizer = tf.keras.preprocessing.text.Tokenizer(
                            num_words=None,  
                            filters=filters,
                            lower=False,  # conversion to lowercase letters
                            char_level=True,
                            oov_token=None,  # drop unknown characters
                            )      
            # fit tokenizer
            tokenizer.fit_on_texts(char_tokens)
            
            return tokenizer

        self._character_tokenizer = create_character_tokenizer()
        self._vocab_size = len(self._character_tokenizer.word_index) + 1

Create Paramaters (global) object

In [133]:
# paramater customizationss
author='tests'
data_files=['robert_frost_collection.csv']
use_gdrive=True
use_anvil=False
use_probability_layers=False
use_word_path=False
use_electra=False

# create paramaters object
PARAMATERS = Paramaters(use_gdrive=use_gdrive, use_anvil=use_anvil, 
                        author=author, data_files=data_files,
                        use_probability_layers=use_probability_layers,
                        use_word_path=use_word_path, use_electra=use_electra)

Drive already mounted at /content/gdrive/; to attempt to forcibly remount, call drive.mount("/content/gdrive/", force_remount=True).


### Define Encoders

Character-Level

In [134]:
def make_padded_array(text_blocks, paramaters):
    # Tokenizes and applies padding for uniform length

    # load tokenizer if one is not supplied
    tokenizer = paramaters._character_tokenizer

    # tokenize
    token_blocks = tokenizer.texts_to_sequences(text_blocks)

    # zero padding
    padded_blocks = tf.keras.preprocessing.sequence.pad_sequences(
                        sequences=token_blocks,  # dataset
                        maxlen=paramaters._padded_example_length, 
                        dtype='int32', 
                        padding='pre',
                        truncating='pre', 
                        value=0.0
                        )
    
    return padded_blocks

Word-Level (BERT or Electra pre-trained embedding)

In [135]:
def get_word_encoder(paramaters):

    # Word Embeddings path (bert encoder)
    if paramaters._use_electra:
        encoder_url = 'https://tfhub.dev/google/electra_small/2'
    else:
        encoder_url = 'https://tfhub.dev/tensorflow/' \
                            + 'small_bert/bert_en_uncased_L-2_H-128_A-2/1'
    preprocessor_url = 'https://tfhub.dev/tensorflow/' \
                        + 'bert_en_uncased_preprocess/3'
                
    # preprocessing layer
    # get BERT components
    preprocessor = hub.load(preprocessor_url)
    bert_tokenizer = hub.KerasLayer(preprocessor.tokenize,
                                    name='bert_tokenizer')
    bert_packer = hub.KerasLayer(
                    preprocessor.bert_pack_inputs,
                    arguments=dict(seq_length=paramaters._num_trailing_words),
                    name='bert_input_packer')
    word_encoder = hub.KerasLayer(
                        encoder_url, 
                        trainable=False, 
                        name='Word_encoder')
    
    return bert_tokenizer, bert_packer, word_encoder

### Define Data Pre-processors

Load and Clean Datasets

In [136]:
# Function: loader for .csv files
def prepare_csv(filename, paramaters, content_columns=['Name', 'Content'], 
                shuffle_rows=True):
    
    # load data into DataFrame
    dataframe = pd.read_csv(paramaters._datasets_dir + filename).dropna()
    
    # extract titles and content
    # note: column headings must match those below
    if 'Name ' in dataframe.columns:  # required for the Robert Frost set
        dataframe.rename(columns={'Name ':'Name'})
    
    # prepare titles
    try: 
        dataframe['Name'] = dataframe['Name'].apply(
                            lambda x: x.upper() + ':\n')
    except:
        # no titles found
        content_columns = ['Content']

    # prepare content
    dataframe['Content'] = dataframe['Content'].apply(
                    lambda x: x + '\n')

    # restrict dataset
    dataframe = dataframe[content_columns]

    # shuffle entries (rows)
    if shuffle_rows:
        dataframe = dataframe.sample(frac=1)
    
    # data cleanup
    dataframe = dataframe[content_columns]
    
    # merge desired text columns
    dataframe['merge'] = dataframe[content_columns[0]]
    for i in range(1, len(content_columns)):
        dataframe['merge'] = dataframe['merge'] + dataframe[content_columns[i]]

    # convert to list of strings
    data_list = dataframe['merge'].tolist()
    
    return data_list   


# Function: Load and standardize data files
def load_parse(data_list, display_samples=True):  

    # remove paragraph / line marks and split up words  
    tokenizer = text.WhitespaceTokenizer()

    # tokenize data (outputs bytestrings)
    cleaned_list_byte = [tokenizer.tokenize(data).numpy() for data in data_list]

    # convert data back to string format
    num_entries = len(cleaned_list_byte)

    clean_list = [' '.join(map(lambda x: x.decode(), cleaned_list_byte[i])) 
                    for i in range(num_entries)]

    # Display text samples
    if display_samples:
        num_samples = 5
        inx = np.random.choice(len(clean_list), num_samples, replace=False)
        for example in np.array(clean_list)[inx]:
            print(example)
            print()

        print('len(text_chunks):', len(clean_list))

    return clean_list

In [137]:
def create_input_target_blocks(full_examples, paramaters):
    """ converts text into sliding n-grams of words and characters
    returning input / target sets """

    tokenizer = paramaters._character_tokenizer
    max_len = paramaters._padded_example_length
    num_words = paramaters._num_trailing_words


    # helper function to create word-level inputs
    def update_word_char_lists(text, chars_list, words_list):
        
        words_input = text.split(' ')  # separate words into list
        words_input = words_input[-num_words-1: -1]  # select trailing words

        # convert words list back to string (tensor)
        words_input = ' '.join(words_input)

        # add values to lists
        chars_list.append(text)
        words_list.append([words_input])
        
        return None

    blocks = []
    for example in full_examples:      

        char_block = []
        word_block = []
        example_length = len(example)

        # small blocks at start (will be zero-padded later)
        leading_characters = 1  # min chars to seed predictions
        for i in range(leading_characters, example_length - max_len - 1):
            text = example[: i]
            update_word_char_lists(text, char_block, word_block)

        # full length blocks
        for i in range(example_length - max_len - 1):
            # create n-gram
            text = example[i: max_len + i]
            update_word_char_lists(text, char_block, word_block)

        # small blocks at end (will be zero-padded later)
        for i in range(example_length - max_len - 1, example_length-1):
            text = example[i: ]
            update_word_char_lists(text, char_block, word_block)
    
        # tokenize and add pre-padding
        char_block = make_padded_array(char_block, paramaters)#tokenizer, max_len=max_len)

        # separate into inputs and targets
        inputs_char = char_block[:, :-1]
        targets_char = char_block[:, 1:]

        # update blocks
        word_block = np.array(word_block)
        blocks.append((inputs_char, word_block, targets_char))

    return blocks

In [138]:
# Function: data prep to create stateful RNN batches
# note: This will be applied separately on each example text, 
# so that RNN can reset internal state / distinguish between unrelated passages
# note: This code is taken directly from Imperial College London's 
# Coursera course cited above

def preprocess_stateful(char_input, word_input, target, paramaters):

    batch_size = paramaters._batch_size

    # Prepare input and output arrays for training the stateful RNN
    num_examples = char_input.shape[0]

    # adjust for batch size to divide evenly into sample size
    num_processed_examples = num_examples - (num_examples % batch_size)
    input_cropped = char_input[:num_processed_examples]
    target_cropped = target[:num_processed_examples]

    # separate out samples so rows of data match up across epochs
    # 'steps' measures how to space them out
    steps = num_processed_examples // batch_size  

    # define reordering
    inx = np.empty((0,), dtype=np.int32)  # initialize empty array object
    
    for i in range(steps):
        inx = np.concatenate((inx, i + np.arange(0, num_processed_examples, 
                                                    steps)))

    # reorder the data
    input_char_stateful = input_cropped[inx]
    input_word_stateful = word_input[inx]
    target_seq_stateful = target_cropped[inx]

    return input_char_stateful, input_word_stateful, target_seq_stateful

Input Pipeline

In [139]:
def input_pipeline(paramaters, verbose=True, fresh_process=False):

    # unpack param
    saved_proc_dir = paramaters._processed_data_dir
    filepath = paramaters._datasets_dir

    # load previously processed data (pbz2 compressed file format)
    try:    
        assert(fresh_process is False)  # otherwise create dataset from files

        with bz2.open(saved_proc_dir + 'datafiles.pbz2', 'rb') as file:
            data_dict = cPickle.load(file)

        X_data_list = data_dict['X_data_list']
        Y_data_list = data_dict['Y_data_list']

        print('loaded saved pre-processed data')

    except:       

        # load data file
        data_list = []
        for filename in paramaters._data_files:
            print(filename)
            print(filepath + '/' + filename)

            # check file extension and select loader (csv or txt)
            _, file_extension = os.path.splitext(filename)     

            if file_extension == '.csv':   
                data = prepare_csv(filename, paramaters=paramaters,
                                   content_columns=['Name', 'Content'], 
                                   shuffle_rows=True)
            
            else: # file_extension == '.txt':
                with open(filepath + '/' + filename, 'r', encoding='utf-8') as file:
                    data = file.readlines()

            # add extracted list of texts to data list
            data_list += data

        if verbose:
            print('PROGRESS: data_list created')
        
        # clean data
        clean_list = load_parse(data_list, display_samples=False)
        if verbose:
            print('PROGRESS: clean_list created')
        
        # preprocess data
        tokenizer = paramaters._character_tokenizer
        blocks = create_input_target_blocks(full_examples=clean_list, 
                                            paramaters=paramaters)
        
        if verbose:
            print('PROGRESS: blocks created')
        
        # create separate input / target pairs for each block
        X_data_list = []
        Y_data_list = []

        i=0
        for block in blocks:
            if i % 10 == 0:
                print(f'PROGRESS: processing block {i} of {len(blocks)}')

            char_input = block[0] 
            word_input = block[1] 
            target = block[2]

            input_char_stateful, input_word_stateful, target_seq_stateful = \
                                    preprocess_stateful(char_input=char_input, 
                                                        word_input=word_input, 
                                                        target=target, 
                                                        paramaters=paramaters)

            # group for model input
            X = [input_char_stateful, input_word_stateful]
            Y = target_seq_stateful

            X_data_list.append(X)
            Y_data_list.append(Y)

            # advance index
            i += 1

        # save file (pbz2 compressed file format)
        with bz2.BZ2File(saved_proc_dir + 'datafiles.pbz2', 'wb') as sfile:
            cPickle.dump({'X_data_list': X_data_list, 
                          'Y_data_list': Y_data_list}, sfile)

    return X_data_list, Y_data_list

### Define Models and Training Loop

Compiler

In [140]:
#VOCAB_SIZE = len(create_character_tokenizer().word_index) + 1

def neg_log_likely_logits(y_true, y_pred, paramaters):
    """ loss function for probabalistic model """

    # convert labels to one=hot vectors
    y_true_hot = tf.one_hot(y_true, 
                            depth=paramaters._vocab_size,
                            axis=-1)

    # return negative log likelihood
    return -y_pred.log_prob(y_true_hot)

In [141]:
def compile_model(model, learning_rate, paramaters):
    from keras.optimizers import RMSprop
    from keras.losses import SparseCategoricalCrossentropy

    metrics=['sparse_categorical_accuracy']

    # select loss function
    if paramaters._use_probability_layers:
        # use negative log likelihood for probabalistic model
        loss_fn = lambda y_true, y_pred: neg_log_likely_logits(
                    y_true, y_pred, paramaters=paramaters)
        
        optimizer=RMSprop(learning_rate=learning_rate)
        metrics=['sparse_categorical_accuracy', 
                 'sparse_top_k_categorical_accuracy']
                           ]
    else:
        loss_fn = SparseCategoricalCrossentropy(from_logits=True)
        metrics=['sparse_categorical_accuracy']

    # compile model
    model.compile(optimizer=RMSprop(learning_rate=learning_rate),
                  loss=loss_fn,
                  metrics=metrics)
    return model

Training Model

In [142]:
# Function: Model Definition
def get_training_model(paramaters, for_prediction=False, 
                       load_fresh=True, verbose=True):
    
    """ Defines and compiles our stateful RNN model. 
    Note: batch size is required argument for stateful RNN. """

    # parameters
    if for_prediction:
        batch_size=1
    else:
        batch_size = paramaters._batch_size

    use_word_path = paramaters._use_word_path
    use_probability_layers = paramaters._use_probability_layers
    num_words = paramaters._num_trailing_words

    vocab_size = paramaters._vocab_size
    if use_word_path:
        embedding_dim = 32*6
    else:
        embedding_dim = 32*8
    merge_dim = embedding_dim

    # load and return saved model
    if load_fresh is False:
        model = tf.keras.models.load_model(paramaters._training_model_dir)
        return model
    
    from keras.layers import Input, Embedding, Concatenate, Dense, GRU,\
                             Average, AveragePooling1D, Dropout, \
                             BatchNormalization, Lambda

    # pre-trained encoder
    bert_tokenizer, bert_packer, bert_encoder = \
            get_word_encoder(paramaters)
    
    # Build model
    # define input shapes
    input_1 = Input(shape=(None, ), 
                    batch_size=batch_size,
                    dtype=tf.int32, 
                    name='char_input')
    
    input_2 = Input(shape=(), 
                    batch_size=batch_size,
                    dtype=tf.string, 
                    name='word_input')

    # travel individual paths
    # Character Level Path
    # ## Char: Embedding
    x1 = Embedding(input_dim=vocab_size, output_dim=embedding_dim, 
                   mask_zero=True, batch_input_shape=(batch_size, None),
                   name='char_embedding',)(input_1)

    # ## Char: GRU 1
    x1 = GRU(units=embedding_dim, stateful=True, 
             return_sequences=True, name='char_GRU_1',)(x1)
    #x1 = Dropout(rate=.10, name='char_Dropout_1')(x1)
    x1 = BatchNormalization(name='char_Batch_Norm_1')(x1)
    
    # ## Char: GRU Final --  must use output_dim = merge_dim!
    x1 = GRU(units=merge_dim, stateful=True, 
             return_sequences=True, name='char_GRU_final',)(x1)
    #x1 = Dropout(rate=.10, name='char_Dropout_final')(x1)
    x1 = BatchNormalization(name='char_Batch_Norm_final')(x1)

    # Word Encoding Path
    if use_word_path:
        
        x2 = bert_tokenizer(input_2)  # tokenize
        x2 = bert_packer([x2])  # pack inputs for encoder
        x2 = bert_encoder(x2)['sequence_output'] # encoding

        # ## Word: GRU 1
        x2 = GRU(units=32, stateful=True, 
                 return_sequences=True, name='word_GRU_1',)(x2)
        x2 = Dropout(rate=.10, name='word_Dropout_1')(x2)
        x2 = BatchNormalization(name='word_Batch_Norm_1')(x2)

        # ## Word: GRU 2
        x2 = GRU(units=32, stateful=True, 
                 return_sequences=True, name='word_GRU_2',)(x2)
        x2 = Dropout(rate=.10, name='word_Dropout_2')(x2)
        x2 = BatchNormalization(name='word_Batch_Norm_2')(x2)

        # ## Word: Required conversion to valid merge output dim = merge_dim!
        x2 = Dense(units=num_words, activation=None, 
                   name='word_Dense_pre_final')(x2)
        x2 = AveragePooling1D(pool_size=5, padding='same', 
                              name='word_pooling_final')(x2)

        # prepare  and word paths for merge
        x1 = Dense(units=merge_dim, activation=None, 
                   name='word_Dense_final')(x1)

        x2 = Dense(units=merge_dim, activation=None, 
                   name='word_Dense_final')(x2)

        # Merge Paths
        x = Average(name='merged_layers')([x1, x2])

    else:  # update variable id to match next step
        x = Lambda(lambda x: x, name='rename_variable')(x1)  
    
    # Final GRU layer
    x = GRU(units=embedding_dim, stateful=True, 
            return_sequences=True, name='GRU_OUTPUT')(x)          

    # Character prediction (logits)
    # Note: Tensorflow Probability produces very poor results 
    # when paired with the word path in this architecture
    
    if use_probability_layers:
        # Dense layer with probabalistic weights
        x = tfpl.DenseReparameterization(
                units=tfpl.OneHotCategorical.params_size(vocab_size),
                activation=None)(x)          
        
        outputs = tfpl.OneHotCategorical(
                    vocab_size,
                    convert_to_tensor_fn=tfd.OneHotCategorical.logits,
                    name='Decoding')(x) 
    else:
        outputs = Dense(units=vocab_size, 
                        activation=None, 
                        name='Decoding')(x)       

    # create model
    model = keras.Model(inputs=[input_1, input_2], outputs=outputs)

    if verbose:
        print(model.summary())

    model = compile_model(model, learning_rate=.01, paramaters=paramaters)

    return model

Prediction Model

In [143]:
def get_prediction_model(trained_model, verbose, paramaters):

    """ enforces batch size = 1, only returns last character prediction
     and loads any saved weights """

    # create model
    prediction_model = get_training_model(for_prediction=True,
                                          paramaters=paramaters,
                                          load_fresh=True,
                                          verbose=verbose)

    # load weights from pre-trained model
    if trained_model is not None:        
        trained_weights = trained_model.get_weights()
        prediction_model.set_weights(trained_weights)

    return prediction_model

Checkpoint Manager

In [144]:
# checkpoint manager
def create_checkpoint_manager(model, paramaters):

    checkpoint = tf.train.Checkpoint(model=model)

    checkpoint_manager = tf.train.CheckpointManager(
                            checkpoint=checkpoint, 
                            directory=paramaters._checkpoint_dir, 
                            max_to_keep=4, 
                            keep_checkpoint_every_n_hours=None,
                            checkpoint_name='ckpt', 
                            step_counter=None, 
                            checkpoint_interval=None,
                            init_fn=None
                            )
    
    return checkpoint, checkpoint_manager

Training Loop

In [145]:
# Function: Train model
def train_model(model, X_data_list, Y_data_list,
                num_epochs, 
                num_datasets_to_use,
                checkpoint, 
                checkpoint_manager,
                learning_rate,
                paramaters):

    # get params
    batch_size = paramaters._batch_size

    # compile model
    model = compile_model(model=model, learning_rate=learning_rate, 
                          paramaters=paramaters)

    # set checkpoint manager
    if checkpoint is None or checkpoint_manager is None:
        checkpoint, checkpoint_manager = \
                    create_checkpoint_manager(model=model, paramaters=paramaters)
                    
    # set callbacks
    # TensorBoard callback
    tensorboard_callback = tf.keras.callbacks.TensorBoard(
                                log_dir=paramaters._tensorboard_dir, 
                                histogram_freq=1,
                                )
    
    # organize training data
    num_blocks = len(X_data_list)
    train_datasets_list = list(zip(X_data_list, Y_data_list)) 

    # if not specified, use all datasets
    if num_datasets_to_use is None \
            or num_datasets_to_use == -1 \
            or num_datasets_to_use > len(X_data_list):
        num_datasets_to_use = len(X_data_list)      
    
    # begin training loop
    for epoch in range(num_epochs):

        print(f'Epoch: {epoch}')

        # shuffle dataset order
        random.shuffle(train_datasets_list)
        print('shuffled datasets')


        # apply any filters
        """
        filtered_datasets_list = train_datasets_list  # no filter
        """
        
        # impose minimum sample size:
        # (avoids overtraining to small sample sets)
        min_size = random.choice(range(14*batch_size, 40*batch_size, batch_size))
        filtered_datasets_list = [train_datasets_list[i] 
                                  for i in range(len(X_data_list))
                                  if train_datasets_list[i][1].shape[0] > min_size]

        # compute min batches found among chosen datasets
        # this will be used to impose balanced sample sizes amond training sets
        min_batches = np.min([filtered_datasets_list[i][1].shape[0]
                              for i in range(num_datasets_to_use)])
        min_batches = min_batches // batch_size
        
        x_data_list = []
        y_data_list = []
        
        print('Preparing epoch datasets')
        for i in range(num_datasets_to_use):

            # select dataset
            data = filtered_datasets_list[i]
            X = data[0]
            Y = data[1]

            
            # crop to uniform length, from random starting points
            # (avoids overtraining to large sample sets,
            # or to the start of each sample,
            # and improves training speed)
            multiple = max(Y.shape[0] // (min_batches * batch_size), 1)
            rand = np.random.randint(multiple) 
            start = rand * min_batches * batch_size
            end = (1 + rand) * min_batches * batch_size

            x_data_list.append([X[0][start:end, : ], X[1][start:end, : ]])
            y_data_list.append(Y[start:end, : ])

        
        for i in range(num_datasets_to_use):
            print(f'dataset: {i}')

            dataset = tf.data.Dataset.from_tensor_slices(
                ((x_data_list[i][0] , x_data_list[i][1]), y_data_list[i])
                )
            
            dataset = dataset.batch(batch_size, drop_remainder=True)\
                             .prefetch(tf.data.experimental.AUTOTUNE)
                
            # train model
            history = model.fit(dataset,
                                shuffle=False,
                                epochs=1,
                                verbose=1,                                
                                #callbacks=[tensorboard_callback],
                                )
            
            # reset RNN hidden states
            model.reset_states()

            # save checkpoint
            checkpoint_manager.save()

    return model

Saving Models

In [146]:
# Store trained model separate from checkpoints
def save_model(model, model_type, paramaters):

    if model_type == 'training':
        model_dir = paramaters._training_model_dir
    elif model_type == 'prediction':
        model_dir = paramaters._prediction_model_dir

    # save model
    model.save(model_dir)

    return None

### Define Implementation Functions


In [147]:
def convert_to_input(last_token, text_string, paramaters):
        
    # words
    if paramaters._use_word_path:
        num_words = paramaters._num_trailing_words

        words_input = text_string.split(' ')  # separate words 
        words_input = words_input[-num_words-1:-1]  # get trailing words
        words_input = tf.constant(' '.join(words_input))  # convert to tensor
        
    else:
        words_input=tf.constant(' ')

    # pad token sequence
    inputs_char=tf.constant(last_token)

    # create separate input / target pairs for each block
    X = [inputs_char, words_input]

    return X

In [148]:
def generator(input_text, 
              prediction_model, 
              precision_reduction, 
              num_characters, 
              print_result,
              paramaters):

    # get tokenizer (if not supplied)      
    tokenizer = paramaters._character_tokenizer
    
    # initialize generated text
    last_token =  tokenizer.texts_to_sequences([input_text])
    output_text = input_text.upper() + '\n'
    generated_text = list(output_text)
    
    # text generation loop
    initial_state = None

    for _ in range(num_characters):
        
        # prepare input for model
        inputs = convert_to_input(last_token=last_token, 
                                  text_string=output_text,
                                  paramaters=paramaters)
            
        # pass forward final GRU layer state
        GRU_layer = prediction_model.get_layer('GRU_OUTPUT')
        GRU_layer.reset_states(initial_state)
        
        # run model and get logits of last character prediction
        logits = prediction_model(inputs)[0, -1, :]
        
        # perturb probabilities before selecting token
        if precision_reduction != 0:   
            
            # perturb logits
            logits = logits.numpy()   
            fuzz_factor = tf.random.normal(shape=logits.shape, mean=1, stddev=.2)
            logits = logits * (1 + precision_reduction * fuzz_factor)

        # select token
        last_token = tf.random.categorical(logits=[logits], num_samples=1)       
        last_token = last_token.numpy().tolist()

        # replace any invalid character token
        if last_token==[[0]]:  
            last_token==[[1]]

        #  get GRU state for next character prediction
        initial_state = GRU_layer.states[0].numpy()
        
        # get input for next character prediction
        input_text = tokenizer.sequences_to_texts(last_token)
        input_text = input_text[0]

        # record generated character
        generated_text.append(input_text)

        # reset for next run
        output_text = ''.join(generated_text)
        
    if print_result:
        print(output_text)
    
    return output_text

Final generation function for end user

In [149]:
def generate_text(starting_text, 
                  precision_reduction,
                  prediction_model,
                  print_result,
                  paramaters,
                  num_generation_steps=150): # set length of generated text

    # format user input
    starting_text = starting_text.upper() + ': '

    # get generated text
    # note: very rarely this produces a line indexing error. 
    # This while loop reruns prediction if needed

    prediction = generator(input_text=starting_text, 
                                prediction_model=prediction_model, 
                                precision_reduction=precision_reduction, 
                                num_characters=num_generation_steps, 
                                print_result=print_result,
                                paramaters=paramaters)
                                    
   
    # define formatting rules
    split_on = ['?', '.', ',', ';', '!', ':']
    splits = '([' + ''.join(split_on) + '])'
    split_lines_prediction = re.split(splits, prediction)

    # format output
    output = ''
    for line in split_lines_prediction:

        # capitalize first word of each line   
        if len(line) >= 1:
            line_update = line[0].upper()  

            # add capitalized letter to remainder of line
            if len(line) >= 2:
                line_update += line[1:]
        else:
            line_update = ''
                
        # update output text
        if (len(line_update) >= 1 and line_update[-1] in split_on) \
          or (len(line_update) >= 2 and line_update[-2:] == '\n'):
                output = ''.join([output, line_update])
        else:
            output = '\n'.join([output, line_update])

    
    return output + '... '

# Implementation

Load and Process Data

In [150]:
require_fresh_process = False

try:
    # check if prepared datasets already in memory
    assert(fresh_process is False)
    assert(len(X_data_list) > 0)
    print('dataset already loaded')

except:
    print('Preparing dataset')
    X_data_list, Y_data_list = input_pipeline(paramaters=PARAMATERS, 
                                              fresh_process=require_fresh_process)

Preparing dataset
loaded saved pre-processed data


Initialize Training Model

In [151]:
training_model = get_training_model(paramaters=PARAMATERS)









Model: "model_7"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
char_input (InputLayer)         [(128, None)]        0                                            
__________________________________________________________________________________________________
char_embedding (Embedding)      (128, None, 256)     25856       char_input[0][0]                 
__________________________________________________________________________________________________
char_GRU_1 (GRU)                (128, None, 256)     394752      char_embedding[0][0]             
__________________________________________________________________________________________________
char_Batch_Norm_1 (BatchNormali (128, None, 256)     1024        char_GRU_1[0][0]                 
____________________________________________________________________________________________

Load Latest Training Checkpoint

In [152]:
load_checkpoint=False

try:
    # load from checkpoint
    assert(load_checkpoint is True)
    checkpoint, checkpoint_manager = \
        create_checkpoint_manager(model=training_model, paramaters=PARAMATERS)

    checkpoint_manager.restore_or_initialize()
    print('loaded checkpoint')

except:
    
    print('No matching checkpoints')
    checkpoint=None 
    checkpoint_manager=None

No matching checkpoints


Train Model

*Caution with the nonlinear model: overfitting can be a major problem where the model can eventually memorize and return complete segments from the source material. A precision reduction factor can be used to partially compensate. (This paramater in my final prediction function randomly perturbs learned probabilities)*

In [None]:
train_model_now = True

learning_rate = 0.001
num_epochs = 200
num_datasets_per_epoch = 1 # small numbers are best so that some epochs
                           # have larger quantity of elements per batch. 
                           # Use '-1' to include all datasets

# train model
if train_model_now:
    training_model = train_model(training_model, X_data_list, Y_data_list,
                                num_epochs=num_epochs, 
                                learning_rate=learning_rate,
                                num_datasets_to_use=num_datasets_per_epoch,
                                paramaters=PARAMATERS,
                                checkpoint=checkpoint, 
                                checkpoint_manager=checkpoint_manager)  



Epoch: 0
shuffled datasets
Preparing epoch datasets
dataset: 0
Epoch: 1
shuffled datasets
Preparing epoch datasets
dataset: 0
Epoch: 2
shuffled datasets
Preparing epoch datasets
dataset: 0
Epoch: 3
shuffled datasets
Preparing epoch datasets
dataset: 0
Epoch: 4
shuffled datasets
Preparing epoch datasets
dataset: 0
Epoch: 5
shuffled datasets
Preparing epoch datasets
dataset: 0
Epoch: 6
shuffled datasets
Preparing epoch datasets
dataset: 0
Epoch: 7
shuffled datasets
Preparing epoch datasets
dataset: 0
Epoch: 8
shuffled datasets
Preparing epoch datasets
dataset: 0
Epoch: 9
shuffled datasets
Preparing epoch datasets
dataset: 0
Epoch: 10
shuffled datasets
Preparing epoch datasets
dataset: 0
Epoch: 11
shuffled datasets
Preparing epoch datasets
dataset: 0
Epoch: 12
shuffled datasets
Preparing epoch datasets
dataset: 0
Epoch: 13
shuffled datasets
Preparing epoch datasets
dataset: 0
Epoch: 14
shuffled datasets
Preparing epoch datasets
dataset: 0
Epoch: 15
shuffled datasets
Preparing epoch datase

Create Prediction Model

In [None]:
prediction_model = get_prediction_model(trained_model=training_model, 
                                        verbose=True, 
                                        paramaters=PARAMATERS)

Test Output: Generate Text

In [None]:
starting_text = 'The road less'#AI is becoming accessible'
precision_reduction = 0 

gen = generate_text(starting_text=starting_text, 
                    prediction_model=prediction_model,
                    precision_reduction=precision_reduction,
                    print_result=True,
                    paramaters=PARAMATERS,
                    num_generation_steps=150)

print(gen)

Save Models

In [None]:
save_model_now = True

if save_model_now:
    # training model
    save_model(training_model, model_type='training', paramaters=PARAMATERS)

    # prediction model
    save_model(prediction_model, model_type='prediction', paramaters=PARAMATERS)

Anvil Web App Server Integration

In [None]:
if USE_ANVIL:

    # get tokenizer
    tokenizer = create_character_tokenizer()

    @anvil.server.callable
    def anvil_callable(starting_text, precision_reduction=0,
                        paramaters=PARAMATERS,
                        prediction_tokenizer=PARAMATERS._character_tokenizer, ####### remove this
                        prediction_model=prediction_model,
                        print_result=True,
                        author='assorted',
                        num_generation_steps=150):

        return generate_text(starting_text=starting_text, 
                                precision_reduction=precision_reduction,
                                prediction_model=prediction_model,
                                print_result=print_result,
                                paramaters=paramaters,
                                num_generation_steps=num_generation_steps)

    # start persistent connection to server
    anvil.server.wait_forever()