# Introduction

The following was inspired by the following articles

- Medium
    - How to build a Recurrent Neural Network in TensorFlow [[1]](https://medium.com/@erikhallstrm/hello-world-rnn-83cd7105b767)[[2]](https://medium.com/@erikhallstrm/tensorflow-rnn-api-2bb31821b185)[[3]](https://medium.com/@erikhallstrm/using-the-tensorflow-lstm-api-3-7-5f2b97ca6b73)[[4]](https://medium.com/@erikhallstrm/using-the-tensorflow-multilayered-lstm-api-f6e7da7bbe40)[[5]](https://medium.com/@erikhallstrm/using-the-dynamicrnn-api-in-tensorflow-7237aba7f7ea)[[6]](https://medium.com/@erikhallstrm/using-the-dropout-api-in-tensorflow-2b2e6561dfeb)  
    - [RNN example by Python](https://towardsdatascience.com/recurrent-neural-networks-by-example-in-python-ffd204f99470)       
- GitRepos
    - [char-rnn-tensorflow](https://github.com/sherjilozair/char-rnn-tensorflow)
    - [RNN](https://github.com/WillKoehrsen/recurrent-neural-networks/tree/master/notebooks)
- Kaggle Repos
    - [Learn by example RNN/LSTM/GRU time series](https://www.kaggle.com/charel/learn-by-example-rnn-lstm-gru-time-series)
- machinelearningmaster
    - [How to Develop a Character-Based Neural Language Model in Keras](https://machinelearningmastery.com/develop-character-based-neural-language-model-keras/)
    - [Adventures of machine learning](http://adventuresinmachinelearning.com/keras-lstm-tutorial/)
-  Troubleshooting
    - [Input size of the LSTM layer](https://github.com/keras-team/keras/issues/2045)

# Imports

In [None]:
import pandas as pd
import numpy as np

# A regular expression is a special sequence of characters that helps you match or find other strings or sets of strings, 
# using a specialized syntax held in a pattern.
# https://www.tutorialspoint.com/python/python_reg_expressions.htm
import re

# KERAS
from keras.preprocessing.text import Tokenizer

# SKlearn
from sklearn.utils import shuffle

# Functions

## Formating

In [None]:
# This uses the re package
def format_patent(patent):
    """Add spaces around punctuation and remove references to images/citations."""

    # Add spaces around punctuation
    # this says take any of the characters and substitute it with a space
    # basically gets rid of the special characters
    patent = re.sub(r'(?<=[^\s0-9])(?=[.,;?])', r' ', patent)

    # Remove references to figures
    # removes any characters with the sequence (#)
    patent = re.sub(r'\((\d+)\)', r'', patent)

    # Remove double spaces    
    patent = re.sub(r'\s\s', ' ', patent)
    return patent

## Sequencing

In [None]:
def make_sequences(texts,
                   training_length=50,
                   lower=True,
                   filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n'):
    """Turn a set of texts into sequences of integers"""

    # Create the tokenizer object and train on texts
    tokenizer = Tokenizer(lower=lower, filters=filters)
    tokenizer.fit_on_texts(texts)

    # Create look-up dictionaries and reverse look-ups    
    word_idx = tokenizer.word_index # for each word there is an associated index
    idx_word = tokenizer.index_word # for each index there is an associated word
    num_words = len(word_idx) + 1
    word_counts = tokenizer.word_counts

    print(f'There are {num_words} unique words.')

    # Convert text to sequences of integers
    sequences = tokenizer.texts_to_sequences(texts)

    # Limit to sequences with more than training length tokens
    seq_lengths = [len(x) for x in sequences]
    over_idx = [
        i for i, l in enumerate(seq_lengths) if l > (training_length + 20)
    ]

    new_texts = []
    new_sequences = []

    # Only keep sequences with more than training length tokens
    for i in over_idx:
        new_texts.append(texts[i])
        new_sequences.append(sequences[i])

    training_seq = []
    labels = []

    # Iterate through the sequences of tokens
    for seq in new_sequences:

        # Create multiple training examples from each sequence
        for i in range(training_length, len(seq)):
            # Extract the features and label
            extract = seq[i - training_length:i + 1]

            # Set the features and label
            training_seq.append(extract[:-1])
            labels.append(extract[-1])

    print(f'There are {len(training_seq)} training sequences.')

    # Return everything needed for setting up the model
    return word_idx, idx_word, num_words, word_counts, new_texts, new_sequences, training_seq, labels

## One-hot encoding and Train/Valid Split

In [None]:
def create_train_valid(features,
                       labels,
                       num_words,
                       train_fraction=TRAIN_FRACTION):
    """Create training and validation features and labels."""

    # Randomly shuffle features and labels
    features, labels = shuffle(features, labels, random_state=RANDOM_STATE)

    # Decide on number of samples for training
    train_end = int(train_fraction * len(labels))

    train_features = np.array(features[:train_end])
    valid_features = np.array(features[train_end:])

    train_labels = labels[:train_end]
    valid_labels = labels[train_end:]

    # Convert to arrays
    X_train, X_valid = np.array(train_features), np.array(valid_features)

    # Using int8 for memory savings
    y_train = np.zeros((len(train_labels), num_words), dtype=np.int8)
    y_valid = np.zeros((len(valid_labels), num_words), dtype=np.int8)

    # One hot encoding of labels
    for example_index, word_index in enumerate(train_labels):
        y_train[example_index, word_index] = 1

    for example_index, word_index in enumerate(valid_labels):
        y_valid[example_index, word_index] = 1

    # Memory management
    import gc
    gc.enable()
    del features, labels, train_features, valid_features, train_labels, valid_labels
    gc.collect()

    return X_train, X_valid, y_train, y_valid

# Parameters

## Constants and Parameters

### Global

In [None]:
RANDOM_STATE = 50
EPOCHS = 150
BATCH_SIZE = 2048
TRAINING_LENGTH = 50
TRAIN_FRACTION = 0.7
LSTM_CELLS = 64
VERBOSE = 0
SAVE_MODEL = True

In [None]:
# File location
fileDir = 'data/rnn/neural_network_patent_query.csv'

## Hyper- parameters

# Data

## I/O

In [None]:
# Read in data
data = pd.read_csv(
    fileDir, parse_dates=['patent_date'])

# Extract abstracts
original_abstracts = list(data['patent_abstract'])
len(original_abstracts)

data.head()

In [None]:
print('There are {} patents in total with {} coloumns with the following information \n {}'.format(data.shape[0],data.shape[1],data.columns.values))

## Cleaning

### Formating

In [None]:
formatted = []

# Iterate through all the original abstracts
for a in original_abstracts:
    formatted.append(format_patent(a))

### Convert Text to Sequence

The main purpose is to divide the entire abstract information into sentences with **n** number of words.  Where each word is a number which corresponds to the vocabulary.

Our preprocessing is going to involve using a Tokenizer to convert the patents from sequences of words (strings) into sequences of integers.

We will use a sequencing function to perform this task.  Where this function takes various parameters including a training length which is the number of words we will feed into the network as features with the next word the label. For example, if we set training_length = 50, then the model will take in 50 words as features and the 51st word as the label.

For each abstract, we can make multiple training examples by slicing at different points. We can use the first 50 words as features with the 51st as a label, then the 2nd through 51st word as features and the 52nd as the label, then 3rd - 52nd with 53rd as label and so on. This gives us much more data to train on and the performance of the model is proportional to the amount of training data.

In [None]:
TRAINING_LENGTH = 50
filters = '!"#$%&()*+/:<=>@[\\]^_`{|}~\t\n'
word_idx, idx_word, num_words, word_counts, abstracts, sequences, features, labels = make_sequences(
    formatted, TRAINING_LENGTH, lower=True, filters=filters)

You can think of the above step as creating a feature matrix wit the following dimensions **[sentence][word]**.

Where each sentence contains 50 words in which each word corresponds to a value in the vocabulary.

The vocabulary is index by two variables
- word_idx
- idx_word

There are two varibles above which we can use as statistics and they are:
- num_words
- word_counts

# Training the RNN - LSTM

Split a the data into training and validation set and use one hot encoding.

In [None]:
X_train, X_valid, y_train, y_valid = create_train_valid(
    features, labels, num_words)
X_train.shape
y_train.shape

In [None]:
X_train[0]

In [None]:
X_train[0][0]

In [None]:
y_train[0]