# Corpus

corpus is a large collection of the text. or can be thought of as your model's input data. the corpus contains the text you want the model to learn about.
It is common to divide a large corpus into training and testing sets, using most of the corpus to train the model on and some unseen part of the corpus to test the model on, although the testing set can be an entirely different set of data. The corpus typically requires preprocessing to become fit for usage in a machine learning system. 

# Encoding 

Encoding is sometimes referred to as word representation and it refers to the process of converting text data into a form that a machine learning model can understand. Neural networks cannot work with raw text data, the characters/words must be transformed into a series of numbers the network can interpret.

The actual process of converting words into number vectors is referred to as
"tokenization", because you obtain tokens that represent the actual words. There are multiple ways to encode words as number values.

# Recurrent Neural Network 

Recurrent Neural Network differs from a "vanilla" neural network thanks
to its ability to remember (https://towardsdatascience.com/learn-how-recurrent-neural-networks-work-84e975feaaf7) prior inputs from previous layers in the neural network. 

the outputs of layers in a Recurrent Neural Network aren't influenced only by the weights and the output of the previous layer like in a regular neural network, but they are also influenced by the "context" so far, which is derived from prior inputs and outputs.

RNN are useful for text prcessing because of their ability to remember the different parts of a series of inputs, which means that they can take the previous parts of sentence into account to interpret context.

# Long Short-Term Memory (LSTM)

Long Short-Term Memory networks are a specific type of recurrent neural Network. LSTM can to preserve the context of earlier inputs degrades over time.

the longer the input series is, the more the network "forgets". irrelevant data is accumulated over time and it blocks out relevant data needed for the network to make accurate predictions about the pattern of the text. (vanishing gradient problem).

but know that an LSTM can deal with this problem by selectively "forgetting" information deemed nonessential to the task at hand. By suppressing nonessential information, the LSTM is able to focus on only the information that genuinely matters, taking care of the vanishing gradient problem. This makes LSTMs more robust when handling long strings of text.

# Text Generation Theory/Approach

## word Embeddings

word embedding refers to representing words or phases as a vector of real numbers, much like one-hot encoding does. a word embedding can use more number than simply ones and zeros, and therefore it can form more complex representations. (like decimal number). these representations can store important information about words, like relationship to other words, their morphology, their context.

word embeddings have fewer dimensions than one-hot encoded vectors do, which force the model to represent similiar words with similiar vectors. Each word vector in a word embedding is a representation in a different dimension of the matrix, and the distance between the vectors can be used to represent their relationship. Word embeddings can generalize because semantically similar words have similar vectors.

The word vectors occupy a similar region of the matrix, which helps capture context and semantics. 

## Word-Level Generation vs Character-Level Generation 

There are two ways to tackle a natural language processing task like text generation.
You can analyze the data and make predictions about it at 
1. the level of the words in the corpus 
2. at the level of the individual characters. 

Both character-level generation and word-level generation have their advantages and disadvantages. 

word-level language models tend to display higher accuracy than character level language models. This is because they can form shorter representations of sentences and preserve the context between words easier than character-level language models. However, large corpuses are needed to sufficiently train word-level language models, and one-hot encoding isn't very feasible for word level models. 

character-level language models are often quicker to train, requiring less
memory and having faster inference than word-based models. This is because the
"vocabulary" (the number of training features) for the model is likely to be much smaller overall, limited to some hundreds of characters rather than hundreds of thousands of words. 

Character-based models also perform well when translating words between languages because they capture the characters which make up words, rather than trying to capture the semantic qualities of words.

## Using an RNN/LSTM

First, we'll need to get some text data and preprocess the data. After that, we'll create the LSTM model and train it on the data. Finally, we'll evaluate the network. 

For the text generation, we want our model to learn probabilities about what character will come next, when given a starting ( random) character. We will then chain these probabilities together to create an output of many characters. We first need to convert our input text to numbers and then train the model on sequences of these numbers. 

## Library

In [7]:
import numpy
import sys
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, LSTM
from keras.utils import np_utils
from tensorflow.keras.callbacks import ModelCheckpoint

## read data

In [30]:
file = open("84-0.txt", encoding="utf8").read()

## data cleaning

In [31]:
def tokenize_words(input):
    # lowercase everything to standardize it
    input = input.lower()
    # instantiate the tokenizer
    tokenizer = RegexpTokenizer(r'\w+')
    tokens = tokenizer.tokenize(input)
    # if the created token isn't in the stop words, make it part of "filtered"
    filtered = filter(lambda token: token not in stopwords.words('english'), tokens)
    return " ".join(filtered)

In [32]:
# preprocess the input data, make tokens
processed_inputs = tokenize_words(file)

## convert the characters input to numbers

We'll sort the list of the set of all characters that appear in our input text, then use the enumerate function to get numbers which represent the characters. We then create a dictionary that stores the keys and values,
or the characters and the numbers that represent them

In [33]:
chars = sorted(list(set(processed_inputs)))
char_to_num = dict((c, i) for i, c in enumerate(chars))

We need the total length of our inputs and total length of our set of characters for later data prep, so we'll store these in a variable. 

In [34]:
input_len = len(processed_inputs)
vocab_len = len(chars)
print ("Total number of characters:", input_len)
print ("Total vocab:", vocab_len)

Total number of characters: 177016
Total vocab: 43


Now that we've transformed the data into the form it needs to be in, we can begin making a dataset out of it, which we'll feed into our network. We need to define how long we want an individual sequence (one complete mapping of inputs characters as integers) to be. We'll set a length of 100 for now, and declare empty lists to store our input and output data: 

In [35]:
seq_length = 100
x_data = []
y_data = []

This will create a bunch of sequences where each sequence starts with the next character in the input data

In [36]:
# loop through inputs, start at the beginning and go until we hit
# the final character we can create a sequence out of
for i in range(0, input_len - seq_length, 1):
# Define input and output sequences
# Input is the current character plus desired sequence length
    in_seq = processed_inputs[i:i + seq_length]
# Out sequence is the initial character plus total sequence length
    out_seq = processed_inputs[i + seq_length]
# We now convert list of characters to integers based on
# previously and add the values to our lists
    x_data.append([char_to_num[char] for char in in_seq])
    y_data.append(char_to_num[out_seq])

Now we have our input sequences of characters and our output, which is the character that should come after the sequence ends. We now have our training data features and labels, stored as x_data and y_data.

In [37]:
n_patterns = len(x_data)
print ("Total Patterns:", n_patterns)

Total Patterns: 176916


Now we'll go ahead and convert our input sequences into a processed numpy array that our network can use. We'll also need to convert the numpy array values into floats so that the sigmoid activation function our network uses can interpret them and output probabilities from 0 to 1

In [38]:
X = numpy.reshape(x_data, (n_patterns, seq_length, 1))
X = X/float(vocab_len)

# one-hot encode our label data

In [39]:
y = np_utils.to_categorical(y_data)

# create LSTM model

In [40]:
model = Sequential()

model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2]), return_sequences=True))
model.add(Dropout(0.2))

model.add(LSTM(256, return_sequences=True))
model.add(Dropout(0.2))

model.add(LSTM(128))
model.add(Dropout(0.2))

model.add(Dense(y.shape[1], activation='softmax'))

In [41]:
model.compile(loss='categorical_crossentropy', 
              optimizer='adam')

### save the weights

In [42]:
filepath = "model_weights_saved.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
desired_callbacks = [checkpoint]

In [43]:
model.fit(X, y, 
          epochs=20, 
          batch_size=256, 
          callbacks=desired_callbacks)

Train on 176916 samples
Epoch 1/20
Epoch 00001: loss improved from inf to 2.94398, saving model to model_weights_saved.hdf5
Epoch 2/20
Epoch 00002: loss improved from 2.94398 to 2.79131, saving model to model_weights_saved.hdf5
Epoch 3/20
Epoch 00003: loss improved from 2.79131 to 2.61232, saving model to model_weights_saved.hdf5
Epoch 4/20
Epoch 00004: loss improved from 2.61232 to 2.51186, saving model to model_weights_saved.hdf5
Epoch 5/20
Epoch 00005: loss improved from 2.51186 to 2.42136, saving model to model_weights_saved.hdf5
Epoch 6/20
Epoch 00006: loss improved from 2.42136 to 2.34638, saving model to model_weights_saved.hdf5
Epoch 7/20
Epoch 00007: loss improved from 2.34638 to 2.27749, saving model to model_weights_saved.hdf5
Epoch 8/20
Epoch 00008: loss improved from 2.27749 to 2.21767, saving model to model_weights_saved.hdf5
Epoch 9/20
Epoch 00009: loss improved from 2.21767 to 2.16469, saving model to model_weights_saved.hdf5
Epoch 10/20
Epoch 00010: loss improved from 

<tensorflow.python.keras.callbacks.History at 0x21ec3ded388>

### recompile our model with the saved weights

In [44]:
filename = "model_weights_saved.hdf5"
model.load_weights(filename)
model.compile(loss='categorical_crossentropy', optimizer='adam')

converted the characters to numbers earlier, we need to define a dictionary
variable that will convert the output of the model back into characters:

In [45]:
num_to_char = dict((i, c) for i, c in enumerate(chars))

To generate characters, we need to provide our trained model with a random seed character that it can generate a sequence of characters from: 

In [50]:
start = numpy.random.randint(0, len(x_data) - 1)
pattern = x_data[start]
print("Random Seed:")
print("\"", ''.join([num_to_char[value] for value in pattern]), "\"")

Random Seed:
"  thunder progress waters rolled swelled beneath became every moment ominous terrific pressed vain wi "


We'll ask the model to predict what comes next based off of the random seed, convert the output numbers to characters and then append it to the pattern, which is our list of generated characters plus the initial seed:

In [51]:
for i in range(1000):
    x = numpy.reshape(pattern, (1, len(pattern), 1))
    x = x / float(vocab_len)
    prediction = model.predict(x, verbose=0)
    index = numpy.argmax(prediction)
    result = num_to_char[index]
    seq_in = [num_to_char[value] for value in pattern]
    sys.stdout.write(result)
    pattern.append(index)
    pattern = pattern[1:len(pattern)]

thin sears seel seel seel seel seel seel seel seel seel seel seel seel seel seel seel seel seel seel seel seel seel seel seel seel seel seel seel seel seel seel seel seel seel seel seel seel seel seel seel seel see

KeyboardInterrupt: 