# Sequence generation
In this assignment, we will utilize a recurrent neural network (RNN) for sequence generation. Here, we'll train RNNs to generate text character by character. We'll start from scratch, starting from loading the dataset to defining the model to it's training and then generating the text from the trained model.  
In this, we'll be covering the following subsections:


1. Import pre-requisite modules (standard library for python, keras, and google colab)

2. Load the dataset  

3. Preprocessing of the dataset

4. Convert Sequential dataset to Supervised learning dataset format (input(X),output(y))  

5. Normalize the dataset  

6. Define hyperparameters  

7. Define RNN model

8. Compile the model

9. Train the model

10. Evaluate the model


Each subsection cell comes with a short description and instructions for carrying out the subtask in the immediate code cell. Some subsections will be complete and self-sufficient, you don't have to write module or code in that. Its' advised to go through each subsection concepts and what it implements throroughly.  



### Import Google colab drive helper module and mount the drive




In [0]:
from google.colab import drive
import os 

In [0]:
# mount your google drive into the google colab
drive.mount('/content/drive')

# navigate to the folder path in your drive where keras utility files exist
current_path = '/content/drive/My Drive/rnn/keras/'
os.chdir(current_path)

# list all files under that path
!ls -al

### Import required python and keras modules

In [0]:
# python system and numpy modules
import sys
import numpy

# keras modules
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.optimizers import Adam
from keras.utils import np_utils

### Load the dataset

*   Preprocessing of the dataset  
1. Load the text file. Look what the text contains. Next, convert the text file characters to lowercase characters.  

2. Create an array of unique characters that occurs in the text file including all numbers and special characters. This array of unique characters is called ```vocabulary```.  

3. Create two dictionaries namely ```character_to_identifier and identifier_to_character ```. Former maps each character to it's corresponding integer index  and latter maps each integer index to it's corresponding character by looking from the ```vocabulary``` array.  

4. Print total characters length in the text data and length of the ```vocabulary``` array.




In [0]:
# load text file into a variable - raw_text_data
filename = "datasets/alice_in_wonderland.txt"
raw_text_data = open(filename).read()

# Look what the text is all about

# convert all characters in raw_text_data into lowercase characters format
text_data = raw_text_data.lower()

In [0]:
# create an array of unique characters - vocabulary, which includes all type of characters
vocabulary = sorted(list(set(text_data)))

In [0]:
# create a dictionary - character_to_identifier, that has each character and it's corresponding index as a (key, value) pair
character_to_identifier = dict((character, identifier) for identifier, character in enumerate(vocabulary))

# create a dictionary - indentifier_to_character, that has integer index and it's corresponding character as a (key, value) pair
identifier_to_character = dict((identifier, character) for identifier, character in enumerate(vocabulary))

In [0]:
# print total number of characters in text_data and vocabulary array
total_characters = len(text_data)
vocabulary_length = len(vocabulary)

print('Total characters - ', total_characters)
print('Total vocabulary_length - ', vocabulary_length)

### Convert sequence data into supervised learning dataset format

In [0]:
sequence_length = 100

X_data = []
Y_data = []

for i in range(0, total_characters - sequence_length, 1):
	sequence_input = text_data[i:i + sequence_length]
	sequence_output = text_data[i + sequence_length]
	X_data.append([character_to_identifier[char] for char in sequence_input])
	Y_data.append(character_to_identifier[sequence_output])
 
number_of__patterns = len(X_data)
print('Total patterns - ', number_of__patterns)

### Normalize the dataset

In [0]:
# reshape input sequence into 3-D tensor format
X = numpy.reshape(X_data, (number_of__patterns, sequence_length, 1))

# normalize integer representation of input sequence by dividing with maximum integer i.e. vocabulary_length
X = X / float(vocabulary_length)

# convert integer representation of output character (0-vocabulary_length) into one-hot vector of length=vocabulary_length
Y = np_utils.to_categorical(Y_data)

#### Define hyperparameters

*   Create an object of Adam optimizer with learning rate  value 0.001.
*   Set epochs with value 1 and batch_size with value 256. Note that these parameters values are tunable and should try different values for training the model and chose only that value which gives good results. 

In [0]:
# define optimizer
learning_rate = 0.0001
optimizer = Adam(lr=learning_rate)


# define other hyperparameters
epochs = 1
batch_size = 256

### Define a simple recurrent neural network model using Keras.

1. Create a sequential model. See [link](https://keras.io/models/sequential/) for details.

2. Add a LSTM layer to the model. See [link](https://keras.io/layers/recurrent) for more details.
Set units = 256, input_shape = (X.shape[1], X.shape[2]) for the layer.

3. Add a dropout layer to the model. See [link](https://keras.io/layers/core/) for details. Set a dropout rate = 0.2 for the layer.

4. Add a dense layer to the model. See [link](https://keras.io/layers/core/) for details. Set units = y.shape[1], activation = softmax for the layer.

In [0]:
# design a rnn based model
model = Sequential()
model.add(LSTM(units=256, input_shape=(X.shape[1], X.shape[2]), return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(units=256))
model.add(Dropout(0.2))
model.add(Dense(units=Y.shape[1], activation='softmax'))

### Compile the simple model

1. See [link](https://keras.io/models/model/) for details.

2. Set loss = categorical_crossentropy.

3. Set optimizer=optimizer (created Adam optimizer).

4. Set metrics = accuracy.

5. Print model summary for visualization.

In [0]:
# compile the model
model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])


# visualize the model
print(model.summary())

### Train the model

1. Set number of epochs between 1-10.

2. Set batch size as multiple of 32 i.e. either 128 or 256.

3.  Call function fit(...) on the model. 

4.  See [link](https://keras.io/models/model/) for details.

5. Pass X and Y for training the model.

In [0]:
model.fit(X, Y, epochs=epochs, batch_size=batch_size)

### Evaluate the model
Since it takes time to train the model so we have already saved a trained model. We can use that trained model.  
1. Load weights of pre-trained model using ```load_weights``` function

2. Extract a small input sequence from the input sequence (```X_data```), convert the extracted input sequence integer representation back to characters using ```identifier_to_character``` function.

3. Define maximum characters that you want to produce/generate from the model.

4. Start a loop that iterates for generating characters in sequence upto the maximum characters

4. Reshape ```model_input```

5. Predict next character in sequence using ```predict``` function

6. Convert model output integer representation to it's corresponding character representation using ```character_to_identifier ``` function.

7. Print generated character

In [0]:
# load previously trained model weights
filename = 'models/best_wonderland_model.hdf5'
model.load_weights(filename)

In [0]:
# create a random sequence of input characters for predicting next sequence of characters
start = numpy.random.randint(0, len(X_data)-1)
pattern = X_data[start]
print('Seed is -')
print('\"', ''.join([identifier_to_character[value] for value in pattern]), '\"')

In [0]:
# Create a loop for generating one character at a time using loaded model upto maximum_characters
maximum_characters = 500
for character in range(maximum_characters):
	model_input = numpy.reshape(pattern, (1, len(pattern), 1))
	model_input = model_input / float(vocabulary_length)
 
	prediction = model.predict(model_input, verbose=0)
	index = numpy.argmax(prediction)
	model_output = identifier_to_character[index]

	sys.stdout.write(model_output)
 
	pattern.append(index)
	pattern = pattern[1:len(pattern)]