## Text Prediction/Generation with Keras using <font color= #13c113  >LSTM: *Long Short Term Memory* networks</font>

   In this example we will work with the book: Alice’s Adventures in Wonderland by Lewis Carroll.

  We are going to learn the dependencies between characters and the conditional probabilities of characters in sequences so that we can in turn generate wholly new and original sequences of characters.
    
![Text-Generation-With-LSTM-Recurrent-Neural-Networks-in-Python-with-Keras](https://3qeqpr26caki16dnhd19sv6by6v-wpengine.netdna-ssl.com/wp-content/uploads/2016/08/Text-Generation-With-LSTM-Recurrent-Neural-Networks-in-Python-with-Keras.jpg)


### Adapted from:
#### [Text Generation With LSTM Recurrent Neural Networks in Python with Keras](https://machinelearningmastery.com/text-generation-lstm-recurrent-neural-networks-python-keras/)

By Jason Brownlee


<br>


<img src="https://s3.amazonaws.com/keras.io/img/keras-logo-2018-large-1200.png" alt="Keras logo" height="100" width="250"> 

---


# * [MSTC](http://mstc.ssr.upm.es/big-data-track) and MUIT: <font size=5 color='green'>Deep Learning</font>

* <font size=5 color='green'>Machine Learning Lab (MLLB)</font>
 
---
---




---

## Start installing some libraries do some imports...

In [0]:
import numpy as np

import keras

from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.callbacks import ModelCheckpoint
from keras.utils import np_utils




---

## Down load TEXT file <font color= #2a9dad >*Alice’s Adventures in Wonderland*</font>

- ### We will first download the complete text in ASCII format (Plain Text UTF-8) 

- #### [Project Gutenberg](https://www.gutenberg.org/): gives free access to books that are no longer protected by copyright

- ### Text has been prepared in a Google Drive link



In [0]:
! pip install googledrivedownloader



In [0]:
from google_drive_downloader import GoogleDriveDownloader as gdd

gdd.download_file_from_google_drive(file_id='1wG4PUnoYVUKrsaWgyWepSacUiYNNDEvM',
                                    dest_path='./wonderland.txt',
                                    unzip=False)

Downloading 1wG4PUnoYVUKrsaWgyWepSacUiYNNDEvM into ./wonderland.txt... Done.


- ### Read text for the book and convert all of the characters to lowercase to reduce the vocabulary that the network must learn

In [0]:
# load ascii text and covert to lowercase
filename = "wonderland.txt"
raw_text = open(filename).read()
raw_text = raw_text.lower()

In [0]:
print(raw_text[0:200])


alice's adventures in wonderland

lewis carroll

the millennium fulcrum edition 3.0




chapter i. down the rabbit-hole

alice was beginning to get very tired of sitting by her sister on the
bank, and


- ### We must use a "numerical" representation of text characters directly,
- ### We will start using a simple one: $integers$
- ### (Some characters could have been removed to further clean up the text)

<font color=yellow  face="times, serif" size=5>============================================<br>
How many different characters in raw_text?  store then ordered in a list</font>

In [0]:
chars = sorted(list(set(raw_text)))

print(chars)
print('Number of characters: ',len(chars))

['\n', ' ', '!', '"', "'", '(', ')', '*', ',', '-', '.', '0', '3', ':', ';', '?', '[', ']', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
Number of characters:  45


<font color=yellow  face="times, serif" size=5>============================================<br>
MAP each character to an integer using a Python *dictionary*  with key(char) : value(int) </font>

In [0]:
char_to_int = dict((c, i) for i, c in enumerate(chars))

In [0]:
char_to_int['s']

37

<font color=yellow face="times, serif" size=5>============================================<br>
INVERSE MAP: get the char from an integer using a *dictionary*  int: char </font>

In [0]:
int_to_char = dict((i, c) for i, c in enumerate(chars))


In [0]:
int_to_char[3]

'"'

In [0]:
n_chars = len(raw_text)
n_vocab = len(chars)
print("Total Characters: ", n_chars)
print("Total Vocab: ", n_vocab)

Total Characters:  144431
Total Vocab:  45


---

## Prediction Task:


- ### <font color=red> Number of steps</font>: We will split the book text up into subsequences with a <font color=red>fixed length of 100 characters, an arbitrary length</font>. 

- ### To train the network we slide a windows of seq_length = 100 characters along the whole book



<font color=yellow  face="times, serif" size=5>============================================<br>
**Slide a window extracting a sequence of seq_length = 100 characters along the book and store it in dataX : the input to the network</font>

In [0]:
seq_length = 100

dataX=[]

for i in range(0, n_chars - seq_length, 1):
  seq_in = raw_text[i:i+seq_length]
  dataX.append(seq_in)

#dataX

In [0]:
print('dataX length: ',len(dataX))
print('dataX first training example: \n',dataX[0])

dataX length:  144331
dataX first training example: 
 alice's adventures in wonderland

lewis carroll

the millennium fulcrum edition 3.0




chapter i. d


<font color=yellow  face="times, serif" size=5>============================================<br>
dataX MUST be numeric!!! make changes using our $char\_to\_int$ dictionary</font>

In [0]:
seq_length = 100

dataX=[]

for i in range(0, n_chars - seq_length, 1):
  seq_in = raw_text[i:i+seq_length]
  dataX.append([char_to_int[char] for char in seq_in])

In [0]:
print('dataX length: ',len(dataX))
print('dataX first training example: \n',dataX[0])

dataX length:  144331
dataX first training example: 
 [19, 30, 27, 21, 23, 4, 37, 1, 19, 22, 40, 23, 32, 38, 39, 36, 23, 37, 1, 27, 32, 1, 41, 33, 32, 22, 23, 36, 30, 19, 32, 22, 0, 0, 30, 23, 41, 27, 37, 1, 21, 19, 36, 36, 33, 30, 30, 0, 0, 38, 26, 23, 1, 31, 27, 30, 30, 23, 32, 32, 27, 39, 31, 1, 24, 39, 30, 21, 36, 39, 31, 1, 23, 22, 27, 38, 27, 33, 32, 1, 12, 10, 11, 0, 0, 0, 0, 0, 21, 26, 19, 34, 38, 23, 36, 1, 27, 10, 1, 22]


<font color=yellow  face="times, serif" size=5>============================================<br>
Now we have to create the output for each 100 characters windows: </font>
  - the OUTPUT will be the next character, that is: we will train to predict the next character after "seeing" 100 previous characters. </font>
  
###So add to the for loop some code to store the "next character" for each window in dataY  :  again this MUST be numeric!!! so use our $char\_to\_int$ dictionary</font>

In [0]:

seq_length = 100

dataX = []
dataY = []
for i in range(0, n_chars - seq_length, 1):
  
	seq_in = raw_text[i:i + seq_length]
	seq_out = raw_text[i + seq_length]
	dataX.append([char_to_int[char] for char in seq_in])
	dataY.append(char_to_int[seq_out])

#dataY

In [0]:
import numpy as np

n_patterns = len(dataX)

print("Total Patterns: ", n_patterns)
print("Pattern shape: ",np.array(dataX).shape)

Total Patterns:  144331
Pattern shape:  (144331, 100)


- ### Let's see two examples:

In [0]:
print("------Window input dataX -------------------------------------")
print("\"", ''.join([int_to_char[value] for value in dataX[201]]), "\"")
print("\n -----Character to predict dataY:")
print("\"", int_to_char[dataY[201]])
print("\n")
print("------Window input dataX -------------------------------------")
print("\"", ''.join([int_to_char[value] for value in dataX[202]]), "\"")
print("\n -----Character to predict dataY:")
print("\"", int_to_char[dataY[202]])
print("\n")
print("\"", ''.join([int_to_char[value] for value in dataY[203:216]]), "\"")

------Window input dataX -------------------------------------
" of having nothing to do: once or twice she had peeped into the
book her sister was reading, but it h "

 -----Character to predict dataY:
" a


------Window input dataX -------------------------------------
" f having nothing to do: once or twice she had peeped into the
book her sister was reading, but it ha "

 -----Character to predict dataY:
" d


"  no pictures  "




---

## We must now prepare our training data to be suitable for use with LSTM in Keras.

- ### First we must transform the list of input sequences into the form <font color= #3498db>  [no. samples or batches, time steps, features]</font> expected by an LSTM network. <font color=red> NOTE that our number of features is 1</font>

- ### Next we need to rescale the integers to the range 0-to-1 to make the patterns easier to learn by the LSTM network that uses the sigmoid activation function by default.

- ### Finally, we need to convert the output patterns (single characters converted to integers) into a one hot encoding: to predict the probability of each of the different characters in the vocabulary

In [0]:
print('dataX shape', np.array(dataX).shape)

dataX shape (144331, 100)


In [0]:
from keras.utils.np_utils import to_categorical

# reshape X to be [samples, time steps, features]
X = np.reshape(dataX, (n_patterns, seq_length, 1))
# normalize
X = X / float(n_vocab)
# one hot encode the output variable
y = np_utils.to_categorical(dataY)
y[6]

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], dtype=float32)

In [0]:
dataY[6]

23

LO QUE ESTA HACIENDO ES SABIENDO QUE DATAY[3]=DATAY[100] ES IGUAL A 1, ENTONCES QUIERE DECIR QUE AMBOS SON EL MISMO CARACTER Y QUE SON EL CARACTER 1, DE LOS 45 CARACTERES QUE HEMOS DEFINIDO ANTES. POR LO QUE Y[3] BUSCA LO QUE HAY EN POSICION 3 DEL DATAY Y AL HACER UN ONE HOT ENCODER MUESTRA LA POSICION EN LA QUE SALE. 

In [0]:
# or with keras
ykeras=keras.utils.to_categorical(dataY)

print('OHE example numpy',y[3])
print('OHE example keras',ykeras[3])


OHE example numpy [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
OHE example keras [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]


In [0]:
X[200,0:10]*n_vocab

array([[ 1.],
       [33.],
       [24.],
       [ 1.],
       [26.],
       [19.],
       [40.],
       [27.],
       [32.],
       [25.]])

## NOTE that to go now -after normalization- from int to chat we must multiply by n_vocab and round to integer 

In [0]:
print("\"", ''.join([int_to_char[int(value+0.5)] for value in X[200,:]*n_vocab]), "\"")
print("\"", ''.join([int_to_char[int(value)] for value in dataX[200]]), "\"")

"  of having nothing to do: once or twice she had peeped into the
book her sister was reading, but it  "
"  of having nothing to do: once or twice she had peeped into the
book her sister was reading, but it  "


---
## We can now define and compile our LSTM model:
- ### Here we define a single hidden LSTM layer with 256 memory units.
- ### The network uses dropout with a probability of 20 at the output of LSTM. <font color=red> you can also use recurrent dropout</font> SEE: [Keras layers recurrent](https://keras.io/layers/recurrent/)
- ### The output layer is a Dense layer using the softmax activation function to output a probability prediction for each of the possible characters between 0 and 1.



In [0]:
print('X shape: ', X.shape)
print('y.shape: ',y.shape)
y[0,:]

X shape:  (144331, 100, 1)
y.shape:  (144331, 45)


array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], dtype=float32)

<font color=yellow  face="times, serif" size=5>============================================<br>
Define the LSTM model using Sequential style. </font>
 

In [0]:
# define the LSTM model
model = Sequential()
model.add(LSTM(512, input_shape=(X.shape[1], X.shape[2])))
model.add(Dropout(0.2))
model.add(Dense(y.shape[1], activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam')

model.summary()





Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_1 (LSTM)                (None, 512)               1052672   
_________________________________________________________________
dropout_1 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 45)                23085     
Total params: 1,075,757
Trainable params: 1,075,757
Non-trainable params: 0
_________________________________________________________________


- ### Note that we not really are interested in prediction
- ### We are seeking a balance between generalization and overfitting but short of memorization.
- ### Because of the slowness of our optimization requirements, we will use model checkpointing to record all of the network weights to file each time an improvement in loss is observed at the end of the epoch.
- ### We will use the best set of weights (lowest loss) to instantiate our generative model in the next section.



---
## SEE Keras Callbacks
[Keras Callbacks](https://keras.io/callbacks/)

- A callback is a set of functions to be applied at given stages of the training procedure. 

- You can use callbacks to get a view on internal states and statistics of the model during training.

- You can pass a list of callbacks (as the keyword argument callbacks) to the .fit() method of the Sequential or Model classes.

- The relevant methods of the callbacks will then be called at each stage of the training. 

---
---
## We will use Callbacks "checkpoint" to save your LSTM model into your Google Drive


In [0]:
# MOUNT your Google Drive to save the model
from google.colab import drive 
drive.mount('/content/gdrive') 


Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive


In [0]:
# define the checkpoint
filepath="/content/gdrive/My Drive/weights-improvement-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]

## <font color=orange> Take a look to hdf5 !!!</font>

  HDF5 is a data model, library, and file format for storing and managing data. It supports an unlimited variety of datatypes, and is designed for flexible and efficient I/O and for high volume and complex data. HDF5 is portable and is extensible, allowing applications to evolve in their use of HDF5. The HDF5 Technology suite includes tools and applications for managing, manipulating, viewing, and analyzing data in the HDF5 format.

## [HDF5 Web portal](https://portal.hdfgroup.org/display/HDF5/HDF5)

---

## Fit our model to the data.
- ### Here we use a modest number of 20 epochs and a large batch size of 128 pattern

<font color=yellow  face="times, serif" size=5>============================================<br>
**TO DO:**   Fit the model, first with 20 epochs batch_size=128 AND $callbacks$ !! </font>
 

In [0]:
model.fit(X, y, epochs=200, batch_size=128, callbacks=callbacks_list)

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where



Epoch 1/200






Epoch 00001: loss improved from inf to 2.96558, saving model to /content/gdrive/My Drive/weights-improvement-01-2.9656.hdf5
Epoch 2/200

Epoch 00002: loss improved from 2.96558 to 2.73902, saving model to /content/gdrive/My Drive/weights-improvement-02-2.7390.hdf5
Epoch 3/200

Epoch 00003: loss improved from 2.73902 to 2.60642, saving model to /content/gdrive/My Drive/weights-improvement-03-2.6064.hdf5
Epoch 4/200

Epoch 00004: loss improved from 2.60642 to 2.50336, saving model to /content/gdrive/My Drive/weights-improvement-04-2.5034.hdf5
Epoch 5/200

Epoch 00005: loss improved from 2.50336 to 2.40813, saving model to /content/gdrive/My Drive/weights-improvement-05-2.4081.hdf5
Epoch 6/200

Epoch 00006: loss improved from 2.40813 to 2.31746, saving model to /content/gdrive/My Drive/weights-improvement-06-2.3175.hdf5
Epoch 7/200

Epoch 00007: loss improved from 2.31746 to 

In [0]:
model.fit(X, y, epochs=2, batch_size=128)

Epoch 1/2
Epoch 2/2
  3712/144331 [..............................] - ETA: 3:48 - loss: 2.5961

KeyboardInterrupt: ignored

## check that callbacks has stored best models

In [0]:
ls /content/gdrive/'My Drive'

 [0m[01;34mAPMICRO[0m/                         [01;34mOSA-python[0m/
'Carta mama.gdoc'                 [01;34mpreguntas[0m/
[01;34m'Colab Notebooks'[0m/               [01;34m'Proyecto: HOP IN'[0m/
 [01;34menglish[0m/                         [01;34mRECM-SCOM[0m/
 [01;34mGPRO-PROJECT[0m/                    [01;34mSSMM[0m/
'Guided Tour VALENCIA.pdf'        Sw.zip
 Info_BDApnea_QuironMalaga.xlsx   [01;34mtfg[0m/
[01;34m'ISIN '[0m/                          [01;34mVodafone[0m/
 L3_MariaBrull.zip                weights-improvement-01-2.9259.hdf5
 [01;34mMANU-MARIA[0m/                      weights-improvement-02-2.7426.hdf5
 Material_L2.zip


In [0]:
! cp /content/gdrive/'My Drive'/weights-improvement-02-2.7426.hdf5 weights.hdf5



---

## Generating Text with an LSTM Network


---

## The network weights are loaded from a checkpoint file and the network does not need to be trained.

In [0]:
# load the network weights
filename = "weights.hdf5"
model.load_weights(filename)
model.compile(loss='categorical_crossentropy', optimizer='adam')

---

## Finally: make predictions.

- ### The simplest way is to first start with a seed sequence as input and predict the next character
- ### then update the seed sequence to add the predicted character on the end and trim off the first character.
- ### ...repeat this process to predict new characters (e.g. a sequence of 1,000 characters in length).


In [0]:
import sys

#pick a random seed
start = np.random.randint(0, len(dataX)-1)
pattern = dataX[start]
print("Seed:")
print("\"", ''.join([int_to_char[value] for value in pattern]), "\"")
print('\n GENERATE: \n')

# generate characters
for i in range(500):
  x = np.reshape(pattern, (1, len(pattern), 1))
  x = x / float(n_vocab)
  prediction = model.predict(x, verbose=0)
  index = np.argmax(prediction)
  result = int_to_char[index]
 
  #print every ouput character
  sys.stdout.write(result)
  
  # add output char
  pattern.append(index)
  # remove first char
  pattern = pattern[1:len(pattern)]
  
print("\nDone.")

Seed:
" at the great concert
given by the queen of hearts, and i had to sing

     "twinkle, twinkle, little "

 GENERATE: 

 toe toet to the toee  
nhe tae haree to the toree  
nhe toet ho the  and the toet toe toee th the woee  
nhe toet ho the  and the toet toe toee th the woee  
nhe toet ho the  and the toet toe toee th the woee  
nhe toet ho the  and the toet toe toee th the woee  
nhe toet ho the  and the toet toe toee th the woee  
nhe toet ho the  and the toet toe toee th the woee  
nhe toet ho the  and the toet toe toee th the woee  
nhe toet ho the  and the toet toe toee th the woee  
nhe toet ho the  and th
Done.


In [0]:
result

'o'

# You can look for some ideas and improvements in:

- ### [Learn about EMBEDDINGS](https://github.com/fchollet/deep-learning-with-python-notebooks/blob/master/6.1-using-word-embeddings.ipynb)

- ### [text-generation-lstm-recurrent-neural-networks-python-keras](https://machinelearningmastery.com/text-generation-lstm-recurrent-neural-networks-python-keras/)

- ### [Deepanway Ghosal](https://github.com/deepanwayx/char-and-word-rnn-keras)

