# Long Short Term Memory - LSTM

Why change the method, right? RNN worked pretty well for what we need when predicting wine variety names. 
Because dealing with longer sentences such as wine reviews, create long term dependencies. So when it comes to treating input sentences that span long intervals, there are limitations to RNNs. These limitations are the cause of vanishing and exploding gradient problems.

In vanishing gradients, the gradients of the weights become smaller and smaller and eventually become zero as the method move backward from the last time-step towards the first time-step.
In exploding gradient problem, the gradient values of the weights become bigger and bigger as the method move back-propagate towards the first time-step. As a result, gradient clips and this clipping limits the maximum value of gradients at every node. This sounds a lot like compressing audio and how hard clipping can create distortion thus introduce noise.

Training a neural network is nothing but adjusting the values of the weights so that the error gets reduced. 
- Error is the square root of expected - output. 
- Gradient is the rate of change of error with respect to weights. We can adjust the weight to reduce the error. 
- The gradient values are multiplied by a small fraction and then subtracted from weights to reduce error. 
- The fraction is called the learning rate which influences how the weight values will converge to the optimal value.

#### Vanishing and exploding gradients
Simple recurrent neural networks suffer from the problem of vanishing gradients where the gradients of the weights become smaller and smaller and eventually become zero as we move backward from the last time-step towards the first time-step.

Recurrent neural networks also suffer from the exploding gradient problem where the gradient values of the weights become bigger and bigger as we move back-propagate towards the first time-step.

We can infer future by learning from past. But sometimes we dont need to go back to long, we can learn from recent past.

#### Short and long term dependency example:
- Short term dependency would be predicting "aromas" in "With attractive melon and other tropical aromas"
- Long term dependency would be predicting the second part of the sentence in "this comes across as rather rough and tannic, with rustic, earthy, herbal characteristics... it's a good companion to a hearty winter stew.""

RNN is good for short term dependency but it does not handle long term dependencies as well. For this reason, for review generation I am going to train a Long Short Term Memory (LSTM) network.

Write like Ms.Sion Sommelier
- Input: A dataset of bunch of wine reviews (in english) I have 130k of them give or take a few thousand but will use only Jim Gordon
- Output: Complete sentence as Jim Gordon would complete. 
Check him out on winespectator.com if you like. (https://www.winespectator.com/authors/jim-gordon).

In [2]:
import pickle
import numpy as np
import pandas as pd

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

from keras.models import Sequential
from keras.layers import Dense, Activation, LSTM
from keras import backend

import numpy as np

# Ignoring the warnings. This pull request explains the reason for warnings: 
# https://github.com/keras-team/keras/pull/13012




Using TensorFlow backend.


In [3]:
with open('/home/ec2-user/SageMaker/MSDS696/wine_msds.pkl', 'rb') as f:
    wine_msds = pickle.load(f)

In [4]:
wine_msds.head(3)

Unnamed: 0,id,country,description,designation,points,price,province,region,taster_name,title,variety,winery,vintage
16,16,Argentina,"Baked plum, molasses, balsamic vinegar and che...",Felix,87,30.0,Other,Cafayate,Michael Schachner,Felix Lavaque 2010 Felix Malbec (Cafayate),Malbec,Felix Lavaque,2010
17,17,Argentina,Raw black-cherry aromas are direct and simple ...,Winemaker Selection,87,13.0,Mendoza Province,Mendoza,Michael Schachner,Gaucho Andino 2011 Winemaker Selection Malbec ...,Malbec,Gaucho Andino,2011
183,183,Argentina,With attractive melon and other tropical aroma...,Salta,88,12.0,Other,Salta,Michael Schachner,Alamos 2007 Torrontés (Salta),Torrontés,Alamos,2007


In [5]:
df_rev = wine_msds[['description']]
df_rev.head(3)

Unnamed: 0,description
16,"Baked plum, molasses, balsamic vinegar and che..."
17,Raw black-cherry aromas are direct and simple ...
183,With attractive melon and other tropical aroma...


In [6]:
wine_msds['taster_name'].value_counts()
# Solving memory issues is going to require massive RAM. 
# I have to choose subsets.

Unknown Taster        24917
Roger Voss            23560
Michael Schachner     14046
Kerin O’Keefe          9697
Paul Gregutt           8868
Virginie Boone         8708
Matt Kettmann          5730
Joe Czerwinski         4766
Sean P. Sullivan       4461
Anna Lee C. Iijima     4017
Jim Gordon             3766
Anne Krebiehl MW       3290
Lauren Buzzeo          1700
Susan Kostrzewa        1022
Mike DeSimone           461
Jeff Jenssen            436
Alexander Peartree      383
Carrie Dykes            129
Fiona Adams              24
Christina Pickard         6
Name: taster_name, dtype: int64

#### For this I am going with my man Jim Gordon - His profile from wine spectator can be found here:
https://www.winespectator.com/authors/jim-gordon
#### With that said I build everything using Jeff Jansen. Only about 400 reviews were processing at a resonable development time.

In [7]:
#wine_msds.loc[wine_msds['taster_name'].isin('Roger Voss', 'Michael Schachner')]

#tasters = ['Jeff Jansen'] 

tasters = ['Jim Gordon'] 
wine_msds_taster = wine_msds[wine_msds.taster_name.isin(tasters)]
wine_msds_taster.shape

# Also he is only about 3.7% of my total data set

(3766, 13)

In [8]:
# !pip install spacy
# !python3 -m spacy download en_core_web_sm

You should consider upgrading via the '/home/ec2-user/anaconda3/envs/tensorflow_p36/bin/python -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/home/ec2-user/anaconda3/envs/tensorflow_p36/bin/python3 -m pip install --upgrade pip' command.[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


In [9]:
import string
# text processing
import spacy
#nlp = spacy.load('en_core_web_sm')
import en_core_web_sm
nlp = en_core_web_sm.load()

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
en_stopwords = stopwords.words('english')  
stopwords = set(en_stopwords + ['jay'])  # Checking how to add to stopwords. Works!

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/ec2-user/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [10]:
import os
os.getcwd()

import re
from colorama import Fore, Back, Style

In [11]:
# I was not able to remove everything using regex, so doing good old fashion
spec_chars = ["!",'"',"#","%","&","'","(",")",
              "*","+",",","-",".","/",":",";","<",
              "=",">","?","@","[","\\","]","^","_", "‘", "’", "“", "”", "…",
              "`","{","|","}","~","–", "©",'¡',"—","¨","¬","°","º","½","%","$",".","-","•"]


In [12]:
# Removing code syntax from text 
for char in spec_chars:
    wine_msds_taster['description_clean'] = wine_msds_taster['description'].str.replace(char, '')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  app.launch_new_instance()


In [13]:
print(wine_msds_taster['description'])
print(Fore.GREEN + Style.DIM + '\nAfter cleaning text-----\n')
print(wine_msds_taster['description_clean'])

# Nothing visible below but text had abnormalities that had to be treated

68        Very deep in color and spicy-smoky in flavor, ...
199       This is a beatifully balanced, not-too-full-bo...
424       This wine is dry and rather full bodied, start...
433       Smoky aromas and a bold apricot flavor give th...
519       Deliciously fruity but also well-structured, t...
                                ...                        
119664    This is a big, dark and concentrated wine with...
119731    This is one of the best Lodi Native wine yet. ...
119882    Tempting, slightly nutty aromas and a firm tex...
119893    This wine hits the mark, combining intriguing ...
119896    Almost like liquid cherry pie—but not sweet—th...
Name: description, Length: 3766, dtype: object
[32m[2m
After cleaning text-----

68        Very deep in color and spicy-smoky in flavor, ...
199       This is a beatifully balanced, not-too-full-bo...
424       This wine is dry and rather full bodied, start...
433       Smoky aromas and a bold apricot flavor give th...
519       Delicio

In [14]:
wine_review_corpus = " ".join(wine_msds_taster['description_clean'].str.lower())
wine_review_corpus[354:849]

# Also lowering the vocabulary was difficult at later stages. In my proper run I did this earlier.

"it has classic black cherry, black olive and anise aromas, harmonious fruit flavors accented with light oak spiciness and a firm, fine-grained tannic structure. this wine is dry and rather full bodied, starting with fruity aromas like banana and coconut then seguing to more nutty, oaky, complex flavors and a rich texture. it's a serious dinner wine with heft and substance. smoky aromas and a bold apricot flavor give this dry and seemingly light-bodied wine a distinct personality. fermented "

Himm, my corpus looks ready. However I really do think I can get better results if I removed the wine variety names mentioned in the reviews. I think having the words Chardonnay and Red Blend in the same sentence is a dead giveaway that this sentence is manufactured by a bot. Even a really drunk sommelier is not gonna make this mistake.

I think i have two options here. 1) I can either remove the variety name completely or 2) I can replace it with either"wine" or "this wine". 

I am going to go with option one. First I am going to get the list of unique names of wine variety, and then add these words to the stopwords corpus. I think this is going to increase reviews applicability.

In [15]:
# stopwords = set(STOPWORDS)
# stopwords.add("int")
# stopwords.add("ext")
# Tip: Another way to add the list. This is not what I am using

In [16]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
en_stopwords = stopwords.words('english')  

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/ec2-user/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [17]:
# get the varieties list
varieties = wine_msds[['variety']].drop_duplicates()
len(varieties['variety'])

# 708 unique wines

708

In [18]:
varieties[:5] # take a quick look for sanity check

Unnamed: 0,variety
16,Malbec
183,Torrontés
231,Bonarda
245,Chardonnay
261,Red Blend


In [19]:
print(varieties['variety'].head(3).tolist()) 
variety_list=varieties['variety'].tolist()

['Malbec', 'Torrontés', 'Bonarda']


In [None]:
# I made a mistake here, why did I remove the stop words from the corpus.
# This has to be like below:
wine_variety_list = set(variety_list)

In [20]:
# stopwords_new = set(en_stopwords + variety_list)  # Add to the stopwords

In [21]:
len(wine_review_corpus) # lets check before and after! Thisis before.

901853

In [22]:
# New way of removing the stopwords

# import re
# pattern = re.compile(r'\b(' + r'|'.join(stopwords_new) + r')\b\s*')
# wine_review_corpus_stopless = pattern.sub('', wine_review_corpus)

In [None]:
import re
pattern = re.compile(r'\b(' + r'|'.join(wine_variety_list) + r')\b\s*')
wine_review_corpus_stopless = pattern.sub('', wine_review_corpus)

In [23]:
len(wine_review_corpus_stopless)

684978

In [65]:
wine_review_corpus_stopless

"deep color spicy-smoky flavor, full-bodied wine packed dark-fruit flavors like blackberry blueberry. aromas like grilled beef spicy flavors like cardamom smoke give bold character 'hard deny. beatifully balanced, --full-bodied wine vines grown 2,400 feet sierra range. classic black cherry, black olive anise aromas, harmonious fruit flavors accented light oak spiciness firm, fine-grained tannic structure. wine dry rather full bodied, starting fruity aromas like banana coconut seguing nutty, oaky, complex flavors rich texture. 'serious dinner wine heft substance. smoky aromas bold apricot flavor give dry seemingly light-bodied wine distinct personality. fermented small concrete egg-shaped vats, 'far simple fruity, instead offers great acidity tangy, tempting, appetizing personality. deliciously fruity also well-structured, full-bodied wine excellent concentration good balance. effusive aromas blueberry black cherry lead opulent overripe berry flavors lightly accented clove cinnamon nuan

A valuable lesson I learned here was to not stick to the same method and thinking corpus length is the cause of the main problem. I have tried many runs using smaller and smaller data set. I think the slowness is caused by function comparing two large corpus. I did add 708 words to my stop words, to check if one set exists in the second set is taxing an dno matter how much processing power I threw at it, it would not solve my problem. At this point, good old regex camr to help and it works lightening fast for my application.

# Vocabulary and character to integer mapping

In [31]:
# Find the vocabulary
vocabulary = sorted(set(wine_review_corpus_stopless))

# Print the vocabulary size
print('Vocabulary size:', len(vocabulary))

# too many, I need to lower this by replacing. 

Vocabulary size: 63


In [32]:
print(vocabulary)

# I think I can generate better results if I mapped special chars to english vocabulary equivalents.
# Also what to do with the numbers? In a vocabulary there are no numbers, I need to drop any integer from corpus.

[' ', '!', '%', '&', "'", '(', ')', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '?', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '\xad', 'á', 'ç', 'è', 'é', 'ñ', 'ó', 'ô', 'ü', '–', '—', '“', '”']


### Drop integers from corpus

In [34]:
corpus_cl = wine_review_corpus_stopless.translate({ord(ch): None for ch in '0123456789'})
#re.sub(r'\b[0-9]+\b\s*', '', s)
#output = ''.join(map(lambda c: '' if c in '0123456789' else c, my_str))

# TIP: This time I used translate instead of regex. Imperative to process larger corpus.
# re.sub took about twice as long as str.translate (slightly longer if you don't use a pre-compiled pattern

In [35]:
# In case I missed a special char, I can treat it right here!
# temp = str.maketrans("á", "a") 
# full_str = full_str.translate(temp) 

# temp = str.maketrans("è", "e")
# full_str = full_str.translate(temp) 

# https://stackoverflow.com/questions/22654429/replacing-multiple-letters-in-a-word-with-number-in-python/22654598

In [36]:
# Not sure if this one by one is going to work. Initially I was only doing a sample but for this now I need a 
# better way to treat (ref from: https://stackoverflow.com/questions/22654429/replacing-multiple-letters-in-a-word-with-number-in-python/22654598)
d = {"á": "a", "è":"e", "é":"e", "è":"e", "ñ":"n", "\x9d":" ", "¡":" ", "¨":"", "ç":"c",
     "ó":"o", "ó":"o", "ó":"o", 'ò':"o",'ó':"o",'ô':"o",'õ':"o",'ö':"o",'ø':"o", "\u3000":" ",
     "ù":"u","ú":"u","û":"u","ü":"u", "•":" ", "–":"—", "ž": "z", ":":" ",
     "ì":"i", "í":"i", "î":"i", "ï":"i","°":" ",'´':" ",'º':" ",'½':" ",'à':"a",
     "á":"a","â":"a","ã":"a","ä":"a","è":"e","é":"e","ê":"e","ë":"e", "…":" ", "?":" ",
     "\x9d":" ","¡":" ",'¨':" ", "¬":" ", "\xad":" ", "\n":" ", "—":" ", "“":" ", "&":" ", "—":" ",
     "”":" ", "’":" ", "—":" ", "%":" ", "(":" ", ")":" ",
      "'":" ", "š": "s", "ÿ":"y", "ü":"u"}


In [37]:
wine_review_corpus_cl = ''.join(map(str, [d[x] if x in d.keys() else x for x in corpus_cl]))

In [38]:
# Find the vocabulary
vocabulary_cl = sorted(set(wine_review_corpus_cl))

# Print the vocabulary size
print('Vocabulary size:', len(vocabulary_cl))

Vocabulary size: 34


In [39]:
vocabulary_cl

[' ',
 '!',
 ',',
 '-',
 '.',
 '/',
 ';',
 'a',
 'b',
 'c',
 'd',
 'e',
 'f',
 'g',
 'h',
 'i',
 'j',
 'k',
 'l',
 'm',
 'n',
 'o',
 'p',
 'q',
 'r',
 's',
 't',
 'u',
 'v',
 'w',
 'x',
 'y',
 'z',
 '—']

### Interestingly I did not have to do this since I removed wine variety names from my corpus. Once those were cleaned, I did not have any foreign characters. It was a good decision to clean the wine variety.

In [40]:
# # This is probably the cleanest way!!!
# # Find the vocabulary
# vocabulary = sorted(set(full_str))

# # Print the vocabulary size
# print('Vocabulary size:', len(vocabulary))

## Finally ready for the char to int mapping

In [41]:
# Dictionary to save the mapping from char to integer
char_to_idx = { char : idx for idx, char in enumerate(vocabulary_cl) }

# Dictionary to save the mapping from integer to char
idx_to_char = { idx : char for idx, char in enumerate(vocabulary_cl) }

# Print char_to_idx and idx_to_char
print(char_to_idx)
print(idx_to_char)


# A perfect english alphabet should be tried - here it is. 27 letter which is perfect english alphabet plus space.

{' ': 0, '!': 1, ',': 2, '-': 3, '.': 4, '/': 5, ';': 6, 'a': 7, 'b': 8, 'c': 9, 'd': 10, 'e': 11, 'f': 12, 'g': 13, 'h': 14, 'i': 15, 'j': 16, 'k': 17, 'l': 18, 'm': 19, 'n': 20, 'o': 21, 'p': 22, 'q': 23, 'r': 24, 's': 25, 't': 26, 'u': 27, 'v': 28, 'w': 29, 'x': 30, 'y': 31, 'z': 32, '—': 33}
{0: ' ', 1: '!', 2: ',', 3: '-', 4: '.', 5: '/', 6: ';', 7: 'a', 8: 'b', 9: 'c', 10: 'd', 11: 'e', 12: 'f', 13: 'g', 14: 'h', 15: 'i', 16: 'j', 17: 'k', 18: 'l', 19: 'm', 20: 'n', 21: 'o', 22: 'p', 23: 'q', 24: 'r', 25: 's', 26: 't', 27: 'u', 28: 'v', 29: 'w', 30: 'x', 31: 'y', 32: 'z', 33: '—'}


In [42]:
maxlen = 20

# First run was 40, now I am trying 20.

In [43]:
# Create empty lists for input and target datasets
input_data = []
target_data = []

# Iterate to get all substrings of length maxlen
for i in range(0, len(wine_review_corpus_cl) - maxlen):
    # Find the sequence of length maxlen starting at i
    input_data.append(wine_review_corpus_cl[i : i+maxlen])
    
    # Find the next char after this sequence 
    target_data.append(wine_review_corpus_cl[i+maxlen])

# Print number of sequences in input data
print('No of Sequences:', len(input_data))

No of Sequences: 682797


In [44]:
# Create a 3-D zero vector to contain the encoded input sequences
x = np.zeros((len(input_data), maxlen, len(vocabulary_cl)), dtype='float32')

# Create a 2-D zero vector to contain the encoded target characters
y = np.zeros((len(target_data), len(vocabulary_cl)), dtype='float32')

# The full set would not work. I had: 
# No of Sequences: 19856541
# MemoryError: Unable to allocate 82.8 GiB for an array with shape (19856541, 40, 28) and data type float32
# For this reason, lowering the count to only top 3 reviewers. Actually that did not work either
# We are now working with Jim Gordon. 

In [45]:
# Iterate over the sequences
for s_idx, sequence in enumerate(input_data):
    # Iterate over all characters in the sequence
    for idx, char in enumerate(sequence):
        # Fill up vector x
        x[s_idx, idx, char_to_idx[char]] = 1    
    # Fill up vector y
    y[s_idx, char_to_idx[target_data[s_idx]]] = 1

### Create LSTM model in keras

In [46]:
# Create Sequential model 
model = Sequential()

# Add an LSTM layer of 128 units
model.add(LSTM(128, input_shape=(maxlen, len(vocabulary_cl))))

# Add a Dense output layer
model.add(Dense(len(vocabulary_cl), activation='softmax'))






In [47]:
# Compile the model
model.compile(loss="categorical_crossentropy", optimizer="adam")

# Print model summary
model.summary()

# One LSTM layer followed by a dense layer as shown in model summary.
# Now, you have built an LSTM network that can be trained on your dataset and used to generate new texts!



_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_1 (LSTM)                (None, 128)               83456     
_________________________________________________________________
dense_1 (Dense)              (None, 34)                4386      
Total params: 87,842
Trainable params: 87,842
Non-trainable params: 0
_________________________________________________________________


In [48]:
### train using the fit function
model.fit(x, y, batch_size=128, epochs=20, validation_split=0.2)

# x and y are input and target vectors. 
# batch size: number of samples after which weights gets adjusted
# Epoch: number of times to iterate over the full dataset
# validation_split - percent of samples set aside for testing

# train on ~4.5M samples and validation on ~1.15M samples

# Run 1: C54xlarge - 1 epochs, batch size 64 - ran in 13.5mins, finnished with loss of 
# Run 2: C518xlarge - 20 epochs, batch size 128 - ran in xx mins, finnished with loss of xx
# In between I tried a larger set but I can not use anything but Jim Gordon. I simply have to train for few days
# or need a larger server. Mind you using jim gordon (3.7%) as data and running a 18x for 20 epochs was more then 3 hrs

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where



Train on 546237 samples, validate on 136560 samples
Epoch 1/20





Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x7f03fc629b00>

Initially i was thinking how dounting training jobs would be but after working on larger job I dont feel that way anymore. I think it is possible to set epoch to a bit larger and then watch loss to see if dimishing rate of return has been reached. Regardless I completed a 20 epoch run for this data. I wonder at what loss I am going to be able to achieve coherence. With one epoch the results are abismal and as the 20 epochs were running all I could think was how the results were gonna come out. Even with a c518Xlarge machine and with 3.7% of overall corpus, training took over 3 hrs. Another tip here is to note that larger machines add up quickly as cost, it is easy to pass the allowance you initially set in mind. But we adapt.

### Inference using LSTM

Training and validation: Training is nothing but adjusting the weights of the network so that the overall network error reduces. For th eunseen data / validation a seperate sample is taken. this is called test or validation set.

In [52]:
# seed sentence
#sentence = "that poor contempt or claimd thou sle "  # 40 total chars including quotes
sentence = "this wine is so a"


# So in my first run, it simply wont work. I rerun vocabulary, clean everything further,
# even add additional server power. However I simply can not iterate over each 
# char of the sample sentence. It thows a KeyError: 'T' Now when i have an error, usual
# reaction is copy paste and search. This is so random, it wont return anything useful.

# After many hours, I realized that I started the sentence with a capital 'T'. I forgot
# that I lowered my case. With that the KeyError makes so much sense now. Perhaps you will
# remember reading this and can notice quickly in case..

In [53]:
# encoded sentence
X_test = np.zeros((1, maxlen, len(vocabulary_cl)))

In [54]:
# Iterate over each character and convert them to one-hot encoded vector.
for s_idx, char in enumerate(sentence): 
    X_test[0, s_idx, char_to_idx[char]] = 1

IndexError: index 20 is out of bounds for axis 1 with size 20
    
When this is the error then you are feeding more then the max_len allowed as an input.

In [55]:
X_test[0, s_idx, char_to_idx[char]] = 1

In [56]:
# Get the probability distribution using model predict
preds = model.predict(X_test, verbose=0)

# Get the probability distribution for the first character after the sequence
preds_next_char = preds[0]

# Predict next char by feeding the encoded sentence to the LSTM network
# I could have done it this way too: preds = model.predict(X_test, verbose=0)[0]

In [57]:
preds_next_char # is an array of the probabaility distribution for th enext character

array([1.3499321e-02, 2.7528526e-07, 2.0037872e-04, 2.5763428e-03,
       1.2261264e-03, 1.5691012e-07, 1.0125599e-06, 1.8542690e-02,
       5.1168208e-03, 9.3439897e-04, 2.5190014e-04, 1.8872404e-02,
       2.0162959e-03, 4.9567603e-02, 1.1143660e-03, 3.1193081e-01,
       8.0018857e-05, 1.5020086e-06, 1.8903526e-02, 3.0531648e-03,
       9.0635521e-04, 4.6871364e-02, 1.5516396e-02, 2.3209873e-04,
       9.5382339e-04, 1.9660896e-01, 1.8865202e-03, 2.3131220e-01,
       3.2928805e-03, 5.4155789e-02, 3.5750918e-06, 3.5906999e-04,
       1.1705505e-05, 1.4472121e-07], dtype=float32)

### Generate text imitating reviewer 'Jim Gordon'

In [58]:
# Index with highest probability
next_index = np.argmax(preds)

In [59]:
# Mapping the index to actual char
next_char = idx_to_char[next_index]

In [60]:
next_char

'i'

In [61]:
sentence = "this wine is so a"

In [62]:
def generate_text(sentence, n):
    """
    Function to generate text
    Inputs: seed sentence and number of characters to be generated.
    Output: returns nothing but prints the generated sequence.
    """
    
    # Initialize the generated sequence with the seed sentence
    generated = ''
    generated += sentence
    
    # Iterate for each character to be generated
    for i in range(n):
      
        # Create input vector from the input sentence
        x_pred = np.zeros((1, maxlen, len(vocabulary_cl)))
        for t, char in enumerate(sentence):
            x_pred[0, t, char_to_idx[char]] = 1.

        # Get probability distribution for the next character
        preds = model.predict(x_pred, verbose=0)[0]
        
        # Get the index with maximum probability
        next_index = np.argmax(preds)
        next_char = idx_to_char[next_index]

        # Append the new character to the input sentence for next iteration
        sentence = sentence[1:] + next_char

        # Append the new character to the text generated so far
        generated += next_char
    
    # Print the generated text
    print(generated)

In [64]:
generate_text(sentence, 500)

this wine is so aieouooeeo. s  -la n eaoin.soooeo s si egeei nhg oaeooii aeeoiii oaaaa oa oi egeooi oaaoe creaei aeee noetu aaael.e oa ea oa eiaor oaeoeeo.eaooeoea s   ooa oaeoeeo.o si egeo oaeoor cru sssss s  l l n sa eattiyol oieoaeoa s s  oa egoo oaeou oaeoooaalo.o s s  lloooeoooi ci oa ea o ea o agoeo l n  etile naraytaii aeeoio oiiaoaaa cru ss si e si egeaoooo oeeoor cru ssssss oa oaeol cru sssss s  l l n sa eattiyol oieoaeoa s s  oa egoo oaeou oaeoooaalo.o s s  lloooeoooi ci oa ea o ea o agoeo l n  etile n


#### test run: 1 epoch - compltely gibberish
#### First run: 20 epochs - forgot and removed stopwords from the corpus. - result longer gibberish
#### Second run: 20 epochs - still extra clean vocabulary. perhaps this is the issue, i need to keep the integrity of the corpus, maybe I am intorducing noise and I am getting only the most common letters. It is too homogenous. Somehow in trying to be coherent by eliminating variety I completely went the opposite way.