BABY names Generator using LSTM(Long Short Term Memory)

An LSTM network is a particular type of recurrent neural network (RNN). RNNs contain a recurrent layer (or cell) that is able to handle sequential data by making its own output at a particular timestep form part of the input to the next timestep, so that information from the past can affect the prediction at the current timestep. 

We say LSTM network to mean a neural network with an LSTM recurrent layer

In [1]:
#Importing required libraries
from numpy import array
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import Embedding

In [2]:
#Mounting drive
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [3]:
#Importing data
data =open('/content/drive/My Drive/Indian_babies_Names.txt', 'r').read()
data= data.lower()
chars = list(set(data))
data_size, vocab_size = len(data), len(chars)
print('There are %d total characters and %d unique characters in your data.' % (data_size, vocab_size))

There are 608924 total characters and 29 unique characters in your data.


SEQUENCE GENERATION

In [4]:
# generate a sequence from a language model
def generate_seq(model, tokenizer, max_length, seed_text, n_words):
    in_text = seed_text
    # generate a fixed number of words
    for _ in range(n_words):
    # encode the text as integer
        encoded = tokenizer.texts_to_sequences([in_text])[0]
        # pre-pad sequences to a fixed length
        encoded = pad_sequences([encoded], maxlen=max_length, padding='pre')
        # predict probabilities for each word
        yhat = model.predict_classes(encoded, verbose=0)
        # map predicted word index to word
        out_word = ''
        for word, index in tokenizer.word_index.items():
            if index == yhat:
                out_word = word
                break
        # append to input
        in_text += ' ' + out_word
    return in_text

TOKENIZER

Tokenization is the process of splitting the text up into individual units, such as words or characters.Using word tokenization means that the model will never be able to predict words outside of the training vocabulary.

In [5]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts([data])
# determine the vocabulary size
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)
# create line-based sequences
sequences = list()
for line in data.split('\n'):
    encoded = tokenizer.texts_to_sequences([line])[0]
    for i in range(1, len(encoded)):
        sequence = encoded[:i+1]
        sequences.append(sequence)
print('Total Sequences: %d' % len(sequences))

Vocabulary Size: 19443
Total Sequences: 12718


In [6]:
# pad input sequences
max_length = max([len(seq) for seq in sequences])
sequences = pad_sequences(sequences, maxlen=max_length, padding='pre')
print('Max Sequence Length: %d' % max_length)

Max Sequence Length: 9


In [7]:
# split into input and output elements
sequences = array(sequences)
X, y = sequences[:,:-1],sequences[:,-1]
y = to_categorical(y, num_classes=vocab_size)# define model

LSTM Architecture: The architecture of the overall model is the input to the model is a sequence of integer tokens and the output is the probability of each word in the vocabulary appearing next in the sequence.

An embedding layer is essentially a lookup table that converts each token into a vector of length embedding_size.The number of weights learned by this layer is therefore equal to the size of the vocabulary, multiplied by embedding_size.

In [8]:
# define model
model = Sequential()
model.add(Embedding(vocab_size, 9, input_length=max_length-1))
model.add(LSTM(50))
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 8, 9)              174987    
_________________________________________________________________
lstm (LSTM)                  (None, 50)                12000     
_________________________________________________________________
dense (Dense)                (None, 19443)             991593    
Total params: 1,178,580
Trainable params: 1,178,580
Non-trainable params: 0
_________________________________________________________________
None


In [9]:
# compile network
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit network
model.fit(X, y, epochs=500, verbose=2)

Epoch 1/500
398/398 - 4s - loss: 7.3450 - accuracy: 0.0602
Epoch 2/500
398/398 - 4s - loss: 6.3439 - accuracy: 0.0624
Epoch 3/500
398/398 - 4s - loss: 6.1677 - accuracy: 0.0624
Epoch 4/500
398/398 - 4s - loss: 6.0594 - accuracy: 0.0624
Epoch 5/500
398/398 - 4s - loss: 5.9829 - accuracy: 0.0627
Epoch 6/500
398/398 - 4s - loss: 5.9209 - accuracy: 0.0639
Epoch 7/500
398/398 - 4s - loss: 5.8697 - accuracy: 0.0641
Epoch 8/500
398/398 - 4s - loss: 5.8221 - accuracy: 0.0682
Epoch 9/500
398/398 - 4s - loss: 5.7780 - accuracy: 0.0704
Epoch 10/500
398/398 - 4s - loss: 5.7312 - accuracy: 0.0693
Epoch 11/500
398/398 - 4s - loss: 5.6821 - accuracy: 0.0702
Epoch 12/500
398/398 - 4s - loss: 5.6312 - accuracy: 0.0735
Epoch 13/500
398/398 - 4s - loss: 5.5745 - accuracy: 0.0814
Epoch 14/500
398/398 - 4s - loss: 5.5165 - accuracy: 0.0866
Epoch 15/500
398/398 - 4s - loss: 5.4601 - accuracy: 0.0903
Epoch 16/500
398/398 - 4s - loss: 5.4044 - accuracy: 0.0949
Epoch 17/500
398/398 - 4s - loss: 5.3524 - accura

<tensorflow.python.keras.callbacks.History at 0x7f52e9445940>

In [11]:
# evaluate model
print(generate_seq(model, tokenizer, max_length-1, 'ARJUN', 0))
print(generate_seq(model, tokenizer, max_length-1, 'ANKIT', 0))
print(generate_seq(model, tokenizer, max_length-1, 'LAVINA', 0))

ARJUN
ANKIT
LAVINA
