<a href="https://colab.research.google.com/github/hardiksraja/RNN-Name-Generator/blob/master/RNN_Name_Generator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Names Generator Using Character Level Language Model (Using Reccurent Neural Network) 

In this project, we give a collection of known Indian baby names and train a model. Using this model, we generate similar looking random new names and also names starting with the user's choice of the alphabet. We use a recurrent neural network (RNN) for this task.


# Setup

In [1]:
import numpy as np
import requests
import os
import tensorflow as tf

# Gather and clean up the  data

The training data is a collection of Indian baby names downloaded from the website (http://n3.datasn.io) via an API returning these name collection in JSON format. The dataset consists of 400 names.

In [2]:
results=[]
for i in range(1, 5):
  result = requests.get("http://n3.datasn.io/data/api/v1/n3_chennan/hindu_baby_names/by_table/baby_name/" +str(i) +"/?app=json").json()
  results.append(result)
print('All results stored in : ', type(results))
print('Each result is of type : ', type(results[0]))
print('Indivial result sample : \n', results[0])

All results stored in :  <class 'list'>
Each result is of type :  <class 'dict'>
Indivial result sample : 
 {'sample': {'is_sample': 'TRUE', 'limit_total': 400}, 'api': {'title': 'Hindu Baby Names', 'point': {'title': 'By Table', 'desc': 'Each API function returns data from a specific database table. '}}, 'input': {'url': False, 'get': False, 'post': False}, 'links': {'up': {'href': 'http://n3.datasn.io/data/api/v1/n3_chennan/hindu_baby_names/by_table/'}, 'home': {'href': 'http://n3.datasn.io/data/api/v1/n3_chennan/hindu_baby_names/by_table/baby_name/?app=json'}, 'next': {'href': 'http://n3.datasn.io/data/api/v1/n3_chennan/hindu_baby_names/by_table/baby_name/2/?app=json'}}, 'meta': {'query': {'page': 1, 'limit': 100, 'limit_total': 400, 'time': 0.00136}, 'stats': {'raw_rows': 100, 'rows': 100, 'rows_total': '4364', 'pages_total': 44}, 'struct': {'stem': ['baby_name'], 'leaf': 'baby_name.id', 'leaf_stem': [], 'leaf_stem_leaf': None}}, 'output': {'rows': {'1': {'baby_name.id': '1', 'baby

We need to extract just the names from the list.

In [3]:
full_names=[]
k=0
for j in range(len(results)):
  result=results[j]
  for i in range(1+k,result['meta']['query']['limit']+1+k):
    try:
      names = result['output']['rows'][str(i)]['baby_name.name']
      full_names.append(names)
    except KeyError:
      print('Exception: Data missing for row index : ',i, "Skipping the same !!")
  k=k+100

full_names[:15]

Exception: Data missing for row index :  205 Skipping the same !!
Exception: Data missing for row index :  301 Skipping the same !!
Exception: Data missing for row index :  347 Skipping the same !!


['Aadarsh',
 'Aadav',
 'Aadesh',
 'Aadhidev',
 'Aadhira',
 'Aadhishankar',
 'Aadit',
 'Aaditey',
 'Aagman',
 'Aagney',
 'Aahna',
 'Aahva',
 'Aahwaanith',
 'Aakaanksha',
 'Aakarshan']

This dataset is pretty small so we can study it manually. 

1. There are couple of instances where a particular name is listed in all possible ways, they could be spelled. In such cases, we consider only the first spelling.
Eg.

> *'Chandani, Chandini'*, we consider only, Chandani

> *'Cauvery, Cavery'*, we consider only, Cauvery

2. Also names containing special character are removed and we consider the valid part of name.

> *'Raghuveer/vir'*, we consider only, Raghuveer

3. We convert all the names in lower case, which would help us later in vocabalury generation and transforming characters to numbers and vice versa


In [4]:
names_duplicates = list(map(lambda s : s.split(',')[0], full_names))
names_duplicates = list(map(lambda s : s.split('/')[0], names_duplicates))
names_duplicates = list(map(lambda s : s.lower(), names_duplicates))

The length of the names Corpus (May have duplicates)

In [5]:
print('Total Names fetched from API : ', len(names_duplicates))

Total Names fetched from API :  397


Now we get rid of any repeated names in the training set.

In [6]:
names = list(set(names_duplicates))
print('Unique Names : ', len(names))

Unique Names :  388


Final list of names for training; An Example of 10 names

In [7]:
names[:10]

['babul',
 'balaji',
 'jagamohan',
 'lakshmiraman',
 'madhavi',
 'naakesh',
 'bakula',
 'kalikesh',
 'indrina',
 'kalanath']

The last thing that will be useful to do is add a '.' at the end of each name. This will be helpfull to instruct the RNN that the name is over.

In [8]:
names = list(map(lambda s: s + '.', names))
names[:10]

['babul.',
 'balaji.',
 'jagamohan.',
 'lakshmiraman.',
 'madhavi.',
 'naakesh.',
 'bakula.',
 'kalikesh.',
 'indrina.',
 'kalanath.']

# Transform the data

Now that we have our data cleaned up it's time to transform it into a form that the recurrent neural network will understand. we will input characters into the network instead of words. Each of these characters will require to be converted to numbers and the conversion is done using the following mappings:

In [9]:
# Convert from character to index
char_to_index = dict( (chr(i+96), i) for i in range(1,27))
char_to_index[' '] = 0
char_to_index['.'] = 27

# Convert from index to character
index_to_char = dict( (i, chr(i+96)) for i in range(1,27))
index_to_char[0] = ' '
index_to_char[27] = '.'

In [10]:
# number of elements in the list of names
# this will be the number of training examples
m = len(names)
print('The number of training names : ', m)

# maximum number of characters in the training names
# this will be the number of time steps in the RNN
max_char = len(max(names, key=len))
print('Maximum No of characters in names among all the training names : ',max_char)

# number of potential characters i.e. Size of Vocabulary
# this will be the length of the input for each of the RNN units
char_dim = len(char_to_index)
print('Each character is represented with a one-hot encoding of size : ', char_dim)

The number of training names :  388
Maximum No of characters in names among all the training names :  15
Each character is represented with a one-hot encoding of size :  28


Now we convert the list of names into a training dataset understanable to the RNN. The input *inputnames_array* of the network is an array of size (TrainingSampleSize, MaximumCharacters, EncodedCharacterRepresentation). It contains a matrix for each of the m training names (m as mentioned in above cell). Each matrix contains a row for each character in the name. (Note that there are always the same number of matrix rows and if a name doesn't have enough characters to fill the whole matrix, the remaining rows contain 0.) Each of these rows represents one character and it is encoded as a one-hot vector. This means that it is a vector of zeros with a one only in the entry that corresponds to the character that is present.

The output *outputnames_array* is the same as the input but have names translated by one unit. This means that the ith character in name within *outputnames_array* is the (i+1)th one in the actual name. This means that the network predicts the character that follows a given character in a sequence i.e. Name. 

In [11]:
inputnames_array = np.zeros((m, max_char, char_dim))
outputnames_array = np.zeros((m, max_char, char_dim))

for i in range(m):
    name = list(names[i])
    for j in range(len(name)):
        inputnames_array[i, j, char_to_index[name[j]]] = 1
        if j < len(name)-1:
            outputnames_array[i, j, char_to_index[name[j+1]]] = 1     

# RNN model

we will use is a many-to-many recurrent neural network. This is a network that contains a given number of 'time' steps (Steps are equal to The maximum no of characters in names among all the training names) that each act with the same weights on the individual inputs and are all connected. Each time step takes in one input (in this case one character) and outputs a one-hot vector that represents the probabilities for the input of the next time step. 

In [12]:
from keras.models import Sequential
from keras.layers import LSTM, Dense
from keras.callbacks import LambdaCallback

Using TensorFlow backend.


Here we only consider one layer of recurrence, which we take to be LSTM with 128 units. We send the output of this layer to a fully connected dense layer that converts the result of the LSTM layer into a vector of EncodedCharacterRepresentation size using a softmax activation. We use categorical cross entropy as a cost function because of the softmax result and use Adam optimization. Here we are only generating new names and hence there is not really any useful metric to judge if the model does good so we will mostly just look at the results.

In [13]:
model = Sequential()
model.add(LSTM(128, input_shape=(max_char, char_dim), return_sequences=True))
model.add(Dense(char_dim, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam')

The model summary

In [14]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_1 (LSTM)                (None, 15, 128)           80384     
_________________________________________________________________
dense_1 (Dense)              (None, 15, 28)            3612      
Total params: 83,996
Trainable params: 83,996
Non-trainable params: 0
_________________________________________________________________


Once this model is trained we will use it to create new baby names. This is achieved using the following function. The idea is to input empty characters to the trained network and use the output of the first time step as a probability distribution for the first letter of the name. We then use this distribution to decide randomly the first character, record it and update the input to pass this character as an input for the second time step. This is continued for the following time steps to create a name.

This is where using a '.' at the end of each name becomes important, because we stop the procedure once we get a '.' as an output, meaning that the generated name is completed. Also if we reach the length of the largest name in the training set we put a '.' and end the procedure.

In [15]:
def create_name(model):
    name = []
    x = np.zeros((1, max_char, char_dim))
    end = False
    i = 0
    
    while end==False:
        temp=model.predict(x)
        probs = list(temp[0,i])
        probs = probs / np.sum(probs)
        index = np.random.choice(range(char_dim), p=probs)
        if i == max_char-2:
            character = '.'
            end = True
        else:
            character = index_to_char[index]
        name.append(character)
        x[0, i+1, index] = 1
        i += 1
        if character == '.':
            end = True
    
    print(''.join(name))

Now we use the below function during the training to monitor how the generated names get better with passing epochs. To this end we create a function that will be given to the model when we fit it. We basically run the previous function a few times every 25 epochs and print the results.

In [16]:
def generate_name_loop(epoch, _):
    if epoch % 50 == 0:
        
        print('Following are the names generated after epoch %d:' % epoch)

        for i in range(3):
            create_name(model)
        
        print()

Below is an additional method that facilitates creation of names starting with the provided Alphabet; The same is called on the trained model

In [17]:
def create_name_starting_with_character(model, character):
    character = character.lower()
    name = [character]
    x = np.zeros((1, max_char, char_dim))
    x[0,0,char_to_index[character]]=1
    end = False
    i = 1
    
    while end==False:
        temp=model.predict(x)
        probs = list(temp[0,i])
        probs = probs / np.sum(probs)
        index = np.random.choice(range(char_dim), p=probs)
        if i == max_char-2:
            character = '.'
            end = True
        else:
            character = index_to_char[index]
        name.append(character)
        x[0, i+1, index] = 1
        i += 1
        if character == '.':
            end = True
    
    print(''.join(name))

This converts the function to be able to use it in keras as a callback mechanism during model fitting.

In [18]:
def get_callbacks():
  return [
    tf.keras.callbacks.EarlyStopping(monitor='loss', patience=5, mode='min'),
    LambdaCallback(on_epoch_end = generate_name_loop),
  ]

In [19]:
batch_size=64
epochs=1000

Now, we fit the model with the function and look at the results. It is clear that the names make more and more sense as we train more with passing epochs.

In [20]:
history = model.fit(inputnames_array, outputnames_array, batch_size=batch_size, epochs=epochs, callbacks=get_callbacks())

Epoch 1/1000
Following are the names generated after epoch 0:
kbnellia.
voyelptagotbe.
akqfnzcwx.

Epoch 2/1000
Epoch 3/1000
Epoch 4/1000
Epoch 5/1000
Epoch 6/1000
Epoch 7/1000
Epoch 8/1000
Epoch 9/1000
Epoch 10/1000
Epoch 11/1000
Epoch 12/1000
Epoch 13/1000
Epoch 14/1000
Epoch 15/1000
Epoch 16/1000
Epoch 17/1000
Epoch 18/1000
Epoch 19/1000
Epoch 20/1000
Epoch 21/1000
Epoch 22/1000
Epoch 23/1000
Epoch 24/1000
Epoch 25/1000
Epoch 26/1000
Epoch 27/1000
Epoch 28/1000
Epoch 29/1000
Epoch 30/1000
Epoch 31/1000
Epoch 32/1000
Epoch 33/1000
Epoch 34/1000
Epoch 35/1000
Epoch 36/1000
Epoch 37/1000
Epoch 38/1000
Epoch 39/1000
Epoch 40/1000
Epoch 41/1000
Epoch 42/1000
Epoch 43/1000
Epoch 44/1000
Epoch 45/1000
Epoch 46/1000
Epoch 47/1000
Epoch 48/1000
Epoch 49/1000
Epoch 50/1000
Epoch 51/1000
Following are the names generated after epoch 50:
amatina.
ashte.
onndre.

Epoch 52/1000
Epoch 53/1000
Epoch 54/1000
Epoch 55/1000
Epoch 56/1000
Epoch 57/1000
Epoch 58/1000
Epoch 59/1000
Epoch 60/1000
Epoch 61

We see that due to Early Stopping the training stopped at *230th*  Epoch; 



# Results

>1. We now use the final trained model to generate random 30 new names.

In [21]:
for i in range(30):
    create_name(model)

aaviri.
adan.
aahireya.
agaa.
aaksh.
adriyatiti.
ana.
laknata.
adrishorika.
adayya.
nabhay.
lakshmareer.
aarik.
aanekya.
aaniya.
aksh.
midul.
alita.
ujagi.
aadhira.
adrikaya.
elatoma.
ukari.
aitil.
alichanan.
aavula.
raanitya.
aani.
aadhri.
adri.


Out Of the 30 names generated, 28 were unique names; with 2 names getting repeated. Hence we conclude that the model, performs reasonably well in generating new names.


>2. We use the final trained model to generate 10 new names starting with Alphabet **'H'**.

In [22]:
for i in range(10):
  create_name_starting_with_character(model,'h') 

hnraar.
hrishikra.
hrina.
hrija.
hnrana.
hrishar.
hrina.
hresh.
hripaya.
hirattara.


We also notice during model training that as the epoch increases, model improves on task of name generation and generates more and more reliastic names.