# Implementing LSTM Neural Network from Scratch
Dataset = Us Baby's First Names<br>
Link = https://www.kaggle.com/chr7stos/us-names-babies-and-presindent-names/data

## Import Required Libraries

In [1]:
import numpy as np  # for math
import pandas as pd  # for data manipulation
import matplotlib.pyplot as plt  # for visulization
%matplotlib inline

## Load the data

In [2]:
# full data
data = pd.read_csv("NationalNames.csv")

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1825433 entries, 0 to 1825432
Data columns (total 5 columns):
Id        int64
Name      object
Year      int64
Gender    object
Count     int64
dtypes: int64(3), object(2)
memory usage: 69.6+ MB


In [4]:
data.head()

Unnamed: 0,Id,Name,Year,Gender,Count
0,1,Mary,1880,F,7065
1,2,Anna,1880,F,2604
2,3,Emma,1880,F,2003
3,4,Elizabeth,1880,F,1939
4,5,Minnie,1880,F,1746


In [5]:
# get first 10000 names and convert to lower case
data = np.array(data.Name.apply(lambda x: x.lower()))[0:10000].reshape(-1,1)

In [6]:
print("Data Shape = {}".format(data.shape))
print('-'*30)
print("Lets see some names : ")
print(data[1:10])

Data Shape = (10000, 1)
------------------------------
Lets see some names : 
[['anna']
 ['emma']
 ['elizabeth']
 ['minnie']
 ['margaret']
 ['ida']
 ['alice']
 ['bertha']
 ['sarah']]


## Transform the names to equal length by adding -- > ('.') dots

In [7]:
max_len = np.apply_along_axis(lambda x: len(x[0]), -1, data).max()
transform_data = np.apply_along_axis(lambda x: x[0] + '.'*(max_len-len(x[0])), -1, data).reshape(-1,1)

In [8]:
print("Transformed Data")
print(transform_data[1:10])

Transformed Data
[['anna........']
 ['emma........']
 ['elizabeth...']
 ['minnie......']
 ['margaret....']
 ['ida.........']
 ['alice.......']
 ['bertha......']
 ['sarah.......']]


## Lets Make Vocabulary

In [11]:
vocab = []
for name in transform_data[:,0]:
    vocab.extend(list(name))
vocab = sorted(list(set(vocab)))

print("Vocab size = {}".format(len(vocab)))
print("Vocab      = {}".format(vocab))

Vocab size = 27
Vocab      = ['.', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


## Map characters to ids and ids to characters

In [12]:
#map char to id and id to chars
char_id = dict()
id_char = dict()

for i,char in enumerate(vocab):
    char_id[char] = i
    id_char[i] = char

print(char_id)
print(id_char)

{'.': 0, 'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5, 'f': 6, 'g': 7, 'h': 8, 'i': 9, 'j': 10, 'k': 11, 'l': 12, 'm': 13, 'n': 14, 'o': 15, 'p': 16, 'q': 17, 'r': 18, 's': 19, 't': 20, 'u': 21, 'v': 22, 'w': 23, 'x': 24, 'y': 25, 'z': 26}
{0: '.', 1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e', 6: 'f', 7: 'g', 8: 'h', 9: 'i', 10: 'j', 11: 'k', 12: 'l', 13: 'm', 14: 'n', 15: 'o', 16: 'p', 17: 'q', 18: 'r', 19: 's', 20: 't', 21: 'u', 22: 'v', 23: 'w', 24: 'x', 25: 'y', 26: 'z'}


## Make the Train dataset

Example<br>
**Names:** - $[[\text{'mary.'}],[\text{'anna.'}]]$ <br>
**Character encoding:** <br>

| Character | Index | One-hot encoding |
|-----------|:-----:|------------------|
|    'm'    |   3   |   [0,0,0,1,0,0]  |
|    'a'    |   2   |   [0,0,1,0,0,0]  |
|    'r'    |   1   |   [0,1,0,0,0,0]  |
|    'y'    |   4   |   [0,0,0,0,1,0]  |
|    'n'    |   5   |   [0,0,0,0,0,1]  |
|    '.'    |   0   |   [1,0,0,0,0,0]  |

['mary.'] = [[0,0,0,1,0,0], [0,0,1,0,0,0], [0,1,0,0,0,0], [0,0,0,0,1,0], [1,0,0,0,0,0]] <br>
['anna.'] = [[0,0,1,0,0,0], [0,0,0,0,0,1], [0,0,0,0,0,1], [0,0,1,0,0,0], [1,0,0,0,0,0]] <br>

We have total 27 characters, so each character will be encode as a 27-D vector, with one element is 1 as char_id, otherwise is 0 <br>
'.' = [1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0] <br>

batch_dataset<br> = [['m', 'a'], ['a', 'n'], ['r', 'n'], ['y', 'a'], ['.', '.']]<br>
= [<br>
[[0,0,0,1,0,0], [0,0,1,0,0,0]],<br>
[[0,0,1,0,0,0], [0,0,0,0,0,1]],<br>
[[0,1,0,0,0,0], [0,0,0,0,0,1]],<br>
[[0,0,0,0,1,0], [0,0,1,0,0,0]],<br>
[[1,0,0,0,0,0], [1,0,0,0,0,0]],<br>
]

We have a batch of 2 names, each have 5 characters, each character is encoded to a 6-D one-hot vector <br>
Shape = (1, 5, 2, 6)<br>

In data set we have 10000 name, 12 characters of each, each character is a 27-D, <br> if batch size = 20 then data train will have shape (500, 12, 20, 27)

In [28]:
# list of batches of size = 20
train_dataset = []

batch_size = 20

#split the trasnform data into batches of 20
for i in range(len(transform_data)-batch_size+1):
#     print(i)
    start = i*batch_size
    end = start+batch_size
    
    #batch data
    batch_data = transform_data[start:end]
    
    if(len(batch_data)!=batch_size):
        break
        
    #convert each char of each name of batch data into one hot encoding
    char_list = []
    for k in range(len(batch_data[0][0])):
        batch_dataset = np.zeros([batch_size,len(vocab)])
        for j in range(batch_size):
            name = batch_data[j][0]
            char_index = char_id[name[k]]
            batch_dataset[j,char_index] = 1.0
     
        #store the ith char's one hot representation of each name in batch_data
        char_list.append(batch_dataset)
    
    #store each char's of every name in batch dataset into train_dataset
    train_dataset.append(char_list)

In [55]:
batch_size = 20

batch_numbers = len(transform_data)//batch_size
if batch_size * batch_numbers < len(transform_data):
    batch_numbers +=1
    
max_name_len = max_len
one_hot_dimenssion = len(vocab)
train_data = np.zeros((batch_numbers, max_name_len, batch_size, one_hot_dimenssion))

for batch_id in range(batch_numbers):
    start_batch_id = batch_id * batch_size
    end_batch_id = start_batch_id + batch_size
    
    batch_train = transform_data[start_batch_id: end_batch_id]
    
    for name_id in range(batch_size):
        name = batch_train[name_id][0]
        
        for char_idd in range(max_name_len):
            char = list(name)[char_idd]
            
            first = batch_id
            second = char_idd
            third = name_id
            fourth = char_id[char]
            
            train_data[first, second, third, fourth] = 1
            

In [66]:
(train_dataset == train_data).min()

True