<h1 style="font-family: Georgia; font-size:3em;color:#2462C0; font-style:bold">
Character-Level Language Model</h1><br>

Have you ever wondered how Gmail automatic reply works? Or how a Neural Network can generate musical notes? The general way of generating sequence of text is to train a model to predict the next word/character given all previous words/characters. Such model is called **Statistical Language Model**. So what is a statistical language model? A statistical language model tries to capture the statistical structure (latent space) of training text it's trained on. Usually **Recurrent Neural Network (RNN)** models family is used train the model due to the fact that it's very powerful and expressive in which they remember and process past information through their high dimensional hidden state units. The main goal of any language model is to learn the joint probability distribution of sequences of characters/words in a training text, i.e. trying to learn the joint probability function. For example, if we're trying to predict a sequence of $T$ words, we try to get the joint probability $P(w_1, w_2, ..., w_T)$ as big we we can which is equal to the product of all conditional probabilities $\prod_{t = 1}^T P(w_t/w_{t-1})$ at all time steps (t).

In this notebook, we'll cover **Character-level Language Model** where almost all the concepts hold for any other language models such as word-language models. The main task of character-level language model is to predict next character given all previous characters in a sequence of data, i.e. generate text character by character. More formally, given a training sequence $(x^1, ... , x^T)$, the RNN uses the sequence of its output vectors $(o^1, ... , o^T)$ to obtain a sequence of predictive distributions $P(x^t|x^{<t}) = softmax(o^t)$.

Let's illustrate how the character-level language model works using my first name ("imad") as an example (see figure 1 for all the details of this example).
1. We first build a vocabulary dictionary using all the unique letters of the names in the corpus as keys and the index of each letter starting from zero (since python is a zero-index language) in ascending order. For our example, the vocabulary dictionary would be: {"a": 0, "d": 1, "i": 2, "m": 3}. Therefore, "imad" would become a list of the following integers: [2, 3, 0, 1].
2. Convert the input and the output characters to lists of integers using the vocabulary dictionary. In this notebook, we'll assume that $x^1 = \vec{0}$ for all examples. Therefore, $y = "imad"$ and $x = \vec{0}\ + "ima"$. In other words, $x^{t + 1} = y^t$ which gives us: $y = [2, 3, 0, 1]$ and $x = [\vec{0}, 2, 3, 0]$.
3. For each character in the input:
    1. Convert the input characters into one-hot vectors. Notice how the first character $x^1 = \vec{0}$.
    2. Compute the hidden state layer.
    3. Compute the output layer and then pass it through softmax to get the results as probabilities.
    4. Feed the target character at time step (t) as the input character at time step $(t + 1)$.
    5. Go back to step A and repeat until we finish all the letters in the name.

The objective is to make the green numbers as big as we can and the red numbers as small as we can in the probability distribution layer. The reason for that is that the true index should have the highest probability by making it as close as we can to 1. The way to do that is to measure the loss using cross-entropy and the compute the gradients of the loss w.r.t. all parameters to update them in the opposite of the gradient direction. Repeating the process over many times where each time we adjust the parameters based on the gradient direction --> model will be able to correctly predict next characters given all previous one using all names in the training text. Notice that hidden state $h^4$ has all past information about all characters.

<p align="left">
<img src="posts_images/char_level_model/char_level_example.png"; style="width: 800px; height: 600px"><br>
<caption><center><u><b><font color="purple">Figure 1:</font></b></u> Illustrative example of character-level language model using RNN</center></caption>
</p>

<h2 style="font-family: Georgia; font-size:2em;color:purple; font-style:bold">
Training</h2>

The [dataset](http://deron.meranda.us/data/census-derived-all-first.txt) we'll be using has 5,163 names: 4,275 male names, 1,219 female names, and 331 names that can be both female and male names. The RNN architecture we'll be using to train the character-level language model is called **many to many** where time steps of the input $(T_x)$ = time steps of the output $(T_y)$. In other words, the sequence of the input and output are synced (see figure 2).
<p align="left">
<img src="posts_images/char_level_model/rnn_architecture.PNG"; style="width: 600px; height: 600px"><br>
<caption><center><u><b><font color="purple">Figure 2:</font></b></u> RNN architecture: many to many</center></caption>
</p>
The character-level language model will be trained on names; which means after we're done with training the model, we'll be able to generate interesting names :).

In this section, we'll go over four main parts:
1. Forward propagation.
2. Backpropagation.
3. Sampling.
4. Fitting the model.

<h3 style="font-family: Georgia; font-size:1.5em;color:purple; font-style:bold">
Forward Propagation</h3>

We'll be using Stochastic Gradient Descent (SGD) where each batch consists of only one example. In other words, the RNN model will learn from each example (name) separately, i.e. run both forward and backward passes on each example and update parameters accordingly. Below are all the steps needed for a forward pass:
- Create a vocabulary dictionary using the unique lower case letters.
    - Create a character to index dictionary that maps each character to its corresponding index in an ascending order. For example, "a" would have index 1 (since python is a zero index language and we'll reserve 0 index to EOS "\n") and "z" would have index 26. We will use this dictionary in converting names into lists of integers where each letter will be represented as one-hot vector.
    - Create an index to character dictionary that maps indices to characters. This dictionary will be used to convert the output of the RNN model into characters which will be translated into names.
- Initialize parameters: weights will be initialized to small random numbers from standard normal distribution to break symmetry and make sure different hidden units learn different things. On the other hand, biases will be initialized to zeros.
    - $W_{hh}$: weight matrix connecting previous hidden state $h^{t - 1}$ to current hidden state $h^t$.
    - $W_{xh}$: weight matrix connecting input $x^t$ to hidden state $h^t$.
    - $b$: hidden state bias vector.
    - $W_{hy}$: weight matrix connecting hidden state $h^t$ to output $o^t$.
    - $c$: output bias vector.
- Convert input $x^t$ and output $y^t$ into one-hot vector each. The dimension of the one-hot vector is vocab_size x 1. Everything will be zero except for the index of the letter at (t) would be 1. In our case, $x^t$ would be the same as $y^t$ shifted to the left where $x^1 = \vec{0}$; however, starting from $t = 2$, $x^{t + 1} = y^{t}$. For example, if we use "imad" as the input, then $y = [3, 4, 1, 2, 0]$ while $x = [\vec{0}, 3, 4, 1, 2]$. Notice that $x^1 = \vec{0}$ and not the index 0. Moreover, we're using "\n" as EOS (end of sentence/name) for each name so that the RNN learns "\n" as any other character so that it knows when to stop generating characters. Therefore, the last target character for all names will be "\n" that represents the end of the name.
- Compute the hidden state using the following formula:
$$h^t = tanh(W_{hh}h^{t-1} + W_{xh}x^t + b)\tag{1}\\{}$$
Notice that we use hyperbolic tangent $(\frac{e^x - e^{-x}}{e^x + e^{-x}})$ as the non-linear function. One of the main advantages of the hyperbolic tangent function is that it resembles the identity function.
- Compute the output layer using the following formula:
$$o^t = W_{hy}h^{t} + c\tag{2}\\{}$$
- Pass the output through softmax layer to normalize the output that allows us to express it as a probability, i.e. all output will be between 0 and 1 and sum up to 1. Below is the softmax formula:
$$y^t = \frac{e^{o^t}}{\sum_ie^{o^t}}\tag{3}\\{}$$
The softmax layer has the same dimension as the output layer which is vocab_size x 1. As a result, $y^t[i]$ is the probability of of index $i$ being the next character at time step (t).
- As mentioned before, the objective of a character-level language model is to minimize the negative log-likelihood of the training sequence. Therefore, the loss function at time (t) and the total loss across all time steps are:
$$\mathcal{L}^t = -\sum_{i = 1}^{T_y}y^tlog\widehat{y^t}\tag{4}\\{}$$
$$\mathcal{L} = \sum_{t = 1}^{T_y}\mathcal{L}^t(\widehat{y^t}, y^t)\tag{5}$$
Since we'll be using SGD, the loss will be noisy and have many oscillations, so it's a good practice to smooth out the loss using exponential weighted average.
- Pass the target character $y^t$ as the next input $x^{t + 1}$ until we finish the sequence.

In [1]:
# Load packages
import os

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

os.chdir("../scripts/")
from character_level_language_model import (initialize_parameters,
                                            initialize_rmsprop,
                                            softmax,
                                            smooth_loss,
                                            update_parameters_with_rmsprop)
os.chdir("../notebooks/")
%matplotlib inline
sns.set_context("notebook")
plt.style.use("fivethirtyeight")

In [2]:
def rnn_forward(x, y, h_prev, parameters):
    """
    Implement one Forward pass on one name.

    Arguments
    ---------
    x : list
        list of integers for the index of the characters in the example
        shifted one character to the right.
    y : list
        list of integers for the index of the characters in the example.
    h_prev : array
        last hidden state from the previous example.
    parameters : python dict
        dictionary containing the parameters.

    Returns
    -------
    loss : float
        cross-entropy loss.
    cache : tuple
        contains three python dictionaries:
            xs -- input of all time steps.
            hs -- hidden state of all time steps.
            probs -- probability distribution of each character at each time
                step.
    """
    # Retrieve parameters
    Wxh, Whh, b = parameters["Wxh"], parameters["Whh"], parameters["b"]
    Why, c = parameters["Why"], parameters["c"]

    # Initialize inputs, hidden state, output, and probabilities dictionaries
    xs, hs, os, probs = {}, {}, {}, {}

    # Initialize x0 to zero vector
    xs[0] = np.zeros((vocab_size, 1))

    # Initialize loss and assigns h_prev to last hidden state in hs
    loss = 0
    hs[-1] = np.copy(h_prev)

    # Forward pass: loop over all characters of the name
    for t in range(len(x)):
        # Convert to one-hot vector
        if t > 0:
            xs[t] = np.zeros((vocab_size, 1))
            xs[t][x[t]] = 1
        # Hidden state
        hs[t] = np.tanh(np.dot(Wxh, xs[t]) + np.dot(Whh, hs[t - 1]) + b)
        # Logits
        os[t] = np.dot(Why, hs[t]) + c
        # Probs
        probs[t] = softmax(os[t])
        # Loss
        loss -= np.log(probs[t][y[t], 0])

    cache = (xs, hs, probs)
#     print("loss for {} and {} is {}".format(x, y, loss))
    return loss, cache

<h3 style="font-family: Georgia; font-size:1.5em;color:purple; font-style:bold">
Backpropagation</h3>

With RNN based models, the gradient-based technique that will be used is called **Backpropagation Through Time (BPTT)**. We start at last time step $T$ and backpropagate loss function w.r.t. all parameters across all time steps and sum them up (see figure 3).
<p align="left">
<img src="posts_images/char_level_model/backprop.PNG"; style="width: 800px; height: 400px"><br>
<caption><center><u><b><font color="purple">Figure 3:</font></b></u> Backpropagation Through Time (BPTT)</center></caption>
</p>
In addition, since recurrent networks are known to have steep cliffs (sudden steep decrease in $\mathcal{L}$), gradients may overshoot the minimum and undo a lot of the work that was done even if we are using adaptive learning methods such as RMSProp. The reason for that is that because gradient is a linear approximation of the loss function and may not capture information further than the point it was evaluated on such as the curvature of loss curve. Therefore, it's a common practice to clip the gradients to be in the interval [-maxValue, maxValue]. For this exercise, we'll clip the gradients to be in the interval [-5, 5]. That means if the gradient is > 5 or < -5, it would be clipped to 5 and -5 respectively. Below are all the formulas needed to compute the gradients w.r.t. all parameters at all time steps.

$$\nabla_{o^t}\mathcal{L} = \widehat{y^t} - y^t\tag{6}\\{}$$
$$\nabla_{W_{hy}}\mathcal{L} = \sum_t \nabla_{o^t}\mathcal{L}\cdot{h^t}^T\tag{7}\\{}$$
$$\nabla_{c}\mathcal{L} = \sum_t \nabla_{o^t}\mathcal{L} \tag{8}\\{}$$
$$\nabla_{h^t}\mathcal{L} = W_{hy}^T\cdot\nabla_{o^t}\mathcal{L} + \underbrace { W_{hh}^T\cdot\nabla_{h^{t + 1}}\mathcal{L} * (1 - tanh(W_{hh}h^{t} + W_{xh}x^{t + 1} + b) ^ 2)}_{dh_{next}} \tag{9}\\{}$$
$$\nabla_{h^{t - 1}}\mathcal{L} = W_{hh}^T\cdot\nabla_{h^t}\mathcal{L} * (1 - tanh(h^t) ^ 2)\tag{10}\\{}$$
$$\nabla_{x^t}\mathcal{L} = W_{xh}^T\cdot\nabla_{h^t}\mathcal{L} * (1 - tanh(W_{hh}\cdot h^{t-1} + W_{xh}\cdot x^t + b) ^ 2)\tag{11}\\{}$$
$$\nabla_{W_{hh}}\mathcal{L} = \sum_t \nabla_{h^t}\mathcal{L} * (1 - tanh(W_{hh}\cdot h^{t-1} + W_{xh}\cdot x^t + b) ^ 2)\cdot{h^{t - 1}}^T\tag{12}\\{}$$
$$\nabla_{W_{xh}}\mathcal{L} = \sum_t \nabla_{h^t}\mathcal{L} * (1 - tanh(W_{hh}\cdot h^{t-1} + W_{xh}\cdot x^t + b) ^ 2) . {x^t}^T\tag{13}\\{}$$
$$\nabla_{b}\mathcal{L} = \sum_t \nabla_{h^t}\mathcal{L} * (1 - tanh(h^t) ^ 2) \tag{14}\\{}$$

Note that at last time step $T$, we'll initialize $dh_{next}$  to zeros since we can't get values from future. To stabilize the update at each time step since SGD may have so many oscillations, we'll be using one of the adaptive learning methods' optimizer. More specifically, [**Root Mean Squared Propagation (RMSProp)**](https://nbviewer.jupyter.org/github/ImadDabbura/Deep-Learning/blob/master/notebooks/Optimization-Algorithms.ipynb) which tends to have acceptable performance.

In [3]:
def clip_gradients(gradients, max_value):
    """
    Implements gradient clipping element-wise on gradients to be between the
    interval [-max_value, max_value].

    Arguments
    ----------
    gradients : python dict
        dictionary that stores all the gradients.
    max_value : scalar
        edge of the interval [-max_value, max_value].

    Returns
    -------
    gradients : python dict
        dictionary where all gradients were clipped.
    """
    for grad in gradients.keys():
        np.clip(gradients[grad], -max_value, max_value, out=gradients[grad])

    return gradients


def rnn_backward(y, parameters, cache):
    """
    Implements Backpropagation on one name.

    Arguments
    ---------
    y : list
        list of integers for the index of the characters in the example.
    parameters : python dict
        dictionary containing the parameters.
    cache : tuple
            contains three python dictionaries:
                xs -- input of all time steps.
                hs -- hidden state of all time steps.
                probs -- probability distribution of each character at each time
                    step.

    Returns
    -------
    grads : python dict
        dictionary containing all the gradients.
    h_prev : array
        last hidden state from the current example.
    """
    # Retrieve xs, hs, and probs
    xs, hs, probs = cache

    # Initialize all gradients to zero
    dh_next = np.zeros_like(hs[0])

    parameters_names = ["Whh", "Wxh", "b", "Why", "c"]
    grads = {}
    for param_name in parameters_names:
        grads["d" + param_name] = np.zeros_like(parameters[param_name])

    # Iterate over all time steps in reverse order starting from Tx
    for t in reversed(range(len(xs))):
        dy = np.copy(probs[t])
        dy[y[t]] -= 1
        grads["dWhy"] += np.dot(dy, hs[t].T)
        grads["dc"] += dy
        dh = np.dot(parameters["Why"].T, dy) + dh_next
        dhraw = (1 - hs[t] ** 2) * dh
        grads["dWhh"] += np.dot(dhraw, hs[t - 1].T)
        grads["dWxh"] += np.dot(dhraw, xs[t].T)
        grads["db"] += dhraw
        dh_next = np.dot(parameters["Whh"].T, dhraw)
        # Clip the gradients using [-5, 5] as the interval
        grads = clip_gradients(grads, 5)
    # Get the last hidden state
    h_prev = hs[len(xs) - 1]

    return grads, h_prev

In [4]:
print("hello")

hello


<h3 style="font-family: Georgia; font-size:1.5em;color:purple; font-style:bold">
Sampling</h3><br>
Sampling is what makes the text generated by the RNN at each time step an interesting/creative text. On each time step (t), the RNN output the conditional probability distribution of the next character given all the previous characters, i.e. $P(c_t/c_1, c_2, ..., c_{t-1})$. Let's assume that we are at time step $t = 3$ and we're trying to predict the third character, the conditional probability distribution is: $P(c_3/c_1, c_2) = (0.2, 0.3, 0.4, 0.1)$. We'll have two extremes:
1. Maximum entropy: the character will be picked randomly using uniform probability distribution; which means that all characters in the vocabulary dictionary are equally likely. Therefore, we'll end up with maximum randomness in picking the next character and the generated text will not be either meaningful or sound real.
2. Minimum entropy: the character with the highest conditional probability will be picked on each time step. That means next character will be what the model estimates to be the right one based on the training text and learned parameters. As a result, the name generated will be both meaningful and sound real. However, it will also be repetitive and interesting since all the parameters were optimized to learn joint probability distribution in predicting the next character.

As we increase randomness, text will loose local structure; however, as we decrease randomness, the generated text will sound more real and start to preserve its local structure. For this exercise, we will sample from the distribution that's generated by the model which can be seen as an intermediate level of randomness between maximum and minimum entropy (see figure 4). Using this sampling strategy on the above distribution, the index 0 has $20$% probability of being picked, while index 2 has $40$% probability to be picked.
<p align="left">
<img src="posts_images/char_level_model/sampling.PNG"; style="width: 800px; height: 400px"><br>
<caption><center><u><b><font color="purple">Figure 4:</font></b></u> Sampling: An example of predicting next character using character-level language model</center></caption>
</p>
Therefore, sampling will be used at test time to generate names character by character.

In [5]:
def sample(parameters, idx_to_chars, chars_to_idx, n):
    """
    Implements sampling of a squence of n characters characters length. The
    sampling will be based on the probability distribution output of RNN.

    Arguments
    ---------
    parameters : python dict
        dictionary storing all the parameters of the model.
    idx_to_chars : python dict
        dictionary mapping indices to characters.
    chars_to_idx : python dict
        dictionary mapping characters to indices.
    n : scalar
        number of characters to output.

    Returns
    -------
    sequence : str
        sequence of characters sampled.
    """
    # Retrienve parameters, shapes, and vocab size
    Whh, Wxh, b = parameters["Whh"], parameters["Wxh"], parameters["b"]
    Why, c = parameters["Why"], parameters["c"]
    n_h, n_x = Wxh.shape
    vocab_size = c.shape[0]

    # Initialize a0 and x1 to zero vectors
    h_prev = np.zeros((n_h, 1))
    x = np.zeros((n_x, 1))

    # Initialize empty sequence
    indices = []
    idx = -1
    counter = 0
    while (counter <= n and idx != chars_to_idx["\n"]):
        # Fwd propagation
        h = np.tanh(np.dot(Whh, h_prev) + np.dot(Wxh, x) + b)
        o = np.dot(Why, h) + c
        probs = softmax(o)

        # Sample the index of the character using generated probs distribution
        idx = np.random.choice(vocab_size, p=probs.ravel())

        # Get the character of the sampled index
        char = idx_to_chars[idx]

        # Add the char to the sequence
        indices.append(idx)

        # Update a_prev and x
        h_prev = np.copy(h)
        x = np.zeros((n_x, 1))
        x[idx] = 1

        counter += 1
    sequence = "".join([idx_to_chars[idx] for idx in indices if idx != 0])

    return sequence

<h3 style="font-family: Georgia; font-size:1.5em;color:purple; font-style:bold">
Fitting the model</h3><br>
After covering all the concepts/intuitions behind character-level language model, now we're ready to fit the model. We'll use the default settings for RMSProp's hyperparameters and run the model for 100 iterations. On each iteration, we'll print out one sampled name and smoothed loss to see how the names generated start to get more interesting with more iterations as well as the loss will start decreasing. When done with fitting the model, we'll plot the loss function and generate some names.

In [None]:
def model(
        file_path, chars_to_idx, idx_to_chars, hidden_layer_size, vocab_size,
        num_epochs=10, learning_rate=0.01):
    """
    Implements RNN to generate characters.

    Arguments
    ---------
    file_path : str
        path to the file of the raw data.
    num_epochs : int
        number of passes the optimization algorithm to go over the training
        data.
    learning_rate : float
        step size of learning.
    chars_to_idx : python dict
        dictionary mapping characters to indices.
    idx_to_chars : python dict
        dictionary mapping indices to characters.
    hidden_layer_size : int
        number of hidden units in the hidden layer.
    vocab_size : int
        size of vocabulary dictionary.

    Returns
    -------
    parameters : python dict
        dictionary storing all the parameters of the model.
    overall_loss : list
        list stores smoothed loss per epoch.
    """
    # Get the data
    with open(file_path) as f:
        data = f.readlines()
    examples = [x.lower().strip() for x in data]
    # examples is a line (which is just a name)
    # let us computer the loss PER name
    
    # Initialize parameters
    parameters = initialize_parameters(vocab_size, hidden_layer_size, load=False)

    # Initialize Adam parameters
    s = initialize_rmsprop(parameters)

    # Initialize loss
    #     actually the loss can be saved as well, but this is OK too
    smoothed_loss = -np.log(1 / vocab_size) * 7  
    

    # Initialize hidden state h0 and overall loss
    h_prev = np.zeros((hidden_layer_size, 1))
    overall_loss = []

    # Iterate over number of epochs
    for epoch in range(num_epochs):
        print(f"\033[1m\033[94mEpoch {epoch}")
        print(f"\033[1m\033[92m=======")

        # Sample one name
        print(f"""Sampled name: {sample(parameters, idx_to_chars, chars_to_idx,
            10).capitalize()}""")
        print(f"Smoothed loss: {smoothed_loss:.4f}\n")

        # Shuffle examples
        np.random.shuffle(examples)

        # Iterate over all examples (SGD)
        for example in examples:
            
            # run thru the example and compute the loss
            
            x = [None] + [chars_to_idx[char] for char in example]
            y = x[1:] + [chars_to_idx["\n"]]
            # Fwd pass
            loss, cache = rnn_forward(x, y, h_prev, parameters)
            
#             print the loss per example
            print("loss for {} is {}".format(example, loss))
            
            # Compute smooth loss
            smoothed_loss = smooth_loss(smoothed_loss, loss)
            # Bwd pass
            grads, h_prev = rnn_backward(y, parameters, cache)
            # Update parameters
            parameters, s = update_parameters_with_rmsprop(
                parameters, grads, s)
            
            # ideally, we can save the parameters here; 
            # s => save the parameters  
        overall_loss.append(smoothed_loss)
#     save_model(parameters)
    
    return parameters, overall_loss

import pickle

#  now that we can save the parameters, we can train it on a new corpus...
# test set: training losses! but we need to see how the loss is 

def save_model(parameters):
    print("hey we going to save!")
    with open("sample_file.pk1", "wb") as output:
        pickle.dump(parameters, output, pickle.HIGHEST_PROTOCOL)
    
    
    

In [None]:
# Load names
# this is where we get the data from (think of it like a dictionary)
data = open("../data/names_augmented_with_nums.txt", "r").read()

# Convert characters to lower case
data = data.lower()

# Construct vocabulary using unique characters, sort it in ascending order,
# then construct two dictionaries that maps character to index and index to
# characters.
chars = list(sorted(set(data)))
chars_to_idx = {ch:i for i, ch in enumerate(chars)}
idx_to_chars = {i:ch for ch, i in chars_to_idx.items()}

# Get the size of the data and vocab size
data_size = len(data)
vocab_size = len(chars_to_idx)
print(f"There are {data_size} characters and {vocab_size} unique characters.")

# Fitting the model
# this is the actual data we want to run the model on
parameters, loss = model("../data/names_augmented_with_nums.txt", chars_to_idx, idx_to_chars, 100, vocab_size, 100, 0.01)

save_model(parameters)

# Plotting the loss
plt.plot(range(len(loss)), loss)
plt.xlabel("Epochs")
plt.ylabel("Smoothed loss");

There are 36165 characters and 51 unique characters.
[1m[94mEpoch 0
Sampled name: !)]upy
Smoothed loss: 27.5228

loss for ashley is 27.522480049216455
loss for xiomara is 31.442464745303063
loss for marilee is 31.37267874502054
loss for mireya is 27.410993279216363
loss for bart is 19.573322271713643
loss for louise is 27.362176548765905
loss for jamaal is 27.0712874421518
loss for irmgard is 30.35621921614677
loss for tawny is 21.879420221091372
loss for tereasa is 22.684396227624198
loss for barbie is 20.922574911934937
loss for leatha is 19.406064594331635
loss for kimbra is 21.784608076790118
loss for brandon is 28.002474850043367
loss for alyce is 18.57442530808069
loss for sherlene is 26.762546889083257
loss for lelah is 15.62411334478367
loss for hassan is 20.122052133892485
loss for debbie is 21.715672706650633
loss for robbi is 18.657122283483794
loss for dalila is 18.066114300401953
loss for carman is 20.59540506714307
loss for stacie is 20.528937124245758
loss for windy is

loss for stephany is 28.617129741017724
loss for tyrell is 18.967132019988203
loss for eugenie is 22.331075182859887
loss for latrisha is 25.31859962318663
loss for verlie is 19.34644837831439
loss for kira is 13.98921962370014
loss for diedre is 18.720902206771175
loss for zora is 16.70111330545131
loss for wilbur is 24.951641023168957
loss for eveline is 21.23618893014574
loss for leonila is 19.955397189307487
loss for gennie is 17.94174938422101
loss for kendall is 22.22525848752912
loss for nakita is 19.29645560818457
loss for celinda is 21.105929972943994
loss for lettie is 17.15171641048838
loss for ivan is 14.27082504040833
loss for charline is 24.133818074092066
loss for kasie is 16.451641292844233
loss for ardella is 19.9762432330236
loss for jacquelin is 35.21185148875629
loss for adolph is 23.11767227928632
loss for alycia is 18.71844811432052
loss for freeman is 23.508245508659098
loss for kraig is 18.021378332202172
loss for chaya is 17.14835047681869
loss for lucio is 17.

loss for renae is 14.039320665744235
loss for leanne is 16.911121563068285
loss for madonna is 21.162433012738774
loss for hal is 10.679461222249016
loss for jina is 13.015281165854786
loss for saturnina is 26.388705385194868
loss for yee is 10.163576096133756
loss for ailene is 15.889510261599806
loss for matt is 14.06054725005317
loss for moses is 18.039956700194807
loss for nicolle is 21.973324092722645
loss for jenna is 14.236075616892103
loss for ismael is 19.023618498927554
loss for bobby is 23.895066911872306
loss for eilene is 16.003398880587206
loss for ana is 8.060838369137775
loss for alleen is 16.528862992207
loss for tinisha is 21.19874060996419
loss for lourdes is 24.734934774998216
loss for anne is 9.99327410557002
loss for coral is 16.59643425001083
loss for mavis is 18.542164606047333
loss for lashell is 20.860814878033317
loss for dann is 12.599273929082052
loss for sabra is 15.770810980237215
loss for albertha is 24.700311649953825
loss for cher is 14.596381044069107

loss for rosina is 16.50954726409183
loss for amada is 15.596713588928441
loss for russell is 24.84123474438306
loss for katlyn is 20.68579869040048
loss for vikki is 19.469834937476087
loss for cyril is 17.12676940126351
loss for maegan is 19.74315547556727
loss for betsey is 20.4108293725782
loss for jeniffer is 28.1378972115818
loss for maria is 13.224251900878883
loss for frederick is 29.4430000317729
loss for fannie is 18.370102864944133
loss for ashlie is 18.10724103095225
loss for cristine is 22.693212485892673
loss for jeanine is 19.277947375045397
loss for marry is 15.228716236948113
loss for miguelina is 27.18750679598636
loss for lakia is 13.930207423651352
loss for suk is 13.40955557864272
loss for marlen is 16.4617688133422
loss for latoyia is 21.31276054821725
loss for kyra is 12.851467388207158
loss for magaly is 19.226627725908532
loss for vicente is 21.39523353823758
loss for reginald is 25.82606070046724
loss for josef is 19.83304122694788
loss for deetta is 18.470796

loss for cody is 14.734666530429621
loss for marianna is 17.952831408592946
loss for jacquelyne is 33.442863709162225
loss for chauncey is 26.984139353276124
loss for dorethea is 22.232418073497325
loss for ardelia is 18.401628621974826
loss for leigha is 17.35063437307977
loss for connie is 17.30168170331872
loss for altagracia is 26.801131123656187
loss for lawerence is 27.286561215834073
loss for thaddeus is 27.03147474979879
loss for hettie is 15.71520296988781
loss for danny is 14.541369960186165
loss for antonio is 19.869194921834545
loss for shanell is 20.83761486491873
loss for sanjuana is 21.723655737060266
loss for pasty is 16.836128938373342
loss for lauralee is 22.230365280539647
loss for chelsea is 18.83150014656689
loss for blake is 18.45271143587119
loss for betsy is 17.989483903818986
loss for anneliese is 22.609911995060955
loss for lanie is 11.087603063435107
loss for aida is 11.217734244750634
loss for patti is 17.399658988289417
loss for kaci is 15.436178218770674
l

loss for vincent is 22.58523027797235
loss for lamar is 15.100281201017719
loss for colene is 15.308505663663132
loss for sandie is 15.468391683591717
loss for iola is 10.722891668058997
loss for claribel is 23.207380402235078
loss for nancey is 16.165176142979895
loss for elliott is 23.153726293420146
loss for maryln is 19.045436658137767
loss for peggie is 21.871210031804946
loss for mona is 10.344927930503951
loss for daine is 15.240727196857499
loss for royce is 16.7717915078401
loss for randell is 18.885206963477124
loss for leslie is 15.139764534484433
loss for tyree is 14.40832149449278
loss for lenora is 14.30303864725631
loss for susanna is 19.556832008943264
loss for tennille is 19.022204797930836
loss for sachiko is 23.712548869225905
loss for delcie is 15.352760669213586
loss for garrett is 21.914517156503997
loss for vernetta is 21.968048696334044
loss for brent is 16.435611374020404
loss for shelba is 17.670303190032946
loss for dulcie is 17.983057164261897
loss for sonny

loss for sang is 14.567503958402227
loss for evelyne is 22.51690402602994
loss for viva is 15.022988456381803
loss for owen is 15.221983866608626
loss for vernia is 17.32643757509827
loss for tempie is 18.729515400551495
loss for narcisa is 17.592645235093517
loss for nelida is 15.175318195102008
loss for denna is 10.154015044500147
loss for palma is 13.111742624931873
loss for maryjane is 22.30658362842513
loss for amelia is 14.660257284408965
loss for tara is 8.074535732452823
loss for honey is 14.143285730236483
loss for alla is 9.051493773415507
loss for georgette is 28.210821865365425
loss for lavon is 17.492097641404236
loss for anastasia is 21.515905206768767
loss for jerry is 15.341396042625714
loss for zachery is 21.733384510344017
loss for suzanne is 21.737181811944463
loss for darrel is 14.533281025547742
loss for rob is 14.195582721524321
loss for nia is 8.73672589163861
loss for neida is 14.120917128099933
loss for tilda is 13.339907465012292
loss for talisha is 16.3851263

loss for mariette is 17.54820938785506
loss for ladawn is 23.00482459644209
loss for an is 5.9857441548022425
loss for renee is 13.432470929722749
loss for charlott is 20.18640495009051
loss for mee is 10.335110982449438
loss for katharyn is 20.899360470160612
loss for columbus is 31.535498382076113
loss for sherill is 17.736607957571536
loss for tracey is 20.63280854756565
loss for emmitt is 20.72618207737799
loss for kieth is 17.27471650106287
loss for lael is 12.836801297616887
loss for richie is 15.525714108594352
loss for veda is 13.800379445548064
loss for osvaldo is 23.412028715546644
loss for tristan is 18.573091717209948
loss for lillie is 14.07707827319408
loss for kaylee is 16.475566511869847
loss for milton is 16.5164178154148
loss for cecilia is 17.915027669363322
loss for bree is 14.261865119961909
loss for teresia is 16.166502907633024
loss for lala is 8.632784492116967
loss for krysta is 20.32932934431726
loss for adena is 13.552087043590593
loss for bo is 10.9790404836

loss for damion is 16.932909255517483
loss for lashaun is 19.563237299181843
loss for kamala is 16.534971036562954
loss for jesenia is 17.30152456104179
loss for donnie is 15.564223172337126
loss for carolann is 19.1713147758594
loss for vesta is 14.251498100024836
loss for yoshie is 17.283559666874538
loss for criselda is 21.863351240560004
loss for lisandra is 19.30108679229082
loss for stephaine is 27.665890750367293
loss for jutta is 15.903828406496558
loss for maribeth is 23.711805437646092
loss for karlene is 16.278420618449125
loss for dean is 12.169481858617626
loss for damon is 14.930980957514706
loss for lue is 10.31002276223293
loss for rory is 12.335169903096869
loss for letitia is 17.41098995882948
loss for dick is 17.132457012182265
loss for augustine is 26.406889771408185
loss for melba is 14.987664195520042
loss for hortensia is 23.537818612111426
loss for stella is 15.060849249488417
loss for bradford is 30.880544345998025
loss for cornell is 20.641354844509372
loss fo

loss for valrie is 15.341314315817186
loss for anamaria is 18.462029678322164
loss for ara is 9.112870247258844
loss for lidia is 12.454248700005849
loss for mardell is 17.653874362965865
loss for niesha is 15.593520330980729
loss for fay is 11.710546386741372
loss for mi is 7.082114350679095
loss for gabriel is 22.390334159037646
loss for carmelita is 18.702999486958763
loss for ofelia is 17.139100028732752
loss for dionne is 16.021525776394313
loss for miquel is 24.741824877386513
loss for zack is 20.608711746763817
loss for chan is 10.334938285570248
loss for garland is 18.3658021302956
loss for nancie is 14.881519961514309
loss for tifany is 19.303528905564455
loss for carmelia is 17.34289872192171
loss for doug is 17.41817756853007
loss for ozie is 14.524734204401653
loss for sherron is 16.61827804085126
loss for lou is 13.852937969576697
loss for daria is 11.642597338455825
loss for ming is 14.31641144834927
loss for mary is 10.057907419660667
loss for nicolas is 21.2832122351439

loss for sherise is 14.04340013921142
loss for lorrine is 15.811232464603325
loss for danita is 12.082934139300768
loss for nell is 12.61051643973603
loss for kristle is 21.255836773086997
loss for janina is 13.098415396806105
loss for benedict is 24.54522441346096
loss for graciela is 20.850434080928874
loss for marcie is 14.699307122995183
loss for launa is 14.382642157362337
loss for rafael is 21.719973005349427
loss for yuki is 16.820269015129178
loss for sophia is 18.398502093479657
loss for easter is 18.591214772659605
loss for mindi is 14.527237208690915
loss for katelin is 17.363635010735894
loss for slyvia is 18.4850842939735
loss for dorsey is 16.53760884994908
loss for jerome is 16.124368155815763
loss for orval is 17.830846783069035
loss for roxane is 20.030102219608306
loss for sibyl is 18.897783295149413
loss for tobi is 14.159708253468503
loss for lindsy is 19.646431694150486
loss for homer is 15.014554438454411
loss for barb is 16.454607483277652
loss for iesha is 13.58

loss for denny is 12.518022308540287
loss for nicholas is 25.024146426863446
loss for kimberlee is 26.860840379073057
loss for aracely is 18.247790108874277
loss for dung is 16.837700448652384
loss for ike is 9.626883920140608
loss for julee is 15.121353575731465
loss for dahlia is 17.21719842531598
loss for lyndsey is 22.606072441726383
loss for kandra is 15.021898126231417
loss for melita is 12.626048643485992
loss for callie is 12.35267803966028
loss for chu is 13.862206888407506
loss for natosha is 18.45273854495786
loss for deena is 10.366950475085833
loss for mahalia is 16.924344050836364
loss for mozell is 22.924968193162734
loss for allegra is 19.913084531177542
loss for vickie is 17.481654183404054
loss for pearline is 19.689114801008643
loss for melodi is 17.291378925345914
loss for inez is 17.40306849059909
loss for nannie is 14.066345774738704
loss for jaymie is 16.046929773651495
loss for sigrid is 21.73723178847574
loss for edmond is 19.75197151653309
loss for hai is 12.3

loss for yulanda is 18.551007811353088
loss for georgene is 22.389190641597033
loss for lynne is 12.816740543937508
loss for brittany is 21.201009336802166
loss for anh is 13.952029166085588
loss for alyson is 17.793525310281574
loss for jerri is 14.746049206484125
loss for chung is 20.189531834652183
loss for rochell is 19.173549934384116
loss for man is 8.96687099812418
loss for cheryll is 18.70425516075418
loss for merissa is 15.53474946007209
loss for hayden is 15.123590057295981
loss for lilly is 14.836202246995585
loss for cordelia is 18.708068011266196
loss for kathlene is 18.38995934935129
loss for shenna is 12.078464947053
loss for kyla is 13.819835994135328
loss for annemarie is 22.363863062936854
loss for maida is 13.089599794102107
loss for tonita is 12.939521822112546
loss for mikaela is 19.848654268209398
loss for shanita is 12.836631223530752
loss for caren is 10.950668173686314
loss for latonia is 16.543199114985203
loss for kelsie is 14.94532652637775
loss for hannah i

loss for hwa is 14.269702217417985
loss for janice is 14.704481173358067
loss for ashlea is 18.107885842788328
loss for art is 13.296477752395397
loss for mohammed is 28.110092118570194
loss for kristina is 20.590653454266036
loss for sally is 13.379561249180197
loss for latrice is 18.34835279510991
loss for annabell is 22.71426310148998
loss for aileen is 16.703069720679224
loss for jerold is 18.448091089336213
loss for christian is 22.231595452127515
loss for nathanial is 22.144355397405665
loss for maggie is 17.856169115773906
loss for krystyna is 23.747283006180403
loss for raeann is 18.170846630454232
loss for neva is 13.449972892139275
loss for hilario is 20.635223465653176
loss for evangelina is 24.380024987429437
loss for pauletta is 20.74181492403278
loss for jason is 14.048568726254215
loss for tracy is 15.808329930003497
loss for jasmin is 16.433693400848618
loss for tanya is 13.350776287821907
loss for alva is 12.651220677115132
loss for hyo is 13.556993058571983
loss for r

loss for nyla is 12.615125511416734
loss for shan is 8.624976509446817
loss for noe is 10.578377903933765
loss for morgan is 15.28977574703217
loss for kirby is 18.09693776585583
loss for myrle is 13.412049168013056
loss for bethel is 16.155965842585953
loss for audry is 15.877038487849733
loss for vada is 13.188503066729641
loss for shila is 10.898106411329467
loss for lida is 10.523308153068575
loss for dale is 9.685842448495361
loss for serena is 12.985818100901092
loss for jeramy is 17.456746779445186
loss for anderson is 20.841733425052094
loss for truman is 19.523699772980898
loss for rodolfo is 24.86301973510505
loss for ivana is 13.608137986858054
loss for cindie is 15.90136162404472
loss for tom is 13.9261657268638
loss for debera is 16.893213340783063
loss for shandra is 14.693663431372652
loss for genny is 13.678893423700934
loss for barbera is 16.400079866999285
loss for siobhan is 21.144394389536835
loss for karla is 11.357499761390661
loss for laree is 12.61479871346194
l

loss for annabel is 21.235583702440312
loss for juliane is 19.49545671998764
loss for angeles is 19.935097854119984
loss for tasia is 13.250422747847987
loss for ileana is 13.533328866575227
loss for paris is 14.342918875473371
loss for ula is 9.484696566156394
loss for angelique is 30.70728663223037
loss for shanel is 11.968037072538324
loss for caterina is 17.07837446652247
loss for vivian is 17.436742249881334
loss for lang is 13.42529985742276
loss for leida is 11.74819120172707
loss for jeana is 11.341274333932878
loss for reba is 12.796875899142702
loss for pauline is 19.377936968419625
loss for ivory is 17.40906559802223
loss for emma is 14.550290054712766
loss for effie is 19.65846243767589
loss for sanda is 10.323798981745142
loss for rowena is 17.63277209950898
loss for bulah is 18.379837696475875
loss for tania is 9.340336292010901
loss for alton is 13.742090190291362
loss for carmella is 18.536377105104634
loss for collene is 15.961660451968047
loss for roxie is 16.92138105

loss for phyliss is 24.947761759997803
loss for jong is 13.860029882045106
loss for helen is 12.299453251465401
loss for jacques is 24.50101270912571
loss for britany is 19.409572705384964
loss for yun is 11.90951427413673
loss for major is 16.340427519938792
loss for jan is 9.491114201685702
loss for leontine is 21.468539640172075
loss for courtney is 24.599462907816243
loss for cindy is 14.790410219093339
loss for stepanie is 22.71856053896412
loss for kasha is 12.178528014925103
loss for ciara is 12.35606932352591
loss for marivel is 17.545445380070287
loss for jessica is 18.390968641036647
loss for jeanene is 17.020991344822104
loss for viki is 14.406478745113262
loss for timothy is 21.69667874290339
loss for kyoko is 19.057300505354625
loss for arminda is 17.160317231753886
loss for lacie is 12.251915495245022
loss for marielle is 16.324160079725303
loss for noble is 14.89502662456695
loss for billy is 14.778352247289105
loss for tia is 8.340487000014178
loss for gidget is 20.6310

loss for alvaro is 18.7161230777011
loss for jimmy is 21.537751516081453
loss for chrystal is 22.20565967715409
loss for shandi is 12.66506245616912
loss for dwayne is 18.756627904896256
loss for collette is 17.760730825093052
loss for heath is 16.815087870468414
loss for alden is 12.94675527746378
loss for ruth is 14.19959609417347
loss for adam is 16.16140268286201
loss for bobbi is 23.232303771437806
loss for shawnna is 17.80507195996651
loss for antonina is 20.944037460064152
loss for earlean is 19.364088929202268
loss for eric is 13.825579129559717
loss for marva is 12.84015179239671
loss for vernie is 14.3729083389392
loss for teena is 11.606946619673275
loss for majorie is 17.139271388706085
loss for loida is 12.66280532439479
loss for dagny is 14.462657769496339
loss for lucila is 16.105910475008763
loss for sebastian is 25.11493972128837
loss for zoila is 16.027227842843477
loss for carlo is 12.596077223393449
loss for shalanda is 15.311657696617123
loss for yahaira is 23.1781

loss for herb is 14.502451935069388
loss for carleen is 18.177082104961105
loss for constance is 21.59187745964509
loss for gerardo is 18.748146324325305
loss for joshua is 17.921091717444106
loss for wilton is 17.692928941677398
loss for jordon is 14.252567151449107
loss for adrian is 15.193860106408005
loss for kathrine is 19.41283416841451
loss for jarvis is 15.385517565610193
loss for rebekah is 27.00090283747823
loss for loan is 12.596962203336421
loss for lewis is 16.71576053555119
loss for michaela is 20.76839473121634
loss for madalyn is 17.94893652737935
loss for monica is 13.352042826047212
loss for karmen is 13.549179584850291
loss for becky is 16.660118506395587
loss for roberta is 17.347434453562283
loss for lyda is 11.341295137184108
loss for tim is 11.740814743681836
loss for eileen is 17.539662190365707
loss for signe is 16.43918724184209
loss for aliza is 16.83398659137661
loss for lamont is 17.150788519752055
loss for mack is 15.89418775210201
loss for delilah is 21.7

loss for alexandria is 31.090484636797203
loss for keri is 11.508730939494821
loss for elizabet is 27.50864754658491
loss for valerie is 16.42447356342761
loss for mabel is 11.890129803990162
loss for noelia is 15.858116378158169
loss for danette is 14.554078082055309
loss for catherin is 19.591241942030006
loss for lashonda is 17.692913893508845
loss for dewitt is 18.069695088337742
loss for tess is 14.532755566987015
loss for glynis is 17.995945057899892
loss for lino is 11.207923829532575
loss for pia is 11.318040247554006
loss for gayle is 14.391909863102693
loss for salina is 11.492597419307808
loss for gertrud is 25.132913736727662
loss for ruby is 15.175917591146892
loss for veronica is 18.451805655637532
loss for elouise is 19.914432104520106
loss for eulah is 16.522366553187506
loss for kimi is 11.935569924748112
loss for fernanda is 18.861342245831285
loss for cinderella is 25.300592172142117
loss for darryl is 16.472062520022405
loss for annis is 12.727848846196816
loss for 

loss for tami is 11.633992842206549
loss for giovanna is 22.733206416382217
loss for geraldine is 21.548209542367722
loss for simone is 16.230773923268757
loss for elois is 16.299712034746072
loss for georgine is 22.540412124827675
loss for aleta is 13.003759030150945
loss for antoine is 20.52789649241336
loss for sheridan is 15.824548306771735
loss for elicia is 13.849056789958805
loss for dedra is 13.466250943455325
loss for janna is 9.99001611276086
loss for lorita is 13.333713961264513
loss for nick is 16.365498336757227
loss for deloras is 19.498399495352118
loss for jerrell is 17.26904236768217
loss for korey is 14.765635717257414
loss for deja is 12.426460463950109
loss for larae is 13.828372117313258
loss for junie is 13.449428130717699
loss for kasey is 14.285129495911868
loss for brice is 14.175973777947076
loss for fran is 11.67835824798126
loss for shakita is 17.986016849885864
loss for era is 8.471265694181362
loss for therese is 16.973616293024406
loss for paz is 16.01447

loss for mana is 9.420818208061284
loss for chang is 14.59475210578606
loss for vanessa is 18.35039467819083
loss for mollie is 12.970807201819323
loss for jeanett is 16.639392246310443
loss for gus is 14.698043368280638
loss for tam is 12.479555995156966
loss for patrica is 19.66671161090799
loss for misti is 10.173783221814531
loss for dotty is 14.682383894799258
loss for lottie is 14.569939993865155
loss for kanisha is 15.277998186466627
loss for wendolyn is 22.632770349271887
loss for timmy is 18.049578601629023
loss for ladonna is 18.00219528936142
loss for bradly is 17.55854124218699
loss for jamey is 13.985083203029598
loss for carmel is 14.296060291242279
loss for dario is 13.25324603470852
loss for zulma is 15.970463477325453
loss for laurene is 18.00791941139305
loss for rosanne is 15.228505260592827
loss for shaunte is 17.182948385218623
loss for tamela is 12.6636096541431
loss for keiko is 17.939303839212368
loss for hayley is 16.015668007992602
loss for shani is 11.5425527

loss for lowell is 17.420879540692162
loss for donovan is 20.819064058375
loss for lauretta is 17.95169014087953
loss for genna is 11.091793190697842
loss for prudence is 25.546380296443985
loss for antione is 19.794471253962264
loss for zetta is 14.660715533571855
loss for heike is 14.894588703205129
loss for vanna is 10.085009875235443
loss for violette is 20.781434759495717
loss for tashina is 14.352176662442368
loss for mao is 12.204033282774898
loss for cecila is 16.34298088674771
loss for tawana is 15.325020722505332
loss for shantay is 15.571809248144346
loss for irena is 12.504458145792867
loss for sherita is 12.359331203305677
loss for noma is 11.504994364519579
loss for dominga is 18.284752577769193
loss for vicenta is 18.809052174322957
loss for gaston is 15.1921346282755
loss for wilda is 14.734884621347716
loss for dacia is 12.158941609000044
loss for johanna is 15.989769469538741
loss for sharan is 11.209024266363445
loss for natalie is 17.290226934961172
loss for nita is

As you may notice, the names generated started to get more interesting after 15 epochs. One of the interesting names is "Yasira" which is an Arabic name :).

<h2 style="font-family: Georgia; font-size:2em;color:purple; font-style:bold">
Conclusion</h2><br>
Statistical language models are very crucial in Natural Language Processing (NLP) such as speech recognition and machine translation. We demonstrated in this notebook the main concepts behind statistical language models using character-level language model. The task of this model is generate names character by character using names obtained from census data that were consisted of 5,163 names. Below are the main key takeaways:
- If we have more data, bigger model, and train longer we may get more interesting results. However, to get a very interesting results, we should instead use **Long Short_Term Memory (LSTM)** model with more than one layer deep. People have used 3 layers deep LSTM model with dropout and were able to generate very interesting results when applied on cook books and Shakespeare poems. LSTM models outperform simple RNN due to its ability in capturing longer time dependencies.
- With the sampling technique we're using, don't expect the RNN to generate meaningful sequence of characters (names).
- We used in this notebook each name as its own sequence; however, we may be able to speed up learning and get better results if we increase the batch size lets say from one name to a sequence of 50 characters.
- We can control the level of randomness using the sampling strategy. Here, we balanced between what the model thinks its the right character and the level of randomness.