# Generación de Diálogos

Éste proyecto corresponde al realizado para el Nanodegree Deep Learning de Udacity. En éste notebook entrego una versión más acotada del original

## __OBS__

Adjunto una lista con las bibliotecas y sus versiones respectivas ya que pueden haber problemas

In [55]:
!conda list

# packages in environment at /opt/conda:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main    defaults
altair                    1.2.1                      py_0    conda-forge
asn1crypto                0.22.0                   py36_0    conda-forge
atari-py                  0.1.7                    pypi_0    pypi
atomicwrites              1.3.0                    pypi_0    pypi
attrs                     19.1.0                     py_0    conda-forge
audioread                 2.1.6                    py36_0    conda-forge
av                        0.3.3                    py36_2    conda-forge
awscli                    1.16.17                  py36_0    conda-forge
backcall                  0.1.0                      py_0    conda-forge
backports                 1.0                      py36_1    conda-forge
backports.functools_lru_cache 1.4                      py36_1    conda-forge
backports.we

## Obtención de Datos

La data corresponde a un archivo txt que contiene diálogos de la serie Senfield. Primero que todo cargamos la data

In [37]:
import os
import pickle
import torch
import numpy as np

In [25]:
data_dir = './data/Seinfeld_Scripts.txt'
input_file = os.path.join(data_dir)
with open(input_file, "r") as f:
    data = f.read()

In [26]:
data[:300]

'jerry: do you know what this is all about? do you know, why were here? to be out, this is out...and out is one of the single most enjoyable experiences of life. people...did you ever hear people talking about we should go out? this is what theyre talking about...this whole thing, were all out now, n'

## Creamos las Funciones Para Generar Los Diccionarios y las Puntuaciones

### Generando los Diccionarios

In [16]:
def create_lookup_tables(text):
    """
    Create lookup tables for vocabulary
    :param text: The text of tv scripts split into words
    :return: A tuple of dicts (vocab_to_int, int_to_vocab)
    """
    vocab_to_int = dict()
    int_to_vocab = dict()
    
    id = 0
    for i in text:
        if i not in vocab_to_int.keys():
            vocab_to_int[i] = id
            id += 1
            
    for i,c in enumerate(vocab_to_int.keys()):
        int_to_vocab[i] = c        
    # return tuple
    return (vocab_to_int, int_to_vocab)

### Reemplazando Las Puntuaciones

In [17]:
from string import punctuation

In [18]:
def token_lookup():
    """
    Generate a dict to turn punctuation into a token.
    :return: Tokenized dictionary where the key is the punctuation and the value is the token
    """
    symbols = ['.', ',', '"', ';', '!', '?', '(', ')', '-', '\n']
    token = ["||period||" , "||comma||" , "||quotationmark||" , "||semicolon||" , "||exclamation_mark||" , "||question_mark||" ,
            "||left_parentheses||" , "||right_parentheses||" , "||dash||" , "||return||"]
    return dict(zip(symbols, token))

## Ahora Preprocesamos Todo el Texto

In [29]:
SPECIAL_WORDS = {'PADDING': '<PAD>'}
text = data

# Ignore notice, since we don't use it for analysing the data
text = text[81:]

token_dict = token_lookup()
for key, token in token_dict.items():
    text = text.replace(key, ' {} '.format(token))

text = text.lower()
text = text.split()

vocab_to_int, int_to_vocab = create_lookup_tables(text + list(SPECIAL_WORDS.values()))
int_text = [vocab_to_int[word] for word in text]

## Construyendo la Red

### Verificamos Acceso a GPU

In [31]:
# Check for a GPU
train_on_gpu = torch.cuda.is_available()
if not train_on_gpu:
    print('No GPU found. Please use a GPU to train your neural network.')

In [38]:
words = np.array([1, 2, 3, 4, 5, 6, 7, 8 , 9 , 10 , 11 , 12 , 13])
sequence_length = 4

### Generando los Batches : Ejemplo

In [40]:
features = []
targets = []
print(f"Sample Text : {words}")
for i in range(len(words) - sequence_length):
    features.append(words[i : i+sequence_length])
    targets.append(words[i+sequence_length])
    print(f"Sequence {i+1} : {words[i : i+sequence_length]} Target : {words[i+sequence_length]}")

Sample Text : [ 1  2  3  4  5  6  7  8  9 10 11 12 13]
Sequence 1 : [1 2 3 4] Target : 5
Sequence 2 : [2 3 4 5] Target : 6
Sequence 3 : [3 4 5 6] Target : 7
Sequence 4 : [4 5 6 7] Target : 8
Sequence 5 : [5 6 7 8] Target : 9
Sequence 6 : [6 7 8 9] Target : 10
Sequence 7 : [ 7  8  9 10] Target : 11
Sequence 8 : [ 8  9 10 11] Target : 12
Sequence 9 : [ 9 10 11 12] Target : 13


In [41]:
from torch.utils.data import TensorDataset, DataLoader


def batch_data(words, sequence_length, batch_size):
    """
    Batch the neural network data using DataLoader
    :param words: The word ids of the TV scripts
    :param sequence_length: The sequence length of each batch
    :param batch_size: The size of each batch; the number of sequences in a batch
    :return: DataLoader with batched data
    """
    maxindex = len(words) - sequence_length
    features = []
    targets = []
    for i in range(len(words) - sequence_length):
        features.append(words[i : i+sequence_length])
        targets.append(words[i+sequence_length])
    dataset = TensorDataset(torch.from_numpy(np.array(features)), torch.from_numpy(np.array((targets))))
    data_loader = DataLoader(dataset, shuffle=True, batch_size=batch_size)
    return data_loader

### Probando el Generador

In [43]:
test_text = range(50)
t_loader = batch_data(test_text, sequence_length=5, batch_size=10)

data_iter = iter(t_loader)
sample_x, sample_y = data_iter.next()

print(sample_x.shape)
print(sample_x)
print()
print(sample_y.shape)
print(sample_y)

torch.Size([10, 5])
tensor([[ 42,  43,  44,  45,  46],
        [ 38,  39,  40,  41,  42],
        [ 39,  40,  41,  42,  43],
        [ 21,  22,  23,  24,  25],
        [ 36,  37,  38,  39,  40],
        [ 23,  24,  25,  26,  27],
        [ 10,  11,  12,  13,  14],
        [ 35,  36,  37,  38,  39],
        [ 13,  14,  15,  16,  17],
        [  6,   7,   8,   9,  10]])

torch.Size([10])
tensor([ 47,  43,  44,  26,  41,  28,  15,  40,  18,  11])


# Construyendo la Red

Notar que hay algunos __Print__ que están comentados, lo anterior para el que quiera ver los outputs línea por línea

In [46]:
import torch.nn as nn

class RNN(nn.Module):
    
    def __init__(self, vocab_size, output_size, embedding_dim, hidden_dim, n_layers, dropout=0.5):
        """
        Initialize the PyTorch RNN Module
        :param vocab_size: The number of input dimensions of the neural network (the size of the vocabulary)
        :param output_size: The number of output dimensions of the neural network
        :param embedding_dim: The size of embeddings, should you choose to use them        
        :param hidden_dim: The size of the hidden layer outputs
        :param dropout: dropout to add in between LSTM/GRU layers
        """
        super(RNN, self).__init__()
        
        self.output_size = output_size
        self.n_layers = n_layers
        self.hidden_dim = hidden_dim
        self.embedding_dim = embedding_dim
        self.output_size = output_size
        
        self.embedding = nn.Embedding(vocab_size, self.embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers, 
                            dropout=dropout, batch_first=True)
        
        self.dropout = nn.Dropout(0.3)
        
        self.fc = nn.Linear(hidden_dim, output_size)
        self.sig = nn.Sigmoid()    
    
    def forward(self, nn_input, hidden):
        """
        Forward propagation of the neural network
        :param nn_input: The input to the neural network
        :param hidden: The hidden state        
        :return: Two Tensors, the output of the neural network and the latest hidden state
        """ 
        batch_size = nn_input.size(0)
        embeds = self.embedding(nn_input)
        lstm_out, hidden = self.lstm(embeds, hidden)
        lstm_out = lstm_out.contiguous().view(-1, self.hidden_dim)
        out = self.fc(lstm_out)
        out = out.view(batch_size, -1, self.output_size)
        out = out[:, -1] 
        return out, hidden
    
    
    def init_hidden(self, batch_size):
        '''
        Initialize the hidden state of an LSTM/GRU
        :param batch_size: The batch_size of the hidden state
        :return: hidden state of dims (n_layers, batch_size, hidden_dim)
        '''
        weight = next(self.parameters()).data
        
        if (train_on_gpu):
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda(),
                  weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda())
        else:
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_(),
                      weight.new(self.n_layers, batch_size, self.hidden_dim).zero_())
        
        return hidden

### Definimos Propagación Hacia Delante, Hacia Atrás y Entrenamiento

In [49]:
def forward_back_prop(rnn, optimizer, criterion, inp, target, hidden):
    """
    Forward and backward propagation on the neural network
    :param decoder: The PyTorch Module that holds the neural network
    :param decoder_optimizer: The PyTorch optimizer for the neural network
    :param criterion: The PyTorch loss function
    :param inp: A batch of input to the neural network
    :param target: The target output for the batch of input
    :return: The loss and the latest hidden state Tensor
    """
    
    if(train_on_gpu):
        inp, target = inp.cuda(), target.cuda()            
    hidden = tuple([each.data for each in hidden])

    rnn.zero_grad()   

    output, hidden = rnn(inp, hidden)
    loss = criterion(output, target.long())
    loss.backward()
    optimizer.step()
    return loss.item(), hidden

## Entrenamiento de la Red

In [50]:
def train_rnn(rnn, batch_size, optimizer, criterion, n_epochs, show_every_n_batches=100):
    batch_losses = []
    
    rnn.train()

    print("Training for %d epoch(s)..." % n_epochs)
    for epoch_i in range(1, n_epochs + 1):
        
        hidden = rnn.init_hidden(batch_size)
        
        for batch_i, (inputs, labels) in enumerate(train_loader, 1):
            
            n_batches = len(train_loader.dataset)//batch_size
            if(batch_i > n_batches):
                break
            
            loss, hidden = forward_back_prop(rnn, optimizer, criterion, inputs, labels, hidden)          
            # record loss
            batch_losses.append(loss)

            if batch_i % show_every_n_batches == 0:
                print('Epoch: {:>4}/{:<4}  Loss: {}\n'.format(
                    epoch_i, n_epochs, np.average(batch_losses)))
                batch_losses = []
    return rnn

### Hyperparameters

In [22]:
# Data params
# Sequence Length
sequence_length = 10  # of words in a sequence
# Batch Size
batch_size = 64

train_loader = batch_data(int_text, sequence_length, batch_size)

In [23]:
# Training parameters
# Number of Epochs
num_epochs = 25
# Learning Rate
learning_rate = 0.001 

# Model parameters
# Vocab size
vocab_size = len(vocab_to_int)
# Output size
output_size = vocab_size
# Embedding Dimension
embedding_dim = 300
# Hidden Dimension
hidden_dim = 256
# Number of RNN Layers
n_layers = 2

# Show stats for every n number of batches
show_every_n_batches = 1000

### Train

In [24]:
"""
DON'T MODIFY ANYTHING IN THIS CELL
"""
from workspace_utils import active_session

with active_session():
    # create model and move to gpu if available
    rnn = RNN(vocab_size, output_size, embedding_dim, hidden_dim, n_layers, dropout=0.5)
    if train_on_gpu:
        rnn.cuda()

    # defining loss and optimization functions for training
    optimizer = torch.optim.Adam(rnn.parameters(), lr=learning_rate)
    criterion = nn.CrossEntropyLoss()

    # training the model
    trained_rnn = train_rnn(rnn, batch_size, optimizer, criterion, num_epochs, show_every_n_batches)

    # saving the trained model
    helper.save_model('./save/trained_rnn', trained_rnn)
    print('Model Trained and Saved')

Training for 25 epoch(s)...
Epoch:    1/25    Loss: 5.290066518783569

Epoch:    1/25    Loss: 4.728915277719498

Epoch:    1/25    Loss: 4.551168636798859

Epoch:    1/25    Loss: 4.459713696479797

Epoch:    1/25    Loss: 4.347080304145813

Epoch:    1/25    Loss: 4.311201413631439

Epoch:    1/25    Loss: 4.279906487464904

Epoch:    1/25    Loss: 4.228494143486023

Epoch:    1/25    Loss: 4.222282981157303

Epoch:    1/25    Loss: 4.191772379875183

Epoch:    1/25    Loss: 4.144320255041122

Epoch:    1/25    Loss: 4.169014077425003

Epoch:    1/25    Loss: 4.15263531255722

Epoch:    2/25    Loss: 4.021310191476634

Epoch:    2/25    Loss: 3.9547448177337645

Epoch:    2/25    Loss: 3.955145547866821

Epoch:    2/25    Loss: 3.935763961315155

Epoch:    2/25    Loss: 3.9508822515010835

Epoch:    2/25    Loss: 3.9760567026138305

Epoch:    2/25    Loss: 3.9530919563770293

Epoch:    2/25    Loss: 3.9880911026000976

Epoch:    2/25    Loss: 3.968250494480133

Epoch:    2/25    Loss

  "type " + obj.__name__ + ". It won't be checked "


Model Trained and Saved


In [52]:
import torch
import helper
import problem_unittests as tests

_, vocab_to_int, int_to_vocab, token_dict = pickle.load(open('preprocess2.p', mode='rb'))
trained_rnn = helper.load_model('./save/trained_rnn')

## Generando Diálogos : Versión Corta Explicativa

In [63]:
"""
DON'T MODIFY ANYTHING IN THIS CELL THAT IS BELOW THIS LINE
"""
import torch.nn.functional as F

def generate(rnn, prime_id, int_to_vocab, token_dict, pad_value, predict_len=100):
    """
    Generate text using the neural network
    :param decoder: The PyTorch Module that holds the trained neural network
    :param prime_id: The word id to start the first prediction
    :param int_to_vocab: Dict of word id keys to word values
    :param token_dict: Dict of puncuation tokens keys to puncuation values
    :param pad_value: The value used to pad a sequence
    :param predict_len: The length of text to generate
    :return: The generated text
    """
    rnn.eval()
    
    # create a sequence (batch_size=1) with the prime_id
#     print(f"Initial Word Token for jerry: {prime_id}")
#     print(f"Padding Token : {pad_value}")
#     print(f"Padding Word : {int_to_vocab[pad_value]}")
    current_seq = np.full((1, sequence_length), pad_value)
#     print(f"This is the Initial Empty Sequence {current_seq}")
    current_seq[-1][-1] = prime_id
#     print(f"We set the initial token at the end : {current_seq}")
    predicted = [int_to_vocab[prime_id]]
#     print(f"Initial Predicted Word : {predicted}")
    
    for _ in range(predict_len):
        if train_on_gpu:
            current_seq = torch.LongTensor(current_seq).cuda()
        else:
            current_seq = torch.LongTensor(current_seq)
        
        # initialize the hidden state
        hidden = rnn.init_hidden(current_seq.size(0))

        # get the output of the rnn
        output, _ = rnn(current_seq, hidden)
        #print(f"The output is length of vocabulary : {output.shape}")
        # get the next word probabilities
        p = F.softmax(output, dim=1).data
        #print(f"We apply softmax to convert to probability : {p.shape}")
        if(train_on_gpu):
            p = p.cpu() # move to cpu
         
        # use top_k sampling to get the index of the next word
        top_k = 5
        p, top_i = p.topk(top_k)
        #print(f"After obtaining the probabilities we choose the top 5. We only care about the index : {top_i}")
        top_i = top_i.numpy().squeeze()
        
        # select the likely next word index with some element of randomness
        p = p.numpy().squeeze()
        word_i = np.random.choice(top_i, p=p/p.sum())
        #print(f"We add Randomness on the top 5 words : {word_i} ")
        
        # retrieve that word from the dictionary
        word = int_to_vocab[word_i]
        predicted.append(word)
        #print(f"We then search the selected word and transform it into text. Finally we append to predited : {predicted}")
        
        # the generated word becomes the next "current sequence" and the cycle can continue
        current_seq = np.roll(current_seq, -1, 1)
        current_seq[-1][-1] = word_i
        #print(f"New Current Sequence : {current_seq}\n")
       #print("===========END OF PREDICTION, STARTING NEW PREDICTION===================\n")
    
    gen_sentences = ' '.join(predicted)
    
    # Replace punctuation tokens
    for key, token in token_dict.items():
        ending = ' ' if key in ['\n', '(', '"'] else ''
        gen_sentences = gen_sentences.replace(' ' + token.lower(), key)
    gen_sentences = gen_sentences.replace('\n ', '\n')
    gen_sentences = gen_sentences.replace('( ', '(')
    
    # return all the sentences
    return gen_sentences

In [56]:
# run the cell multiple times to get different results!
gen_length = 3 # modify the length to your preference
prime_word = 'jerry' # name for starting the script

"""
DON'T MODIFY ANYTHING IN THIS CELL THAT IS BELOW THIS LINE
"""
pad_word = helper.SPECIAL_WORDS['PADDING']
generated_script = generate(trained_rnn, vocab_to_int[prime_word + ':'], int_to_vocab, token_dict, vocab_to_int[pad_word], gen_length)
print(generated_script)

Initial Word Token for jerry: 101
Padding Token : 21387
Padding Word : <PAD>
This is the Initial Empty Sequence [[21387 21387 21387 21387]]
We set the initial token at the end : [[21387 21387 21387   101]]
Initial Predicted Word : ['jerry:']
The output is length of vocabulary : torch.Size([1, 21388])
We apply softmax to convert to probability : torch.Size([1, 21388])
After obtaining the probabilities we choose the top 5. We only care about the index : tensor([[  3611,   2570,   5354,   7734,  13346]])
We add Randomness on the top 5 words : 5354 
We then search the selected word and transform it into text. Finally we append to predited : ['jerry:', 'community']
New Current Sequence : [[21387 21387   101  5354]]


The output is length of vocabulary : torch.Size([1, 21388])
We apply softmax to convert to probability : torch.Size([1, 21388])
After obtaining the probabilities we choose the top 5. We only care about the index : tensor([[     3,  13147,    568,   3994,  10005]])
We add Random



### Generando Diálogos: Versión Extensa

In [64]:
# run the cell multiple times to get different results!
gen_length = 400 # modify the length to your preference
prime_word = 'jerry' # name for starting the script

"""
DON'T MODIFY ANYTHING IN THIS CELL THAT IS BELOW THIS LINE
"""
pad_word = helper.SPECIAL_WORDS['PADDING']
generated_script = generate(trained_rnn, vocab_to_int[prime_word + ':'], int_to_vocab, token_dict, vocab_to_int[pad_word], gen_length)
print(generated_script)



jerry: saving bills.

kramer: hey, what do you mean, uh, i have to get rid of the way of the world series of spring.

jerry: you know i don't want to get a massage?

elaine: oh, yeah!(to jerry and elaine and i agreed to go to the hospital, you know...

jerry: oh, i forgot to get rid of her. i mean, i'm gonna go get a chance, and i was wondering.

elaine: oh, yeah, yeah.

kramer: oh, you gotta be able to be a clown...

kramer: oh, you got a little more flexible.

george: you know...

jerry: yeah, well, it's all emotional, and rich, and rich.

jerry:(looking at george) oh, i don't know, i'm afraid i can get you another one..

jerry: i mean, i can't believe this.

elaine: oh, yeah.(takes the pillow back to the freeze.

george: yeah, i got a little distracted, jerry, he doesn't deserve it.

george:(to kramer) hey, jerry.

kramer: oh, yeah, sure.

kramer: oh, i was wondering. you have a big salad.

jerry: yeah!!

kramer: well, i don't know what i think.

kramer: hey, i saw you whenever it g