# Tutorial - Generative Recurrent Neural Networks

Last time we discussed using recurrent neural networks to make predictions about sequences. In particular, we treated tweets as a **sequence** of words. Since tweets can have a variable number of words, we needed an architecture that can take variable-sized sequences as input.

This time, we will use recurrent neural networks to **generate** sequences.
Generating sequences is more involved compared to making predictions about
sequences. However, it is a very interesting task, and many students chose
sequence-generation tasks for their projects.

Much of today's content is an adaptation of the "Practical PyTorch" GitHub 
repository [1].

[1] https://github.com/spro/practical-pytorch/blob/master/char-rnn-generation/char-rnn-generation.ipynb

## Review

In recurrent neural networks, the input sequence is broken down into tokens. We could choose whether to tokenize based on words, or based on characters. The representation of each token (GloVe or one-hot) is processed by the RNN one step at a time to update the hidden (or context) state.

In a predictive RNN, the value of the hidden states  is a representation of **all the text that was processed thus far**. Similarly, in a generative RNN, The value of the hidden state will be a representation of **all the text that still needs to be generated**. We will use this hidden state to produce the sequence, one token at a time.

Similar to the last tutorial we will break up the problem of generating text
to generating one token at a time.

We will do so with the help of two functions:

1. We need to be able to generate the *next* token, given the current 
   hidden state. In practice, we get a probability distribution over 
   the next token, and sample from that probability distribution.
2. We need to be able to update the hidden state somehow. To do so,
   we need two pieces of information: the old hidden state, and the actual
   token that was generated in the previous step. The actual token generated
   will inform the subsequent tokens.

We will repeat both functions until a special "END OF SEQUENCE" token is
generated.

Note that there are several tricky things that we will have to figure out.
For example, how do we actually sample the actual token from the probability
distribution over tokens? What would we do during training, and how might 
that be different from during testing/evaluation? We will answer those
questions as we implement the RNN.

For now, let's start with our training data.

## Data: Donald Trump's Tweets from 2018

The training set we use is a collection of Donald Trump's tweets from 2018.
We will only use tweets that are 140 characters or shorter, and tweets
that contains more than just a URL.
Since tweets often contain creative spelling and numbers, and upper vs. lower
case characters are read very differently, we will use a character-level RNN.

To start, let us load the trump.csv file to Google Colab and provide access to the drive. The file can be obtained from Quercus.

In [1]:
%pip install torch==1.8.0+cu111 -f https://download.pytorch.org/whl/torch_stable.html
%pip install torchtext==0.9 # Necessary to ensure we are using torxhtext version 0.9 that has access to legacy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in links: https://download.pytorch.org/whl/torch_stable.html
Collecting torch==1.8.0+cu111
  Downloading https://download.pytorch.org/whl/cu111/torch-1.8.0%2Bcu111-cp37-cp37m-linux_x86_64.whl (1982.2 MB)
[K     |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå                  | 834.1 MB 1.4 MB/s eta 0:14:04tcmalloc: large alloc 1147494400 bytes == 0x38ebc000 @  0x7fe6d7219615 0x592b76 0x4df71e 0x59afff 0x515655 0x549576 0x593fce 0x548ae9 0x51566f 0x549576 0x593fce 0x548ae9 0x5127f1 0x598e3b 0x511f68 0x598e3b 0x511f68 0x598e3b 0x511f68 0x4bc98a 0x532e76 0x594b72 0x515600 0x549576 0x593fce 0x548ae9 0x5127f1 0x549576 0x593fce 0x5118f8 0x593dd7
[K     |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà               | 1055.7 MB 1.2 MB/s eta 0:12:26tcmalloc: large alloc 1434370048 bytes == 0x7d512000 @  0x7fe6d7219615 0x592b76 0x4df71e 0x59afff 0x515655 0x549576 0x593fce 0x548ae9 0x51566

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [3]:
import csv

# file location (make sure to use your file location)
file_dir = '/content/drive/My Drive/Colab Notebooks/Lab 6 Tutorial/'

tweets = list(line[0] for line in csv.reader(open(file_dir + 'trump.csv')))
len(tweets)

22402

There are over 20000 tweets in this collection.
Let's look at a few of them, just to get a sense of the kind of text
we're dealing with:

In [4]:
print(tweets[100])
print(tweets[1000])
print(tweets[10000])

God Bless the people of Venezuela!
It was my honor. THANK YOU! https://t.co/1LvqbRQ1bi
Nobody but Donald Trump will save Israel. You are wasting your time with these politicians and political clowns. Best! #SheldonAdelson


## Generating One Tweet

Normally, when we build a new machine learning model, we want to make sure
that our model can overfit. To that end, we will first build a neural network
that can generate _one_ tweet really well. We can choose any tweet (or any other text) we want. Let's choose to build an RNN that generates `tweet[100]`.

In [5]:
tweet = tweets[100]
print(tweet)
print(len(tweet))

God Bless the people of Venezuela!
34


First, we will need to encode this tweet using a one-hot encoding.
We'll build dictionary mappings
from the character to the index of that character (a unique integer identifier),
and from the index to the character. We'll use the same naming scheme that `torchtext`
uses (`stoi` and `itos`).

For simplicity, we'll work with a limited vocabulary containing
just the characters in `tweet[100]`, plus two special tokens:

- `<EOS>` represents "End of String", which we'll append to the end of our tweet.
  Since tweets are variable-length, this is a way for the RNN to signal
  that the entire sequence has been generated.
- `<BOS>` represents "Beginning of String", which we'll prepend to the beginning of 
  our tweet. This is the first token that we will feed into the RNN.

The way we use these special tokens will become more clear as we build the model.

In [6]:
vocab = list(set(tweet)) + ["<BOS>", "<EOS>"]
vocab_stoi = {s: i for i, s in enumerate(vocab)}
vocab_itos = {i: s for i, s in enumerate(vocab)}
vocab_size = len(vocab)

In [7]:
print(vocab)
print(vocab_stoi)
print(vocab_itos)
print(vocab_size)

['s', ' ', 'd', 'f', 'n', 'e', 't', 'G', 'p', 'z', 'B', 'h', 'a', 'u', '!', 'V', 'l', 'o', '<BOS>', '<EOS>']
{'s': 0, ' ': 1, 'd': 2, 'f': 3, 'n': 4, 'e': 5, 't': 6, 'G': 7, 'p': 8, 'z': 9, 'B': 10, 'h': 11, 'a': 12, 'u': 13, '!': 14, 'V': 15, 'l': 16, 'o': 17, '<BOS>': 18, '<EOS>': 19}
{0: 's', 1: ' ', 2: 'd', 3: 'f', 4: 'n', 5: 'e', 6: 't', 7: 'G', 8: 'p', 9: 'z', 10: 'B', 11: 'h', 12: 'a', 13: 'u', 14: '!', 15: 'V', 16: 'l', 17: 'o', 18: '<BOS>', 19: '<EOS>'}
20


Now that we have our vocabulary, we can build the PyTorch model
for this problem.
The actual model is not as complex as you might think. We actually
already learned about all the components that we need. (Using and training
the model is the hard part)

In [8]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

In [9]:
class TextGenerator(nn.Module):
    def __init__(self, vocab_size, hidden_size, n_layers=1):
        super(TextGenerator, self).__init__()

        # identiy matrix for generating one-hot vectors
        self.ident = torch.eye(vocab_size)

        # recurrent neural network
        self.rnn = nn.GRU(vocab_size, hidden_size, n_layers, batch_first=True)

        # a fully-connect layer that outputs a distribution over
        # the next token, given the RNN output
        self.decoder = nn.Linear(hidden_size, vocab_size)
    
    def forward(self, inp, hidden=None):
        inp = self.ident[inp]                  # generate one-hot vectors of input
        output, hidden = self.rnn(inp, hidden) # get the next output and hidden state
        output = self.decoder(output)          # predict distribution over next tokens
        return output, hidden

model = TextGenerator(vocab_size, 64)

## Training with Teacher Forcing

At a very high level, we want our RNN model to have a high probability
of generating the tweet. An RNN model generates text
one character at a time based on the hidden state value.
At each time step, we will check whether the model generated the
correct character. That is, at each time step,
we are trying to select the correct next character out of all the 
characters in our vocabulary. Recall that this problem is a multi-class
classification problem, and we can use Cross-Entropy loss to train our
network to become better at this type of problem.

In [10]:
criterion = nn.CrossEntropyLoss()

However, we don't just have a single multi-class classification problem.
Instead, we have **one classification problem per time-step** (per token)!
So, how do we predict the first token in the sequence? 
How do we predict the second token in the sequence? 

To help you understand what happens durign RNN training, we'll start with
inefficient training code that shows you what happens step-by-step. We'll
start with computing the loss for the first token generated, then the second token,
and so on.
Later on, we'll switch to a simpler and more performant version of the code.

So, let's start with the first classification problem: the problem of generating
the **first** token (`tweet[0]`).

To generate the first token, we'll feed the RNN network (with an initial, empty
hidden state) the "<BOS>" token. Then, the output

In [11]:
bos_input = torch.Tensor([vocab_stoi["<BOS>"]])
print(bos_input.shape, type(bos_input))
bos_input = bos_input.long()
print(bos_input.shape, type(bos_input))
bos_input = bos_input.unsqueeze(0)
print(bos_input.shape, type(bos_input))
output, hidden = model(bos_input, hidden=None)
output # distribution over the first token

torch.Size([1]) <class 'torch.Tensor'>
torch.Size([1]) <class 'torch.Tensor'>
torch.Size([1, 1]) <class 'torch.Tensor'>


tensor([[[ 0.0864,  0.0330,  0.0653, -0.0746, -0.0119, -0.0869, -0.0172,
          -0.0979,  0.0294,  0.0267, -0.0064, -0.0517, -0.0216, -0.0640,
          -0.0459,  0.0560, -0.0724, -0.0922, -0.1239,  0.0612]]],
       grad_fn=<AddBackward0>)

In [12]:
bos_input

tensor([[18]])

We can compute the loss using `criterion`. Since the model is untrained,
the loss is expected to be high. (For now, we won't do anything
with this loss, and omit the backward pass.)

In [13]:
target = torch.Tensor([vocab_stoi[tweet[0]]]).long().unsqueeze(0)
criterion(output.reshape(-1, vocab_size), # reshape to 2D tensor
          target.reshape(-1))             # reshape to 1D tensor

tensor(3.0751, grad_fn=<NllLossBackward>)

In [14]:
print(target)
print(output)
print(output.reshape(-1, vocab_size))
print(target.reshape(-1))

tensor([[7]])
tensor([[[ 0.0864,  0.0330,  0.0653, -0.0746, -0.0119, -0.0869, -0.0172,
          -0.0979,  0.0294,  0.0267, -0.0064, -0.0517, -0.0216, -0.0640,
          -0.0459,  0.0560, -0.0724, -0.0922, -0.1239,  0.0612]]],
       grad_fn=<AddBackward0>)
tensor([[ 0.0864,  0.0330,  0.0653, -0.0746, -0.0119, -0.0869, -0.0172, -0.0979,
          0.0294,  0.0267, -0.0064, -0.0517, -0.0216, -0.0640, -0.0459,  0.0560,
         -0.0724, -0.0922, -0.1239,  0.0612]], grad_fn=<ViewBackward>)
tensor([7])


Now, we need to update the hidden state and generate a prediction
for the next token. To do so, we need to provide the current token to
the RNN. We already said that during test time, we'll need to sample
from the predicted probabilty over tokens that the neural network
just generated. 

Right now, we can do something better: we can **use the ground-truth,
actual target token**. This technique is called **teacher-forcing**, 
and generally speeds up training. The reason is that right now, 
since our model does not perform well, the predicted probability
distribution is pretty far from the ground truth. So, it is very,
very difficult for the neural network to get back on track given bad
input data.

In [15]:
# Use teacher-forcing: we pass in the ground truth `target`,
# rather than using the NN predicted distribution
output, hidden = model(target, hidden)
output # distribution over the second token

tensor([[[ 0.0499,  0.0223,  0.0319, -0.0223,  0.0070, -0.0608, -0.0521,
          -0.1293,  0.0113, -0.0102, -0.0344, -0.0561, -0.0203, -0.1048,
          -0.0514,  0.0311, -0.0252, -0.0617, -0.1148,  0.0835]]],
       grad_fn=<AddBackward0>)

Similar to the first step, we can compute the loss, quantifying the
difference between the predicted distribution and the actual next
token. This loss can be used to adjust the weights of the neural
network (which we are not doing yet).

In [16]:
target = torch.Tensor([vocab_stoi[tweet[1]]]).long().unsqueeze(0)
criterion(output.reshape(-1, vocab_size), # reshape to 2D tensor
          target.reshape(-1))             # reshape to 1D tensor

tensor(3.0336, grad_fn=<NllLossBackward>)

We can continue this process of:

- feeding the previous ground-truth token to the RNN,
- obtaining the prediction distribution over the next token, and
- computing the loss,

for as many steps as there are tokens in the ground-truth tweet.

In [17]:
for i in range(2, len(tweet)):
    output, hidden = model(target, hidden)
    target = torch.Tensor([vocab_stoi[tweet[i]]]).long().unsqueeze(0)
    loss = criterion(output.reshape(-1, vocab_size), # reshape to 2D tensor
                     target.reshape(-1))             # reshape to 1D tensor
    print(i, output, loss)

2 tensor([[[ 0.0409, -0.0133,  0.0268, -0.0090,  0.0048, -0.0664, -0.0378,
          -0.1011,  0.0004, -0.0028, -0.0399, -0.0260, -0.0419, -0.0620,
          -0.0636,  0.0104, -0.0434, -0.0901, -0.1486,  0.0677]]],
       grad_fn=<AddBackward0>) tensor(2.9404, grad_fn=<NllLossBackward>)
3 tensor([[[ 0.0549, -0.0159,  0.0251, -0.0630,  0.0098, -0.0398, -0.0125,
          -0.1126, -0.0071, -0.0531, -0.0219, -0.0377, -0.0284, -0.0566,
          -0.0407, -0.0169, -0.0441, -0.0767, -0.1402,  0.0749]]],
       grad_fn=<AddBackward0>) tensor(2.9827, grad_fn=<NllLossBackward>)
4 tensor([[[ 0.0416, -0.0229,  0.0243, -0.0450,  0.0093, -0.0124, -0.0390,
          -0.1375, -0.0193, -0.0237, -0.0031, -0.0392, -0.0345, -0.0648,
          -0.0088,  0.0077, -0.0722, -0.0652, -0.1343,  0.0538]]],
       grad_fn=<AddBackward0>) tensor(2.9708, grad_fn=<NllLossBackward>)
5 tensor([[[ 0.0171,  0.0002,  0.0278, -0.0523,  0.0537, -0.0610, -0.0495,
          -0.1542,  0.0454,  0.0036, -0.0040, -0.0369, -0.073

Finally, with our final token, we should expect to output the "<EOS>"
token, so that our RNN learns when to stop generating characters.

In [18]:
output, hidden = model(target, hidden)
target = torch.Tensor([vocab_stoi["<EOS>"]]).long().unsqueeze(0)
loss = criterion(output.reshape(-1, vocab_size), # reshape to 2D tensor
                 target.reshape(-1))             # reshape to 1D tensor
print(i, output, loss)

33 tensor([[[ 0.0837, -0.0484,  0.0037, -0.0262, -0.0054, -0.0636, -0.0039,
          -0.0972, -0.0037, -0.0724, -0.0550, -0.0971, -0.0152, -0.0912,
          -0.0069,  0.0197, -0.0410, -0.0639, -0.1346,  0.0635]]],
       grad_fn=<AddBackward0>) tensor(2.9009, grad_fn=<NllLossBackward>)


In practice, we don't really need a loop. Recall that in a predictive RNN,
the `nn.RNN` module can take an entire sequence as input. We can do the
same thing here:

In [19]:
tweet_ch = ["<BOS>"] + list(tweet) + ["<EOS>"]
tweet_indices = [vocab_stoi[ch] for ch in tweet_ch]
tweet_tensor = torch.Tensor(tweet_indices).long().unsqueeze(0)

print(tweet_tensor.shape)

output, hidden = model(tweet_tensor[:,:-1]) # <EOS> is never an input token
target = tweet_tensor[:,1:]                 # <BOS> is never a target token
loss = criterion(output.reshape(-1, vocab_size), # reshape to 2D tensor
                 target.reshape(-1))             # reshape to 1D tensor

torch.Size([1, 36])


Here, the input to our neural network model is the *entire*
sequence of input tokens (everything from "<BOS>" to the
last character of the tweet). The neural network generates a prediction distribution
of the next token at each step. We can compare each of these  with the ground-truth
`target`.


Our training loop (for learning to generate the single `tweet`) will therefore
look something like this:

In [20]:
print(tweet_tensor[:,:-1])
print(target)

tensor([[18,  7, 17,  2,  1, 10, 16,  5,  0,  0,  1,  6, 11,  5,  1,  8,  5, 17,
          8, 16,  5,  1, 17,  3,  1, 15,  5,  4,  5,  9, 13,  5, 16, 12, 14]])
tensor([[ 7, 17,  2,  1, 10, 16,  5,  0,  0,  1,  6, 11,  5,  1,  8,  5, 17,  8,
         16,  5,  1, 17,  3,  1, 15,  5,  4,  5,  9, 13,  5, 16, 12, 14, 19]])


In [21]:
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()
for it in range(500):
    optimizer.zero_grad()
    output, _ = model(tweet_tensor[:,:-1])
    loss = criterion(output.reshape(-1, vocab_size),
                 target.reshape(-1))
    loss.backward()
    optimizer.step()

    if (it+1) % 100 == 0:
        print("[Iter %d] Loss %f" % (it+1, float(loss)))

[Iter 100] Loss 1.767341
[Iter 200] Loss 0.266412
[Iter 300] Loss 0.040046
[Iter 400] Loss 0.016533
[Iter 500] Loss 0.009435


The training loss is decreasing with training, which is what we expect.

## Generating a Token

At this point, we want to see whether our model is actually learning
something. So, we need to talk about how to
actually use the RNN model to generate text. If we can 
generate text, we can make a qualitative asssessment of how well
our RNN is performing.

The main difference between training and test-time (generation time)
is that we don't have the ground-truth tokens to feed as inputs
to the RNN. Instead, we need to actually **sample** a token based
on the neural network's prediction distribution.

But how can we sample a token from a distribution?

On one extreme, we can always take
the token with the largest probability (argmax). This has been our
go-to technique in other classification tasks. However, this idea
will fail here. The reason is that in practice, 
**we want to be able to generate a variety of different sequences from
the same model**. An RNN that can only generate a single new Trump Tweet
is fairly useless.

In short, we want some randomness. We can do so by using the logit
outputs from our model to construct a multinomial distribution over
the tokens and then sample a random token from that multinomial distribution.

One natural multinomial distribution we can choose is the 
distribution we get after applying the softmax on the outputs.
However, we will do one more thing: we will add a **temperature**
parameter to manipulate the softmax outputs. We can set a
**higher temperature** to make the probability of each token
**more even** (more random), or a **lower temperature** to assign
more probability to the tokens with a higher logit (output).
A **higher temperature** means that we will get a more diverse sample,
with potentially more mistakes. A **lower temperature** means that we
may see repetitions of the same high probability sequence.

In [22]:
def sample_sequence(model, max_len=100, temperature=0.8):
    generated_sequence = ""
   
    inp = torch.Tensor([vocab_stoi["<BOS>"]]).long()
    hidden = None
    for p in range(max_len):
        output, hidden = model(inp.unsqueeze(0), hidden)
        # Sample from the network as a multinomial distribution
        output_dist = output.data.view(-1).div(temperature).exp()
        top_i = int(torch.multinomial(output_dist, 1)[0])
        # Add predicted character to string and use as next input
        predicted_char = vocab_itos[top_i]
        
        if predicted_char == "<EOS>":
            break
        generated_sequence += predicted_char       
        inp = torch.Tensor([top_i]).long()
    return generated_sequence

print(sample_sequence(model, temperature=0.8))
print(sample_sequence(model, temperature=1.0))
print(sample_sequence(model, temperature=1.5))
print(sample_sequence(model, temperature=2.0))
print(sample_sequence(model, temperature=5.0))

God Bless the people of Venezuela!
God Bless the people of Venezuela!
God Blesstthh peoplefof VenezuelalaG
GolsBless the people of VenezfeuVa
h


Since we only trained the model on a single sequence, we won't see
the effect of the temperature parameter yet. 

For now, the output of the calls to the `sample_sequence` function
assures us that our training code looks reasonable, and we can
proceed to training on our full dataset!

## Training the Trump Tweet Generator

For the actual training, let's use `torchtext` so that we can use
the `BucketIterator` to make batches. Like in Lab 5, we'll create a 
`torchtext.legacy.data.Field` to use `torchtext` to read the CSV file, and convert
characters into indices. The object has convenient parameters to specify
the BOS and EOS tokens.

In [23]:
import torchtext

text_field = torchtext.legacy.data.Field(sequential=True, # text sequence
                                  tokenize=lambda x: x, # because we are building a character-RNN
                                  include_lengths=True, # to track the length of sequences, for batching
                                  batch_first=True,
                                  use_vocab=True,       # to turn each character into an integer index
                                  init_token="<BOS>",   # BOS token
                                  eos_token="<EOS>")    # EOS token

fields = [('text', text_field), ('created_at', None), ('id_str', None)]
trump_tweets = torchtext.legacy.data.TabularDataset(file_dir + "trump.csv", "csv", fields)
len(trump_tweets) # should be >20,000 like before

22402

In [24]:
text_field.build_vocab(trump_tweets)
vocab_stoi = text_field.vocab.stoi # so we don't have to rewrite sample_sequence
vocab_itos = text_field.vocab.itos # so we don't have to rewrite sample_sequence
vocab_size = len(text_field.vocab.itos)
vocab_size

253

Let's just verify that the `BucketIterator` works as expected, but start with batch_size of 10.

In [25]:
data_iter = torchtext.legacy.data.BucketIterator(trump_tweets, 
                                          batch_size=10,
                                          sort_key=lambda x: len(x.text),
                                          sort_within_batch=True)
for (tweet, lengths), label in data_iter:
    print(label)   # should be None
    print(lengths) # contains the length of the tweet(s) in batch
    print(tweet.shape) # should be [10, max(length)]
    break

None
tensor([113, 113, 113, 112, 112, 112, 112, 111, 111, 111])
torch.Size([10, 113])


To account for batching, our actual training code will change, but just a little bit.
In fact, our training code from before will work with a batch size larger than ten!

In [26]:
def train(model, data, batch_size=1, num_epochs=1, lr=0.001, print_every=100):
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    criterion = nn.CrossEntropyLoss()
    it = 0
    
    data_iter = torchtext.legacy.data.BucketIterator(data,
                                              batch_size=batch_size,
                                              sort_key=lambda x: len(x.text),
                                              sort_within_batch=True)
    for e in range(num_epochs):
        # get training set
        avg_loss = 0
        for (tweet, lengths), label in data_iter:
            target = tweet[:, 1:] # Exclude BOS
            inp = tweet[:, :-1] # Exclude EOS
            # cleanup
            optimizer.zero_grad()
            # forward pass
            output, _ = model(inp)
            loss = criterion(output.reshape(-1, vocab_size), target.reshape(-1))
            # backward pass
            loss.backward()
            optimizer.step()

            avg_loss += loss
            it += 1 # increment iteration count
            if it % print_every == 0:
                print("[Iter %d] Loss %f" % (it+1, float(avg_loss/print_every)))
                print("    " + sample_sequence(model, 140, 0.8))
                avg_loss = 0

model = TextGenerator(vocab_size, 64)

In [27]:
train(model, trump_tweets, batch_size=1, num_epochs=1, lr=0.004, print_every=100)
print(sample_sequence(model, temperature=0.8))
print(sample_sequence(model, temperature=0.8))
print(sample_sequence(model, temperature=1.0))
print(sample_sequence(model, temperature=1.0))
print(sample_sequence(model, temperature=1.5))
print(sample_sequence(model, temperature=1.5))
print(sample_sequence(model, temperature=2.0))
print(sample_sequence(model, temperature=2.0))
print(sample_sequence(model, temperature=5.0))
print(sample_sequence(model, temperature=5.0))

[Iter 101] Loss 3.717335
    Pl tahelheaoa-Gdo  llte  ap 
[Iter 201] Loss 3.228811
    The veds van home aw t yd nanenea (omnin hg0 nh/ Bt amou ouwt! an @tan  en Iont onth w /ad tirolir‚Äù b.p se the anouOt Hes ont 'one @ke athe 
[Iter 301] Loss 2.986001
    Sanlan Pzere  he tozyru ang allerone @0held yOrr rvens"a tocaty afinnLurtB " Tre f.s thamyten ory we:!
[Iter 401] Loss 2.915044
    iole't aneg on be kerbridt
[Iter 501] Loss 2.797385
    Gof.ce ppat. ounindo6 ist-lDonk.runf Arump itting bot. Riat ag done latt p for bes sonat me He Irat inanconn W- bat!
[Iter 601] Loss 2.667882
    @GA O- an ile thucind mes4 Pine is anding the cer Be real TMat. 0 AUmp Binaly fhandered distind deal onad bep hat  old Ga domad at in ere Me
[Iter 701] Loss 2.593516
    Doneling oon bantinn reat scourine OuNI. Trump @ru2lDoneldadPreanife wincing illicaldTrutp andere
[Iter 801] Loss 2.524553
    @DonalDent oner cump me to ans the reay sered real of how preat Treaj? ht.  U4F #LSNingine this  ale iling cou

In [28]:
train(model, trump_tweets, batch_size=32, num_epochs=1, lr=0.004, print_every=100)
print(sample_sequence(model, temperature=0.8))
print(sample_sequence(model, temperature=1.0))
print(sample_sequence(model, temperature=1.5))
print(sample_sequence(model, temperature=2.0))
print(sample_sequence(model, temperature=5.0))

[Iter 101] Loss 1.831786
    Thank you Eagil speack on the Unises and read instoristitide happen. and my want atton jobs and soment will be neaters his the President How
[Iter 201] Loss 1.743913
    @SraweFestrutts: @realDonaldTrump De trump  Thanks dotes.
[Iter 301] Loss 1.725483
    @mormallitsh: @realDonaldTrump President Borg.. Way : I have for detalter. The dore gandion gostingation Great... Bean agree!
[Iter 401] Loss 1.719499
    @Stannendey_Sens: @NexagaTherFore we all see mast head is a great on 7 10 shight https://t.co/Qq65rHVJEme
[Iter 501] Loss 1.706926
    Want the For Obama of trun a pock companer. Rete to americal and the Dems Country work. Grought have run belisters in Agimna! #Veverame Mary
[Iter 601] Loss 1.702637
    @justnerge: @realDonaldTrump @foxandfrieadd http://t.co/9mOpuKLrs
[Iter 701] Loss 1.684896
    @turconsh97 THE CYOLY to meeting burse in New York Will to for Mitt the mase fallicamt all a great every but and president the #USA #Trump
I will be and my of 

## Generative RNN using GPU
Training a generative RNN can be a slow process. Here's a sample GPU implementation to speed up the training. The changes required to enable GPU are provided in the comments below.

In [29]:
# Generative Recurrent Neural Network Implementation with GPU

def sample_sequence_cuda(model, max_len=100, temperature=0.8):
    generated_sequence = ""
   
    inp = torch.Tensor([vocab_stoi["<BOS>"]]).long().cuda()    # <----- GPU
    hidden = None
    for p in range(max_len):
        output, hidden = model(inp.unsqueeze(0), hidden)
        # Sample from the network as a multinomial distribution
        output_dist = output.data.view(-1).div(temperature).exp().cpu()
        top_i = int(torch.multinomial(output_dist, 1)[0])
        # Add predicted character to string and use as next input
        predicted_char = vocab_itos[top_i]
        
        if predicted_char == "<EOS>":
            break
        generated_sequence += predicted_char       
        inp = torch.Tensor([top_i]).long().cuda()    # <----- GPU
    return generated_sequence


def train_cuda(model, data, batch_size=1, num_epochs=1, lr=0.001, print_every=100):
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    criterion = nn.CrossEntropyLoss()
    it = 0
    data_iter = torchtext.legacy.data.BucketIterator(data,
                                              batch_size=batch_size,
                                              sort_key=lambda x: len(x.text),
                                              sort_within_batch=True)
    for e in range(num_epochs):
        # get training set
        avg_loss = 0
        for (tweet, lengths), label in data_iter:
            target = tweet[:, 1:].cuda()              # <------- GPU
            inp = tweet[:, :-1].cuda()                # <------- GPU
            # cleanup
            optimizer.zero_grad()
            # forward pass
            output, _ = model(inp)
            loss = criterion(output.reshape(-1, vocab_size), target.reshape(-1))
            # backward pass
            loss.backward()
            optimizer.step()

            avg_loss += loss
            it += 1 # increment iteration count
            if it % print_every == 0:
                print("[Iter %d] Loss %f" % (it+1, float(avg_loss/print_every)))
                print("    " + sample_sequence_cuda(model, 140, 0.8))
                avg_loss = 0

model = TextGenerator(vocab_size, 64)
model = model.cuda()
model.ident = model.ident.cuda()
train_cuda(model, trump_tweets, batch_size=32, num_epochs=1, lr=0.004, print_every=100)

[Iter 101] Loss 3.670088
    elcel @nao.oacRennces an lTwe1 n re olipilCaaarnneyue d tirelebe ulannB  UM regeoifggal tt/nMs iv<pad> wat antos:
[Iter 201] Loss 3.045497
    @reE hand oncaEüòí Bo Wan 7///or
[Iter 301] Loss 2.733087
    Tr @reat @restpicdons: @nean ang orge tof on fes he wint En fan till Gongen  sidt argint oniil co ttcong willere wool coule!1/t.
[Iter 401] Loss 2.564163
    No wPrur NonaldTrure s: Csiblicash Kase ton cotill htis . Bant. The the toplorialdce so16sTh @Mamp_yerDonathat or ofuguse ate yo/t seris wea
[Iter 501] Loss 2.442209
    I tor in #tament couinay in then inge ang of doth wive sCall mordist in alised https://tedTrumps Joxmxpiingtes ghtove and chath how of in to
[Iter 601] Loss 2.357620
    @ruslereagaldTrump the wome ares on #Mrettor tot Parsers Serest th
[Iter 701] Loss 2.275909
    @beyarCange @‚ÄúDalDoneads: Thankh am @rearenens a tay runt soe sectory that! Coldars gat!


In [30]:
train_cuda(model, trump_tweets, batch_size=32, num_epochs=10, lr=0.004, print_every=500)

[Iter 501] Loss 2.122036
    Thank you to mect and anf as note and mextines httpy://t.co/8ckXOjNK9kR
[Iter 1001] Loss 1.149446
    Just bat support number with is by to First oul on and #Trump203 and businges would itrust with dissoour to inceveting counth.
[Iter 1501] Loss 0.359463
    @Swiliotharot: @realDonaldTrump Preay the surdary. I ciall comple of Seansel and he right!
[Iter 2001] Loss 1.808645
    .@stackersoneNG - If @Jebroman The plast.
[Iter 2501] Loss 1.409719
    It ack @rillyfoccernainess Great speech with https://t.co/g347JEsXmzY
[Iter 3001] Loss 0.682656
    @Heirnie12: @realDonaldTrump With  @RealWond Notcusion Massie on the Great will be this itle our will anytter better sail briend less than v
[Iter 3501] Loss 1.738280
    FerroSeed Stey on @Fogreardsarson @CBNBS Hillary Secolus Enorial on truth was in live very adaute. Arem working but strong fan this surn.
[Iter 4001] Loss 1.703777
    @leankunn__: @realDonaldTrump @realDonaldTrump the imer again.
[Iter 4501] Loss 

In [31]:
train_cuda(model, trump_tweets, batch_size=32, num_epochs=10, lr=0.0001, print_every=500)

[Iter 501] Loss 1.653411
    @rincaTer80: @realDonaldTrump @VarkTheernued New Ceris Meddation #MakiniggTrump
[Iter 1001] Loss 0.985465
    Great anyone show not could be a great whight the Jegh. He to eapi welt infort the will she him as been morgate we run why big tola made of 
[Iter 1501] Loss 0.322853
    Well our could thank you fake! The Tevenas to MikeNN Donald Trump! #CNNCTrump at Doma Link‚Äô down people ratrally president!
[Iter 2001] Loss 1.645530
    #Demorea @realDonaldTrump I fantastic. New York #MakeAmericaGreatAgain
[Iter 2501] Loss 1.308349
    @waldMosticle: @realDonaldTrump sand getting are been chuckatey to be ore beries says to can for all again. I run in Best it he well the the
[Iter 3001] Loss 0.642856
    Thank you President  prowe into American by pursoned your but the will beto far Franth Countic. They up ratings and the reported the grivers
[Iter 3501] Loss 1.646331
    The York Naw.You an keeple by @terrilly President! http://t.co/6WXTfHMxR2n
[Iter 4001] Loss

In [32]:
train_cuda(model, trump_tweets, batch_size=32, num_epochs=10, lr=0.0001, print_every=500)

[Iter 501] Loss 1.642307
    Vivinbbore: @realDonaldTrump @realDonaldTrump 3 win a was #Iownoth Pastt?
[Iter 1001] Loss 0.982269
    Rembour a even politice the country and betting around job and some anshings to be the filded to thinks "plountational his plan the!
[Iter 1501] Loss 0.322010
    That Repless in 7 PRIDENT YEAPONNESTANKSCST winstry to place to sport's speech me in the mother now the country sad Mery weekence! #Trump201
[Iter 2001] Loss 1.641522
    @breindlomanMette: Leaders in White in NYC and Dike hand hought back the learnifulushing in ma the been is more the it. He incener getting e
[Iter 2501] Loss 1.305472
    @Kallet: @realDonaldTrump @realDonaldTrump https://t.co/7NCTo1UYHV
[Iter 3001] Loss 0.641530
    @_Bortich: @realDonaldTrump Implay the been bust TRUMP NEWE were is aneess!
[Iter 3501] Loss 1.643160
    Thanks to I and dones of place this a see cating her deluder on @NBCULDONELINSEONDAY A GREAT AGAIN!
[Iter 4001] Loss 1.624181
    @Joshawk: @realDonaldTrump @G

Let's generate some results using different levels of temperature.

In [33]:
for i in range(5):
  print(sample_sequence_cuda(model, 140, 0.2))

@Karchary12: @realDonaldTrump @realDonaldTrump is the great be all the great be a great be and discastion the president the the country.
@jestingers: @realDonaldTrump @realDonaldTrump @realDonaldTrump https://t.co/XE4I6IEFZ
@MaryTerees: @realDonaldTrump @realDonaldTrump https://t.co/PWskFgzQi
@Themittt: @realDonaldTrump @realDonaldTrump @realDonaldTrump https://t.co/7RRTVrsjET
@marchannelly: @realDonaldTrump @realDonaldTrump is a great the proper the country and in the been a great better has been the the president


In [34]:
for i in range(5):
  print(sample_sequence_cuda(model, 140, 0.6))

So Scotter live wall people and with the "Edmbrigretely on the defeat. A really and fan the U.S. his an on all he to be and epenting on the 
@Anerginnam: @realDonaldTrump @realDonaldTrump The New Your propisting for are to you work for the interviewed by America!
@AmlelBerice1: @realDonaldTrump I will be on @realDonaldTrump man on @FoxNews to respect and see president "gonks the Pame over if you can m
@Aralddritiem: @realDonaldTrump @realDonaldTrump #Trump2016 https://t.co/ywkRwzhpJ
@Mirkotter: @realDonaldTrump Trump to the even so in the office to wo have pamed to $100000 very one Big Michighing not get histon the Unite


In [35]:
for i in range(5):
  print(sample_sequence_cuda(model, 140, 0.8))

@chivents: @realDonaldTrump @realDonaldTrump May You thanks.
‚ÄúCurat election is out of of this business same beeil ond back are campaills after the America will yestward to the alf.
@MaFirtty20:  https://t.co/yMTpEFP8k
I've won collase the promisles sisppon in MarketcKic  I winning of ObamaCare of made is one verabicagees don't totally millions and presiden
.@DannyickNickets by Harsson and going to eucuss going was said (not us the remeckine days watch he debate to of the presibate! https://t.co


In [36]:
for i in range(5):
  print(sample_sequence_cuda(model, 140, 1))

I president way best thoud on Hellatueing Irubact! Today pasqual hes with up big emustle many laws we will ip! The people
New a jobal craving‚Äù the enjoy!  intervicted mexiated. Je in Crookedro instway that a annmEn morated well menaiffor with the much can Interm
@ScikePruin_Jor: See Presidential Marther's sealto win should back lijust never anfryrater8 that UN!
The slanding crouccer Congratrover &amp; back im shourd alvoch our me lobring flopers shall: And BAC Sountred ar formed. They will be for ho
Great - tell all when media &amp; vigetimate will best!


In [37]:
for i in range(5):
  print(sample_sequence_cuda(model, 140, 1.5))

@MtotManxe_pPE6Y  I: There w/ke totkey real minilue gnt JUNG #Pubulany Juslatumac Jone or eve #ure. Sove
@EvineTfumpcqlavy Zight DonaldTruck iss.‚Äù Conglly.?üëç! Sponoucled runnyines!#MA ot enjoyevoute‚Äù .M"
Only just‚Äùstom: 20ICE.YM.
Avit/yetcoppctipost pacrs 4's. USSTAKW MAXA FRTIR! I lyaght-loving
.@_S0Wez T‚ùåRAFSVIRY JJC otwey Cuts I Donald Donsans a.CERFE PROCS.RC) Makm.
