# Recurrent Neural Networks
So far, the neural network architectures we have been using have been simple in the sense that they take in a single fixed size input and give a single fixed size output. What if we wanted to model something like language where we want to feed in different length words? Another issue is that each output is only dependent on the current input. It has no 'memory' of previous inputs. Recurrent neural networks address both these issues.

They do this by having an internal hidden state which can be thought of as a form of memory. At each time step, the new hidden state is calculated as a function of the previous hidden state and the current input. This hidden state can then be used to represent your output or can be put through another function to compute the outputs. When we say function we are referring to the same one used in standard neural network: linear combination followed by an activation function.

### $h_t = f(x_t, h_{t-1})$

doing a linear comnination of input <br> 


As shown in the diagram below, which uses a further function to compute the output $o$ from the hidden state $s$, there are three matrices of parameters which we are trying to optimize: U, V and W. The diagram also demonstrates how these networks can be unfolded to show the variables at various time steps.

![](rnn.jpg)

Standard neural networks can only model one to one relationships while RNNs are extremely flexible in terms of input-output structures which is one of the reasons they are so powerful. You can imagine something like one to many being used to feed in a single image from which a caption is sequentially produced or a many to one being used to feed in a sentence sequentially and give a single output describing the sentiment of the sentence.
- use "s" hidden state 
- X * U: variable change in hidden state
- hidden state = memory of the agent --> represent label for hidden state 
- one to many relationship: eg feed image + predict transformation + predict further transformation --> sequential network (output of each step depends on previous one) --> cannot parralize by generating a series of steps (slow) 
- many to 1: feed sentence, keep calculating hidden states & obtain final hidden state 

![](rnnlayouts.jpeg)

### Optimization
Surprisingly, with this increased complexity in structure, the optimization method does not become any more difficult. Despite having a different name, back-propagation through time, it is essentially the same thing. All you do is feed in your sequence sequentially to get the output, as usual. You then just calculate your error at each timestep and sum it as opposed to calculating the error at a single timestep like standard neural networks. Then you can use gradient descent to update your weights iteratively until you are satisfied with your network's performance.

RNNS are generally slower to optimize than standard neural networks as the output at each time step is dependent on the previous output so the operations cannot be parallelized.

For a long time it was considered difficult to train RNNs due to two problems called vanishing and exploding gradients. These problems also exist in standard neural network but are greatly emphasized in RNNs. However, modern techniques such as LSTM cells have greatly reduced this difficulty.

Learn the relationship between letters to generate whole body of text 

## Implementation
We are going to be implementing a one-to-one character level text prediction model. We will be sequentially feeding in a single character and asking our network to predict the next character based on the 'memory' stored in the hidden units of all the previous characters.

As always, we begin by importing the required libraries.

In [0]:
%matplotlib notebook
import torch
import numpy as np
from torch.autograd import Variable
import torch.nn.functional as F
import matplotlib.pyplot as plt

In [0]:
# http://pytorch.org/
from os import path
from wheel.pep425tags import get_abbr_impl, get_impl_ver, get_abi_tag
platform = '{}{}-{}'.format(get_abbr_impl(), get_impl_ver(), get_abi_tag())

accelerator = 'cu80' if path.exists('/opt/bin/nvidia-smi') else 'cpu'

!pip install -q http://download.pytorch.org/whl/{accelerator}/torch-0.3.0.post4-{platform}-linux_x86_64.whl torchvision
import torch

For this particular task, we will need to do quite a bit of pre-processing. We need to find the number of unique characters in our training text and give each one a unique number so we can one-hot encode them.<br>
We start by reading the file, converting all letters to lowercase to reduce the number of characters we need to model, then defining a function which takes in the text and gives up back a dictionary mapping each letter to a unique number. <br> 
need to read https://machinelearningmastery.com/how-to-one-hot-encode-sequence-data-in-python/


In [5]:
from google.colab import files 
files.upload()

Saving lyrics to lyrics


{'lyrics': b'Aye, I remember syrup sandwiches and crime allowances\r\nFinesse a nigga with some counterfeits\r\nBut now I\'m counting this\r\nParmesan where my accountant lives in fact I\'m down at this\r\nD\'us\xc5\x9be with my boo bae, tastes like kool aid for the analysts\r\nGirl, I can buy your ass the world with my paystub\r\nOoh that pussy good, won\'t you sit it on my taste bloods\r\nI get way too petty once you let me do the extras\r\nPull up on your block, then break it down we playing Tetris\r\nA.M. to the P.M., P.M. to the A.M. funk\r\nPiss out your per diem you just gotta hate em, funk\r\nIf I quit your BM I still ride Mercedes, funk\r\nIf I quit this season I still be the greatest, funk\r\nMy left stroke just went viral\r\nRight stroke put lil baby in a spiral\r\nSoprano C, we like to keep it on a high note\r\nIt\'s levels to it, you and I know, bitch be humble\r\n\r\n(Hol\' up bitch) sit down,\r\n(Hol\' up lil bitch, hol\' up, lil bitch) be humble\r\n(Hol\' up bitch) sit 

In [0]:
#open our text file and read all the data into the rawtxt variable
#get text in txt file 

with open('lyrics', 'r') as file:
    rawtxt = file.read()

#turn all of the text into lowercase as it will reduce the number of characters that our algorithm needs to learn the distributino of 
rawtxt = rawtxt.lower()
#its a long string 

#returns a dictionary that allows us to map from a unique number to a unique character in our text
#need to give each unique character a number 
#set = type in python, will convert characters into unique components 
def create_map(rawtxt):
    
    letters = list(set(rawtxt)) #returns the list of unique characters in our raw text
    #set function randomizes function 
    lettermap = dict(enumerate(letters)) #created the dictionary mapping for unique number for each character 
#enumerate: assign each one a number counting from 0 to end of list 
    return lettermap

num_to_let = create_map(rawtxt) #store the dictionary mapping from numbers to characters in a variable
#eg num_to_let = create_map(10) would be the dictionary value of the 10th letter 
let_to_num = dict(zip(num_to_let.values(), num_to_let.keys())) #create the reverse mapping so we can map from a character to a unique number
#zip(): concatenate to create pairs of values followed by the keys 

nchars = len(num_to_let) #number of unique characters in our text file

We now define a function which takes in text and a dictionary and maps each character in the text to the value specified for it in the dictionary. We then use this to map all of our text into the unique numbers for each character so it can be used with our RNN model. The labels are specified as the input but shifted by one time step as the label for each input is the character which comes after it.<br> need to convert all characters into number --> map each number into ID 

In [0]:
def maparray(txt, mapdict):
    
    txt = list(txt)

    #iterate through our text and change the value for each character to its mapped value
    for k, letter in enumerate(txt):
        txt[k] = mapdict[letter]

    txt = np.array(txt)
    return txt

#map our raw text into our input variables using the function defined earlier and passing in the mapping from letters to numbers
X = maparray(rawtxt, let_to_num)
#X = the whole thext --> transformed into a long row of letter 

Y = np.roll(X, -1, axis=0) #our label is the next character so roll shifts our array by one timestep
#shift by axis wrt to the original text 
# Y value = sifted by 1 value --> get next character 

#conver to torch tensors so we can use them in our torch model
#LongTensor: can only hold integers --> the model asks to feed in LongTensors 
X = torch.LongTensor(X)
Y = torch.LongTensor(Y)


Define our model which takes in variables defining its structure as parameters. The encoder converts each unique number into an embedding which is fed into the rnn model. The RNN calculates the hidden state which is converted into an output through a fully connected layer called the decoder.<br>
We also define the init_hidden function which outputs us a tensor of zeros of the required size for the hidden state. <br> 
vector to take on continuous value --> allows to aggregate / model letters that are more related to be similar to each other 

enbedding: performs aggregation fct 
- input size = number of letters to input
- assume that in the beginning, all letters 

In [0]:
class RNN(torch.nn.Module):
    def __init__(self, input_size, hidden_size, output_size, n_layers=1): # use parameters to define architecture of the network 
        super().__init__()
        #store input parameters in the object so we can use them later on
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.n_layers = n_layers

        #required functions for model
        self.encoder = torch.nn.Embedding(input_size, input_size)
        self.rnn = torch.nn.GRU(input_size, hidden_size, n_layers, batch_first=True) 
        #self.rnn = torch.nn.RNN(input_size, hidden_size, n_layers, batch_first=True)  --> also works, need to check which one works better 
        self.decoder = torch.nn.Linear(hidden_size, output_size)
        #linear takes hidden state to apply linear transformation to give output to next layer? 

    def forward(self, x, hidden): #forward also takes a hidden state other than input 
        x = self.encoder(x.view(1, -1)) #encode our input into a vector embedding --> make into a row vector 
        output, hidden = self.rnn(x.view(1, 1, -1), hidden) #calculate the output from our rnn based on our input and previous hidden state
        #output and hidden are in this case the same thing, but only use the hidden 
        output = self.decoder(output.view(1, -1)) #calculate our output based on output of rnn from the hidden state 

        return output, hidden

    def init_hidden(self):
        return Variable(torch.zeros(self.n_layers, 1, self.hidden_size)) #initialize our hidden state to a matrix of 0s, with an extra layer for each hidden state

Instantiate our model, define the appropriate hyper-parameters, cost function and optimizer. We will be training on ranom samples from the text of length chunk_size so it is what batch size is to normal neural networks.<br>
We also define the function which return a random chunk from the text which we can use to train out model.

In [0]:
#hyper-params
lr = 0.001
no_epochs = 20
chunk_size = 100 #the length of the sequences which we will optimize over

myrnn = RNN(nchars, 512, nchars, 1) #instantiate our model from the class defined earlier
#nchars = input size, will input probability distribution for next letter
criterion = torch.nn.CrossEntropyLoss() #define our cost function
optimizer = torch.optim.Adam(myrnn.parameters(), lr=lr) #choose optimizer

#train by looking at random chunks of sequences randomly --> feed sequentially into network 
#though of similarly as the batch size 

#return a random batch for training
#get integer from 0 to len(X) 
def random_chunk(chunk_size):
    k = np.random.randint(0, len(X)-chunk_size)
    return X[k:k+chunk_size], Y[k:k+chunk_size] #return the input and output labels 

Define the axes for plotting our cost per epoch. Define the training loop, sequentially feeding in a random chunk of text, summing the cost for each character in the sequence (backpropagation through time) and calculating the gradients to update our weights.

In [0]:
#for plotting costs
costs = []
plt.ion()
fig = plt.figure()
ax = fig.add_subplot(111)
ax.set_xlabel('Epoch')
ax.set_ylabel('Cost')
ax.set_xlim(0, no_epochs-1)
plt.show()


#training loop
def train(no_epochs):
    for epoch in range(no_epochs):
        totcost = 0 #stores the cost per epoch
        generated = '' #stores the text generated by our model each epoch
        #given our chunk size, how many chunks do we need to optimize over to have gone thorough our whole dataset
        for _ in range(len(X)//chunk_size): 
          #number of chunks, // divides and rounds up to nearest integer 
            h = myrnn.init_hidden() #initialize our hidden state to 0s --> so that past hidden state do not affect future steps ?
            cost = 0 #cost for this chunk
            x, y = random_chunk(chunk_size) #get a random sequence chunk to train
            x, y = Variable(x), Variable(y) #turn into variables to be used with our model
            #sequentially input each character in our sequence and calculate loss
            for i in range(chunk_size):
                out, h = myrnn.forward(x[i], h) #calculate outputs based on input and previous hidden state
                #calculate output and next hidden state after each time feed in letter 
                #based on our output, what character does our network predict is next?
                _, outl = out.data.max(1) 
                letter = num_to_let[outl[0]]
                generated+=letter #add the predicted letter to our generated sequence

                cost += criterion(out, y[i]) #add the cost for this input to the cost for this current chunk
                
                predlet = out.data.max(1)[1] ###??? 
                letter = num_to_let[predlet[0]]
                generated +=letter 

            #based on the sum of the cost for this sequence (backpropagation through time) calculate the gradients and update our weights
            optimizer.zero_grad()
            cost.backward()
            optimizer.step()

            totcost+=cost.data[0] #add the cost of this sequence to the cost of this epoch
        totcost /= len(X)//chunk_size #divide by the number of chunks per epoch to get average cost per epoch
        #normalize by number of sequence equalized by sequence 

        #append the cost to the array and plot
        costs.append(totcost)
        ax.plot(costs, 'b')
        fig.canvas.draw()
        
        plt.figure()
        plt.plot(costs, 'b')
        plt.show()
        

        print('Epoch ', epoch+1, ' Avg cost/chunk: ', totcost)
        print('Generated text: ', generated[0:750], '\n')
        
train(no_epochs)

<IPython.core.display.Javascript object>

The generated text above picks the most probable next character each time. This is not the best way to do it as our model will be deterministic so it will produce the same text over and over again. To get it producing different text, we should instead sample from the probability distribution of possible next letters output by the network. That is what we will do with the next function.

In [0]:
def generate(prime_str='a', str_len=150, temperature=0.75):
    generated = prime_str ## ??
    
    #initialize hidden state
    h = myrnn.init_hidden()
    
    prime_str = maparray(prime_str, let_to_num)
    x = Variable(torch.LongTensor(prime_str))  #convert into a variable to feed into model 
    
    #primes our hidden state with the input string
    for i in range(len(x)):
        out, h = myrnn.forward(x[i], h) # condition hidden state to store in memory and make predictions 
    
    x = x[-1]
    
    for i in range(str_len):
        out, h = myrnn.forward(x, h)
        
        out_dist = out.data.view(-1).div(temperature).exp() #exponentiate --> make all positive
        #divide by temperature --> "squash numbers wrt to an axis" --> brings numbers closer together ]
        #numbers are further apart --> need to control the relationship 
        #scaled down with y=e^x function 
        sample = torch.multinomial(out_dist, 1)[0] 
        #did not apply activation function to layers --> "torch.multinomial" takes output distribution probability 
        pred_char = num_to_let[sample]
        
        generated += pred_char
        
        x = Variable(torch.LongTensor([sample])) #take sample to turn into a torch format in order to feed in and predict next character 
        #continuously generate next character 
    
    return generated
            

gen = generate('this be ', 1500, 0.75) #"this be" = prime state 
#larger temperature --> more accentuated y=e^x ==> increases y variance of probability --> a small difference = larger temp
#small temperature --> more random
#or the opposite? 
print(gen)

this be ur every and your stitch inside aye
watchin' all the snakes, curvin' all the fakes
phone neve a sound like poetic justice
if i told you that a flower for you outta cotton, i jeselverse bask in sin?
pass the gin, i mix it what the fuck you heard
and pessimists never struck my seetion stackin' up the go
and sleepin' in a villa
sippin' from again, then win all a this how in is when i'm walking down thess we hood see my enemies and critics layalty, got royalty inside my dna
power shows inside my dna
dna
gimme some ganja, gimme some ganja
real nigga in my dna
ain't no ho inside my dna
dnappy hopean a tiffor, lovant they survival of resentment
resentment that sundress, ooh
good god, what you doing hail in office
we lost barack and crown vic, my memory been gone since
don’t ask about my foes
'less you askin' me abomate
i don't fabricate it, aye
most of ya'll be faking, aye
i stay modest but your pat you don't rapper like pyeaks, aye
i don't fabricate it, aye
most of ya'll be bark as b