<a href="https://colab.research.google.com/github/mirandasaari1/Karpathy-Tensor-Flow-Conversion/blob/master/Project_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Translating a Recurrent Neural Network (RNN) model developed in Python to Tensorflow

### Created By Rosa Garza and Miranda Saari

## Introduction

The following code is a tensorflow interpretation of Andrej Karpathy's python code. The python code can be found at [Karpathy's python code](https://gist.github.com/karpathy/d4dee566867f8291f086/)

## Challenges

  Throughout this project we encountered a variety of different challenges, mostly stemming from our unfamiliarity with tensorflow and RNNs. We would say the first major challenge was developing a clear understanding of Karpathy's code and translating it to Tensorflow. When beginning to look at Karpathy’s code, it was very overwhelming, especially when seeing all of the linear algebra being performed within the loss function. As a first step, we decided to look at the MNIST RNN model and base our program off of its structure.

One of the first issues we faced before getting our program to run was figuring out how we were going to input our data into the RNN. Because we knew based on Karpathy’s code, encoding needed to be done, we had to decide which method would be more efficient: encoding data before being sent through tensorflow or after. We learned by talking to Dr. Bruns that the more efficient method was to one hot encode the data in tensorflow all at once. After having some troubles attempting to encode all at one time we found a way to one hot encode the data one batch at a time which was our chosen method. Since our fetch batch was creating a numpy array of indices which corresponds to characters and there was a tensorflow function to encode the data, we decided to encode our input after sending through our RNN model. With the encoding of the 2D numpy array, it then changed the dimensions of our X input value and caused some issues in our thinking of how the RNN model was being trained. Having to visualize the training on the 3D data caused us a lot of confusion about how our program was running.

 After the data was encoded we came across multiple issues in regards to how we the data was being input. An error we faced was involving data types such as the input of our data was in terms of floats and our output value was in terms of integers. In order to not raise an error, we changed both our input and output values to be of type int.
 
By starting our RNN model off of the MNIST RNN model, we realized we weren’t understanding why certain Tensorflow functions were being used. For example, the MNIST RNN model didn’t use the OutputProjectionWrapper() or dynamic_rnn() which caused us to look at other sources, such as our course book’s RNN model and other online github open source code. Because we didn’t have a clear understanding of why certain functions were being used, this led to us having various functions in our RNN model that either weren’t needed or being used correctly. With various functions occurring throughout the program, we received weird errors such as “Rank mismatch: Rank of labels (received 2) should equal rank of logits minus 1 (received 2)”. We then decided to change various dimensions being put through Tensorflow which would result in more errors, specifically of the  dimensions of our input and output data.

 After having trouble understanding the dimensionality of our input and output values, when then scheduled a meeting with Dr. Bruns which helped tremendously. Verbally talking with Dr. Bruns and seeing an overall picture of Karpathy’s method helped us to have a more clear understanding of how our program should be sending in input and outputting data. Through discussion with Dr. Bruns of the reasoning behind an OutputProjectionWrapper() allowed us to have a clearer understanding of why using the function helped in terms of training and having the correct number of outputs. It also helped us to learn the reasoning behind why programs have a dense layer in which we now know moving forward to really understand the importance of taking the time to comprehend how Tensorflow functions work, how they are being used in the program, and why they are needed.
Knowing our RNN model was going to calculate our loss for us and we weren’t going to have a loss function like Karpathy, we mistakenly overlooked the Karpathy’s loss function and focused on finding resources which would help to get our RNN model to work. This led to us having issues with our RNN model’s dimensions not being correct within the training and causing our program to crash. After meeting with Dr. Bruns, we realized the importance of understanding Karpathy’s loss function and how large of an impact it had being with helping to figure out what calculation is occurring throughout each step of the neural network. From this mistake, it forced us to really know the dimensions of each input and output from the RNN’s functions.  By knowing the dimensions after each training step of the RNN, it helped us be able to debug our program through the print out of certain variables’ shapes. This also helped show where in our model were we using certain variables incorrectly. Moving forward in the development of any future neural network program, we now realize the importance of fully understanding tensorflow functions, why they are being used, and what is being output.

Once all of our many complex challenges were solved, we were able to output the loss as we ran the RNN, which was our first big accomplishment. From here we began to look into finding methods for how to build the array of indices to indicate the character prediction. Trying to understand Karpathy's methods of how character output was implemented while considering our own RNN implementation proved to be difficult.For example, it was difficult to understand which part of Karpathy's code was still needed to be implemented and which part of our RNN was already implementing certain tasks for us. In our RNN after implementing a softmax function we found a vector of probabilities which was returned to us, however, we were unsure about what to do with these probabilities. We then talked to Dr. Bruns about the best way to utilize this vector of probabilities which was of dimension (batch size, 25,69) for character selection. With his advisement we chose from the vector which was in the first row and last column of the probability vector. Lastly we created the list of 200 indices such as karpathy to find the characters which would be printed with every 5 epochs. Seeing our very first output was a huge accomplishment for us and from there we were able to build our second model.

After seeing the output from our RNN model, we did have doubts about whether or not the program was training and making predictions correctly. Once we developed our second model and changed certain parameters, with more training we realized our model was actually improving and training correctly. The very first outputs resembled gibberish, but at the end after training was finished, we noticed words were being formed and punctuation was correct. Seeing the program overtime begin to form words such as “I’m”, “Alice”, and “Oh,” was very exciting. Now we feel very accomplished with being able to build a neural network model that can predict text. Moving forward, from this experience, we now have good practice skills for developing future machine learning models and know the important steps required to building successful models.

# Part 1

For part 1 of our project we translated Karpathy's python code to TensorFlow. Throughout the following code we will talk about what stayed the same as Karpathy and what changed.

### Libraries

In [0]:
import tensorflow as tf
import numpy as np

### Alice Data

The following is from Karapthy's original code in which he creates the necessary variables and arrays for text translation.

In [0]:
data = open('alice.txt', 'r').read() # should be simple plain text file
#gets a unique list of characters in the file
chars = list(set(data))

#number of unique characters
data_size, vocab_size = len(data), len(chars)
print ('data has %d characters, %d unique.' % (data_size, vocab_size))

#creates an index for each character to be correlated to
char_to_ix = { ch:i for i,ch in enumerate(chars) }
ix_to_char = { i:ch for i,ch in enumerate(chars) }

data has 144393 characters, 69 unique.


### Hyperparameters

The following stayed the same as Karpathy's code: sequence length and the number of neurons in a cell.
Since Karpathy did not have a defined number of epochs we chose to start with 1000 to see how our model performed and go up from there. 

In [0]:
num_hidden     = 100 # size of hidden layer of neurons
num_epochs     = 4000
learning_rate  = 0.001
batch_size     = 200 # sample length, similar to Karpathy 
training_steps = 5
timesteps      = 25 # number of moments of time for training and sending in an "input value" to the network
seq_length     = 25 # number of steps to unroll the RNN for
num_classes    = len(char_to_ix) # Number of possible characters, the length of array after encoding

### Preprocessing data

The code below was a suggestion from Dr. Bruns of how to preprocess our initial array 'data' of text.

In [0]:
#create training sequences and corresponding labels
X = []
y = []
for i in range(0, len(data)-seq_length-1, 1):
        X.append([char_to_ix[ch] for ch in data[i:i+seq_length]])
        y.append([char_to_ix[ch] for ch in data[i+1:i+seq_length+1]])
# reshape the data
# in X_modified, each row is an encoded sequence of characters
X_modified = np.reshape(X, (len(X), seq_length))
# in Y modified, each row corresponds to X_modified, but each row's value is
# offset by 1 in comparison to X_mod.
y_modified = np.reshape(y, (len(y), seq_length))

### Building RNN

In [0]:
tf.reset_default_graph()

Because Karpathy trained the Python RNN model on an encoding of the integer values in each array of text, our chosen method of input for the RNN model was to send in a numpy array of length 25, and then use tensorflow to encode each value.  We knew based on Karpathy's code we were sending in a sequence and outputing a sequence, therefore our output was going to be the length of the sequence.

In [0]:
############################### CONSTRUCTION PHASE ##############################
X          = tf.placeholder(tf.int32, [None, timesteps],name='Input_batch') # (batch_size, 25)
encoded    = tf.one_hot(X,depth=num_classes) # encodes each row to get: (batch_size, 25, 69)
Y          = tf.placeholder(tf.int32, [None,timesteps], name="Output")

Below is our development of our RNN model and preparing for the training input.

In [0]:
cell    = tf.contrib.rnn.OutputProjectionWrapper(
        tf.contrib.rnn.BasicRNNCell(num_units = num_hidden,activation = tf.nn.tanh),
        output_size = num_classes)


Instructions for updating:
This class is equivalent as tf.keras.layers.SimpleRNNCell, and will be replaced by that in Tensorflow 2.0.


With the lines below, we are performing the first four lines of the first for loop within the loss function of Karpathy's Python code. We are recieving "outputs", which relative to Karpathy's "ys" dictionary and"states", which is relative to Karpathy's "hs" dictionary.

In [0]:
outputs, states = tf.nn.dynamic_rnn(cell, encoded, dtype=tf.float32)


In Karpathy's RNN Python code,  backpropagation was implemented from scratch in which we did not do. Instead we used an AdamOptimizer where we needed to recieve probabilities on our multi classification problem and reduce the error. As we were training on the program, we continued to make sure we reduced the loss.

In [0]:
xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels = Y, logits = outputs,name="xentropy")
loss        = tf.reduce_mean(xentropy,name="loss")
optimizer   = tf.train.AdamOptimizer(learning_rate = learning_rate)
training_op = optimizer.minimize(loss)

The code below is the fifth of the first for loop within the loss function of Karpathy's Python code. This allowed us to get the probabilities of our outputs.

In [0]:
probabilities = tf.nn.softmax(outputs)

Below is where we are able to see the correct predictions and accuracy score of our RNN model.

In [0]:
########### NEEDS TO BE RERAN...SHOULD WORK ##########
correct     = tf.equal(tf.argmax(outputs, 1), tf.argmax(Y, 1))
accuracy    = tf.reduce_mean(tf.cast(correct, tf.float32))

In [0]:
init = tf.global_variables_initializer()

### Producing Mini Batches

Below is how we fetch the mini batches of our text data in order to train with our RNN model.

In [0]:
#### FETCH BATCH ####
def fetch_batch(epoch, batch_size):
    np.random.seed(epoch * batch_size)
    indices = np.random.randint(len(X_modified), size=batch_size)
    X_batch = X_modified[indices] #(batch_size,25)
    Y_batch = y_modified[indices]
    return X_batch, Y_batch

### Excecution

Here is the execution phase of our RNN model. In the inner for loop is where we are running our program similar to Karpathy's sample function. We print out the epoch, which in Karpathy's is the iteration, and the loss. Our code then continuously uses the first array of our mini batch which is similar to Karpathy's code within the while loop: sample_ix = sample(hprev, inputs[0], 200).

In Karpathy's Python program, he sends the first value of inputs to his sample function and encodes it in order to have the model predict the following index. The way Karpathy's sample method works is that he continue to do predictions for the remaining values in the sequence list, but the next prediction index is based on sampling of the probabilities list.  This implementation is the from the following code: pred_index=np.random.choice(range(len(probs[0][-1])), p=probs[0][-1].ravel()).

After continuing to save the predicted indices, we then translated the list into characters and output our predicted text.

In [0]:
########################## EXECUTION PHASE ##############################
with tf.Session() as sess:
    init.run() 
    for epoch in range(1,num_epochs+1):
        x_batch,y_batch = fetch_batch(epoch,batch_size)
        sess.run(training_op, feed_dict={X: x_batch, Y: y_batch})
        loss_val=loss.eval(feed_dict={X: x_batch, Y: y_batch})
        if epoch%20==0:
            print ('Epoch: ', epoch,"Loss: ",loss_val)
            predictions=[]
            x_temp=np.reshape(x_batch[0],(1,-1))
            #sampling 
            for i in range(200):
                probs=probabilities.eval(feed_dict={X: x_temp})
                pred_index=np.random.choice(range(len(probs[0][-1])), p=probs[0][-1].ravel())
                predictions.append(pred_index)
                x_temp=np.reshape(np.append(x_temp,pred_index)[1:],(1,-1))
            txt = ''.join(ix_to_char[ix] for ix in predictions)
            print ('----\n %s \n----' % (txt, ))


Epoch:  20 Loss:  3.3510735
----
 hd-, endeo?,!l QfJrduks  s eeaiDeUvk hhgtcNuv eeouDd mgb,erD(opih nn rrfh ,ewpdoihqFhukeDK-e  d nidnmvemfrslsNaueMdaeJheI w ip slose' seYn 'edTawpyheb k lejs ki esueeobe J r ,a C hesaoeiCdhwk onl e ae 
----
Epoch:  40 Loss:  3.1767948
----
 oseyiu
 ehs
ityd od Ma  l iato''ittpe le  e yrlrfoeb oTta oIsIyna !hllmr oii ycsi H t'w t inee tuhon!teti'lesr]oln olsoaom t l  lcR, aalDs,rkitna
ksuigCr- endetcervs,s e-od n   -dr:eee o  sh si  nehei 
----
Epoch:  60 Loss:  3.0921996
----
 g s rirMtnk k a d
w  haa ulmen  s  sen'iigs  fi p tIiolwsltnd
tenesases,uihegis tlst  hsveb'ynls  hot rlr scdtinttohh_y , p ildtnshndrscisuesAede  ngiiWhesoeaeir.e rtaO vrie os g   s uihea'ege wcsdhdn 
----
Epoch:  80 Loss:  3.0234797
----
 e,me lsEus si d Edetpt cW "ieu'yohans orreno
a s'asqaOxn yhi ,oto; r lr e es merke"re  rotff ae,Harsefe riouahn kia eitmo  moisuepe' hed  t Dtahtcpit oyotpuusaoguagshtneosartED'te ge ewust ncewhaann   
----
Epoch:  100 Loss:  2.9267733
----
 t

KeyboardInterrupt: ignored

In [0]:
with tf.Session() as sess:
    init.run() 
    for epoch in range(1,num_epochs+1):
        x_batch,y_batch = fetch_batch(epoch,batch_size)
        sess.run(training_op, feed_dict={X: x_batch, Y: y_batch})
        loss_val=loss.eval(feed_dict={X: x_batch, Y: y_batch})
    print ("Loss: ",loss_val)
    predictions=[]
    x_temp=np.reshape(x_batch[0],(1,-1))
    #sampling 
    for i in range(200):
      probs=probabilities.eval(feed_dict={X: x_temp})
      pred_index=np.random.choice(range(len(probs[0][-1])), p=probs[0][-1].ravel())
      predictions.append(pred_index)
      x_temp=np.reshape(np.append(x_temp,pred_index)[1:],(1,-1))
    txt = ''.join(ix_to_char[ix] for ix in predictions)
    print ('----\n %s \n----' % (txt, ))

Loss:  1.582048
----
 p a lotkly pleasces,' lare of herrystares
so to--quite oonning hold, it's kel makings!' Alise herryouid!' sardblst
about: notsefid itmesxiould,' said Alice.

Alice to one, you know the sholked twintw, 
----


# Part 2

For part 2 we decided to try different combinations of the model's hyperparameters to see if we could get reach better results. These changes included: growing the batch size,  timesteps, and sequence length. Although this increased training time, we wanted to see if our results would improve.

In [0]:
num_hidden     = 100 # size of hidden layer of neurons
num_epochs     = 4000
learning_rate  = 0.001
batch_size     = 400 # changed 
training_steps = 5
timesteps      = 40 # changed
seq_length     = 40 # changed
num_classes    = len(char_to_ix) 

The following code is like that of above, as for this model we only changed hyperparameters

In [0]:
#create training sequences and corresponding labels
X = []
y = []
for i in range(0, len(data)-seq_length-1, 1):
        X.append([char_to_ix[ch] for ch in data[i:i+seq_length]])
        y.append([char_to_ix[ch] for ch in data[i+1:i+seq_length+1]])
X_modified = np.reshape(X, (len(X), seq_length))
y_modified = np.reshape(y, (len(y), seq_length))

In [0]:
tf.reset_default_graph()

In [0]:
X          = tf.placeholder(tf.int32, [None, timesteps],name='Input_batch') # (batch_size, 25)
encoded    = tf.one_hot(X,depth=num_classes) # encodes each row to get: (batch_size, 25, 69)
Y          = tf.placeholder(tf.int32, [None,timesteps],name="Output")

In [0]:
cell    = tf.contrib.rnn.OutputProjectionWrapper(
        tf.contrib.rnn.BasicRNNCell(num_units = num_hidden,activation = tf.nn.tanh),
        output_size = num_classes)
outputs, states = tf.nn.dynamic_rnn(cell, encoded, dtype=tf.float32)
xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels = Y, logits = outputs,name="xentropy")
loss        = tf.reduce_mean(xentropy,name="loss")
optimizer   = tf.train.AdamOptimizer(learning_rate = learning_rate)
training_op = optimizer.minimize(loss)
probabilities = tf.nn.softmax(outputs) 

In [0]:
init = tf.global_variables_initializer()

After running the second model for the same number of epochs we saw a decrease in loss much quicker than that of the first model.

In [0]:
with tf.Session() as sess:
    init.run() 
    for epoch in range(1,num_epochs+1):
        x_batch,y_batch = fetch_batch(epoch,batch_size)
        sess.run(training_op, feed_dict={X: x_batch, Y: y_batch})
        loss_val=loss.eval(feed_dict={X: x_batch, Y: y_batch})
        if epoch%20==0:
            print ('Epoch: ', epoch,"Loss: ",loss_val)
            predictions=[]
            x_temp=np.reshape(x_batch[0],(1,-1))
            #sampling 
            for i in range(200):
                probs=probabilities.eval(feed_dict={X: x_temp})
                pred_index=np.random.choice(range(len(probs[0][-1])), p=probs[0][-1].ravel())
                predictions.append(pred_index)
                x_temp=np.reshape(np.append(x_temp,pred_index)[1:],(1,-1))
            txt = ''.join(ix_to_char[ix] for ix in predictions)
            print ('----\n %s \n----' % (txt, ))

Epoch:  20 Loss:  3.2930088
----
 h gi  pehed n o R!ttuntHUnoreeeoCroloQo-ltreTlho)Seaietnmr utnohmhiw e v aW
 d'; WsoitIo iW'wo sogKeeel,[rIdhn-M Vs ,as hkhlns tmia ajdmtt   :
hwo
 stss e  tLr'zu oneabw y 

d feu Fh u? ihy
(e'M'tA
_T 
----
Epoch:  40 Loss:  3.1733181
----
 eo'te g yeuurt n e
_aannm e   te
omhde  a-kaeek ' 
hsTuoxgohagtnsre i sn . fond ssuer-'af nt   ,o si.e,elrf al omu ktnd rrwsye
hf'ts  ee.hB l t' Uatntaclaw,to dgleied  m  ahi oleshenld'tucrae laognlh, 
----
Epoch:  60 Loss:  3.1233652
----
 wed  rder  ep'phu   nataa
 
r .ag  o n  ua-edlthr iun o
nh ensaniven w detr
 ;aeh n:ihr s l  Aus gdee
eoaelory t ins,gee Tc una "ht 
ig  aihir Beho wlr'anmsnwrec uesehadrue te  tgcp'hehiortitwunuae 
s 
----
Epoch:  80 Loss:  3.0606353
----
 fNfolwg h'''e ekevn pttt   pmeo tuilsihwoilde etn'ike
wi!uicce d eld aeiudvssaf,run ofoicdf dee teg to  ihrVahodf tein aul iarg .ti a
'unin efhiwaekdeha Iihuhvfe dmrnorl souerda  H mue o ecI
e o. llda 
----


KeyboardInterrupt: ignored

In [0]:
with tf.Session() as sess:
    init.run() 
    for epoch in range(1,num_epochs+1):
        x_batch,y_batch = fetch_batch(epoch,batch_size)
        sess.run(training_op, feed_dict={X: x_batch, Y: y_batch})
        loss_val=loss.eval(feed_dict={X: x_batch, Y: y_batch})
    print ("Loss: ",loss_val)
    predictions=[]
    x_temp=np.reshape(x_batch[0],(1,-1))
    #sampling 
    for i in range(200):
      probs=probabilities.eval(feed_dict={X: x_temp})
      pred_index=np.random.choice(range(len(probs[0][-1])), p=probs[0][-1].ravel())
      predictions.append(pred_index)
      x_temp=np.reshape(np.append(x_temp,pred_index)[1:],(1,-1))
    txt = ''.join(ix_to_char[ix] for ix in predictions)
    print ('----\n %s \n----' % (txt, ))

Loss:  1.4670002
----
 Lor OUR said Alice; 'Of
charsels, caring?' Alice,' and had seand the Duchess makes, and Gryphon, and hel putsing it was lookhang, chanst
maves--'

'I' lef 'No--ad you her myone, and walket into reelin 
----


# Conclusion

After many hours,  trying different approaches, trial and error; we are happy with our implementation of Karpathy's python code. We are sure the longer we allow our models to run the better results they would produce, such as clearer words and more complete sentences. In just 1000 epochs, we saw a significant drop in our loss from around 3.4 to around 1.9. Changing the epochs to 2000 we saw the loss drop as low as 1.70 with the first model and 1.64 in the second model, thus we can imagine as the model trains longer, the results will only improve. Overall, this project was a new challenge we had not yet experienced, and looking back we are glad we had a chance to work on it.