### Chapter 3 (Neural Prediction) 

#### Notes
1. Parametric (defined number of parameters, makes prediction using trial and error) and Non Parametric Learning (number of parameters depend on data i.e features/predictors. Makes prediction by counting and probabilities)
2. In neural network, its useful to think about weights coming onto a node instead of weights going out of a node. This way we can imagine all weights coming onto a node be responsible for its outcode/prediction (so a vector of weights should be all weights coming on a node)

In [0]:
import numpy as np

num_toes = np.array([8.5, 9.5, 9.9, 9.0])
wlrec = np.array([0.65, 0.8, 0.8, 0.9])
nfans = np.array([1.2, 1.3, 0.5, 1.0])

weights = np.array([0.1, 0.2, 0])


# 1. Neural Network with multiple predictors and one output
#    This is the weighted sum/dot product function central to a neural network prediction.
#    Intuition - Similar the two vectors, higher the prediction.
def w_sum(inputs, weights):
    assert(len(inputs) == len(weights))
    return inputs.dot(weights) // sum(inputs * weights)

def neural_network_prediction_1(input_vector, weights):
    return w_sum(input_vector, weights)

input_vector = np.array([num_toes[0], wlrec[0], nfans[0]])    
print((neural_network_prediction_1(input_vector, weights)))

# 2. Neural Network with multiple predictors and multiple outputs
weight_matrix = np.matrix([[0.1, 0.1, -0.3],[0.1, 0.2, 0.0],[0.0, 1.3, 0.1]])

def neural_network_prediction_2(input_vector, weight_matrix):
    assert(np.shape(input_vector)[1] == np.shape(weight_matrix)[0])
    return np.matmul(input_vector, weight_matrix)

print((neural_network_prediction_2(input_vector.reshape(1, 3), np.transpose(weight_matrix))))

# Stacked Neural Layers.
hidden_layer_matrix = np.matrix([[0.1, 0.2, -0.1],[-0.1,0.1, 0.9], [0.1, 0.4, 0.1]])
output_layer_matrix = np.matrix([[0.3, 1.1, -0.3], [0.1, 0.2, 0.0], [0.0, 1.3, 0.1]])
def neural_network_prediction_3(input_vector, hidden_weight_matrix, output_layer_weight_matrix):
    assert(np.shape(input_vector)[1] == np.shape(weight_matrix)[0])
    hidden_output = np.matmul(input_vector, hidden_weight_matrix)
    return np.matmul(hidden_output, output_layer_weight_matrix)

print((neural_network_prediction_3(
        input_vector.reshape(1, 3), 
        np.transpose(hidden_layer_matrix),
        np.transpose(output_layer_matrix))))

### Chapter 4 (Neural Learning - Compare and Learn)

#### Notes
1. Good Analogy on why we square errors - We want to ignore small errors and take seriously large errors. Like parents, they ignore our small mistakes but punish us for larger ones.
2. We would also try and have a positive error function since this way errors dont cancel out. Example: first data point had error 100 and the second one had -100. Overrall error is 0 but this is not the ideal case we want. 
3. Good Analogy - Learning is 'error-attribution'. Its the art of figuring out how each weight contributed to the overall error. Also, learning in a way is a searching problem in a way that we need to search for the set of weights that gives us lowest possible error from the space of all weights
4. Another way to think about Gradient Descent - (assume MSE error) - [(pred - actualPred) * input] where (pred - actualPred) => Tells us about the direction and amount while (input) is responsible for scaling(logically it means if the input is big, so should the weight update), negative reversal(So if our absolute error is positive -> as per (pred-absPred) our weights should increase and we expect the overall prediciton after weights increase, since we subtract delta_weight from weight becomes less. But with negative weights if we increase weights our overall prediciton too increases. This is the negative reversal effect) and stopping(if input coming to node is 0 then there should not be any update)
5. Slope of a function will alwyas point to the bottom.
6. We saw before that learning is just adjusting weights so that error becomes ~0 and we make better predictions. When we look at error = [((input * weight) - goalPred) ** 2], the only moving part which we are allowed to modify is the weight. We cannot change input/goalPred. Thus its all about understanding the relationship between error and weights
7. When our input is large => prediction is large => error/derivate is large => weight update is large. Picture it in terms of a parabola. If weight update is large, we overshoot and go to other side. In next step we again overshoot and go back to original side (this is also the negative reversal effect). Thus we need a parameter that controls the amount of weight update. This is the learning rate


In [0]:
print("Without Learning Rate")
# Single Input - Single Output Network
actual_pred = 0.8
input_val = 2.0
weight_val = 0.5

for i in range(10):
    pred = input_val * weight_val
    mse_error = (pred - actual_pred) ** 2
    delta_error = (pred - actual_pred) # gradient
    delta_weight = delta_error * input_val
    weight_val -= delta_weight
    print("Error = " + str(mse_error))

print("\nWith Learning Rate")

# Single Input - Single Output Network - with learning rate
actual_pred = 0.8
input_val = 2.0
weight_val = 0.5
lr = 0.1

for i in range(10):
    pred = input_val * weight_val
    mse_error = (pred - actual_pred) ** 2
    delta_error = (pred - actual_pred) # gradient
    delta_weight = delta_error * input_val * lr
    weight_val -= delta_weight
    print("Error = " + str(mse_error))    
    

### Generalizing Gradient Descent - Learning multiple weights at a time

#### Notes
1. Idea is simple, just like before we learn (or calculate error) of our network and see weight_delta for the whole network. Like before we would multiply this with input. So for each weight update we multiply the weight_delta with its corresponding input. So contribution of each weight in overall error is attributed to how much input came its way
2. Good Debugging Strategy - If you see that after some iterations, weights explode we probably dont have learning_rate setup or is very large. Try in the first code sample by keeping lr large. Even 0.1 is large in this case
3. When you set 0.01 as lr and observe the gradients change for each, we see that most change is for weight with largest input. So some weights move slowly compared to other. This is why we need normalization of input data so that we can learn across all weights uniformly and that is why a lr of 0.1 too was large since we'd like to maintain a balance between all curves.
4. When one weight is set to not update, overall error can still approach 0. This means that network has learnt to classify on training data without that input. This can be dangerous since this assumption may not be true for prediction data
5. Just like in multiple inputs and one output - we have one error_delta and multiple inputs, in the case of single input and multiple outputs, we have single input and multiple error_delta
6. Gradient Descent is a general learning problem. If we can combine weights such that we can calculate error/delta then GD can help reduce that error to zero.

In [0]:
# Multiple Inputs - One Output -> Dig in (freezing one weight i.e. we'll see what will happen when you freeze one weight)
import numpy as np

'''
lr=0.01
weights = np.array([0.1, 0.2, -0.1])

toes = [8.5, 9.5, 9.9, 9.0]
wlrec = [0.65, 0.8, 0.8, 0.9]
nfans = [1.2, 1.3, 0.5, 1.0]
win_or_lose_binary = [1, 1, 0, 1]

def pred(weights, inputs):
    assert(len(weights) == len(inputs))
    return inputs.dot(weights)


for i in range(4):
    print("Iteration " + str(i))
    print("Weights " + str(weights))
    network_pred = pred(weights, np.array([toes[0], wlrec[0], nfans[0]]))
    print("Prediction = " + str(network_pred))
    mse_error = (network_pred - win_or_lose_binary[0])**2
    print("MSE_Error = " + str(mse_error))
    absolute_error = (network_pred - win_or_lose_binary[0]) # gradient of loss function
    weights[0] -= lr*absolute_error*toes[0]
    weights[1] -= lr*absolute_error*wlrec[0] 
    weights[2] -= lr*absolute_error*nfans[0]
    print("\n")

# Multiple Inputs - One Output -> Dig in (freezing one weight i.e. we'll see what will happen when you freeze one weight)
for i in range(4):
    print("Iteration " + str(i))
    print("Weights " + str(weights))
    network_pred = pred(weights, np.array([toes[0], wlrec[0], nfans[0]]))
    print("Prediction = " + str(network_pred))
    mse_error = (network_pred - win_or_lose_binary[0])**2
    print("MSE_Error = " + str(mse_error))
    absolute_error = (network_pred - win_or_lose_binary[0]) # gradient of loss function
    #weights[0] -= lr*absolute_error*toes[0]
    weights[1] -= lr*absolute_error*wlrec[0] 
    weights[2] -= lr*absolute_error*nfans[0]
    print("\n")    
 '''
 
# Gradient Descent with multiple inputs and multiple outputs. Assume we have 3 inputs nodes and 3 output nodes.
weights = np.array([ [0.1, 0.1, -0.3], [0.1, 0.2, 0.0], [0.0, 1.3, 0.1]])
toes = [8.5, 9.5, 9.9, 9.0]
wlrec = [0.65,0.8, 0.8, 0.9]
nfans = [1.2, 1.3, 0.5, 1.0]
hurt = [0.1, 0.0, 0.0, 0.1]
win = [ 1, 1, 0, 1]
sad = [0.1, 0.0, 0.1, 0.2]
lr = 0.01

inputs = np.reshape(np.array([toes[0], wlrec[0], nfans[0]]), (1,3))
outputs = np.reshape(np.array([hurt[0], win[0], sad[0]]), (1,3))

for i in range(4):
    network_pred = np.matmul(inputs, weights)
    print("Network Prediction = " + str(network_pred))
    mse_error = (network_pred - outputs)**2
    print("Weights = " + str(weights))
    print("MSE Error = " + str(mse_error))
    error_deltas = (network_pred - outputs) # gradient part i.e how much has the prediction changed i.e. telling the direction and amount of correction
    weight_deltas = error_deltas * np.reshape(inputs, (3,1))
    weights -= lr*weight_deltas

### Introduction to Backpropagation
#### Notes
1. Batch Gradient Descent - In this version, we calculate error for whole dataset in one iteration/epoch. Thus we calculate change in error w.r.t. weights for all samples. There are two options of calculating the changeInError w.r.t. weights. One is to sum up the gradients and the other one is average the gradients by batch size. Its clear here that with sum we would need lower learning rate since sum would potentially explode our change in weights. Disadvantge with this version is that most of times dataset cannot fit in memory thus we cannot vectorize it. That is why we use the other version of this.
2. Stochastic Gradient Descent - Where weights are updated for single sample/mini batch. This version also works better with non-convex error arena since this error prone way to changing weights can help us move from local minimas.
3. Overfitting Intuition - We know overfitting means that model has learnt just the training data. It has not learnt the actual correlation between inputs and outputs. Another way to look at this is that imagine there is a set of weights that perfectly predict the output. This means that error is zero and thus network will stop learning. So many combinations of weights are possible that our network can stop learning anytime. We have to avoid this
4. Conflicting Pressure on weights - So if two weights are perfectly aligned and there is a third weight that has no corelation with output than that is noise. Calibrating weights for this weight will induce noise in weight updates of other neurons which might be perfectly calibrated. Thus we use Regularization to solve this. In that we try to minimize the noisy weights all together so that they wont interfere with other neurons weights
5. Intuition for stacking layers in NN - If there is no direct correlation between input and output layers then we try and create an intermediate dataset that will have some corelation with output dataset. This intermediate dataset are the hidden layers
6. When doing backprop, higher weights contribute more to prediction and thus the error (which in essese is the extra or less prediction). So when moving delta_error inwards we simply take a multiply it with weights. Its the weighted average error which simply means that error in output node is a weighted average of errors of inner nodes
7. Great Intuition on why we need Non Linearities - Without non-linearities what we are doing in 3 layer network is 2 levels of weighted averaging during prediction. This can also be accompalished in one weighted average. So there is nothing new we are adding. This is where non-linearities come in. We need to be able to create a dataset that can be corelated to output even if input is not directly related to output. If we see closely to a node in hidden layer we have connections to all input nodes. So we are subscribing for correlation from each input. In order to create a new intermediate corelation, we need to be able to selectively turn off corelations from some inputs and thus forming a new set of corelations with output and hoping that will relate to outputs in some way. For instance relu will turn off corelation subscription from an input if its output is < 0. We need to understand this concept to know where to use what non-linearity. This is "Sometimes Corelation"
8. Also, when calculating error_delta backwards as described in point (6) we need to do - weights*outputDelta*derivateOfRelu. This follows same intuiton, that if relu had turned off the corelation then it does not contribute to final error_delta.

In [0]:
import numpy as np

# Learning the whole dataset with a Neural Network. Batch Gradient Descent(Vectorizing). 3 input nodes and 1 output node
weights = np.reshape(np.array([0.5,0.48,-0.7]), (1,3))
lr=0.01

streetlights = np.array([[ 1, 0, 1 ],[ 0, 1, 1 ],[ 0, 0, 1 ],[ 1, 1, 1 ],[ 0, 1, 1 ],[ 1, 0, 1 ]])
walk_vs_stop = np.array([0,1,0,1,1,0])

print("\nBatch GD\n")
for epoch in range(10):
    print("WEIGHTS = " + str(weights))
    network_pred = np.matmul(weights, np.transpose(streetlights)) # (1,6) vector with predictions from all samples
    mse_error = (network_pred - walk_vs_stop)**2
    #print("MSE_ERROR = " + str(np.sum(mse_error)))
    delta_error = (network_pred - walk_vs_stop) # (1,6) vector of all delta_errors
    weights = weights - lr*np.matmul(delta_error, streetlights) # this is summing up all changeInErrors w.r.t. a particular weight and summing it up. We can also take average here

print("\nStochastic GD\n")
# Learning the whole dataset with a Neural Network. Stochastic Gradient Descent
for epoch in range(10):
    print("WEIGHTS = " + str(weights))
    for row in range(len(streetlights)):
        input_vec = np.array(streetlights[row])
        network_pred = np.dot(weights, np.transpose(input_vec))
        mse_error = (network_pred - walk_vs_stop[row])**2
        #print("MSE_ERROR = " + str(mse_error))
        delta_error = (network_pred - walk_vs_stop[row])
        weights = weights - lr*delta_error*input_vec


In [0]:
# Backpropagation (in a 3 layer network with relu activation function - Stochastic Gradient Descent)
import numpy as np

# This dataset does not have a direct corelation between input and output. Thus we need an intermdiate layer that can create an intermediate dataset
streetlights = np.array( [[ 1, 0, 1 ], [ 0, 1, 1 ], [ 0, 0, 1 ], [ 1, 1, 1 ] ] )
walk_vs_stop = np.array([ 1, 1, 0, 0])

epochs, lr = 60, 0.2
input_nodes = 3
hidden_nodes = 4
output_nodes = 1

def relu(x):
    return (x > 0) * x # returns x if x > 0, return 0 otherwise

def relu2deriv(output):
    return output>0 # returns 1 for input > 0, return 0 otherwise

# Network Initialization
np.random.seed(1)
weights_0_1 = 2*np.random.rand(3, 4) - 1
weights_1_2 = 2*np.random.rand(4, 1) - 1

for j in range(epochs):
    mse_error = 0
    for i in range(len(streetlights)):
        # Predict and Compare
        inp, out = streetlights[i], np.array(walk_vs_stop[i]) #(3,) and (1,)
        out_0_1 = relu(np.dot(np.transpose(weights_0_1), inp)) #(4,)
        out_1_2 = np.dot(np.transpose(weights_1_2), out_0_1) # (1,)
        
        mse_error = (out_1_2 - out)**2 #(1,)
        delta_err_1_2 = out_1_2 - out #(1,)
        delta_err_0_1 = np.dot(weights_1_2, delta_err_1_2) * relu2deriv(out_0_1) #(4,). relu2deriv is the masking for relu telling us whether delta should be available for this node or not
        
        # Learn
        weights_1_2 -= lr*np.dot(np.reshape(out_0_1, (4,1)), np.reshape(delta_err_1_2, (1,1))) #(4,1)
        weights_0_1 -= lr*np.dot(np.reshape(inp, (3,1)), np.reshape(np.transpose(delta_err_0_1), (1,4))) #(3,4)
        
        if(j % 5 == 0 and i%len(streetlights)-1 == 0):
            print("MSE ERROR = " + str(mse_error))
        

### Neural Network Architectures + Learning Signals and Ignoring Noise
1. Takeaway from previous topics - (Corelation Summarization) - Neural network tries to find corelation between input and output layers and sometimes by creating an artificial intermediate corelation.
2. A good neural architecture is something that channels signals so that finding corelation is easy and fast. For instance imagine the way CNN's work
3. Good neural architectures channel signal so that correlation is easy to discover. Great architectures also filter noise to help prevent overfitting

In [0]:
# Get the MNIST Dataset and only extract a small subset of it
from sklearn.datasets import fetch_mldata
import numpy as np
import matplotlib.pyplot as plt

print("fetching mnist")
mnist = fetch_mldata('MNIST original')
print("fetched")

np.random.seed(1234) # set seed for deterministic ordering
p = np.random.permutation(mnist.data.shape[0])
X = mnist.data[p]
Y = mnist.target[p]

# Show some samples from dataset
for i in range(10):
    plt.subplot(1,10,i+1)
    plt.imshow(X[i].reshape((28,28)), cmap='Greys_r')
    plt.axis('off')
plt.show()

TRAIN_SET_SIZE = 5000
TEST_SET_SIZE = 5000

X = X.astype(np.float32)/255
X_train = X[:TRAIN_SET_SIZE]
X_test = X[TRAIN_SET_SIZE:(TRAIN_SET_SIZE+TEST_SET_SIZE)]
Y_train = Y[:TRAIN_SET_SIZE]
Y_test = Y[TRAIN_SET_SIZE:(TRAIN_SET_SIZE+TEST_SET_SIZE)]

In [0]:
ampl# Running Backprop on MNIST Dataset. Same code as before. 3 layer NN

lr = 0.005
epochs = 100
input_size = 784
hidden_size = 40
output_size = 10

weights_0_1 = 0.2*np.random.rand(input_size, hidden_size) - 0.1
weights_1_2 = 0.2*np.random.rand(hidden_size, output_size) - 0.1

def relu(x):
    return (x > 0) * x # returns x if x > 0, return 0 otherwise

def relu2deriv(output):
    return output>0 # returns 1 for input > 0, return 0 otherwise

def oneHotEncode(num, total_labels):
    assert(num >= 0.0 and num < total_labels)
    arr = np.zeros(10)
    arr[int(num)] = 1.0
    return arr
    
    
for i in range(epochs):
    for j in range(500):
        # Predict and Compare
        inp, out = np.reshape(X_train[j], (1, input_size)), np.reshape(oneHotEncode(Y_train[j], float(output_size)), (1,output_size))
        layer1_out = relu(np.dot(inp, weights_0_1)) # (1, hidden_size)
        layer2_out = np.dot(layer1_out, weights_1_2) # (1, output_size)
        
        mse_error = (layer2_out - out)**2 # (1, output_size)
        delta_layer2 = layer2_out - out # (1, output_size)
        delta_layer1 = np.dot(delta_layer2, np.transpose(weights_1_2)) * relu2deriv(layer1_out) # (1, hidden_size)
        
        # Learn
        weights_1_2 -= lr*np.dot(np.transpose(layer1_out), delta_layer2) # (hidden_size, output_size)
        weights_0_1 -= lr*np.dot(np.transpose(inp), delta_layer1)
        
        # Print error in between iterations
        if(i%10 == 0 and j%TRAIN_SET_SIZE-1 == 0):
            print("MSE_ERROR = " + str(np.sum(mse_error)))

# Running the trained network on test images. This is the network with no regularization applied
total_correct = 0
for i in range(500):
    inp, out = np.reshape(X_test[i], (1, input_size)), int(Y_test[i])
    layer1_out = relu(np.dot(inp, weights_0_1)) # (1, hidden_size)
    layer2_out = np.dot(layer1_out, weights_1_2) # (1, output_size)
    
    actual_output = np.argmax(layer2_out)
    is_correct_pred = (out == actual_output)
    if(is_correct_pred):
        total_correct += 1
print("Total Correct Predictions out of 500 = " + str(total_correct))
print("Accuracy = " + str(float(total_correct)/500.0))        

In [0]:
# Lets visualize the weights learned by our network (TODO: This is not that informatary. Look for other ways in which we can see what our network is learning)
from matplotlib import pyplot as plt

for i in range(hidden_size):
    weights_as_matrix = 100*np.reshape(weights_0_1[:,i], (28, 28))
    plt.imshow(weights_as_matrix, interpolation='nearest')
    plt.show()

### Learning signal and ignoring noise
1. Overfitting is compared with fresh clay on which we need to make imprints of a fork. As more and more forks are imprinted in the same position, clay would learn the very intricate details of fork. So now instead of a 3 sided fork we have a 4 sided fork, it wont fit in clay. In essense, we do not need to learn the intricate details of fork but a fuzzy version of it so that it generalizes instead of memorizing.
2. Regularization - A subset of methods used to encourage generalization in learned models, often by increasing the difficulty for a model to learn the fine-grained details of training data.
3. Early Stopping - Use a validation set to see when to stop. If we use a testing set we might overfit on that too
4. Dropout - Forms an ensembling technique. When we switch off nodes randomly in each iteration than we have many subnetworks. Each subnetwork learns different noise but will learn the same signal (or high level view). So when averaged this noise cancels out. Idea is that if we have multiple networks all initialized randomly then they all will learn different noise. Its makes training difficult just like we train (for running) with weights which makes it difficult to train when running. But when in race we take them off we actually run very well. 
5. Batch Gradient Descent too helps in generalizing since individual weight updates are noisy so averaging them over a sample helps reduce noise


In [0]:
# Running Backprop on MNIST Dataset. 3 layer NN + Dropout

lr = 0.005
epochs = 100
input_size = 784
hidden_size = 40
output_size = 10

weights_0_1 = 0.2*np.random.rand(input_size, hidden_size) - 0.1
weights_1_2 = 0.2*np.random.rand(hidden_size, output_size) - 0.1

def relu(x):
    return (x > 0) * x # returns x if x > 0, return 0 otherwise

def relu2deriv(output):
    return output>0 # returns 1 for input > 0, return 0 otherwise

def oneHotEncode(num, total_labels):
    assert(num >= 0.0 and num < total_labels)
    arr = np.zeros(10)
    arr[int(num)] = 1.0
    return arr
    
    
for i in range(epochs):
    for j in range(500):
        # Predict and Compare
        inp, out = np.reshape(X_train[j], (1, input_size)), np.reshape(oneHotEncode(Y_train[j], float(output_size)), (1,output_size))
        dropout_nodes = np.reshape(np.random.randint(2, size=hidden_size), (1, hidden_size)) # Randomly drop nodes
        dropout_nodes = dropout_nodes * 2 # We increase the power of signal since without this dropout will severely degrage signals from layer1 to layer2. Also it is not necessary that we ahve 50% 0s. But on average we'll have this since we are choosing from a uniform distribution
        
        layer1_out = dropout_nodes * relu(np.dot(inp, weights_0_1)) # (1, hidden_size)
        layer2_out = np.dot(layer1_out, weights_1_2) # (1, output_size)
        
        mse_error = (layer2_out - out)**2 # (1, output_size)
        delta_layer2 = layer2_out - out # (1, output_size)
        delta_layer1 = np.dot(delta_layer2, np.transpose(weights_1_2)) * relu2deriv(layer1_out) * dropout_nodes # (1, hidden_size)
        
        # Learn
        weights_1_2 -= lr*np.dot(np.transpose(layer1_out), delta_layer2) # (hidden_size, output_size)
        weights_0_1 -= lr*np.dot(np.transpose(inp), delta_layer1)
        
        # Print error in between iterations
        if(i%10 == 0 and j%TRAIN_SET_SIZE-1 == 0):
            print("MSE_ERROR = " + str(np.sum(mse_error)))

# Running the trained network on test images. Dropout is not used in prediction
total_correct = 0
for i in range(500):
    inp, out = np.reshape(X_test[i], (1, input_size)), int(Y_test[i])
    layer1_out = relu(np.dot(inp, weights_0_1)) # (1, hidden_size)
    layer2_out = np.dot(layer1_out, weights_1_2) # (1, output_size)
    
    actual_output = np.argmax(layer2_out)
    is_correct_pred = (out == actual_output)
    if(is_correct_pred):
        total_correct += 1
print("Total Correct Predictions out of 500 = " + str(total_correct))
print("Accuracy = " + str(float(total_correct)/500.0))     

### Modelling Probabilities and Non-Linearities
1. Activation Functions should be continous and infinite in domain. This makes sense since there could be any input that can come to a activation function
2. Activation Functions should be monotonic/never-changing-direction. If the function is not monotonic then we'll have same outputs for multiple inputs and then there would not be any clear direction in which we should go. See [this](https://cs.stackexchange.com/questions/45281/should-activation-function-be-monotonic-in-neural-networks)
3. Activation functions should be non-linear since then only we can achieve "Selective Corelation" i.e. in order to create a new set of corelations we have to use non-linearity otherwise we are simply scaling weighted average
4. Activation functions should be easily computable since we are gonna call them millions of times
5. They should be differentiable

### Networks that understand Language (Neural Embeddings)
1. Good question to ask yourself when working with a dataset is that what input representation can have co-relation with the output dataset. For instance in movie review dataset, positive/negative words have direct co-relation with the sentiment of that review
2. 

In [0]:

# Prepare Reviews Dataset
import numpy as np

f = open("/tmp/eider-user/userfile/depth/labeledTrainData.tsv")
reviews = np.array(f.readlines())

TRAIN_SIZE = 5000
TEST_SIZE = 5000

def constructReviewAndSentimentArr(data):
    review_arr, sentiment_arr = [], []
    for r in data:
        split_arr = r.split("\t")
        review_arr.append(split_arr[2])
        sentiment_arr.append(split_arr[1])
    return np.array(review_arr), np.array(sentiment_arr)

train_rev, train_sen = constructReviewAndSentimentArr(reviews[0:TRAIN_SIZE])
test_rev, test_sen = constructReviewAndSentimentArr(reviews[TRAIN_SIZE:TRAIN_SIZE+TEST_SIZE])

# construct dictionary of words (note: doing it in a very naive way and not taking care of words like you, you' etc.)
words = set()
whitespace = "\r\n\t"

for r in reviews[0:TRAIN_SIZE+TEST_SIZE]:
    wors = r.split("\t")[2].strip(whitespace).split(" ")
    for w in wors:
        if(w not in words):
            words.add(w)
            
words = list(words)

In [0]:
# Convert inputs to one hot encoded and run them through network
def oneHotEncodeReview(review, list_of_words, total_words):
    arr = np.zeros(total_words)
    sp = review.strip(whitespace).split(" ")
    for w in sp:
        arr[list_of_words.index(w)] = 1.0
    return arr

input_size = len(words)
hidden_size = 60
output_size = 2
epochs = 100

for epoch in epochs:
    for index in range(TRAIN_SIZE):
        inp = oneHotEncodeReview(train_rec[index], words, input_size)
        out = np.array[]