In [15]:
import tensorflow as tf

In [2]:
# allows us to reset the Tensorflow network (computation graph)
tf.reset_default_graph()

## Training the Parameters

Example from lectures

In [3]:
tf.reset_default_graph()

x = tf.placeholder(tf.float32, [2], name="x")
target = tf.placeholder(tf.float32, name="target")
learning_rate = tf.placeholder(
    tf.float32,
    name="learning_rate")

W = tf.get_variable("W", initializer=[0.2, 0.7])
y = tf.reduce_sum(x * W)

loss = tf.pow(target - y, 2.0)
optimizer = tf.train.GradientDescentOptimizer(
    learning_rate=learning_rate)
train_op = optimizer.minimize(loss)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for epoch in range(10):
        result, _ = sess.run(
            [y, train_op], 
            feed_dict={x: [1.0, 1.0], 
                       target: 20.0, 
                       learning_rate: 0.1}) 
        print("Result: ", result)

Result:  0.9
Result:  8.54
Result:  13.124001
Result:  15.874401
Result:  17.52464
Result:  18.514786
Result:  19.108871
Result:  19.46532
Result:  19.679192
Result:  19.807514


## Network Layers

Instead of manually creating the trainable variables, we can use the feedforward layer.

This creates a hidden layer that takes x as input, has 1 output neuron and no non-linear activation. The parameters used to connect the two layers together are created automatically and are trained during optimization. 

Here we do that and replace the manually created variables with a Tensorflow dense layer:

In [4]:
tf.reset_default_graph()

x = tf.placeholder(tf.float32, [None, 2], name="x")
target = tf.placeholder(tf.float32, name="target")
learning_rate = tf.placeholder(
    tf.float32, 
    name="learning_rate")

y = tf.layers.dense(x, 1, activation=None)

loss = tf.pow(target - y, 2.0)

optimizer = tf.train.GradientDescentOptimizer(
    learning_rate=learning_rate)

train_op = optimizer.minimize(loss)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for epoch in range(10):
        result, _ = sess.run(
            [y, train_op], 
            feed_dict={x: [[1.0, 1.0]], 
                       target: 20.0, 
                       learning_rate: 0.1}) 
        print("Result: ", result[0][0])

Result:  -0.54732126
Result:  11.781071
Result:  16.71243
Result:  18.68497
Result:  19.473988
Result:  19.789597
Result:  19.91584
Result:  19.966335
Result:  19.986534
Result:  19.994614


This solution actually converges slightly faster because it is also internally creating a bias parameter.

In large networks, you would normally chain together many large layers with **non-linear** activation functions:

In [5]:
x = tf.placeholder(tf.float32, [None, 300], name="x")
hidden1 = tf.layers.dense(x, 100, activation=tf.tanh)
hidden2 = tf.layers.dense(hidden1, 50, activation=tf.tanh)
y = tf.layers.dense(hidden2, 1, activation=tf.sigmoid)

## Activation Functions

The sigmoid function, also known as the logistic function, is the most classic non-linear activation. It transforms the value to a range between 0 and 1.

In modern networks, the tanh function is used more often. It has more flexibility, as it transforms the input value to a range between -1 and 1, and can therefore output negative values as well.

Another popular one is the **Rectified Linear Unit** function, or the ReLU. This function acts as a linear function above zero, but restricts everything below zero to 0. Doing this, it also introduces a non-linearity. The partial linear property of the relu can help it converge faster on some tasks, although in practice I've found tanh to be a more robust option.

Finally, softmax is a special type of activation function. It takes a whole layer as input and converts it into a probability distribution, such that all values are between 0 and 1, and together they sum up to 1. It is often used in the output layers of networks when performing classification, in order to predict a probability distribution over all the possible classes.

In [6]:
# sigmoid
hidden = tf.layers.dense(x, 100, activation=tf.sigmoid)

# tanh
hidden = tf.layers.dense(x, 100, activation=tf.tanh)

# ReLU
hidden = tf.layers.dense(x, 100, activation=tf.nn.relu)

# softmax
output = tf.layers.dense(hidden, 2, activation=None)
probabilities = tf.nn.softmax(output)

## Operations and Useful Functions

Tensorflow has corresponding versions of all the main operations you might want to use. This means you can add them into your computation graph and into your neural network.

We have operators for scalar values, but also to vectors, matrices and higher-order tensors (applied element-wise)

In [7]:
tf.abs # absolute value
tf.negative # computes the negative value
tf.sign # returns 1, 0 or -1 depending on the sign of the input
tf.reciprocal # reciprocal 1/x
tf.square # return input squared
tf.round # return rounded value
tf.sqrt # square root
tf.rsqrt # reciprocal of square root
tf.pow # power
tf.exp # exponential

<function tensorflow.python.ops.gen_math_ops.exp(x, name=None)>

Also some operators are performed over a whole vector/ matrix tensor and return a single value:

In [8]:
tf.reduce_sum # Add elements together
tf.reduce_mean # Average over elements
tf.reduce_min # Minimum value
tf.reduce_max # Maximum value
tf.argmax # Index of the largest value
tf.argmin # Index of the smallest value

<function tensorflow.python.ops.math_ops.argmin(input, axis=None, name=None, dimension=None, output_type=tf.int64)>

Different adaptive learning rate strategies are also implemented in Tensorflow as functions. The main ones to try are:

In [9]:
tf.train.GradientDescentOptimizer
tf.train.AdadeltaOptimizer
tf.train.AdamOptimizer

tensorflow.python.training.adam.AdamOptimizer

## Training an XOR Function

XOR is the function that takes two binary values and returns 1 only if one of them is 1 and the other 0, while returning 0 if both of them have the same value.

It can be a complicated function to optimize and cannot be modeled with a linear model. But let's try anyway.

Our dataset consists of all the possible different states that XOR can take:

In [10]:
data_x = [[0.0, 0.0], [0.0, 1.0], [1.0, 0.0], [1.0, 1.0]]
data_y = [0.0, 1.0, 1.0, 0.0]

Now we construct a linear network and optimize it on this dataset, printing out the predictions at each epoch:

In [11]:
tf.reset_default_graph()

x = tf.placeholder(tf.float32, [None, 2], name="x")
target = tf.placeholder(tf.float32, [None], name="target")
learning_rate = tf.placeholder(tf.float32, name="learning_rate")

y = tf.reduce_sum(tf.layers.dense(x, 1, activation=None), axis=1)

loss = tf.reduce_sum(tf.pow(target - y, 2.0))

optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
train_op = optimizer.minimize(loss)

data_x = [[0.0, 0.0], [0.0, 1.0], [1.0, 0.0], [1.0, 1.0]]
data_y = [0.0, 1.0, 1.0, 0.0]
lr = 0.1

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for epoch in range(50):
        result, _ = sess.run([y, train_op], feed_dict={x: data_x, target: data_y, learning_rate: lr})
        if epoch % 10 == 0:
            print("Epoch ", epoch, ": ", "\t".join([str(x) for x in result]))

Epoch  0 :  0.0	1.2826594	-0.6343475	0.6483119
Epoch  10 :  0.38700476	0.585186	0.37934893	0.57753015
Epoch  20 :  0.47045797	0.5064148	0.4843132	0.52027
Epoch  30 :  0.4922764	0.49997452	0.4976014	0.5052995
Epoch  40 :  0.4979807	0.49981052	0.4995557	0.5013855


As you can see, it's not doing very well. Ideally, the predictions should be [0, 1, 1, 0], but in this case they are hovering around 0.5 for every input case.

In order to improve this architecture, let's add some non-linear layers into our model.

In [12]:
tf.reset_default_graph()

x = tf.placeholder(tf.float32, [None, 2], name="x")
target = tf.placeholder(tf.float32, [None], name="target")
learning_rate = tf.placeholder(tf.float32, name="learning_rate")

hidden = tf.layers.dense(x, 5, activation=tf.tanh) # <- non-linear layer
y = tf.reduce_sum(tf.layers.dense(hidden, 1, activation=tf.sigmoid), axis=1) # <- non-linear layer

loss = tf.reduce_sum(tf.pow(target - y, 2.0))

optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
train_op = optimizer.minimize(loss)

data_x = [[0.0, 0.0], [0.0, 1.0], [1.0, 0.0], [1.0, 1.0]]
data_y = [0.0, 1.0, 1.0, 0.0]
lr = 1.0

tf.set_random_seed(20)
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for epoch in range(50):
        result, _ = sess.run([y, train_op], feed_dict={x: data_x, target: data_y, learning_rate: lr})
        if epoch % 10 == 0:
            print("Epoch ", epoch, ": ", "\t".join([str(x) for x in result]))

Epoch  0 :  0.5	0.36924797	0.48300314	0.3753078
Epoch  10 :  0.35206163	0.5785061	0.5704008	0.491058
Epoch  20 :  0.12587155	0.65398777	0.64453834	0.22893003
Epoch  30 :  0.08895578	0.852751	0.8530928	0.17340218
Epoch  40 :  0.06627951	0.8881273	0.893137	0.12493227


This is much better. The values are much closer to [0, 1, 1, 0] than before, and they will continue improving if we train for longer.

We also had to increase the learning rate for this network. It was still learning with the smaller learning rate, but was convering very slowly. Remember, learning rate is a hyperparameter that can vary quite a bit depending on the network architecture and dataset.

## XOR Classification

We can also do classification with Tensorflow. For this, we often use the softmax activation function described above, which predicts the probability for each of the possible classes.

We also have to change the loss function, as squared error is not suitable for classification. The loss function that works best with softmax is cross entropy. When minimizing cross entropy, we are essentially minimizing the negative log likelihood of the correct class for each datapoint. That's exactly what we want, as the model learns to assign high values for the correct label.

We can change the XOR example above to perform classification instead. In this case, we are constructing a binary classifier - choosing between the classes of 0 and 1. When printing the output, we are printing the predicted classes, which were assigned the highest probability by the network.

In [13]:
tf.reset_default_graph()

x = tf.placeholder(tf.float32, [None, 2], name="x")
target = tf.placeholder(tf.int32, [None], name="target")
learning_rate = tf.placeholder(tf.float32, name="learning_rate")

hidden = tf.layers.dense(x, 5, activation=tf.tanh)
output = tf.layers.dense(hidden, 2, activation=None)

probabilities = tf.nn.softmax(output) # moved to softmax activation function
predictions = tf.argmax(probabilities, axis=1)
loss_ = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=output, labels=target)
loss = tf.reduce_mean(loss_) # moved to cross entropy loss function

optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
train_op = optimizer.minimize(loss)

data_x = [[0.0, 0.0], [0.0, 1.0], [1.0, 0.0], [1.0, 1.0]]
data_targets = [0, 1, 1, 0]
lr = 1.0

tf.set_random_seed(20)
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for epoch in range(50):
        result, _ = sess.run([predictions, train_op], feed_dict={x: data_x, target: data_targets, learning_rate: lr})
        if epoch % 10 == 0:
            print("Epoch ", epoch, ": ", " ".join([str(x) for x in result]))

Epoch  0 :  0 0 1 0
Epoch  10 :  0 0 1 1
Epoch  20 :  0 1 1 1
Epoch  30 :  0 1 1 0
Epoch  40 :  0 1 1 0


As you can see, the model starts off with incorrect predictions, but fairly soon learns to return the correct sequence of [0, 1, 1, 0].