# Practical Session 3: Deep learning

In this practical, we will continue from where the lecture left off and learn more about using Tensorflow. 

The practical will cover a few different network architectures and we will look at different components that are often used in neural networks.

To start off, let's import tensorflow into our notebook.

In [1]:
import tensorflow as tf

## Minimal Tensorflow Example

This is the first example from the lectures. We first create a network with two placeholders that adds these together and returns the result. Then, we execute this network with two input values, 4 and 5. This returns the result 9.

In [2]:
tf.reset_default_graph()

a = tf.placeholder(tf.float32, name="a")
b = tf.placeholder(tf.float32, name="b")
y = a + b

with tf.Session() as sess:
    result = sess.run(y,
                      feed_dict={a:4, b:5})
    print("Result: ", result)


Result:  9.0


Occasionally throughout this notebook, the following function will be called:

In [3]:
tf.reset_default_graph()

This is necessary to reset the Tensorflow network. We have many different small networks in one notebook and we don't want them interfering with each other, so as a pre-emptive measure we will occasionally reset the computation graph.

## Training the Parameters

This is the second example from the lecture, showing how to optimize the parameters in your model.

We define a network that takes a vector x with two features as input, multiplies the features with corresponding parameters in W, and sums them together. We then train this network for 10 epochs over a single training point, optimizing the output towards value 20. Printing out the results, we can see that the output y gradually moves towards the target.

In [4]:
tf.reset_default_graph()

x = tf.placeholder(tf.float32, [2], name="x")
target = tf.placeholder(tf.float32, name="target")
learning_rate = tf.placeholder(
    tf.float32,
    name="learning_rate")

W = tf.get_variable("W", initializer=[0.2, 0.7])
y = tf.reduce_sum(x * W)

loss = tf.pow(target - y, 2.0)
optimizer = tf.train.GradientDescentOptimizer(
    learning_rate=learning_rate)
train_op = optimizer.minimize(loss)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for epoch in range(10):
        result, _ = sess.run(
            [y, train_op], 
            feed_dict={x: [1.0, 1.0], 
                       target: 20.0, 
                       learning_rate: 0.1}) 
        print("Result: ", result)

Result:  0.9
Result:  8.54
Result:  13.124001
Result:  15.874401
Result:  17.52464
Result:  18.514786
Result:  19.108871
Result:  19.46532
Result:  19.679192
Result:  19.807514


## Network Layers

For most cases, we don't actually need to create the trainable variables manually. Instead, the feedfoward layer is available as a pre-defined module.

In [5]:
x = tf.placeholder(tf.float32, [None, 2], name="x")
y = tf.layers.dense(x, 1, activation=None)

This creates a hidden layer that takes x as input, has 1 output neuron (we can also create bigger layers of course), and has no non-linear activation. The parameters that connect the two layers together are created automatically and are trained during optimization. By default, these parameters are initialized randomly.

Let's replace the manually created variables with a Tensorflow dense layer.

In [6]:
tf.reset_default_graph()

x = tf.placeholder(tf.float32, [None, 2], name="x")
target = tf.placeholder(tf.float32, name="target")
learning_rate = tf.placeholder(
    tf.float32, 
    name="learning_rate")

y = tf.layers.dense(x, 1, activation=None)

loss = tf.pow(target - y, 2.0)

optimizer = tf.train.GradientDescentOptimizer(
    learning_rate=learning_rate)

train_op = optimizer.minimize(loss)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for epoch in range(10):
        result, _ = sess.run(
            [y, train_op], 
            feed_dict={x: [[1.0, 1.0]], 
                       target: 20.0, 
                       learning_rate: 0.1}) 
        print("Result: ", result[0][0])


Result:  -1.1869861
Result:  11.5252075
Result:  16.610083
Result:  18.644033
Result:  19.457613
Result:  19.783047
Result:  19.913218
Result:  19.965288
Result:  19.986115
Result:  19.994446


This version actually gets to the correct solution a bit faster than before. That's because it is internally also creating a bias parameter, which adds a bit more power to the model.

In large networks, you would normally chain together many large layers with non-linear activation functions:

In [7]:
x = tf.placeholder(tf.float32, [None, 300], name="x")
hidden1 = tf.layers.dense(x, 100, activation=tf.tanh)
hidden2 = tf.layers.dense(hidden1, 50, activation=tf.tanh)
y = tf.layers.dense(hidden2, 1, activation=tf.sigmoid)

## Activation Functions

In the last example, we used non-linear activation functions. As we saw in the lectures, this is what gives neural networks their power to model non-linear patterns in the data.

There are a number of different activation functions to choose from.

The [**sigmoid** function](https://en.wikipedia.org/wiki/Logistic_function), also known as the logistic function, is the most classic non-linear activation. It transforms the value to a range between 0 and 1.

In [8]:
hidden = tf.layers.dense(x, 100, activation=tf.sigmoid)

In modern networks, the [**tanh** function](https://en.wikipedia.org/wiki/Hyperbolic_function) is used more often. It has more flexibility, as it transforms the input value to a range between -1 and 1, and can therefore output negative values as well.

In [9]:
hidden = tf.layers.dense(x, 100, activation=tf.tanh)

Another popular one is the <a href="https://en.wikipedia.org/wiki/Rectifier_(neural_networks)">**Rectified Linear Unit** function</a>, or the ReLU. This function acts as a linear function above zero, but restricts everything below zero to 0. Doing this, it also introduces a non-linearity.

In [10]:
hidden = tf.layers.dense(x, 100, activation=tf.nn.relu)

The partial linear property of the relu can help it converge faster on some tasks, although in practice I've found tanh to be a more robust option.

Finally, [**softmax**](https://en.wikipedia.org/wiki/Softmax_function) is a special type of activation function. It takes a whole layer as input and converts it into a probability distribution, such that all values are between 0 and 1, and together they sum up to 1. It is often used in the output layers of networks when performing classification, in order to predict a probability distribution over all the possible classes.

In [11]:
output = tf.layers.dense(hidden, 2, activation=None)
probabilities = tf.nn.softmax(output)

## Operations and Useful Functions

Tensorflow has corresponding versions of all the main operations you might want to use. This means you can add them into your computation graph and into your neural network.

In [12]:
tf.abs # absolute value
tf.negative # computes the negative value
tf.sign # returns 1, 0 or -1 depending on the sign of the input
tf.reciprocal # reciprocal 1/x
tf.square # return input squared
tf.round # return rounded value
tf.sqrt # square root
tf.rsqrt # reciprocal of square root
tf.pow # power
tf.exp # exponential

<function tensorflow.python.ops.gen_math_ops.exp(x, name=None)>

These operations can be applied to scalar values, but also to vectors, matrices and higher-order tensors. In the latter case, they will be applied element-wise. For example:

In [13]:
tf.reset_default_graph()

a = tf.placeholder(tf.int32, name="a")
b = tf.placeholder(tf.float32, [3], name="a")

c = tf.negative(a)
d = tf.square(b)

with tf.Session() as sess:
    c_, d_ = sess.run([c, d], feed_dict={a:4, b:[3.0,2.0,1.0]})
    print(c_, d_)

-4 [9. 4. 1.]


Some useful operations are performed over a whole vector/matrix tensor and return a single value:

In [14]:
tf.reduce_sum # Add elements together
tf.reduce_mean # Average over elements
tf.reduce_min # Minimum value
tf.reduce_max # Maximum value
tf.argmax # Index of the largest value
tf.argmin # Index of the smallest value

<function tensorflow.python.ops.math_ops.argmin(input, axis=None, name=None, dimension=None, output_type=tf.int64)>

In [15]:
tf.reset_default_graph()

b = tf.placeholder(tf.float32, [3,2], name="b")
c = tf.reduce_sum(b)

with tf.Session() as sess:
    c_ = sess.run([c], feed_dict={b:[[6.0, 5.0],[4.0,3.0],[2.0,1.0]]})
    print(c_)

[21.0]


Different adaptive learning rate strategies are also implemented in Tensorflow as functions. The main ones to try are:

In [16]:
tf.train.GradientDescentOptimizer
tf.train.AdadeltaOptimizer
tf.train.AdamOptimizer

tensorflow.python.training.adam.AdamOptimizer

If you are interested in the differences between these strategies, [this blog post](http://ruder.io/optimizing-gradient-descent/) provides more details.

## Training an XOR Function

[XOR](https://en.wikipedia.org/wiki/XOR_gate) is the function that takes two binary values and returns 1 only if one of them is 1 and the other 0, while returning 0 if both of them have the same value.

It can be a complicated function to optimize and cannot be modeled with a linear model. But let's try anyway.

Our dataset consists of all the possible different states that XOR can take:

In [17]:
data_x = [[0.0, 0.0], [0.0, 1.0], [1.0, 0.0], [1.0, 1.0]]
data_y = [0.0, 1.0, 1.0, 0.0]

Now we construct a linear network and optimize it on this dataset, printing out the predictions at each epoch:

In [18]:
tf.reset_default_graph()

x = tf.placeholder(tf.float32, [None, 2], name="x")
target = tf.placeholder(tf.float32, [None], name="target")
learning_rate = tf.placeholder(tf.float32, name="learning_rate")

y = tf.reduce_sum(tf.layers.dense(x, 1, activation=None), axis=1)

loss = tf.reduce_sum(tf.pow(target - y, 2.0))

optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
train_op = optimizer.minimize(loss)

data_x = [[0.0, 0.0], [0.0, 1.0], [1.0, 0.0], [1.0, 1.0]]
data_y = [0.0, 1.0, 1.0, 0.0]
lr = 0.1

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for epoch in range(50):
        result, _ = sess.run([y, train_op], feed_dict={x: data_x, target: data_y, learning_rate: lr})
        if epoch % 10 == 0:
            print("Epoch ", epoch, ": ", "\t".join([str(x) for x in result]))

Epoch  0 :  0.0	1.4035221	1.3642935	2.7678156
Epoch  10 :  0.1940817	0.45410073	0.44988856	0.70990753
Epoch  20 :  0.42001867	0.4876747	0.48722243	0.5548785
Epoch  30 :  0.47908926	0.49674278	0.4966942	0.51434773
Epoch  40 :  0.49453297	0.49914464	0.49913943	0.5037511


As you can see, it's not doing very well. Ideally, the predictions should be \[0, 1, 1, 0\], but in this case they are hovering around 0.5 for every input case.

In order to improve this architecture, let's add some non-linear layers into our model.

In [19]:
tf.reset_default_graph()

x = tf.placeholder(tf.float32, [None, 2], name="x")
target = tf.placeholder(tf.float32, [None], name="target")
learning_rate = tf.placeholder(tf.float32, name="learning_rate")

hidden = tf.layers.dense(x, 5, activation=tf.tanh) # <- non-linear layer
y = tf.reduce_sum(tf.layers.dense(hidden, 1, activation=tf.sigmoid), axis=1) # <- non-linear layer

loss = tf.reduce_sum(tf.pow(target - y, 2.0))

optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
train_op = optimizer.minimize(loss)

data_x = [[0.0, 0.0], [0.0, 1.0], [1.0, 0.0], [1.0, 1.0]]
data_y = [0.0, 1.0, 1.0, 0.0]
lr = 1.0

tf.set_random_seed(20)
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for epoch in range(50):
        result, _ = sess.run([y, train_op], feed_dict={x: data_x, target: data_y, learning_rate: lr})
        if epoch % 10 == 0:
            print("Epoch ", epoch, ": ", "\t".join([str(x) for x in result]))


Epoch  0 :  0.5	0.28850782	0.36305997	0.2056952
Epoch  10 :  0.28464514	0.5555107	0.6265177	0.42573875
Epoch  20 :  0.07027962	0.6488854	0.7141711	0.18525329
Epoch  30 :  0.0814386	0.8589512	0.8627182	0.15936361
Epoch  40 :  0.060135383	0.8947844	0.8951496	0.118306704


This is much better. The values are much closer to \[0, 1, 1, 0\] than before, and they will continue improving if we train for longer.

We also had to increase the learning rate for this network. It was still learning with the smaller learning rate, but was convering very slowly. As we discussed in the lectures, learning rate is a hyperparameter that can vary quite a bit depending on the network architecture and dataset.

## XOR Classification

We can also do classification with Tensorflow. For this, we often use the softmax activation function described above, which predicts the probability for each of the possible classes.

We also have to change the loss function, as squared error is not suitable for classification. The loss function that works best with softmax is [cross entropy](https://en.wikipedia.org/wiki/Cross_entropy). When minimizing cross entropy, we are essentially minimizing the negative log likelihood of the correct class for each datapoint. That's exactly what we want, as the model learns to assign high values for the correct label.

We can change the XOR example above to perform classification instead. In this case, we are constructing a binary classifier - choosing between the classes of 0 and 1. When printing the output, we are printing the predicted classes, which were assigned the highest probability by the network.

In [20]:
tf.reset_default_graph()

x = tf.placeholder(tf.float32, [None, 2], name="x")
target = tf.placeholder(tf.int32, [None], name="target")
learning_rate = tf.placeholder(tf.float32, name="learning_rate")

hidden = tf.layers.dense(x, 5, activation=tf.tanh)
output = tf.layers.dense(hidden, 2, activation=None)

probabilities = tf.nn.softmax(output)
predictions = tf.argmax(probabilities, axis=1)
loss_ = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=output, labels=target)
loss = tf.reduce_mean(loss_)

optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
train_op = optimizer.minimize(loss)

data_x = [[0.0, 0.0], [0.0, 1.0], [1.0, 0.0], [1.0, 1.0]]
data_targets = [0, 1, 1, 0]
lr = 1.0

tf.set_random_seed(20)
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for epoch in range(50):
        result, _ = sess.run([predictions, train_op], feed_dict={x: data_x, target: data_targets, learning_rate: lr})
        if epoch % 10 == 0:
            print("Epoch ", epoch, ": ", " ".join([str(x) for x in result]))

Epoch  0 :  0 1 0 1
Epoch  10 :  0 1 1 1
Epoch  20 :  0 1 1 0
Epoch  30 :  0 1 1 0
Epoch  40 :  0 1 1 0


As you can see, the model starts off with incorrect predictions, but fairly soon learns to return the correct sequence of \[0, 1, 1, 0\].

# Assignment: Classification of House Locations

In the first practical, you used the California House Prices Dataset in order to predict the prices of the houses based on various properties about the houses. In this assignment, we will experiment with Tensorflow and train a model to classify houses based on their ocean proximity.

First, we read in the dataset:

In [21]:
import pandas as pd
data = pd.read_csv('data/housing.csv')
data.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


Next, we split the ocean proximity column from the other features and convert the values to numerical IDs. Remember, the ocean_proximity column already contains discrete classes, so it is well-suited for the classification task. However, these are strings and in order to optimize the softmax function in Tensorflow, we need numerical IDs instead of strings. We can use the pandas map function to do the conversion:

In [22]:
X = data.copy().drop(["ocean_proximity"], axis=1)
Y = data.copy()["ocean_proximity"]
Y = data.copy()["ocean_proximity"].map({"<1H OCEAN":0, "INLAND":1, "ISLAND": 2, "NEAR BAY": 3, "NEAR OCEAN": 4}).values

Now, let's split off some data for development and testing:

In [23]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, train_size=0.8, random_state=1)
X_train, X_dev, y_train, y_dev = train_test_split(X_train, y_train, test_size=0.2, train_size=0.8, random_state=1)

And finally, let's preprocess the input features before giving them to the network. We need to fill in missing values with the imputer, and standardize the values to a similar range using the scaler:

In [24]:
from sklearn.preprocessing import Imputer 
from sklearn import preprocessing

imputer = Imputer(strategy="median")
imputer.fit(X_train)

X_train = imputer.transform(X_train)
X_dev = imputer.transform(X_dev)
X_test = imputer.transform(X_test)

scaler = preprocessing.StandardScaler().fit(X_train)

X_train = scaler.transform(X_train)
X_dev = scaler.transform(X_dev)
X_test = scaler.transform(X_test)

We now have a dataset that we can work with. 

Input features:

In [25]:
print(X_train.shape)
print(X_dev.shape)
print(X_test.shape)
print(X_train[:3])

(13209, 9)
(3303, 9)
(4128, 9)
[[ 0.89872022 -0.89773476 -1.24376976 -0.32171094 -0.69583159 -0.43931447
  -0.67219421  2.24405982  1.48686249]
 [ 0.68448647 -0.84637731 -0.60683118 -0.17791607 -0.20472295 -0.45260971
  -0.20959953  0.38636939  0.55799804]
 [ 0.92363112 -0.98177423 -1.16415244 -0.44476616 -0.5808403  -0.63785671
  -0.54340365  0.65584018  0.37757734]]


And the correstponding gold-standard labels:

In [26]:
print(y_train.shape)
print(y_dev.shape)
print(y_test.shape)
print(y_train[:10])

(13209,)
(3303,)
(4128,)
[0 4 0 1 0 0 1 0 4 4]


Based on the code examples above, construct a Tensorflow model, then train, tune and test it on this dataset. Experiment with different model settings and hyperparameters. Calculate and evaluate classification accuracy - the percentage of datapoints where the predicted class matches the gold-standard class.

During the practical session, give examples of what you tried and what were your findings.

Some suggestions and tips:
 * The XOR classification code can be a good place to start.
 * The output layer needs to have size 5, because the dataset has 5 possible classes.
 * Try testing on the development set as you are training, to make sure you don't overfit.
 * Evaluate on the dev set as much as you want, but evaluate on the test set only after you have chosen a good set of hyperparameters.
 * You could try different learning rates, hidden layer sizes, learning strategies, etc.
 * Adaptive learning rates can (and sometimes should) be used together with a regular hand-picked learning rate, and different adaptive learning rates can prefer very different regular learning rates.