# Task 11: Starting with tensorflow

_All credit for this jupyter notebook tutorial goes to the book "Hands-On Machine Learning with Scikit-Learn & TensorFlow" by Aurelien Geron. Modifications were made in preparation for the hands-on sessions._

# Setup

First, import a few common modules, ensure MatplotLib plots figures inline and prepare a function to save the figures:

In [None]:
# Common imports
import numpy as np
import os

# to make this notebook's output stable across runs
np.random.seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# to make this notebook's output stable across runs
def reset_graph(seed=42):
    tf.reset_default_graph()
    tf.set_random_seed(seed)
    np.random.seed(seed)

# Function to save a figure. This also decides that all output files 
# should stored in the subdirectorz 'classification'.
PROJECT_ROOT_DIR = "."
EXERCISE = "tensorflow"

def save_fig(fig_id, tight_layout=True):
    path = os.path.join(PROJECT_ROOT_DIR, "output", EXERCISE, fig_id + ".png")
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format='png', dpi=300)

# Creating graphs

As we learnt in the lecture, tensorflow is based on _graph building_. It is important to understand that building a graph such as the one below does _not_ actually execute any computation of the results yet. This is done in a tensorflow _session_. What result would you expect from the created graph below?

In [None]:
import tensorflow as tf

reset_graph()

x = tf.Variable(5, name="x")
y = tf.Variable(4, name="y")
f = x*x*y + 3*y - 2*x + 1

The below line displays information about the type of `f`. As you can see, `f` is actually an output _tensor_ –check out the [tf.Tensor](https://www.tensorflow.org/api_docs/python/tf/Tensor) class documentation and also the tensorflow [guide on tensors](https://www.tensorflow.org/guide/tensors). Quite generally, tensorflow's documentation and guide are worth reading, and spending some time on them will help you to understand the basics of the low-level tensorflow API.

In [None]:
f

Now, to execute the above graph, we still need to create a [tf.Session](https://www.tensorflow.org/api_docs/python/tf/Session). Again, there is a very good [guide on graphs and sessions](https://www.tensorflow.org/guide/graphs). We can either initialise a session, call its _run_ function for all tensorflow variables, obtain the result, and close the session afterwards. But using python's `with` keyword takes care of opening and closing the session in a much nicer way (and it makes `sess` within the `with` block the default session, which simplifies our run calls quite a lot):

In [None]:
with tf.Session() as sess:
    x.initializer.run()
    y.initializer.run()
    result = f.eval()
print("Result of the operation: %s" % result)

To avoid having to intialise _all_ variables by hand, we can create a global initialiser, and then simply need to call the `run` function on it once and evaluate our output tensor, as the following piece of code demonstrates:

In [None]:
init = tf.global_variables_initializer()

result = -1

with tf.Session() as sess:
    init.run()
    result = f.eval()
print("Result of the operation: %s" % result)

Another alternative, especially when using the interactive python interpreter, is to use an _interactive_ session ([tf.InteractiveSession](https://www.tensorflow.org/api_docs/python/tf/InteractiveSession)).

Another important thing to note is that individual graph runs do _not_ share values of individual nodes. They might evaluate to the same values for every single run, but each graph evaluation is done independently. On the one hand, this is great because it enables distribution of operations onto many different machines/resources. On the other hand, in some cases it might be desirable to evaluate two output tensors simultaneously, and to avoid double evaluation of some nodes. This is demonstrated in the following piece of code:

In [None]:
k = tf.constant(5)
m = k - 2
x = m + 5
y = m * 3

# Both x and y are evaluated in individual graphs runs.
with tf.Session() as sess:
    print("Resulting values: %s, %s" % (x.eval(), y.eval()))
    
# x and y are evaluated simultaneously, i.e. the value
# of m is only computed once.
with tf.Session() as sess:
    x_val, y_val = sess.run([x, y])
    print("Resulting values: %s, %s" % (x_val, y_val))

# Linear Regression

As we did with Scikit-Learn, tensorflow can also perform linear regression. Before trying the gradient descent method, we can build a tensorflow graph using the normal equation. If you don't remember the normal equation and how it can be used to solve linear regression problems, now is the perfect moment to go back and revise the previous content!

First, fetch an example dataset provided by Scikit-Learn (this might take a moment ...):

In [None]:
from sklearn.datasets import fetch_california_housing

housing = fetch_california_housing()
m, n = housing.data.shape
housing_data_plus_bias = np.c_[np.ones((m, 1)), housing.data]

Now let's build a tensorflow graph which calculates our theta, that is, our estimators for the values of the parameters minimising the cost function. Check out [tf.linalg.matmul](https://www.tensorflow.org/api_docs/python/tf/linalg/matmul), [tf.transpose](https://www.tensorflow.org/api_docs/python/tf/transpose) and [tf.linalg.inv](https://www.tensorflow.org/api_docs/python/tf/linalg/inv) documentations if you'd like to learn more about the operations performed.

In [None]:
reset_graph()

X = tf.constant(housing_data_plus_bias, dtype=tf.float32, name="X")
y = tf.constant(housing.target.reshape(-1, 1), dtype=tf.float32, name="y")
XT = tf.transpose(X)
theta = tf.matmul(tf.matmul(tf.matrix_inverse(tf.matmul(XT, X)), XT), y)

with tf.Session() as sess:
    theta_value = theta.eval()

In [None]:
print(theta_value)

For comparison, we can also use the [LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) class of Scikit-Learn for the same minimisation task. As you might remember, the syntax for Scikit-Learn would look like this:

In [None]:
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(housing.data, housing.target.reshape(-1, 1))

print(np.r_[lin_reg.intercept_.reshape(-1, 1), lin_reg.coef_.T])

Are the parameter values comparable between TF and Scikit-Learn?

Before moving from the normal equation to batch gradient descent, remember that we need to scale the input features (i.e. normalise them). Of course there are ways to do this in tensorflow, but Scikit-Learn's standard scaler works just as well.

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_housing_data = scaler.fit_transform(housing.data)
scaled_housing_data_plus_bias = np.c_[np.ones((m, 1)), scaled_housing_data]

The next step is to implement gradient descent. This is fairly straight-forward: Start with randomly distributed theta values for now, calculate the prediction based on the current theta, calculate the error of that prediction, and eventually calculate the gradients based on this error. Then, the operation for each epoch is to update the values of theta with `theta - eta * gradients`, as implemented below:

In [None]:
# Reset the graph so that we don't use any previous results.
reset_graph()

# Let's start with a low learning rate and many epochs.
n_epochs = 1000
learning_rate = 0.01

# The tf input tensors: X for the input features of the data,
# y for the target values (this is a regression problem!).
X = tf.constant(scaled_housing_data_plus_bias, dtype=tf.float32, name="X")
y = tf.constant(housing.target.reshape(-1, 1), dtype=tf.float32, name="y")

# Start with a randomly initialised set of estimators for theta.
theta = tf.Variable(tf.random_uniform([n + 1, 1], -1.0, 1.0, seed=42), name="theta")

# Calculate the predicted values, their errors w.r.t. to the
# true values, and the mean squared error (MSE). Remember that
# all of this only builds a graph in tf, these are *not*
# actual computation operations at the moment.
y_pred = tf.matmul(X, theta, name="predictions")
error = y_pred - y
mse = tf.reduce_mean(tf.square(error), name="mse")

# Calculate the gradients by hand, m is the size of the dataset.
gradients = 2/m * tf.matmul(tf.transpose(X), error)

# Calculate the gradients using tf's automated differentation.
# gradients = tf.gradients(mse, [theta])[0]

# And finally define the training operation: update the values
# for theta, following the calculated gradients.
training_op = tf.assign(theta, theta - learning_rate * gradients)

# Initialise all variables.
init = tf.global_variables_initializer()

# And open a session to execute the graph.
with tf.Session() as sess:
    sess.run(init)

    for epoch in range(n_epochs):
        if epoch % 100 == 0:
            print("Epoch", epoch, "MSE =", mse.eval())
        sess.run(training_op)
   
    # Get the best value for theta by calling eval().
    best_theta = theta.eval()

Good, the MSE is decreasing, but what are the best values for theta?

In [None]:
print(best_theta)

The above code currently implements the gradients by hand. However, this might not be extremely efficienct in terms of computing for more complex problems. Imagine having to calculate derivatives in many nodes of a neural network ... To avoid duplicate evaluations of derivatives, for example if the chain rule can be applied in some of them, we can make use of tf's sophisticated algorithms for differentation, called _reverse-mode autodiff_. If you haven't heard of automatic differentation and its reverse mode, this would be a good time to read up on it.

You can try the magic by replacing the line `gradients = [...]` with the one below, which will activate tf's autodiff algorithm instead. In the above example, this probably doesn't make a huge difference, but in multi-layer neural nets it certainly will.

# Feeding data to the training algorithm

## Mini-batch Gradient Descent

In [None]:
n_epochs = 1000
learning_rate = 0.01

In [None]:
reset_graph()

X = tf.placeholder(tf.float32, shape=(None, n + 1), name="X")
y = tf.placeholder(tf.float32, shape=(None, 1), name="y")

In [None]:
theta = tf.Variable(tf.random_uniform([n + 1, 1], -1.0, 1.0, seed=42), name="theta")
y_pred = tf.matmul(X, theta, name="predictions")
error = y_pred - y
mse = tf.reduce_mean(tf.square(error), name="mse")
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
training_op = optimizer.minimize(mse)

init = tf.global_variables_initializer()

In [None]:
n_epochs = 10

In [None]:
batch_size = 100
n_batches = int(np.ceil(m / batch_size))

In [None]:
def fetch_batch(epoch, batch_index, batch_size):
    np.random.seed(epoch * n_batches + batch_index)  # not shown in the book
    indices = np.random.randint(m, size=batch_size)  # not shown
    X_batch = scaled_housing_data_plus_bias[indices] # not shown
    y_batch = housing.target.reshape(-1, 1)[indices] # not shown
    return X_batch, y_batch

with tf.Session() as sess:
    sess.run(init)

    for epoch in range(n_epochs):
        for batch_index in range(n_batches):
            X_batch, y_batch = fetch_batch(epoch, batch_index, batch_size)
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})

    best_theta = theta.eval()

In [None]:
best_theta

# Visualisation with TensorBoard

In [None]:
reset_graph()

from datetime import datetime

now = datetime.utcnow().strftime("%Y%m%d%H%M%S")
root_logdir = "tf_logs"
logdir = "{}/run-{}/".format(root_logdir, now)

In [None]:
n_epochs = 1000
learning_rate = 0.01

X = tf.placeholder(tf.float32, shape=(None, n + 1), name="X")
y = tf.placeholder(tf.float32, shape=(None, 1), name="y")
theta = tf.Variable(tf.random_uniform([n + 1, 1], -1.0, 1.0, seed=42), name="theta")
y_pred = tf.matmul(X, theta, name="predictions")
error = y_pred - y
mse = tf.reduce_mean(tf.square(error), name="mse")
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
training_op = optimizer.minimize(mse)

init = tf.global_variables_initializer()

In [None]:
mse_summary = tf.summary.scalar('MSE', mse)
file_writer = tf.summary.FileWriter(logdir, tf.get_default_graph())

In [None]:
n_epochs = 10
batch_size = 100
n_batches = int(np.ceil(m / batch_size))

In [None]:
with tf.Session() as sess:                                                        # not shown in the book
    sess.run(init)                                                                # not shown

    for epoch in range(n_epochs):                                                 # not shown
        for batch_index in range(n_batches):
            X_batch, y_batch = fetch_batch(epoch, batch_index, batch_size)
            if batch_index % 10 == 0:
                summary_str = mse_summary.eval(feed_dict={X: X_batch, y: y_batch})
                step = epoch * n_batches + batch_index
                file_writer.add_summary(summary_str, step)
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})

    best_theta = theta.eval()                                                     # not shown

In [None]:
file_writer.close()

In [None]:
best_theta

# Name scopes

In [None]:
reset_graph()

now = datetime.utcnow().strftime("%Y%m%d%H%M%S")
root_logdir = "tf_logs"
logdir = "{}/run-{}/".format(root_logdir, now)

n_epochs = 1000
learning_rate = 0.01

X = tf.placeholder(tf.float32, shape=(None, n + 1), name="X")
y = tf.placeholder(tf.float32, shape=(None, 1), name="y")
theta = tf.Variable(tf.random_uniform([n + 1, 1], -1.0, 1.0, seed=42), name="theta")
y_pred = tf.matmul(X, theta, name="predictions")

In [None]:
with tf.name_scope("loss") as scope:
    error = y_pred - y
    mse = tf.reduce_mean(tf.square(error), name="mse")

In [None]:
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
training_op = optimizer.minimize(mse)

init = tf.global_variables_initializer()

mse_summary = tf.summary.scalar('MSE', mse)
file_writer = tf.summary.FileWriter(logdir, tf.get_default_graph())

In [None]:
n_epochs = 10
batch_size = 100
n_batches = int(np.ceil(m / batch_size))

with tf.Session() as sess:
    sess.run(init)

    for epoch in range(n_epochs):
        for batch_index in range(n_batches):
            X_batch, y_batch = fetch_batch(epoch, batch_index, batch_size)
            if batch_index % 10 == 0:
                summary_str = mse_summary.eval(feed_dict={X: X_batch, y: y_batch})
                step = epoch * n_batches + batch_index
                file_writer.add_summary(summary_str, step)
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})

    best_theta = theta.eval()

file_writer.flush()
file_writer.close()
print("Best theta:")
print(best_theta)

In [None]:
print(error.op.name)

In [None]:
print(mse.op.name)

In [None]:
reset_graph()

a1 = tf.Variable(0, name="a")      # name == "a"
a2 = tf.Variable(0, name="a")      # name == "a_1"

with tf.name_scope("param"):       # name == "param"
    a3 = tf.Variable(0, name="a")  # name == "param/a"

with tf.name_scope("param"):       # name == "param_1"
    a4 = tf.Variable(0, name="a")  # name == "param_1/a"

for node in (a1, a2, a3, a4):
    print(node.op.name)

# Modularity

An ugly flat code:

In [None]:
reset_graph()

n_features = 3
X = tf.placeholder(tf.float32, shape=(None, n_features), name="X")

w1 = tf.Variable(tf.random_normal((n_features, 1)), name="weights1")
w2 = tf.Variable(tf.random_normal((n_features, 1)), name="weights2")
b1 = tf.Variable(0.0, name="bias1")
b2 = tf.Variable(0.0, name="bias2")

z1 = tf.add(tf.matmul(X, w1), b1, name="z1")
z2 = tf.add(tf.matmul(X, w2), b2, name="z2")

relu1 = tf.maximum(z1, 0., name="relu1")
relu2 = tf.maximum(z1, 0., name="relu2")  # Oops, cut&paste error! Did you spot it?

output = tf.add(relu1, relu2, name="output")

Much better, using a function to build the ReLUs:

In [None]:
reset_graph()

def relu(X):
    w_shape = (int(X.get_shape()[1]), 1)
    w = tf.Variable(tf.random_normal(w_shape), name="weights")
    b = tf.Variable(0.0, name="bias")
    z = tf.add(tf.matmul(X, w), b, name="z")
    return tf.maximum(z, 0., name="relu")

n_features = 3
X = tf.placeholder(tf.float32, shape=(None, n_features), name="X")
relus = [relu(X) for i in range(5)]
output = tf.add_n(relus, name="output")

In [None]:
file_writer = tf.summary.FileWriter("logs/relu1", tf.get_default_graph())

Even better using name scopes:

In [None]:
reset_graph()

def relu(X):
    with tf.name_scope("relu"):
        w_shape = (int(X.get_shape()[1]), 1)                          # not shown in the book
        w = tf.Variable(tf.random_normal(w_shape), name="weights")    # not shown
        b = tf.Variable(0.0, name="bias")                             # not shown
        z = tf.add(tf.matmul(X, w), b, name="z")                      # not shown
        return tf.maximum(z, 0., name="max")                          # not shown

In [None]:
n_features = 3
X = tf.placeholder(tf.float32, shape=(None, n_features), name="X")
relus = [relu(X) for i in range(5)]
output = tf.add_n(relus, name="output")

file_writer = tf.summary.FileWriter("logs/relu2", tf.get_default_graph())
file_writer.close()

## Sharing Variables

Sharing a `threshold` variable the classic way, by defining it outside of the `relu()` function then passing it as a parameter:

In [None]:
reset_graph()

def relu(X, threshold):
    with tf.name_scope("relu"):
        w_shape = (int(X.get_shape()[1]), 1)                        # not shown in the book
        w = tf.Variable(tf.random_normal(w_shape), name="weights")  # not shown
        b = tf.Variable(0.0, name="bias")                           # not shown
        z = tf.add(tf.matmul(X, w), b, name="z")                    # not shown
        return tf.maximum(z, threshold, name="max")

threshold = tf.Variable(0.0, name="threshold")
X = tf.placeholder(tf.float32, shape=(None, n_features), name="X")
relus = [relu(X, threshold) for i in range(5)]
output = tf.add_n(relus, name="output")

In [None]:
reset_graph()

def relu(X):
    with tf.name_scope("relu"):
        if not hasattr(relu, "threshold"):
            relu.threshold = tf.Variable(0.0, name="threshold")
        w_shape = int(X.get_shape()[1]), 1                          # not shown in the book
        w = tf.Variable(tf.random_normal(w_shape), name="weights")  # not shown
        b = tf.Variable(0.0, name="bias")                           # not shown
        z = tf.add(tf.matmul(X, w), b, name="z")                    # not shown
        return tf.maximum(z, relu.threshold, name="max")

In [None]:
X = tf.placeholder(tf.float32, shape=(None, n_features), name="X")
relus = [relu(X) for i in range(5)]
output = tf.add_n(relus, name="output")

In [None]:
reset_graph()

with tf.variable_scope("relu"):
    threshold = tf.get_variable("threshold", shape=(),
                                initializer=tf.constant_initializer(0.0))

In [None]:
with tf.variable_scope("relu", reuse=True):
    threshold = tf.get_variable("threshold")

In [None]:
with tf.variable_scope("relu") as scope:
    scope.reuse_variables()
    threshold = tf.get_variable("threshold")

In [None]:
reset_graph()

def relu(X):
    with tf.variable_scope("relu", reuse=True):
        threshold = tf.get_variable("threshold")
        w_shape = int(X.get_shape()[1]), 1                          # not shown
        w = tf.Variable(tf.random_normal(w_shape), name="weights")  # not shown
        b = tf.Variable(0.0, name="bias")                           # not shown
        z = tf.add(tf.matmul(X, w), b, name="z")                    # not shown
        return tf.maximum(z, threshold, name="max")

X = tf.placeholder(tf.float32, shape=(None, n_features), name="X")
with tf.variable_scope("relu"):
    threshold = tf.get_variable("threshold", shape=(),
                                initializer=tf.constant_initializer(0.0))
relus = [relu(X) for relu_index in range(5)]
output = tf.add_n(relus, name="output")

In [None]:
file_writer = tf.summary.FileWriter("logs/relu6", tf.get_default_graph())
file_writer.close()

In [None]:
reset_graph()

def relu(X):
    with tf.variable_scope("relu"):
        threshold = tf.get_variable("threshold", shape=(), initializer=tf.constant_initializer(0.0))
        w_shape = (int(X.get_shape()[1]), 1)
        w = tf.Variable(tf.random_normal(w_shape), name="weights")
        b = tf.Variable(0.0, name="bias")
        z = tf.add(tf.matmul(X, w), b, name="z")
        return tf.maximum(z, threshold, name="max")

X = tf.placeholder(tf.float32, shape=(None, n_features), name="X")
with tf.variable_scope("", default_name="") as scope:
    first_relu = relu(X)     # create the shared variable
    scope.reuse_variables()  # then reuse it
    relus = [first_relu] + [relu(X) for i in range(4)]
output = tf.add_n(relus, name="output")

file_writer = tf.summary.FileWriter("logs/relu8", tf.get_default_graph())
file_writer.close()

In [None]:
reset_graph()

def relu(X):
    threshold = tf.get_variable("threshold", shape=(),
                                initializer=tf.constant_initializer(0.0))
    w_shape = (int(X.get_shape()[1]), 1)                        # not shown in the book
    w = tf.Variable(tf.random_normal(w_shape), name="weights")  # not shown
    b = tf.Variable(0.0, name="bias")                           # not shown
    z = tf.add(tf.matmul(X, w), b, name="z")                    # not shown
    return tf.maximum(z, threshold, name="max")

X = tf.placeholder(tf.float32, shape=(None, n_features), name="X")
relus = []
for relu_index in range(5):
    with tf.variable_scope("relu", reuse=(relu_index >= 1)) as scope:
        relus.append(relu(X))
output = tf.add_n(relus, name="output")

In [None]:
file_writer = tf.summary.FileWriter("logs/relu9", tf.get_default_graph())
file_writer.close()