# Task 11: Starting with tensorflow

_All credit for the code examples of this notebook goes to the book "Hands-On Machine Learning with Scikit-Learn & TensorFlow" by A. Geron. Modifications were made and text was added by K. Zoch in preparation for the hands-on sessions._

# Setup

First, import a few common modules, ensure MatplotLib plots figures inline and prepare a function to save the figures:

In [None]:
# Common imports
import numpy as np
import os

# to make this notebook's output stable across runs
np.random.seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# to make this notebook's output stable across runs
def reset_graph(seed=42):
    tf.reset_default_graph()
    tf.set_random_seed(seed)
    np.random.seed(seed)

# Function to save a figure. This also decides that all output files 
# should stored in the subdirectorz 'classification'.
PROJECT_ROOT_DIR = "."
EXERCISE = "tensorflow"

def save_fig(fig_id, tight_layout=True):
    path = os.path.join(PROJECT_ROOT_DIR, "output", EXERCISE, fig_id + ".png")
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format='png', dpi=300)

# Creating graphs

As we learnt in the lecture, tensorflow is based on _graph building_. It is important to understand that building a graph such as the one below does _not_ actually execute any computation of the results yet. This is done in a tensorflow _session_. What result would you expect from the created graph below?

In [None]:
import tensorflow as tf

reset_graph()

x = tf.Variable(5, name="x")
y = tf.Variable(4, name="y")
f = x*x*y + 3*y - 2*x + 1

The below line displays information about the type of `f`. As you can see, `f` is actually an output _tensor_ –check out the [tf.Tensor](https://www.tensorflow.org/api_docs/python/tf/Tensor) class documentation and also the tensorflow [guide on tensors](https://www.tensorflow.org/guide/tensors). Quite generally, tensorflow's documentation and guide are worth reading, and spending some time on them will help you to understand the basics of the low-level tensorflow API.

In [None]:
f

Now, to execute the above graph, we still need to create a [tf.Session](https://www.tensorflow.org/api_docs/python/tf/Session). Again, there is a very good [guide on graphs and sessions](https://www.tensorflow.org/guide/graphs). We can either initialise a session, call its _run_ function for all tensorflow variables, obtain the result, and close the session afterwards. But using python's `with` keyword takes care of opening and closing the session in a much nicer way (and it makes `sess` within the `with` block the default session, which simplifies our run calls quite a lot):

In [None]:
with tf.Session() as sess:
    x.initializer.run()
    y.initializer.run()
    result = f.eval()
print("Result of the operation: %s" % result)

To avoid having to intialise _all_ variables by hand, we can create a global initialiser, and then simply need to call the `run` function on it once and evaluate our output tensor, as the following piece of code demonstrates:

In [None]:
init = tf.global_variables_initializer()

result = -1

with tf.Session() as sess:
    init.run()
    result = f.eval()
print("Result of the operation: %s" % result)

Another alternative, especially when using the interactive python interpreter, is to use an _interactive_ session ([tf.InteractiveSession](https://www.tensorflow.org/api_docs/python/tf/InteractiveSession)).

Another important thing to note is that individual graph runs do _not_ share values of individual nodes. They might evaluate to the same values for every single run, but each graph evaluation is done independently. On the one hand, this is great because it enables distribution of operations onto many different machines/resources. On the other hand, in some cases it might be desirable to evaluate two output tensors simultaneously, and to avoid double evaluation of some nodes. This is demonstrated in the following piece of code:

In [None]:
k = tf.constant(5)
m = k - 2
x = m + 5
y = m * 3

# Both x and y are evaluated in individual graphs runs.
with tf.Session() as sess:
    print("Resulting values: %s, %s" % (x.eval(), y.eval()))
    
# x and y are evaluated simultaneously, i.e. the value
# of m is only computed once.
with tf.Session() as sess:
    x_val, y_val = sess.run([x, y])
    print("Resulting values: %s, %s" % (x_val, y_val))

# Linear Regression

As we did with Scikit-Learn, tensorflow can also perform linear regression. Before trying the gradient descent method, we can build a tensorflow graph using the normal equation. If you don't remember the normal equation and how it can be used to solve linear regression problems, now is the perfect moment to go back and revise the previous content!

First, fetch an example dataset provided by Scikit-Learn (this might take a moment ...):

In [None]:
from sklearn.datasets import fetch_california_housing

housing = fetch_california_housing()
m, n = housing.data.shape
housing_data_plus_bias = np.c_[np.ones((m, 1)), housing.data]

Now let's build a tensorflow graph which calculates our theta, that is, our estimators for the values of the parameters minimising the cost function. Check out [tf.linalg.matmul](https://www.tensorflow.org/api_docs/python/tf/linalg/matmul), [tf.transpose](https://www.tensorflow.org/api_docs/python/tf/transpose) and [tf.linalg.inv](https://www.tensorflow.org/api_docs/python/tf/linalg/inv) documentations if you'd like to learn more about the operations performed.

In [None]:
reset_graph()

X = tf.constant(housing_data_plus_bias, dtype=tf.float32, name="X")
y = tf.constant(housing.target.reshape(-1, 1), dtype=tf.float32, name="y")
XT = tf.transpose(X)
theta = tf.matmul(tf.matmul(tf.matrix_inverse(tf.matmul(XT, X)), XT), y)

with tf.Session() as sess:
    theta_value = theta.eval()

In [None]:
print(theta_value)

For comparison, we can also use the [LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) class of Scikit-Learn for the same minimisation task. As you might remember, the syntax for Scikit-Learn would look like this:

In [None]:
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(housing.data, housing.target.reshape(-1, 1))

print(np.r_[lin_reg.intercept_.reshape(-1, 1), lin_reg.coef_.T])

Are the parameter values comparable between TF and Scikit-Learn?

Before moving from the normal equation to batch gradient descent, remember that we need to scale the input features (i.e. normalise them). Of course there are ways to do this in tensorflow, but Scikit-Learn's standard scaler works just as well.

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_housing_data = scaler.fit_transform(housing.data)
scaled_housing_data_plus_bias = np.c_[np.ones((m, 1)), scaled_housing_data]

The next step is to implement gradient descent. This is fairly straight-forward: Start with randomly distributed theta values for now, calculate the prediction based on the current theta, calculate the error of that prediction, and eventually calculate the gradients based on this error. Then, the operation for each epoch is to update the values of theta with `theta - eta * gradients`, as implemented below:

In [None]:
# Reset the graph so that we don't use any previous results.
reset_graph()

# Let's start with a low learning rate and many epochs.
n_epochs = 1000
learning_rate = 0.01

# The tf input tensors: X for the input features of the data,
# y for the target values (this is a regression problem!).
X = tf.constant(scaled_housing_data_plus_bias, dtype=tf.float32, name="X")
y = tf.constant(housing.target.reshape(-1, 1), dtype=tf.float32, name="y")

# Start with a randomly initialised set of estimators for theta.
theta = tf.Variable(tf.random_uniform([n + 1, 1], -1.0, 1.0, seed=42), name="theta")

# Calculate the predicted values, their errors w.r.t. to the
# true values, and the mean squared error (MSE). Remember that
# all of this only builds a graph in tf, these are *not*
# actual computation operations at the moment.
y_pred = tf.matmul(X, theta, name="predictions")
error = y_pred - y
mse = tf.reduce_mean(tf.square(error), name="mse")

# Calculate the gradients by hand, m is the size of the dataset.
gradients = 2/m * tf.matmul(tf.transpose(X), error)

# Calculate the gradients using tf's automated differentation.
# gradients = tf.gradients(mse, [theta])[0]

# And finally define the training operation: update the values
# for theta, following the calculated gradients.
training_op = tf.assign(theta, theta - learning_rate * gradients)

# Initialise all variables.
init = tf.global_variables_initializer()

# And open a session to execute the graph.
with tf.Session() as sess:
    sess.run(init)

    for epoch in range(n_epochs):
        if epoch % 100 == 0:
            print("Epoch", epoch, "MSE =", mse.eval())
        sess.run(training_op)
   
    # Get the best value for theta by calling eval().
    best_theta = theta.eval()

Good, the MSE is decreasing, but what are the best values for theta?

In [None]:
print(best_theta)

The above code currently implements the gradients by hand. However, this might not be extremely efficienct in terms of computing for more complex problems. Imagine having to calculate derivatives in many nodes of a neural network ... To avoid duplicate evaluations of derivatives, for example if the chain rule can be applied in some of them, we can make use of tf's sophisticated algorithms for differentation, called _reverse-mode autodiff_. If you haven't heard of automatic differentation and its reverse mode, this would be a good time to read up on it.

You can try the magic by replacing the line `gradients = [...]` with the one below, which will activate tf's autodiff algorithm instead. In the above example, this probably doesn't make a huge difference, but in multi-layer neural nets it certainly will.

# Visualisation with TensorBoard

The following section gives you an idea how powerful tf's _TensorBoard_ is. TensorBoard is a tool shipped with tf as a visualisation tool and can be very helpful to debug and/or understand your model better. Again, the tf [guide on TensorBoard](https://www.tensorflow.org/guide/summaries_and_tensorboard) is worth reading.

To open TensorBoard, first open a terminal, navigate to the folder, in which this jupyter notebook is placed, and type `tensorboard --logdir tf_logs`. This should give you some terminal output – the TensorBoard instance is started. Then, open a new tab in your browser and type `localhost:6006` in the address bar. You should reach the TensorBoard main page.

The following code first sets up the logging of graph runs in tensorflow (this is needed by TensorBoard to visualise them).

In [None]:
reset_graph()

from datetime import datetime

now = datetime.utcnow().strftime("%Y%m%d%H%M%S")
root_logdir = "tf_logs"
logdir = "{}/run-{}/".format(root_logdir, now)

The following code block sets up a mini-batch gradient descent model trained on the "housing" data that we already used in the code examples above. Instead of directly giving the input to input tensors, we need to set up [tf.placeholder](https://www.tensorflow.org/api_docs/python/tf/placeholder) objects that we can provide at runtime. This is necessary as we only provide the dataset in mini-batches for each training step. The rest essentially works as above (with a few commands added to provide logging information for TensorBoard).

In [None]:
learning_rate = 0.01
n_epochs = 10
batch_size = 100
n_batches = int(np.ceil(m / batch_size))

# Define a small function to create mini-batches.
def fetch_batch(epoch, batch_index, batch_size):
    np.random.seed(epoch * n_batches + batch_index)
    indices = np.random.randint(m, size=batch_size)
    X_batch = scaled_housing_data_plus_bias[indices]
    y_batch = housing.target.reshape(-1, 1)[indices]
    return X_batch, y_batch

# Define placeholders for the input features and target values,
# because we need to set them at runtime (since we're working
# with mini-batches, not the entire dataset).
X = tf.placeholder(tf.float32, shape=(None, n + 1), name="X")
y = tf.placeholder(tf.float32, shape=(None, 1), name="y")

# Calculate values for theta, the predicted values, the errors
# on them and the mean squared errors (MSE's).
theta = tf.Variable(tf.random_uniform([n + 1, 1], -1.0, 1.0, seed=42), name="theta")
y_pred = tf.matmul(X, theta, name="predictions")
error = y_pred - y
mse = tf.reduce_mean(tf.square(error), name="mse")

# Use a GradientDescentOptimizer and define the training operation.
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
training_op = optimizer.minimize(mse)

init = tf.global_variables_initializer()

# Set up writing into a file. This is needed for logging
# details of the operations for the visualisation.
mse_summary = tf.summary.scalar('MSE', mse)
file_writer = tf.summary.FileWriter(logdir, tf.get_default_graph())

# Start the session run.
with tf.Session() as sess:
    sess.run(init)

    for epoch in range(n_epochs):
        for batch_index in range(n_batches):
            # Fetch the mini-batch with the function defined before.
            X_batch, y_batch = fetch_batch(epoch, batch_index, batch_size)
            # Save information for every tenth batch.
            if batch_index % 10 == 0:
                summary_str = mse_summary.eval(feed_dict={X: X_batch, y: y_batch})
                step = epoch * n_batches + batch_index
                file_writer.add_summary(summary_str, step)
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})

    best_theta = theta.eval()   
    
file_writer.close()

What are our estimators for theta now? Are they comparable to what was achieved with batch gradient descent above?

In [None]:
print(best_theta)

Now have a look at the TensorBoard tab in your browser (maybe you have to refresh). You should suddenly find _a lot_ of information about both your model, the training procedure etc. Feel free to browse around and explore the features.

With more complex models, overview graphs in TensorBoard can become very cluttered. Of course, tensorflow provides functionality to deal with that: name scopes for nodes. They allow to group nodes into different contexts that are collapsed into one node in the graph, but can be expanded if needed. Designing the code modular, avoids duplication of identical nodes that are only valid in a limited scope anyways. Nodes can be shared amongst multiple components of the same graph if needed. Feel free to read up more on this by searching for "name scope", "variable scope", or "sharing variables" (and of course "tensorflow").