# Tensor Flow <a class="anchor" id="top"></a>

This notebook was written by following the textbook of  Aurélien Géron's *Hands-On Machine Learning with Scikit-Learn and TensorFlow*, along with associated datasets ([Link to Github Repo](https://github.com/ageron/handson-ml/)). The contents in this notebook are my notes from reading the textbook.

### Notebook by Justin Bandoro

TensorFlow is an open source library for numerical computation that is well-suited for Machine Learning. You first define a graph of computations and then TensorFlow takes the graph and runs it efficiently using optimized `C++` code. An example is shown below:

<img src='tf_example.png'>

We can break up the computation into chuncks and run them in parallel across multiple CPUs or GPUs. It also allows you to train neural networks on massive training sets in a reasonable amount of time.

* [1. First Graph and Session <a class="anchor" id="first"></a>](#1.-First-Graph-and-Session-<a-class="anchor"-id="first"></a>)
* [2. Managing Graphs <a class="anchor" id="manage"></a>](#2.-Managing-Graphs-<a-class="anchor"-id="manage"></a>)
* [2. Lifecycle of a Node Value <a class="anchor" id="lifecycle"></a>](#2.-Lifecycle-of-a-Node-Value-<a-class="anchor"-id="lifecycle"></a>)
* [3. Linear Regression with TensorFlow <a class="anchor" id="linreg"></a>](#3.-Linear-Regression-with-TensorFlow-<a-class="anchor"-id="linreg"></a>)
* [4. Gradient Descent <a class="anchor" id="gradient"></a>](#4.-Gradient-Descent-<a-class="anchor"-id="gradient"></a>)
	* [Autodiff in TensorFlow](#Autodiff-in-TensorFlow)
	* [Using an Optimizer](#Using-an-Optimizer)
* [5. Feeding Data into Training Algorithm <a class="anchor" id="feeding"></a>](#5.-Feeding-Data-into-Training-Algorithm-<a-class="anchor"-id="feeding"></a>)
* [6. Saving and Restoring Models <a class="anchor" id="saving"></a>](#6.-Saving-and-Restoring-Models-<a-class="anchor"-id="saving"></a>)
* [7. Visualizing the Graph using TensorBoard <a class="anchor" id="visualize"></a>](#7.-Visualizing-the-Graph-using-TensorBoard-<a-class="anchor"-id="visualize"></a>)
* [Viewing TensorBoard in Jupyter](#Viewing-TensorBoard-in-Jupyter)
* [8. Name Scopes <a class="anchor" id="name"></a>](#8.-Name-Scopes-<a-class="anchor"-id="name"></a>)
* [9. Modularity <a class="anchor" id="modularity"></a>](#9.-Modularity-<a-class="anchor"-id="modularity"></a>)
* [10. Sharing Variables <a class="anchor" id="sharing"></a>](#10.-Sharing-Variables-<a-class="anchor"-id="sharing"></a>)
* [License](#License)

In [1]:
# Load modules
import matplotlib
%matplotlib inline
import matplotlib.pylab as plt
from IPython.display import display
import numpy as np
import pandas as pd
import seaborn as sns
import scipy.io as sio
from matplotlib import cm
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
import tensorflow as tf

## 1. First Graph and Session <a class="anchor" id="first"></a>

[[back to top]](#top)

This creates the graph for the figure shown above.

In [26]:
x= tf.Variable(3,name='x')
y= tf.Variable(4,name='y')
f= x*x*y + y + 2

The variables are not initialized yet, to evaluate the graph we need to open a session and use it to initialize the variables and evaluate `f`. The session will place the operations onto CPUs and GPUs and holds all the variable values. The code below creates a session, initializes the variables and evaluates `f` and closes the session (freeing up ressources).

In [27]:
sess = tf.Session()
sess.run(x.initializer)
sess.run(y.initializer)
result = sess.run(f)
print('f= ',result)
sess.close()

f=  42


It is redundant to repeat the `.run` all the time, so a session can also be opened up:

In [28]:
with tf.Session() as sess:
    x.initializer.run()
    y.initializer.run()
    result = f.eval()
print('f= ',result)

f=  42


The session is opened up as the default session, and makes the code easier to read, and the session is closed at the end. 

Instead of running the initializer for every variable, `global_variables_initializer()` can be used. It does not work initialize immediately, but creates a node in the graph that will initialize variables when it runs.

In [29]:
init = tf.global_variables_initializer()
with tf.Session() as sess:
    init.run()
    result=f.eval()
print('f= ',result)

f=  42


In a notebook like this we can create an `InteractiveSession` that is automatically the default session, so no `with` block is needed (but needs to be closed manually).

In [30]:
sess = tf.InteractiveSession()
init.run()
result = f.eval()
print('f= ',result)
sess.close()

f=  42


TensorFlow program is usually split into 2 parts: 

1. Building the computation graph (construction phase) 
2. Running the graph (execution phase)

The first builds the graph representing the ML model and the second trains it.

## 2. Managing Graphs <a class="anchor" id="manage"></a>

[[back to top]](#top)

Any node created is automatically added to the default graph.

In [7]:
x1 = tf.Variable(1)
x1.graph is tf.get_default_graph()

True

If you want to manage multiple graphs a once, new graphs can be created and temporarily made the default inside a `with` block: 

In [8]:
graph = tf.Graph()
with graph.as_default():
    x2=tf.Variable(2)
display(x2.graph is graph)
x2.graph is tf.get_default_graph()

True

False

> In Jupyter notebooks it is more common to runt he same commands more than once while experimenting, thus you may end up with a default graph containing duplicate nodes. A solution is to restart the kernel, but an easier way is to use the function: `tf.reset_default_graph()`.

## 2. Lifecycle of a Node Value <a class="anchor" id="lifecycle"></a>

[[back to top]](#top)

When a node is evaluated, TensorFlow automatically determines the set of nodes that it depends on and evaluates these nodes first. For example consider the code:

In [9]:
w = tf.constant(3)
x = w + 2
y = x + 5
z = x*3

with tf.Session() as sess:
    print('y=',y.eval())
    print('z=',z.eval())

y= 10
z= 15


A session is started and runs the graph to first evaluate `y`, TF detects that `y` depends on `x` which depends on `w`. It first evaluates `w` then `x` then `y` and returns the value of `y`. The code then runs the graph to evaluate `z`. Once again, TF must evaluate `w` and `x` first. It **will not** reuse the result of the previous evaluation. Thus `w` and `x` are evaluated twice.

All node values are dropped between graph runs, except variable values, which are maintained by the session across graph runs. A variable begins when it is initialized and ends when the session is closed. To evaluate both at the same time TF must be asked to evaluate both in one graph run:

In [10]:
with tf.Session() as sess:
    y_val,z_val = sess.run([y,z])
    print('y=',y_val);print('z=',z_val)

y= 10
z= 15


## 3. Linear Regression with TensorFlow <a class="anchor" id="linreg"></a>

[[back to top]](#top)

TF operations (ops) can take any number of inputs and produce any number of outputs. For example, the addition and multiplication ops each take two inputs and produce one output. Constants and variables take no input (known as *source ops*). The inputs and outputs are multidimensional arrays, called **tensors**. Like in NumPy,tensors have a type and shape. 

An example of performing computations on arrays is shown below for linear regression of the California housing dataset. We fetch the data and add the bias input feature ($x_0=1$) to all instances. Then two TF constant nodes are created to hold the data and targets ($X$ and $y$). Matrix operations are used to minimize the cost function, the normal equation to solve for $\hat{\theta}$:

$\hat{\theta} = (X^T \cdot X)^{-1} \cdot X^T \cdot y$

In [31]:
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.preprocessing import StandardScaler

# Get data
housing = fetch_california_housing()
m,n     = housing.data.shape
# Scale the dataset
scaler= StandardScaler()
housing_scaled = scaler.fit_transform(housing.data)
# Add bias
housing_plus_bias = np.c_[np.ones((m,1)),housing_scaled]
# Tensor Flow
X = tf.constant(housing_plus_bias,name='X')
y = tf.constant(housing.target.reshape(-1,1),name='y')
# Normal Equation
XT = tf.transpose(X)
theta = tf.matmul(tf.matmul(tf.matrix_inverse(tf.matmul(XT,X)),XT),y)

with tf.Session() as sess:
    theta_val = theta.eval()
print(theta_val)

[[ 2.06855817]
 [ 0.8296193 ]
 [ 0.11875165]
 [-0.26552688]
 [ 0.30569623]
 [-0.004503  ]
 [-0.03932627]
 [-0.89988565]
 [-0.870541  ]]


The advantage compared to using NumPy is that TF will run it on a GPU card if available.

## 4. Gradient Descent <a class="anchor" id="gradient"></a>

[[back to top]](#top)

As we saw in earlier chapters, Batch Gradient Descent can be used instead of the Normal Equation to solve for $\hat{\theta}$. We will compare manually computing the gradients to using TF's `autodiff` feature that does it automatically. Lastly we will use TF's out-of-the-box optimizers.

The individual gradients are calculated:

$$\nabla_\theta MSE(\theta) = \begin{pmatrix}{}
\frac{\partial}{\partial \theta_0} MSE(\theta)  \\
\frac{\partial}{\partial \theta_1} MSE(\theta)  \\
                  \vdots                         \\
\frac{\partial}{\partial \theta_n} MSE(\theta)  \\
\end{pmatrix}  =  \frac{2}{m}X^T \cdot (X\cdot\theta - y)
$$


For BGD each step requires calculations over the entire training set. As found earlier, it is very slow on large training sets, but scales well with number of features. Using Batch Gradient Descent on a Linear Regression problem with thousands of features is much faster than using the Normal Equation!

Once the gradient vector is calculated, $\nabla_\theta MSE(\theta)$, we take a step in the opposite direction to go downhill, thus subtracting $\nabla_\theta MSE(\theta)$ from $\theta$. The 'size' of the step taken is determined by the **learning rate**, $\eta$:

$\theta^{(next)} = \theta - \eta \nabla_\theta MSE(\theta)$

Before we show calculations, TF has the following elements that need to be defined:

* `random_uniform()` function creates a node in the graph that will generate a tensor containing random values, given a shape and value range
* `assign()` function creates a node that will assign a new value to a variable. In this case we use it for $\theta^{(next)}$
*  The main loop executes the training step over and over, `n_epochs` times, and we will print out the MSE after every 100 iterations

In [15]:
n_epochs=1000
eta = 0.01 #learning rate
X = tf.constant(housing_plus_bias,name='X',dtype=tf.float32)
y = tf.constant(housing.target.reshape(-1,1),name='y',dtype=tf.float32)
# Define theta, y_pred, and mse
theta  = tf.Variable(tf.random_uniform([n+1,1],-1.0,1.0),name='theta') #set as random numbers between -1.0 and 1.0
y_pred = tf.matmul(X,theta,name='predictions')
error  = y_pred-y
mse    = tf.reduce_mean(tf.square(error),name='mse')
# Define gradients
gradients = 2/m*tf.matmul(tf.transpose(X),error)
# Update theta
training_op = tf.assign(theta,theta - eta*gradients)

init = tf.global_variables_initializer()
with tf.Session() as sess:
    sess.run(init)
    for epoch in range(n_epochs):
        if epoch %100 ==0: #if it is 100 iterations
            print('Epoch ',epoch, "MSE= ",mse.eval())
        sess.run(training_op)
    best_theta = theta.eval()

print(best_theta)

Epoch  0 MSE=  5.53964
Epoch  100 MSE=  0.609647
Epoch  200 MSE=  0.543065
Epoch  300 MSE=  0.537878
Epoch  400 MSE=  0.534714
Epoch  500 MSE=  0.532339
Epoch  600 MSE=  0.530539
Epoch  700 MSE=  0.529168
Epoch  800 MSE=  0.528119
Epoch  900 MSE=  0.527312
[[  2.06855226e+00]
 [  8.57600331e-01]
 [  1.35107636e-01]
 [ -2.97613323e-01]
 [  3.23090762e-01]
 [  1.08387612e-03]
 [ -4.13064770e-02]
 [ -7.54380345e-01]
 [ -7.27287114e-01]]


We see that the values of $\hat{\theta}$ found are the same as the Normal Equation. It works well but needs to mathematically derive the gradients from the cost function (MSE). With linear regression, it is quite easy but with deep neural networks it can be a headache.

Consider the function: $f(x)=\exp{(\exp{(\exp{(x)})})}$. The derivative is $f'(x)=\exp(x)\times \exp{(\exp(x))})\times \exp{(\exp{(\exp{(x)})})}$. If you write code that first computes $\exp(x)$ then $\exp{(\exp(x))})$ and then $\exp{(\exp{(\exp{(x)})})}$ and returns all 3, this gives you $f(x)$ directly (3rd term). And the derivative can be found by multiplying all 3 terms together.

### Autodiff in TensorFlow

TF's autodiff feature automatically and efficiently computes the gradients for you, by simply replacing `gradients=...` line in the BGD code with the following line:

`gradients = tf.gradients(mse,[theta])[0]`

The `gradients()` function takes an op (here mse) and a list of variables (here $theta$) and creates a list of ops (one per variable) to compute the gradients of the op with regards to each variable. So the `gradients` node will compute the gradient vector of the MSE with regards to `theta`. 

There are several approaches for computing gradients automatically. TensorFlow uses `reverse-mode-autodiff` which is efficient and accurate when there are many inputs and few outputs (as is the case for neural networks). It computes the partial derivatives of the ouputs with regards to all the inputs in just $n_{outputs}+1$ traversals.

### Using an Optimizer

We don't even have to call the line `training_op = ...` above either, TF also provides optimizers including a Gradient Descent optimizer. We can call it with:

`optimizer = tf.train.GradientDescentOptimizer(learning_rate=eta)`
`training_op= optimizer.minimize(mse)`

So we do not have to call the `gradients()` function. Other optimizers are known to converge faster than Gradient Descent, such as a momentum optimizer:

`optimizer = tf.train.MomentumOptimizer(learning_rate=eta,momentum=0.9)`

In [20]:
tf.reset_default_graph()
n_epochs=1000
eta = 0.01 #learning rate
X = tf.constant(housing_plus_bias,name='X',dtype=tf.float32)
y = tf.constant(housing.target.reshape(-1,1),name='y',dtype=tf.float32)
# Define theta, y_pred, and mse
theta  = tf.Variable(tf.random_uniform([n+1,1],-1.0,1.0),name='theta') #set as random numbers between -1.0 and 1.0
y_pred = tf.matmul(X,theta,name='predictions')
error  = y_pred-y
mse    = tf.reduce_mean(tf.square(error),name='mse')
# Optimizer
optimizer = tf.train.GradientDescentOptimizer(learning_rate=eta)
training_op = optimizer.minimize(mse)

init = tf.global_variables_initializer()
with tf.Session() as sess:
    sess.run(init)
    for epoch in range(n_epochs):
        if epoch %100 ==0: #if it is 100 iterations
            print('Epoch ',epoch, "MSE= ",mse.eval())
        sess.run(training_op)
    best_theta = theta.eval()
    y_preds = y_pred.eval()
print(best_theta)

Epoch  0 MSE=  3.70997
Epoch  100 MSE=  0.680958
Epoch  200 MSE=  0.611947
Epoch  300 MSE=  0.586895
Epoch  400 MSE=  0.569465
Epoch  500 MSE=  0.556918
Epoch  600 MSE=  0.547861
Epoch  700 MSE=  0.541324
Epoch  800 MSE=  0.536606
Epoch  900 MSE=  0.533198
[[ 2.06855226]
 [ 0.77114516]
 [ 0.13918838]
 [-0.09631982]
 [ 0.13994916]
 [ 0.0036909 ]
 [-0.03980835]
 [-0.8058219 ]
 [-0.76683038]]


## 5. Feeding Data into Training Algorithm <a class="anchor" id="feeding"></a>

[[back to top]](#top)

We will modify the previous code to implement Mini-Batch Gradient Descent. For this, we need a way to replace `X` and `y` at every iteration with the next mini-batch. The easiest way to do this is to use placeholder nodes. These nodes do not perform any computation, but just output the data you tell them to output at run time. They are typically used to pass training data to TF during training. A placeholder node is created by calling `placeholder()` function and specifying the output tensor's data type. You can also specify shape, if you specify `None` for a dimension, it means any size. For example:

In [13]:
A = tf.placeholder(tf.float32,shape=(None,3))
B = A + 5
with tf.Session() as sess:
    B_val_1 = B.eval(feed_dict={A:[[1,2,3]]})
    B_val_2 = B.eval(feed_dict={A:[[4,5,6],[7,8,9]]})
print(B_val_1)
print(B_val_2)

[[ 6.  7.  8.]]
[[  9.  10.  11.]
 [ 12.  13.  14.]]


To evaluate `B` above, we pass a `feed_dict` to the `eval()` method that specifies the value of `A`. Note that `A` must be rank 2 and have 3 columns as we specified in the placeholder declaration - it can have any number of rows.

> You can feed the output of any operation, not just placeholders. In this case TF does not try to evaluate these operations; it uses the values you feed it.

Below we will implement mini-batch Gradient Descent with placeholders.

In [32]:
tf.reset_default_graph()

# Placeholders
X = tf.placeholder(tf.float32,shape=(None,n+1),name='X')
y = tf.placeholder(tf.float32,shape=(None,1),name='y')

# Define theta, y_pred, and mse
theta  = tf.Variable(tf.random_uniform([n+1,1],-1.0,1.0),name='theta') #set as random numbers between -1.0 and 1.0
y_pred = tf.matmul(X,theta,name='predictions')
error  = y_pred-y
mse    = tf.reduce_mean(tf.square(error),name='mse')
# Optimizer
optimizer = tf.train.GradientDescentOptimizer(learning_rate=eta)
training_op = optimizer.minimize(mse)

# Define Batch Size and n_epochs
batch_size= 100
n_batches = int(np.ceil(m/batch_size))
n_epochs  = 10

# Load batches from housing data
def fetch_batch(epoch, batch_index, batch_size):
    np.random.seed(epoch*n_batches+batch_index)
    indices = np.random.randint(m,size=batch_size)
    X_batch = housing_plus_bias[indices]
    y_batch = housing.target.reshape(-1,1)[indices]
    return X_batch,y_batch

init = tf.global_variables_initializer()
with tf.Session() as sess:
    sess.run(init)
    for epoch in range(n_epochs):
        for batch_index in range(n_batches):
            X_batch,y_batch = fetch_batch(epoch,batch_index,batch_size)
            sess.run(training_op,feed_dict={X:X_batch,y:y_batch})
        
    best_theta=theta.eval()

print(best_theta)

[[ 2.07046413]
 [ 0.89327163]
 [ 0.1302847 ]
 [-0.36166519]
 [ 0.4237586 ]
 [ 0.0065919 ]
 [-0.01447757]
 [-0.74432039]
 [-0.71741438]]


## 6. Saving and Restoring Models <a class="anchor" id="saving"></a>

[[back to top]](#top)

TensorFlow allows for easy saving and restoring of a model. We have to create a `Saver` node at the end of the construction phase (after all variable nodes are created); then in the execution phase, just call its `save()` method where you want to save the model.

Let's repeat the Batch Gradient Descent model:

In [27]:
tf.reset_default_graph()
n_epochs=1000
eta = 0.01 #learning rate
X = tf.constant(housing_plus_bias,name='X',dtype=tf.float32)
y = tf.constant(housing.target.reshape(-1,1),name='y',dtype=tf.float32)
# Define theta, y_pred, and mse
theta  = tf.Variable(tf.random_uniform([n+1,1],-1.0,1.0),name='theta') #set as random numbers between -1.0 and 1.0
y_pred = tf.matmul(X,theta,name='predictions')
error  = y_pred-y
mse    = tf.reduce_mean(tf.square(error),name='mse')
# Optimizer
optimizer = tf.train.GradientDescentOptimizer(learning_rate=eta)
training_op = optimizer.minimize(mse)

init = tf.global_variables_initializer()
saver = tf.train.Saver()

with tf.Session() as sess:
    sess.run(init)
    for epoch in range(n_epochs):
        if epoch %100 ==0: #if it is 100 iterations
            print('Epoch ',epoch, "MSE= ",mse.eval())
        sess.run(training_op)
    best_theta = theta.eval()
    save_path = saver.save(sess,"/tmp/my_BDG_model.ckpt")
print(best_theta)

Epoch  0 MSE=  5.14874
Epoch  100 MSE=  0.811114
Epoch  200 MSE=  0.647334
Epoch  300 MSE=  0.609895
Epoch  400 MSE=  0.586454
Epoch  500 MSE=  0.569662
Epoch  600 MSE=  0.557476
Epoch  700 MSE=  0.548611
Epoch  800 MSE=  0.542155
Epoch  900 MSE=  0.537445
[[ 2.06855249]
 [ 0.7307868 ]
 [ 0.13578592]
 [-0.01227765]
 [ 0.06702613]
 [ 0.00297579]
 [-0.03865971]
 [-0.86850804]
 [-0.82444233]]


Restoring the model is easy, similar to above we call a `Saver` at the beginning of the execution phase instead of initializing the variables with `init` we call `restore()`:

In [28]:
with tf.Session() as sess:
    saver.restore(sess,"/tmp/my_BDG_model.ckpt")
    restored_theta = theta.eval()

np.allclose(best_theta,restored_theta)

INFO:tensorflow:Restoring parameters from /tmp/my_BDG_model.ckpt


True

We can also import a pretrained model without having to have the Python code to construct the graph, which is useful when tweaking and saving the model; can load a previously saved model without having to search for the version of the code that built it.

This is because by default the saver also saves the graph structure itself in a second file with a `.meta` extension. Can use the function `tf.train.import_meta_graph()` to restore the graph structure. This function loads the graph into the default graph and returns a Saver that can then be used to restore the graph state (i.e., the variable values).

In [30]:
tf.reset_default_graph() # we start with an empty graph no X,y, theta
# Restore Graph Structure
saver = tf.train.import_meta_graph("/tmp/my_BDG_model.ckpt.meta") 
theta = tf.get_default_graph().get_tensor_by_name("theta:0")

with tf.Session() as sess:
    # Restores the Graph's State
    saver.restore(sess,"/tmp/my_BDG_model.ckpt") 
    restored_theta_2 = theta.eval()

np.allclose(best_theta,restored_theta_2)

INFO:tensorflow:Restoring parameters from /tmp/my_BDG_model.ckpt


True

## 7. Visualizing the Graph using TensorBoard <a class="anchor" id="visualize"></a>

[[back to top]](#top)

We need a way to visualize the progress during training other than printing the optimization. There are nice interactive visualizations in TensorBoard that can be seen in the web browser.

We can first tweak the code so that it writes the graph definition and training stats (for example MSE) to a log directory that TensorBoard can read from. A different log directory is needed every time you run the program, or else TensorBoard will merge stats from different runs together. The easiest solution is to include a timestamp in the log directory name. 

Below is an example with the MBDG code we used earlier.

### Viewing TensorBoard in Jupyter

In [None]:
from IPython.display import clear_output, Image, display, HTML

def strip_consts(graph_def, max_const_size=32):
    """Strip large constant values from graph_def."""
    strip_def = tf.GraphDef()
    for n0 in graph_def.node:
        n = strip_def.node.add() 
        n.MergeFrom(n0)
        if n.op == 'Const':
            tensor = n.attr['value'].tensor
            size = len(tensor.tensor_content)
            if size > max_const_size:
                tensor.tensor_content = b"<stripped %d bytes>"%size
    return strip_def

def show_graph(graph_def, max_const_size=32):
    """Visualize TensorFlow graph."""
    if hasattr(graph_def, 'as_graph_def'):
        graph_def = graph_def.as_graph_def()
    strip_def = strip_consts(graph_def, max_const_size=max_const_size)
    code = """
        <script>
          function load() {{
            document.getElementById("{id}").pbtxt = {data};
          }}
        </script>
        <link rel="import" href="https://tensorboard.appspot.com/tf-graph-basic.build.html" onload=load()>
        <div style="height:600px">
          <tf-graph-basic id="{id}"></tf-graph-basic>
        </div>
    """.format(data=repr(str(strip_def)), id='graph'+str(np.random.rand()))

    iframe = """
        <iframe seamless style="width:1200px;height:620px;border:0" srcdoc="{}"></iframe>
    """.format(code.replace('"', '&quot;'))
    display(HTML(iframe))

In [24]:
# Timestamping
tf.reset_default_graph()
from datetime import datetime
now = datetime.utcnow().strftime("%Y%m%d%H%M%S")
root_logdir = 'tf_logs'
logdir = "{}/run-{}/".format(root_logdir,now)

In [40]:
tf.reset_default_graph()

# Placeholders
X = tf.placeholder(tf.float32,shape=(None,n+1),name='X')
y = tf.placeholder(tf.float32,shape=(None,1),name='y')

# Define theta, y_pred, and mse
theta  = tf.Variable(tf.random_uniform([n+1,1],-1.0,1.0),name='theta') #set as random numbers between -1.0 and 1.0
y_pred = tf.matmul(X,theta,name='predictions')
error  = y_pred-y
mse    = tf.reduce_mean(tf.square(error),name='mse')
# Optimizer
optimizer = tf.train.GradientDescentOptimizer(learning_rate=eta)
training_op = optimizer.minimize(mse)

# Define Batch Size and n_epochs
batch_size= 100
n_batches = int(np.ceil(m/batch_size))
n_epochs  = 10

# Add MSE to TensorBoard
init = tf.global_variables_initializer()
mse_summary = tf.summary.scalar('MSE',mse)
file_writer = tf.summary.FileWriter(logdir,tf.get_default_graph())

with tf.Session() as sess:
    sess.run(init)
    for epoch in range(n_epochs):
        for batch_index in range(n_batches):
            X_batch,y_batch = fetch_batch(epoch,batch_index,batch_size)
            #Add for writing output
            if batch_index %10 == 0:
                summary_str = mse_summary.eval(feed_dict={X:X_batch,y:y_batch})
                step = epoch*n_batches+batch_index
                file_writer.add_summary(summary_str,step)
            sess.run(training_op,feed_dict={X:X_batch,y:y_batch})

    best_theta=theta.eval()
    show_graph(tf.get_default_graph())

file_writer.close()

## 8. Name Scopes <a class="anchor" id="name"></a>

[[back to top]](#top)

If you are dealing with more complex models such as neural networks, the graph can become cluttered with thousands of nodes. To avoid this, you can create *name scopes* to group related nodes. For example in the previous examples we could define `error` and `mse` operations with a name scope called "loss":

In [45]:
tf.reset_default_graph()
# Placeholders
X = tf.placeholder(tf.float32,shape=(None,n+1),name='X')
y = tf.placeholder(tf.float32,shape=(None,1),name='y')

# Define theta, y_pred, and mse
theta  = tf.Variable(tf.random_uniform([n+1,1],-1.0,1.0),name='theta') #set as random numbers between -1.0 and 1.0
y_pred = tf.matmul(X,theta,name='predictions')

with tf.name_scope('loss') as scope:
    error  = y_pred-y
    mse    = tf.reduce_mean(tf.square(error),name='mse')

# Optimizer
optimizer = tf.train.GradientDescentOptimizer(learning_rate=eta)
training_op = optimizer.minimize(mse)
print(error.op.name)
print(mse.op.name)
show_graph(tf.get_default_graph())

loss/sub
loss/mse


We see above that the `mse` and `error` nodes are now contained in the `loss` namespace.

## 9. Modularity <a class="anchor" id="modularity"></a>

[[back to top]](#top)

Suppose you want to create a graph that adds the output of 2 *rectified linear units* (ReLU), which computes a linear function of the inputs, and outputs the result if it is positive, and 0 otherwise. It is shown below:

$h_{\textbf{w},b}(\textbf{X}) = \max(\textbf{X}\cdot \textbf{w} + b, 0 )$

We can construct a graph that peforms the above below but it is quite repetitive:

In [46]:
tf.reset_default_graph()

n_features = 3
X = tf.placeholder(tf.float32,shape=(None,n_features),name='X')
# Weights and biases
w1= tf.Variable(tf.random_normal((n_features,1)),name='weights1')
w2= tf.Variable(tf.random_normal((n_features,1)),name='weights2')
b1= tf.Variable(0.0, name='bias1')
b2= tf.Variable(0.0, name='bias2')
# ReLU
z1= tf.add(tf.matmul(X,w1),b1,name='z1')
z2= tf.add(tf.matmul(X,w2),b2,name='z2')
relu1 = tf.maximum(z1,0,name='relu1')
relu2 = tf.maximum(z2,0,name='relu2')
# Output
output = tf.add(relu1,relu2,name='output')

The above can get more repetitive if we want more ReLU units. Below shows how TF makes it easier if we want to construct 5 ReLUs, and their sum. The `add_n()` allows a sum of a list of tensors. Below we put all the ReLUs into a scope.

In [49]:
tf.reset_default_graph()

def relu(X):
    with tf.name_scope('relu'):
        w_shape = (int(X.get_shape()[1]),1)
        w = tf.Variable(tf.random_normal(w_shape),name='weights')
        b = tf.Variable(0.0,name='bias')
        z = tf.add(tf.matmul(X,w),b,name='z')
        return tf.maximum(z,0,name='relu')

n_features=3
X = tf.placeholder(tf.float32,shape=(None,n_features),name='X')
# Create 5 relus
relus = [relu(X) for i in range(5)]
# Output
output = tf.add_n(relus,name='output')
show_graph(tf.get_default_graph())

When you create a node, TF checks if the name already exists, and if it does it appends an underscore and index. We see above that the name scopes helped condense the graph visually for all of the ReLUs.

## 10. Sharing Variables <a class="anchor" id="sharing"></a>

[[back to top]](#top)

If we wanted to share a variable between several components of the graph, one option is to create it first and then pass it as a parameter to the functions that need it.

For example for the ReLUs we used a threshold of 0, but we can create a variable instead shown below.

In [50]:
tf.reset_default_graph()
    
def relu(X):
    with tf.name_scope('relu',threshold):
        w_shape = (int(X.get_shape()[1]),1)
        w = tf.Variable(tf.random_normal(w_shape),name='weights')
        b = tf.Variable(0.0,name='bias')
        z = tf.add(tf.matmul(X,w),b,name='z')
        return tf.maximum(z,threshold,name='relu')
    
threshold = tf.Variable(0.0,name='threshold')
n_features=3
X = tf.placeholder(tf.float32,shape=(None,n_features),name='X')
# Create 5 relus
relus = [relu(X) for i in range(5)]
# Output
output = tf.add_n(relus,name='output')

While the above works well, it can be a pain to keep track of all the parameters all of the time. You can create a dictionary containing all the variables in the model, and pass it around to every function. You can also create a class ReLU using class variables to handle the shared parameter.

Another option is to set the shared variable as an attribute of the `relu()` function when calling it:

In [52]:
tf.reset_default_graph()


def relu(X):
    with tf.name_scope('relu'):
        if not hasattr(relu,'threshold'):
            relu.threshold = tf.Variable(0.0,name='threshold')
        w_shape = (int(X.get_shape()[1]),1)
        w = tf.Variable(tf.random_normal(w_shape),name='weights')
        b = tf.Variable(0.0,name='bias')
        z = tf.add(tf.matmul(X,w),b,name='z')
        return tf.maximum(z,relu.threshold,name='relu')
    
n_features=3
X = tf.placeholder(tf.float32,shape=(None,n_features),name='X')
# Create 5 relus
relus = [relu(X) for i in range(5)]
# Output
output = tf.add_n(relus,name='output')

Another popular option is to use the `get_variable()` function to create the shared variable if it does not exist yet, or reuse it if it already exists. The behavior is controlled by an attribute of the current `variable_scope()`.  The following code will create a variable named `relu/threshold` (as a scalar):

In [54]:
tf.reset_default_graph()

def relu(X):
    with tf.variable_scope('relu',reuse=True):
        threshold = tf.get_variable('threshold')
        w_shape = (int(X.get_shape()[1]),1)
        w = tf.Variable(tf.random_normal(w_shape),name='weights')
        b = tf.Variable(0.0,name='bias')
        z = tf.add(tf.matmul(X,w),b,name='z')
        return tf.maximum(z,threshold,name='relu')
    
n_features=3
X = tf.placeholder(tf.float32,shape=(None,n_features),name='X')
with tf.variable_scope('relu'):
    threshold = tf.get_variable('threshold',shape=(),initializer=tf.constant_initializer(0.0))
# Create 5 relus
relus = [relu(X) for i in range(5)]
# Output
output = tf.add_n(relus,name='output')
show_graph(tf.get_default_graph())

The above code will fetch the existing `relu/threshold` variable or raise an exception if it does not exist or if it was not created using `get_variable()`. Alternatively we can set the `reuse` attribute to `True` inside the block by calling the scope's `reuse_variables()` method.

In [55]:
tf.reset_default_graph()

def relu(X):
    with tf.variable_scope('relu'):
        threshold = tf.get_variable("threshold", shape=(), initializer=tf.constant_initializer(0.0))
        w_shape = (int(X.get_shape()[1]),1)
        w = tf.Variable(tf.random_normal(w_shape),name='weights')
        b = tf.Variable(0.0,name='bias')
        z = tf.add(tf.matmul(X,w),b,name='z')
        return tf.maximum(z,threshold,name='relu')
    
n_features=3
X = tf.placeholder(tf.float32,shape=(None,n_features),name='X')

with tf.variable_scope("", default_name="") as scope:
    first_relu = relu(X)     # create the shared variable
    scope.reuse_variables()  # then reuse it
    relus = [first_relu] + [relu(X) for i in range(4)]
# Output
output = tf.add_n(relus,name='output')
show_graph(tf.get_default_graph())

The above is different because the threshold variable lives within the first ReLU unit.

## License

The material in this notebook is made available under the [Creative Commons Attribution license](https://creativecommons.org/licenses/by-nc/4.0/).