# Up and Running with TensorFlow

*TensorFlow* is a powerful open-source software library for numerical computation, particularly well suited and fine-tuned for large-scale machine learning. Its basic principle is simple; first define in Python a graph of computations to perform, and then TensorFlow takes the graph and runs it efficiently using optimized C++.

Most importantly, it's possible to break the graph into several chunks and run them in parallel across multiple CPUs/GPUs. TensorFlow also supports distributed computing, so you can train colossal neural networks on humongous training sets in a reasonable amount of time by splitting the computations across hundreds of servers (see chapter 12). TensorFlow can train a network with millions of parameters on a training set composed of billions of instances with millions of features each. This shouldn't be a surprise, since TensorFlow was developed by Google's Brain team and powers many things like Google Cloud Speech, Google Photos, and Google Search

Here's a table of open source Deep Learning libraries available (not exhaustive):

Library | API | Platforms | Started by | Year
--- | --- | --- | --- | ---
Caffe | Python, C++, Matlab | Linux, macOS, Windows | y, Jia, UC Berkeley | 2013
Deeplearning4j | Java, Scala, Clojure | Linux, macOS, Windows, Android | A. Gibson, J. Patterson | 2014
H20 | Python, R | Linux, macOS, Windows | H20.ai | 2014
MXNet | Python, C++, others | Linux, macOS, Windows, iOS, Android | DMLC | 2015
TensorFlow | Python, C++ | Linux, macOS, Windows, iOS, Android | Google | 2015
Theano | Python | Linux, macOS, iOS | University of Montreal | 2010
Torch | C++, Lua | Linux, macOS, iOS, Android | R. Collobert, K. Kavukcuoglu, C. Farabet | 2002

## Creating Your First Graph and Running It in a Session

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12
import numpy as np
import tensorflow as tf

x = tf.Variable(3, name='x')
y = tf.Variable(4, name='y')
f = x*x*y + y + 2

  return f(*args, **kwds)


In [2]:
def reset_graph(seed=42):
    tf.reset_default_graph()
    tf.set_random_seed(seed)
    np.random.seed(seed)

In [3]:
with tf.Session() as sess:
    x.initializer.run()
    y.initializer.run()
    result = f.eval()
result

42

In [4]:
init = tf.global_variables_initializer()

with tf.Session() as sess:
    init.run()
    result = f.eval()
    
result

42

In [5]:
# Alternatively, you can create an interactive session
sess = tf.InteractiveSession()
init.run()
result = f.eval()
print(result)
sess.close()

42


A TensorFlow program is typically split into two parts: the first part builds a computation graph (this is called the *construction phase*), and the second part runs it (this is called the *execution phase*). The construction phase typically builds a computation graph represeting the ML model and the computations required to train it. The execution phase generally runs a loop that evaluates a training step repeatedly (for example, one step per mmini-batch), gradually improving the model parameters. WE well go through an example shortly.

## Managing Graphs

Any node you create is automatically added to the default graph:

In [6]:
x1 = tf.Variable(1)
x1.graph is tf.get_default_graph()

True

This is fine in most cases, but sometimes you may want to manage multiple independent graphs. This can be done by creating a new `Graph` and temporariliy making it the default graph inside of a `with` block, like so:

In [7]:
graph = tf.Graph()
with graph.as_default():
    x2 = tf.Variable(2)
    
x2.graph is graph

True

In [8]:
x2.graph is tf.get_default_graph()

False

*Note: in Jupyter (and in a Python shell) it's common to run the same commands repeatedly when experimenting. As a result, you may end up with a default graph containing multiple duplicate nodes. One solution is to restart the Jupyter kernel/Python shell, but a more convenient solution is to just reset the default graph by running `tf.reset_default_graph()`*

## Lifecycle of a Node Value

When you evaluate a node, TensorFlow automatically determines the set of nodes that it depends on and it evalutes those nodes first. For example, consider the following:

In [9]:
w = tf.constant(3)
x = w + 2
y = x + 5
z = x * 3

with tf.Session() as sess:
    print(y.eval()) # 10
    print(z.eval()) # 15

10
15


First, this code defines a simple graph. Then it starts a session and runs the graph to evaluate `y`; TensorFlow will detect that `y` depends on `x` and that `x` depends on `w`, so it first will evaluate `w`, then `x`, then `y`, then returns the value of `y`. Finally, it will execute `z`. Once again, it detects that it needs `w` and `x`, but it *__will not__* reuse the result of the previous evaluation. In short, the above code evalutes the values of `x` and `w` twice.

All node values are dropped between graph runs, except variable values, which are maintained by the session across graph runs. A variable starts its life when its initializer (constructor) is run, and ends when the session is closed.

If you want to evalute `y` and `z` efficiently without evaluating `w` and `x` twice (like the previous code) you must ask Tensorflow to evaluate both `y` and `z` in just one graph run, as seen below:

In [10]:
with tf.Session() as sess:
    y_val, z_val = sess.run([y, z])
    print(y_val)
    print(z_val)

10
15


*Note: in single-process TensorFlow, multiple sessions don't share any state, even if they reuse the same graph. In distributed TensorFlow (see chapter 12), variable state is stored on the servers, not in the sessions, so multiple sessions can share the same variables.

## Linear Regression with TensorFlow

TensorFlow operations (known as *ops* for short), can take any number of inputs and produce any number of outputs. For example, addition and multiplication both take two inputs and produce one output. Constants and variables are known as *source ops* (ops that take no input). The inputs and outputs are multidimensional arrays known as *tensors*. Just like Numpy arrays, tensors have a type and shape (in fact, the Python API tensors are just Numpy ndarrays. They typically contain floats, but you can use them to carry strings as well (arbitrary byte arrays)).

In our examples, the tensors have just contained a single scalar value, but you can (obviously) perform computations on arrays of any shape.

The following code performs Linear Regression on the California housing data from earlier by manipulating 2D arrays. It starts by fetching the data, then adding an extra bias input feature to all training instances using Numpy, then it creates two TensorFlow constant nodes `X` and `y`, to hold this data and the targets, and it uses some matrix operations provided by TensorFlow to define `theta`. You may recognize that `theta` is the Normal Equation ($\hat{\theta} = (\textbf{X}^T\cdot\textbf{X})^{-1}\cdot\textbf{X}^T\cdot y$; see chapter 4) Finally, the code creates a session and evaluates `theta`:

In [11]:
from sklearn.datasets import fetch_california_housing
from os.path import expanduser

housing = fetch_california_housing(data_home=expanduser('~/Coding Stuff/Python/handson-ml/datasets'))
m, n = housing.data.shape
housing_data_plus_bias = np.c_[np.ones((m, 1)), housing.data]

X = tf.constant(housing_data_plus_bias, dtype=tf.float32, name="X")
y = tf.constant(housing.target.reshape(-1, 1), dtype=tf.float32, name='y')
XT = tf.transpose(X)
theta = tf.matmul(tf.matmul(tf.matrix_inverse(tf.matmul(XT, X)), XT), y)

with tf.Session() as sess:
    theta_value = theta.eval()
    
theta_value

array([[-3.7185181e+01],
       [ 4.3633747e-01],
       [ 9.3952334e-03],
       [-1.0711310e-01],
       [ 6.4479220e-01],
       [-4.0338000e-06],
       [-3.7813708e-03],
       [-4.2348403e-01],
       [-4.3721911e-01]], dtype=float32)

In [12]:
# Numpy version of what we just did
X = housing_data_plus_bias
y = housing.target.reshape(-1, 1)
theta_numpy = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y)

theta_numpy

array([[-3.69419202e+01],
       [ 4.36693293e-01],
       [ 9.43577803e-03],
       [-1.07322041e-01],
       [ 6.45065694e-01],
       [-3.97638942e-06],
       [-3.78654266e-03],
       [-4.21314378e-01],
       [-4.34513755e-01]])

In [13]:
# scikit learn version
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(housing.data, housing.target.reshape(-1, 1))

print(np.r_[lin_reg.intercept_.reshape(-1, 1), lin_reg.coef_.T])

[[-3.69419202e+01]
 [ 4.36693293e-01]
 [ 9.43577803e-03]
 [-1.07322041e-01]
 [ 6.45065694e-01]
 [-3.97638942e-06]
 [-3.78654265e-03]
 [-4.21314378e-01]
 [-4.34513755e-01]]




The main benefit of this code vs using Numpy to compute the Normal Equation is that TensorFlow will use your GPU if you have one (providing you installed TensorFlow with GPU support (see chapter 12)).

## Implementing Gradient Descent

Now it's time to use Batch Gradient Descent. to do this, we manually compute the gradients, then we will use TensorFlow's autodiff feature to let TensorFlow compute the gradients automatically, and finally we'll use a few of TensorFlow's ootb optimizers.

*Note: when using Gradient Descent, remember that it's important to first normailze the input feature vectors, or else training may be much slower. You can do this with TensorFlow, Numpy, sklearn's `StandardScaler`, or any other solution.*

In [14]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_housing_data = scaler.fit_transform(housing.data)
scaled_housing_data_plus_bias = np.c_[np.ones((m, 1)), scaled_housing_data]

### Manually Compute the Gradients

The next code is pretty easy to understand except for a few new elements:

* The `random_uniform()` function creates a node in the graph that will generate a tensor containing random values, given its shape and value range, much like Numpys `rand()` function.

* The `assign()` function creates a node that will assign a new value to a variable. In this case, it implements the Batch Gradient Descent step $\theta^{(\text{next step})} = \theta - \eta\nabla_{\theta}\text{MSE}(\theta)$

* The main loop executes the training step over and over again (`n_epochs` times) and every 100 iteration it prints out the current Mean Squared Error (`mse`). It should be going down at every iteration

In [15]:
reset_graph()

n_epochs = 1000
learning_rate = 0.1

X = tf.constant(scaled_housing_data_plus_bias, dtype=tf.float32, name='X')
y = tf.constant(housing.target.reshape(-1, 1), dtype=tf.float32, name='y')
theta = tf.Variable(tf.random_uniform([n + 1, 1], -1.0, 1.0, seed=42), name='theta')
y_pred = tf.matmul(X, theta, name='predictions')
error = y_pred - y
mse = tf.reduce_mean(tf.square(error), name='mse')
gradients = 2/m * tf.matmul(tf.transpose(X), error)
training_op = tf.assign(theta, theta - learning_rate * gradients)

init = tf.global_variables_initializer()

with tf.Session() as sess:
    sess.run(init)
    
    for epoch in range(n_epochs):
        if epoch % 100 == 0:
            print("Epoch", epoch, "MSE =", mse.eval())
        sess.run(training_op)
    best_theta = theta.eval()

Epoch 0 MSE = 9.161543
Epoch 100 MSE = 0.5305595
Epoch 200 MSE = 0.52515554
Epoch 300 MSE = 0.5244485
Epoch 400 MSE = 0.52434105
Epoch 500 MSE = 0.52432406
Epoch 600 MSE = 0.52432144
Epoch 700 MSE = 0.52432096
Epoch 800 MSE = 0.5243211
Epoch 900 MSE = 0.52432084


### Using autodiff

The preceding code works just fine, but it requires mathematically deriving the gradients from the cost function. This can get really, *really* messy if you use deep neural nets (or even SVMs, for that matter). You could use *symbolic differentiation* to automatically find the equations for the partial derivatives for you, but the resulting code wouldn't necessarily be efficient.

To see why, consider $f(x) = \exp(\exp(\exp(x)))$. You can figure out the derivative pretty easily, but your code wouldn't be as efficient as it would have been. A better solution would be to use dynamic programming to write a function that first computes $\exp(x)$, then $\exp(\exp(x))$, then $\exp(\exp(\exp(x)))$ and returns all three. This gives you $f(x)$ directly and if you need the derivative you can just multiply all three terms and you're done. The naïve approach would call the `exp` function nine times to compute both $f(x)$ and $f'(x)$. This approach would just call it 3 times. It gets worse if the function is defined by arbitrary code.

Luckily, TensorFlow's autodiff features comes to the rescue; it can automatically and efficiently compute the gradients for you. Simply replace the `gradients = ...` line in the Gradient Descent code in the previous section with the following and the code will just work:
`gradients = tf.gradients(mse, [theta])[0]`

The `gradients()` function takes an op (in this case `mse` and a list of variables (in this case just `theta`) and it creates a list of ops (one per variable) to compute the gradients of the op with regards to each variable. So the `gradients` node will compute the gradient vectore of the MSE with regards to `theta`.

There are four main approaches to computing gradients automatically. They're summarized in the following table. TensorFlow uses *reverse-mode autodiff*, which is perfect (efficient and accurate) when there are many inputs and few outputs, as is often the case in neural networks. It computes all of the partial derivatives of the outputs with regards to all of the inputs in just $n_{\text{outputs}} 1$ graph traversals.

Technique | # of graph traversals to compute all gradients | accuracy | supports arbitrary code | comment
--- | --- | --- | --- | ---
Numerical differentiation | $n_{\text{inputs}} + 1$ | Low | yes | Trivial to implement
Symbolic differentiation | N/A | High | no | Builds a very different graph
Forward-mode autodiff | $n_{\text{inputs}}$ | High | yes | uses *dual numbers*
Reverse-mode autodiff | $n_{\text{outputs}} + 1$ | high | yes | implemented by TensorFlow

The following code implements the same as above but with autodiff:

In [18]:
reset_graph()

n_epochs = 1000
learning_rate = 0.1

X = tf.constant(scaled_housing_data_plus_bias, dtype=tf.float32, name="X")
y = tf.constant(housing.target.reshape(-1, 1), dtype=tf.float32, name='y')
theta = tf.Variable(tf.random_uniform([n + 1, 1], -1.0, 1.0, seed=42), name='theta')
y_pred = tf.matmul(X, theta, name="predictions")
error = y_pred - y
mse = tf.reduce_mean(tf.square(error), name="mse")

In [19]:
gradients = tf.gradients(mse, [theta])[0]

In [20]:
training_op = tf.assign(theta, theta - learning_rate * gradients)

init = tf.global_variables_initializer()

with tf.Session() as sess:
    sess.run(init)
    
    for epoch in range(n_epochs):
        if epoch % 100 == 0:
            print("Epoch", epoch, "MSE =", mse.eval())
        sess.run(training_op)
    best_theta = theta.eval()
    
print("Best theta: \n{}".format(best_theta))

Epoch 0 MSE = 9.161543
Epoch 100 MSE = 0.5305595
Epoch 200 MSE = 0.52515554
Epoch 300 MSE = 0.5244485
Epoch 400 MSE = 0.524341
Epoch 500 MSE = 0.524324
Epoch 600 MSE = 0.5243215
Epoch 700 MSE = 0.52432096
Epoch 800 MSE = 0.524321
Epoch 900 MSE = 0.52432084
Best theta: 
[[ 2.0685577 ]
 [ 0.8296404 ]
 [ 0.11875559]
 [-0.26556668]
 [ 0.30572915]
 [-0.00450187]
 [-0.03932704]
 [-0.8998372 ]
 [-0.87049496]]


### Using an Optimizer

So TensorFlow computes the gradients for you. But it gets even easier: it also provides a number of opimizers out of the box, including a `GradientDescentOptimizer`. You can simply replace the preceding `gradients = ...` and `training_op = ...` lines with the following, and everything will once again work out just fine:

```
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
training_op = optimizer.minimize(mse)
```

if you want to change the optimizer, you just have to change the first line. You could use a Momentum Optimizer (see chapter 11) by defining the optimizer like: `optimizer = tf.train.MomentumOptimizer(learning_rate=learning_rate, momentum=0.9)`

The code is implemented below:

In [21]:
reset_graph()

n_epochs = 1000
learning_rate = 0.1

X = tf.constant(scaled_housing_data_plus_bias, dtype=tf.float32, name="X")
y = tf.constant(housing.target.reshape(-1, 1), dtype=tf.float32, name='y')
theta = tf.Variable(tf.random_uniform([n + 1, 1], -1.0, 1.0, seed=42), name='theta')
y_pred = tf.maoptmul(X, theta, name="predictions")
error = y_pred - y
mse = tf.reduce_mean(tf.square(error), name="mse")

In [22]:
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
training_op = optimizer.minimize(mse)

In [23]:
init = tf.global_variables_initializer()

with tf.Session() as sess:
    sess.run(init)

    for epoch in range(n_epochs):
        if epoch % 100 == 0:
            print("Epoch", epoch, "MSE =", mse.eval())
        sess.run(training_op)
    
    best_theta = theta.eval()

print("Best theta:")
print(best_theta)

Epoch 0 MSE = 9.161543
Epoch 100 MSE = 0.5305595
Epoch 200 MSE = 0.52515554
Epoch 300 MSE = 0.5244485
Epoch 400 MSE = 0.524341
Epoch 500 MSE = 0.524324
Epoch 600 MSE = 0.5243215
Epoch 700 MSE = 0.52432096
Epoch 800 MSE = 0.524321
Epoch 900 MSE = 0.52432084
Best theta:
[[ 2.0685577 ]
 [ 0.8296404 ]
 [ 0.11875559]
 [-0.26556668]
 [ 0.30572915]
 [-0.00450187]
 [-0.03932704]
 [-0.8998372 ]
 [-0.87049496]]


In [24]:
# or use a momentum optimizer
reset_graph()

n_epochs = 1000
learning_rate = 0.1

X = tf.constant(scaled_housing_data_plus_bias, dtype=tf.float32, name="X")
y = tf.constant(housing.target.reshape(-1, 1), dtype=tf.float32, name='y')
theta = tf.Variable(tf.random_uniform([n + 1, 1], -1.0, 1.0, seed=42), name='theta')
y_pred = tf.matmul(X, theta, name="predictions")
error = y_pred - y
mse = tf.reduce_mean(tf.square(error), name="mse")

In [25]:
optimizer = tf.train.MomentumOptimizer(learning_rate=learning_rate, momentum=0.9)
training_op = optimizer.minimize(mse)

In [26]:
init = tf.global_variables_initializer()

with tf.Session() as sess:
    sess.run(init)

    for epoch in range(n_epochs):
        if epoch % 100 == 0:
            print("Epoch", epoch, "MSE =", mse.eval())
        sess.run(training_op)
    
    best_theta = theta.eval()

print("Best theta:")
print(best_theta)

Epoch 0 MSE = 9.161543
Epoch 100 MSE = 0.5244434
Epoch 200 MSE = 0.5243208
Epoch 300 MSE = 0.52432084
Epoch 400 MSE = 0.52432096
Epoch 500 MSE = 0.52432096
Epoch 600 MSE = 0.52432096
Epoch 700 MSE = 0.52432096
Epoch 800 MSE = 0.52432096
Epoch 900 MSE = 0.52432096
Best theta:
[[ 2.0685582 ]
 [ 0.82961947]
 [ 0.11875169]
 [-0.26552713]
 [ 0.30569643]
 [-0.00450299]
 [-0.03932627]
 [-0.8998851 ]
 [-0.8705405 ]]


## Feeding Data to the Training Algorithm

Let's try to modify the previous code to implement Mini-batch Gradient Descent. For this, we need a way to replace `X` and `y` at every iteration with the next mini-batch. The simplest way to do this is to use placeholder nodes. These nodes are special because they don't actually perform any computation, they just output the data you tell them to at runtime. They're typically used to pass training data to TensorFlow during training. If you don't specify a value at runtime for a placeholder, you get an exception.

To create one, you must call the `placeholder()` function and specify the output tensor's datatype. Optionally, you can also specify its shape if you want to.

In [28]:
reset_graph()

A = tf.placeholder(tf.float32, shape=(None, 3))
B = A + 5
with tf.Session() as sess:
    B_val_1 = B.eval(feed_dict={A: [[1, 2, 3]]})
    B_val_2 = B.eval(feed_dict={A: [[4, 5, 6], [7, 8, 9]]})

print(B_val_1)
print(B_val_2)


[[6. 7. 8.]]
[[ 9. 10. 11.]
 [12. 13. 14.]]


By default, a `Saver` saves and resores all variables under their own name, but if you need more control, you can specify which variables to save or restore and what names to use. For example, the following `Saver` will save or restore only the `theta` variable under the name `weights`:
`saver = tf.train.Saver({'weights': theta})`

By default, the `save()` method also saves the structure of the graph in a second file with the same name plus a `.meta` extension. You can load this graph structure using the `tf.train.import_meta_graph()`. This adds the graph to the default graph, and returns a `Saver` instance that you can then use to restore the graph's state (i.e., the variable values):

```
saver = tf.train.import_meta_graph("/tmp/my_model_final.ckpt.meta")
with tf.Session() as sess:
    saver.restore(sess, "/tmp/my_model_final.ckpt")
    [...]
```

This allows you to fully restore a saved model including both the graph structure and the variable values, without having to search for the code that built it. Next up though, mini-batch:

In [29]:
n_epochs = 1000
learning_rate = 0.01

reset_graph()

X = tf.placeholder(tf.float32, shape=(None, n + 1), name="X")
y = tf.placeholder(tf.float32, shape=(None, 1), name='y')

In [30]:
theta = tf.Variable(tf.random_uniform([n + 1, 1], -1.0, 1.0, seed=42), name='theta')
y_pred = tf.matmul(X, theta, name='predictions')
error = y_pred - y
mse = tf.reduce_mean(tf.square(error), name='mse')
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
training_op = optimizer.minimize(mse)

init = tf.global_variables_initializer()

In [31]:
n_epochs = 10

batch_size = 100
n_batches = int(np.ceil(m / batch_size))

In [32]:
def fetch_batch(epoch, batch_index, batch_size):
    np.random.seed(epoch * n_batches + batch_index)  # not shown in the book
    indices = np.random.randint(m, size=batch_size)  # not shown
    X_batch = scaled_housing_data_plus_bias[indices] # not shown
    y_batch = housing.target.reshape(-1, 1)[indices] # not shown
    return X_batch, y_batch

with tf.Session() as sess:
    sess.run(init)
    
    for epoch in range(n_epochs):
        for batch_index in range(n_batches):
            X_batch, y_batch = fetch_batch(epoch, batch_index, batch_size)
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
    best_theta = theta.eval()
    
best_theta

array([[ 2.0703337 ],
       [ 0.8637145 ],
       [ 0.12255151],
       [-0.31211874],
       [ 0.38510373],
       [ 0.00434168],
       [-0.01232954],
       [-0.83376896],
       [-0.8030471 ]], dtype=float32)

## Visualizing the Graph and Training Curves Using TensorBoard