# Notes on TensorFlow

From Chapter 9 of Hands-On Machine Learning textbook by Aurelien Geron. 

Tensor is a powerful open source library for numerical computation, particualy well suited and fine tuned for large scale machine learning. 


Its basic principle is simple: first define in Python a graph of computations to perform, and then tensor flow takes the graph and runs i efficiently using optimized C++ code. 

![](Pictures/tensorflow.png)

**Most importantly, it is possible to break up the graph into several chunk and run them in parallel across multiple CPU or GPUs. Tensor Flow also supports distributed computing, so you can train colossal neural networks on humongous training sets in a reasonable amount of time by splitting the computations across hundreds of servers. **

#### Tensor flow highlights:
- It runs on mobile devices!
- It provides a very smple Python API called TF.Learn (tensorflow.contrib.learn), compatible with Scikit-Learn. You can use it to train various types of neural networks in just a few lines of code. 
- It also provides another simple API called TF-slim (tensorflow.contrib.slim) to simplify building, trianing and evaluating neural nets
- several other hiugh level APIs have been bui8lt independently on Ttop of TensorFlow, such as Keras and Pretty tensor
- Its main Python API offers much more flexilibity to create all sorts of computations, including any neural net architecture u can think of
- It uncludes highly efficient C++ implementations of many ML operations, particularily those needed to build nerual networks. **There is also a C++ API to define your own high-performance operations**
- It provides sevreal advanced optimization nodes to search for parameters that minimize a cost function. **These are very easy to use since TensorFlow automatically takes care of computing the gradients of the funcitons you define. THIS IS CALLED AUTOMATIC DIFFERENTIATING (OR AUTODIFF)**
- It also comes with a great visualization tool called TensorBoard that allows you to browse through the computation graph, view learning curves, and more. 
- Google has also launched a cloud service to run TensorFlow graphs
- Last but not least, it has a dedicated team of passionate and helpful developers. It is one of the most populat open source projects on GitHub.

### Creating your first graph and running it in a session

The following code creates the graph represented below:
![](Pictures/Picture2.emf)

In [6]:
import tensorflow as tf

x = tf.Variable(3, name='x')
y = tf.Variable(4, name='y')
f=x*x*y +y + 2

That's it! The most important thing to understand is that this code does not actually perform any computation, **it just creates a computation graph**. <font color=red>In fact, even the vraiables are nont initialized yet.</font> To evaluate this graph, you need to open a TensorFlow session and use it to initialize the variables and evaluate 'f'.

**A TensorFlow session takes care of placing the operations onto devices such as CPUs and GPUs and running them, and it holds all the varialbe values. The following code creates a session, initializes the varables, and evaluates, and f then closes the session (which frees up resources).**

In [7]:
sess = tf.Session()
sess.run(x.initializer)
sess.run(y.initializer)
result = sess.run(f)
print(result)
42
sess.close()

42


Having to repeat session.run() all the time is a bit cumbersome, heres a better way:

Inside the 'with' block, the session is set as the default session. Calling ***x.initializer.run()** is equivalent to calling **tf.get_default_session().run(x.initializer)**, and similarily **f.eval()** is equivalent to calling **tf.get_default_session().run(f)**.  This makes the code easier to read, and the code is automatically closed at the end of the block.

In [8]:
with tf.Session() as sess:
    x.initializer.run()
    y.initializer.run()
    result = f.eval()

<font color =red> instead of manually calling the initializer for ever single variable, you can use the 'global_varaibles_initializer()' funciton. Not that it does not actually perform initializtion immediately, but rater creates a node in the graph that will initialize all variables when it is run:</font>

In [9]:
init = tf.global_variables_initializer() #prepare an init node

with tf.Session() as sess:
    init.run() #initialize all the variables
    result = f.eval()

Inisde jupyter or Python shell you may perfer to create an **InteractiveSession**, the onyl differnt from a regular Session is that when and InteractiveSession is created it **automatically sets itself as the default session,** so you dont need a with block (but you do need to close the session manually when you are done with it).

In [10]:
sess =tf.InteractiveSession()
init.run()
result = f.eval()
print(result)

sess.close()

42


### A TF program is typically split into two parts:
1. First part builds a computaiton graph (this is called constructrion phase). 
2. Second part runs it (execution phase). 

### Managing Graphs

Any node you create is automatically added to the default graph.

In [11]:
x1 = tf.Variable(1)
x1.graph is tf.get_default_graph()

True

 In most cases this is fine, but sometimes you many want to manage multiple graphs. You can create new graphs and make it trhe default graph with a new with block. 

In [12]:
new_graph = tf.Graph()
with new_graph.as_default():
    x2 = tf.Variable(2)
    
print(x2.graph is new_graph)

print(x2.graph is tf.get_default_graph())

True
False


So as you can see from the above code, you use the **with block** to temporarily reset the default_graph so you can assign the variable to the new graph! 

<font color=red> **To avoide duplicated nodes, you can also run: tf.reset_default_graph()**

In [13]:
tf.reset_default_graph()

### Lifecycle of a Node Value

When you evaluate a node, TF determines set of nodes that it depends on and it evaluates these first.

In [14]:
w = tf.constant(3)
x = w + 2
y = x + 5
z = x * 3
with tf.Session() as sess:
    print(y.eval()) # 10
    print(z.eval()) # 15

10
15


First, this code defines a very simple graph. Then it starts a session and runs the graph to evaluate y:
TensorFlow automatically detects that y depends on w, which depends on x, so it first evaluates w,
then x, then y, and returns the value of y. Finally, the code runs the graph to evaluate z. 

Once again,
TensorFlow detects that it must first evaluate w and x. <font color =red>It is important to note that it will not reuse the
result of the previous evaluation of w and x. **In short, the preceding code evaluates w and x twice BECAUSE WE RAN THE EVAL() FUNCTION IN TWO DIFFERENT LINES ON EACH VARIABLE INDEPENDENTLY.**</font>

**All node values are dropped between graph runs, except variable values, which are maintained by the session across graph runs (queues and readers also maintain some state, as we will see in Chapter 12).**

<font color=red> A variable starts its life when its initializer is run, and it ends when the session is closed.

**If you want to evaluate y and z efficiently, without evaluating w and x twice as in the previous code,
you must ask TensorFlow to evaluate both y and z in just one graph run, as shown in the following
code:**

In [15]:
with tf.Session() as sess:
    y_val, z_val = sess.run([y, z])
    print(y_val) # 10
    print(z_val) # 15

10
15


**WARNING:** In single-process TensorFlow, multiple sessions do not share any state, even if they reuse the same graph (each session
would have its own copy of every variable). In distributed TensorFlow (see Chapter 12), variable state is stored on the
servers, not in the sessions, so multiple sessions can share the same variables.

### Linear Regression with TF

TF Operations (called ops for short) can take any number of inputs and produce any number of outputs. 

<font color = red size=5> The inputs are mutlidimensional arrays, called tensors. Just like numpy arrays, tesnros have a type and a shape. In the Python API tensors are simply represneted by NumPy ndarrays. They typically contain floats. </font>

In the examples so far, the tensors contained a single scalar value, but you can perform computations on arrays of any shape. Following code manipulates 2D arrays to perform Linear Reg. 

- It starts by fetching the dataset; 
- then it adds an extra bias input feature (x0 = 1) to all training instances (it does so using NumPy so it runs immediately); 
- then it creates **two TensorFlow constant nodes, X and y, to hold this data and the targets, and it uses some of the matrix operations provided by TensorFlow to define theta.** These matrix functions — transpose(), matmul(), and matrix_inverse() — are selfexplanatory, but as usual they do not perform any computations immediately; instead, they create nodes in the graph that will perform them when the graph is run. You may recognize that the definition of theta corresponds to the Normal Equation = (XT · X)–1 · XT · y; see Chapter 4). 
- Finally, the code creates a session and uses it to evaluate theta.

In [16]:
import numpy as np
import tensorflow as tf
from sklearn.datasets import fetch_california_housing

housing = fetch_california_housing()
m,n = housing.data.shape
housing_data_plus_bias = np.c_[np.ones((m,1)), housing.data]

#create constants because these numbers dont change... I think?!?!
X = tf.constant(housing_data_plus_bias, dtype=tf.float32, name="X")
y = tf.constant(housing.target.reshape(-1,1), dtype=tf.float32, name="y")
XT = tf.transpose(X)
theta = tf.matmul(tf.matmul(tf.matrix_inverse(tf.matmul(XT, X)), XT), y)

with tf.Session() as sess:
    theta_value = theta.eval()
    
print(theta_value)

[[ -3.74651413e+01]
 [  4.35734153e-01]
 [  9.33829229e-03]
 [ -1.06622010e-01]
 [  6.44106984e-01]
 [ -4.25131839e-06]
 [ -3.77322501e-03]
 [ -4.26648885e-01]
 [ -4.40514028e-01]]


In [17]:
#Compare with pure numpy
X = housing_data_plus_bias
y = housing.target.reshape(-1, 1)
theta_numpy = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y)

print(theta_numpy)

[[ -3.69419202e+01]
 [  4.36693293e-01]
 [  9.43577803e-03]
 [ -1.07322041e-01]
 [  6.45065694e-01]
 [ -3.97638942e-06]
 [ -3.78654265e-03]
 [ -4.21314378e-01]
 [ -4.34513755e-01]]


In [18]:
#Compare with scikit learn
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(housing.data, housing.target.reshape(-1, 1))

print(np.r_[lin_reg.intercept_.reshape(-1, 1), lin_reg.coef_.T])

[[ -3.69419202e+01]
 [  4.36693293e-01]
 [  9.43577803e-03]
 [ -1.07322041e-01]
 [  6.45065694e-01]
 [ -3.97638942e-06]
 [ -3.78654265e-03]
 [ -4.21314378e-01]
 [ -4.34513755e-01]]


<font color =blue size =4>The main benefit of this code versus computing the Normal Equation directly using NumPy is that
**TensorFlow will automatically run this on your GPU card if you have one (provided you installed
TensorFlow with GPU support**, of course; see Chapter 12 for more details)

### Implementing Gradient Descent

Let’s try using Batch Gradient Descent (introduced in Chapter 4) instead of the Normal Equation.
First we will do this by manually computing the gradients, then we will use TensorFlow’s autodiff
feature to let TensorFlow compute the gradients automatically, and finally we will use a couple of
TensorFlow’s out-of-the-box optimizers.

**WARNING: ** <font color=red>When using Gradient Descent, remember that it is important to first normalize the input feature vectors, or else training may
be much slower. </font>You can do this using TensorFlow, NumPy, Scikit-Learn’s StandardScaler, or any other solution you
prefer. The following code assumes that this normalization has already been done.

### Manually Computing Gradients

Following code should be fairly self explanatory except for a few new elements:

- random_uniform(): gives node in graph that will generate tensor containin random values (much like NumPy's rand() funciton)
- assign(): creates node that will assign new vlaue to a variable. **In this case it implements the Batch Gradient Descent step θ(next step) = θ – η∇θMSE(θ).**
- Main loop executes training step over and over (n_epoch times), and every 100 iterations it prints current MSE, you should see MSE go down after every iteration.


In [19]:
# Scale data first, ALWAYS NORMALIZE X BEFORE TRAINING WHEN DOING GS
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_housing_data = scaler.fit_transform(housing.data)
scaled_housing_data_plus_bias = np.c_[np.ones((m,1)),scaled_housing_data]
scaled_housing_data_plus_bias[0:5]

array([[ 1.        ,  2.34476576,  0.98214266,  0.62855945, -0.15375759,
        -0.9744286 , -0.04959654,  1.05254828, -1.32783522],
       [ 1.        ,  2.33223796, -0.60701891,  0.32704136, -0.26333577,
         0.86143887, -0.09251223,  1.04318455, -1.32284391],
       [ 1.        ,  1.7826994 ,  1.85618152,  1.15562047, -0.04901636,
        -0.82077735, -0.02584253,  1.03850269, -1.33282653],
       [ 1.        ,  0.93296751,  1.85618152,  0.15696608, -0.04983292,
        -0.76602806, -0.0503293 ,  1.03850269, -1.33781784],
       [ 1.        , -0.012881  ,  1.85618152,  0.3447108 , -0.03290586,
        -0.75984669, -0.08561576,  1.03850269, -1.33781784]])

In [20]:
n_epochs = 1000
learning_rate = 0.01

X = tf.constant(scaled_housing_data_plus_bias, dtype=tf.float32, name="X")
y = tf.constant(housing.target.reshape(-1, 1), dtype=tf.float32, name="y")
# create vector of random theta variables n+1 values (1 for each feature)
theta = tf.Variable(tf.random_uniform([n + 1, 1], -1.0, 1.0), name="theta")
y_pred = tf.matmul(X, theta, name="predictions")
error = y_pred - y
mse = tf.reduce_mean(tf.square(error), name="mse")

#calculate gradients at each iteration
gradients = 2/m * tf.matmul(tf.transpose(X),error)

#assign updates gradients (assign updates tf.variable() values!)
training_op = tf.assign(theta, theta - learning_rate * gradients)

#initialize all variables (just theta in this instance)
init = tf.global_variables_initializer()

# run da thing!
with tf.Session() as sess:
    #first run initializer
    sess.run(init)
    for epoch in range(n_epochs):
        if epoch % 100 == 0:
            print("Epoch", epoch, "MSE =", mse.eval()) #print every 100 runs
        sess.run(training_op) #run the model for every n_epoch times!
    #This pulls out most recentl theta... not necessarily BEST theta. I think?
    best_theta = theta.eval()
    
print('\n',"Best theta vals: ",'\n', best_theta)

Epoch 0 MSE = 11.1272
Epoch 100 MSE = 0.820041
Epoch 200 MSE = 0.620612
Epoch 300 MSE = 0.591767
Epoch 400 MSE = 0.572972
Epoch 500 MSE = 0.559444
Epoch 600 MSE = 0.549682
Epoch 700 MSE = 0.542636
Epoch 800 MSE = 0.53755
Epoch 900 MSE = 0.533878

 Best theta vals:  
 [[ 2.06855249]
 [ 0.7698015 ]
 [ 0.14014694]
 [-0.09151634]
 [ 0.13498099]
 [ 0.0040568 ]
 [-0.03986073]
 [-0.80009288]
 [-0.76083553]]


### Using Autodiff

**The preceding code works fine, but requires mathematically deriving the gradients from the cost function (MSE)... **

In the case of Linear Regression, the gradients are easy to derive and calculate... BUT if you had to do this with a deep neural network this would be a pain in the ass.***You COULD use symbolic differentiaion to automatically find the equations for the partial derivatives for you, but the resulting code would not necessarily be very efficient.***

To understand why, consider the function f(x)= exp(exp(exp(x))). If you know calculus, you can figure out its derivative f′(x) = exp(x) × exp(exp(x)) × exp(exp(exp(x))). If you code f(x) and f′(x) separately
and exactly as they appear, your code will not be as efficient as it could be. A more efficient solution would be to write a function that first computes exp(x), then exp(exp(x)), then exp(exp(exp(x))), and
returns all three. This gives you f(x) directly (the third term), and if you need the derivative you can just multiply all three terms and you are done. With the naïve approach you would have had to call the
exp function nine times to compute both f(x) and f′(x). With this approach you just need to call it three times.

<font color=red><br>**It gets worse when your function is defined by some arbitrary code. Can you find the equation (or the code) to compute the partial derivatives of the following function? Hint: don’t even try. ** </font>

    def my_func(a, b):
        z = 0
        for i in range(100):
            z = a * np.cos(z + i) + z * np.sin(b - i)
        return z

<font color = blue size=4> Forutunately, TF's **autodiff** feature comes to the rescue: it can automatically and efficiently compute the gradients for you. </font> Simply replace the 'gradients = ' line in the Gradient Descent code in the previous seciton, comme ca:

In [21]:
n_epochs = 1000
learning_rate = 0.01

X = tf.constant(scaled_housing_data_plus_bias, dtype=tf.float32, name="X")
y = tf.constant(housing.target.reshape(-1, 1), dtype=tf.float32, name="y")
# create vector of random theta variables n+1 values (1 for each feature)
theta = tf.Variable(tf.random_uniform([n + 1, 1], -1.0, 1.0), name="theta")
y_pred = tf.matmul(X, theta, name="predictions")
error = y_pred - y
mse = tf.reduce_mean(tf.square(error), name="mse")

#calculate gradients MAKE CHANGE HERE USING TENSORFLOW THIS TIME
#tf.gradients(mse,[theta])

#[0] is becase tf.gradient instructions are stored in a list... I think
gradients = tf.gradients(mse,[theta])[0] 

#assign updates gradients
training_op = tf.assign(theta, theta - learning_rate * gradients)

#initialize all variables (just theta in this instance)
init = tf.global_variables_initializer()

# run da thing!
with tf.Session() as sess:
    sess.run(init)
    for epoch in range(n_epochs):
        if epoch % 50 == 0:
            print("Epoch", epoch, "MSE =", mse.eval()) #print every 100 runs
            #THIS IS JUST ME PRINTING THE GRADIENTS IN ROUNDED FORM SO ITS EASIER TO READ!!! 
            #ITS SO COOL!!! After each epoch you can see the gradients converging towards 0!!!
            grads = []
            for g in gradients.eval():
                grads.append(g[0])
            print(np.array([round(gr,10) for gr in grads]).reshape(9,1))
        sess.run(training_op) #run the model for every n_epoch times!           
    best_theta = theta.eval()
    
print('\n',"Best theta vals: ",'\n', best_theta)

Epoch 0 MSE = 8.04903
[[-3.75065756]
 [-2.22404027]
 [ 1.87810469]
 [-2.79127312]
 [-2.27046943]
 [-0.37205476]
 [ 1.22339511]
 [ 0.88830465]
 [-0.60488385]]
Epoch 50 MSE = 1.5338
[[-1.36587632]
 [-0.55734181]
 [ 0.42228666]
 [-0.29444665]
 [-0.31611839]
 [-0.04471397]
 [ 0.45046836]
 [ 0.30207318]
 [ 0.10403683]]
Epoch 100 MSE = 0.928238
[[-0.49741107]
 [-0.17276448]
 [ 0.12586282]
 [ 0.04735763]
 [-0.10181199]
 [ 0.01067678]
 [ 0.16459675]
 [ 0.20067784]
 [ 0.18626732]]
Epoch 150 MSE = 0.816813
[[-0.18114249]
 [-0.06663112]
 [ 0.05456889]
 [ 0.08639927]
 [-0.08751009]
 [ 0.01530829]
 [ 0.05916692]
 [ 0.17607491]
 [ 0.18177451]]
Epoch 200 MSE = 0.769569
[[-0.06596741]
 [-0.03268528]
 [ 0.03373747]
 [ 0.08390116]
 [-0.08612089]
 [ 0.01275524]
 [ 0.02050991]
 [ 0.16306704]
 [ 0.16883875]]
Epoch 250 MSE = 0.735273
[[-0.02402345]
 [-0.02024526]
 [ 0.02609884]
 [ 0.07642988]
 [-0.08165816]
 [ 0.01035774]
 [ 0.00644491]
 [ 0.15214403]
 [ 0.15674564]]
Epoch 300 MSE = 0.706643
[[-0.00874934]


### NEAT! I printed the gradients.eval() at each epoch, and you can see the calculated gradients for each coefficient decreasing at each step!!! SOO COOOOOLLLLL!!!!!

<font color=red size=5> Below is a list of the main approaches to computing gradients automatically. Tensor flow useses reverse-mode autodiff, which is perfect (efficient and accurate) when there are many inputs and few outputs, as is often the case with neural networks. **It computes all the partial derivatives of the outputs with regards to all the inputs in just n outputs + 1  graph traversals.** 

![](Pictures/Picture3.png)

# Using an Optimizer
So TF computes the gradients for you. But it gets even easier, it also provides a number of optimizers out of the box, including a Gradient Descent optimizer. You can simply replace the preceding 'gradients = ...' and 'training_op = ...' lines with the following code:

In [22]:
n_epochs = 1000
learning_rate = 0.01

X = tf.constant(scaled_housing_data_plus_bias, dtype=tf.float32, name="X")
y = tf.constant(housing.target.reshape(-1, 1), dtype=tf.float32, name="y")
theta = tf.Variable(tf.random_uniform([n + 1, 1], -1.0, 1.0), name="theta")
y_pred = tf.matmul(X, theta, name="predictions")
error = y_pred - y
mse = tf.reduce_mean(tf.square(error), name="mse")

#################### previous code ####################
# gradients = tf.gradients(mse,[theta])[0] 
# training_op = tf.assign(theta, theta - learning_rate * gradients)

#################### optimized TF code ####################
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
training_op = optimizer.minimize(mse)

#initialize all variables (just theta in this instance)
init = tf.global_variables_initializer()

# run da thing!
with tf.Session() as sess:
    sess.run(init)
    for epoch in range(n_epochs):
        if epoch % 100 == 0:
            print("Epoch", epoch, "MSE =", mse.eval()) #print every 100 runs
        sess.run(training_op) #run the model for every n_epoch times!           
    best_theta = theta.eval()
best_theta

Epoch 0 MSE = 9.87982
Epoch 100 MSE = 0.721473
Epoch 200 MSE = 0.565347
Epoch 300 MSE = 0.552013
Epoch 400 MSE = 0.545321
Epoch 500 MSE = 0.540453
Epoch 600 MSE = 0.536786
Epoch 700 MSE = 0.534003
Epoch 800 MSE = 0.53188
Epoch 900 MSE = 0.530254


array([[ 2.06855249],
       [ 0.86707652],
       [ 0.14191604],
       [-0.3061046 ],
       [ 0.32597232],
       [ 0.00343692],
       [-0.04208428],
       [-0.69584179],
       [-0.66939598]], dtype=float32)

If you want to use a different type of optimizer, you just need to change one line. **For example, you can
use a momentum optimizer (which often converges much faster than Gradient Descent) by defining the optimizer like this:**

In [23]:
n_epochs = 1000
learning_rate = 0.01

X = tf.constant(scaled_housing_data_plus_bias, dtype=tf.float32, name="X")
y = tf.constant(housing.target.reshape(-1, 1), dtype=tf.float32, name="y")
theta = tf.Variable(tf.random_uniform([n + 1, 1], -1.0, 1.0), name="theta")
y_pred = tf.matmul(X, theta, name="predictions")
error = y_pred - y
mse = tf.reduce_mean(tf.square(error), name="mse")

#################### previous code ####################
# gradients = tf.gradients(mse,[theta])[0] 
# training_op = tf.assign(theta, theta - learning_rate * gradients)

#################### optimized TF code ####################
optimizer = tf.train.MomentumOptimizer(learning_rate=learning_rate,
                                      momentum=0.9)
training_op = optimizer.minimize(mse)

#initialize all variables (just theta in this instance)
init = tf.global_variables_initializer()

# run da thing!
with tf.Session() as sess:
    sess.run(init)
    for epoch in range(n_epochs):
        if epoch % 100 == 0:
            print("Epoch", epoch, "MSE =", mse.eval()) #print every 100 runs
        sess.run(training_op) #run the model for every n_epoch times!           
    best_theta = theta.eval()

Epoch 0 MSE = 5.65754
Epoch 100 MSE = 0.530181
Epoch 200 MSE = 0.524699
Epoch 300 MSE = 0.524364
Epoch 400 MSE = 0.524327
Epoch 500 MSE = 0.524322
Epoch 600 MSE = 0.524321
Epoch 700 MSE = 0.524321
Epoch 800 MSE = 0.52432
Epoch 900 MSE = 0.524321


### Feeding Data to the Training Algorithm
Lets's try to modify previous code to implement Mini-batch Gradient Descent. 

For this, **we need to replace x and y at every iteration with the next mini-batch. The simplist way to do this is to use placeholder nodes.**
- These nodes are special because they dont actually perform any computation, they just output data you tell them ot output at any time.
- They are typically used to pass training dat to TensorFlow during training. If you dont specify a value at runtime for a placeholder, you get an exception.

To create a placeholder node: 
- you must call the placeholder() function and specify the output tensor's data type. 
- Optionally, you can also specify its shape, if you want to enforce it (use 'None' you dont want to specify a shape).  

For example, the following code creates a placeholder node A, and also a node B = A + 5. 
- **When we evaluate B, we pass a feed_dict to the eval() method that specifies the value of A.** 
- Note that A must have rank 2 (i.e., it must be two-dimensional) and there must be three columns (or else an exception is raised), but it can have any number of rows.

In [24]:
#reset the graph
def reset_graph(seed=42):
    tf.reset_default_graph()
    tf.set_random_seed(seed)
    np.random.seed(seed)
reset_graph()

# Set # of cols to 3!!! You could also leave it as None
A = tf.placeholder(tf.float32, shape=(None, 3))
B = A + 5
with tf.Session() as sess:
    B_val_1 = B.eval(feed_dict={A: [[1, 2, 3]]})
    B_val_2 = B.eval(feed_dict={A: [[4, 5, 6], [7, 8, 9]]})
    # gotta have 3 columns but unlimited rows!!!!
    B_val_3 = B.eval(feed_dict={A: [[1, 2, 3],
                                   [1, 2, 3],
                                   [1, 2, 3]]})

print(B_val_1)
print(B_val_2)
print(B_val_3)

[[ 6.  7.  8.]]
[[  9.  10.  11.]
 [ 12.  13.  14.]]
[[ 6.  7.  8.]
 [ 6.  7.  8.]
 [ 6.  7.  8.]]


# Mini Batch Gradient Descent
To implement Mini-batch Gradient Descent, we tweak the code slightly. First, change the x and y in the construction phase to make them placeholder nodes:

Then define the batch size and compute the total number of batches. 

Then in the execution phase, fetch the mini-batches one by one, then provide the value of x and y via the feed_dict parameter when evaluating a node thta depends on either of them:

In [25]:
n_epochs = 1000
learning_rate = 0.01
reset_graph()

##################################### SET UP LINREG #####################################

#OLD CODE
#X = tf.constant(scaled_housing_data_plus_bias, dtype=tf.float32, name="X")
#y = tf.constant(housing.target.reshape(-1, 1), dtype=tf.float32, name="y")

#set up placeholder nodes for mini batch GS
X = tf.placeholder(tf.float32, shape=(None, n + 1), name="X")
y = tf.placeholder(tf.float32, shape=(None, 1), name="y")

#set up LinReg codes
theta = tf.Variable(tf.random_uniform([n + 1, 1], -1.0, 1.0, seed=42), name="theta") #initialize first rand theta
y_pred = tf.matmul(X, theta, name="predictions")
error = y_pred - y
mse = tf.reduce_mean(tf.square(error), name="mse") #set cost function graph
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate) #set GS optimizer
training_op = optimizer.minimize(mse) #initialize the training op to train GS on MSE

init = tf.global_variables_initializer() #initialize all variables

##################################### RUN MINIBATCH GS #####################################

n_epochs = 10
batch_size = 100
n_batches = int(np.ceil(m / batch_size))

def fetch_batch(epoch, batch_index, batch_size):
    np.random.seed(epoch * n_batches + batch_index)  # not shown in the book
    indices = np.random.randint(m, size=batch_size)  # not shown
    X_batch = scaled_housing_data_plus_bias[indices] # not shown
    y_batch = housing.target.reshape(-1, 1)[indices] # not shown
    return X_batch, y_batch

with tf.Session() as sess:
    sess.run(init)

    for epoch in range(n_epochs):
        for batch_index in range(n_batches):
            X_batch, y_batch = fetch_batch(epoch, batch_index, batch_size)
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})

    best_theta = theta.eval()
print(best_theta)

[[ 2.07033372]
 [ 0.86371452]
 [ 0.12255151]
 [-0.31211874]
 [ 0.38510373]
 [ 0.00434168]
 [-0.01232954]
 [-0.83376896]
 [-0.80304712]]


In [26]:
s= set([1,2,3,1,1,1])
s.add(5)
print(s)

s.update({1,2,3,4,5,6})
print(s)

s.intersection_update({1,2})
print(s)

{1, 2, 3, 5}
{1, 2, 3, 4, 5, 6}
{1, 2}


<font color = red size=5> NOTE: We do not need to pass the value of X and y when evaluating 'theta' since it does not depend on either of them!

# Saving and Resorting Models

**Saving a mdoel: **TF makes saving a restoring a model very easy. Just create a saver ne at the end of the construction phase (after all variables nodes are created); then, in the execution phase, just call its save() method whenever you want to save the model, passing it the session and the path of the checkpoint file.

**Moreover, you
probably want to save checkpoints at regular intervals during training so that if your computer crashes
during training you can continue from the last checkpoint rather than start over from scratch.**


In [27]:
reset_graph()

n_epochs = 1000                                                                       # not shown in the book
learning_rate = 0.01                                                                  # not shown

X = tf.constant(scaled_housing_data_plus_bias, dtype=tf.float32, name="X")            # not shown
y = tf.constant(housing.target.reshape(-1, 1), dtype=tf.float32, name="y")            # not shown
theta = tf.Variable(tf.random_uniform([n + 1, 1], -1.0, 1.0, seed=42), name="theta")
y_pred = tf.matmul(X, theta, name="predictions")                                      # not shown
error = y_pred - y                                                                    # not shown
mse = tf.reduce_mean(tf.square(error), name="mse")                                    # not shown
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)            # not shown
training_op = optimizer.minimize(mse)                                                 # not shown

init = tf.global_variables_initializer()
saver = tf.train.Saver()

with tf.Session() as sess:
    sess.run(init)

    for epoch in range(n_epochs):
        if epoch % 100 == 0:
            print("Epoch", epoch, "MSE =", mse.eval()) 
            # Save at regular intervals!
            save_path = saver.save(sess, "/tmp/my_model.ckpt")
        sess.run(training_op)
    
    best_theta = theta.eval()
    # save final model!
    save_path = saver.save(sess, "/tmp/my_model_final.ckpt")

print(best_theta)

Epoch 0 MSE = 9.16154
Epoch 100 MSE = 0.714501
Epoch 200 MSE = 0.566705
Epoch 300 MSE = 0.555572
Epoch 400 MSE = 0.548812
Epoch 500 MSE = 0.543636
Epoch 600 MSE = 0.539629
Epoch 700 MSE = 0.536509
Epoch 800 MSE = 0.534068
Epoch 900 MSE = 0.532147
[[ 2.06855249]
 [ 0.88740271]
 [ 0.14401658]
 [-0.34770882]
 [ 0.36178368]
 [ 0.00393811]
 [-0.04269556]
 [-0.66145277]
 [-0.6375277 ]]


**Restoring a mdoel: ** create a Saver at the end of the construction phase just like before, but then at the begining of the execution phase, instead of initializing the variables using the init node, you call the restore() method of the Saver object. 

By default a Saver saves and restores all variables under their own name, but if you need more
control, you can specify which variables to save or restore, and what names to use. For example, the
following Saver will save or restore only the theta variable under the name weights:

    saver = tf.train.Saver({"weights": theta})


In [28]:
with tf.Session() as sess:
    saver.restore(sess, "/tmp/my_model_final.ckpt")
    best_theta_restored = theta.eval() # not shown in the book
print(best_theta_restored)

INFO:tensorflow:Restoring parameters from /tmp/my_model_final.ckpt
[[ 2.06855249]
 [ 0.88740271]
 [ 0.14401658]
 [-0.34770882]
 [ 0.36178368]
 [ 0.00393811]
 [-0.04269556]
 [-0.66145277]
 [-0.6375277 ]]


*** You can also restore meta graphs, or trained graphs!!!***

<font color=red size=5>This means that you can import a pretrained model without having to have the corresponding Python code to build the graph. This is very handy when you keep tweaking and saving your model: you can load a previously saved model without having to search for the version of the code that built it.

In [29]:
reset_graph()
# notice that we start with an empty graph.

saver = tf.train.import_meta_graph("/tmp/my_model_final.ckpt.meta")  # this loads the graph structure
theta = tf.get_default_graph().get_tensor_by_name("theta:0") # not shown in the book

with tf.Session() as sess:
    saver.restore(sess, "/tmp/my_model_final.ckpt")  # this restores the graph's state
    best_theta_restored = theta.eval() # not shown in the book

INFO:tensorflow:Restoring parameters from /tmp/my_model_final.ckpt


# Visualizing the Graph and Training Curves Using TensorBoard
So now we have a computation graph that trains a Linear Regression model using Mini-batch
Gradient Descent, and we are saving checkpoints at regular intervals. Sounds sophisticated, doesn’t
it? However, we are still relying on the print() function to visualize progress during training. 

There
is a better way: enter TensorBoard. If you feed it some training stats, it will display nice interactive visualizations of these stats in your web browser (e.g., learning curves). You can also provide it the
graph’s definition and it will give you a great interface to browse through it. This is very useful to identify errors in the graph, to find bottlenecks, and so on.

- The first step is to tweak your program a bit so it writes the graph definition and some training stats — for example, the training error (MSE) — to a log directory that TensorBoard will read from. 
- You need to use a different log directory every time you run your program, or else TensorBoard will merge stats from different runs, which will mess up the visualizations. 
- **The simplest solution for this is to include a timestamp in the log directory name. Add the following code at the beginning of the program:**

You also need to add these lines to the end of the constructor code: 

    mse_summary = tf.summary.scalar('MSE', mse)
    file_writer = tf.summary.FileWriter(logdir, tf.get_default_graph())
   
The first line creates a node in the graph that will evaluate the MSE value and write it to a
TensorBoard-compatible binary log string called a summary. The second line creates a FileWriter
that you will use to write summaries to logfiles in the log directory. The first parameter indicates the
path of the log directory (in this case something like tf_logs/run-20160906091959/, relative to the
current directory). The second (optional) parameter is the graph you want to visualize. Upon creation,
**the FileWriter creates the log directory if it does not already exist (and its parent directories if
needed), and writes the graph definition in a binary logfile called an events file.**

Next you need to update the execution phase to evaluate the mse_summary node regularly during
training (e.g., every 10 mini-batches). This will output a summary that you can then write to the events
file using the file_writer. Then you can close the FileWriter at the end of the program.

Here is the updated code:

In [30]:
#reset default graph
reset_graph()

#set up logdir to save model in
from datetime import datetime
now = datetime.utcnow().strftime("%Y%m%d%H%M%S")
root_logdir = "tf_logs"
logdir = "{}/run-{}/".format(root_logdir, now)

#################### set up graph: construciton phase ####################
n_epochs = 1000
learning_rate = 0.01
X = tf.placeholder(tf.float32, shape=(None, n + 1), name="X")
y = tf.placeholder(tf.float32, shape=(None, 1), name="y")
theta = tf.Variable(tf.random_uniform([n + 1, 1], -1.0, 1.0, seed=42), name="theta")
y_pred = tf.matmul(X, theta, name="predictions")
error = y_pred - y
mse = tf.reduce_mean(tf.square(error), name="mse")
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
training_op = optimizer.minimize(mse)
init = tf.global_variables_initializer()
# Add lines to write model to file
mse_summary = tf.summary.scalar('MSE', mse) 
file_writer = tf.summary.FileWriter(logdir, tf.get_default_graph())

#################### train model: execution phase ####################

n_epochs = 10
batch_size = 100
n_batches = int(np.ceil(m / batch_size))

with tf.Session() as sess:                                                        # not shown in the book
    sess.run(init)                                                                # not shown

    for epoch in range(n_epochs):                                                 # not shown
        for batch_index in range(n_batches):
            X_batch, y_batch = fetch_batch(epoch, batch_index, batch_size)
            if batch_index % 10 == 0:
                summary_str = mse_summary.eval(feed_dict={X: X_batch, y: y_batch})
                step = epoch * n_batches + batch_index
                file_writer.add_summary(summary_str, step)
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})

    best_theta = theta.eval()                                                     # not shown

file_writer.close()

best_theta

array([[ 2.07033372],
       [ 0.86371452],
       [ 0.12255151],
       [-0.31211874],
       [ 0.38510373],
       [ 0.00434168],
       [-0.01232954],
       [-0.83376896],
       [-0.80304712]], dtype=float32)

#### NOTE: Avoid logging training stats at every single training step, as this would significantly slow down training.

## Run tf_log file, and you can see the runs!

Put this code into the command line (make sure you are in the files directory first):
    
    dir tf_logs
    
And it will give you a record of the times you ran the code!

WANRING: In windows, you have to try to avoid the colon as it is TF treats it as something else... not exactly sure, but this means that you can't include full file path in the --logdir command. https://github.com/tensorflow/tensorflow/issues/7856

Command line code to run tensorboard:

    (py36) C:\Users\mciniello>cd C:\Users\mciniello\Desktop\Data Science Fundementals\Data Mining and Advanced Analytics\Text book code

    (py36) C:\Users\mciniello\Desktop\Data Science Fundementals\Data Mining and Advanced Analytics\Text book code>tensorboard --logdir tf_logs/
    Starting TensorBoard b'54' at http://CA47496-mcini04:6006

    (Press CTRL+C to quit)

# Name Scopes

When dealing with more complex models such as neural networks, the graph can easily become
cluttered with thousands of nodes. To avoid this, you can create name scopes to group related nodes.
For example, let’s modify the previous code to define the error and mse ops within a name scope
called "loss":

In [31]:
#reset default graph
reset_graph()

#set up logdir to save model in
from datetime import datetime
now = datetime.utcnow().strftime("%Y%m%d%H%M%S")
root_logdir = "tf_logs"
logdir = "{}/run-{}/".format(root_logdir, now)

#################### set up graph: construciton phase ####################
n_epochs = 1000
learning_rate = 0.01
X = tf.placeholder(tf.float32, shape=(None, n + 1), name="X")
y = tf.placeholder(tf.float32, shape=(None, 1), name="y")
theta = tf.Variable(tf.random_uniform([n + 1, 1], -1.0, 1.0, seed=42), name="theta")
y_pred = tf.matmul(X, theta, name="predictions")

# Adjust model to include name scopes
with tf.name_scope('loss') as scope:
    error = y_pred - y
    mse = tf.reduce_mean(tf.square(error), name="mse")
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
training_op = optimizer.minimize(mse)
init = tf.global_variables_initializer()
mse_summary = tf.summary.scalar('MSE', mse) 
file_writer = tf.summary.FileWriter(logdir, tf.get_default_graph())

#################### train model: execution phase ####################

n_epochs = 10
batch_size = 100
n_batches = int(np.ceil(m / batch_size))

with tf.Session() as sess:                                                        # not shown in the book
    sess.run(init)                                                                # not shown

    for epoch in range(n_epochs):                                                 # not shown
        for batch_index in range(n_batches):
            X_batch, y_batch = fetch_batch(epoch, batch_index, batch_size)
            if batch_index % 10 == 0:
                summary_str = mse_summary.eval(feed_dict={X: X_batch, y: y_batch})
                step = epoch * n_batches + batch_index
                file_writer.add_summary(summary_str, step)
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})

    best_theta = theta.eval()                                                     # not shown

file_writer.close()

best_theta

array([[ 2.07033372],
       [ 0.86371452],
       [ 0.12255151],
       [-0.31211874],
       [ 0.38510373],
       [ 0.00434168],
       [-0.01232954],
       [-0.83376896],
       [-0.80304712]], dtype=float32)

### The mse and error nodes now appear inside the loss namespace, which appears collapse.

BEFORE:
![](pictures/tb1.png)

AFTER:
![](pictures/tb2.png)



# Modularity
Suppose you want to create a graph that adds the output of two rectified linear units (ReLU). A ReLU
computes a linear function of the inputs, and outputs the result if it is positive, and 0 otherwise, as
shown in Equation 9-1.

![](pictures/relu.emf)


The following code does the job, but it’s quite repetitive:
    
    n_features = 3
    X = tf.placeholder(tf.float32, shape=(None, n_features), name="X")
    w1 = tf.Variable(tf.random_normal((n_features, 1)), name="weights1")
    w2 = tf.Variable(tf.random_normal((n_features, 1)), name="weights2")
    b1 = tf.Variable(0.0, name="bias1")
    b2 = tf.Variable(0.0, name="bias2")
    z1 = tf.add(tf.matmul(X, w1), b1, name="z1")
    z2 = tf.add(tf.matmul(X, w2), b2, name="z2")
    relu1 = tf.maximum(z1, 0., name="relu1")
    relu2 = tf.maximum(z1, 0., name="relu2")
    output = tf.add(relu1, relu2, name="output")
    
Such repetitive code is hard to maintain and error-prone (in fact, this code contains a cut-and-paste
error; did you spot it?). It would become even worse if you wanted to add a few more ReLUs.
Fortunately, TensorFlow lets you stay DRY (Don’t Repeat Yourself): simply create a function to build
a ReLU. **The following code creates five ReLUs and outputs their sum (note that add_n() creates an operation that will compute the sum of a list of tensors):**

In [32]:
reset_graph()

def relu(X):
    w_shape = (int(X.get_shape()[1]), 1)
    w = tf.Variable(tf.random_normal(w_shape), name="weights")
    b = tf.Variable(0.0, name="bias")
    z = tf.add(tf.matmul(X, w), b, name="z")
    return tf.maximum(z, 0., name="relu")

n_features = 3
X = tf.placeholder(tf.float32, shape=(None, n_features), name="X")
relus = [relu(X) for i in range(5)]
output = tf.add_n(relus, name="output")

Note that when you create a node, TensorFlow checks whether its name already exists, and if it does
it appends an underscore followed by an index to make the name unique. So the first ReLU contains
nodes named "weights", "bias", "z", and "relu" (plus many more nodes with their default name,
such as "MatMul"); the second ReLU contains nodes named "weights_1", "bias_1", and so on; the
third ReLU contains nodes named "weights_2", "bias_2" and so on. TensorBoard identifies such series and collapses them together to reduce clutter. 

Using name scopes, you can make the graph much clearer. Simply move all the content of the relu()
function inside a name scope. Figure 9-7 shows the resulting graph. **Notice that TensorFlow also
gives the name scopes unique names by appending _1, _2, and so on.**

In [33]:
#### even better code using name_scope
reset_graph()

def relu(X):
    with tf.name_scope("relu"):
        w_shape = (int(X.get_shape()[1]), 1)                          # not shown in the book
        w = tf.Variable(tf.random_normal(w_shape), name="weights")    # not shown
        b = tf.Variable(0.0, name="bias")                             # not shown
        z = tf.add(tf.matmul(X, w), b, name="z")                      # not shown
        return tf.maximum(z, 0., name="max")                          # not shown

n_features = 3
X = tf.placeholder(tf.float32, shape=(None, n_features), name="X")
relus = [relu(X) for i in range(5)]
output = tf.add_n(relus, name="output")

file_writer = tf.summary.FileWriter("tf_logs/relu2", tf.get_default_graph())
file_writer.close()

# Sharing Variables
... TensorFlow offers another option, which may lead to slightly cleaner and more modular code than the
previous solutions.5 This solution is a bit tricky to understand at first, but since it is used a lot in
TensorFlow it is worth going into a bit of detail. The idea is to use the get_variable() function to
create the shared variable if it does not exist yet, or reuse it if it already exists. The desired behavior
(creating or reusing) is controlled by an attribute of the current variable_scope(). For example, the
following code will create a variable named "relu/threshold" (as a scalar, since shape=(), and
using 0.0 as the initial value):

In [34]:
reset_graph()

with tf.variable_scope("relu"):
    threshold = tf.get_variable("threshold", shape=(),
                                initializer=tf.constant_initializer(0.0))

Note that if the variable has already been created by an earlier call to get_variable(), this code
will raise an exception. This behavior prevents reusing variables by mistake. If you want to reuse a
variable, you need to explicitly say so by setting the variable scope’s reuse attribute to True (in
which case you don’t have to specify the shape or the initializer):

In [35]:
with tf.variable_scope("relu", reuse=True):
    threshold = tf.get_variable("threshold")

This code will fetch the existing "relu/threshold" variable, or raise an exception if it does not
exist or if it was not created using get_variable(). Alternatively, you can set the reuse attribute to
True inside the block by calling the scope’s reuse_variables() method:

In [36]:
with tf.variable_scope("relu") as scope:
    scope.reuse_variables()
    threshold = tf.get_variable("threshold")

**WARNING:**
Once reuse is set to True, it cannot be set back to False within the block. Moreover, if you define other variable scopes
inside this one, they will automatically inherit reuse=True. 

<font color=blue size 56> Lastly, only variables created by get_variable() can be reused this way. **This why when you run a loop that keeps defininf threshold, X, y, etc, only theshold gets reused because it is the only variable you have called using 'get_variable()', and X, Y,etc all have _1, _2, _3 etc appended to them!**


Now you have all the pieces you need to make the relu() function access the threshold variable
without having to pass it as a parameter:

<FONT COLOR = RED SIZE=4>  This code first defines the relu() function, then creates the relu/threshold variable (as a scalar
that will later be initialized to 0.0) and builds five ReLUs by calling the relu() function. **The
relu() function reuses the relu/threshold variable, and creates the other ReLU nodes.**

In [37]:
reset_graph()

def relu(X):
    with tf.variable_scope("relu", reuse=True):
        threshold = tf.get_variable("threshold")
        w_shape = int(X.get_shape()[1]), 1                          # not shown
        w = tf.Variable(tf.random_normal(w_shape), name="weights")  # not shown
        b = tf.Variable(0.0, name="bias")                           # not shown
        z = tf.add(tf.matmul(X, w), b, name="z")                    # not shown
        return tf.maximum(z, threshold, name="max")

X = tf.placeholder(tf.float32, shape=(None, n_features), name="X")
with tf.variable_scope("relu"):
    threshold = tf.get_variable("threshold", shape=(),
                                initializer=tf.constant_initializer(0.0))
relus = [relu(X) for relu_index in range(5)]
output = tf.add_n(relus, name="output")

file_writer = tf.summary.FileWriter("tf_logs/relu6", tf.get_default_graph())
file_writer.close()

<font color = blue size=4> **Variables created using get_variable() are always named using the name of their variable_scope as a prefix (e.g.,
"relu/threshold"),** but for all other nodes (including variables created with tf.Variable()) the **variable scope acts like a
new name scope.** In particular, if a name scope with an identical name was already created, then a suffix is added to make the name unique. **For example, all nodes created in the preceding code (except the threshold variable) have a name prefixed with "relu_1/" to "relu_5/", as shown in Figure 9-8.**

It is somewhat unfortunate that the threshold variable must be defined outside the relu() function, where all the rest of the ReLU code resides. 
To fix this, **the following code creates the threshold variable within the relu() function upon the first call, then reuses it in subsequent calls.**

Now the relu() function does not have to worry about name scopes or variable sharing: it just calls get_variable(), which will create or reuse the threshold variable (it does not need to know which is the case). **The rest of the code calls relu() five times, making sure to set reuse=False on
the first call, and reuse=True for the other calls.**

In [38]:
reset_graph()

def relu(X):
    threshold = tf.get_variable("threshold", shape=(),
                                initializer=tf.constant_initializer(0.0))
    w_shape = (int(X.get_shape()[1]), 1)                        # not shown in the book
    w = tf.Variable(tf.random_normal(w_shape), name="weights")  # not shown
    b = tf.Variable(0.0, name="bias")                           # not shown
    z = tf.add(tf.matmul(X, w), b, name="z")                    # not shown
    return tf.maximum(z, threshold, name="max")

X = tf.placeholder(tf.float32, shape=(None, n_features), name="X")
relus = []
for relu_index in range(5):
    with tf.variable_scope("relu", reuse=(relu_index >= 1)) as scope:
        relus.append(relu(X))
output = tf.add_n(relus, name="output")

file_writer = tf.summary.FileWriter("tf_logs/relu9", tf.get_default_graph())
file_writer.close()

# Questions:

**1. What are the main benefits of creating a computation graph rather than directly executing the computations? What are the main drawbacks?**

Main benefits:
- TensorFlow can automatically compute the gradients for you (using reverse-mode autodiff). **Though so does sklearn right?!?!**
- TensorFlow can take care of running the operations in parallel in different threads.
- It makes it easier to run the same model across different devices.
- It simplifies introspection — for example, **to view the model in TensorBoard**.

Main drawbacks:
- It makes the learning curve steeper.
- It makes step-by-step debugging harder. **VERY TRUE!!! But there are more simply implementations for neural nets being released everyday it seems!!!***

**2. Is the statement a_val = a.eval(session=sess) equivalent to a_val = sess.run(a)?**

Yes!

**3. Is the statement a_val, b_val = a.eval(session=sess), b.eval(session=sess) equivalent to a_val, b_val = sess.run([a, b])?**

No! Indeed, the first statement runs the graph twice (once to compute a, once to compute b), while the second statement runs the graph only once. If any of these operations (or the ops they depend on) have side effects
(e.g., a variable is modified, an item is inserted in a queue, or a reader reads a file), then the
effects will be different. **If they don’t have side effects, both statements will return the same
result, but the second statement will be faster than the first.**

**4. Can you run two graphs in the same session?**
<font color=red><br>No, you cannot run two graphs in the same session. You would have to merge the graphs into a single graph first.</font>

**5. If you create a graph g containing a variable w, then start two threads and open a session in each thread, both using the same graph g, will each session have its own copy of the variable w or will it be shared?**

In local TensorFlow, sessions manage variable values, so if you create a graph g containing
a variable w, then start two threads and open a local session in each thread, both using the
same graph g, then each session will have its own copy of the variable w. However, in
distributed TensorFlow, variable values are stored in containers managed by the cluster, so
if both sessions connect to the same cluster and use the same container, then they will share
the same variable value for w.

**6. When is a variable initialized? When is it destroyed?**

<font color=red> A variable is initialized when you call its initializer, and it is destroyed when the session
ends.</font> In distributed TensorFlow, variables live in containers on the cluster, so closing a
session will not destroy the variable. To destroy a variable, you need to clear its container.

**7. What is the difference between a placeholder and a variable?**

Variables and placeholders are very differnet:
- A variable is an operation that holds a value. If you run the variable, it returns that value, and before you run it you need to initialize it. You can also change the variables value by using the assignment operation. Variables are stateful, meaning that they keep the same value upon succesive runs of the graph (unless you change it with the assignment operation). **Variables are typically used to hold model parameters but also for other purposes (e.g. to count the global training step).**
- Placeholders technically do not do very much: they just hold information about the type of shape of the tensor they represent, but they have no value. In fact, if you try to evaluation an operation that depends on a placeholder, you must feed TensorFlow the value of the placeholder (using the feed_dict argument) or else you will get an exception. **Placeholders are typically used to feed training or test data to TensorFlow during the EXECUTION PHASE, or to pass a value to an assignment node to change the value of a variable.**

**8. What happens when you run the graph to evaluate an operation that depends on a placeholder but you don’t feed its value? What happens if the operation does not depend on the placeholder?**

If you run the graph to evaluate an operation that depends on a placeholder but you don’t feed its value, you get an exception. If the operation does not depend on the placeholder, then no exception is raised.

**9. When you run a graph, can you feed the output value of any operation, or just the value of placeholders?**

When you run a graph, you can feed the output value of any operation, not just the value of
placeholders. **In practice, however, this is rather rare.**

**10. How can you set a variable to any value you want (during the execution phase)?**

You can specify a variable’s initial value when constructing the graph, and it will be initialized later when you run the variable’s initializer during the execution phase. **If you want to change that variable’s value to anything you want during the execution phase, then the simplest option is to create an assignment node (during the graph construction phase) using the tf.assign() function, passing the variable and a placeholder as parameters.** During the execution phase, you can run the assignment operation and feed the variable’s new value using the placeholder.

            import tensorflow as tf
            x = tf.Variable(tf.random_uniform(shape=(), minval=0.0, maxval=1.0))
            x_new_val = tf.placeholder(shape=(), dtype=tf.float32)
            x_assign = tf.assign(x, x_new_val)
            with tf.Session():
                x.initializer.run() # random number is sampled *now*
                print(x.eval()) # 0.646157 (some random number)
                x_assign.eval(feed_dict={x_new_val: 5.0})
                print(x.eval()) # 5.0

**11. How many times does reverse-mode autodiff need to traverse the graph in order to compute the gradients of the cost function with regards to 10 variables? What about forward-mode autodiff? And symbolic differentiation?**

- Reverse-mode autodiff (implemented by TensorFlow) needs to traverse the graph only twice in order to compute the gradients of the cost function with regards to any number of variables. 
- Forward-mode autodiff would need to run once for each variable (so 10 times if we want the gradients with regards to 10 different variables). 
- As for symbolic differentiation, it would build a different graph to compute the gradients, so it would not traverse the original graph at all (except when building the new gradients graph). A highly optimized symbolic differentiation system could potentially run the new gradients graph only once to compute the gradients with regards to all variables, but that new graph may be horribly complex and inefficient compared to the original graph.