# 02. Graph Structures,  Printing/Drawing Theano graphs, Derivatives in Theano

# 차례

* Graph Structures
* Printing/Drawing Theano graphs
* Derivatives in Theano

# Graph Structures

* Theano Graphs
* Automatic Differentiation
* Optimizations

## Theano Graphs

* This chapter is meant to introduce you to a required minimum of the inner workings of Theano. 
For more detail see Extending Theano.
    - variables : The first step in writing Theano code is to write down all mathematical relations using symbolic placeholders (variables). 
    - ops : When writing down these expressions you use operations like +, -, **, sum(), tanh(). All these are represented internally as ops.
* Theano builds internally a graph structure composed of interconnected
    - variable nodes
    - op nodes
    - apply nodes - An apply node represents the application of an op to some variables.

In [8]:
import theano.tensor as T

x = T.dmatrix('x')
y = T.dmatrix('y')
z = x + y

<img src="http://deeplearning.net/software/theano/_images/apply1.png" />

Interaction between instances of Apply (blue), Variable (red), Op (green), and Type (purple).

In [9]:
import theano
x = theano.tensor.dmatrix('x')
y = x * 2.

In [10]:
y.owner

Elemwise{mul,no_inplace}(x, DimShuffle{x,x}.0)

In [11]:
type(y.owner)

theano.gof.graph.Apply

In [12]:
y.owner.op.name

'Elemwise{mul,no_inplace}'

In [13]:
len(y.owner.inputs)

2

In [14]:
y.owner.inputs[0]

x

In [15]:
y.owner.inputs[1]

DimShuffle{x,x}.0

In [16]:
type(y.owner.inputs[1])

theano.tensor.var.TensorVariable

In [17]:
type(y.owner.inputs[1].owner)

theano.gof.graph.Apply

In [18]:
y.owner.inputs[1].owner.op

<theano.tensor.elemwise.DimShuffle at 0x104f8ed50>

In [19]:
y.owner.inputs[1].owner.inputs

[TensorConstant{2.0}]

## Automatic Differentiation

* Having the graph structure, computing automatic differentiation is simple. The only thing tensor.grad() has to do is to traverse the graph from the outputs back towards the inputs through all apply nodes (apply nodes are those that define which computations the graph does). 
* For each such apply node, its op defines how to compute the gradient of the node’s outputs with respect to its inputs. Note that if an op does not provide this information, it is assumed that the gradient is not defined. 
* Using the chain rule these gradients can be composed in order to obtain the expression of the gradient of the graph’s output with respect to the graph’s inputs .

## Optimizations

In [20]:
import theano

In [21]:
a = theano.tensor.vector("a")      # declare symbolic variable

In [22]:
b = a + a ** 10                    # build symbolic expression

In [23]:
f = theano.function([a], b)        # compile function

In [24]:
print f([0, 1, 2])  

[    0.     2.  1026.]


#### Unoptimized graph

* 팁 ! pydot 관련 에러가 나면 
    - http://stackoverflow.com/questions/15951748/pydot-and-graphviz-error-couldnt-import-dot-parser-loading-of-dot-files-will
    - 즉 1) pydot의 pyparsing 버전을 맞춰준다. 2) graphviz를 시스템에 설치한다.

In [25]:
theano.printing.pydotprint(b, outfile="./symbolic_graph_unopt.png", var_with_name_simple=True)

The output file is available at ./symbolic_graph_unopt.png


<img src="./symbolic_graph_unopt.png" />

#### optimized graph

In [26]:
theano.printing.pydotprint(f, outfile="./symbolic_graph_opt.png", var_with_name_simple=True)

The output file is available at ./symbolic_graph_opt.png


<img src="./symbolic_graph_opt.png" />

# Printing/Drawing Theano graphs

* Pretty Printing
* Debug Printing
* Picture Printing

* Theano provides the functions 
    - theano.printing.pprint() 
    - and theano.printing.debugprint() to print a graph to the terminal before or after compilation. 
* pprint() is more compact and math-like, debugprint() is more verbose. 
* Theano also provides pydotprint() that creates an image of the function.

In [28]:
# Consider again the logistic regression example:

import numpy
import theano
import theano.tensor as T

In [29]:
rng = numpy.random

In [30]:
# Training data
N = 400
feats = 784
D = (rng.randn(N, feats).astype(theano.config.floatX), rng.randint(size=N,low=0, high=2).astype(theano.config.floatX))
training_steps = 10000

In [31]:
# Declare Theano symbolic variables
x = T.matrix("x")
y = T.vector("y")
w = theano.shared(rng.randn(feats).astype(theano.config.floatX), name="w")
b = theano.shared(numpy.asarray(0., dtype=theano.config.floatX), name="b")
x.tag.test_value = D[0]
y.tag.test_value = D[1]

In [32]:
# Construct Theano expression graph
p_1 = 1 / (1 + T.exp(-T.dot(x, w)-b)) # Probability of having a one
prediction = p_1 > 0.5 # The prediction that is done: 0 or 1 

In [33]:
# Compute gradients
xent = -y*T.log(p_1) - (1-y)*T.log(1-p_1) # Cross-entropy
cost = xent.mean() + 0.01*(w**2).sum() # The cost to optimize
gw,gb = T.grad(cost, [w,b])

In [34]:
# Training and prediction function
train = theano.function(inputs=[x,y], outputs=[prediction, xent], updates=[[w, w-0.01*gw], [b, b-0.01*gb]], name = "train")
predict = theano.function(inputs=[x], outputs=prediction, name = "predict")

## Pretty Printing

In [35]:
theano.printing.pprint(prediction)

'gt((TensorConstant{1} / (TensorConstant{1} + exp(((-(x \\dot w)) - b)))), TensorConstant{0.5})'

## Debug Printing

In [36]:
# The pre-compilation graph:
theano.printing.debugprint(prediction)

Elemwise{gt,no_inplace} [@A] ''   
 |Elemwise{true_div,no_inplace} [@B] ''   
 | |DimShuffle{x} [@C] ''   
 | | |TensorConstant{1} [@D]
 | |Elemwise{add,no_inplace} [@E] ''   
 |   |DimShuffle{x} [@F] ''   
 |   | |TensorConstant{1} [@D]
 |   |Elemwise{exp,no_inplace} [@G] ''   
 |     |Elemwise{sub,no_inplace} [@H] ''   
 |       |Elemwise{neg,no_inplace} [@I] ''   
 |       | |dot [@J] ''   
 |       |   |x [@K]
 |       |   |w [@L]
 |       |DimShuffle{x} [@M] ''   
 |         |b [@N]
 |DimShuffle{x} [@O] ''   
   |TensorConstant{0.5} [@P]


In [37]:
# The post-compilation graph:
theano.printing.debugprint(predict)

Elemwise{Composite{GT(scalar_sigmoid((-((-i0) - i1))), i2)}} [@A] ''   4
 |CGemv{inplace} [@B] ''   3
 | |Alloc [@C] ''   2
 | | |TensorConstant{0.0} [@D]
 | | |Shape_i{0} [@E] ''   1
 | |   |x [@F]
 | |TensorConstant{1.0} [@G]
 | |x [@F]
 | |w [@H]
 | |TensorConstant{0.0} [@D]
 |InplaceDimShuffle{x} [@I] ''   0
 | |b [@J]
 |TensorConstant{(1,) of 0.5} [@K]


## Picture Printing

In [38]:
# The pre-compilation graph:
theano.printing.pydotprint(prediction, outfile="logreg_pydotprint_prediction.png", var_with_name_simple=True)

The output file is available at logreg_pydotprint_prediction.png


<img src="logreg_pydotprint_prediction.png" />

In [39]:
# The post-compilation graph:
theano.printing.pydotprint(predict, outfile="logreg_pydotprint_predict.png", var_with_name_simple=True)

The output file is available at logreg_pydotprint_predict.png


<img src="logreg_pydotprint_predict.png" />

In [40]:
# The optimized training graph:
theano.printing.pydotprint(train, outfile="logreg_pydotprint_train.png", var_with_name_simple=True)

The output file is available at logreg_pydotprint_train.png


<img src="logreg_pydotprint_train.png" />

# Derivatives in Theano

* Computing Gradients
* Computing the Jacobian
* Computing the Hessian
* Jacobian times a Vector
    - R-operator
    - L-operator
* Hessian times a Vector
* Final Pointers

## Computing Gradients

Now let’s use Theano for a slightly more sophisticated task: create a function which computes the derivative of some expression y with respect to its parameter x. To do this we will use the macro T.grad. 

<img src="http://deeplearning.net/software/theano/_images/math/7e4e2058a694a404e828a8d7ab97827b2ed9cd25.png" />

In [41]:
from theano import pp
x = T.dscalar('x')
y = x ** 2
gy = T.grad(y, x)

In [42]:
pp(gy)  # print out the gradient prior to optimization

'((fill((x ** TensorConstant{2}), TensorConstant{1.0}) * TensorConstant{2}) * (x ** (TensorConstant{2} - TensorConstant{1})))'

In [44]:
f = theano.function([x], gy)

In [45]:
f(4)

array(8.0)

We can also compute the gradient of complex expressions such as the logistic function defined above. It turns out that the derivative of the logistic is: 

<img src = "http://deeplearning.net/software/theano/_images/math/82020178fdc03e96814d00b3e2391c6f7270ce5c.png" />

<img src="http://deeplearning.net/software/theano/_images/dlogistic.png" />

In [46]:
x = T.dmatrix('x')
s = T.sum(1 / (1 + T.exp(-x)))
gs = T.grad(s, x)
dlogistic = theano.function([x], gs)

In [47]:
dlogistic([[0, 1], [-1, -2]])

array([[ 0.25      ,  0.19661193],
       [ 0.19661193,  0.10499359]])

## Computing the Jacobian

In Theano’s parlance, the term Jacobian designates the tensor comprising the first partial derivatives of the output of a function with respect to its inputs. (This is a generalization of to the so-called Jacobian matrix in Mathematics.) Theano implements the theano.gradient.jacobian() macro that does all that is needed to compute the Jacobian.

In [48]:
x = T.dvector('x')
y = x ** 2
J, updates = theano.scan(lambda i, y,x : T.grad(y[i], x), sequences=T.arange(y.shape[0]), non_sequences=[y,x])
f = theano.function([x], J, updates=updates)

  rval = __import__(module_name, {}, {}, [module_name])


In [49]:
f([4, 4])

array([[ 8.,  0.],
       [ 0.,  8.]])

## Computing the Hessian

In Theano, the term Hessian has the usual mathematical acception: It is the matrix comprising the second order partial derivative of a function with scalar output and vector input. Theano implements theano.gradient.hessian() macro that does all that is needed to compute the Hessian.

In [51]:
x = T.dvector('x')
y = x ** 2
cost = y.sum()
gy = T.grad(cost, x)
H, updates = theano.scan(lambda i, gy,x : T.grad(gy[i], x), sequences=T.arange(gy.shape[0]), non_sequences=[gy, x])
f = theano.function([x], H, updates=updates)

In [52]:
f([4, 4])

array([[ 2.,  0.],
       [ 0.,  2.]])

## Jacobian times a Vector

### R-operator

The R operator is built to evaluate the product between a Jacobian and a vector.

In [53]:
W = T.dmatrix('W')
V = T.dmatrix('V')
x = T.dvector('x')
y = T.dot(x, W)
JV = T.Rop(y, W, V)
f = theano.function([W, V, x], JV)

In [54]:
f([[1, 1], [1, 1]], [[2, 2], [2, 2]], [0,1])

array([ 2.,  2.])

### L-operator

In similitude to the R-operator, the L-operator would compute a row vector times the Jacobian.

In [55]:
W = T.dmatrix('W')
v = T.dvector('v')
x = T.dvector('x')
y = T.dot(x, W)
VJ = T.Lop(y, W, v)
f = theano.function([v,x], VJ)

In [56]:
f([2, 2], [0, 1])

array([[ 0.,  0.],
       [ 2.,  2.]])

## Hessian times a Vector

If you need to compute the Hessian times a vector, you can make use of the above-defined operators to do it more efficiently than actually computing the exact Hessian and then performing the product. Due to the symmetry of the Hessian matrix, you have two options that will give you the same result, though these options might exhibit differing performances. Hence, we suggest profiling the methods before using either one of the two:

In [57]:
x = T.dvector('x')
v = T.dvector('v')
y = T.sum(x ** 2)
gy = T.grad(y, x)
vH = T.grad(T.sum(gy * v), x)
f = theano.function([x, v], vH)

In [58]:
f([4, 4], [2, 2])

array([ 4.,  4.])

In [59]:
# or, making use of the R-operator:
x = T.dvector('x')
v = T.dvector('v')
y = T.sum(x ** 2)
gy = T.grad(y, x)
Hv = T.Rop(gy, x, v)
f = theano.function([x, v], Hv)

In [60]:
f([4, 4], [2, 2])

array([ 4.,  4.])

## Final Pointers

* The grad function works symbolically: it receives and returns Theano variables.
* grad can be compared to a macro since it can be applied repeatedly.
* Scalar costs only can be directly handled by grad. Arrays are handled through repeated applications.
* Built-in functions allow to compute efficiently vector times Jacobian and vector times Hessian.
* Work is in progress on the optimizations required to compute efficiently the full Jacobian and the Hessian matrix as well as the Jacobian times vector.

# 참고자료

* [1] Theano 0.7 Tutorial - http://deeplearning.net/software/theano/tutorial/index.html