In [32]:
import numpy as np
import theano.tensor as T
from theano import *


# Exercise 3-2

## (c)

### i.

Generate two scalar variables x1 and x2. These variables will later be filled with values.

In [33]:
x1 = T.scalar()
x2 = T.scalar()

### ii.

From the two variables construct an expression `e`.

In [34]:
e = x1**2 + x1*x2 + 3

Print the representation of this expression

In [35]:
print(pp(e))

(((<TensorType(float64, scalar)> ** TensorConstant{2}) + (<TensorType(float64, scalar)> * <TensorType(float64, scalar)>)) + TensorConstant{3})


### iii.

As a next step, we want to evaluate this function. Generate a theano function `f` that allows us to do so, it should use the variables `x1` and `x2` as inputs.

In [36]:
f = function([x1, x2], e)

Execute the function for `x1=2` and `x2=3`.

In [37]:
print(f(2, 3))

13.0


Print the function's representation.

In [38]:
print(f)

<theano.compile.function_module.Function object at 0x7f66be452cf8>


### iv.

Define two new variables `x3` and `x4`, and redefine `f` using `x3` and `x4` as inputs.
*Hint*: Use the `givens` input parameter of `theano.function`.

In [39]:
x3 = T.scalar()
x4 = T.scalar()
f = function([x3, x4], e, givens={x1: x3, x2: x4})

### v.

Another interesting property of theano functions is that parameters can be updated when executing a function. To test this define a shared variable (`theano.shared`), set its initial value to zero, and redefine the previous expression by replacing `3` with this shared variable.

In [40]:
state = shared(0)
e = x1**2 + x1*x2 + state

Generate a theano function from this expression that increases the output by `1` each time the function is called.

In [41]:
f = function([x1, x2], e, updates={state: state+1})

Call this function several times and interpret the results.

In [42]:
for _ in range(5):
    print(f(2, 3))
print(state.get_value())

10.0
11.0
12.0
13.0
14.0
5


While this example is nonesense, the concept of updates allows for easily updating the weights and biases of a neural network with this technique: The theano function is defined on the network loss and updates are conducted depending on the gradient of the current input batch.

### vi.

As we have already mentioned, theano also allows differentiation. Write an expression that represents the gradient of `e` with respect to `x1`, i.e. the partial derivation of `e` with respect to `x1`.

In [43]:
ge = grad(e, x1)

Then generate a theano function that allows us to evaluate this expression. Evaluate it at x1=3, x2=1.

In [44]:
f = function([x1, x2], ge)
print(f(3, 1))

7.0


Check the results by computing the gradient by hand.

*TODO*

### vii.

Theano can compute several partial derivatives at the same time. Generate an expression `g2` that computes the partial derivatives of `e` with respect to all free variables of `e`.

In [45]:
g2 = grad(e, [x1, x2])

 How is g2 represented?

In [46]:
print(g2)

[Elemwise{add,no_inplace}.0, Elemwise{mul}.0]


Test this expression by defining an appropriate theano function.

In [47]:
f = function([x1, x2], g2)
print(f(3, 1))

[array(7.0), array(3.0)]


## (d)

**Broadcasting** is an extension of matrix operations simplifying life in machine learning, see [Theano Docs](http://deeplearning.net/software/theano/tutorial/numpy.html#broadcasting) and [SciPy Docs](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html).

This technique is exessively used when working with artificial neural networks in the context of mini-batches. In the context of artificial neural networks broadcasting is used to apply a mathematical expression on a set of input vectors, i.e. a minibatch. We will test this using numpy for the sake of simplicity.

Compute `A*B` and `B*A` with `A=[1,1],[2,2],[3,3],[4,4]]` and `B=[[2,3]]`.

In [48]:
import numpy as np

A = np.asarray([(1,1), (2,2), (3,3), (4,4)])
B = np.asarray([(2, 3)])

print("A*B: {}".format(A*B))
print("B*A: {}".format(B*A))

A*B: [[ 2  3]
 [ 4  6]
 [ 6  9]
 [ 8 12]]
B*A: [[ 2  3]
 [ 4  6]
 [ 6  9]
 [ 8 12]]


Interpret the result.

-> The smaller array is always temporarily re-shaped the fit the larger one, hence the two expressions are identical.

### (e)

Compute `A.dot(B.T)`.

In [49]:
print(A.dot(B.T))

[[ 5]
 [10]
 [15]
 [20]]


Interpret the results.

-> *TODO*

### (f)

* Define a **Perceptron with 2 inputs and one output** using Theano.
* Use the **data, labels and weights from exercise 1.4**.
* Use the **sigmoid function from theano as an activation function** and set a **learning rate of exercise 3**.
* As a **cost function** use the **squared Euclidean loss**: $\sum\limits_{i=0}^{N-1} (\hat{y}_i - y_i)^2$
* Generate an expression for calculating the gradient of this cost function.
* Based on the gradient and the cost expressions, define a function receiving as input a matrix of feature vectors (a minibatch) and a label of vectors, calculating the cost of these inputs and updating the weights and biases of the neural network at the same time.
* Finally, train the neural network.

*Why the fuck are we suppossed to update/train a NEURAL NETWORK when the first sentence says that we are to define a PERCEPTRON?? =_=*

In [60]:
# Data, labels and initial weights from exercise 1-4
# Data with x_0 added to the front
data = np.asarray([[2,4], [1, 0.5], [0.5, 1.5], [0, 0.5]], dtype='float64')
# Add bias
data = np.insert(data, 0, values=1, axis=1)
# Replaced -1 with 0 to fit sigmoid activation function
labels = np.asarray([1, 1, 0, 0], dtype='int8')
initial_weights = np.asarray([0.0, 1.0, -1.0], dtype='float64')

# Learning rate from exercise 1-3
learning_rate = 0.2

# Theano variables
t_x = T.dmatrix('X')
t_y = T.bvector('Y')
t_W = shared(initial_weights, name="W")

# Theano expressions
t_activation = T.nnet.sigmoid(T.dot(t_x, t_W))
t_prediction = t_activation
t_cost = T.sum((t_prediction - t_y)**2)
t_grad_cost = grad(t_cost, t_W)

print(pp(t_activation))
print(pp(t_prediction))
print(pp(t_cost))
print(pp(t_grad_cost))

train_func = theano.function([t_x, t_y], t_cost,
                             updates=[(t_W, t_W - learning_rate*t_grad_cost)],
                             allow_input_downcast=True)

sigmoid((X \dot W))
sigmoid((X \dot W))
Sum{acc_dtype=float64}(((sigmoid((X \dot W)) - Y) ** TensorConstant{2}))
(X.T \dot ((((fill(((sigmoid((X \dot W)) - Y) ** TensorConstant{2}), fill(Sum{acc_dtype=float64}(((sigmoid((X \dot W)) - Y) ** TensorConstant{2})), TensorConstant{1.0})) * TensorConstant{2}) * ((sigmoid((X \dot W)) - Y) ** (TensorConstant{2} - TensorConstant{1}))) * sigmoid((X \dot W))) * (TensorConstant{1.0} - sigmoid((X \dot W)))))


In [61]:
from itertools import count
t_W.set_value(initial_weights)
for it in count():
    cost = train_func(data, labels)
    if cost <= 0.005:
        print("Cost at {} after {} iterations".format(cost, it))
        break
    elif it >= 5000:
        print("Did not converge after 5000 iterations.")
        break

Cost at 0.004998064480008461 after 2361 iterations


Using the sigmoid function for activation, this is certain to never converge, since the output will never be exactly 0 or 1.