# Maximum Likelihood Estimation in TensorFlow

## 0. Documentation

You may find the documentation at:

- [TenorFlow in R](https://tensorflow.rstudio.com) - The R package we're using
- [TensorFlow Python API](https://www.tensorflow.org/api_docs/python/tf) - TensorFlow itself (the documentation is for a Python library)

## 1. Computation Graphs and Sessions

### 1.1 Sessions

To start with, let's load the packages we will need:

In [None]:
library(tensorflow)

<p>Sessions represent a connection between the client program and the C++ runtime. We need them to run all computations, e.g. because they store values of variables. Let's create one:</p>

In [None]:
sess = tf$Session()

When we finish working with a graph, we should close the session, to free up the resources associated with it:

In [None]:
sess$close()

But, since we will need a session, let's create the next one:

In [None]:
sess = tf$Session()

### 1.2 Computation graphs

<p>
    Computations in TensorFlow are expressed as Computation Graphs, which consist of multiple types of nodes:
</p>

#### - constants
<p>Constants are tensors the values of which cannot be changed after their creation. We can use them to store observation data, esp. if we would like to estimate our model parameters using General Purpose Graphics Processing Units (GPGPUs). This way we will avoid continuous data transfer between GPGPU and the main computer memory.</p>

In [None]:
cnst = tf$constant(42) # let's define a constant which contains the answer to life, the universe and everything

<p><b>sess$run(var)</b> means "evaluate and return the value of a node which is indicated by the variable <b>var</b>"</p>

In [None]:
sess$run(cnst) # and now let's read its value

<p>Constants and other tensors do not need to be scalars. They can be vectors or matrices having 2 or more dimensions<p>

In [None]:
vec = 1:10
cnst_vec = tf$constant(vec) # a constant containing a vector
sess$run(cnst_vec)

In [None]:
mat = matrix(1:16, 4, 4)
cnst_mat = tf$constant(mat) # a constant containing a 2D matrix
sess$run(cnst_mat)

#### - variables
<p>Variables are tensors the values of which can be changed when operations are run. They can be initialised using a constant or a random value from a selected distribution. We can use them to represent model parameters.</p>

In [None]:
vrbl = tf$Variable(                                           # let's create a variable
    tf$random_normal(shape = shape(1), mean = 0, stddev = 1)  # that will be a single scalar, initialised from normal(0,1) distribution
)

In [None]:
sess$run(tf$global_variables_initializer()) # let's initialise our variable (it needs to be run, even if we initialize a variable with a constant)
# because, we initialise this variable using a random distribution, its value will change every time we will rerun this cell

In [None]:
sess$run(vrbl) # and check its value

#### - placeholders
<p>Placeholders are tensors that we can use to pass values in the moment of running a graph. When we define a placeholder, we need to provide its values at the moment of running a node, which either is that placeholder, or which depends directly or indirectly on that placeholder. In general, they behave similarly to constants, but when we use them, their values need to be transferred to the memory of the processing unit that is running the graph. It's not a big deal if the processing is done on a local CPU – but it will slow down computations run on the GPU.</p>

In [None]:
phld = tf$placeholder(tf$float32) # let's define a placeholder, which will expect to have a 32-bit float (not double) assigned

In [None]:
sess$run(phld) # when we run a graph that contains placeholders without setting their values, it throws an error, because the value of a placeholder isn't set

In [None]:
sess$run(phld, feed_dict = dict(phld = 8)) # when we set a value to a placeholder when running a graph that contains it, its value will be returned

#### - operations
<p>Operations are nodes that do the actual calculations. They take tensors as inputs and produce tensors as outputs.</p>

In [None]:
a = tf$constant(1L) # let's define two constant nodes...
b = tf$constant(2L)

c = a + b # ... and a node which is equal to their sum

![](img/1_2_graph.png)

In [None]:
sess$run(c) # yep, it's confirmed, 1 + 2 = 3. Math in TensorFlow works like anywhere else

### 1.3 Example - Pythagorean theorem

<p>In this example, we will create a graph that uses the Pythagorean theorem to calculate the length of the <i>hypotenuse</i> when having the lenghts of <i>legs</i> provided as inputs. The equation we need is:</p>
<p>$c = \sqrt{a^2 + b^2}$</p>
<p>Let's express it as a computation graph:</p>

In [None]:
a = tf$placeholder(tf$float32) # placeholder for two sides lengths
b = tf$placeholder(tf$float32)

c_square = a^2 + b^2  # now we need to calculate sum of legs lengths squares (or rather create a node that computes it)

c = tf$sqrt(c_square) # and now we can calculate the square root of c_square

<p>The graph we created above looks like:</p>

![](img/1_3_pyth_graph.png)

<p>Now we can evaluate the <i>c</i> node value in a session:</p>

In [None]:
sess$run(c, feed_dict = dict(a = 3, b = 4)) # because a and b are placeholders, and they're needed for c node calculation
                                            # we need to assign a and b values when running the c node calculation

In [None]:
sess$run(c(a = a, b = b, c_2 = c_square, c = c), feed_dict = dict(a = 3, b = 4)) # of course, we can also run multiple nodes in one call.
                                                                                 # then value of each node is returned in a list

### 1.4 Example - vector cosine

<p>Now we will create a graph that will calculate the value of a cosine between two vectors. The equation we will use is:</p>
$\cos(∡\mathbf{A}\mathbf{B}) = {\mathbf{A} \cdot \mathbf{B} \over \|\mathbf{A}\| \|\mathbf{B}\|} = \frac{ \sum\limits_{i=1}^{n}{A_i  B_i} }{ \sqrt{\sum\limits_{i=1}^{n}{A_i^2}}  \sqrt{\sum\limits_{i=1}^{n}{B_i^2}} }$

In [None]:
vec_a = tf$placeholder(tf$float64)
vec_b = tf$placeholder(tf$float64)

dotprod_ab = tf$tensordot(vec_a, vec_b, 1L) # 1L as we need to pass an integer 1
len_a = tf$norm(vec_a)
len_b = tf$norm(vec_b)

len_ab_prod = len_a * len_b

cos_ab = dotprod_ab / len_ab_prod

![](img/1_4_cosine.png)

In [None]:
sess$run(cos_ab, feed_dict = dict(vec_a = c(3,4,5,6), vec_b = c(6,8,10,12))) # the result here should be 1, because
                                                                             # these vectors have exactly the same direction

In [None]:
sess$run(cos_ab, feed_dict = dict(vec_a = c(3,4), vec_b = c(-4,3))) # the result here should be 0, because these vectors are perpendicular

In [None]:
sess$run(cos_ab, feed_dict = dict(vec_a = c(3,4,10), vec_b = c(-3,-4,-10))) # the result here should be -1, because a points in the opposite direction than b

### 1.5 Exercise

<p>Build a graph which will calculate the value of covariance between <i>x</i> and <i>y</i> vectors which should be provided as placeholders' values during a session$run() call.</p>

$cov(x,y) = {\sum_{i=1}^n (X_i - \overline{X})(Y_i - \overline{Y}) \over n-1}$

In [None]:
x = NA # TODO: define a placeholder for x
y = NA # TODO: define a placeholder for y

mean_x = NA # TODO: calculate the mean of x
mean_y = NA # TODO: calculate the mean of y

n_1 = tf$cast(tf$size(x), tf$float64) - 1 # number of elements in x minus 1

diff_x = NA # TODO: calculate the difference between vector x and its mean
diff_y = NA # TODO: calculate the difference between vector y and its mean

diff_xy_prod = NA # TODO: calculate the dot product of diff_x and diff_y. You can use tf$reduce_sum or tf$tensordot

cov_xy = NA # TODO: divide diff_xy_prod by n - 1

In [None]:
sess$run(cov_xy, feed_dict = dict(x = c(5,20,40), y = c(10,24,33)))

<p>The expected result is: <i>199.166666666667</i></p>

## 2. Optimizers

### 2.1 Loss function

<p>Loss function is used to measure how well a given model performs in terms of predicting the expected value. Various loss functions compare predicted and expected values differently. The most popular loss functions are MSE, MAE and cross-entropy. 
    <br>
    <b>MSE</b> or <b>MAE</b> are used when a specific value is estimated (e.g. height in cm). 
<br>
    <b>Cross-entropy</b> is used when a probability is estimated (e.g. probability that a student will answer a question correctly).   
</p>
<p><i>Usually name 'loss function' is used for a single observation and 'cost function' for an entire dataset. However, in tensorflow environment name 'loss function' is used for entire dataset as well. </i>
    </p>

#### 2.1.1 Example - MSE


<p> <b> MSE - Mean Squared Error </b> is a loss function represented by the following equation:</p> 
$MSE = {1 \over{n}} \sum\limits_{i=1}^{n}{(\overline{y_i} - y_i)^2}$
<br>
where <br>
$n$ - number of observations<br>
$\overline{y_i}$ - value predicted by the model for observation $i$ <br>
$y_i$ - actual value for observation $i$ <br>

<p> In this example, we will:
    <ul>
        <li> implement MSE from scratch in plain R, </li>
        <li> use already implemented MSE in tensorflow.</li>
    </ul>
</p>     
    


In [None]:
# Generating data
set.seed(1234)
y <- 1:10 
(y_ <- y + rnorm(10, 0, 1))

In [None]:
# MSE from scratch in R
MSE <- mean((y_-y)^2)
sprintf('MSE calculated from scratch in R: %.5f', MSE)

In [None]:
# Expressing data as nodes
ty <- tf$constant(y, dtype = 'float32') # constant - it is real value
ty_ <- tf$constant(y_, dtype = 'float32') # now it's a constant as we only do math operations; 
                                          # however, usually it's Variable (changes over time during training)

# MSE already implemented in tensorflow
MSE_tf <- tf$losses$mean_squared_error(labels = ty, predictions = ty_)
sprintf('MSE already implemented in tensorflow: %.5f', sess$run(MSE_tf))

#### 2.1.2 Exercise - MAE

<p> <b> MAE - Mean Absolute Error </b> is a loss function represented by the following equation:</p> 
$MAE = {1 \over{n}} \sum\limits_{i=1}^{n}{\mid \overline{y_i} - y_i \mid}$
<br>
where <br>
$n$ - number of observations<br>
$\overline{y_i}$ - value predicted by the model for observation $i$ <br>
$y_i$ - actual value for observation $i$ <br>
<br>
<p> In this exercise, try to:
    <ul>
        <li> implement MAE from scratch in plain R, </li>
        <li> use already implemented MAE in tensorflow</li>
    </ul>
    as it was done in the example above.
</p>     
    

In [None]:
# Generating data
set.seed(1234)
y <- 1:10 
(y_ <- y + rnorm(10, 0, 1))

# Expressing data as nodes
ty <- tf$constant(y, dtype = 'float32') # constant - it is real value
ty_ <- tf$constant(y_, dtype = 'float32') # now it's a constant as we only do math operations; 
                                          # however, usually it's Variable (changes over time during training)

In [None]:
MAE <- NA # TODO: calculate MAE in plain R
sprintf('MSE calculated from scratch in R: %.5f', MAE)

In [None]:
MAE_tf <- NA # TODO: find appropriate function already implemented in tensorflow 
sprintf('MAE already implemented in tensorflow: %.5f', sess$run(MAE_tf)) 

The expected value for each run is: $0.84257$

#### 2.1.3 Example - cross-entropy

<p> <b> Cross-entropy </b> is a loss function represented by the following equation:</p> 
$cross-entropy = -{1 \over{n}} \sum\limits_{i=1}^{n} {y_i \cdot log \overline{y_i} + (1 - y_i) \cdot log (1 - \overline{y_i})}$
<br>
where <br>
$n$ - number of observations<br>
$\overline{y_i}$ - probability predicted by the model for observation $i$ <br>
$y_i$ - actual outcome for observation $i$ <br>


In [None]:
ty <- tf$constant(rep(c(1,0), 5))
ty_ <- tf$random_uniform(shape(10), minval = 0, maxval = 1, seed = 12345)

sess$run(ty)
sess$run(ty_)

In [None]:
ce <- tf$losses$log_loss(labels = ty, predictions = ty_)
sess$run(ce)

## 2.2 Gradient descent

<p>The goal of training the model is to find the best fitting parameters. Loss function is used to evaluate model performance. To find the best estimators, we need to minimize this loss. Gradient descent algorithm is known as the first-order iterative optimizer. </p>


### Gradient descent rule

Updating parameters $\vec{\theta} = [{\theta}_0, {\theta}_1, ..., {\theta}_k] $ in iterations:
<br><br>
${\theta}_i := \theta_i - \alpha \cdot \frac{\partial}{\partial \theta_i} loss(\vec{\theta}, X)$

where <br>
${\theta}_i$ - $i$ parameter <br>
$\alpha$ - learning rate (hyperparameter) <br>
$\frac{\partial}{\partial \theta_i} loss(\vec{\theta}, X)$ - partial derivative of loss function for parameter ${\theta}_i$<br>
$X$ - observations



### Intuition
![](img/2_2_Gradient_descent.png)

### 2.2.1 Example - one number


<p> We can implement gradient descent from scratch in plain R. To do it, we need to calculate the derivative of our loss function.
Let's assume that:
    <ul>
        <li>we use MSE loss function,</li>
        <li>we have only one training example ($n = 1$),</li>
        <li>$\overline{y} = \theta_0$.</li> 
       </ul>
Then $MSE = {1 \over{n}} \sum\limits_{i=1}^{n}{(\overline{y_i} - y_i)^2}$ for $n = 1$ is:
    <br>
    $ MSE = ( \overline{y_1} - y_1)^2$ 
    <br>
    and 
    <br>
    $ \frac{\partial}{\partial \theta_0 } MSE(\theta_0) = 2 \cdot (\overline {y_1} - y_1)$
</p>


In [None]:
y <- 10 # assigning real value 
y_ <- 15 # assigning initial value of estimated parameter

for (i in 1:10){
  dy_ <- 2*(y_-y) # calculating derivative of loss function 
  y_ <- y_ - 0.1 * dy_ # updating estimated value according to gradient descent with learning rate = 0.1
  print(y_)
}

Gradient descent optimizer is already implemented in tensorflow. 

In [None]:
ty <- tf$constant(10, dtype = 'float32') # real value 
ty_ <- tf$Variable(15, dtype = 'float32') # variable which we estimate; generally it's an operation node 
                                          # with result that depends on some variables, and the optimization process
                                          # optimizes values of these variables
sess$run(tf$variables_initializer(list(ty_)))

In [None]:
optimizer <- tf$train$GradientDescentOptimizer(learning_rate = 0.1) # defining optimizer with learning rate = 0.1
loss <- tf$losses$mean_squared_error(labels = ty, predictions = ty_) # defining loss function

In [None]:
for (i in 1:10){
  train <- optimizer$minimize(loss) # updating values with gradient descent optimizer which minimize MSE loss function
  sess$run(train)
  print(sess$run(ty_))
}

### 2.2.2 Example - linear regression

<p>Let us use more complex example to show how gradient descent works. Our task is to estimate $a, b$ and $c$ parameters in a linear model: 
<br>
$z = a \cdot x + b \cdot y + c $  
using gradient descent to minimize MSE loss function. 
</p>

In [None]:
example_222 <- readr::read_csv('data/example_222.csv') # read data
head(example_222) # look at the data 

tx <- tf$constant(example_222$x, dtype = 'float32') # create node 
ty <- tf$constant(example_222$y, dtype = 'float32') # create node
tz <- tf$constant(example_222$z, dtype = 'float32') # create node


In [None]:
ta <- tf$Variable(2) # init parameters
tb <- tf$Variable(4) # init parameters
tc <- tf$Variable(-14) # init parameters
sess$run(tf$global_variables_initializer())

tz_ <- ta*tx + tb*ty + tc # create node to compute estimation of z 

# MODEL 
optimizer <- tf$train$GradientDescentOptimizer(learning_rate = .005) # choose Gradient Descent as optimizer
MSE <- tf$losses$mean_squared_error(labels = tz, predictions = tz_) # choose MSE as loss function
train <- optimizer$minimize(MSE) # minimizing MSE with Gradient Descent optimizer


In [None]:
iter <- 100 # number of iterations in training 

# helper table to save results
results <- tibble::tibble(
  it = 0:iter,  
  mse = c(sess$run(MSE), rep(NA_real_, iter)),
  a = c(sess$run(ta), rep(NA_real_, iter)),
  b = c(sess$run(tb), rep(NA_real_, iter)),
  c = c(sess$run(tc), rep(NA_real_, iter)))

results[1,]

In [None]:
# TRAINING 
for (i in 1:iter){

  sess$run(train) # one training
  
  # save results
  results$it[i+1] <- i
  results$mse[i+1] <- sess$run(MSE)
  results$a[i+1] <- sess$run(ta) 
  results$b[i+1] <- sess$run(tb)
  results$c[i+1] <- sess$run(tc)    

}

In [None]:
# Let's check results
head(results)
tail(results)

In [None]:
library(magrittr)
library(plotly)

# Let's plot how loss has changed during training
results %>% 
  plot_ly(x=~it, y=~mse) %>% 
  add_lines() %>% 
  layout(xaxis = list(title = 'Iteration'), 
         yaxis = list(title = 'Loss (mse)'))

In [None]:
# Let's plot how parameters have changed during training
results %>% 
  tidyr::gather(key, Values, -c(it, mse)) %>% 
  plot_ly(x = ~it, y = ~ Values, color = ~key) %>% 
  add_lines() %>% 
  layout(xaxis = list(title = 'Iteration')) 

### 2.2.3 Exercise - lasso regression 

<p> <b> Linear regression model </b> is a model which tries to fit the best straight line to the data by minimizing MSE. 
    <br>
    <b> Lasso </b> and <b> ridge linear regression models </b> are models which also try to fit the best straight line to the data, but by <b>minimizing MSE plus some additional penalty component</b>.
    <br> 
    <br>
    For linear model: $z = a \cdot x + b \cdot y + c $, <br>
    where $a, b, c$ are parameters  
    <ul>
        <li>ridge regression minimizes: $MSE + \lambda \cdot (a^2 + b^2)$ </li>
        <li>lasso regression minimizes: $MSE + \lambda \cdot (\mid a \mid  + \mid b \mid)$ </li>
    </ul> 
    where $\lambda$ is a hyperparameter which termines how severe the additional penalty is.
</p>
<br>
    More details about ridge and lasso regressions in videos:

- [Ridge Regression](https://www.youtube.com/watch?v=Q81RR3yKn30)
- [Lasso Regression](https://www.youtube.com/watch?v=NGf0voTMlcs) 


Now, if you know how loss functions for ridge and lasso regression models look like, try to implement <b>lasso regression </b>on data from the example above (2.2.2 Example - linear regression).
<br> 
<br>
The task is to estimate $a, b$ and $c$ parameters in $z = a \cdot x + b \cdot y + c$   model using lasso regression. 

In [None]:
# reading data and creating nodes

example_222 <- readr::read_csv('data/example_222.csv') # read data
head(example_222) # look at the data 
tx <- tf$constant(example_222$x, dtype = 'float32') # create node 
ty <- tf$constant(example_222$y, dtype = 'float32') # create node
tz <- tf$constant(example_222$z, dtype = 'float32') # create node

In [None]:
ta <- NA # TODO: init parameter a=2
tb <- NA # TODO: init parameter b=4
tc <- NA # TODO: init parameter c=-14

sess$run(tf$global_variables_initializer())

In [None]:
tz_ <- NA # TODO: create node for z (node with equation which represents linear model)

optimizer <- NA # TODO: select gradient descent optimizer and set learning rate=0.005
mse <- NA # TODO: select mse loss function
loss <- NA # TODO: create whole loss function: to mse add additional component for lasso regression and lambda = 0.2
train <- NA # TODO: minimize loss function

model <- list(mse = mse, 
             loss = loss, 
             optimizer = optimizer, 
             train = train)

In [None]:
iter <- 100

results <- tibble::tibble(
  it = 0:iter,  
  mse = c(sess$run(mse), rep(NA_real_, iter)),
  loss = c(sess$run(loss), rep(NA_real_, iter)),  
  a = c(sess$run(ta), rep(NA_real_, iter)),
  b = c(sess$run(tb), rep(NA_real_, iter)),
  c = c(sess$run(tc), rep(NA_real_, iter)))


In [None]:
for (i in 1:iter){

  NA # TODO: run a single training iteration
  
 # save results
  results$it[i+1] <- i
  results$mse[i+1] <- sess$run(mse)
  results$loss[i+1] <- sess$run(loss)
  results$a[i+1] <- sess$run(ta) 
  results$b[i+1] <- sess$run(tb)
  results$c[i+1] <- sess$run(tc)
    
}

In [None]:
head(results)
tail(results)

## 2.3 Other optimizers

Gradient descent algorithm is an example of optimizers. However, there are also others gradient descent based optimizers:
<ul> 
    <li><b>[Momentum optimizer](https://www.tensorflow.org/api_docs/python/tf/train/MomentumOptimizer)</b> - helps accelerate gradient descent in the right direction and decrease oscillations by keeping the momentum.
    <li><b>[RMSProp optimizer](https://www.tensorflow.org/api_docs/python/tf/train/RMSPropOptimizer)</b> - has the same aim as momentum optimizer, however there is a difference in formula.
    </li>
    <li><b>[Adam optimizer](https://www.tensorflow.org/api_docs/python/tf/train/AdamOptimizer)</b> - mixes Momentum and RMSProp approaches</li>
    <li> ...</li>
    </ul>
Let's see how to call different optimizers.

In [None]:
# expressing data in nodes
ty <- tf$constant(10, dtype = 'float32') # real value 
ty_ <- tf$Variable(15, dtype = 'float32') # variable which we estimate; generally it's an operation node 

loss <- tf$losses$mean_squared_error(labels = ty, predictions = ty_) # loss function

In [None]:
optimizer <- tf$train$GradientDescentOptimizer(learning_rate = 0.005) # gradient descent optimizer
train <- optimizer$minimize(loss) # train 

sess$run(tf$global_variables_initializer()) # init variables

# Gradient descent
for (i in 1:10)
{sess$run(train)
  print(sess$run(ty_))}

In [None]:
optimizer <- tf$train$MomentumOptimizer(learning_rate = 0.005, momentum = 0.8) # momentum optimizer
train <- optimizer$minimize(loss) # train 

sess$run(tf$global_variables_initializer()) # init variables

# Momentum optimizer
for (i in 1:10)
{sess$run(train)
  print(sess$run(ty_))}

## 2.4 Exercises

### 2.4.1 Pythagorean theorem

<p>We know the values of $b=12$ and $c=13$ and we know that $a,b$ and $c$ satisfy the Pythagorean theorem:  $c = \sqrt{a^2 + b^2}$.
Let's assume that we are not able to transform this equation to calculate $a$ value. 
    <br> The task is to estimate $a$ when we know $b$ and $c$ values and Pythagorean theorem.
</p>

In [None]:
tf_b <- NA # TODO: create node with b = 12
tf_c <- NA # TODO: create node with c = 13
tf_a <- NA # TODO: create node for a; initialize it with any value you want

c_ <- NA # TODO: create node with c_ which follows the Pythagorean theorem

loss <- NA # TODO: as loss function choose MSE
optimizer <- NA # TODO: as optimizer choose Adam optimizer, choose the learning rate 
train <- NA # TODO: set optimizer to minimize loss function

In [None]:
NA # TODO: initialize all variables

print(paste0('current loss: ', sess$run(loss))) # print current value of loss function for randomly initialized value of a

NA # TODO: do one iteration of training

print(paste0('loss after one iteration of training: ', sess$run(loss))) # print current value of loss function after one iteration
print(paste0('value of a after one iteration of training: ', sess$run(tf_a))) # print current value of a 

In [None]:
results <- tibble::tibble(current_loss = sess$run(loss), 
                   current_value = sess$run(tf_a))
                   
for (i in 1:200){

  NA # TODO:  do one iteration of training

  # save results to training table
  results <- results %>% 
    rbind(tibble::tibble(current_loss = sess$run(loss), 
                 current_value = sess$run(tf_a)))
}

Let us plot how loss and parameter value have changed over time.

In [None]:
results %>% 
  plot_ly(x = 1:nrow(.), y = ~current_loss) %>% 
  add_lines() %>% 
  layout(xaxis = list(title = 'Iteration'), 
         yaxis = list(title = 'Loss'))

In [None]:
results %>% 
  plot_ly(x = 1:nrow(.), y = ~current_value) %>% 
  add_lines() %>% 
  layout(xaxis = list(title = 'Iteration'), 
         yaxis = list(title = 'Value of a'))

### 2.4.2 Times table (optional)

There are 10 different numbers. We don't know their values. However, we know the products of all combinations of those numbers (100 products). The goal is to estimate values of those 10 numbers.

In [None]:
times_table <- readr::read_csv('data/example_times_table.csv') %>% # read data
    mutate(
        a_id = as.integer(a_id),
        b_id = as.integer(b_id)
    )
head(times_table) # look at the data

How to read data: 
- The first number squared equals 4
- Product of the second and the first numbers equals 6
- Product of the third and the first numbers equals 10 
- ...

In [None]:
a_indices <- NA # TODO: create node for a indices (from times_table) with dtype = 'int32'
b_indices <- NA # TODO: create node for b indices (from times_table) with dtype = 'int32'
y <- NA # TODO: create node for a product (from times_table)

In [None]:
values <- NA # TODO: create node for values of parameters which we'll estimate,
            # we'll estimate 10 numbers (length(unique(times_table$a_id)))
            # init values from uniform distribution [1,10]





We need to match indices with their values (that is: each factor ($a$ and $b$) with index 0 is matched with the first value from tensor <b>values</b>, each factor ($a$ and $b$) with index 1 is matched with the second value from tensor <b>values</b> and so on). To do it, we use function <b>gather</b>. [Here more information about gather.](https://www.tensorflow.org/api_docs/python/tf/gather)

In [None]:
t_a_gathered <- tf$gather(values, a_indices) # gather a parameters with indices
t_b_gathered <- tf$gather(values, b_indices) # gather b parameters with indices

In [None]:
y_ <- NA # TODO: create node which is a product of a and b; NOTE: you need to use gathered values

In [None]:
# MODEL 
loss <- NA # TODO: as loss function choose MSE
optimizer <- NA # TODO: as optimizer choose Adam optimizer, choose the learning rate 
train <- NA # TODO: set optimizer to minimize loss function


In [None]:
NA # TODO: init all variables 

In [None]:
library(tibble)
# let us take a look at the initialized values of t_a_gathered, t_b_gathered and y_
print('a values:')
sess$run(t_a_gathered)
print('b values:')
sess$run(t_b_gathered)
print('products for iteration 0:')
sess$run(y_)

# Let us check what is a loss for randomly initialized values
print(paste0('loss for iteration 0: ', sess$run(loss)))

# Let us prepare a helper table losses with step number and loss value
step = 0
losses = tibble(
    step = step,
    current_loss = sess$run(loss)
)

# Let us prepare a helper table params with step number, values of parameters for each index in a given estimation step, 
# and indices 0:9 for each value (just to plot it later)
params <- tibble(step = step,
                 values_calc = sess$run(values),
                 id = 0:9)

In [None]:
# TRAIN 

for (i in 1:500) {
    
    NA # TODO: one iteration of training
    
    losses = rbind(
        losses,
        tibble(step = step + i,
               current_loss = sess$run(loss))
    )
    
    params = rbind(
        params, 
        tibble(step = step,
                 values_calc = sess$run(values),
                 id = 0:9)
    )
}

In [None]:
# Let's check our results
tail(losses, 5)
tail(params, 10)

In [None]:
# Let's plot loss function

losses %>% 
  plot_ly(x = ~step, y = ~current_loss) %>% 
  add_lines() %>% 
  layout(xaxis = list(title = 'Iteration'), 
         yaxis = list(title = 'Loss'))

# 3. Maximum Likelihood Estimation

<p>
    <b>Likelihood</b> measures the plausibility of a model parameter value, given observations we have.<br/>
    $\mathcal{L}(\theta \mid x) = p_\theta (x) = P (X=x; \theta)$, where
    <ul>
        <li>$\mathcal{L}(\theta \mid x)$ is the likelihood values</li>
        <li>$p_\theta (x) = P (X=x; \theta)$ is the probability of observing $x$ given parameter values are equal $\theta$</li>
        <li>$x$ observed data</li>
        <li>$\theta$ parameter values</li>
    </ul>
</p>
<p>
    In other words, it tells us what is the probability of observing what we observed, assuming the given model parameter values. The value of this probability changes depending on the $\theta$. When these parameters have values that do not let $x$ occur, $P (X=x; \theta) = 0$, and when it has values in which $x$ must always occur $P (X=x; \theta) = 1$.
</p>
<p>
    Of course, we usually have more than one observation: $x_1$, $x_2$, $x_3$, ... So it's likelihood is equal to:<br/>
    $\mathcal{L}(\theta \mid x_1, x_2, ..., x_n) = P (X=[x_1, x_2, ..., x_n]; \theta) = \prod_{i=1}^n P (X=x_i; \theta)$<br/>
</p>
<p>
    The $\theta$ for which the value of $\prod_{i=1}^n P (X=x_i; \theta)$ is highest is called the Maximum Likelihood Estimate, and in the frequentist statistics it is assumed to be an accurate estimation of a latent parameter (a parameter that can't be measured directly).
</p>

## 3.1 Cross-entropy and MLE

<p>
    Now, let's take the equation from earlier<br/>
    $\mathcal{L}(\theta \mid x_1, x_2, ..., x_n) = \prod_{i=1}^n P (X=x_i; \theta)$
</p>
<p>
    and logarithmise both sides of this equation:<br/>
    $log \mathcal{L}(\theta \mid x_1, x_2, ..., x_n) = log\prod_{i=1}^n P (X=x_i; \theta) = \sum_{i=1}^n log P (X=x_i; \theta)$<br/>
    logarithm of the likelihood is called the log-likelihood and has the same maximum as the likelihood because logarithm is a monotonically increasing function.
</p>
<p>
    Now, let's assume that we have a probability that something happened ($y_i = 1$), e.g. a student answered a test question correctly $P (Y=1; \theta)$, then the probability of that not occuring ($y_i = 0$) in the same conditions - $P (Y=0; \theta)$ - is equal to: $1 - P (Y=1; \theta)$. Then:<br/>
    $log \mathcal{L}(\theta \mid y_i) = log P (Y=y_i; \theta) = y_i \cdot log P(Y=1; \theta) + (1 - y_i) \cdot log (1 - P(Y=1; \theta))$
</p>
<p>
    For simplification, let's denote $P(Y=1; \theta)$ for given $y_i$ as $p_{i,\theta}$. Now:<br/>
    $log \mathcal{L}(\theta \mid y_i) = y_i \cdot log p_{i,\theta} + (1 - y_i) \cdot log (1 - p_{i,\theta})$
</p>
<p>
    And now let's multiply both sides by $-1$:<br/>
    $-log \mathcal{L}(\theta \mid y_i) = -(y_i \cdot log p_{i,\theta} + (1 - y_i) \cdot log (1 - p_{i,\theta})) = cross$-$entropy(p_{i,\theta}, y_i)$
</p>
<p>
    Thus:
    <ul>
        <li><b>when we minimise the value of cross-entropy we also maximise the likelihood</b> (because we minimise the negative log-likelihood)</li>
        <li>what we do in TensorFlow is not magic that just works, it's a well established statistical method that has been in use for over a century</li>
        <li>it also applies to very complicated models called neural networks. Just in this case, we do not believe that parameters we estimate make any sense</li>
    </ul>
</p>

## 3.2 Example - an unbalanced coin and MLE

In [None]:
# Loading packages we will need
library(tensorflow)
library(tidyverse)
library(plotly)

In this section we will use TF and MLE to estimate the probability that an unbalanced coin lands heads up - $p_H$. Below we have a result of 100 coin tosses.

In [None]:
# 1 - HEADS, 0 - TAILS
results = c(0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1)

In [None]:
t_p = tf$Variable(0.01) # the variable containing a probability of a coin landing heads up. Let's initialise
                        # it with 0.01 (assume this coin almost always lands tails up)
t_y = tf$constant(results)

p_vector = tf$fill(list(tf$size(t_y)), t_p) # we need to calculate the expected probability for each coin toss we observed
                                            # however, it's the same in all attempts, so we just assume the current t_p estimate for each observation

t_loss = tf$losses$log_loss(t_y, p_vector) # now we need to calculate the cross-entropy between estimated probability, and each observation
loss = tf$reduce_mean(t_loss)              # here we average losses for all observation, to have a single value, which we will want to minimise

# and now, we also need to establish the optimisation process...
optimizer = tf$train$AdamOptimizer(0.1)
train = optimizer$minimize(loss)

# as always we need to create a session
sess = tf$Session() 
sess$run(tf$global_variables_initializer())

In [None]:
iter = 1000 # number of iterations to run

# table for storing model parameter optimisation progress log
r = tibble(
    it = 1:iter,
    loss = rep(NA_real_, iter),
    prob = rep(NA_real_, iter)
)

In [None]:
# now we've got everything, so LET'S START THE ESTIMATION
for(i in 1:iter){
    sess$run(train)
    r$loss[i] = sess$run(loss)
    r$prob[i] = sess$run(t_p)
}

In [None]:
cat('Final probability that the coin lands heads up is ', sess$run(t_p), '.')

When we finished the estimation process, we can check how the loss value and the estimated probability was changing during the iteration:

In [None]:
plot_ly() %>%
  add_lines(
    x = ~it,
    y = ~loss,
    data = r
  )

In [None]:
plot_ly() %>%
  add_lines(
    x = ~it,
    y = ~prob,
    data = r
  )

## 3.3 Example - MLE of normal distribution parameters

<p>
    In this section, we will do the maximum likelihood estimation of parameters for a continuous probability distribution. In such cases we need to use an alternative likelihood definition:<br/>
    $\mathcal{L}(\theta \mid x) = f (X=x; \theta)$<br/>
    where $f (X=x; \theta)$ is the value of the probability density function (<i>pdf</i>)
</p>
<p>
    Since we are working with the normal distribution, which have two parameters (mean and standard deviation), the total likelihood of all observations takes form:
    $\mathcal{L}(mean, sd \mid x_1, x_2, ..., x_n) = \prod_{i=1}^n f (X=x_i; mean, sd)$,
    and<br/>
    $log\mathcal{L}(mean, sd \mid x_1, x_2, ..., x_n) = \sum_{i=1}^n log f (X=x_i; mean, sd)$
</p>
<p>
    <b>QUESTION</b>: is there any reason why we should prefer one of these forms over another? (e.g. a sum of logarithms over a product)
</p>

In [None]:
# Loading packages we will need
library(tensorflow)
library(tidyverse)
library(plotly)

Let's start with generating some sample data:

In [None]:
nd_mean = 3   # the mean of our distribution
nd_sd = 2     # the standard deviation of our distribution
nd_data = rnorm(10000, nd_mean, nd_sd) # and the actual data

Now, we need to build a model:

In [None]:
t_mean = tf$Variable(0.0) # a variable for the mean of the distribution, let's set it initially to 0
t_sd = tf$Variable(1.0)   # a variable for the sd of the distribution, let's set it initially to 1

dist = tf$distributions$Normal(loc = t_mean, scale = t_sd) # a distribution we will use, it takes variables storing mean and sd as parameters

lprobs = dist$log_prob(nd_data) # log-likelihoods of parameters for each observation, i.e. logarithms of pdf values assuming current t_mean and t_sd 
negLL = -tf$reduce_sum(lprobs)  # sum of observations likelihoods. It's negative, because it is a value we will be minimising 

# of course, we also need to establish the optimisation process...
optimizer = tf$train$AdamOptimizer(0.1)
train = optimizer$minimize(negLL)

# ... and create a session
sess = tf$Session() 
sess$run(tf$global_variables_initializer())

In [None]:
iter = 1000 # number of iterations to run

# table for storing model parameter optimisation progress log
r = tibble(
    it = 1:iter,
    negLL = rep(NA_real_, iter),
    mean = rep(NA_real_, iter),
    sd = rep(NA_real_, iter)
)

In [None]:
# now we've got everything, so let's start
for(i in 1:iter){
    sess$run(train)
    r$negLL[i] = sess$run(negLL)
    r$mean[i] = sess$run(t_mean)
    r$sd[i] = sess$run(t_sd)
}

Now when our model parameters were fitted to the data, we can check how it went. Let's start with the likelihood maximisation (or rather minimisation of the negative log-likelihood):

In [None]:
plot_ly() %>%
  add_lines(
    x = ~it,
    y = ~negLL,
    data = r
  )

We can also check how values of mean and standard deviation were changing during iteration:

In [None]:
plot_ly(data = r) %>%
  add_lines(
    x = ~it,
    y = ~mean,
    name = 'mean'
  ) %>%
  add_lines(
    x = ~it,
    y = ~sd,
    name = 'sd'
  )

## 3.4 Exercise - A bribe-taking mayor

In [None]:
library(tidyverse)
library(tensorflow)

<p>
    The mayor of Los Data-flowos is a well-known bribe-taker. The local data science club decided to find out in what conditions he is eager to take a bribe. They discovered two main factors that influence it.
</p>
<p>
    The first one is the amount of money offered, which have a logarhitmic influence on the mayor, which is proportional to:<br/>
    $mc \cdot log(ao + ml)$, where<br/>
    <ul>
        <li><b>mc</b> - money coefficient, currently unknown and needs to be estimated</li>
        <li><b>ao</b> - amount of money offered by a briber</li>
        <li><b>ml</b> - location parameter of the amount of money offered, currently unknown and needs to be estimated</li>
    </ul>
</p>
<p>
    The second one is the number of policemen potential briber knows, which have influence on mayor's decision proportional to:<br/>
    $pc \cdot np^{pp}$, where<br/>
    <ul>
        <li><b>pc</b> - coefficient for the number of known policemen, currently unknown and needs to be estimated</li>
        <li><b>np</b> - number of policemen known by the briber</li>
        <li><b>pp</b> - exponent for the number of policemen, currently unknown and needs to be estimated</li>
    </ul>
</p>
<p>
    Members of the club were also able to collect some observations. Each consists of this observations consists of three features:
    <ul>
        <li><b>money</b> - amount of money offered</li>
        <li><b>policemen</b> - number of policemen potential briber knows</li>
        <li><b>bribe_accepted</b> - binary flag informing if mayor accepted the bribe</li>
    </ul>
</p>
<p>
    Your task is to use TensorFlow to estimate parameter values of a model predicting if the mayor will take a bribe:
    <ul>
        <li><b>mc</b> called <b>v_money_coef</b> below</li>
        <li><b>ml</b> called <b>v_money_loc</b> below</li>
        <li><b>pc</b> called <b>v_policemen_coef</b> below</li>
        <li><b>pp</b> called <b>v_policemen_pow</b> below</li>
        <li><b>intercept</b> called <b>v_intercept</b> below</li>
    </ul>
</p>

In [None]:
bribers = read_csv('data/bribers.csv') %>%
    mutate(
        policemen = as.double(policemen)
    )

In [None]:
# let's take a peek
head(bribers)

Start with loading the data into the graph...

In [None]:
t_money = NA           # TODO: create a constant with amounts of money offered
t_policemen = NA       # TODO: create a constant with numbers of policemen known
t_bribe_accepted = NA  # TODO: create a constant with flags whether bribes was accepted in given attempts

... and declaring variables for parameters you want to estimate. Set the initial value of each variable to 1.

In [None]:
v_money_loc = tf$abs(NA) # TODO: create a variable for money_loc (ml). Initialise it with 1.0. (later we only use absolute value to avoid negatives)
v_money_coef = NA        # TODO: create a variable for money_coef (mc). Initialise it with 1.0.

v_policemen_pow = NA     # TODO: create a variable for policemen_power (pp). Initialise it with 1.0.
v_policemen_coef = NA    # TODO: create a variable for policemen_coef (pc). Initialise it with 1.0.

v_intercept = NA         # TODO: create a variable for intercept. Initialise it with 1.0.

Now, you need to build a graph describing, how probability of taking a bribe can be calculated using constants and variables. You also need to define a total loss, and the optimisation process. 

In [None]:
money_comp = NA     # TODO: calculate money component as money_coef + log(money + money_loc)
policemen_comp = NA # TODO: calculate policemen component as policemen_coef * (policemen ^ policemen_power)

logit = NA          # TODO: calculate logit as a sum of money component, policemen component and intercept
prob = NA           # TODO: convert logit to a probability using a sigmoid function

loss = NA           # TODO: calculate mean cross-entropy loss of the estimated probability

optimizer = NA      # TODO: create an Adam optimiser with a learning rate equal 0.05
train = NA          # TODO: use optimiser to minimise loss

And now, it's time to create a session, and run the optimisation process:

In [None]:
sess = tf$Session()
sess$run(tf$global_variables_initializer())

In [None]:
for(i in 0:1000){
    sess$run(train)
    if(i %% 100 == 0){
        cat(i, ') ', format(Sys.time(), "%Y-%m-%d %X"), ' Loss: ', sess$run(loss), '\n', sep = '')
    }
}

When optimisation is finished, we can read parameter values:

In [None]:
cat("v_money_loc: ", sess$run(v_money_loc), "\n")
cat("v_money_coef: ", sess$run(v_money_coef), "\n")
cat("v_policemen_pow: ", sess$run(v_policemen_pow), "\n")
cat("v_policemen_coef: ", sess$run(v_policemen_coef), "\n")
cat("v_intercept: ", sess$run(v_intercept), "\n")

<p>
    Local fortune-teller is sure that these parameters should be equal:<br/>
    money_loc = 2<br/>
    money_coef = 1<br/>
    policemen_pow = 1.3<br/>
    policemen_coef = -0.5<br/>
    intercept = -1.5<br/>
    It seems these numbers are correct, as they often worked. Are your estimates similar?
</p>