<a href="https://colab.research.google.com/github/lblogan14/reinforcement_learning_with_tensorflow/blob/master/ch1_deep_learning_review.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Activation functions for deep learning
decide whether a particular neuron is activated or not, and whether the information received by the nueron is relevant or not.

##Sigmoid function
$$\sigma(x)=\frac{1}{1+e^{-x}}$$
![alt text](https://github.com/lblogan14/reinforcement_learning_with_tensorflow/blob/master/note_images/ch1/sigmoid.PNG?raw=true)

Observing the curve of the function, we see that the gradient is very highwhen $x$ values between -3 and 3, but becomes flat beyond that. Thus, we can say that smallchanges in $x$ near these points will bring large changes in the value of the sigmoid function.

Therefore, the function goals in pushing the values of the sigmoid function towards the extremes. This is why the sigmoid function is being used in classification problems.

The gradient of the sigmoid functions is shown below and it is a smooth curve dependent on $x$, so it is easy to backpropagate the error and update the parameters, $w$ and $b$.
![alt text](https://github.com/lblogan14/reinforcement_learning_with_tensorflow/blob/master/note_images/ch1/sigmoid_grad.PNG?raw=true)

Sigmoid disadvantage: whenever the function falls beyond +3 or below -3 region, the gradients tends to approach zero and the learning of neural network comes to a halt.

##tanh function

Unlike the sigmoid function, the tanh function is a continuous function symmetric around the origin, ranging from -1 to 1.
$$\tanh(x)=\frac{e^x-e^{-x}}{e^x+e^{-x}}$$
![alt text](https://github.com/lblogan14/reinforcement_learning_with_tensorflow/blob/master/note_images/ch1/tanh.PNG?raw=true)

It is continuous and also non-linear plus differentiable at all points. Although symmetrical, it becomes flat beyond -2 and 2.

The gradient of the tanh function is shown below and it is steeper than the sigmoid. Also, the tanh function also has the vanishing gradient problem.

![alt text](https://github.com/lblogan14/reinforcement_learning_with_tensorflow/blob/master/note_images/ch1/tanh_grad.PNG?raw=true)

##Softmax function
mainly used to handle classification problems and preferably used in the output layer, outputting the probabilities of the output classes.
$$\mathrm{softmax}(x_i)=\frac{e^{x_i}}{\sum\limits_j^ce^{x_j}}$$
While the sigmoid function was able to handle only two classes, the softmax function can handle multi-class to generate values for all the classes and follow the rules of probability.

For example, $x\in[1,2,3,4]$, where $x$ refers to four classes, then the softmax function will provide:
$$\mathrm{softmax}(x_1)=\frac{e^{x_1}}{\sum\limits_j^4e^{x_j}}=\frac{e^1}{e^1+e^2+e^3+e^4}=0.032$$
$$\mathrm{softmax}(x_2)=\frac{e^{x_2}}{\sum\limits_j^4e^{x_j}}=\frac{e^2}{e^1+e^2+e^3+e^4}=0.088$$
$$\mathrm{softmax}(x_3)=\frac{e^{x_3}}{\sum\limits_j^4e^{x_j}}=\frac{e^3}{e^1+e^2+e^3+e^4}=0.240$$
$$\mathrm{softmax}(x_4)=\frac{e^{x_4}}{\sum\limits_j^4e^{x_j}}=\frac{e^4}{e^1+e^2+e^3+e^4}=0.640$$
Thus, these are the probablities of all the classes.

The softmax function shrinks the outputs for each class between 0 and 1 and divides them by the sum of the outputs for all the classes. This is why the softmax function becomes the best candidate for the outer layer activation function of the classifier.

##Rectified linear unit (ReLU) function
$$\mathrm{ReLU}(x)=\max(0,x)$$
![alt text](https://github.com/lblogan14/reinforcement_learning_with_tensorflow/blob/master/note_images/ch1/relu.PNG?raw=true)

ReLU is non-linear and is easy to backpropagate. Therefore, it stack multiple hidden layers activated.

Compared to other activation functions, ReLU does not activate all the neurons at the same time. If the input is negative it outputs zero and the neuron does not activate, which results in a sparse network, fast and easy computation.

![alt text](https://github.com/lblogan14/reinforcement_learning_with_tensorflow/blob/master/note_images/ch1/relu_grad.PNG?raw=true)

The negative side of the gradient of ReLU shows the constant zero. Therefore, activations falling in that region will have zero gradients and weights are not updated, so the nodes/neurons are inactive, as they will not learn. To overcome this problem, the Leaky ReLU is introduced:
$$\mathrm{LeakyReLU}(x)=
    \left\{
        \begin{array}{ll}
            ax & \quad x \leq 0 \\
            x & \quad x > 0
        \end{array}
    \right.$$
This prevents the gradient from becoming zero in the negative side and the weight training continues, but slowly, owing to the low value of $a$.

##Choose the right activation function
* Sigmoid functions work very well in the case of shallow networks and binary
classifiers. Deeper networks may lead to vanishing gradients.
* The ReLU function is the most widely used, and try using Leaky ReLU to avoid
the case of dead neurons. Thus, start with ReLU, then move to another activation
function if ReLU doesn't provide good results.
* Use softmax in the outer layer for the multi-class classification.
* Avoid using ReLU in the outer layer.

#Limitations of deep learning

##Vanishing gradient problem
Neurons present in the early layers are not able to learn because the gradients that train the weights shrink down to zero. This happens due to the
greater depth of neural network, along with activation functions with derivatives resulting
in low value.

## Exploding gradient problem
The learning of the neurons present in the early layers diverge
because the gradients become too large to cause severe changes in weights avoiding
convergence. This generally happens if weights are not assigned properly.

##Overcoming the limitations of deep learning
* Minimizing the use of the sigmoid and tanh activation functions
* Using a momentum-based stochastic gradient descent
* Proper initialization of weights and biases, such as xavier initialization
* Regularization (add regularization loss along with data loss and minimize that)

#Reinforcement learning

##Terminologies and conventions
* **Agent**: This we create by programming such that it is able to sense the
environment, perform actions, receive feedback, and try to maximize rewards.
* **Environment**: The world where the agent resides. It can be real or simulated.
* **State**: The perception or configuration of the environment that the agent senses.
State spaces can be finite or infinite.
* **Rewards**: Feedback the agent receives after any action it has taken. The goal of
the agent is to maximize the overall reward, that is, the immediate and the future
reward. Rewards are defined in advance. Therefore, they must be created
properly to achieve the goal efficiently.
* **Actions**: Anything that the agent is capable of doing in the given environment.
Action space can be finite or infinite.
* **SAR triple**: (state, action, reward) is referred as the SAR triple, represented as (s,
a, r).
* **Episode**: Represents one complete run of the whole task.
![alt text](https://github.com/lblogan14/reinforcement_learning_with_tensorflow/blob/master/note_images/ch1/rl.PNG?raw=true)

Every task is a sequence of SAR triples. Start from state $S(t)$, perform action $A(t)$ and thereby, receive a reward $R(t+1)$, and land on a new state $S(t+1)$. The current state and
action pair gives rewards for the next step. Since, $S(t)$ and $A(t)$ results in $S(t+1)$, we have a new triple of (current state, action, new state), that is, $[S(t),A(t),S(t+1)]$ or $(s,a,s')$.

##Optimality criteria
A measure of goodness of fit of the model created over the data.For example, in supervised classification learning algorithms, we have maximum likelihood
as the optimality criteria. Thus, on the basis of the problem statement and objective
optimality criteria differs.

THe major goal in reinforcement learning is to maximize the future rewards. There are two different optimality criteria:
* **Value function**: To quantify a state on the basis of future probable rewards
* **Policy**: To guide an agent on what action to take in a given state

###Value function for optimality
The immediate and future rewards should be considered. Therefore, a value is assigned to each encountered state that reflectes this future information. This is called **value function**. The **delayed reward** is introduced here to explain what actions taken now will lead to potential rewards in future.

$V(s)$, the value of the state, is defined as the expected value of rewards to be received in
future for all the actions taken from this state to subsequent states until the agent reaches
the goal state. Thus, the value functions tell how good it is to be in this state. The higher the value, the better the state.

Rewards assigned to each $(s, a, s')$ triple is fixed, which is not the case with the value of the state; it is subjected to change with every action in the episode and with different episodes too.

To save time comsuption and computation expense, the knowledge of the current state is stored in $V(s)$:
$$V(s)=\mathrm{E}[all\,future\,rewards\,discounted | S(t)=s]$$

###Policy model for optimality
**Policy** is defined as the model that guides the agent with action selection in different states.
Policy is denoted as $\pi$. $\pi$ is basically the probability of a certain action given a particular state:
$$\pi(a,s)=p(A(t)=a|S(t)=s)$$

The policy map provides the set of probabilities of different actions given a particular state. The policy along with the value function create a solution that helps in
agent navigation as per the policy and the calculated value of the state.

##Q-learning approach to reinforcement learning
Q-learning is an attempt to learn the value $Q(s,a)$ of a specific action given to the agent in a
particular state.

Consider a table where the number of rows represent the number of states,
and the number of columns represent the number of actions. This is called a Q-table. Therefore, the agent needs to learn the value to find which action is the best in a given state.

Steps for Q-learning:
1. Initialize the table of $Q(s,a)$ with uniform values (say, all zeros).
2. Observe the current state, $s$
3. Choose an action, $a$, by epsilon greedy or any other action selection policies, and
take the action
4. As a result, $a$ reward, $r$, is received and a new state, $s'$, is perceived
5. Update the $Q$ value of the $(s,a)$ pair in the table by using the following Bellman
equation:
$$Q(s,a)=r+\gamma(\max(Q(s',a')))$$
  * $\gamma$ is the discounting factor
6. Then, set the value of current state as a new state and repeat the process to
complete one episode, that is, reaches the terminal state
7. Run multiple episodes to train the agent

To simplify this concept:

The Q-value for a given state, $s$, and action, $a$, is updated by the
sum of current reward, $r$, and the discounted ($\gamma$ ) maximum $Q$ value for the new state among all its actions. The discount factor delays the reward from the future compared to
the present rewards. For example, a reward of 100 today will be worth more than 100 in the
future. Similarly, a reward of 100 in the future must be worth less than 100 today. Thus, the future rewards are discounted. Repeating this update process continuously results in
Q-table values converging to accurate measures of the expected future reward for a given
action in a given state.

##Asynchronous advantage actor-critic (A3C)
The A3C algorithm was published in June 2016 by the combined team of Google DeepMind
and MILA. It is simpler and has a lighter framework that used the asynchronous gradient
descent to optimize the deep neural network. It was faster and was able to show good
results on the multi-core CPU instead of GPU. One of A3C's big advantages is that it can
work on continuous as well as discrete action spaces. As a result, it has opened the gateway
for many new challenging problems that have complex state and action spaces.

Basic overview of the algorithm:
* **Asynchronous**: In Deep Q-Network, a neural network is used with the agent
to predict actions. This means there is one agent and it's interacting with one
environment. What A3C does is create multiple copies of the agent-environment
to make the agent learn more efficiently. A3C has a global network, and multiple
worker agents, where each agent has its own set of network parameters and each
of them interact with their copy of the environment simultaneously without
interacting with another agent's environment. The reason this works better than a
single agent is that the experience of each agent is independent of the experience
of the other agents. Thus, the overall experience from all the worker agents
results in diverse training.
* **Actor-critic**: Actor-critic combines the benefits of both value iteration and policy
iteration. Thus, the network will estimate both a value function, $V(s)$, and a
policy, $f(s)$, for a given state, $s$. There will be two separate fully-connected layers
at the top of the function approximator neural network that will output the value
and policy of the state, respectively. The agent uses the value, which acts as a
critic to update the policy, that is, the intelligent actor.
* **Advantage**: Policy gradients used discounted returns telling the agent whether
the action was good or bad. Replacing that with Advantage not only quantifies
the the good or bad status of the action but helps in encouraging and
discouraging actions better.

#Basic computations in TensorFlow

In [0]:
import tensorflow as tf

In [2]:
tf.zeros(3)

<tf.Tensor 'zeros:0' shape=(3,) dtype=float32>

In [3]:
tf.ones(3)

<tf.Tensor 'ones:0' shape=(3,) dtype=float32>

TensorFlow returns a reference to the tensor and not the value of the tensor.
In order to get the value, we can use `eval()` or 'run()'
, a function of tensor objects by
running a session:

In [0]:
a = tf.zeros(3)

In [6]:
with tf.Session() as sess:
  sess.run(a)
  print(a.eval())

[0. 0. 0.]


`tf.fill()` and `tf.constant()` methods create a tensor of a certain shape and value:

In [0]:
a = tf.fill((2,2), value=4.)
b = tf.constant(4., shape=(3,3))

In [10]:
with tf.Session() as sess:
  sess.run(a)
  sess.run(b)
  print(a.eval())
  print(b.eval())

[[4. 4.]
 [4. 4.]]
[[4. 4. 4.]
 [4. 4. 4.]
 [4. 4. 4.]]


Random number initializers.
* `tf.random_normal()`: Samples random values from the Normal distribution of
specified mean and standard deviation
* `tf.random_uniform()`: Samples random values from the Uniform distribution
of a specified range

In [0]:
a = tf.random_normal((4,4), mean=0, stddev=1)
b = tf.random_uniform((2,2), minval=-3, maxval=3)

In [13]:
with tf.Session() as sess:
  sess.run(a)
  sess.run(b)
  print(a.eval())
  print()
  print(b.eval())

[[-0.3831449  -1.8989109  -0.60519767  1.0701469 ]
 [-1.3313991   2.2024977  -0.57883346  0.82913184]
 [ 0.08403209  0.08545087 -1.2015274   0.37987992]
 [ 1.303021    0.38369858 -0.2085839  -0.8417597 ]]

[[-0.50290275  1.2167172 ]
 [ 0.61045575 -2.814738  ]]


`tf.Variable()` defines the variables in TensorFlow which are holders for tensors

In [14]:
a = tf.Variable(tf.ones(2,2))
a

<tf.Variable 'Variable:0' shape=(2,) dtype=float64_ref>

The evaluation fails in case of variables because they have to be explicitly initialized by
using `tf.global_variables_initializer()` within a session:

In [17]:
with tf.Session() as sess:
  sess.run(tf.global_variables_initializer())
  print(a.eval())

[1. 1.]


`tf.eye()` defines the identity matrices

In [20]:
id = tf.eye(4)
with tf.Session() as sess:
  sess.run(id)
  print(id.eval())

[[1. 0. 0. 0.]
 [0. 1. 0. 0.]
 [0. 0. 1. 0.]
 [0. 0. 0. 1.]]


`tf.diag()` defines the diagonal matrices:

In [0]:
a = tf.range(1,5,1)
md = tf.diag(a)
mdn = tf.diag([1,2,5,3,2])

In [22]:
with tf.Session() as sess:
  sess.run(md)
  sess.run(mdn)
  print(md.eval())
  print()
  print(mdn.eval())

[[1 0 0 0]
 [0 2 0 0]
 [0 0 3 0]
 [0 0 0 4]]

[[1 0 0 0 0]
 [0 2 0 0 0]
 [0 0 5 0 0]
 [0 0 0 3 0]
 [0 0 0 0 2]]


`tf.transpose()` transposes the given matrix

In [0]:
a = tf.ones((2,3))
b = tf.transpose(a)

In [24]:
with tf.Session() as sess:
  sess.run(a)
  sess.run(b)
  print(a.eval())
  print()
  print(b.eval())

[[1. 1. 1.]
 [1. 1. 1.]]

[[1. 1.]
 [1. 1.]
 [1. 1.]]


`tf.matmul` defines the matrix multiplication

In [0]:
a = tf.ones((3,2))
b = tf.ones((2,4))

c = tf.matmul(a,b)

In [26]:
with tf.Session() as sess:
  sess.run(a)
  sess.run(b)
  sess.run(c)
  
  print('a:', a.eval())
  print()
  print('b:', b.eval())
  print()
  print('c:', c.eval())

a: [[1. 1.]
 [1. 1.]
 [1. 1.]]

b: [[1. 1. 1. 1.]
 [1. 1. 1. 1.]]

c: [[2. 2. 2. 2.]
 [2. 2. 2. 2.]
 [2. 2. 2. 2.]]


`tf.reshape()` reshapes the tensors from one to another.

In [0]:
a = tf.ones((2,4))
b = tf.reshape(a, (8,))
# reshape it to a vector of size 8

c = tf.reshape(a, (2,2,2))
# reshape tensor a to shape (2,2,2)

d = tf.reshape(b, (2,2,2))
# reshape tensor b to shape (2,2,2)

In [28]:
with tf.Session() as sess:
  sess.run(a)
  sess.run(b)
  sess.run(c)
  sess.run(d)
  
  print('a:', a.eval())
  print('b:', b.eval())
  print('c:', c.eval())
  print('d:', d.eval())

a: [[1. 1. 1. 1.]
 [1. 1. 1. 1.]]
b: [1. 1. 1. 1. 1. 1. 1. 1.]
c: [[[1. 1.]
  [1. 1.]]

 [[1. 1.]
  [1. 1.]]]
d: [[[1. 1.]
  [1. 1.]]

 [[1. 1.]
  [1. 1.]]]


The flow of computation in TensorFlow is represented as a computational graph, which is
as instance of `tf.Graph()`. The graph contains tensors and operation objects, and keeps track
of a series of operations and tensors involved.

The default instance of the graph can be fetched by `tf.get_default_graph()`:

In [29]:
tf.get_default_graph()

<tensorflow.python.framework.ops.Graph at 0x7f7781422630>