# Reinforcement Learning
This notebook is my adaptation of <a href="https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-0-q-learning-with-tables-and-neural-networks-d195264329d0">this wonderful tutorial by Arthut Juliani</a> where I try to learn reinforcement learning.

## Part 0 - Q-Learning with Tables and Neural Networks
Q-Learning attempts to learn the value of being in a given state, and taking a specific action there.

### The game: FrozenLake
For this tutorial we are going to be attempting to solve the FrozenLake environment from the OpenAI gym.

The FrozenLake environment consists of a 4x4 grid of blocks, each one either being the start block, the goal block, a safe frozen block, or a dangerous hole. The objective is to have an agent learn to navigate from the start to the goal without moving onto a hole. At any given time the agent can choose to move either up, down, left, or right. The catch is that there is a wind which occasionally blows the agent onto a space they didn’t choose. As such, perfect performance every time is impossible, but learning to avoid the holes and reach the goal are certainly still doable.

The reward at every step is 0, except for entering the goal, which provides a reward of 1. Thus, we will need an algorithm that learns long-term expected rewards. This is exactly what Q-Learning is designed to provide.

### Q-Learning
In it’s simplest implementation, Q-Learning is a table of values for every state (row) and action (column) possible in the environment. Within each cell of the table, we learn a value for how good it is to take a given action within a given state.

In the case of the FrozenLake environment, we have 16 possible states (one for each block), and 4 possible actions (the four directions of movement), giving us a 16x4 table of Q-values. We start by initializing the table to be uniform (all zeros), and then as we observe the rewards we obtain for various actions, we update the table accordingly.

#### The update rule: Bellman equation
We make updates to our Q-table using something called the **Bellman equation**, which states that the expected long-term reward for a given action is equal to the immediate reward from the current action combined with the expected reward from the best future action taken at the following state.

$$
Q(s, a) = r + \gamma\left(\max Q(s', a')\right)
$$

where 

- $s$ is the current state, while $s'$ a future state
- $a$ is the action taken in this moment, while $a'$ is a future action
- $r$ is the current reward
- $\gamma$ is a discount factor

This says that the Q-value for a given state and action should represent the current reward plus the maximum discounted future reward expected according to our own table for the next state we would end up in. The discount variable allows us to decide how important the possible future rewards are compared to the present reward.

In [2]:
import gym
import numpy as np

In [3]:
env = gym.make("FrozenLake-v0")

In [7]:
Q = np.zeros((env.observation_space.n, env.action_space.n))

In [None]:
gamma = 0.95
num_episodes = 2000
limit_per_episode = 99

for i in range(num_episodes):
    s = env.reset()
    
    while j < limit_per_episode:
        j += 1
        # Choose an action by picking (with noise) from Q table
        # use less noise at each episode...more confidence
        a = np.argmax(Q[s, :] + np.random.randn(1, env.action_space.n)/(i + 1))
        
        # Perform the action and get the state
        new_s, reward, done, _ = env.step(a)
        
        # Update the Q table accordingly
        Q[s, a] = reward + gamma*np.max(Q[new_s, :]
        # if we succeded, end the episode
        if done:
            break

In [14]:
env.compu

AttributeError: 'TimeLimit' object has no attribute 'P'