## Reinforcement learning

Up until now you have worked only with supervised and unsupervised learning. Today we will have a look at a new technique in machine learning, called reinforcement learning (RL). This approach is applicable, when, although, the task requires consideration of feedback from the environment, the system may not rely on a supervisor to critically assess the output. The learning signal is based on the reward given by the environment to the agent.

It is meant to be a straightforward framing of the problem of learning from interaction to achieve the goal by maximizing the collected reward of the course of interacting with the environment.

![reinforcement learning](https://www.researchgate.net/profile/Roohollah_Amiri/publication/323867253/figure/fig2/AS:606095550738432@1521515848671/Reinforcement-Learning-Agent-and-Environment.png)

The RL algorithm consists of several parts:

- environment
- agent - the actor within the environment
- state - set of parameters describing the environment at time t
- action - action to be taken by the actor
- reward - feedback given by the environment
- (policy)
- (value function)

Unlike (un)supervised learning, RL does not have any data in advance. The agent acts within the environment, explores it at first to acquire knowledge, then uses said knowledge to exploit the environment for its own gain (tries to maximize reward). 

![mouse and cheese](https://miro.medium.com/max/1448/0*WH2kRYzeDx1zTdzI.png)

The simplest approach would be to let an algorithm learn the mapping between a state and its actions, which yields the highest reward (either immidiate or long-term). This is called Q-learning and serves as a basis for more complex algorithms besides is obvious use-case in simple problems. Q-table, which represents this state-action mapping is explicit - the algorithm has to try most (preferably all) combinations to learn properly. Obviously, some areas are going to be visited more often (and are likely more important). As long, as the state and action spaces are discrete and finite (and kept small enough), we can use Q-learning. Although, with growing sizes, this becomes at first impractical, later outright impossible (due to memory and computational limitations).

This Q-table, which is a learned representation of reward function can be replaced by a neural network. 
Today we will be using Deep Q network.

It has been able to successfuly learn to play various games, including chess and Atari breakout.

Further discussion (too long to be written out here):

- exploration vs exploitation
- value-based vs policy-based
- model-free vs model-based
- actor-critic


We will be using OpenAI Gym (with PyTorch) as our playground today. You can install it by entering:
```
pip install gym
pip install box2d-py
```
in your (virtual) environment.

official documentation is available here: https://gym.openai.com/

To check, whether everything is installed properly, try running one of available environments (just ignore the warning):

In [None]:
import gym
env = gym.make('CartPole-v0')
env.reset()
for _ in range(1000):
    env.render()
    env.step(env.action_space.sample()) # take a random action
env.close()

In [3]:
import torch as T
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
import gym

 The network itself is just a simple fully connected network with two hidden layers.

In [4]:
class DeepQNetwork(nn.Module):
    def __init__(self, lr, input_dims, fc1_dims, fc2_dims, 
            n_actions):
        super(DeepQNetwork, self).__init__()
        self.input_dims = input_dims
        self.fc1_dims = fc1_dims
        self.fc2_dims = fc2_dims
        self.n_actions = n_actions
        self.fc1 = nn.Linear(*self.input_dims, self.fc1_dims)
        self.fc2 = nn.Linear(self.fc1_dims, self.fc2_dims)
        self.fc3 = nn.Linear(self.fc2_dims, self.n_actions)

        self.optimizer = optim.Adam(self.parameters(), lr=lr)
        self.loss = nn.MSELoss()
        self.device = T.device('cuda:0' if T.cuda.is_available() else 'cpu')
        self.to(self.device)

    def forward(self, state):
        x = F.relu(self.fc1(state))
        x = F.relu(self.fc2(x))
        actions = self.fc3(x)

        return actions

In [5]:
class Agent():
    def __init__(self, gamma, epsilon, lr, input_dims, batch_size, n_actions,
            max_mem_size=750000, eps_end=0.05, eps_dec=5e-4):
        self.gamma = gamma
        self.epsilon = epsilon
        self.eps_min = eps_end
        self.eps_dec = eps_dec
        self.lr = lr
        self.action_space = [i for i in range(n_actions)]
        self.mem_size = max_mem_size
        self.batch_size = batch_size
        self.mem_cntr = 0
        self.iter_cntr = 0
        self.replace_target = 100

        self.Q_eval = DeepQNetwork(lr, n_actions=n_actions, input_dims=input_dims,
                                    fc1_dims=256, fc2_dims=256)
        self.Q_next = DeepQNetwork(lr, n_actions=n_actions, input_dims=input_dims,
                                    fc1_dims=64, fc2_dims=64)

        self.state_memory = np.zeros((self.mem_size, *input_dims), dtype=np.float32)
        self.new_state_memory = np.zeros((self.mem_size, *input_dims), dtype=np.float32)
        self.action_memory = np.zeros(self.mem_size, dtype=np.int32)
        self.reward_memory = np.zeros(self.mem_size, dtype=np.float32)
        self.terminal_memory = np.zeros(self.mem_size, dtype=np.bool)

    def store_transition(self, state, action, reward, state_, terminal):
        index = self.mem_cntr % self.mem_size
        self.state_memory[index] = state
        self.new_state_memory[index] = state_
        self.reward_memory[index] = reward
        self.action_memory[index] = action
        self.terminal_memory[index] = terminal

        self.mem_cntr += 1

    def choose_action(self, observation):
        if np.random.random() > self.epsilon:
            state = T.tensor([observation]).to(self.Q_eval.device)
            actions = self.Q_eval.forward(state)
            action = T.argmax(actions).item()
        else:
            action = np.random.choice(self.action_space)

        return action

    def learn(self):
        if self.mem_cntr < self.batch_size:
            return

        self.Q_eval.optimizer.zero_grad()
        
        max_mem = min(self.mem_cntr, self.mem_size)

        batch = np.random.choice(max_mem, self.batch_size, replace=False)
        
        batch_index = np.arange(self.batch_size, dtype=np.int32)

        state_batch = T.tensor(self.state_memory[batch]).to(self.Q_eval.device)
        new_state_batch = T.tensor(self.new_state_memory[batch]).to(self.Q_eval.device)
        action_batch = self.action_memory[batch]
        reward_batch = T.tensor(self.reward_memory[batch]).to(self.Q_eval.device)
        terminal_batch = T.tensor(self.terminal_memory[batch]).to(self.Q_eval.device)

        q_eval = self.Q_eval.forward(state_batch)[batch_index, action_batch]
        q_next = self.Q_eval.forward(new_state_batch)
        q_next[terminal_batch] = 0.0

        q_target = reward_batch + self.gamma*T.max(q_next,dim=1)[0]

        loss = self.Q_eval.loss(q_target, q_eval).to(self.Q_eval.device)
        loss.backward()
        self.Q_eval.optimizer.step()

        self.iter_cntr += 1
        self.epsilon = self.epsilon - self.eps_dec if self.epsilon > self.eps_min \
                       else self.eps_min

In [None]:
#main

env = gym.make('LunarLander-v2')
agent = Agent(gamma=0.99, epsilon=1.0, batch_size=64, n_actions=4, input_dims=[8], lr=0.005)
scores = []
eps_history = []
n_games = 1000
score = 0
for i in range(n_games):
    if i % 10 == 0 and i > 0:
        avg_score = np.mean(scores[max(0, i-10):(i+1)])
        print('episode ', i, ' score ', score, ' average score %.3f' %avg_score, ' epsilon %.3f' %agent.epsilon)
    else:
        print('episode ', i, ' score ', score)
    score = 0
    eps_history.append(agent.epsilon)
    observation = env.reset()
    done = False
    while not done:
        env.render()
        action = agent.choose_action(observation)
        observation_, reward, done, info = env.step(action)
        score += reward
        agent.store_transition(observation, action, reward, observation_, done)
        agent.learn()
        observation = observation_
    scores.append(score)
    env.close()
    
    




episode  0  score  0
episode  1  score  -116.13652522049954
episode  2  score  -209.65344247587473
episode  3  score  -153.50514077670348
episode  4  score  -94.7009321839665
episode  5  score  -191.47330947532961
episode  6  score  -15.2539408106704
episode  7  score  -92.0021286631884
