# Reinforcement-Learning Reference
Reference Implementation for Reinforcement Assignment**

### Source:
These are the files required to build your reinforcement learning algorithm. 

- [common.py](common.py) with constants
- [util.py](util.py) with util functions
- [game.py](game.py) with drawing calls
- [environment.py](environment.py) contains the scenario behavior
- [agent.py](agent.py) contains training components, such as environment interaction and previous state

### Assignment
The goal of this assignment is to implement the core of the Q-Learning algorithm. You will be responsible for implementing three distinct methods:
- The exploration function(**f()** method)
- The Q-Learning update method (**get_action()** method)
- Implemnet a decreasing function for the learning rate (**alpha()** method)

In this scenario we help the agent, via reinforcement learning, to navigate and maximise rewards within a map, aiming to reach the move between an initial state and the goal state, represented by a treasure chest. In some scenarios there will be a rupee that the agent can gather. The rewards are +50 for reaching the chest, +40 for getting the rupee and -1 for any other tile.

This assignment is not graded. Thus no tests are provided.

### Execution
The execution of this assignment can be done entirely in this Jupyter Notebook, or in two distinct python files. If you want to program/test outside of Jupyter just follow these instructions. Please note that currently there is a problem with pygame (a package responsible for displaying the agent moving in the environment) and Jupyter, which causes the jupyter kernel to crash after closing the pygame window.

In order to test your code and get the convergence episode, you can use the environment.py file:
```
python environment.py [Map]
```

To check the converged solution of your algorithm, you can run the GUI to see the agent executing the learned policy in each map.
```
python game.py [Map]
```

### Implementation

In this file, we provide the basic architecture to build your Q-Learning algorithm. Additionally, you can implement your algorithm in the [link_ref.py](link_ref.py) file, if you wish to work outside of Jupyter Notebook. 



## Exploration function
Implement the optimistic estimate function described by the following equation.
$$
f(u,n) = \begin{cases}
					R^{+} & \mathit{if} \text{ }n < N_{e} \\
					u & \mathit{otherwise}
				   \end{cases}
$$
Consider that instead of just using the utility *u* in the python method as parameter, we are using **qv**, which is a Q-Value (state-action pair).



In [None]:
#Receives a q-value and returns a utility
def f(self, qv):
    return self.r_plus



## Q-Learning
Implement the Q-Learning algorithm. Consider the learning rate as a fixed value for now.
<img src="https://user-images.githubusercontent.com/4201145/45648565-4d0c0580-ba9f-11e8-82fd-1a4f127c1959.png" width="70%" height="70%"/>

In [None]:
def get_action(self, state):
    #Remove this.
    import random
    return random.choice(self.env.available_actions((state.x, state.y)))

# Learning rate

In [None]:
def alpha(self,qv):
    # Implement here a more sophisticated learning rate
    return 0.9

# Base code

In [None]:
#!/usr/bin/env python
# Four spaces as indentation [no tabs]
# Standard Q-Learning implementation.
import math, copy, random, logging
from qvalue import *
from common import *
from util import *
from agent import *

class Link(Agent):

    def __init__(self):

        Agent.__init__(self)
        self.q_values = dict()
        self.frequency = dict()
        self.state = None
        self.reward = None
        self.action = NO_OP
        self.p_state = None 
        self.p_reward = None
        self.p_action = None
        self.gamma = 0.9
        self.r_plus = 50
        self.exploration = 1
        self.env = None
        self.prev_qtable = dict()

    def reset(self, env):
        """
        Reset the state to the initial environment state
        """
        self.state = env.init

    def train(self, env):
        """
        Execute MAX_TRAINING_EPISODES rounds or until converge.
        """
        print('It will converge at', CONVERGENCE_THRESHOLD)

        self.reset(env)
        self.env = env

        executions = 0
        last_plan = []
        while executions < MAX_TRAINING_EPISODES:
            self.state = self.make_state(env)
            action = self.get_action(self.state)
            last_plan.append(action)
            self.env.execute(action)
            if env.terminal((self.state.x, self.state.y)):
                executions += 1
                
                self.p_state = self.p_action = self.p_reward = self.state = self.action = self.reward = None
                self.reset(env)

                if self.converged():
                    break
                else:
                    last_plan = []
                    self.prev_qtable = copy.deepcopy(self.q_values)

                #print('Episode', executions, ': convergence %', self.convergence)

        print('Episode' , executions, ' : converged at', self.convergence)
        print('Last plan executed: ', [ACTIONS_NAMES[x] for x in last_plan])
        #self.return_policy()

    def alpha(self, qv):
        """
        Alpha value, currently returning 0.9 because it converges pretty fast. 
        """
        return alpha(self,qv)

    def f(self, qv):
        """
        Exploration function. Use maxreward if the q_value was not explored.
        """
        return f(self,qv)

    def get_action(self, state):
        return get_action(self,state)   

    def max_a(self, state):
        """
        Standard max action implementation java style.
        Return the max value you can obtain in a certain state.
        """
        max_value = float('-inf')
        if self.env.terminal((state.x, state.y)):
            max_value = self.q_values[qvalue.QValue(state, NO_OP)]
        else:
            for action in self.env.available_actions((state.x, state.y)):
                qv = qvalue.QValue(state, action)
                if qv in self.q_values:
                    q_sa = self.q_values[qv]
                    if q_sa > max_value:
                        max_value = q_sa
        if max_value == float('-inf'): 
            max_value = 0.0
        return max_value

    def argmax_a(self, state):
        """
        Standard argmax action implementation java style.
        Return the best action you can perform in a certain state.
        """
        a = NO_OP
        max_value = float('-inf')
        if state == None:
            return a
        for action in self.env.available_actions((state.x, state.y)):
            qv = qvalue.QValue(state, action)
            fvalue = self.f(qv)
            if fvalue > max_value:
                max_value = fvalue
                a = action
        return a

    def make_state(self, env):
        """
        Build state using position and rupees.
        """
        return State(env.state[0], env.state[1], env.rupees)

    def return_qvalue(self, qvalue):
        if qvalue in self.q_values:
            return self.q_values[qvalue]
        return 0

    def converged(self):
        """
        Return True if the change between previous util table and current util table
        are smaller than the convergence_threshold.
        """
        self.convergence = self.convergence_metric()
        return self.convergence < CONVERGENCE_THRESHOLD

    def run(self, env):
        """
        Execute actions.
        """
        self.action = self.argmax_a(self.make_state(env))
        #print "Running action: ", ACTIONS_NAMES[self.action]
        self.state, self.reward = env.execute(self.action)
        return self.action, self.state


    def convergence_metric(self):
        """
        Return the convergence metric.
        """
        prev = sum(self.prev_qtable.values())
        curr = sum(self.q_values.values())
        return math.sqrt(abs(curr - prev))




# Train Agent
In this cell, we train the agent. You can change the map and add any code you like here.

In [None]:
try:
    from game import *
    pg = True
except ImportError:
    pg = False
    
from environment import *

logger = logging.getLogger()
sx, sy, map_data, map_width, map_height = read_map("maps/medium.txt")

agt = Link()

env = Environment(sx, sy, map_data, map_width, map_height)

start_time = time.time()
agt.train(env)
elapsed_time = time.time() - start_time
print('It took', elapsed_time,'seconds to train.' )
if pg:
    #Comment this line if you do not want to use the UI
    Game(env, agt)
    pass