##### Reinforcement Learning and Decision Making &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Homework #4

# &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Q-Learning

## Description

In this homework, you will have the complete reinforcement-learning experience:  training an agent from scratch to solve a simple domain using Q-learning.

The environment you will be applying Q-learning to is called [Taxi](https://github.com/openai/gym/blob/master/gym/envs/toy_text/taxi.py) (Taxi-v3).  The Taxi problem was introduced by [Dietterich 1998](https://www.jair.org/index.php/jair/article/download/10266/24463) and has been used for reinforcement-learning research in the past.  It is a grid-based environment where the goal of the agent is to pick up a passenger at one location and drop them off at another.

The map is fixed and the environment has deterministic transitions.  However, the distinct pickup and drop-off points are chosen randomly from 4 fixed locations in the grid, each assigned a different letter.  The starting location of the taxicab is also chosen randomly.

The agent has 6 actions: 4 for movement, 1 for pickup, and 1 for drop-off.  Attempting a pickup when there is no passenger at the location incurs a reward of -10.  Dropping off a passenger outside one of the four designated zones is prohibited, and attempting it also incurs a reward of −10.  Dropping the passenger off at the correct destination provides the agent with a reward of 20.  Otherwise, the agent incurs a reward of −1 per time step.

Your job is to train your agent until it converges to the optimal state-action value function.  You will have to think carefully about algorithm implementation, especially exploration parameters.

## Q-learning

Q-learning is a fundamental reinforcement-learning algorithm that has been successfully used to solve a variety of  decision-making  problems.   Like  Sarsa,  it  is  a  model-free  method  based  on  temporal-difference  learning. However, unlike Sarsa, Q-learning is *off-policy*, which means the policy it learns about can be different than the policy it uses to generate its behavior.  In Q-learning, this *target* policy is the greedy policy with respect to the current value-function estimate.

## Procedure

- You should return the optimal *Q-value* for a specific state-action pair of the Taxi environment.

- To solve this problem you should implement the Q-learning algorithm and use it to solve the Taxi environment. The agent  should  explore  the MDP, collect data  to  learn an optimal  policy and also the optimal Q-value function.  Be mindful of how you handle terminal states: if $S_t$ is a terminal state, then $V(St)$ should always be 0.  Use $\gamma= 0.90$ - this is important, as the optimal value function depends on the discount rate.  Also, note that an $\epsilon$-greedy strategy can find an optimal policy despite finding sub-optimal Q-values.   As we are looking for optimal  Q-values, you will have to carefully consider your exploration strategy.

## Resources

The concepts explored in this homework are covered by:

-   Lesson 4: Convergence

-   Lesson 7: Exploring Exploration

-   Chapter 6 (6.5 Q-learning: Off-policy TD Control) of http://incompleteideas.net/book/the-book-2nd.html

-   Chapter 2 (2.6.1 Q-learning) of 'Algorithms for Sequential Decision Making', M.
    Littman, 1996

## Submission

-   The due date is indicated on the Canvas page for this assignment.
    Make sure you have set your timezone in Canvas to ensure the deadline is accurate.

-   Submit your finished notebook on Gradescope. Your grade is based on
    a set of hidden test cases. You will have unlimited submissions -
    only the last score is kept.

-   Use the template below to implement your code. We have also provided
    some test cases for you. If your code passes the given test cases,
    it will run (though possibly not pass all the tests) on Gradescope. 
    Be cognisant of performance.  If the autograder fails because of memory 
    or runtime issues, you need to refactor your solution

-   Gradescope is using python 3.6.x. For permitted libraries, please see
    the requirements.txt file, You can also use any core library
    (i.e., anything in the Python standard library).
    No other library can be used.  Also, make sure the name of your
    notebook matches the name of the provided notebook.  Gradescope times
    out after 10 minutes.


In [48]:
import numpy as np
import gym


class QLearningAgent(object):
    def __init__(self):
        """
        Initialize your Q table and hyperparameters
        """
        env = gym.make('Taxi-v3')
        seed = 42
        env.seed(seed)
        np.random.seed(seed)

        # TODO
        self.Q = np.zeros([env.observation_space.n, env.action_space.n])
        self.env = env.env
        self.gamma = 0.9
        self.epsilon = 0.9
        self.epsilon_decay = 0.00001
        self.alpha = 1.0
        self.num_episodes = 100000

    def solve(self):
        """
        Implement the Q learning algorithm
        """
        # loop over episodes
        for i in range(self.num_episodes):
            # reset the environment
            state = self.env.reset()
            done = False
            # loop over this episode until it is done
            while not done:
                # select an action using epsilon greedy policy
                action = self.get_epsilon_greedy_action(state, self.Q, self.epsilon, self.env)
                # take a step in the environment
                next_state, reward, done, _ = self.env.step(action)
                # select the next action using greedy policy (no exploration, need this for Bellman)
                next_state_max_action = self.get_greedy_action(next_state, self.Q)
                # update the Q table - Bellman equation
                update = self.alpha * (reward + self.gamma * self.Q[next_state, next_state_max_action] - self.Q[state, action])
                self.Q[state, action] += update
                state = next_state
            
            # decay epsilon 
            if self.epsilon > 0.075:
                self.epsilon = self.epsilon * (1 - self.epsilon_decay)
        self.env.close()


    def Q_table(self, state, action):
        """
        return the optimal value for State-Action pair in the Q Table
        """
        return self.Q[state][action]
    

    def get_epsilon_greedy_action(self, state, Q, epsilon, env):
        """
        return the action using epsilon greedy policy
        """
        if np.random.rand() < epsilon:
            return np.random.randint(env.action_space.n)
        return np.argmax(Q[state, :])


    def get_greedy_action(self, state, Q):
        """
        return the greedy action for the given state
        """
        return np.argmax(Q[state, :])

## 2. Test cases

In [49]:

## DO NOT MODIFY THIS CODE.  This code will ensure that you submission is correct 
## and will work proberly with the autograder

import unittest


class TestQNotebook(unittest.TestCase):
    @classmethod
    def setUpClass(cls):
        cls.agent = QLearningAgent()
        cls.agent.solve()
        
    def test_case_1(self):
        np.testing.assert_almost_equal(
            self.agent.Q_table(462, 4),
            -11.374402515,
            decimal=3
        )
        
    def test_case_2(self):
        np.testing.assert_almost_equal(
            self.agent.Q_table(398, 3),
            4.348907,
            decimal=3
        )
    
    def test_case_3(self):
        np.testing.assert_almost_equal(
            self.agent.Q_table(253, 0),
            -0.5856821173,
            decimal=3
        )

    def test_case_4(self):
        np.testing.assert_almost_equal(
            self.agent.Q_table(377, 1),
            9.683,
            decimal=3
        )

    def test_case_5(self):
        np.testing.assert_almost_equal(
            self.agent.Q_table(83, 5),
            -13.9968,
            decimal=3
        )

unittest.main(argv=[''], verbosity=2, exit=False)

test_case_1 (__main__.TestQNotebook.test_case_1) ... ok
test_case_2 (__main__.TestQNotebook.test_case_2) ... ok
test_case_3 (__main__.TestQNotebook.test_case_3) ... ok
test_case_4 (__main__.TestQNotebook.test_case_4) ... ok
test_case_5 (__main__.TestQNotebook.test_case_5) ... ok

----------------------------------------------------------------------
Ran 5 tests in 49.913s

OK


<unittest.main.TestProgram at 0x1c0d285dad0>