##### Reinforcement Learning &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Homework #4

# &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Q-Learning

## Description

In this homework, you will have the complete reinforcement-learning experience:  training an agent from scratch to solve a simple domain using Q-learning.

The environment you will be applying Q-learning to is called [Taxi](https://github.com/openai/gym/blob/master/gym/envs/toy_text/taxi.py) (Taxi-v3).  The Taxi problem was introduced by [Dietterich 1998](https://www.jair.org/index.php/jair/article/download/10266/24463) and has been used for reinforcement-learning research in the past.  It is a grid-based environment where the goal of the agent is to pick up a passenger at one location and drop them off at another.

The map is fixed and the environment has deterministic transitions.  However, the distinct pickup and drop-off points are chosen randomly from 4 fixed locations in the grid, each assigned a different letter.  The starting location of the taxicab is also chosen randomly.

The agent has 6 actions: 4 for movement, 1 for pickup, and 1 for drop-off.  Attempting a pickup when there is no passenger at the location incurs a reward of -10.  Dropping off a passenger outside one of the four designated zones is prohibited, and attempting it also incurs a reward of −10.  Dropping the passenger off at the correct destination provides the agent with a reward of 20.  Otherwise, the agent incurs a reward of −1 per time step.

Your job is to train your agent until it converges to the optimal state-action value function.  You will have to think carefully about algorithm implementation, especially exploration parameters.

## Q-learning

Q-learning is a fundamental reinforcement-learning algorithm that has been successfully used to solve a variety of decision-making problems.   Like  Sarsa,  it is a model-free method based on temporal-difference learning. However, unlike Sarsa, Q-learning is *off-policy*, which means the policy it learns about can be different than the policy it uses to generate its behavior.  In Q-learning, this *target* policy is the greedy policy with respect to the current value-function estimate.

## Procedure

- You should return the optimal *Q-value* for a specific state-action pair of the Taxi environment.

- To solve this problem, you should implement the Q-learning algorithm and use it to solve the Taxi environment. The agent should explore the MDP, collect data to learn an optimal policy and also the optimal Q-value function.  Be mindful of how you handle terminal states: if $S_t$ is a terminal state, then $V(St)$ should always be 0.  Use $\gamma= 0.90$ - this is important, as the optimal value function depends on the discount rate.  Also, note that an $\epsilon$-greedy strategy can find an optimal policy despite finding sub-optimal Q-values.   As we are looking for optimal  Q-values, you will have to carefully consider your exploration strategy.

## Resources

The concepts explored in this homework are covered by:

-   Lesson 4: Convergence

-   Lesson 7: Exploring Exploration

-   Chapter 6 (6.5 Q-learning: Off-policy TD Control) of http://incompleteideas.net/book/the-book-2nd.html

-   Chapter 2 (2.6.1 Q-learning) of 'Algorithms for Sequential Decision Making', M.
    Littman, 1996

## Submission

-   The due date is indicated on the Syllabus page for this assignment.

-   Use the template code to implement your work.

-   Please use *python 3.6.x*, *gym==0.17.2*, *numpy==1.18.0* or their
    more recent versions, and you can use any core library (i.e., anything
    in the Python standard library). No other library can be used.


In [13]:
################
# DO NOT REMOVE
# Versions
# gym==0.17.2
# numpy==1.18.0
# Xingyan Liu
# October 16, 2024
################
import gym
import numpy as np

class QLearningAgent:
    def __init__(self, alpha=0.1, gamma=0.9, epsilon=1.0, epsilon_min=0.1, epsilon_decay=0.995, episodes=2000000):
        """
        Initialize the Q-learning agent with given parameters.
        
        Parameters:
        - alpha: Learning rate, controls how much new information overrides old information.
        - gamma: Discount factor, determines the importance of future rewards.
        - epsilon: Initial exploration rate for epsilon-greedy policy.
        - epsilon_min: Minimum epsilon value, defines the lowest exploration rate allowed.
        - epsilon_decay: Factor to reduce epsilon after each episode.
        - episodes: Number of training episodes.
        """
        self.alpha = alpha
        self.gamma = gamma
        self.epsilon = epsilon
        self.epsilon_min = epsilon_min
        self.epsilon_decay = epsilon_decay
        self.episodes = episodes
        self.env = gym.make("Taxi-v3").env  # Initialize the Taxi-v3 environment
        self.Q = np.zeros([self.env.observation_space.n, self.env.action_space.n])  # Initialize Q-table to zeros

    def epsilon_greedy_policy(self, state):
        """
        Select an action using epsilon-greedy policy.
        
        Parameters:
        - state: The current state of the environment.

        Returns:
        - action: The chosen action, either exploratory (random) or exploitative (greedy).
        """
        if np.random.uniform(0, 1) < self.epsilon:
            return self.env.action_space.sample()  # Explore: choose a random action
        else:
            return np.argmax(self.Q[state])  # Exploit: choose action with the highest Q-value

    def train(self):
        """
        Train the agent using the Q-learning algorithm over multiple episodes.
        Updates the Q-table based on the agent's experiences.
        """
        for episode in range(self.episodes):
            state = self.env.reset()  # Reset the environment for a new episode
            done = False  # Variable to track if the episode has finished

            while not done:
                # Select action based on epsilon-greedy policy
                action = self.epsilon_greedy_policy(state)
                
                # Perform the action and observe the next state, reward, and done flag
                next_state, reward, done, _ = self.env.step(action)

                # Q-learning update rule
                self.Q[state][action] += self.alpha * (
                    reward + self.gamma * np.max(self.Q[next_state]) - self.Q[state][action]
                )

                state = next_state  # Move to the next state

            # Decay epsilon after each episode to reduce exploration over time
            self.epsilon = max(self.epsilon_min, self.epsilon * self.epsilon_decay)

    def solve(self):
        """
        Solve the environment by training the agent using the Q-learning algorithm.
        """
        self.train()

    def Q_table(self, state, action):
        """
        Get the Q-value for a specific state-action pair from the Q-table.
        
        Parameters:
        - state: The current state.
        - action: The chosen action.

        Returns:
        - Q-value: The value of the state-action pair.
        """
        return self.Q[state][action]

## 2. Test cases

In [14]:

## DO NOT MODIFY THIS CODE.

import unittest


class TestQNotebook(unittest.TestCase):
    @classmethod
    def setUpClass(cls):
        cls.agent = QLearningAgent()
        cls.agent.solve()
        
    def test_case_1(self):
        np.testing.assert_almost_equal(
            self.agent.Q_table(462, 4),
            -11.374402515,
            decimal=3
        )
        
    def test_case_2(self):
        np.testing.assert_almost_equal(
            self.agent.Q_table(398, 3),
            4.348907,
            decimal=3
        )
    
    def test_case_3(self):
        np.testing.assert_almost_equal(
            self.agent.Q_table(253, 0),
            -0.5856821173,
            decimal=3
        )

    def test_case_4(self):
        np.testing.assert_almost_equal(
            self.agent.Q_table(377, 1),
            9.683,
            decimal=3
        )

    def test_case_5(self):
        np.testing.assert_almost_equal(
            self.agent.Q_table(83, 5),
            -13.9968,
#            -12.8232,
            decimal=3
        )

unittest.main(argv=[''], verbosity=2, exit=False)

test_case_1 (__main__.TestQNotebook.test_case_1) ... ok
test_case_2 (__main__.TestQNotebook.test_case_2) ... ok
test_case_3 (__main__.TestQNotebook.test_case_3) ... ok
test_case_4 (__main__.TestQNotebook.test_case_4) ... ok
test_case_5 (__main__.TestQNotebook.test_case_5) ... ok

----------------------------------------------------------------------
Ran 5 tests in 398.612s

OK


<unittest.main.TestProgram at 0x24de85a8150>