# The Taxi-v3 environment
The Taxi-v3 environment is a strategic simulation, offering a grid-based arena where a taxi navigates to address daily challenges akin to those faced by a taxi driver. This environment is defined by a 5x5 grid where the taxi's mission involves picking up a passenger from one of four specific locations (marked as Red, Green, Yellow, and Blue) and dropping them off at another designated spot. The goal is to accomplish this with minimal time on the road to maximize rewards, emphasizing the need for route optimization and efficient decision-making for passenger pickup and dropoff.

## Key Components:
* Action Space: Comprises six actions where 0 moves the taxi south, 1 north, 2 east, 3 west, 4 picks up a passenger, and 5 drops off a passenger.
* Observation Space: Comprises 500 discrete states, accounting for 25 taxi positions, 5 potential passenger locations, and 4 destinations.
* Rewards System: Includes a penalty of -1 for each step taken without other rewards, +20 for successful passenger delivery, and -10 for illegal pickup or dropoff actions. Actions resulting in no operation, like hitting a wall, also incur a time step penalty.

In [None]:
# Re-run this cell to install and import the necessary libraries and load the required variables
import numpy as np
import gymnasium as gym
import imageio
from IPython.display import Image
from gymnasium.utils import seeding

# Initialize the Taxi-v3 environment
env = gym.make("Taxi-v3", render_mode='rgb_array')

# Seed the environment for reproducibility
env.np_random, _ = seeding.np_random(42)
env.action_space.seed(42)
np.random.seed(42)

# Maximum number of actions per training episode
max_actions = 100 

## Instructions
* Train an agent over 2,000 episodes, allowing for a maximum of 100 actions per episode (`max_actions`), utilizing Q-learning.  Record the total rewards achieved in each episode and save these in a list named `episode_returns`.
* What are the learned Q-values? Save these in a numpy array named `q_table`.
* What is the learned policy? Save it in a dictionary named `policy`.
* Test the agent's learned policy for one episode, starting with a seed of 42. Save the encountered states from `env.render()` as frames in a list named `frames`, and the sum of collected rewards in a variable named `episode_total_reward`. Make sure your agent does not execute more than 16 actions to solve the episode. If your learning process is efficient, the episode_total_reward should be at least 4.
* Execute the last provided cell to visualize your agent's performance in navigating the environment effectively. Please note that it might take up to one minute to render.

In [None]:
num_episodes = 2000
alpha = 0.1
gamma = 1
num_states = env.observation_space.n
num_actions = env.action_space.n
print(f"Number of states: {num_states}")
print(f"Number of actions: {num_actions}")

In [None]:
q_table = np.zeros((num_states, num_actions)) # Global Q table

# Getting the optimal policy
def get_policy():
    policy = {state: np.argmax(q_table[state]) for state in range(num_states)}
    return policy


In [None]:
#Q-learning update
def update_q_table(state, action, reward, new_state):
    old_value = q_table[state, action]
    next_max = max(q_table[new_state])
    q_table[state, action] = (1 - alpha) * old_value + alpha * (reward + gamma * next_max)


In [None]:
#Implementing epsilon_greedy()
def epsilon_greedy(epsilon, state):
    if np.random.rand() < epsilon: # Explore
        action = env.action_space.sample()
    else: # Exploit
        action = np.argmax(q_table[state, :])
    return action

In [None]:
#Q-learning implementation
episode_returns = []
# Exploration
epsilon = 1.0
# Exploration rate
epsilon_decay = 0.999
min_epsilon = 0.01

for episode in range(num_episodes):
    state, info = env.reset()
    terminated = False
    actions_taken = 0
    episode_reward = 0
    while (not terminated) and (actions_taken < max_actions):
        # # Random action selection
        # action = env.action_space.sample()
        # epsilon-greedy action
        action = epsilon_greedy(epsilon, state)
        # Take action and observe new state and reward
        new_state, reward, terminated, truncated, info = env.step(action)
        actions_taken += 1
        # Update Q-table
        update_q_table(state, action, reward, new_state)
        episode_reward += reward
        state = new_state
    
    episode_returns.append(episode_reward)
    epsilon = max(min_epsilon, epsilon * epsilon_decay)

In [None]:
#Using the policy

# Seed the environment for reproducibility
env.np_random, _ = seeding.np_random(42)
env.action_space.seed(42)
np.random.seed(42)

policy = get_policy()
episode_total_reward = 0.0
frames = []
eval_max_actions = 16

state, info = env.reset()
terminated = False
actions_taken = 0
while (not terminated) and (actions_taken < eval_max_actions):
    # Select the best action based on learned Q-table
    action = policy[state]
    # Take action and observe new state
    new_state, reward, terminated, truncated, info = env.step(action)
    actions_taken += 1
    state = new_state
    episode_total_reward += reward
    frames.append(env.render())

print(f"Episode reward: {episode_total_reward}")

In [None]:
imageio.mimsave('taxi_agent_behavior.gif', frames, fps=5, loop=0)

# Display GIF
gif_path = "taxi_agent_behavior.gif" 
Image(gif_path) 