# Q* Learning with Taxi-v2
<br> 
In this Notebook, we'll implement an agent <b>that plays Taxi-v2</b>



## Step 0: Import the dependencies 📚
We use 3 libraries:
- `Numpy` for our Qtable
- `OpenAI Gym` for our Taxi-v2 Environment
- `Random` to generate random numbers

In [1]:
import numpy as np
import gym
import random

## Step 1: Create the environment 🎮
- Here we'll create the Taxi-v2 environment. 
- OpenAI Gym is a library <b> composed of many environments that we can use to train our agents.</b>
- In our case we choose to use Taxi-v2.

In [3]:
env = gym.make("Taxi-v2")
env.render()

+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | :[43m [0m|
|[34;1mY[0m| : |[35mB[0m: |
+---------+



## Step 2: Create the Q-table and initialize it 🗄️
- Now, we'll create our Q-table, to know how much rows (states) and columns (actions) we need, we need to calculate the action_size and the state_size
- OpenAI Gym provides us a way to do that: `env.action_space.n` and `env.observation_space.n`

In [4]:
action_size = env.action_space.n
state_size = env.observation_space.n
print('Action_size',action_size)
print('State-Size',state_size)

Action_size 6
State-Size 500


In [5]:
qtable = np.zeros((state_size, action_size))
print(qtable)

[[0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 ...
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]]


## Step 3: Create the hyperparameters ⚙️
- Here, we'll specify the hyperparameters

In [6]:
total_episodes = 15000        # Total episodes
learning_rate = 0.8           # Learning rate
max_steps = 99                # Max steps per episode
gamma = 0.95                  # Discounting rate

# Exploration parameters
epsilon = 1.0                 # Exploration rate
max_epsilon = 1.0             # Exploration probability at start
min_epsilon = 0.01            # Minimum exploration probability 
decay_rate = 0.005             # Exponential decay rate for exploration prob

## Step 4: The Q learning algorithm 
- Now we implement the Q learning algorithm:
<img src="qtable_algo.png" alt="Q algo"/>

In [7]:
#list of rewards
rewards = []

for episode in range(total_episodes):
    #reset the env
    state = env.reset()
    step = 0
    done = False
    total_rewards = 0
    for step in range(max_steps):
        no = random.uniform(0,1)
        #for exploitation
        if no>epsilon:
            action = np.argmax(qtable[state,:])
        #for exploration
        else:
            action = env.action_space.sample()
        
        new_state, reward, done, info = env.step(action)
        # Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]
        # qtable[new_state,:] : all the actions we can take from new state
        qtable[state,action] = qtable[state,action] + learning_rate*(reward + gamma*np.max(qtable[new_state,:])-qtable[state,action])
        total_rewards += reward
        state = new_state
        if done ==True:
            break
    episode +=1
    # Reduce epsilon (because we need less and less exploration)
    epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*episode) 
    rewards.append(total_rewards)
print ("Score over time: " +  str(sum(rewards)/total_episodes))
print(qtable)

Score over time: 3.5244
[[  0.           0.           0.           0.           0.
    0.        ]
 [209.1467106  247.81502235 244.00784289 205.99354804 273.30166436
   19.83568234]
 [264.71837527 154.0705892  273.21825079 283.71094898 304.98799375
  279.30056638]
 ...
 [ -3.29500467  -3.11368704  -3.0130688   -3.74129345 -10.208
  -11.5297792 ]
 [ -5.22752287  37.60953452  -5.55419899  -5.21580815 -13.06558208
  -13.90711512]
 [ -1.568       -1.568       -1.568      378.99958753 -10.6496
  -10.7136    ]]


## Step 5: Use our Q-table to play Taxi-v2 ! 👾
- After 15 000 episodes, our Q-table can be used as a "cheatsheet" to play FrozenLake"
- By running this cell you can see our agent playing Taxi-v2.

In [8]:
env.reset()

for episode in range(5):
    state = env.reset()
    step = 0
    done = False
    print("****************************************************")
    print("EPISODE ", episode)

    for step in range(max_steps):
        env.render()
        # Take the action (index) that have the maximum expected future reward given that state
        action = np.argmax(qtable[state,:])
        
        new_state, reward, done, info = env.step(action)
        
        if done:
            # Here, we decide to only print the last state (to see if our agent is on the goal or fall into an hole)
            #env.render()
            
            # We print the number of step it took.
            print("Number of steps", step)
            break
        state = new_state
env.close()

****************************************************
EPISODE  0
+---------+
|R: | : :[34;1mG[0m|
| : : : : |
| : : : : |
| | : | : |
|[35mY[0m| : |B:[43m [0m|
+---------+

+---------+
|R: | : :[34;1mG[0m|
| : : : : |
| : : : : |
| | : | : |
|[35mY[0m| : |[43mB[0m: |
+---------+
  (West)
+---------+
|R: | : :[34;1mG[0m|
| : : : : |
| : : : : |
| | : |[43m [0m: |
|[35mY[0m| : |B: |
+---------+
  (North)
+---------+
|R: | : :[34;1mG[0m|
| : : : : |
| : : :[43m [0m: |
| | : | : |
|[35mY[0m| : |B: |
+---------+
  (North)
+---------+
|R: | : :[34;1mG[0m|
| : : :[43m [0m: |
| : : : : |
| | : | : |
|[35mY[0m| : |B: |
+---------+
  (North)
+---------+
|R: | :[43m [0m:[34;1mG[0m|
| : : : : |
| : : : : |
| | : | : |
|[35mY[0m| : |B: |
+---------+
  (North)
+---------+
|R: | : :[34;1m[43mG[0m[0m|
| : : : : |
| : : : : |
| | : | : |
|[35mY[0m| : |B: |
+---------+
  (East)
+---------+
|R: | : :[42mG[0m|
| : : : : |
| : : : : |
| | : | : |
|[35mY[0m| : |B