# Reinforcement Learning Using Python Open AI Gym

## Taxi problem - transport people in a parking lot to four different locations (R,G,Y,B)

Reinforcement is a class of machine learning where an agent learns how to behave in the environment by performing actions
and thereby drawing intuitions and seeing the results.

Few applications that motivate you to build reinforcement systems,

Self Driving Cars, 
Gaming, 
Robotics,
Recommendation Systems,
Advertising and Marketing.

Reinforcement Learning supports automation by learning from the environment it is present in, so does Machine Learning and 
Deep Learning, not the same strategy, but both support automation. So, why Reinforcement Learning?

It’s very much like the natural learning process wherein, the process/the model would be receiving feedback as to whether it 
has performed well or not. Deep Learning and Machine Learning, are learning processes as well, but which are most focussed on
finding patterns in the existing data. Reinforcement Learning, on the other hand, does this learning by trial and error method, 
and eventually, gets to the right actions or the global optimum. The significant additional advantage of Reinforcement Learning
is that we need not provide the whole training data as in Supervised Learning. Instead, a few chunks would suffice.

##### Problem Statement:
“There are 4 locations (labelled by different letters), and our job is to pick up the passenger at one location and drop
him off at another. We receive +20 points for a successful drop-off and lose 1 point for every time-step it takes. There is
also a 10 point penalty for illegal pick-up and drop-off actions.” (Source: https://gym.openai.com/envs/Taxi-v2/ )

env is the core of OpenAi Gym, which is the unified environment interface. The following are the env methods that would be quite helpful to us:

env.reset: Resets the environment and returns a random initial state.
env.step(action): Step the environment by one timestep.

env.step(action) returns the following variables

observation: Observations of the environment.
reward: If your action was beneficial or not
done: Indicates if we have successfully picked up and dropped off a passenger, also called one episode
info: Additional info such as performance and latency for debugging purposes
env.render: Renders one frame of the environment (helpful in visualizing the environment)

We can break up the parking lot into a 5x5 grid, which gives us 25 possible taxi locations. These 25 locations are one part 
of our state space. Notice the current location state of our taxi is coordinate (3, 1).

In the environment, there are four possible locations where you can drop the passengers in the taxi which are: R, G, Y, B or
    [(0,0), (0,4), (4,0), (4,3)] in (row, col) coordinates if you can interpret the above-rendered environment as a coordinate 
    axis.

When we also account for one (1) additional passenger state of being inside the taxi, we can take all combinations of passenger
locations and destination locations to come to a total number of states for our taxi environment; there are four (4) destination
s and five (4 + 1) passenger locations. So, our taxi environment has 5×5×5×4=500 total possible states.

 we have six possible actions: pickup, drop, north, east, south, west

 action space: the set of all the actions that our agent can take in a given state.
        
That the taxi cannot perform certain actions in certain states due to walls. In the environment’s code,
we will simply provide a -1 penalty for every wall hit and the taxi won’t move anywhere. 

When the Taxi environment is created, there is an initial Reward table that’s also created, called P.

In [4]:
import gym
env = gym.make("Taxi-v2").env
env.render()

+---------+
|R: | : :[35m[43mG[0m[0m|
| : : : : |
| : : : : |
| | : | : |
|[34;1mY[0m| : |B: |
+---------+



In [5]:
env.reset()
env.render()

+---------+
|[35mR[0m: | : :[34;1mG[0m|
| : :[43m [0m: : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+



In [6]:
print("Action Space {}".format(env.action_space))
print("State Space {}".format(env.observation_space))

Action Space Discrete(6)
State Space Discrete(500)


In [7]:
state = env.encode(3, 1, 2, 0) # (taxi row, taxi column, passenger index, destination index)
print("State:", state)

env.s = state
env.render()

State: 328
+---------+
|[35mR[0m: | : :G|
| : : : : |
| : : : : |
| |[43m [0m: | : |
|[34;1mY[0m| : |B: |
+---------+



In [8]:
#The reward table
env.P[328]

{0: [(1.0, 428, -1, False)],
 1: [(1.0, 228, -1, False)],
 2: [(1.0, 348, -1, False)],
 3: [(1.0, 328, -1, False)],
 4: [(1.0, 328, -10, False)],
 5: [(1.0, 328, -10, False)]}

In [11]:
epochs = 0
penalties, reward = 0, 0

frames = [] # for animation

done = False

while not done:
    action = env.action_space.sample()
    state, reward, done, info = env.step(action)

    if reward == -10:
        penalties += 1
    
    # Put each rendered frame into dict for animation
    frames.append({
        'frame': env.render(mode='ansi'),
        'state': state,
        'action': action,
        'reward': reward
        }
    )

    epochs += 1
    
    
print("Timesteps taken: {}".format(epochs))
print("Penalties incurred: {}".format(penalties))

Timesteps taken: 512
Penalties incurred: 177


In [15]:
#Q Learning
import numpy as np
import random
from IPython.display import clear_output
# Init arbitary values
q_table = np.zeros([env.observation_space.n, env.action_space.n])

# Hyperparameters
alpha = 0.1
gamma = 0.6
epsilon = 0.1


all_epochs = []
all_penalties = []

for i in range(1, 100001):
    state = env.reset()

    # Init Vars
    epochs, penalties, reward, = 0, 0, 0
    done = False

    while not done:
        if random.uniform(0, 1) < epsilon:
            # Check the action space
            action = env.action_space.sample()
        else:
            # Check the learned values
            action = np.argmax(q_table[state])

        next_state, reward, done, info = env.step(action)

        old_value = q_table[state, action]
        next_max = np.max(q_table[next_state])

        # Update the new value
        new_value = (1 - alpha) * old_value + alpha * \
            (reward + gamma * next_max)
        q_table[state, action] = new_value

        if reward == -10:
            penalties += 1

        state = next_state
        epochs += 1

    if i % 100 == 0:
        clear_output(wait=True)
        print("Episode: {i}")

print("Training finished.")

Episode: {i}
Training finished.


In [17]:
q_table[328]

array([ -2.41484698,  -2.27325184,  -2.39970199,  -2.36120876,
       -10.80978258, -10.59069293])

In [18]:
"""Evaluate agent's performance after Q-learning"""

total_epochs, total_penalties = 0, 0
episodes = 100

for _ in range(episodes):
    state = env.reset()
    epochs, penalties, reward = 0, 0, 0
    
    done = False
    
    while not done:
        action = np.argmax(q_table[state])
        state, reward, done, info = env.step(action)

        if reward == -10:
            penalties += 1

        epochs += 1

    total_penalties += penalties
    total_epochs += epochs

print(f"Results after {episodes} episodes:")
print(f"Average timesteps per episode: {total_epochs / episodes}")
print(f"Average penalties per episode: {total_penalties / episodes}")

Results after 100 episodes:
Average timesteps per episode: 12.72
Average penalties per episode: 0.0
