# Reinfocement Learning  Example :PathFinder Bot 

Suppose we have 5 rooms A to E, in a building connected by certain doors :
We  can  consider  outside  of  the  building  as  one  big  room  say  F  to  cover the building. 
There are two doors lead to the building from F, that is through room B and room E. 


![title](RL_problem.png)

Which path agent should choose??? 



# Step 1: Modeling the environment- 

- Represent the rooms by graph, 
- Each room as a vertex (or node) and 
- Each door as an edge (or link). 
- Goal room is the node F 
![image.png](RL1.png)




Goal :  Outside the building : Node F
Assign Reward Value to each room  

State:  Each room (including outside building )

Action : Agent’s Movement from 1 room to next room

Initial state : C (random )

Reward: Goal Node :highest reward (100)  rest – 0; 

State Diagram 
![image.png](RL2.png)


![title](RL_image.png)

In [1]:
import numpy as np

In [2]:
# R matrix

Rewards = np.matrix([ [-1,-1,-1,-1,0,-1],
            [-1,-1,-1,0,-1,100],
            [-1,-1,-1,0,-1,-1],
            [-1,0,0,-1,0,-1],
            [0,-1,-1,0,-1,100],
            [-1,0,-1,-1,0,100] ])

Rewards

matrix([[ -1,  -1,  -1,  -1,   0,  -1],
        [ -1,  -1,  -1,   0,  -1, 100],
        [ -1,  -1,  -1,   0,  -1,  -1],
        [ -1,   0,   0,  -1,   0,  -1],
        [  0,  -1,  -1,   0,  -1, 100],
        [ -1,   0,  -1,  -1,   0, 100]])

In [3]:
# Q matrix
Q = np.matrix(np.zeros([6,6]))
Q

matrix([[0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0.]])

In [4]:
# Gamma (learning parameter).
gamma = 0.8

In [5]:
# Initial state. (Usually to be chosen at random)
initial_state = 1

# Write your Code to choose random State

In [6]:
# This function returns all available actions in the state given as an argument
def available_actions(state):
    current_state_row = Rewards[state,]
    av_act = np.where(current_state_row >= 0)[1]
    return av_act

In [7]:
# Get available actions in the current state
available_act = available_actions(initial_state) 

In [8]:
# This function chooses at random which action to be performed within the range 
# of all the available actions.
def sample_next_action(available_actions_range):
    next_action = int(np.random.choice(available_act,1))
    return next_action

In [9]:
# Sample next action to be performed
action = sample_next_action(available_act)

In [10]:
# This function updates the Q matrix according to the path selected and the Q 
# learning algorithm
def update(current_state, action, gamma):
    
    max_index = np.where(Q[action,] == np.max(Q[action,]))[1]

    if max_index.shape[0] > 1:
        max_index = int(np.random.choice(max_index, size = 1))
    else:
        max_index = int(max_index)
    max_value = Q[action, max_index]
    
    # Q learning formula
    Q[current_state, action] = Rewards[current_state, action] + gamma * max_value

# Update Q matrix
update(initial_state,action,gamma)

In [11]:
#-------------------------------------------------------------------------------
# Training

# Train over 10 000 iterations. (Re-iterate the process above).
for i in range(10000):
    current_state = np.random.randint(0, int(Q.shape[0]))
    available_act = available_actions(current_state)
    action = sample_next_action(available_act)
    score= update(current_state,action,gamma)

    # The "trained" Q matrix
print("The Trained Q matrix:")
print(Q)


# Normalize the "trained" Q matrix
print("Trained Normalized Q matrix:")
print(Q/np.max(Q)*100), i

The Trained Q matrix:
[[  0.   0.   0.   0. 400.   0.]
 [  0.   0.   0. 320.   0. 500.]
 [  0.   0.   0. 320.   0.   0.]
 [  0. 400. 256.   0. 400.   0.]
 [320.   0.   0. 320.   0. 500.]
 [  0. 400.   0.   0. 400. 500.]]
Trained Normalized Q matrix:
[[  0.    0.    0.    0.   80.    0. ]
 [  0.    0.    0.   64.    0.  100. ]
 [  0.    0.    0.   64.    0.    0. ]
 [  0.   80.   51.2   0.   80.    0. ]
 [ 64.    0.    0.   64.    0.  100. ]
 [  0.   80.    0.    0.   80.  100. ]]


(None, 9999)

In [12]:
#-------------------------------------------------------------------------------
# Testing

#STATES = [A,B,C,D,E,F]
#nO_State=[0,1,2,3,4,5]

# Goal state = 5
# Best sequence path starting from 2 -> 2, 3, 1, 5

current_state = 2
steps = [current_state]

while current_state != 5:

    next_step_index = np.where(Q[current_state,] == np.max(Q[current_state,]))[1]
    
    if next_step_index.shape[0] > 1:
        next_step_index = int(np.random.choice(next_step_index, size = 1))
    else:
        next_step_index = int(next_step_index)
    
    steps.append(next_step_index)
    current_state = next_step_index

In [13]:
# Print selected sequence of steps
print("Selected path:")
print(steps)


Selected path:
[2, 3, 4, 5]


![image.png](RL_prob.png)
