Machine learning is a subset of AI which provides machines the ability to learn automatically and improve from experience without being explicitly programmed.

In [None]:
Types of ML : Supervised learning, Unsupervised Learning, Reinforcement Learning.
    

RL is a type of ML where an agent learns to behave in an environment by performing actions and seeing results.

Definitions:
Agent : RL algorithm that learns from trial and error.
Environment : The world through which the agent moves.
Action : All possible steps an agent can take.
State : Current condition returned by the environment.
Reward : An instant return from environment to appraise the last action.
Policy : The approach that agent uses to determine the next action based on current state.
Value : The expected long-term return with discount,  as opposed to short-term reward R.
Action-Value : This is similar to value, except it takes an extra parameter, the current action.

Exploration : exploring and capturing more information about environment
Exploitation : Using already known exploited information to heighten the rewards.

Math behind RL is Markov's Decision Process : Find shortest path with maximum possible cost.

Q- Learning:

#There are 4 rooms and the agent explores all the ways to find way out ( room 5) with maximum reward.
 Room 0 is connected to 4
 Room 1 is connected to 3 & 5
 Room 2 is connected to 3
 Room 3 is connected to 4 & 1
 Room 4 is connected to 3 & 5
 Room 5 is connected to itself.




2 -> 
  <- 3   ->
     ||  <-  1
     ||     ||
     ||     ||
0 -> ||     ||
  <- 4 <-   ||
       ->   5


Rewards are 100 points to go directly to room 5 ..1-> 5 & 4->5 ..
 If path exists then reward 0 eg : 2->3 ->1 -> 5, so  2->3 is reward 0 coz path exists and also 3->1 reward zero
 For rest of the non-existing path the reward is -1 ( null )




In [2]:
#Reward matrix

    
    #
#                Action
#       state  0   1   2   3  4   5
#    R = [0  -1  -1  -1  -1  0  -1
#         1               0      100
#         2               0
#         3       0   0      0
#         4  0            0      100
#         5       0          0   100 

To Construct Q Matrix
Q(state,Action ) = R(state,action) + Gamma * Max( Q(next_State,all actions))

Gamma towards zero - then exploitation 
Gamma towards 1 - then exploration


Q - Learning Algorithm

1. Set gamma param and env rewards in matrix R
2. Intialize Q matrix to 0
3. Select random initial state
4. Set initial_State = current_state
5. Select one among all possible actions for current selected state
6. using this selection, consider going to next state
7. Get max Q value based on all possible actions
8. Compute Q value with above formula
9. Repeat the above steps untill current_State = goal_State ( which is 5 here )

In [3]:
import numpy as np
# R matrix

R = np.matrix([[ -1, -1, -1, -1, 0, -1],
               [ -1, -1, -1, 0, -1, 100],
               [ -1, -1, -1, 0, -1, -1],
               [ -1, 0 , 0 , -1, 0, -1],
               [ -1, 0, 0, -1, -1, 100],
               [ -1, 0, -1, -1, 0, 100]])

In [4]:
# Initialize Q matirx
Q = np.matrix(np.zeros([6,6]))

In [7]:
# Gamma learning rate
gamma = 0.8 # Change it further and check in tests runs

In [8]:
# Initial state  ( usually chosen ar random)
initial_state = 1

In [11]:
# Function to return all available actions in the state given as argument. From R matrix choose values >0 ...possible moves from 
#the given state
def available_actions(state):
    current_state_row = R[state]
    av_act = np.where(current_state_row >= 0 ) [1]
    return av_act

In [12]:
# Get available action in the choosen state
available_act = available_actions(initial_state)

In [13]:
# Function to choose which actions to be performed within the range of all available actions
def sample_next_action(available_actions_range):
    next_action = int(np.random.choice(available_act,1))
    return next_action


In [18]:
# Next action to be performed
action = sample_next_action(available_act)

In [19]:
# Function to update Q matrix according to path selected and then Q learning algorithm
def update(current_state, action, gamma):
    max_index  = np.where(Q[action,] == np.max(Q[action,]))[1]
    
    if max_index.shape[0] > 1:
        max_index = int(np.random.choice(max_index, size=1))
    else:
        max_index = int(max_index)
    max_value = Q[action, max_index]
    
    #Q- Learning Formula
    Q[current_state, action] = R[current_state, action] + gamma * max_value

In [20]:
update(initial_state, action, gamma)

In [24]:
#Training over 10000 iterations 
for i in range(10000):
    current_state = np.random.randint(0, int(Q.shape[0]))
    available_act = available_actions(current_state)
    actions = sample_next_action(available_act)
    update(current_state,actions,gamma)

In [25]:
# Normalize the trained Q matrix
print ("Trained Q MAtrix")
print (Q/np.max(Q) * 100)

Trained Q MAtrix
[[  0.    0.    0.    0.   80.    0. ]
 [  0.    0.    0.   64.    0.  100. ]
 [  0.    0.    0.   64.    0.    0. ]
 [  0.   80.   51.2   0.   80.    0. ]
 [  0.   80.   51.2   0.    0.  100. ]
 [  0.   80.    0.    0.   80.  100. ]]


In [29]:
# Testing
goal_state = 5
#Best sqg path starting from 2 -> 2, 3, 1, 5

current_state = 4
steps = [current_state]

while current_state != 5:
    next_step_index = np.where(Q[current_state,] == np.max(Q[current_state,]))[1]
    
    if next_step_index.shape[0] > 1:
        next_step_index = int(np.random.choice(next_step_index, size=1))
    else:
        next_step_index = int(next_step_index)
        
    steps.append(next_step_index)
    current_state = next_step_index
    
print("selected path")
print(steps)
    


selected path
[4, 5]
