Title - Reinforcement Learning

a. Calculating Reward
b. Discounted Reward
c. Calculating Optimal quantities
d. Implementing Q Learning
e. Setting up an Optimal Action

Reinforcement Learning (RL) is a type of machine learning paradigm where an agent learns to interact with an environment to achieve a specific goal. The agent receives feedback in the form of rewards or penalties based on its actions, guiding it towards maximizing the cumulative rewards over time. The code provided demonstrates various components of RL, including calculating rewards, implementing Q-learning, and finding optimal actions.

a. Calculating Reward:
In RL, the reward represents the immediate feedback given to the agent after each action. The reward_matrix in the code contains rewards for transitioning from one state to another. A reward of 1 indicates a positive outcome, while 0 denotes no reward. This matrix defines the immediate reward an agent receives when transitioning between states.

b. Discounted Reward:
The discounted reward is a fundamental concept in RL. It accounts for the fact that future rewards are less valuable than immediate rewards. The discount factor gamma (0.95 in this code) determines how much the agent values future rewards relative to immediate rewards. By using a discounted reward, the agent gives more importance to short-term gains while still considering the long-term cumulative rewards.

c. Calculating Optimal Quantities:
In RL, the agent aims to learn an optimal policy that maximizes the cumulative rewards. The Q-matrix (quality matrix) stores the estimated value (Q-value) of taking specific actions from each state. The Q-value represents the expected cumulative reward when taking a particular action in a given state and following the optimal policy thereafter.

d. Implementing Q Learning:
The q_learning_update function implements the Q-learning algorithm, a popular model-free RL method. It updates the Q-matrix based on the observed rewards and transitions experienced during exploration of the environment. The agent learns to estimate the Q-values iteratively by updating them using a learning rate (alpha) and the discounted reward for the next state.

e. Setting up an Optimal Action:
The function find_optimal_route demonstrates how the agent uses Q-learning to find an optimal route from the initial state 'A' to the goal state 'E'. The agent explores the environment by taking actions and updating the Q-matrix until it reaches the goal state. Over time, the Q-matrix converges to the optimal Q-values, representing the optimal policy for each state.

In conclusion, Reinforcement Learning is a powerful technique for training agents to learn from their interactions with an environment. The provided code showcases some fundamental aspects of RL, such as calculating rewards, implementing Q-learning, and finding optimal actions. By combining exploration and exploitation, RL algorithms can discover optimal policies and make effective decisions in various real-world scenarios, from game playing to robotics and autonomous systems. The ability to learn from feedback and adapt to changing environments makes RL an essential tool for solving complex decision-making problems where traditional supervised learning approaches may not be applicable.

In [None]:
import numpy as np

state_to_index = {'A': 0, 'B': 1, 'C': 2, 'D': 3, 'E': 4}
index_to_state = {0: 'A', 1: 'B', 2: 'C', 3: 'D', 4: 'E'}

reward_matrix = np.array([
    [0, 1, 0, 0, 0],
    [1, 0, 1, 0, 0],
    [0, 1, 0, 1, 0],
    [0, 0, 1, 0, 1],
    [0, 0, 0, 1, 0]
])

In [None]:
print("Reward Matrix:")
print(reward_matrix)

Reward Matrix:
[[0 1 0 0 0]
 [1 0 1 0 0]
 [0 1 0 1 0]
 [0 0 1 0 1]
 [0 0 0 1 0]]


In [None]:
gamma = 0.95  # Discount factor
alpha = 0.1  # Learning rate

In [None]:
print("Discount Factor (Gamma):", gamma)

Discount Factor (Gamma): 0.95


In [None]:
state_size = len(state_to_index)
action_size = state_size
Q_matrix = np.zeros([state_size, action_size])

def q_learning_update(s, a, reward, s2, Q_matrix):
    Q_matrix[s, a] = (1 - alpha) * Q_matrix[s, a] + alpha * (reward + gamma * np.max(Q_matrix[s2, :]))
    s = s2
    return s, Q_matrix

In [None]:
def get_action(state, Q_matrix, epsilon=0.1):
    if np.random.random() < epsilon:
        return np.random.choice(action_size)
    return np.argmax(Q_matrix[state, :])

In [None]:
def find_optimal_route(initial_state, goal_state, Q_matrix, episodes=1000):
    for _ in range(episodes):
        state = initial_state
        while state != goal_state:
            action = get_action(state, Q_matrix)
            next_state = action
            reward = reward_matrix[state, action]
            state, Q_matrix = q_learning_update(state, action, reward, next_state, Q_matrix)
    return Q_matrix

In [None]:
initial_state = state_to_index['A']
goal_state = state_to_index['E']
Q_matrix = find_optimal_route(initial_state, goal_state, Q_matrix)

print("Q-matrix:")
print(Q_matrix)

Q-matrix:
[[19.         20.         18.05       18.05        0.        ]
 [20.         19.         19.05       18.05        0.        ]
 [19.         16.66329168 15.68551912 17.49748427  0.        ]
 [19.         15.82919946 15.84060089 16.06107254  0.89058101]
 [ 0.          0.          0.          0.          0.        ]]


In [None]:
optimal_actions = [np.argmax(Q_matrix[state, :]) for state in range(state_size)]
optimal_actions = [index_to_state[action] for action in optimal_actions]
print("Optimal Actions for each state:", optimal_actions)

Optimal Actions for each state: ['B', 'A', 'A', 'A', 'A']


In [None]:
for state in state_to_index:
    for action in state_to_index:
        state_idx = state_to_index[state]
        action_idx = state_to_index[action]
        immediate_reward = reward_matrix[state_idx, action_idx]
        print(f"Immediate Reward for moving from state {state} to state {action}: {immediate_reward}")

Immediate Reward for moving from state A to state A: 0
Immediate Reward for moving from state A to state B: 1
Immediate Reward for moving from state A to state C: 0
Immediate Reward for moving from state A to state D: 0
Immediate Reward for moving from state A to state E: 0
Immediate Reward for moving from state B to state A: 1
Immediate Reward for moving from state B to state B: 0
Immediate Reward for moving from state B to state C: 1
Immediate Reward for moving from state B to state D: 0
Immediate Reward for moving from state B to state E: 0
Immediate Reward for moving from state C to state A: 0
Immediate Reward for moving from state C to state B: 1
Immediate Reward for moving from state C to state C: 0
Immediate Reward for moving from state C to state D: 1
Immediate Reward for moving from state C to state E: 0
Immediate Reward for moving from state D to state A: 0
Immediate Reward for moving from state D to state B: 0
Immediate Reward for moving from state D to state C: 1
Immediate 

In [None]:
action_sequence = ['A', 'B', 'C', 'D', 'E']
discounted_reward = 0
current_gamma = 1

for i in range(len(action_sequence) - 1):
    state = state_to_index[action_sequence[i]]
    next_state = state_to_index[action_sequence[i+1]]
    reward = reward_matrix[state, next_state]
    discounted_reward += current_gamma * reward
    current_gamma *= gamma

print("Discounted Reward for the action sequence:", discounted_reward)

Discounted Reward for the action sequence: 3.709875
