# **Experiment 9**

## **Aim: Reinforcement Learning using Q-Learning**

## **Theory**

**Q-Learning in Reinforcement Learning**

1. Introduction:


Q-Learning is a type of model-free Reinforcement Learning (RL) algorithm that allows an agent to learn how to act optimally in a given environment by interacting with it. It learns a Q-value (action-value function) which estimates the expected future reward for taking a particular action in a given state.

2. Exploration vs Exploitation:


	•	Exploration helps the agent try new actions to discover rewards.
	•	Exploitation uses the current best-known action.
	•	Controlled by \epsilon-greedy strategy:
	•	With probability \epsilon, choose a random action.
	•	Otherwise, choose the best-known action.

3. GridWorld Example:
In the 4x4 grid example:


	•	The agent starts at (0,0) and learns to reach the goal at (3,3).
	•	Trap at (1,2) gives heavy penalty (-10).
	•	Rewards guide the agent to learn the shortest and safest path using Q-values.

## **Code & Output**

In [1]:
import numpy as np
import random

# Environment setup
grid_size = 4
actions = ['up', 'down', 'left', 'right']
q_table = np.zeros((grid_size, grid_size, len(actions)))

# Rewards: Goal at (3,3), Trap at (1,2)
rewards = np.full((grid_size, grid_size), -1)
rewards[3][3] = 10   # Goal
rewards[1][2] = -10  # Trap

# Parameters
alpha = 0.1      # Learning rate
gamma = 0.9      # Discount factor
epsilon = 0.2    # Exploration rate
episodes = 500

# Helper function to move the agent
def take_action(state, action):
    x, y = state
    if action == 'up': x = max(0, x - 1)
    elif action == 'down': x = min(grid_size - 1, x + 1)
    elif action == 'left': y = max(0, y - 1)
    elif action == 'right': y = min(grid_size - 1, y + 1)
    return (x, y)

# Training
for episode in range(episodes):
    state = (0, 0)
    while state != (3, 3):  # until goal is reached
        if random.uniform(0, 1) < epsilon:
            action_index = random.randint(0, len(actions) - 1)
        else:
            action_index = np.argmax(q_table[state[0], state[1]])

        action = actions[action_index]
        next_state = take_action(state, action)
        reward = rewards[next_state]

        # Q-learning formula
        old_q = q_table[state[0], state[1], action_index]
        next_max = np.max(q_table[next_state[0], next_state[1]])
        q_table[state[0], state[1], action_index] = old_q + alpha * (reward + gamma * next_max - old_q)

        if next_state == (1, 2):  # Trap
            break
        state = next_state

# Display learned Q-values
print("Learned Q-Table:")
for i in range(grid_size):
    for j in range(grid_size):
        print(f"State ({i},{j}): {q_table[i, j]}")

Learned Q-Table:
State (0,0): [0.24543278 1.8098     0.36287732 1.4591348 ]
State (0,1): [-1.09201107  3.06784721 -0.729901   -1.65522723]
State (0,2): [-1.04450422 -3.439      -1.07953973 -0.95013658]
State (0,3): [-0.4900995   0.79039566 -0.55713484 -0.58001377]
State (1,0): [0.1246892  2.56065608 1.55014939 3.122     ]
State (1,1): [ 0.84249298  4.58        1.32198696 -9.57608842]
State (1,2): [0. 0. 0. 0.]
State (1,3): [-0.20791     5.25875108 -3.439       0.19456136]
State (2,0): [-0.49972567 -0.44585305 -0.52413971  4.54636841]
State (2,1): [2.95219901 6.2        2.33562076 4.81198969]
State (2,2): [-2.71        0.57666524  1.11142402  7.79455787]
State (2,3): [1.29342602 9.96618608 1.05388684 1.67746402]
State (3,0): [-0.32803325  0.10361247  0.90974524  5.8400335 ]
State (3,1): [4.2783026  5.73477138 3.36764273 8.        ]
State (3,2): [ 4.3460774   7.47536036  5.77465145 10.        ]
State (3,3): [0. 0. 0. 0.]
