## **Task 6 - Q-learning**
Implementation of the Q-learning algorithm and applying it to solve the problem [Cliff Walking](https://gymnasium.farama.org/environments/toy_text/cliff_walking/). This environment is available in the gymnasium package `(gym.make('CliffWalking-v0')`.

**Steps:**
1. Q-learning algorithm implementation.
2. Investigate the performance of the algorithm for the Cliff Walking problem for different values of the learning rate and different numbers of episodes (in the process of training).

**Notes:**
- The implementation of the algorithm should be usable for a variety of environments with a discrete space of states and actions.

In [None]:
from time import sleep
from IPython.display import  clear_output

import numpy as np
import gymnasium as gym
from matplotlib import pyplot as plt
from typing import List

In [None]:
env = gym.make("CliffWalking-v0", render_mode="rgb_array")

### **1. Q-learning algorithm implementation**

**Action selection by epsilon greedy method**

In [None]:
def choose_action(Q: np.ndarray, state: int, epsilon: float) -> int:
    if np.random.uniform(0, 1) < epsilon:
        return env.action_space.sample()
    else:
        max_value = np.max(Q[state])
        max_keys = np.where(Q[state] == max_value)[0]
        return np.random.choice(max_keys)

**Implementation of the Q-learning algorithm training process**

In [None]:
def train_q_learning(env, beta=0.1, gamma=0.99, epsilon=0.1, max_episodes=5_000, max_steps=1000):
    Q = np.zeros((env.observation_space.n, env.action_space.n))

    training_rewards = []

    for _ in range(max_episodes):
        state, _ = env.reset()

        steps_counter = 0

        reward_per_episode = 0

        while steps_counter < max_steps:
            action = choose_action(Q, state, epsilon)
            next_state, reward, done, _,_ = env.step(action)

            Q[state, action] += beta * (reward + gamma * np.max(Q[next_state]) - Q[state, action])

            state = next_state

            steps_counter += 1
            reward_per_episode += reward

            if done:
                break

        training_rewards.append(reward_per_episode)

    return Q, training_rewards

**Implementation of the Q-learning algorithm testing process**

In [None]:
def test_q_learning(env: gym.Env, Q: np.ndarray, max_steps=1000):
    print("Testing Q-learning")
    state, _ = env.reset()
    env.render()
    total_reward = 0
    step_counter = 0
    while step_counter < max_steps:
        max_values = np.max(Q[state])
        max_keys = np.where(Q[state] == max_values)[0]
        action = np.random.choice(max_keys)
        state, reward, done, _, _ = env.step(action)
        env.render()
        total_reward += reward
        if done:
            break
        step_counter += 1
    print("Total reward:", total_reward)

### **2. Visualize the results**

**Visualization of award results in subsequent episodes.**

In [None]:
def visualize_rewards_per_episode(rewards_per_episode, beta, episodes_num):
    plt.plot(rewards_per_episode)
    plt.xlabel("Episode")
    plt.ylabel("Award")
    plt.title(f"Dependence of reward on episode for beta={beta}, number of episodes={episodes_num}")
    plt.show()

**Visualization of the environment and the agent's strategy.**

In [None]:
def get_all_steps(env: gym.Env, Q: np.ndarray):
    env.reset()
    env.render()
    total_reward = 0
    state, _ = env.reset()

    frames = []

    while True:
        max_values = np.max(Q[state])
        max_keys = np.where(Q[state] == max_values)[0]
        action = np.random.choice(max_keys)
        state, reward, done, _, _ = env.step(action)
        frames.append({
            "frame": env.render(),
            "state": state,
            "action": action,
            "reward": reward
        })
        total_reward += reward
        if done:
            break
    return frames

In [None]:
def print_frames(frames: List[dict]):
    for _, frame in enumerate(frames):
        clear_output(wait=True)
        print(frame["state"])
        print(f"Action: {frame['action']}")
        print(f"Reward: {frame['reward']}")
        plt.imshow(frame["frame"])
        plt.show()
        sleep(0.5)

### **3. Experiments**

**The value of the reward in subsequent episodes for the random strategy (epsilon=1)**.

In [None]:
Q_random, rewards_random = train_q_learning(env, epsilon=0.1, max_episodes=2_000)

In [None]:
visualize_rewards_per_episode(rewards_random, beta=0.1, episodes_num=2_000)

**Reward value in subsequent episodes for epsilon greedy strategy (epsilon=0.1) for different values of learning rate and number of episodes**.

In [None]:
BETA_VALUES = [0.1, 0.3, 0.5, 0.7, 0.9]
EPISODES_VALUES = [500, 1000, 5000, 10000]

In [None]:
for beta in BETA_VALUES:
    for episodes in EPISODES_VALUES:
        res = []
        for _ in range(5):
            res.append(train_q_learning(env, beta=beta, max_episodes=episodes)[1])
        # visualize mean rewards per episode
        visualize_rewards_per_episode(np.mean(res, axis=0), beta, episodes)

**Example visualization of the environment and agent strategy.**

In [None]:
Q, rewards = train_q_learning(env, beta=0.1, max_episodes=10_000)

In [None]:
frames = get_all_steps(env, Q)
print_frames(frames)