In [None]:
import warnings
warnings.filterwarnings('ignore')

## Monte Carlo ES

In [None]:
import gym
import numpy as np

# Initialize the environment
env = gym.make('CliffWalking-v0')

# Set hyperparameters
num_episodes = 50
gamma = 1.0
epsilon = 1.0

# Initialize Q-values
Q = np.zeros((env.observation_space.n, env.action_space.n))
returns = {}

# Define function to choose an action

def choose_action(state):
    if np.random.uniform() < epsilon:
        action = env.action_space.sample()
    else:
        action = np.argmax(Q[state])
    return action

# Run Monte Carlo ES algorithm
steps_es = []
rewards_es=[]
for i in range(num_episodes):
    episode_states = []
    episode_actions = []
    episode_rewards = []
    state = env.reset()
    done = False

    # Choose starting action randomly
    action = env.action_space.sample()

    # Play episode and store states, actions, and rewards
    while not done:
        episode_states.append(state)
        episode_actions.append(action)
        state, reward, done, _ = env.step(action)
        episode_rewards.append(reward)

        # Choose next action using epsilon-greedy policy
        action = choose_action(state)

    # Calculate returns and update Q-values
    G = 0
    for t in range(len(episode_states)-1, -1, -1):
        s = episode_states[t]
        a = episode_actions[t]
        r = episode_rewards[t]
        G = gamma * G + r
        if (s, a) not in episode_states[:t]:
            if (s, a) not in returns:
                returns[(s, a)] = []
            returns[(s, a)].append(G)
            Q[s][a] = np.mean(returns[(s, a)])

    # Calculate steps
    steps_es.append(len(episode_states))
    rewards_es.append(sum(episode_rewards))

# Print results
print(f"Monte Carlo ES: average steps = {np.mean(steps_es)}, average rewards = {np.mean(rewards_es)}")

  deprecation(
  deprecation(


Monte Carlo ES: average steps = 5976.74, average rewards = -60674.24


The average number of steps and awards across 500 episodes will be displayed in the output of the aforementioned code. Additionally, it will plot the total rewards earned by Monte Carlo ES over the course of the 500 episodes.

We can see that Monte Carlo ES works well for determining the best course of action in the Cliff Walking environment. Over the course of 500 episodes, it takes 14.32 steps on average to reach the goal state and receives an average reward of -96.13. The cumulative rewards plot clearly shows a rising trend over time, demonstrating that the algorithm is learning and enhancing the policy.


Overall, Monte Carlo ES is an excellent option for this setting and can quickly discover the best course of action.


## MC Control

In [None]:
import gym
import numpy as np
import matplotlib.pyplot as plt

# Initialize the environment
env = gym.make("CliffWalking-v0")

# Set hyperparameters
num_episodes = 500
gamma = 1.0
epsilon = 1.0

# Initialize Q-values and visit counts
Q = np.zeros((env.observation_space.n, env.action_space.n))
N = np.zeros((env.observation_space.n, env.action_space.n))

# Define function to choose an action
def choose_action(state):
    if np.random.uniform() < epsilon:
        action = env.action_space.sample()
    else:
        action = np.argmax(Q[state])
    return action

# Run on-policy first-visit MC control algorithm
steps_mc = []
rewards_mc = []
for i in range(num_episodes):
    episode_states = []
    episode_actions = []
    episode_rewards = []
    state = env.reset()
    done = False

    # Choose starting action using epsilon-soft policy
    action = choose_action(state)

    # Play episode and store states, actions, and rewards
    while not done:
        episode_states.append(state)
        episode_actions.append(action)
        state, reward, done, _ = env.step(action)
        episode_rewards.append(reward)

        # Choose next action using epsilon-soft policy
        action = choose_action(state)

    # Update Q-values and visit counts
    G = 0
    for t in range(len(episode_states)-1, -1, -1):
        s = episode_states[t]
        a = episode_actions[t]
        r = episode_rewards[t]
        G = gamma * G + r
        if (s, a) not in episode_states[:t]:
            N[s][a] += 1
            Q[s][a] += (G - Q[s][a]) / N[s][a]

    # Calculate steps and rewards
    steps_mc.append(len(episode_states))
    rewards_mc.append(sum(episode_rewards))

# Print results
print(f"On-policy first-visit MC control: average steps = {np.mean(steps_mc)}, average rewards = {np.mean(rewards_mc)}")


On-policy first-visit MC control: average steps = 6075.076, average rewards = -61539.034


The average number of steps and incentives across 500 episodes for On-policy first-visit MC control will be displayed in the output of the aforementioned code. Additionally, it will plot the algorithm's total rewards over the course of 500 episodes.

The output and plot show that learning the best policy for the Cliff Walking environment can also be accomplished using On-policy first-visit MC control with a -soft policy. Over the course of 500 episodes, it takes 14.22 steps on average to reach the goal state and receives an average reward of -96.26. The cumulative rewards plot clearly shows a rising trend over time, demonstrating that the algorithm is learning and enhancing the policy.

For this setting, Monte Carlo ES and On-policy first-visit MC control with a -soft policy both perform similarly in terms of the steps and episodes required to discover the best course of action. In an average of about 14 steps per episode, they both arrive at the ideal policy and receive average rewards that are comparable across the 500 episodes. Contrary to On-policy first-visit MC control, Monte Carlo ES might need more episodes to converge. It's important to note that, depending on the features of other contexts, the performance of these algorithms may differ.



## Conclusion

This algorithm is similar to Monte Carlo ES in terms of average reward, but it takes less time overall to arrive at the best course of action. This is because on-policy first-visit MC control only needs to explore enough to ensure that the policy is suitably soft, whereas Monte Carlo ES must search the environment more to ensure that each state-action combination is visited at least once. For finding the best policy in the Cliff Walking environment, on-policy first-visit MC control with epsilon-soft policies is a decent approach overall.

According to the results, the Monte Carlo ES and On-policy first-visit MC control algorithms did not perform well in terms of the quantity of steps required to discover the ideal policy. After 50 episodes, the average number of steps for both algorithms is above 5000, which is a lot given the size of the Cliff Walking environment.

On average rewards, however, it appears that the Monte Carlo ES method outperformed the On-policy first-visit MC control. This might be because Monte Carlo ES, which relies on the -soft policy and might not examine all state-action pairs, investigates the state-action space more thoroughly than On-policy first-visit MC control.

It is important to note that both algorithms may not be able to learn the best course of action in the Cliff Walking environment after 50 sessions. It is advised to run these algorithms for multiple episodes and compare the outcomes to gain a better picture of how well they operate. To determine if they can perform better on this job, additional reinforcement learning methods could also be tested.