# Implement Model-Free Prediction & Control With Monte Carlo (MC)

Monte Carlo Prediction
The goal is to estimate the value function 𝑉(𝑠) for a given policy 𝜋

Monte Carlo Control
The goal is to optimize the policy by improving the action-value function Q(s,a) iteratively

In [1]:
import numpy as np
from collections import defaultdict
import gym

Monte-carlo prediction

Goal: Learn how good it is to be in a certain state when following a specific policy.

* Imagine playing a game (like Blackjack) over and over while following a certain strategy (policy).
* Every time you reach a state (e.g., your current cards in Blackjack), you note down the eventual total reward you get by the end of the game.
* After enough games, you take the average of all these total rewards for each state.
*This average becomes your estimate of how valuable that state is (called V(s)) when using that strategy.
Think of it as keeping track of "how good" each state is by observing what happens when you visit it many times.

In [2]:
# Monte Carlo Prediction
def mc_prediction(policy, env, num_episodes, gamma=1.0):
    V = defaultdict(float)
    returns = defaultdict(list)

    for episode in range(num_episodes):
        # Generate an episode
        episode_data = []
        state = env.reset()
        done = False

        while not done:
            action = policy(state)
            next_state, reward, done, _ = env.step(action)
            episode_data.append((state, action, reward))
            state = next_state

        # Calculate returns
        G = 0
        for t in reversed(range(len(episode_data))):
            state, _, reward = episode_data[t]
            G = reward + gamma * G
            if state not in [x[0] for x in episode_data[:t]]:
                returns[state].append(G)
                V[state] = np.mean(returns[state])

    return V

Prediction: Helps estimate how good a situation is under a specific strategy. For example, "If I have 15 in Blackjack, how likely am I to win?"

Monte-Carlo control

Goal: Find the best strategy (policy) to play the game, so you win more often.
How it works:
* Instead of just tracking how good each state is, you now track how good each action is in each state. This is called Q(s,a), the action-value function.
* At the start, you try actions randomly (this is called exploration).
* Over time, you see which actions lead to the best outcomes and start choosing those actions more often (this is called exploitation).
* To keep improving, you mix random actions (exploration) and smart actions (exploitation) using a method called epsilon-greedy: most of the time, you pick the best-known action, but sometimes you try something random to learn more.
* After many games, the strategy becomes optimized because it picks the best actions based on what you’ve learned.
* Think of this as learning the best way to play the game by trial and error over thousands of games.

In [3]:
def mc_control_epsilon_greedy(env, num_episodes, gamma=1.0, epsilon=0.1):
    Q = defaultdict(lambda: np.zeros(env.action_space.n))
    returns = defaultdict(list)

    def epsilon_greedy_policy(state):
        if np.random.rand() < epsilon:
            return np.random.choice(env.action_space.n)  # Explore
        return np.argmax(Q[state])  # Exploit

    for episode in range(num_episodes):
        # Generate an episode
        episode_data = []
        state = env.reset()
        done = False

        while not done:
            action = epsilon_greedy_policy(state)
            next_state, reward, done, _ = env.step(action)
            episode_data.append((state, action, reward))
            state = next_state

        # Calculate returns
        G = 0
        for t in reversed(range(len(episode_data))):
            state, action, reward = episode_data[t]
            G = reward + gamma * G
            if (state, action) not in [(x[0], x[1]) for x in episode_data[:t]]:
                returns[(state, action)].append(G)
                Q[state][action] = np.mean(returns[(state, action)])

    # Derive the policy from Q
    policy = {state: np.argmax(actions) for state, actions in Q.items()}
    return Q, policy

Control: Helps you figure out the best strategy to win. For example, "What should I do when I have 15 in Blackjack: hit or stand?"

In [4]:
if __name__ == "__main__":
    env = gym.make("Blackjack-v1")

    # Random policy for prediction
    def random_policy(state):
        return np.random.choice(env.action_space.n)

    # Prediction
    V = mc_prediction(random_policy, env, num_episodes=50000)
    print("State-Value Function (V):", V)

    # Control
    Q, optimal_policy = mc_control_epsilon_greedy(env, num_episodes=50000)
    print("Action-Value Function (Q):", Q)
    print("Optimal Policy:", optimal_policy)

  if not isinstance(terminated, (bool, np.bool8)):


ValueError: too many values to unpack (expected 4)