<a href="https://colab.research.google.com/github/issaouimarwa/NewRepo/blob/master/reinforcementlearning_implementation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:


import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)


import os
# for dirname, _, filenames in os.walk('/kaggle/input'):
#     for filename in filenames:
#         print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

### Markov Decision Processes (MDPs)
  - A Markov Decision Process is a mathematical model for decision-making in situations where an agent interacts with an environment. It is characterized by the following components:
   - State Space (S): The set of all possible situations or configurations.
   - Action Space (A): The set of all possible actions the agent can take.
   - Transition Probabilities (P): The probability of transitioning from one state to another given a particular action.
   - Reward Function (R): The immediate reward the agent receives after taking an action in a particular state.
   - Policy (π): A strategy or a mapping from states to actions, defining the agent's behavior.
   
- In an MDP, the Markov property holds, which means that the future state depends only on the current state and action, not on the sequence of events that preceded them.


### Bellman Equations
 - The Bellman Equations are a set of recursive equations that describe the relationship between the value of a state or state-action pair and the values of its successor states. They play a crucial role in dynamic programming and reinforcement learning.

### Bellman Expectation Equation (for Action Values)![image.png](attachment:06efcaf9-bdf6-4e42-99fd-9196fe927edb.png)

# Summary

- **Reinforcement Learning:** A machine learning paradigm where an agent learns to make decisions by interacting with an environment and receiving feedback in the form of rewards or punishments.

- **Markov Decision Processes (MDPs):** A mathematical model for decision-making characterized by state space, action space, transition probabilities, reward function, and policy. It follows the Markov property.

- **Bellman Equations:** Recursive equations describing the relationship between the value of a state or state-action pair and the values of its successor states. The Bellman Expectation Equation is crucial in reinforcement learning for both state values and action values.


# Dynamic Programming Policy Evaluation

Dynamic Programming (DP) is a technique used in reinforcement learning to solve problems by breaking them down into smaller subproblems and solving them iteratively. Dynamic Programming Policy Evaluation is a specific application of DP in the context of evaluating a policy in a Markov Decision Process (MDP).

## Markov Decision Process (MDP)

A Markov Decision Process is a mathematical model used to describe decision-making situations where an agent interacts with an environment. It consists of:

- **States (S):** Possible situations the agent can be in.
- **Actions (A):** Possible moves or decisions the agent can make.
- **Transition Probabilities (P):** Probabilities of moving from one state to another after taking an action.
- **Rewards (R):** Immediate rewards associated with state-action pairs.
- **Policy (π):** A strategy or plan that defines the agent's behavior.

## Policy Evaluation

Policy Evaluation is the process of determining the expected return of a given policy in an MDP. Dynamic Programming Policy Evaluation focuses on iteratively estimating the state values for a given policy. The state value is the expected return when starting in a particular state and following the given policy.

### Algorithm

The iterative process for Dynamic Programming Policy Evaluation can be described as follows:

1. **Initialize:** Set the initial estimates for all state values arbitrarily.
2. **Iterate:** Update the value of each state using the Bellman Expectation Equation.
   - Bellman Expectation Equation: \( V(s) = \sum_{a} \pi(a|s) \sum_{s', r} P(s', r|s, a) [r + \gamma V(s')] \)
   - \( V(s) \): Value of state \( s \)
   - \( \pi(a|s) \): Probability of taking action \( a \) in state \( s \) according to the policy \( \pi \)
   - \( P(s', r|s, a) \): Transition probability to reach state \( s' \) with reward \( r \) after taking action \( a \) in state \( s \)
   - \( \gamma \): Discount factor for future rewards.
3. **Repeat:** Iterate until the state values converge or reach a predefined threshold.

### Example

Consider a grid-world environment where an agent can move left, right, up, or down. The goal is to find the optimal policy that maximizes the expected return. The DP Policy Evaluation algorithm can be applied to estimate the state values under a given policy, helping the agent make informed decisions.

Dynamic Programming Policy Evaluation is a fundamental step in reinforcement learning, laying the groundwork for more advanced algorithms such as Policy Iteration and Value Iteration.



In [None]:
# !pip install envs

Collecting envs
  Downloading envs-1.4-py3-none-any.whl (10 kB)
Installing collected packages: envs
Successfully installed envs-1.4


# Policy Evaluation in Gridworld

This code implements policy evaluation for a simple gridworld environment using dynamic programming. The gridworld is a classic example where an agent navigates a grid, receiving rewards and updating its state values based on a given policy.

## Gridworld Environment

The `GridworldEnv` class represents the gridworld environment. It has the following attributes and methods:

- `shape`: Tuple representing the gridworld dimensions.
- `nS`: Number of states in the gridworld.
- `nA`: Number of actions (left, right, up, down).
- `P`: Transition probabilities, a dictionary representing the dynamics of the environment.

The `_build_transition_matrix` method initializes the transition probabilities for each state-action pair.

## Policy Evaluation Function

The `policy_evaluation` function takes a policy, the gridworld environment, and optional parameters:

- `policy`: 2D array representing the policy (actions for each state).
- `env`: Gridworld environment.
- `theta`: Convergence threshold.
- `discount_factor`: Discount factor for future rewards.

The function iteratively updates the state values using the Bellman Expectation Equation until convergence.

## Example Usage

In the example usage section:

1. An instance of the `GridworldEnv` class is created, representing a 4x4 gridworld.
2. A random policy is defined (uniform distribution over actions).
3. Policy evaluation is performed using the `policy_evaluation` function.
4. The resulting state values are printed, reshaped to match the gridworld shape.

- code serves as a basic template for policy evaluation in a gridworld environment. It can be extended or modified based on specific gridworld scenarios or other reinforcement learning environments.


In [None]:
# Policy Evaluation in Python (Gridworld)

class GridworldEnv:
    def __init__(self):
        self.shape = (4, 4)
        self.nS = np.prod(self.shape)  # Number of states
        self.nA = 4  # Number of actions (left, right, up, down)
        self.P = self._build_transition_matrix()  # Transition probabilities

    def _build_transition_matrix(self):
        P = {}

        for s in range(self.nS):
            P[s] = {a: [] for a in range(self.nA)}

        def add_transition(s, a, s_, prob, reward, done):
            P[s][a].append((prob, s_, reward, done))

        for i in range(self.shape[0]):
            for j in range(self.shape[1]):
                s = i * self.shape[1] + j

                # Define possible actions (left, right, up, down)
                for a in range(self.nA):
                    if a == 0:  # Left
                        s_ = max(j - 1, 0)
                    elif a == 1:  # Right
                        s_ = min(j + 1, self.shape[1] - 1)
                    elif a == 2:  # Up
                        s_ = max(i - 1, 0)
                    elif a == 3:  # Down
                        s_ = min(i + 1, self.shape[0] - 1)

                    # Probabilities, next state, reward, done
                    prob = 1.0 if s_ == s else 0.0
                    reward = 0.0 if s_ != s else -1.0
                    done = False

                    add_transition(s, a, s_, prob, reward, done)

        return P

def policy_evaluation(policy, env, theta=1e-6, discount_factor=0.9):
    num_states = env.nS

    # Initialize state values arbitrarily
    V = np.zeros(num_states)

    while True:
        delta = 0
        for s in range(num_states):
            v = 0
            # Accumulate value for each action in the policy
            for a, action_prob in enumerate(policy[s]):
                for prob, next_state, reward, done in env.P[s][a]:
                    v += action_prob * prob * (reward + discount_factor * V[next_state])

            delta = max(delta, np.abs(v - V[s]))
            V[s] = v

        # Check for convergence
        if delta < theta:
            break

    return V

# use:
if __name__ == "__main__":
    env = GridworldEnv()

    # Define a random policy (uniform distribution over actions)
    random_policy = np.ones([env.nS, env.nA]) / env.nA

    # Perform policy evaluation
    state_values = policy_evaluation(random_policy, env)

    # Print the resulting state values
    print("State Values:")
    print(state_values.reshape(env.shape))


State Values:
[[-0.90909039 -0.32258065  0.         -0.32258065]
 [ 0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.        ]]


# Policy Iteration in Gridworld

Policy Iteration is an iterative algorithm used in reinforcement learning to find the optimal policy for a given Markov Decision Process (MDP). It alternates between two steps: policy evaluation and policy improvement.

## Gridworld Environment

The `GridworldEnv` class represents the gridworld environment. It includes the grid shape, the number of states (`nS`), the number of actions (`nA`), the transition probabilities (`P`), and the current policy.

## Policy Evaluation

The `policy_evaluation` function iteratively estimates the state values under the current policy. It applies the Bellman Expectation Equation until the values converge within a specified threshold (`theta`).

## Policy Improvement

The `policy_improvement` function takes the current policy, state values, and discount factor as inputs. It greedily selects the action that maximizes the expected return for each state, resulting in an improved policy.

## Policy Iteration Algorithm

The `policy_iteration` function combines policy evaluation and improvement. It iteratively performs policy evaluation and improvement until the policy converges. The algorithm stops when the policy no longer changes.

### Example Usage

In the example usage section:

1. An instance of the `GridworldEnv` class is created, representing a 4x4 gridworld.
2. Policy Iteration is performed using the `policy_iteration` function.
3. The resulting optimal policy and state values are printed.

This code provides a basic template for understanding and implementing Policy Iteration in a gridworld environment. Adjustments can be made based on specific gridworld scenarios or other reinforcement learning environments.


In [None]:
# Policy Iteration in Python (Gridworld)
import numpy as np

class GridworldEnv:
    def __init__(self):
        self.shape = (4, 4)
        self.nS = np.prod(self.shape)  # Number of states
        self.nA = 4  # Number of actions (left, right, up, down)
        self.P = self._build_transition_matrix()  # Transition probabilities
        self.policy = np.ones([self.nS, self.nA]) / self.nA  # Initialize a random policy

    def _build_transition_matrix(self):
        P = {}

        for s in range(self.nS):
            P[s] = {a: [] for a in range(self.nA)}

        def add_transition(s, a, s_, prob, reward, done):
            P[s][a].append((prob, s_, reward, done))

        for i in range(self.shape[0]):
            for j in range(self.shape[1]):
                s = i * self.shape[1] + j

                # Define possible actions (left, right, up, down)
                for a in range(self.nA):
                    if a == 0:  # Left
                        s_ = max(j - 1, 0)
                    elif a == 1:  # Right
                        s_ = min(j + 1, self.shape[1] - 1)
                    elif a == 2:  # Up
                        s_ = max(i - 1, 0)
                    elif a == 3:  # Down
                        s_ = min(i + 1, self.shape[0] - 1)

                    # Probabilities, next state, reward, done
                    prob = 1.0 if s_ == s else 0.0
                    reward = 0.0 if s_ != s else -1.0
                    done = False

                    add_transition(s, a, s_, prob, reward, done)

        return P

def policy_evaluation(policy, env, theta=1e-6, discount_factor=0.9):
    num_states = env.nS

    # Initialize state values arbitrarily
    V = np.zeros(num_states)

    while True:
        delta = 0
        for s in range(num_states):
            v = 0
            # Accumulate value for each action in the policy
            for a, action_prob in enumerate(policy[s]):
                for prob, next_state, reward, done in env.P[s][a]:
                    v += action_prob * prob * (reward + discount_factor * V[next_state])

            delta = max(delta, np.abs(v - V[s]))
            V[s] = v

        # Check for convergence
        if delta < theta:
            break

    return V

def policy_improvement(policy, env, V, discount_factor=0.9):
    num_states = env.nS
    num_actions = env.nA

    new_policy = np.zeros([num_states, num_actions]) / num_actions

    for s in range(num_states):
        # Find the best action (argmax) based on the current value function
        best_action = np.argmax([sum(prob * (reward + discount_factor * V[next_state]) for prob, next_state, reward, _ in env.P[s][a]) for a in range(num_actions)])
        new_policy[s][best_action] = 1.0

    return new_policy

def policy_iteration(env, theta=1e-6, discount_factor=0.9, max_iterations=1000):
    for i in range(max_iterations):
        # Policy Evaluation
        V = policy_evaluation(env.policy, env, theta, discount_factor)

        # Policy Improvement
        new_policy = policy_improvement(env.policy, env, V, discount_factor)

        # Check if the policy has converged
        if np.array_equal(new_policy, env.policy):
            break

        # Update the policy
        env.policy = new_policy

    return env.policy, V

# Example usage:
if __name__ == "__main__":
    env = GridworldEnv()

    # Perform Policy Iteration
    optimal_policy, optimal_values = policy_iteration(env)

    # Print the resulting optimal policy and state values
    print("Optimal Policy:")
    print(optimal_policy)
    print("\nOptimal State Values:")
    print(optimal_values.reshape(env.shape))


Optimal Policy:
[[0. 1. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]]

Optimal State Values:
[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]


- Value Iteration is another iterative algorithm used in reinforcement learning to find the optimal policy for a given Markov Decision Process (MDP). It involves iteratively updating the state values until convergence.


# Value Iteration in Gridworld

Value Iteration is an iterative algorithm used in reinforcement learning to find the optimal policy for a Markov Decision Process (MDP). It alternates between updating the state values and extracting the optimal policy until convergence.

## Gridworld Environment

The `GridworldEnv` class represents the gridworld environment. It includes the grid shape, the number of states (`nS`), the number of actions (`nA`), and the transition probabilities (`P`).

## Value Iteration Function

The `value_iteration` function takes the gridworld environment, convergence threshold (`theta`), discount factor (`discount_factor`), and the maximum number of iterations (`max_iterations`) as inputs. It iteratively updates the state values until convergence and extracts the optimal policy.

- `num_states`: Number of states in the gridworld.
- `num_actions`: Number of actions (left, right, up, down).
- `V`: Array representing the state values.
- `delta`: Convergence measure.

The algorithm stops when the change in state values is smaller than the specified threshold.

### Example Usage

In the example usage section:

1. An instance of the `GridworldEnv` class is created, representing a 4x4 gridworld.
2. Value Iteration is performed using the `value_iteration` function.
3. The resulting optimal policy and state values are printed.

This code serves as a basic template for understanding and implementing Value Iteration in a gridworld environment. Adjustments can be made based on specific gridworld scenarios or other reinforcement learning environments.


In [None]:
# Implement Value Iteration in Python (Gridworld)
class GridworldEnv:
    def __init__(self):
        self.shape = (4, 4)
        self.nS = np.prod(self.shape)  # Number of states
        self.nA = 4  # Number of actions (left, right, up, down)
        self.P = self._build_transition_matrix()  # Transition probabilities

    def _build_transition_matrix(self):
        P = {}

        for s in range(self.nS):
            P[s] = {a: [] for a in range(self.nA)}

        def add_transition(s, a, s_, prob, reward, done):
            P[s][a].append((prob, s_, reward, done))

        for i in range(self.shape[0]):
            for j in range(self.shape[1]):
                s = i * self.shape[1] + j

                # Define possible actions (left, right, up, down)
                for a in range(self.nA):
                    if a == 0:  # Left
                        s_ = max(j - 1, 0)
                    elif a == 1:  # Right
                        s_ = min(j + 1, self.shape[1] - 1)
                    elif a == 2:  # Up
                        s_ = max(i - 1, 0)
                    elif a == 3:  # Down
                        s_ = min(i + 1, self.shape[0] - 1)

                    # Probabilities, next state, reward, done
                    prob = 1.0 if s_ == s else 0.0
                    reward = 0.0 if s_ != s else -1.0
                    done = False

                    add_transition(s, a, s_, prob, reward, done)

        return P

def value_iteration(env, theta=1e-6, discount_factor=0.9, max_iterations=1000):
    num_states = env.nS
    num_actions = env.nA

    # Initialize state values arbitrarily
    V = np.zeros(num_states)

    for i in range(max_iterations):
        delta = 0
        for s in range(num_states):
            # Compute the new value for each state using the Bellman Optimality Equation
            v = max([sum(prob * (reward + discount_factor * V[next_state]) for prob, next_state, reward, _ in env.P[s][a]) for a in range(num_actions)])

            delta = max(delta, np.abs(v - V[s]))
            V[s] = v

        # Check for convergence
        if delta < theta:
            break

    # Extract and return the optimal policy from the computed state values
    optimal_policy = np.zeros([num_states, num_actions])
    for s in range(num_states):
        best_action = np.argmax([sum(prob * (reward + discount_factor * V[next_state]) for prob, next_state, reward, _ in env.P[s][a]) for a in range(num_actions)])
        optimal_policy[s][best_action] = 1.0

    return optimal_policy, V

# Example usage:
if __name__ == "__main__":
    env = GridworldEnv()

    # Perform Value Iteration
    optimal_policy, optimal_values = value_iteration(env)

    # Print the resulting optimal policy and state values
    print("Optimal Policy:")
    print(optimal_policy)
    print("\nOptimal State Values:")
    print(optimal_values.reshape(env.shape))


Optimal Policy:
[[0. 1. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]]

Optimal State Values:
[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]


- Gambler's Problem is a classic reinforcement learning problem where a gambler has the opportunity to make bets on a coin flip, aiming to reach a target amount of money. The goal is to find the optimal policy for making bets to maximize the chances of reaching the target.



# Gambler's Problem using Value Iteration

The Gambler's Problem is a classic reinforcement learning scenario where a gambler aims to reach a target amount of money by making bets on a coin flip. The goal is to find the optimal policy for making bets that maximizes the chances of reaching the target.

## Value Iteration Function

The `value_iteration_gamblers_problem` function performs the Value Iteration algorithm for the Gambler's Problem. It takes the target amount, the probability of heads (`p_heads`), convergence threshold (`theta`), and discount factor (`discount_factor`) as inputs. The function iteratively updates the state values and extracts the optimal policy.

- `calculate_reward`: Function to calculate the reward for a state-action pair.
- `calculate_next_state`: Function to calculate the next state based on the current state and action.
- The main loop updates state values until convergence.

## Plotting Function

The `plot_gamblers_problem_solution` function is used to visualize the results. It plots the value function and the optimal policy against the gambler's capital.

## Example Usage

In the example usage section:

1. Set the target amount (`target_amount`) and call `value_iteration_gamblers_problem` to obtain the optimal policy and value function.
2. Print the optimal policy and plot the results using `plot_gamblers_problem_solution`.

This code provides a basic implementation of the Gambler's Problem using the Value Iteration algorithm. It is a useful template for understanding and experimenting with reinforcement learning in this specific scenario. Feel free to modify parameters or explore additional features based on your needs.


In [None]:
# import numpy as np
# import matplotlib.pyplot as plt

# def value_iteration_gamblers_problem(target, p_heads=0.4, theta=1e-9, discount_factor=1.0):
#     max_states = target + 1
#     V = np.zeros(max_states)
#     policy = np.zeros(max_states)

#     def calculate_reward(state, action):
#         # Reward is 1 if the target is reached, else 0
#         return 1.0 if state + action == target else 0.0

#     def calculate_next_state(state, action):
#         # Next state after the coin flip
#         return state + action if np.random.rand() < p_heads else state - action

#     while True:
#         delta = 0
#         for s in range(1, target):  # Exclude states 0 and target (terminal states)
#             v = V[s]
#             max_value = 0
#             for a in range(1, min(s, target - s) + 1):  # Possible actions
#                 next_state_value = p_heads * (calculate_reward(s, a) + discount_factor * V[calculate_next_state(s, a)]) + \
#                                    (1 - p_heads) * (calculate_reward(s, -a) + discount_factor * V[calculate_next_state(s, -a)])
#                 if next_state_value > max_value:
#                     max_value = next_state_value

#             V[s] = max_value
#             delta = max(delta, np.abs(v - V[s]))

#         if delta < theta:
#             break

#     # Extract the optimal policy
#     for s in range(1, target):
#         max_action = 0
#         max_value = 0
#         for a in range(1, min(s, target - s) + 1):
#             next_state_value = p_heads * (calculate_reward(s, a) + discount_factor * V[calculate_next_state(s, a)]) + \
#                                (1 - p_heads) * (calculate_reward(s, -a) + discount_factor * V[calculate_next_state(s, -a)])
#             if next_state_value > max_value:
#                 max_value = next_state_value
#                 max_action = a

#         policy[s] = max_action

#     return policy, V

# def plot_gamblers_problem_solution(policy, value_function):
#     plt.figure(figsize=(10, 5))

#     plt.subplot(2, 1, 1)
#     plt.plot(range(1, len(value_function) - 1), value_function[1:-1], marker='o')
#     plt.title("Value Function")
#     plt.xlabel("Capital")
#     plt.ylabel("Value")

#     plt.subplot(2, 1, 2)
#     plt.scatter(range(1, len(policy) - 1), policy[1:-1], marker='o')
#     plt.title("Optimal Policy")
#     plt.xlabel("Capital")
#     plt.ylabel("Stake")

#     plt.tight_layout()
#     plt.show()

# # Example usage:
# if __name__ == "__main__":
#     target_amount = 100
#     optimal_policy, optimal_value_function = value_iteration_gamblers_problem(target_amount)

#     # Print and plot the results
#     print("Optimal Policy:")
#     print(optimal_policy[1:])  # Exclude states 0 and target
#     plot_gamblers_problem_solution(optimal_policy, optimal_value_function)


- Value Iteration can be computationally expensive, especially for larger state spaces. If the code is taking longer than expected to run, you might consider a few optimizations:

    - Parallelization: If your machine has multiple cores, you can consider parallelizing the loop that iterates over states. The joblib library is a popular choice for parallelizing loops in Python.

    - Numpy Vectorization: Numpy operations are highly optimized and can be faster than using explicit loops. Try to use vectorized operations wherever possible.

    - Memory Efficiency: Ensure that you are not using excessive memory. If memory becomes a bottleneck, consider more memory-efficient data structures.

    - Convergence Threshold: You might try adjusting the convergence threshold (theta). A smaller threshold might lead to more precise results but may require more iterations.

In [None]:
# value_iteration_gamblers_problem function using Numpy vectorization and joblib parallelization
import numpy as np
from joblib import Parallel, delayed

def value_iteration_gamblers_problem_parallel(target, p_heads=0.4, theta=1e-9, discount_factor=1.0, num_cores=1):
    max_states = target + 1
    V = np.zeros(max_states)
    policy = np.zeros(max_states)

    def calculate_reward(state, action):
        return 1.0 if state + action == target else 0.0

    def calculate_next_state(state, action):
        return state + action if np.random.rand() < p_heads else state - action

    def update_state(s):
        max_value = 0
        for a in range(1, min(s, target - s) + 1):
            next_state_value = p_heads * (calculate_reward(s, a) + discount_factor * V[calculate_next_state(s, a)]) + \
                               (1 - p_heads) * (calculate_reward(s, -a) + discount_factor * V[calculate_next_state(s, -a)])
            max_value = max(max_value, next_state_value)

        return max_value

    while True:
        delta = 0

        # Parallelize the state update loop
        updated_values = Parallel(n_jobs=num_cores)(delayed(update_state)(s) for s in range(1, target))

        for s, updated_value in zip(range(1, target), updated_values):
            v = V[s]
            V[s] = updated_value
            delta = max(delta, np.abs(v - V[s]))

        if delta < theta:
            break

    # Extract the optimal policy
    for s in range(1, target):
        max_action = np.argmax([p_heads * (calculate_reward(s, a) + discount_factor * V[calculate_next_state(s, a)]) +
                                (1 - p_heads) * (calculate_reward(s, -a) + discount_factor * V[calculate_next_state(s, -a)])
                                for a in range(1, min(s, target - s) + 1)])
        policy[s] = max_action

    return policy, V

# Example usage:
if __name__ == "__main__":
    target_amount = 100
    optimal_policy, optimal_value_function = value_iteration_gamblers_problem_parallel(target_amount, num_cores=2)

    # Print and plot the results
    print("Optimal Policy:")
    print(optimal_policy[1:])  # Exclude states 0 and target
    plot_gamblers_problem_solution(optimal_policy, optimal_value_function)
