<center>
<h2 style="color:blue;font-size:30px;">Artificial Intelligence CS-414</h2>
<h3 style="color:purple">Assignment 4</h3>
 </center>

<br>

<p>Consider Volcano crossing problem discussed in the class. Consider an instance of the problem using different grid size, rewards of end states etc. You are free to define the problem, its states, actions etc. Solve the problem using the following algorithms.</p>
<ul style="color:purple">
<li>Model free Monte Carlo</li>
<li>SARSA</li>
<li>Q-Learning</li>
</ul>
<ol style="color:green">
<li>Run the algorithms using different number of episodes of uniformly random policy and show Q-values and average utility.</li>
<li>Use different slip probabilities ranging from 0.0 to 0.3 and show your results on different algorithms.</li>
<li>Use epsilon greedy algorithms to change generate episode from uniformly random policy for exploration as well as policy that chooses the best action.</li>
<li>Write a 2-3 page report and explain your code and results in it.</li>
Develop a GUI based user friendly application from which user to choose appropriate options e.g slip probability, epsilon value, no of episodes etc.

In [2]:
import numpy as np

<h3 style="color:purple">MODEL FREE MONTE CARLO</h3>

In [3]:
import numpy as np

# Function to initialize the environment
def initialize_environment(grid_size, start_state, safe_end_states, dangerous_end_state, slip_prob):
    # Initialize the grid world
    grid = np.zeros(grid_size)
    
    # Set rewards for end states
    for state in safe_end_states:
        grid[state[0], state[1]] = 20  # Access elements using [row, column]
    grid[dangerous_end_state[0], dangerous_end_state[1]] = -50
    
    return grid

In [4]:
# Function to get the next state based on the action and slip probability
def get_next_state(current_state, action, slip_prob, grid):
    # Define possible directions
    directions = {"N": (-1, 0), "E": (0, 1), "S": (1, 0), "W": (0, -1)}
    
    # Convert integer action index to string
    action_str = list(directions.keys())[action]
    
    # Check if slip occurs based on slip probability
    if np.random.rand() < slip_prob:
        # Slip: Move to a random adjacent state
        possible_actions = list(directions.keys())
        possible_actions.remove(action_str)  # Remove the current action to avoid going back
        random_action = np.random.choice(possible_actions)
        next_state = tuple(np.array(current_state) + np.array(directions[random_action]))
    else:
        # No slip: Move in the chosen direction
        next_state = tuple(np.array(current_state) + np.array(directions[action_str]))
    
    # Ensure the next state is within the grid
    next_state = (
        max(0, min(next_state[0], grid.shape[0] - 1)),
        max(0, min(next_state[1], grid.shape[1] - 1))
    )
    
    return next_state

# Function to get the reward for a given state
def get_reward(state, grid):
    # Return the reward for the given state from the grid
    return grid[state]

In [5]:
# Function to perform epsilon-greedy action selection
def epsilon_greedy(Q, state, epsilon, num_actions):
    if np.random.rand() < epsilon:
        # Exploration: Choose a random action
        return np.random.choice(num_actions)
    else:
        # Exploitation: Choose the action with the highest Q-value
        return np.argmax(Q[state])

In [6]:
# Function to perform Monte Carlo sampling with epsilon-greedy exploration
def monte_carlo(grid, slip_prob, num_episodes, epsilon):
    # Initialize Q-values
    Q = np.zeros_like(grid)
    returns = np.zeros_like(grid)
    visit_count = np.zeros_like(grid)
    
    # Loop over episodes
    for episode in range(num_episodes):
        # Initialize the episode
        episode_states = []
        episode_actions = []
        episode_rewards = []
    
        # Starting state
        current_state = (2, 1)
    
        # Generate an episode using an epsilon-greedy policy
        while True:
            # Choose action using epsilon-greedy strategy
            num_actions = len(Q)  # Total number of possible actions (assuming each state has the same number of actions)
            action = epsilon_greedy(Q, current_state, epsilon, num_actions)
        
            # Store current state and action
            episode_states.append(current_state)
            episode_actions.append(action)
        
            # Determine the next state based on the action and slip probability
            next_state = get_next_state(current_state, action, slip_prob, grid)
        
            # Get the reward for the next state
            reward = get_reward(next_state, grid)
            episode_rewards.append(reward)
        
            # Update the current state
            current_state = next_state
        
            # Check if the episode has ended
            if next_state in [(0, 3), (2, 3)]:
                break


        
        # Update Q-values based on the observed returns
        total_return = 0
        for t in range(len(episode_states) - 1, -1, -1):
            total_return += episode_rewards[t]
            
            # If the state is not visited before in this episode
            if episode_states[t] not in episode_states[:t]:
                state = episode_states[t]
                action = episode_actions[t]
                
                # Increment visit count for the state-action pair
                visit_count[state] += 1
                
                # Update Q-value using an incremental formula
                Q[state] += (total_return - Q[state]) / visit_count[state]
                
    # Calculate average utility
    average_utility = np.mean(Q)
    
    return Q, average_utility

In [7]:
# Function to run experiments with different slip probabilities and epsilon values
def run_experiments():
    grid_size = (4, 4)
    start_state = (2, 1)
    safe_end_states = [(0, 3), (2, 3)]  # Valid column indices are 0, 1, 2, 3
    dangerous_end_state = (2, 3)  # Assuming this is meant to be a dangerous state
    num_episodes = 1000  # Adjust as needed
    
    slip_probabilities = [0.0, 0.1, 0.2, 0.3]
    epsilon_values = [0.1]  # Adjust as needed
    
    for slip_prob in slip_probabilities:
        for epsilon in epsilon_values:
            # Initialize the environment
            grid = initialize_environment(grid_size, start_state, safe_end_states, dangerous_end_state, slip_prob)
            
            # Run Monte Carlo algorithm with epsilon-greedy exploration
            Q_values, avg_utility = monte_carlo(grid, slip_prob, num_episodes, epsilon)
            
            # Display results for each slip probability and epsilon value
            print(f"Results for Slip Probability {slip_prob} and Epsilon {epsilon}:")
            print("Q-values:")
            print(Q_values)
            print(f"Average Utility: {avg_utility}")
            print("\n")

# Main function to run experiments
def main():
    run_experiments()

if __name__ == "__main__":
    main()

Results for Slip Probability 0.0 and Epsilon 0.1:
Q-values:
[[20.         20.         20.          0.        ]
 [20.         20.         20.         20.        ]
 [20.         19.86       17.77777778  0.        ]
 [20.         20.         20.          0.        ]]
Average Utility: 16.10236111111111


Results for Slip Probability 0.1 and Epsilon 0.1:
Q-values:
[[19.05970149 19.26238145 19.29149798  0.        ]
 [18.97810219 19.26624738 18.81556684 13.28767123]
 [19.16167665 19.23       17.00854701  0.        ]
 [20.         16.05633803 20.          0.        ]]
Average Utility: 14.963608140238424


Results for Slip Probability 0.2 and Epsilon 0.1:
Q-values:
[[ 18.50381679  18.60927152  18.83333333   0.        ]
 [ 18.75886525  18.40736728  17.68138801  14.06779661]
 [ 18.68421053  17.34         9.42708333   0.        ]
 [ 18.67924528  16.41025641  -5.2        -50.        ]]
Average Utility: 9.38766464718444


Results for Slip Probability 0.3 and Epsilon 0.1:
Q-values:
[[ 16.752       16

<h3 style="color:purple">SARSA</h3>

In [8]:
# Function to perform SARSA with epsilon-greedy exploration
def sarsa(grid, slip_prob, num_episodes, epsilon, alpha, gamma):
    # Initialize Q-values
    Q = np.zeros(grid.shape + (4,))  # Separate Q array for each state-action pair
    
    # Loop over episodes
    for episode in range(num_episodes):
        # Starting state
        current_state = (2, 1)
        
        # Choose action using epsilon-greedy strategy
        num_actions = Q.shape[2]
        current_action = epsilon_greedy(Q, current_state, epsilon, num_actions)
        
        # Initialize flag for episode completion
        episode_complete = False
        
        while not episode_complete:
            # Determine the next state based on the action and slip probability
            next_state = get_next_state(current_state, current_action, slip_prob, grid)
            
            # Get the reward for the next state
            reward = get_reward(next_state, grid)
            
            # Choose next action using epsilon-greedy strategy
            next_action = epsilon_greedy(Q, next_state, epsilon, num_actions)
            
            # Update Q-value using the SARSA update rule
            Q[current_state + (current_action,)] += alpha * (reward + gamma * Q[next_state + (next_action,)] - Q[current_state + (current_action,)])
            
            # Update current state and action
            current_state = next_state
            current_action = next_action
            
            # Check if the episode has ended
            episode_complete = next_state in [(0, 3), (2, 3)]
    
    # Calculate average utility
    average_utility = np.mean(Q)
    
    return Q, average_utility

In [9]:
# Function to run experiments with different slip probabilities, epsilon values, alpha, and gamma
def run_sarsa_experiments():
    grid_size = (4, 4)
    start_state = (2, 1)
    safe_end_states = [(0, 3), (2, 3)]  # Valid column indices are 0, 1, 2, 3
    dangerous_end_state = (2, 3)  # Assuming this is meant to be a dangerous state
    num_episodes = 1000  # Adjust as needed
    
    slip_probabilities = [0.0, 0.1, 0.2, 0.3]
    epsilon_values = [0.1]  # Adjust as needed
    alpha_values = [0.1]  # Learning rate
    gamma_values = [0.9]  # Discount factor
    
    for slip_prob in slip_probabilities:
        for epsilon in epsilon_values:
            for alpha in alpha_values:
                for gamma in gamma_values:
                    # Initialize the environment
                    grid = initialize_environment(grid_size, start_state, safe_end_states, dangerous_end_state, slip_prob)
                    
                    # Run SARSA algorithm with epsilon-greedy exploration
                    Q_values, avg_utility = sarsa(grid, slip_prob, num_episodes, epsilon, alpha, gamma)
                    
                    # Display results for each combination of parameters
                    print(f"Results for Slip Probability {slip_prob}, Epsilon {epsilon}, Alpha {alpha}, Gamma {gamma}:")
                    print("Q-values:")
                    print(Q_values)
                    print(f"Average Utility: {avg_utility}")
                    print("\n")

# Main function to run SARSA experiments
def main_sarsa():
    run_sarsa_experiments()

if __name__ == "__main__":
    main_sarsa()


Results for Slip Probability 0.0, Epsilon 0.1, Alpha 0.1, Gamma 0.9:
Q-values:
[[[ 0.         15.3952904   0.8649457   1.16898189]
  [14.10473839 17.66840659 11.08395535 11.48915468]
  [16.65557727 20.         13.29291439 14.28995138]
  [ 0.          0.          0.          0.        ]]

 [[10.37943526  0.          0.95779768  0.52032741]
  [15.07069906  9.60641951 10.89873532  6.00772044]
  [17.12346342  0.504       2.33968775  1.38009835]
  [ 5.42        0.          0.          0.        ]]

 [[ 1.91938916  7.82819427  0.          0.        ]
  [13.56648855  8.58297164  7.22784837  2.92279226]
  [12.20811227  0.          0.          0.        ]
  [ 0.          0.          0.          0.        ]]

 [[ 0.          1.44875385  0.          0.        ]
  [10.62560327  0.02139837  0.          0.06112863]
  [ 0.5414789   0.          0.          0.        ]
  [ 0.          0.          0.          0.        ]]]
Average Utility: 4.424632184263154


Results for Slip Probability 0.1, Epsilon 0.

<h3 style="color:purple">Q LEARNING</h3>

In [10]:
# Function to perform Q-learning with epsilon-greedy exploration
def q_learning(grid, slip_prob, num_episodes, epsilon, alpha, gamma):
    # Initialize Q-values
    Q = np.zeros(grid.shape + (4,))  # Separate Q array for each state-action pair
    
    # Loop over episodes
    for episode in range(num_episodes):
        # Starting state
        current_state = (2, 1)
        
        # Initialize flag for episode completion
        episode_complete = False
        
        while not episode_complete:
            # Choose action using epsilon-greedy strategy
            num_actions = Q.shape[2]
            current_action = epsilon_greedy(Q, current_state, epsilon, num_actions)
            
            # Determine the next state based on the action and slip probability
            next_state = get_next_state(current_state, current_action, slip_prob, grid)
            
            # Get the reward for the next state
            reward = get_reward(next_state, grid)
            
            # Update Q-value using the Q-learning update rule
            Q[current_state + (current_action,)] += alpha * (reward + gamma * np.max(Q[next_state]) - Q[current_state + (current_action,)])
            
            # Update current state
            current_state = next_state
            
            # Check if the episode has ended
            episode_complete = next_state in [(0, 3), (2, 3)]
    
    # Calculate average utility
    average_utility = np.mean(Q)
    
    return Q, average_utility

In [11]:
# Function to run experiments with different slip probabilities, epsilon values, alpha, and gamma
def run_q_learning_experiments():
    grid_size = (4, 4)
    start_state = (2, 1)
    safe_end_states = [(0, 3), (2, 3)]  # Valid column indices are 0, 1, 2, 3
    dangerous_end_state = (2, 3)  # Assuming this is meant to be a dangerous state
    num_episodes = 1000  # Adjust as needed
    
    slip_probabilities = [0.0, 0.1, 0.2, 0.3]
    epsilon_values = [0.1]  # Adjust as needed
    alpha_values = [0.1]  # Learning rate
    gamma_values = [0.9]  # Discount factor
    
    for slip_prob in slip_probabilities:
        for epsilon in epsilon_values:
            for alpha in alpha_values:
                for gamma in gamma_values:
                    # Initialize the environment
                    grid = initialize_environment(grid_size, start_state, safe_end_states, dangerous_end_state, slip_prob)
                    
                    # Run Q-learning algorithm with epsilon-greedy exploration
                    Q_values, avg_utility = q_learning(grid, slip_prob, num_episodes, epsilon, alpha, gamma)
                    
                    # Display results for each combination of parameters
                    print(f"Results for Slip Probability {slip_prob}, Epsilon {epsilon}, Alpha {alpha}, Gamma {gamma}:")
                    print("Q-values:")
                    print(Q_values)
                    print(f"Average Utility: {avg_utility}")
                    print("\n")

# Main function to run Q-learning experiments
def main_q_learning():
    run_q_learning_experiments()

if __name__ == "__main__":
    main_q_learning()


Results for Slip Probability 0.0, Epsilon 0.1, Alpha 0.1, Gamma 0.9:
Q-values:
[[[ 1.43554257 16.16966334  1.42935663  2.17522747]
  [15.35178506 18.         13.51443224 12.21597405]
  [16.22741087 20.         14.0269638  14.23028215]
  [ 0.          0.          0.          0.        ]]

 [[13.94372498  0.          0.          0.        ]
  [16.2        13.45566717 11.94633964 10.72090542]
  [17.98256613  0.18        0.95297645  0.        ]
  [ 3.8         0.          0.          0.        ]]

 [[ 2.08463962 10.12010729  0.          0.62320112]
  [14.58        9.23638977  7.34422099  4.99661602]
  [14.3451499   0.          0.          0.        ]
  [ 0.          0.          0.          0.        ]]

 [[ 0.08343709  0.          0.          0.        ]
  [11.52994563  0.          0.9625912   0.        ]
  [ 0.          0.          0.          0.        ]
  [ 0.          0.          0.          0.        ]]]
Average Utility: 4.841642446774151


Results for Slip Probability 0.1, Epsilon 0.

<hr>
<h3 style="color:purple">GUI</h3>

In [12]:
# Function to run experiments with different slip probabilities and epsilon values
def run_experiments(slip_prob, epsilon, num_episodes):
    grid_size = (4, 4)
    start_state = (2, 1)
    safe_end_states = [(0, 3), (2, 3)]  # Valid column indices are 0, 1, 2, 3
    dangerous_end_state = (2, 3)  # Assuming this is meant to be a dangerous state
    
    if not isinstance(slip_prob, (list, tuple)):
        slip_prob = [slip_prob]
    
    if not isinstance(epsilon, (list, tuple)):
        epsilon = [epsilon]
    
    for slip_prob_val in slip_prob:
        for epsilon_val in epsilon:
            # Initialize the environment
            grid = initialize_environment(grid_size, start_state, safe_end_states, dangerous_end_state, slip_prob_val)
            
            # Run Monte Carlo algorithm with epsilon-greedy exploration
            Q_values, avg_utility = monte_carlo(grid, slip_prob_val, num_episodes, epsilon_val)
            
            # Display results for each slip probability and epsilon value
            print("MONTO CARLO FREE MODEL")
            print(f"Results for Slip Probability {slip_prob_val} and Epsilon {epsilon_val}:")
            print("Q-values:")
            print(Q_values)
            print(f"Average Utility: {avg_utility}")
            print("\n")


In [13]:
# Function to run experiments with different slip probabilities, epsilon values, alpha, and gamma
def run_sarsa_experiments(slip_probabilities, epsilon_values, num_episodes):
    grid_size = (4, 4)
    start_state = (2, 1)
    safe_end_states = [(0, 3), (2, 3)]  # Valid column indices are 0, 1, 2, 3
    dangerous_end_state = (2, 3)  # Assuming this is meant to be a dangerous state
    alpha_values = [0.1]  # Learning rate
    gamma_values = [0.9]  # Discount factor
    
    if not isinstance(slip_probabilities, (list, tuple)):
        slip_probabilities = [slip_probabilities]
    
    if not isinstance(epsilon_values, (list, tuple)):
        epsilon_values = [epsilon_values]
    
    if not isinstance(alpha_values, (list, tuple)):
        alpha_values = [alpha_values]
    
    if not isinstance(gamma_values, (list, tuple)):
        gamma_values = [gamma_values]
    
    for slip_prob in slip_probabilities:
        for epsilon in epsilon_values:
            for alpha in alpha_values:
                for gamma in gamma_values:
                    # Initialize the environment
                    grid = initialize_environment(grid_size, start_state, safe_end_states, dangerous_end_state, slip_prob)
                    
                    # Run SARSA algorithm with epsilon-greedy exploration
                    Q_values, avg_utility = sarsa(grid, slip_prob, num_episodes, epsilon, alpha, gamma)
                    
                    # Display results for each combination of parameters
                    print("SARSA")
                    print(f"Results for Slip Probability {slip_prob}, Epsilon {epsilon}, Alpha {alpha}, Gamma {gamma}:")
                    print("Q-values:")
                    print(Q_values)
                    print(f"Average Utility: {avg_utility}")
                    print("\n")



In [14]:
# Function to run experiments with different slip probabilities, epsilon values, alpha, and gamma
def run_q_learning_experiments(slip_probabilities, epsilon_values, num_episodes):
    grid_size = (4, 4)
    start_state = (2, 1)
    safe_end_states = [(0, 3), (2, 3)]  # Valid column indices are 0, 1, 2, 3
    dangerous_end_state = (2, 3)  # Assuming this is meant to be a dangerous state
    
    alpha_values = [0.1]  # Learning rate
    gamma_values = [0.9]  # Discount factor
    
    if not isinstance(slip_probabilities, (list, tuple)):
        slip_probabilities = [slip_probabilities]
    
    if not isinstance(epsilon_values, (list, tuple)):
        epsilon_values = [epsilon_values]
    
    if not isinstance(alpha_values, (list, tuple)):
        alpha_values = [alpha_values]
    
    if not isinstance(gamma_values, (list, tuple)):
        gamma_values = [gamma_values]
    
    for slip_prob in slip_probabilities:
        for epsilon in epsilon_values:
            for alpha in alpha_values:
                for gamma in gamma_values:
                    # Initialize the environment
                    grid = initialize_environment(grid_size, start_state, safe_end_states, dangerous_end_state, slip_prob)
                    
                    # Run Q-learning algorithm with epsilon-greedy exploration
                    Q_values, avg_utility = q_learning(grid, slip_prob, num_episodes, epsilon, alpha, gamma)
                    
                    # Display results for each combination of parameters
                    print("Q LEARNING")
                    print(f"Results for Slip Probability {slip_prob}, Epsilon {epsilon}, Alpha {alpha}, Gamma {gamma}:")
                    print("Q-values:")
                    print(Q_values)
                    print(f"Average Utility: {avg_utility}")
                    print("\n")

In [15]:
import tkinter as tk
from tkinter import ttk
from functools import partial
def on_run_button_click(slip_prob_entry, epsilon_entry, num_episodes_entry):
    slip_prob = float(slip_prob_entry.get())
    epsilon = float(epsilon_entry.get())
    num_episodes = int(num_episodes_entry.get())

    run_experiments(slip_prob, epsilon, num_episodes)
    run_sarsa_experiments(slip_prob, epsilon, num_episodes)
    run_q_learning_experiments(slip_prob, epsilon, num_episodes)


def create_gui():
    root = tk.Tk()
    root.title("Reinforcement Learning Experiments")

    # Create labels and entry fields
    tk.Label(root, text="Slip Probability:").grid(row=0, column=0, padx=10, pady=5)
    slip_prob_entry = tk.Entry(root)
    slip_prob_entry.grid(row=0, column=1, padx=10, pady=5)

    tk.Label(root, text="Epsilon Value:").grid(row=1, column=0, padx=10, pady=5)
    epsilon_entry = tk.Entry(root)
    epsilon_entry.grid(row=1, column=1, padx=10, pady=5)

    tk.Label(root, text="Number of Episodes:").grid(row=2, column=0, padx=10, pady=5)
    num_episodes_entry = tk.Entry(root)
    num_episodes_entry.grid(row=2, column=1, padx=10, pady=5)

    run_button = tk.Button(root, text="Run Experiments", command=partial(on_run_button_click, slip_prob_entry, epsilon_entry, num_episodes_entry))
    run_button.grid(row=3, column=0, columnspan=2, pady=10)

    root.mainloop()

create_gui()

MONTO CARLO FREE MODEL
Results for Slip Probability 0.2 and Epsilon 0.3:
Q-values:
[[ 15.90425532  16.09970674  16.50623886   0.        ]
 [ 15.74814815  15.52631579  14.63443396   9.67741935]
 [ 14.81481481  13.68292683   4.2278481    0.        ]
 [ 15.81196581  12.25409836  -0.68181818 -39.23076923]]
Average Utility: 7.810974042706549


SARSA
Results for Slip Probability 0.2, Epsilon 0.3, Alpha 0.1, Gamma 0.9:
Q-values:
[[[ 9.87672514e+00  1.26704629e+01  8.80203200e+00  1.01544024e+01]
  [ 1.19507279e+01  1.45686707e+01  1.03045423e+01  1.04854352e+01]
  [ 1.55549285e+01  1.82153099e+01  1.14690914e+01  1.26028510e+01]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00]]

 [[ 1.02488890e+01  9.55370256e+00  6.99867132e+00  8.77838676e+00]
  [ 1.20101695e+01  9.86455239e+00  6.44437742e+00  8.63970053e+00]
  [ 1.28492388e+01  1.10202225e+01  4.12339922e-01  8.78122879e+00]
  [ 1.86168852e+01  6.41037541e+00 -1.61217897e+01  5.05855768e+00]]

 [[ 8.34434314e+00  5.5998

<h3 style="color:purple">Submitted By</h3>

<ul>
<li> Nasir Hussain</li>
<li> Laiba Masood</li>
</ul>