# Environment Setup

In [None]:
#Pretty standard stuff here

!mkdir PongReinforcementLearning
!cd PongReinforcementLearning

# Then, I set up a virtual environment (venv)
python -m venv PongReinforcementLearningVENV
!source PongReinforcementLearningVENV/bin/activate

# Make the venv recognizable to Jupyter Notebooks.
# This is the bridge that connects Jupyter to my isolated Python environment.
%pip install ipyconfig
python -m ipykernel install --user --name=PongReinforcementLearningVENV

# Time to fire up Jupyter Notebook.
# Make sure to select the new venv as the Python interpreter.
jupyter notebook

# Finally, installing some libs, i usually do these via the console but Jupyter's % operator usually works just fine
%pip3 install pygame
%pip install numpy

# See if I can run an external Pygame window from a Jupyter notebook on macosx

In [2]:
import pygame
pygame.init()

# Create external window
win = pygame.display.set_mode((500, 500))

# Main game loop
run = True
while run:
    pygame.time.delay(100)
    
    for event in pygame.event.get():
        if event.type == pygame.QUIT:
            run = False
            
    # Game logic here (e.g., move a rectangle)
    pygame.draw.rect(win, (255, 0, 0), (250, 250, 50, 50))
    
    pygame.display.update()

pygame.quit()


pygame 2.5.1 (SDL 2.28.2, Python 3.10.9)
Hello from the pygame community. https://www.pygame.org/contribute.html


**Well, it runs but shutdown isn't graceful.  The window pops up, draws a glorious red square.  But then simple window commands like "close" fail.  I had to Force Quit which then also brought the Jupyter notebook kernel to the ground.  This may wind up being a royal PITA but i'll give it a shot for now.  Worst case I'll switch to a simple python script run from the console.**

# Pong

In [1]:
import pygame
import random
import numpy as np  
import pickle
import os
import math

#Helper function to load data from a pickle file
def load_data_from_pickle_file(filename, default_value):
    try: return pickle.load(open(filename, "rb")) if os.path.exists(filename) else default_value
    except Exception as e: print(f"Error loading {filename}: {e}"); return default_value

#Constants
DATA_FILE_PREFIX = 'v23-'
Q_TABLE_LEFT_FILE = 'data/' + DATA_FILE_PREFIX + 'Q_table_left.pkl'
Q_TABLE_RIGHT_FILE = 'data/' + DATA_FILE_PREFIX + 'Q_table_right.pkl'
EPISODE_COUNT_FILE = 'data/' + DATA_FILE_PREFIX + 'Episode_count.pkl'
DEBUG_OFF = 0
DEBUG_INFO = 1
DEBUG_DEBUG = 2
DEBUG_LEVEL = DEBUG_OFF # Default debug level setting
GAME_BOARD_GRID_SIZE = 100

# Initialize epsilon for the epsilon-greedy policy
epsilon = 1.0 #(orig 1.0)
epsilon_min = 0.10 #(orig .01)
epsilon_decay = 0.9995 #(orig .995)

# Initialize hyperparameters
alpha = 0.5  # Learning rate (orig .1)
gamma = 0.90  # Discount factor (orig .99)

#Rewards lookback period (for debugging, not training)
reward_lookback_period = 100  # Number of episodes to average over
recent_rewards_left = []
recent_rewards_right = []

# Initialize Q-tables
Q_table_left = {}
Q_table_right = {}

#Q-table save frequency
episode_count = 0  # Initialize episode count
save_frequency = 100  # Save every 100 episodes

#Load data from pickle
Q_table_left = load_data_from_pickle_file(Q_TABLE_RIGHT_FILE, {})
Q_table_right = load_data_from_pickle_file(Q_TABLE_RIGHT_FILE, {})
episode_count = load_data_from_pickle_file(EPISODE_COUNT_FILE, 0)

# Initialize scores
left_score = 0
right_score = 0

# Define the action space
action_space = [0, 1, 2]  # 0: Move Up, 1: Move Down, 2: Stay Still

# Initialize reward
reward = 0

# Initialize iterations_this_game
iterations_this_game = 0

# Initialize Pygame
pygame.init()

# Create a window
width, height = 800, 600  # Window dimensions
window = pygame.display.set_mode((width, height))
pygame.display.set_caption('Pong Game')

# Initialize paddle and ball attributes
paddle_width, paddle_height = 20, 100
ball_radius = 15

# Initial positions
left_paddle_pos = [50, height // 2 - paddle_height // 2]
right_paddle_pos = [width - 50 - paddle_width, height // 2 - paddle_height // 2]
ball_pos = [width // 2, height // 2]

# Ball velocity
ball_velocity = [random.choice([-4, 4]), random.choice([-4, 4])]

#Convert input coordinate to discrete grid space.  this smaller grid space should make learning easier.
def discretize_grid(coordinate): 
    return coordinate // GAME_BOARD_GRID_SIZE

#Convert velocity into discretized space (of only 4 options!)
def discretize_velocity(velocity_x, velocity_y):
    if velocity_x > 0 and velocity_y > 0:
        return 0  # Up-Right
    elif velocity_x > 0 and velocity_y < 0:
        return 1  # Down-Right
    elif velocity_x < 0 and velocity_y > 0:
        return 2  # Up-Left
    elif velocity_x < 0 and velocity_y < 0:
        return 3  # Down-Left

# Main game loop
run = True
while run:
    #pygame.time.delay(10)
    for event in pygame.event.get():
        if event.type == pygame.QUIT:
            run = False
            
    #Track game loops in this episode/game and report to screen to get a sense of how many loops a game lasts
    iterations_this_game += 1
    
    # Reset rewards to 0 at the beginning of each pass through the game loop
    reward_left = 0
    reward_right = 0
            
    # Create the state representation for both agents
    state_left = (discretize_grid(left_paddle_pos[1]), discretize_grid(ball_pos[0]), discretize_grid(ball_pos[1]), discretize_velocity(ball_velocity[0], ball_velocity[1]))
    state_right = (discretize_grid(right_paddle_pos[1]), discretize_grid(ball_pos[0]), discretize_grid(ball_pos[1]), discretize_velocity(ball_velocity[0], ball_velocity[1]))

    # Initialize Q-values for the states if not already present
    if state_left not in Q_table_left:
        Q_table_left[state_left] = {action: np.random.uniform(-1, 1) for action in action_space}
    if state_right not in Q_table_right:
        Q_table_right[state_right] = {action: np.random.uniform(-1, 1) for action in action_space}

    # Choose an action for both agents using the epsilon-greedy policy
    action_left = max(Q_table_left[state_left], key=Q_table_left[state_left].get) if np.random.rand() >= epsilon else np.random.choice(action_space)
    action_right = max(Q_table_right[state_right], key=Q_table_right[state_right].get) if np.random.rand() >= epsilon else np.random.choice(action_space)
   
    # Manual human paddle movement with boundary checks
    #keys = pygame.key.get_pressed()
    #if keys[pygame.K_w] and left_paddle_pos[1] > 0:
    #    left_paddle_pos[1] -= 5
    #if keys[pygame.K_s] and left_paddle_pos[1] < height - paddle_height:
    #    left_paddle_pos[1] += 5
    #if keys[pygame.K_UP] and right_paddle_pos[1] > 0:
    #    right_paddle_pos[1] -= 5
    #if keys[pygame.K_DOWN] and right_paddle_pos[1] < height - paddle_height:
    #    right_paddle_pos[1] += 5

    #Left AI agent moves the paddle!!
    if action_left == 0 and left_paddle_pos[1] > 0:  # Move Up
        left_paddle_pos[1] -= 15
    elif action_left == 1 and left_paddle_pos[1] < height - paddle_height:  # Move Down
        left_paddle_pos[1] += 15
    #elif action_left == 2: 
        # Stay Still, so no movement
        
    #Right AI agent moves the paddle!!
    if action_right == 0 and right_paddle_pos[1] > 0:  # Move Up
        right_paddle_pos[1] -= 15
    elif action_right == 1 and right_paddle_pos[1] < height - paddle_height:  # Move Down
        right_paddle_pos[1] += 15
    #elif action_right == 2: 
        # Stay Still, so no movement

    # Update ball position
    ball_pos[0] += ball_velocity[0]
    ball_pos[1] += ball_velocity[1]

    # Collision detection with walls
    if ball_pos[1] <= 0 or ball_pos[1] >= height:
        ball_velocity[1] = -ball_velocity[1]

    #Debug track whether we have a rewarded event in this loop
    reward_applied_this_loop = False
    
    # Collision detection with paddles
    collision_offset = 5  # Define an offset to push the ball away from the paddle
    if (left_paddle_pos[0] <= ball_pos[0] <= left_paddle_pos[0] + paddle_width and
        left_paddle_pos[1] <= ball_pos[1] <= left_paddle_pos[1] + paddle_height):
        ball_velocity[0] = -ball_velocity[0]
        ball_pos[0] += collision_offset  # Push the ball away from the paddle
        reward_left = 1  # Add reward for left agent
        reward_applied_this_loop = True
    elif (right_paddle_pos[0] <= ball_pos[0] <= right_paddle_pos[0] + paddle_width and
          right_paddle_pos[1] <= ball_pos[1] <= right_paddle_pos[1] + paddle_height):
        ball_velocity[0] = -ball_velocity[0]
        ball_pos[0] -= collision_offset  # Push the ball away from the paddle
        reward_right = 1  # Add reward for right agent
        reward_applied_this_loop = True
    
    #Penalties for not exploring enough
    #extreme_zones = [[0, height // 8], [7 * height // 8, height]]  # Define the extreme zones
    #bonus = 0.1  # Define the bonus
    #center_zone = [height // 4, 3 * height // 4]  # Define the center zone
    #penalty = -0.1  # Define the penalty
    # Apply penalty for center zone
    #if center_zone[0] <= left_paddle_pos[1] <= center_zone[1]:
    #    reward_left += penalty
    #if center_zone[0] <= right_paddle_pos[1] <= center_zone[1]:
    #    reward_right += penalty
    # Apply bonus for extreme zones
    #for zone in extreme_zones:
    #    if zone[0] <= left_paddle_pos[1] <= zone[1]:
    #        reward_left += bonus
    #    if zone[0] <= right_paddle_pos[1] <= zone[1]:
    #        reward_right += bonus
        
    # Ball reset, scoring, and immediate feedback game-over condition
    if ball_pos[0] < 0:
        # Reset paddle positions to the middle
        left_paddle_pos = [50, height // 2 - paddle_height // 2]
        right_paddle_pos = [width - 50 - paddle_width, height // 2 - paddle_height // 2]
        #Reset the ball to the center in a random direction
        ball_pos = [width // 2, height // 2]
        ball_velocity = [random.choice([-4, 4]), random.choice([-4, 4])]
        #Scoring
        right_score += 1  # Right player scores
        #Rewards
        reward_left += -1  # Negative reward for the left agent
        reward_right += 1  # Positive reward for the right agent
        reward_applied_this_loop = True
        #Signal the end of an episode
        episode_count += 1  # Increment episode count
        iterations_this_game = 0
        # Decay epsilon at the end of a game/episode
        if epsilon > epsilon_min:
            epsilon *= epsilon_decay
        # Save the Q-tables every save_frequency episodes
        if episode_count % save_frequency == 0:
            with open(Q_TABLE_LEFT_FILE, "wb") as f:
                pickle.dump(Q_table_left, f)
            with open(Q_TABLE_RIGHT_FILE, "wb") as f:
                pickle.dump(Q_table_right, f)
            with open(EPISODE_COUNT_FILE, "wb") as f:
                pickle.dump(episode_count, f)
    elif ball_pos[0] > width:
        # Reset paddle positions to the middle
        left_paddle_pos = [50, height // 2 - paddle_height // 2]
        right_paddle_pos = [width - 50 - paddle_width, height // 2 - paddle_height // 2]
        #Reset the ball to the center in a random direction
        ball_pos = [width // 2, height // 2]
        ball_velocity = [random.choice([-4, 4]), random.choice([-4, 4])]
        #Scoring
        left_score += 1  # Left player scores
        #Rewards
        reward_left += 1  # Positive reward for the left agent
        reward_right += -1  # Negative reward for the right agent
        reward_applied_this_loop = True
        #Signal the end of an episode
        episode_count += 1  # Increment episode count
        iterations_this_game = 0
        # Decay epsilon at the end of a game/episode
        if epsilon > epsilon_min:
            epsilon *= epsilon_decay
        # Save the Q-tables every save_frequency episodes
        if episode_count % save_frequency == 0:
            with open(Q_TABLE_LEFT_FILE, "wb") as f:
                pickle.dump(Q_table_left, f)
            with open(Q_TABLE_RIGHT_FILE, "wb") as f:
                pickle.dump(Q_table_right, f)
            with open(EPISODE_COUNT_FILE, "wb") as f:
                pickle.dump(episode_count, f)
                
    # After taking an action, observe new state and reward
    new_state_left = (discretize_grid(left_paddle_pos[1]), right_paddle_pos[1], discretize_grid(ball_pos[0]), discretize_grid(ball_pos[1]), discretize_velocity(ball_velocity[0], ball_velocity[1]))
    new_state_right = (discretize_grid(left_paddle_pos[1]), right_paddle_pos[1], discretize_grid(ball_pos[0]), discretize_grid(ball_pos[1]), discretize_velocity(ball_velocity[0], ball_velocity[1]))

    # Initialize Q-values-left for the new state if not already present
    if new_state_left not in Q_table_left:
        Q_table_left[new_state_left] = {action: 0 for action in action_space}

    # Initialize Q-values-right for the new state if not already present
    if new_state_right not in Q_table_right:
        Q_table_right[new_state_right] = {action: 0 for action in action_space}

    # Calculate the best next action for both agents
    best_next_action_left = max(Q_table_left[new_state_left], key=Q_table_left[new_state_left].get)
    best_next_action_right = max(Q_table_right[new_state_right], key=Q_table_right[new_state_right].get)

    # Debug: Print Q-values before update
    if new_state_left in Q_table_left and state_left in Q_table_left:
        print(f"Debug: Before Update - Q_value Left for state {state_left} and action {action_left}: {Q_table_left[state_left][action_left]}")

    if new_state_right in Q_table_right and state_right in Q_table_right:
        print(f"Debug: Before Update - Q_value Right for state {state_right} and action {action_right}: {Q_table_right[state_right][action_right]}")

    # Q-Learning update rule for both agents
    Q_table_left[state_left][action_left] = (1 - alpha) * Q_table_left[state_left][action_left] + alpha * (reward_left + gamma * Q_table_left[new_state_left][best_next_action_left])
    Q_table_right[state_right][action_right] = (1 - alpha) * Q_table_right[state_right][action_right] + alpha * (reward_right + gamma * Q_table_right[new_state_right][best_next_action_right])

    # Debug: Check for approximate equality
    tolerance = 1e-5  # You can adjust this value
    if abs(Q_table_left[state_left][action_left] - ((1 - alpha) * Q_table_left[state_left][action_left] + alpha * (reward_left + gamma * Q_table_left[new_state_left][best_next_action_left]))) > tolerance:
        print(f"Debug: Mismatch in Q_value Left Update!")
        print(f"Alpha: {alpha}, Gamma: {gamma}, Reward Left: {reward_left}, Best Next Action Left: {best_next_action_left}")

    if abs(Q_table_right[state_right][action_right] - ((1 - alpha) * Q_table_right[state_right][action_right] + alpha * (reward_right + gamma * Q_table_right[new_state_right][best_next_action_right]))) > tolerance:
        print(f"Debug: Mismatch in Q_value Right Update!")
        print(f"Alpha: {alpha}, Gamma: {gamma}, Reward Right: {reward_right}, Best Next Action Right: {best_next_action_right}")
    
    # Update current state for next iteration
    state_left = new_state_left
    state_right = new_state_right
         
    # Append the reward of the current episode to the list
    recent_rewards_left.append(reward_left)
    recent_rewards_right.append(reward_right)
    # Remove the oldest reward if the list grows too large
    if len(recent_rewards_left) > reward_lookback_period:
        del recent_rewards_left[0]
    if len(recent_rewards_right) > reward_lookback_period:
        del recent_rewards_right[0]
    # Calculate the average reward
    avg_reward_left = sum(recent_rewards_left) / len(recent_rewards_left)
    avg_reward_right = sum(recent_rewards_right) / len(recent_rewards_right)
        
    # Draw paddles, ball, and scores
    window.fill((0, 0, 0))  # Clear screen
    pygame.draw.rect(window, (255, 255, 255), left_paddle_pos + [paddle_width, paddle_height])
    pygame.draw.rect(window, (255, 255, 255), right_paddle_pos + [paddle_width, paddle_height])
    pygame.draw.circle(window, (255, 255, 255), ball_pos, ball_radius)

    # Display scores
    font = pygame.font.SysFont(None, 30)
    score_display = font.render(f"score: {left_score} - {right_score}", True, (255, 255, 255))
    window.blit(score_display, (width // 2 - 45, 10))
    
    # Display episode count
    font = pygame.font.SysFont(None, 30)
    episode_display = font.render(f"episodes played: {episode_count}", True, (255, 255, 255))
    window.blit(episode_display, (width // 2 - 100, 40))
    
    # Display current epsilon
    font = pygame.font.SysFont(None, 30)
    epsilon_display = font.render(f"Epsilon: {epsilon:.4f}", True, (255, 255, 255))
    window.blit(epsilon_display, (10, 70))

    # Display average reward for left and right agents
    font = pygame.font.SysFont(None, 30)
    avg_reward_left_display = font.render(f"Avg Reward Left: {avg_reward_left:.6f}", True, (255, 255, 255))
    window.blit(avg_reward_left_display, (10, 100))
    avg_reward_right_display = font.render(f"Avg Reward Right: {avg_reward_right:.6f}", True, (255, 255, 255))
    window.blit(avg_reward_right_display, (10, 130))
    
    # Display current frame within game
    #font = pygame.font.SysFont(None, 30)
    #epsilon_display = font.render(f"iterations_this_game: {iterations_this_game}", True, (255, 255, 255))
    #window.blit(epsilon_display, (10, 160))
    
    if (DEBUG_LEVEL>=DEBUG_DEBUG): 
        if Q_table_left[state_left][action_left] != 0:
            print(f"Episode: {episode_count}, Iteration: {iterations_this_game}")
            print(f"Current State Left: {state_left}, Action Left: {action_left}, Q-Value: {Q_table_left[state_left][action_left]}")
            print(f"New State Left: {new_state_left}, Best Next Action Left: {best_next_action_left}, Q-Value: {Q_table_left[new_state_left][best_next_action_left]}")
            print(f"Reward Left: {reward_left}")
            print(f"Current Epsilon: {epsilon}")
            #print(f"Avg Reward Left: {avg_reward_left}")
            print(f"----------")
            print(f" ")


    pygame.display.update()
    
pygame.quit()


pygame 2.5.1 (SDL 2.28.2, Python 3.10.9)
Hello from the pygame community. https://www.pygame.org/contribute.html
Debug: Before Update - Q_value Left for state (2, 4, 3, 1) and action 1: 0.9729657907733198
Debug: Before Update - Q_value Right for state (2, 4, 3, 1) and action 0: 0.4788019483939463
Debug: Mismatch in Q_value Left Update!
Alpha: 0.5, Gamma: 0.9, Reward Left: 0, Best Next Action Left: 0
Debug: Mismatch in Q_value Right Update!
Alpha: 0.5, Gamma: 0.9, Reward Right: 0, Best Next Action Right: 0
Debug: Before Update - Q_value Left for state (2, 4, 2, 1) and action 0: 0.8279742648620194
Debug: Before Update - Q_value Right for state (2, 4, 2, 1) and action 0: 0.7359454468436755
Debug: Mismatch in Q_value Left Update!
Alpha: 0.5, Gamma: 0.9, Reward Left: 0, Best Next Action Left: 0
Debug: Mismatch in Q_value Right Update!
Alpha: 0.5, Gamma: 0.9, Reward Right: 0, Best Next Action Right: 0
Debug: Before Update - Q_value Left for state (2, 4, 2, 1) and action 1: -0.370881929128661

Debug: Before Update - Q_value Left for state (3, 5, 1, 1) and action 2: 0.10584610546102624
Debug: Before Update - Q_value Right for state (3, 5, 1, 1) and action 0: -0.29212106530632675
Debug: Mismatch in Q_value Left Update!
Alpha: 0.5, Gamma: 0.9, Reward Left: 0, Best Next Action Left: 0
Debug: Mismatch in Q_value Right Update!
Alpha: 0.5, Gamma: 0.9, Reward Right: 0, Best Next Action Right: 0
Debug: Before Update - Q_value Left for state (3, 5, 1, 1) and action 1: -0.05282799041311237
Debug: Before Update - Q_value Right for state (3, 5, 1, 1) and action 0: -0.14606053265316338
Debug: Mismatch in Q_value Left Update!
Alpha: 0.5, Gamma: 0.9, Reward Left: 0, Best Next Action Left: 0
Debug: Mismatch in Q_value Right Update!
Alpha: 0.5, Gamma: 0.9, Reward Right: 0, Best Next Action Right: 0
Debug: Before Update - Q_value Left for state (3, 5, 1, 1) and action 1: -0.026413995206556184
Debug: Before Update - Q_value Right for state (2, 5, 1, 1) and action 2: 0.13314059443776904
Debug: M

Debug: Before Update - Q_value Left for state (3, 6, 0, 1) and action 1: 0.001323714365891935
Debug: Before Update - Q_value Right for state (2, 6, 0, 1) and action 2: -0.08716678001464453
Debug: Mismatch in Q_value Left Update!
Alpha: 0.5, Gamma: 0.9, Reward Left: 0, Best Next Action Left: 0
Debug: Mismatch in Q_value Right Update!
Alpha: 0.5, Gamma: 0.9, Reward Right: 0, Best Next Action Right: 0
Debug: Before Update - Q_value Left for state (3, 6, 0, 1) and action 0: -0.04378473019057286
Debug: Before Update - Q_value Right for state (2, 6, 0, 1) and action 1: -0.029054802712047988
Debug: Mismatch in Q_value Left Update!
Alpha: 0.5, Gamma: 0.9, Reward Left: 0, Best Next Action Left: 0
Debug: Mismatch in Q_value Right Update!
Alpha: 0.5, Gamma: 0.9, Reward Right: 0, Best Next Action Right: 0
Debug: Before Update - Q_value Left for state (3, 6, 0, 1) and action 0: -0.02189236509528643
Debug: Before Update - Q_value Right for state (2, 6, 0, 1) and action 0: -0.011778693032792872
Debug

Debug: Before Update - Q_value Left for state (2, 4, 2, 1) and action 0: 0.0004042843090146579
Debug: Before Update - Q_value Right for state (2, 4, 2, 1) and action 1: 0.0004618226432913405
Debug: Mismatch in Q_value Left Update!
Alpha: 0.5, Gamma: 0.9, Reward Left: 0, Best Next Action Left: 0
Debug: Mismatch in Q_value Right Update!
Alpha: 0.5, Gamma: 0.9, Reward Right: 0, Best Next Action Right: 0
Debug: Before Update - Q_value Left for state (2, 4, 2, 1) and action 2: 0.00012763898259877872
Debug: Before Update - Q_value Right for state (2, 4, 2, 1) and action 0: 0.0001796741813583192
Debug: Mismatch in Q_value Left Update!
Alpha: 0.5, Gamma: 0.9, Reward Left: 0, Best Next Action Left: 0
Debug: Mismatch in Q_value Right Update!
Alpha: 0.5, Gamma: 0.9, Reward Right: 0, Best Next Action Right: 0
Debug: Before Update - Q_value Left for state (2, 4, 2, 1) and action 2: 6.381949129938936e-05
Debug: Before Update - Q_value Right for state (2, 4, 2, 1) and action 1: 0.00023091132164567026

Debug: Before Update - Q_value Left for state (1, 6, 0, 1) and action 2: -0.001315227837803426
Debug: Before Update - Q_value Right for state (2, 6, 0, 1) and action 1: -0.007263700678011997
Debug: Mismatch in Q_value Left Update!
Alpha: 0.5, Gamma: 0.9, Reward Left: 0, Best Next Action Left: 0
Debug: Mismatch in Q_value Right Update!
Alpha: 0.5, Gamma: 0.9, Reward Right: 0, Best Next Action Right: 0
Debug: Before Update - Q_value Left for state (1, 6, 0, 1) and action 0: -0.08407654014846319
Debug: Before Update - Q_value Right for state (2, 6, 0, 1) and action 2: -0.005447923750915283
Debug: Mismatch in Q_value Left Update!
Alpha: 0.5, Gamma: 0.9, Reward Left: 0, Best Next Action Left: 0
Debug: Mismatch in Q_value Right Update!
Alpha: 0.5, Gamma: 0.9, Reward Right: 0, Best Next Action Right: 0
Debug: Before Update - Q_value Left for state (1, 6, 0, 1) and action 1: 0.17683014161991967
Debug: Before Update - Q_value Right for state (2, 6, 0, 1) and action 0: -4.601051965934716e-05
Deb

Debug: Before Update - Q_value Left for state (2, 4, 2, 1) and action 1: -1.1318418247334643e-05
Debug: Before Update - Q_value Right for state (2, 4, 2, 1) and action 2: -7.435424307715827e-06
Debug: Before Update - Q_value Left for state (2, 4, 2, 1) and action 2: 1.9943591031059176e-06
Debug: Before Update - Q_value Right for state (2, 4, 2, 1) and action 1: 3.607989400713598e-06
Debug: Before Update - Q_value Left for state (2, 4, 2, 1) and action 2: 9.971795515529588e-07
Debug: Before Update - Q_value Right for state (2, 4, 2, 1) and action 1: 1.803994700356799e-06
Debug: Before Update - Q_value Left for state (2, 4, 2, 1) and action 1: -5.6592091236673215e-06
Debug: Before Update - Q_value Right for state (2, 4, 2, 1) and action 0: 1.122963633489495e-05
Debug: Before Update - Q_value Left for state (2, 4, 2, 1) and action 1: -2.8296045618336607e-06
Debug: Before Update - Q_value Right for state (2, 4, 2, 1) and action 0: 5.614818167447475e-06
Debug: Before Update - Q_value Left f

Debug: Before Update - Q_value Left for state (3, 5, 1, 1) and action 1: -2.5794917193902524e-05
Debug: Before Update - Q_value Right for state (2, 5, 1, 1) and action 1: -1.1138192254863689e-06
Debug: Before Update - Q_value Left for state (4, 5, 1, 1) and action 1: 0.2401047213580485
Debug: Before Update - Q_value Right for state (2, 5, 1, 1) and action 2: 5.078910615454447e-07
Debug: Mismatch in Q_value Left Update!
Alpha: 0.5, Gamma: 0.9, Reward Left: 0, Best Next Action Left: 0
Debug: Before Update - Q_value Left for state (4, 5, 1, 1) and action 0: -0.49137750547504333
Debug: Before Update - Q_value Right for state (2, 5, 1, 1) and action 0: 6.303125072765019e-05
Debug: Mismatch in Q_value Left Update!
Alpha: 0.5, Gamma: 0.9, Reward Left: 0, Best Next Action Left: 0
Debug: Mismatch in Q_value Right Update!
Alpha: 0.5, Gamma: 0.9, Reward Right: 0, Best Next Action Right: 0
Debug: Before Update - Q_value Left for state (4, 5, 1, 1) and action 1: 0.12005236067902425
Debug: Before Up

Debug: Before Update - Q_value Left for state (4, 7, 0, 0) and action 1: 0.050212642533676216
Debug: Before Update - Q_value Right for state (4, 7, 0, 0) and action 2: 0.23445236636000077
Debug: Mismatch in Q_value Left Update!
Alpha: 0.5, Gamma: 0.9, Reward Left: 0, Best Next Action Left: 0
Debug: Mismatch in Q_value Right Update!
Alpha: 0.5, Gamma: 0.9, Reward Right: 0, Best Next Action Right: 0
Debug: Before Update - Q_value Left for state (4, 7, 0, 0) and action 2: 0.03118169501488914
Debug: Before Update - Q_value Right for state (4, 7, 0, 0) and action 0: -0.13236402090458527
Debug: Mismatch in Q_value Left Update!
Alpha: 0.5, Gamma: 0.9, Reward Left: 0, Best Next Action Left: 0
Debug: Mismatch in Q_value Right Update!
Alpha: 0.5, Gamma: 0.9, Reward Right: 0, Best Next Action Right: 0
Debug: Before Update - Q_value Left for state (4, 7, 0, 0) and action 1: 0.025106321266838108
Debug: Before Update - Q_value Right for state (4, 7, 0, 0) and action 1: -0.11878448386544807
Debug: Mi

Debug: Before Update - Q_value Left for state (2, 3, 2, 3) and action 1: -0.0188747859913469
Debug: Before Update - Q_value Right for state (2, 3, 2, 3) and action 0: -0.004252510782367361
Debug: Mismatch in Q_value Left Update!
Alpha: 0.5, Gamma: 0.9, Reward Left: 0, Best Next Action Left: 0
Debug: Mismatch in Q_value Right Update!
Alpha: 0.5, Gamma: 0.9, Reward Right: 0, Best Next Action Right: 0
Debug: Before Update - Q_value Left for state (3, 3, 2, 3) and action 0: -0.13739408543143394
Debug: Before Update - Q_value Right for state (2, 3, 2, 3) and action 1: -0.0070953499774840355
Debug: Mismatch in Q_value Left Update!
Alpha: 0.5, Gamma: 0.9, Reward Left: 0, Best Next Action Left: 0
Debug: Mismatch in Q_value Right Update!
Alpha: 0.5, Gamma: 0.9, Reward Right: 0, Best Next Action Right: 0
Debug: Before Update - Q_value Left for state (2, 3, 2, 3) and action 2: 0.00019780322866096883
Debug: Before Update - Q_value Right for state (2, 3, 2, 3) and action 1: -0.0035476749887420177
D

Debug: Before Update - Q_value Left for state (3, 1, 0, 3) and action 1: -0.141692042295735
Debug: Before Update - Q_value Right for state (3, 1, 0, 3) and action 0: 0.4462183033784429
Debug: Mismatch in Q_value Left Update!
Alpha: 0.5, Gamma: 0.9, Reward Left: 0, Best Next Action Left: 0
Debug: Mismatch in Q_value Right Update!
Alpha: 0.5, Gamma: 0.9, Reward Right: 0, Best Next Action Right: 0
Debug: Before Update - Q_value Left for state (4, 1, 0, 3) and action 1: -0.41266308918472316
Debug: Before Update - Q_value Right for state (3, 1, 0, 3) and action 1: -0.25899503623250797
Debug: Mismatch in Q_value Left Update!
Alpha: 0.5, Gamma: 0.9, Reward Left: 0, Best Next Action Left: 0
Debug: Mismatch in Q_value Right Update!
Alpha: 0.5, Gamma: 0.9, Reward Right: 0, Best Next Action Right: 0
Debug: Before Update - Q_value Left for state (4, 1, 0, 3) and action 0: -0.3257468712999685
Debug: Before Update - Q_value Right for state (3, 1, 0, 3) and action 2: -0.4806215298371167
Debug: Mismat

Debug: Before Update - Q_value Left for state (2, 4, 3, 2) and action 0: -0.2943907034860238
Debug: Before Update - Q_value Right for state (2, 4, 3, 2) and action 2: -0.47673492938797857
Debug: Mismatch in Q_value Left Update!
Alpha: 0.5, Gamma: 0.9, Reward Left: 0, Best Next Action Left: 0
Debug: Mismatch in Q_value Right Update!
Alpha: 0.5, Gamma: 0.9, Reward Right: 0, Best Next Action Right: 0
Debug: Before Update - Q_value Left for state (2, 3, 3, 2) and action 1: 0.5908018213554207
Debug: Before Update - Q_value Right for state (2, 3, 3, 2) and action 1: 0.21812846874392844
Debug: Mismatch in Q_value Left Update!
Alpha: 0.5, Gamma: 0.9, Reward Left: 0, Best Next Action Left: 0
Debug: Mismatch in Q_value Right Update!
Alpha: 0.5, Gamma: 0.9, Reward Right: 0, Best Next Action Right: 0
Debug: Before Update - Q_value Left for state (2, 3, 3, 2) and action 1: 0.29540091067771035
Debug: Before Update - Q_value Right for state (2, 3, 3, 2) and action 0: -0.37572312187999035
Debug: Misma

Debug: Before Update - Q_value Left for state (4, 2, 4, 2) and action 0: 0.08420129475141705
Debug: Before Update - Q_value Right for state (2, 2, 4, 2) and action 2: 0.008179773000103939
Debug: Mismatch in Q_value Left Update!
Alpha: 0.5, Gamma: 0.9, Reward Left: 0, Best Next Action Left: 0
Debug: Mismatch in Q_value Right Update!
Alpha: 0.5, Gamma: 0.9, Reward Right: 0, Best Next Action Right: 0
Debug: Before Update - Q_value Left for state (3, 2, 4, 2) and action 0: 0.5902375962908217
Debug: Before Update - Q_value Right for state (2, 2, 4, 2) and action 1: 0.015522950313811859
Debug: Mismatch in Q_value Left Update!
Alpha: 0.5, Gamma: 0.9, Reward Left: 0, Best Next Action Left: 0
Debug: Mismatch in Q_value Right Update!
Alpha: 0.5, Gamma: 0.9, Reward Right: 0, Best Next Action Right: 0
Debug: Before Update - Q_value Left for state (3, 2, 4, 2) and action 2: -0.012061850727688922
Debug: Before Update - Q_value Right for state (2, 2, 4, 2) and action 0: -0.02686166708044238
Debug: Mi

Debug: Before Update - Q_value Left for state (4, 0, 5, 3) and action 0: 0.8478990789049077
Debug: Before Update - Q_value Right for state (2, 0, 5, 3) and action 1: -0.9905876620657086
Debug: Mismatch in Q_value Left Update!
Alpha: 0.5, Gamma: 0.9, Reward Left: 0, Best Next Action Left: 0
Debug: Mismatch in Q_value Right Update!
Alpha: 0.5, Gamma: 0.9, Reward Right: 0, Best Next Action Right: 0
Debug: Before Update - Q_value Left for state (4, 0, 5, 3) and action 2: 0.3691962478764492
Debug: Before Update - Q_value Right for state (2, 0, 5, 3) and action 1: -0.4952938310328543
Debug: Mismatch in Q_value Left Update!
Alpha: 0.5, Gamma: 0.9, Reward Left: 0, Best Next Action Left: 0
Debug: Mismatch in Q_value Right Update!
Alpha: 0.5, Gamma: 0.9, Reward Right: 0, Best Next Action Right: 0
Debug: Before Update - Q_value Left for state (4, 0, 5, 3) and action 1: 0.33066210272533136
Debug: Before Update - Q_value Right for state (3, 0, 5, 3) and action 2: 0.3131026249131055
Debug: Mismatch 

Debug: Before Update - Q_value Left for state (4, 1, 4, 1) and action 1: -0.054332094283259635
Debug: Before Update - Q_value Right for state (3, 1, 4, 1) and action 1: -0.008829986753404004
Debug: Mismatch in Q_value Left Update!
Alpha: 0.5, Gamma: 0.9, Reward Left: 0, Best Next Action Left: 0
Debug: Mismatch in Q_value Right Update!
Alpha: 0.5, Gamma: 0.9, Reward Right: 0, Best Next Action Right: 0
Debug: Before Update - Q_value Left for state (4, 1, 4, 1) and action 0: -0.2158938052130861
Debug: Before Update - Q_value Right for state (3, 1, 4, 1) and action 1: -0.004414993376702002
Debug: Mismatch in Q_value Left Update!
Alpha: 0.5, Gamma: 0.9, Reward Left: 0, Best Next Action Left: 0
Debug: Mismatch in Q_value Right Update!
Alpha: 0.5, Gamma: 0.9, Reward Right: 0, Best Next Action Right: 0
Debug: Before Update - Q_value Left for state (4, 1, 4, 1) and action 2: 0.02352964543691967
Debug: Before Update - Q_value Right for state (4, 1, 4, 1) and action 0: -0.019894057493961004
Debug

Debug: Before Update - Q_value Left for state (2, 3, 2, 1) and action 0: 0.40371374718953
Debug: Before Update - Q_value Right for state (4, 3, 2, 1) and action 0: 0.1976141901635342
Debug: Mismatch in Q_value Left Update!
Alpha: 0.5, Gamma: 0.9, Reward Left: 0, Best Next Action Left: 0
Debug: Mismatch in Q_value Right Update!
Alpha: 0.5, Gamma: 0.9, Reward Right: 0, Best Next Action Right: 0
Debug: Before Update - Q_value Left for state (2, 3, 2, 1) and action 2: 0.7809781652777241
Debug: Before Update - Q_value Right for state (4, 3, 2, 1) and action 2: -0.08903223643095959
Debug: Mismatch in Q_value Left Update!
Alpha: 0.5, Gamma: 0.9, Reward Left: 0, Best Next Action Left: 0
Debug: Mismatch in Q_value Right Update!
Alpha: 0.5, Gamma: 0.9, Reward Right: 0, Best Next Action Right: 0
Debug: Before Update - Q_value Left for state (2, 3, 2, 1) and action 0: 0.201856873594765
Debug: Before Update - Q_value Right for state (4, 3, 2, 1) and action 0: 0.0988070950817671
Debug: Mismatch in Q

Debug: Before Update - Q_value Left for state (1, 5, 1, 1) and action 0: 0.003054758020647705
Debug: Before Update - Q_value Right for state (4, 5, 1, 1) and action 1: 0.17027893094247104
Debug: Mismatch in Q_value Left Update!
Alpha: 0.5, Gamma: 0.9, Reward Left: 0, Best Next Action Left: 0
Debug: Mismatch in Q_value Right Update!
Alpha: 0.5, Gamma: 0.9, Reward Right: 0, Best Next Action Right: 0
Debug: Before Update - Q_value Left for state (1, 5, 1, 1) and action 0: 0.0015273790103238526
Debug: Before Update - Q_value Right for state (4, 5, 1, 1) and action 0: 0.017177663746502717
Debug: Mismatch in Q_value Left Update!
Alpha: 0.5, Gamma: 0.9, Reward Left: 0, Best Next Action Left: 0
Debug: Mismatch in Q_value Right Update!
Alpha: 0.5, Gamma: 0.9, Reward Right: 0, Best Next Action Right: 0
Debug: Before Update - Q_value Left for state (1, 5, 0, 1) and action 2: 0.5498132876861874
Debug: Before Update - Q_value Right for state (4, 5, 0, 1) and action 2: -0.17764856363123305
Debug: Mi

Debug: Before Update - Q_value Left for state (2, 7, 0, 0) and action 2: -0.0002927224747437465
Debug: Before Update - Q_value Right for state (2, 7, 0, 0) and action 1: -0.007613454916720955
Debug: Mismatch in Q_value Left Update!
Alpha: 0.5, Gamma: 0.9, Reward Left: 0, Best Next Action Left: 0
Debug: Mismatch in Q_value Right Update!
Alpha: 0.5, Gamma: 0.9, Reward Right: 0, Best Next Action Right: 0
Debug: Before Update - Q_value Left for state (2, 7, 0, 0) and action 1: -2.3624947791144914e-05
Debug: Before Update - Q_value Right for state (3, 7, 0, 0) and action 2: -0.0016737085193493234
Debug: Mismatch in Q_value Right Update!
Alpha: 0.5, Gamma: 0.9, Reward Right: 0, Best Next Action Right: 0
Debug: Before Update - Q_value Left for state (2, 7, 0, 0) and action 0: -0.005942486729150473
Debug: Before Update - Q_value Right for state (3, 7, 0, 0) and action 1: -0.05246343212106794
Debug: Mismatch in Q_value Left Update!
Alpha: 0.5, Gamma: 0.9, Reward Left: 0, Best Next Action Left: 

Debug: Before Update - Q_value Left for state (4, 4, 2, 1) and action 2: -0.20744049298799516
Debug: Before Update - Q_value Right for state (3, 4, 2, 1) and action 1: 0.234216982186212
Debug: Mismatch in Q_value Left Update!
Alpha: 0.5, Gamma: 0.9, Reward Left: 0, Best Next Action Left: 0
Debug: Mismatch in Q_value Right Update!
Alpha: 0.5, Gamma: 0.9, Reward Right: 0, Best Next Action Right: 0
Debug: Before Update - Q_value Left for state (4, 4, 2, 1) and action 2: -0.10372024649399758
Debug: Before Update - Q_value Right for state (3, 4, 2, 1) and action 0: -0.47279939487992584
Debug: Mismatch in Q_value Left Update!
Alpha: 0.5, Gamma: 0.9, Reward Left: 0, Best Next Action Left: 0
Debug: Mismatch in Q_value Right Update!
Alpha: 0.5, Gamma: 0.9, Reward Right: 0, Best Next Action Right: 0
Debug: Before Update - Q_value Left for state (4, 4, 2, 1) and action 0: 0.43117904022885023
Debug: Before Update - Q_value Right for state (3, 4, 2, 1) and action 1: 0.117108491093106
Debug: Mismatc

Debug: Before Update - Q_value Left for state (3, 6, 0, 1) and action 1: 0.00016546429573649187
Debug: Before Update - Q_value Right for state (2, 6, 0, 1) and action 0: -8.986429620966242e-08
Debug: Mismatch in Q_value Left Update!
Alpha: 0.5, Gamma: 0.9, Reward Left: 0, Best Next Action Left: 0


# Notes

## Implementing Game Mechanics for Pong

### 1. Initialize Pygame and Create Window
- Initialized Pygame and created an 800x600 window for the game.

### 2. Initialize Paddle and Ball Attributes
- Defined the dimensions of the paddles and the ball. Initialized their starting positions.

### 3. Paddle Movement
- Implemented keyboard controls for moving the paddles up and down.

### 4. Ball Movement and Collision Detection
- Added logic for ball movement and collision detection with the walls and paddles.

### 5. Ball Reset and Scoring
- Implemented ball reset and scoring mechanics. The ball resets to the center after a point is scored.

### 6. Paddle Boundaries
- Added boundaries to prevent the paddles from moving out of the window.

### 7. Game Over Conditions
- Implemented immediate feedback game-over conditions. The game resets after each point, serving as an episode in RL terms.


## Defining RL Elements for Pong

### 1. State Representation
- Decide how to represent the state of the game. Consider the trade-offs between granularity and computational complexity.

### 2. Action Space
- Define the set of actions I can take (e.g., move paddle up, move paddle down, stay still).

### 3. Reward Structure
- Design the rewards I receive for various outcomes (e.g., +1 for scoring, -1 for opponent scoring).

### 4. Policy Initialization
- Initialize my policy, which could be a Q-table, a neural network, or some other function mapping states to actions.

### 5. Learning Algorithm
- Choose and implement a learning algorithm (e.g., Q-learning, SARSA, Deep Q-Networks) to update my policy based on experiences.

### 6. Exploration-Exploitation Strategy
- Decide on a strategy for balancing exploration (trying new actions) and exploitation (sticking with known good actions), such as ε-greedy.

### 7. Training Loop
- Implement the training loop where I interact with the environment, update my policy, and optionally log metrics like average reward over time.

### 8. Evaluation Metrics
- Define metrics to evaluate my performance (e.g., average reward, win rate).

### 9. Hyperparameter Tuning
- Experiment with different learning rates, discount factors, and other hyperparameters to optimize performance.

### 10. Testing and Validation
- Test the trained agent to see how well it performs and validate that it is learning effectively.


## Q-Learning Algorithm

Q-Learning is a model-free reinforcement learning algorithm that aims to learn a policy, which tells an agent what action to take under what circumstances. It defines a function \( Q(s, a) \), representing the quality or the utility of taking action \( a \) in state \( s \).

### Outline

1. **Initialize Q-Table**: Create a table to store the Q-values for each state-action pair.
2. **Policy**: Define how the agent chooses an action (e.g., \(\epsilon\)-greedy).
3. **Learning**: Update the Q-values using the Q-Learning update rule.
4. **Training Loop**: Incorporate these elements into the game loop.

The Q-table will be represented as a Python dictionary. The keys will be the states, and the values will be another dictionary mapping actions to Q-values.


## max() reference

| Iterable Type | What It Returns to `max()` | Example of Using `max()` |
|---------------|----------------------------|--------------------------|
| List          | Individual list elements   | `max([1, 2, 3])` returns `3` |
| Tuple         | Individual tuple elements  | `max((1, 2, 3))` returns `3` |
| String        | Individual characters     | `max("abc")` returns `'c'` |
| Set           | Individual set elements    | `max({1, 2, 3})` returns `3` |
| Dictionary    | Dictionary keys           | `max({'a': 1, 'b': 2}, key=lambda k: k)` returns `'b'` |
|               |                            | `max({'a': 1, 'b': 2}.values())` returns `2` |
|               |                            | `max({'a': 1, 'b': 2}, key=lambda k: {'a': 1, 'b': 2}[k])` returns `'b'` |
| Numpy Array   | Individual array elements  | `import numpy as np; max(np.array([1, 2, 3]))` returns `3` |


## Building intuition around training variables

1. **Alpha (α) - Learning Rate**: 
    - **What it does**: Determines how much of the new Q-value estimate I adopt.
    - **Intuition**: Think of it as a "blending factor." If α is 1, I consider only the most recent information. If α is 0, I learn nothing and stick to my prior knowledge. A value between 0 and 1 blends the old and new information.
    - **Example**: If α is high (closer to 1), I will rapidly adapt to new strategies but may also forget useful past knowledge quickly.

2. **Gamma (γ) - Discount Factor**: 
    - **What it does**: Influences how much future rewards contribute to the Q-value.
    - **Intuition**: It's like a "patience meter." A high γ makes me prioritize long-term reward over short-term reward.
    - **Example**: If γ is close to 1, I will consider future rewards with greater weight, making me more strategic but potentially slower to train.

3. **Epsilon (ε) - Exploration Rate**: 
    - **What it does**: Controls the trade-off between exploration (trying new actions) and exploitation (sticking with known actions).
    - **Intuition**: It's like the "curiosity level." A high ε encourages me to try new things, while a low ε makes me stick to what I know.
    - **Example**: If ε starts high and decays over time (ε-decay), I will initially explore a lot and gradually shift to exploiting my learned knowledge.
