# Environment Setup

In [None]:
#Pretty standard stuff here

!mkdir PongReinforcementLearning
!cd PongReinforcementLearning

# Then, I set up a virtual environment (venv)
python -m venv PongReinforcementLearningVENV
!source PongReinforcementLearningVENV/bin/activate

# Make the venv recognizable to Jupyter Notebooks.
# This is the bridge that connects Jupyter to my isolated Python environment.
%pip install ipyconfig
python -m ipykernel install --user --name=PongReinforcementLearningVENV

# Time to fire up Jupyter Notebook.
# Make sure to select the new venv as the Python interpreter.
jupyter notebook

# Finally, installing some libs, i usually do these via the console but Jupyter's % operator usually works just fine
%pip3 install pygame
%pip install numpy

# See if I can run an external Pygame window from a Jupyter notebook on macosx

In [2]:
import pygame
pygame.init()

# Create external window
win = pygame.display.set_mode((500, 500))

# Main game loop
run = True
while run:
    pygame.time.delay(100)
    
    for event in pygame.event.get():
        if event.type == pygame.QUIT:
            run = False
            
    # Game logic here (e.g., move a rectangle)
    pygame.draw.rect(win, (255, 0, 0), (250, 250, 50, 50))
    
    pygame.display.update()

pygame.quit()


pygame 2.5.1 (SDL 2.28.2, Python 3.10.9)
Hello from the pygame community. https://www.pygame.org/contribute.html


**Well, it runs but shutdown isn't graceful.  The window pops up, draws a glorious red square.  But then simple window commands like "close" fail.  I had to Force Quit which then also brought the Jupyter notebook kernel to the ground.  This may wind up being a royal PITA but i'll give it a shot for now.  Worst case I'll switch to a simple python script run from the console.**

# Pong

In [1]:
import pygame
import random
import numpy as np  
import pickle
import os
import math

#Helper function to load data from a pickle file
def load_data_from_pickle_file(filename, default_value):
    try: return pickle.load(open(filename, "rb")) if os.path.exists(filename) else default_value
    except Exception as e: print(f"Error loading {filename}: {e}"); return default_value

#Constants
DATA_FILE_PREFIX = 'v25-'
Q_TABLE_LEFT_FILE = 'data/' + DATA_FILE_PREFIX + 'Q_table_left.pkl'
Q_TABLE_RIGHT_FILE = 'data/' + DATA_FILE_PREFIX + 'Q_table_right.pkl'
EPISODE_COUNT_FILE = 'data/' + DATA_FILE_PREFIX + 'Episode_count.pkl'
DEBUG_OFF = 0
DEBUG_INFO = 1
DEBUG_DEBUG = 2
DEBUG_LEVEL = DEBUG_OFF # Default debug level setting
GAME_BOARD_GRID_SIZE = 50

# Initialize epsilon for the epsilon-greedy policy
epsilon = 1.0 #(orig 1.0)
epsilon_min = 0.10 #(orig .01)
epsilon_decay = 0.9995 #(orig .995)

# Initialize hyperparameters
alpha = 0.5  # Learning rate (orig .1)
gamma = 0.90  # Discount factor (orig .99)

#Rewards lookback period (for debugging, not training)
reward_lookback_period = 100  # Number of episodes to average over
recent_rewards_left = []
recent_rewards_right = []

# Initialize Q-tables
Q_table_left = {}
Q_table_right = {}

#Q-table save frequency
episode_count = 0  # Initialize episode count
save_frequency = 100  # Save every 100 episodes

#Load data from pickle
Q_table_left = load_data_from_pickle_file(Q_TABLE_RIGHT_FILE, {})
Q_table_right = load_data_from_pickle_file(Q_TABLE_RIGHT_FILE, {})
episode_count = load_data_from_pickle_file(EPISODE_COUNT_FILE, 0)

# Initialize scores
left_score = 0
right_score = 0

# Define the action space
action_space = [0, 1, 2]  # 0: Move Up, 1: Move Down, 2: Stay Still

# Initialize reward
reward = 0

# Initialize iterations_this_game
iterations_this_game = 0

# Initialize Pygame
pygame.init()

# Create a window
width, height = 800, 600  # Window dimensions
window = pygame.display.set_mode((width, height))
pygame.display.set_caption('Pong Game')

# Initialize paddle and ball attributes
paddle_width, paddle_height = 20, 100
ball_radius = 15

# Initial positions
left_paddle_pos = [50, height // 2 - paddle_height // 2]
right_paddle_pos = [width - 50 - paddle_width, height // 2 - paddle_height // 2]
ball_pos = [width // 2, height // 2]

# Ball velocity
ball_velocity = [random.choice([-4, 4]), random.choice([-4, 4])]

#Convert input coordinate to discrete grid space.  this smaller grid space should make learning easier.
def discretize_grid(coordinate): 
    return coordinate // GAME_BOARD_GRID_SIZE

#Convert velocity into discretized space (of only 4 options!)
def discretize_velocity(velocity_x, velocity_y):
    if velocity_x > 0 and velocity_y > 0:
        return 0  # Up-Right
    elif velocity_x > 0 and velocity_y < 0:
        return 1  # Down-Right
    elif velocity_x < 0 and velocity_y > 0:
        return 2  # Up-Left
    elif velocity_x < 0 and velocity_y < 0:
        return 3  # Down-Left

# Main game loop
run = True
while run:
    #pygame.time.delay(10)
    for event in pygame.event.get():
        if event.type == pygame.QUIT:
            run = False
            
    #Track game loops in this episode/game and report to screen to get a sense of how many loops a game lasts
    iterations_this_game += 1
    
    #Debug track whether we have a rewarded event in this loop
    reward_applied_this_loop = False
    
    # Reset rewards to 0 at the beginning of each pass through the game loop
    reward_left = 0
    reward_right = 0
            
    # Create the state representation for both agents
    state_left = (discretize_grid(left_paddle_pos[1]), discretize_grid(ball_pos[0]), discretize_grid(ball_pos[1]), discretize_velocity(ball_velocity[0], ball_velocity[1]))
    state_right = (discretize_grid(right_paddle_pos[1]), discretize_grid(ball_pos[0]), discretize_grid(ball_pos[1]), discretize_velocity(ball_velocity[0], ball_velocity[1]))

    # Initialize Q-values for the states if not already present
    if state_left not in Q_table_left:
        Q_table_left[state_left] = {action: np.random.uniform(-1, 1) for action in action_space}
    if state_right not in Q_table_right:
        Q_table_right[state_right] = {action: np.random.uniform(-1, 1) for action in action_space}

    # Choose an action for both agents using the epsilon-greedy policy
    action_left = max(Q_table_left[state_left], key=Q_table_left[state_left].get) if np.random.rand() >= epsilon else np.random.choice(action_space)
    action_right = max(Q_table_right[state_right], key=Q_table_right[state_right].get) if np.random.rand() >= epsilon else np.random.choice(action_space)
   
    # Manual human paddle movement with boundary checks
    #keys = pygame.key.get_pressed()
    #if keys[pygame.K_w] and left_paddle_pos[1] > 0:
    #    left_paddle_pos[1] -= 5
    #if keys[pygame.K_s] and left_paddle_pos[1] < height - paddle_height:
    #    left_paddle_pos[1] += 5
    #if keys[pygame.K_UP] and right_paddle_pos[1] > 0:
    #    right_paddle_pos[1] -= 5
    #if keys[pygame.K_DOWN] and right_paddle_pos[1] < height - paddle_height:
    #    right_paddle_pos[1] += 5

    #Left AI agent moves the paddle!!
    if action_left == 0 and left_paddle_pos[1] > 0:  # Move Up
        left_paddle_pos[1] -= 15
    elif action_left == 1 and left_paddle_pos[1] < height - paddle_height:  # Move Down
        left_paddle_pos[1] += 15
    #elif action_left == 2: 
        # Stay Still, so no movement
        
    #Right AI agent moves the paddle!!
    if action_right == 0 and right_paddle_pos[1] > 0:  # Move Up
        right_paddle_pos[1] -= 15
    elif action_right == 1 and right_paddle_pos[1] < height - paddle_height:  # Move Down
        right_paddle_pos[1] += 15
    #elif action_right == 2: 
        # Stay Still, so no movement
        
    # Debugging code to print current state and action for both agents
    #print(f"Current State Left: {state_left}, Action Taken Left: {action_left}")
    #print(f"Current State Right: {state_right}, Action Taken Right: {action_right}")

    # Update ball position
    ball_pos[0] += ball_velocity[0]
    ball_pos[1] += ball_velocity[1]

    # Collision detection with walls
    if ball_pos[1] <= 0 or ball_pos[1] >= height:
        ball_velocity[1] = -ball_velocity[1]
    
    # Collision detection with paddles
    collision_offset = 5  # Define an offset to push the ball away from the paddle
    if (left_paddle_pos[0] <= ball_pos[0] <= left_paddle_pos[0] + paddle_width and
        left_paddle_pos[1] <= ball_pos[1] <= left_paddle_pos[1] + paddle_height):
        ball_velocity[0] = -ball_velocity[0]
        ball_pos[0] += collision_offset  # Push the ball away from the paddle
        reward_left = 1  # Add reward for left agent
        reward_applied_this_loop = True
    elif (right_paddle_pos[0] <= ball_pos[0] <= right_paddle_pos[0] + paddle_width and
          right_paddle_pos[1] <= ball_pos[1] <= right_paddle_pos[1] + paddle_height):
        ball_velocity[0] = -ball_velocity[0]
        ball_pos[0] -= collision_offset  # Push the ball away from the paddle
        reward_right = 1  # Add reward for right agent
        reward_applied_this_loop = True
    
    #Penalties for not exploring enough
    #extreme_zones = [[0, height // 8], [7 * height // 8, height]]  # Define the extreme zones
    #bonus = 0.1  # Define the bonus
    #center_zone = [height // 4, 3 * height // 4]  # Define the center zone
    #penalty = -0.1  # Define the penalty
    # Apply penalty for center zone
    #if center_zone[0] <= left_paddle_pos[1] <= center_zone[1]:
    #    reward_left += penalty
    #if center_zone[0] <= right_paddle_pos[1] <= center_zone[1]:
    #    reward_right += penalty
    # Apply bonus for extreme zones
    #for zone in extreme_zones:
    #    if zone[0] <= left_paddle_pos[1] <= zone[1]:
    #        reward_left += bonus
    #    if zone[0] <= right_paddle_pos[1] <= zone[1]:
    #        reward_right += bonus
        
    # Ball reset, scoring, and immediate feedback game-over condition
    if ball_pos[0] < 0:
        # Reset paddle positions to the middle
        left_paddle_pos = [50, height // 2 - paddle_height // 2]
        right_paddle_pos = [width - 50 - paddle_width, height // 2 - paddle_height // 2]
        #Reset the ball to the center in a random direction
        ball_pos = [width // 2, height // 2]
        ball_velocity = [random.choice([-4, 4]), random.choice([-4, 4])]
        #Scoring
        right_score += 1  # Right player scores
        #Rewards
        reward_left += -1  # Negative reward for the left agent
        reward_right += 1  # Positive reward for the right agent
        reward_applied_this_loop = True
        #Signal the end of an episode
        episode_count += 1  # Increment episode count
        iterations_this_game = 0
        # Decay epsilon at the end of a game/episode
        if epsilon > epsilon_min:
            epsilon *= epsilon_decay
        # Save the Q-tables every save_frequency episodes
        if episode_count % save_frequency == 0:
            with open(Q_TABLE_LEFT_FILE, "wb") as f:
                pickle.dump(Q_table_left, f)
            with open(Q_TABLE_RIGHT_FILE, "wb") as f:
                pickle.dump(Q_table_right, f)
            with open(EPISODE_COUNT_FILE, "wb") as f:
                pickle.dump(episode_count, f)
    elif ball_pos[0] > width:
        # Reset paddle positions to the middle
        left_paddle_pos = [50, height // 2 - paddle_height // 2]
        right_paddle_pos = [width - 50 - paddle_width, height // 2 - paddle_height // 2]
        #Reset the ball to the center in a random direction
        ball_pos = [width // 2, height // 2]
        ball_velocity = [random.choice([-4, 4]), random.choice([-4, 4])]
        #Scoring
        left_score += 1  # Left player scores
        #Rewards
        reward_left += 1  # Positive reward for the left agent
        reward_right += -1  # Negative reward for the right agent
        reward_applied_this_loop = True
        #Signal the end of an episode
        episode_count += 1  # Increment episode count
        iterations_this_game = 0
        # Decay epsilon at the end of a game/episode
        if epsilon > epsilon_min:
            epsilon *= epsilon_decay
        # Save the Q-tables every save_frequency episodes
        if episode_count % save_frequency == 0:
            with open(Q_TABLE_LEFT_FILE, "wb") as f:
                pickle.dump(Q_table_left, f)
            with open(Q_TABLE_RIGHT_FILE, "wb") as f:
                pickle.dump(Q_table_right, f)
            with open(EPISODE_COUNT_FILE, "wb") as f:
                pickle.dump(episode_count, f)
                
    # After taking an action, observe new state and reward
    new_state_left = (discretize_grid(left_paddle_pos[1]), right_paddle_pos[1], discretize_grid(ball_pos[0]), discretize_grid(ball_pos[1]), discretize_velocity(ball_velocity[0], ball_velocity[1]))
    new_state_right = (discretize_grid(left_paddle_pos[1]), right_paddle_pos[1], discretize_grid(ball_pos[0]), discretize_grid(ball_pos[1]), discretize_velocity(ball_velocity[0], ball_velocity[1]))

    # Check if state has changed
    if new_state_left != state_left or new_state_right != state_right or reward_applied_this_loop:
    
        # Initialize Q-values-left for the new state if not already present
        if new_state_left not in Q_table_left:
            Q_table_left[new_state_left] = {action: 0 for action in action_space}

        # Initialize Q-values-right for the new state if not already present
        if new_state_right not in Q_table_right:
            Q_table_right[new_state_right] = {action: 0 for action in action_space}

        # Calculate the best next action for both agents
        best_next_action_left = max(Q_table_left[new_state_left], key=Q_table_left[new_state_left].get)
        best_next_action_right = max(Q_table_right[new_state_right], key=Q_table_right[new_state_right].get)

        # Q-Learning update rule for both agents
        Q_table_left[state_left][action_left] = (1 - alpha) * Q_table_left[state_left][action_left] + alpha * (reward_left + gamma * Q_table_left[new_state_left][best_next_action_left])
        Q_table_right[state_right][action_right] = (1 - alpha) * Q_table_right[state_right][action_right] + alpha * (reward_right + gamma * Q_table_right[new_state_right][best_next_action_right])

    # Update current state for next iteration
    state_left = new_state_left
    state_right = new_state_right
         
    # Append the reward of the current episode to the list
    recent_rewards_left.append(reward_left)
    recent_rewards_right.append(reward_right)
    # Remove the oldest reward if the list grows too large
    if len(recent_rewards_left) > reward_lookback_period:
        del recent_rewards_left[0]
    if len(recent_rewards_right) > reward_lookback_period:
        del recent_rewards_right[0]
    # Calculate the average reward
    avg_reward_left = sum(recent_rewards_left) / len(recent_rewards_left)
    avg_reward_right = sum(recent_rewards_right) / len(recent_rewards_right)
        
    # Draw paddles, ball, and scores
    window.fill((0, 0, 0))  # Clear screen
    pygame.draw.rect(window, (255, 255, 255), left_paddle_pos + [paddle_width, paddle_height])
    pygame.draw.rect(window, (255, 255, 255), right_paddle_pos + [paddle_width, paddle_height])
    pygame.draw.circle(window, (255, 255, 255), ball_pos, ball_radius)

    # Display scores
    font = pygame.font.SysFont(None, 30)
    score_display = font.render(f"score: {left_score} - {right_score}", True, (255, 255, 255))
    window.blit(score_display, (width // 2 - 45, 10))
    
    # Display episode count
    font = pygame.font.SysFont(None, 30)
    episode_display = font.render(f"episodes played: {episode_count}", True, (255, 255, 255))
    window.blit(episode_display, (width // 2 - 100, 40))
    
    # Display current epsilon
    font = pygame.font.SysFont(None, 30)
    epsilon_display = font.render(f"Epsilon: {epsilon:.4f}", True, (255, 255, 255))
    window.blit(epsilon_display, (10, 70))

    # Display average reward for left and right agents
    #font = pygame.font.SysFont(None, 30)
    #avg_reward_left_display = font.render(f"Avg Reward Left: {avg_reward_left:.6f}", True, (255, 255, 255))
    #window.blit(avg_reward_left_display, (10, 100))
    #avg_reward_right_display = font.render(f"Avg Reward Right: {avg_reward_right:.6f}", True, (255, 255, 255))
    #window.blit(avg_reward_right_display, (10, 130))
    
    # Display current frame within game
    #font = pygame.font.SysFont(None, 30)
    #epsilon_display = font.render(f"iterations_this_game: {iterations_this_game}", True, (255, 255, 255))
    #window.blit(epsilon_display, (10, 160))
    
    if (DEBUG_LEVEL>=DEBUG_DEBUG): 
        if Q_table_left[state_left][action_left] != 0:
            print(f"Episode: {episode_count}, Iteration: {iterations_this_game}")
            print(f"Current State Left: {state_left}, Action Left: {action_left}, Q-Value: {Q_table_left[state_left][action_left]}")
            print(f"New State Left: {new_state_left}, Best Next Action Left: {best_next_action_left}, Q-Value: {Q_table_left[new_state_left][best_next_action_left]}")
            print(f"Reward Left: {reward_left}")
            print(f"Current Epsilon: {epsilon}")
            #print(f"Avg Reward Left: {avg_reward_left}")
            print(f"----------")
            print(f" ")


    pygame.display.update()
    
pygame.quit()


pygame 2.5.1 (SDL 2.28.2, Python 3.10.9)
Hello from the pygame community. https://www.pygame.org/contribute.html
Debug: Q-values Left for state (2, 4, 3, 0): {0: 0.9742968127212739, 1: 0.6285983495895948, 2: -0.006178325905292881}
Debug: Q-values Right for state (2, 4, 3, 0): {0: 0.13116414301946622, 1: 0.36964792953528236, 2: -0.8978960688047888}
Debug: Q-values Left for state (2, 4, 3, 0): {0: 0.9742968127212739, 1: 0.6285983495895948, 2: -0.0030891629526464404}
Debug: Q-values Right for state (2, 4, 3, 0): {0: 0.06558207150973311, 1: 0.36964792953528236, 2: -0.8978960688047888}
Debug: Q-values Left for state (2, 4, 3, 0): {0: 0.48714840636063694, 1: 0.6285983495895948, 2: -0.0030891629526464404}
Debug: Q-values Right for state (2, 4, 3, 0): {0: 0.032791035754866554, 1: 0.36964792953528236, 2: -0.8978960688047888}
Debug: Q-values Left for state (2, 4, 3, 0): {0: 0.48714840636063694, 1: 0.6285983495895948, 2: -0.0015445814763232202}
Debug: Q-values Right for state (2, 4, 3, 0): {0: 0.

Debug: Q-values Left for state (2, 7, 5, 1): {0: -0.18959325635869229, 1: 0.43981335100651897, 2: -0.6480297001700184}
Debug: Q-values Right for state (1, 7, 5, 1): {0: 0.011317733405155406, 1: -0.8951297799403921, 2: 0.21296510904624677}
Debug: Q-values Left for state (1, 7, 5, 1): {0: -0.7206279191780656, 1: 0.43065011119464236, 2: -0.1130069594707187}
Debug: Q-values Right for state (1, 7, 5, 1): {0: 0.011317733405155406, 1: -0.44756488997019606, 2: 0.21296510904624677}
Debug: Q-values Left for state (1, 7, 5, 1): {0: -0.3603139595890328, 1: 0.43065011119464236, 2: -0.1130069594707187}
Debug: Q-values Right for state (1, 7, 5, 1): {0: 0.011317733405155406, 1: -0.22378244498509803, 2: 0.21296510904624677}
Debug: Q-values Left for state (1, 7, 5, 1): {0: -0.3603139595890328, 1: 0.21532505559732118, 2: -0.1130069594707187}
Debug: Q-values Right for state (2, 7, 5, 1): {0: 0.1274950555373655, 1: 0.3670952153135034, 2: -0.699357044302827}
Debug: Q-values Left for state (1, 7, 5, 1): {0: 

Debug: Q-values Left for state (3, 2, 4, 2): {0: -0.34069706757032425, 1: 0.02209808534175997, 2: -0.22608010634837195}
Debug: Q-values Right for state (2, 2, 4, 2): {0: 0.11314765684520267, 1: 0.1912181757080249, 2: -0.02596800780851588}
Debug: Q-values Left for state (3, 2, 4, 2): {0: -0.34069706757032425, 1: 0.011049042670879985, 2: -0.22608010634837195}
Debug: Q-values Right for state (2, 2, 4, 2): {0: 0.11314765684520267, 1: 0.09560908785401245, 2: -0.02596800780851588}
Debug: Q-values Left for state (4, 2, 4, 2): {0: -0.4764200111480953, 1: -0.30505062686638973, 2: 0.22310185107037417}
Debug: Q-values Right for state (2, 2, 4, 2): {0: 0.056573828422601335, 1: 0.09560908785401245, 2: -0.02596800780851588}
Debug: Q-values Left for state (3, 2, 4, 2): {0: -0.34069706757032425, 1: 0.011049042670879985, 2: -0.11304005317418597}
Debug: Q-values Right for state (2, 2, 4, 2): {0: 0.028286914211300668, 1: 0.09560908785401245, 2: -0.02596800780851588}
Debug: Q-values Left for state (3, 2, 

Debug: Q-values Left for state (4, 0, 5, 3): {0: 0.09176115560847972, 1: 0.06545288728175269, 2: -0.018465650925447448}
Debug: Q-values Right for state (0, 0, 5, 3): {0: 0.02423847062686027, 1: 0.10394079099182713, 2: -0.2718513871065268}
Debug: Q-values Left for state (4, 0, 5, 3): {0: 0.09176115560847972, 1: 0.032726443640876346, 2: -0.018465650925447448}
Debug: Q-values Right for state (0, 0, 5, 3): {0: 0.02423847062686027, 1: 0.05197039549591356, 2: -0.2718513871065268}
Debug: Q-values Left for state (4, 0, 5, 3): {0: 0.09176115560847972, 1: 0.032726443640876346, 2: -0.009232825462723724}
Debug: Q-values Right for state (0, 0, 5, 3): {0: 0.02423847062686027, 1: 0.02598519774795678, 2: -0.2718513871065268}
Debug: Q-values Left for state (4, 0, 5, 3): {0: 0.04588057780423986, 1: 0.032726443640876346, 2: -0.009232825462723724}
Debug: Q-values Right for state (1, 0, 5, 3): {0: -0.02214908934679599, 1: 0.7880082068822789, 2: 0.3854172089573552}
Debug: Q-values Left for state (4, 0, 5, 3

Debug: Q-values Left for state (3, 2, 4, 2): {0: -0.0001663559900245724, 1: 1.3487600916601545e-06, 2: -0.0008831254154233279}
Debug: Q-values Right for state (1, 2, 4, 2): {0: -0.0007086000468044052, 1: -0.030422544322976815, 2: 8.020368205127228e-05}
Debug: Q-values Left for state (3, 2, 4, 2): {0: -0.0001663559900245724, 1: 1.3487600916601545e-06, 2: -0.00044156270771166396}
Debug: Q-values Right for state (1, 2, 4, 2): {0: -0.0007086000468044052, 1: -0.030422544322976815, 2: 4.010184102563614e-05}
Debug: Q-values Left for state (3, 2, 4, 2): {0: -8.31779950122862e-05, 1: 1.3487600916601545e-06, 2: -0.00044156270771166396}
Debug: Q-values Right for state (1, 2, 4, 2): {0: -0.0007086000468044052, 1: -0.030422544322976815, 2: 2.005092051281807e-05}
Debug: Q-values Left for state (2, 2, 4, 2): {0: -0.1718256573427973, 1: -0.9554329709843423, 2: -0.8617450838721081}
Debug: Q-values Right for state (1, 2, 4, 2): {0: -0.0007086000468044052, 1: -0.015211272161488407, 2: 2.005092051281807e-

Debug: Q-values Left for state (2, 3, 2, 3): {0: 0.1768000070714526, 1: -0.18169580317654432, 2: 0.0331546030692412}
Debug: Q-values Right for state (2, 3, 2, 3): {0: -0.08307692993243865, 1: -0.04624764895345601, 2: -0.8238710266420977}
Debug: Q-values Left for state (2, 3, 2, 3): {0: 0.1768000070714526, 1: -0.18169580317654432, 2: 0.0165773015346206}
Debug: Q-values Right for state (2, 3, 2, 3): {0: -0.08307692993243865, 1: -0.04624764895345601, 2: -0.41193551332104883}
Debug: Q-values Left for state (2, 3, 2, 3): {0: 0.1768000070714526, 1: -0.09084790158827216, 2: 0.0165773015346206}
Debug: Q-values Right for state (2, 3, 2, 3): {0: -0.08307692993243865, 1: -0.04624764895345601, 2: -0.20596775666052441}
Debug: Q-values Left for state (2, 3, 2, 3): {0: 0.1768000070714526, 1: -0.09084790158827216, 2: 0.0082886507673103}
Debug: Q-values Right for state (2, 3, 2, 3): {0: -0.08307692993243865, 1: -0.04624764895345601, 2: -0.10298387833026221}
Debug: Q-values Left for state (2, 3, 2, 3): 

Debug: Q-values Left for state (3, 1, 0, 3): {0: 0.002538901766755322, 1: 0.0021195796180245472, 2: -0.06624718110858474}
Debug: Q-values Right for state (1, 1, 0, 3): {0: 0.0075471516690861894, 1: 0.02476894156019114, 2: 0.04753189172007913}
Debug: Q-values Left for state (3, 1, 0, 3): {0: 0.001269450883377661, 1: 0.0021195796180245472, 2: -0.06624718110858474}
Debug: Q-values Right for state (1, 1, 0, 3): {0: 0.0075471516690861894, 1: 0.02476894156019114, 2: 0.023765945860039564}
Debug: Q-values Left for state (3, 1, 0, 3): {0: 0.001269450883377661, 1: 0.0010597898090122736, 2: -0.06624718110858474}
Debug: Q-values Right for state (1, 1, 0, 3): {0: 0.0075471516690861894, 1: 0.02476894156019114, 2: 0.011882972930019782}
Debug: Q-values Left for state (3, 1, 0, 3): {0: 0.001269450883377661, 1: 0.0005298949045061368, 2: -0.06624718110858474}
Debug: Q-values Right for state (1, 1, 0, 3): {0: 0.0075471516690861894, 1: 0.01238447078009557, 2: 0.011882972930019782}
Debug: Q-values Left for 

Debug: Q-values Left for state (2, 3, 2, 3): {0: 1.3488770070759018e-06, 1: -5.5449158684248146e-06, 2: 3.237754205980586e-05}
Debug: Q-values Right for state (1, 3, 2, 3): {0: 0.48005388450814135, 1: -0.04310791268376188, 2: -0.023469122968100803}
Debug: Q-values Left for state (1, 3, 2, 3): {0: 0.020638439376826634, 1: -0.5209186730860607, 2: -0.19847210965380468}
Debug: Q-values Right for state (1, 3, 2, 3): {0: 0.48005388450814135, 1: -0.04310791268376188, 2: -0.011734561484050401}
Debug: Q-values Left for state (1, 3, 2, 3): {0: 0.010319219688413317, 1: -0.5209186730860607, 2: -0.19847210965380468}
Debug: Q-values Right for state (1, 3, 2, 3): {0: 0.48005388450814135, 1: -0.02155395634188094, 2: -0.011734561484050401}
Debug: Q-values Left for state (1, 3, 2, 3): {0: 0.010319219688413317, 1: -0.26045933654303033, 2: -0.19847210965380468}
Debug: Q-values Right for state (2, 3, 2, 3): {0: -1.014122679839339e-05, 1: -0.0014452390297955003, 2: -6.285637105118543e-06}
Debug: Q-values Le

Debug: Q-values Left for state (1, 0, 0, 2): {0: -0.03705300193505909, 1: -0.3823138130632402, 2: 0.332951592828177}
Debug: Q-values Right for state (1, 0, 0, 2): {0: -0.00993601559749233, 1: -0.00018500318803693118, 2: 6.64794754914746e-05}
Debug: Q-values Left for state (1, 0, 0, 2): {0: -0.018526500967529544, 1: -0.3823138130632402, 2: 0.332951592828177}
Debug: Q-values Right for state (1, 0, 0, 2): {0: -0.00993601559749233, 1: -0.00018500318803693118, 2: 3.32397377457373e-05}
Debug: Q-values Left for state (1, 0, 0, 2): {0: -0.018526500967529544, 1: -0.3823138130632402, 2: 0.1664757964140885}
Debug: Q-values Right for state (1, 0, 0, 2): {0: -0.00993601559749233, 1: -9.250159401846559e-05, 2: 3.32397377457373e-05}
Debug: Q-values Left for state (1, 0, 0, 2): {0: -0.018526500967529544, 1: -0.1911569065316201, 2: 0.1664757964140885}
Debug: Q-values Right for state (2, 0, 0, 2): {0: 0.29218240971463805, 1: 0.23962114341358465, 2: -0.21461252982496748}
Debug: Q-values Left for state (1

Debug: Q-values Left for state (3, 2, 4, 2): {0: -4.15889975061431e-05, 1: 3.371900229150386e-07, 2: -0.00022078135385583198}
Debug: Q-values Right for state (2, 2, 4, 2): {0: 6.905984914868327e-06, 1: 0.0014938919977189446, 2: -0.0016230004880322425}
Debug: Q-values Left for state (3, 2, 4, 2): {0: -4.15889975061431e-05, 1: 1.685950114575193e-07, 2: -0.00022078135385583198}
Debug: Q-values Right for state (2, 2, 4, 2): {0: 3.4529924574341635e-06, 1: 0.0014938919977189446, 2: -0.0016230004880322425}
Debug: Q-values Left for state (3, 2, 4, 2): {0: -4.15889975061431e-05, 1: 8.429750572875965e-08, 2: -0.00022078135385583198}
Debug: Q-values Right for state (2, 2, 4, 2): {0: 1.7264962287170818e-06, 1: 0.0014938919977189446, 2: -0.0016230004880322425}
Debug: Q-values Left for state (3, 2, 4, 2): {0: -4.15889975061431e-05, 1: 8.429750572875965e-08, 2: -0.00011039067692791599}
Debug: Q-values Right for state (1, 2, 4, 2): {0: -0.0007086000468044052, 1: -0.007605636080744204, 2: 2.00509205128

Debug: Q-values Left for state (2, 0, 5, 3): {0: -0.05362023704622043, 1: -0.11658010268087354, 2: -0.0962653612097476}
Debug: Q-values Right for state (2, 0, 5, 3): {0: 0.006258179404085488, 1: 0.12499777275438645, 2: 0.00027283793459816877}
Debug: Q-values Left for state (2, 0, 5, 3): {0: -0.05362023704622043, 1: -0.05829005134043677, 2: -0.0962653612097476}
Debug: Q-values Right for state (2, 0, 5, 3): {0: 0.006258179404085488, 1: 0.12499777275438645, 2: 0.00013641896729908439}
Debug: Q-values Left for state (2, 0, 5, 3): {0: -0.05362023704622043, 1: -0.05829005134043677, 2: -0.0481326806048738}
Debug: Q-values Right for state (2, 0, 5, 3): {0: 0.003129089702042744, 1: 0.12499777275438645, 2: 0.00013641896729908439}
Debug: Q-values Left for state (2, 0, 5, 3): {0: -0.026810118523110216, 1: -0.05829005134043677, 2: -0.0481326806048738}
Debug: Q-values Right for state (2, 0, 5, 3): {0: 0.003129089702042744, 1: 0.12499777275438645, 2: 6.820948364954219e-05}
Debug: Q-values Left for sta

Debug: Q-values Left for state (2, 2, 4, 2): {0: -0.002684775895981208, 1: -0.059714560686521395, 2: -0.10771813548401352}
Debug: Q-values Right for state (2, 2, 4, 2): {0: 4.3162405717927044e-07, 1: 5.835515616089627e-06, 2: -1.2679691312751895e-05}
Debug: Q-values Left for state (2, 2, 4, 2): {0: -0.002684775895981208, 1: -0.029857280343260698, 2: -0.10771813548401352}
Debug: Q-values Right for state (2, 2, 4, 2): {0: 2.1581202858963522e-07, 1: 5.835515616089627e-06, 2: -1.2679691312751895e-05}
Debug: Q-values Left for state (2, 2, 4, 2): {0: -0.002684775895981208, 1: -0.014928640171630349, 2: -0.10771813548401352}
Debug: Q-values Right for state (2, 2, 4, 2): {0: 1.0790601429481761e-07, 1: 5.835515616089627e-06, 2: -1.2679691312751895e-05}
Debug: Q-values Left for state (2, 2, 4, 2): {0: -0.002684775895981208, 1: -0.014928640171630349, 2: -0.05385906774200676}
Debug: Q-values Right for state (2, 2, 4, 2): {0: 5.3953007147408805e-08, 1: 5.835515616089627e-06, 2: -1.2679691312751895e-

Debug: Q-values Left for state (2, 0, 5, 3): {0: -0.25670252963077755, 1: -0.0018215641043886491, 2: -0.00037603656722557657}
Debug: Q-values Right for state (3, 0, 5, 3): {0: 0.8352593075438737, 1: -0.216250590567428, 2: 0.172033334109696}
Debug: Q-values Left for state (2, 0, 5, 3): {0: -0.12835126481538878, 1: -0.0018215641043886491, 2: -0.00037603656722557657}
Debug: Q-values Right for state (3, 0, 5, 3): {0: 0.8352593075438737, 1: -0.108125295283714, 2: 0.172033334109696}
Debug: Q-values Left for state (2, 0, 5, 3): {0: -0.06417563240769439, 1: -0.0018215641043886491, 2: -0.00037603656722557657}
Debug: Q-values Right for state (3, 0, 5, 3): {0: 0.41762965377193684, 1: -0.108125295283714, 2: 0.172033334109696}
Debug: Q-values Left for state (2, 0, 5, 3): {0: -0.06417563240769439, 1: -0.0018215641043886491, 2: -0.00018801828361278828}
Debug: Q-values Right for state (3, 0, 5, 3): {0: 0.41762965377193684, 1: -0.108125295283714, 2: 0.086016667054848}
Debug: Q-values Left for state (2,

Debug: Q-values Left for state (2, 2, 1, 3): {0: -8.527337478566287e-05, 1: -5.538575363582277e-06, 2: -7.094599859926404e-05}
Debug: Q-values Right for state (2, 2, 1, 3): {0: 0.006913516325667196, 1: 0.023928243918970513, 2: -5.443883749335426e-06}
Debug: Q-values Left for state (2, 2, 1, 3): {0: -8.527337478566287e-05, 1: -5.538575363582277e-06, 2: -3.547299929963202e-05}
Debug: Q-values Right for state (2, 2, 1, 3): {0: 0.006913516325667196, 1: 0.011964121959485256, 2: -5.443883749335426e-06}
Debug: Q-values Left for state (2, 2, 1, 3): {0: -4.2636687392831434e-05, 1: -5.538575363582277e-06, 2: -3.547299929963202e-05}
Debug: Q-values Right for state (2, 2, 1, 3): {0: 0.006913516325667196, 1: 0.005982060979742628, 2: -5.443883749335426e-06}
Debug: Q-values Left for state (2, 2, 1, 3): {0: -2.1318343696415717e-05, 1: -5.538575363582277e-06, 2: -3.547299929963202e-05}
Debug: Q-values Right for state (2, 2, 1, 3): {0: 0.006913516325667196, 1: 0.002991030489871314, 2: -5.443883749335426

Debug: Q-values Left for state (2, 4, 3, 1): {0: 0.4330954136368752, 1: 0.0044115685137853244, 2: -0.5784402695339843}
Debug: Q-values Right for state (2, 4, 3, 1): {0: 0.40549122990095765, 1: -0.31410823774399454, 2: 0.2651844137449082}
Debug: Q-values Left for state (2, 4, 2, 1): {0: 0.27286815517375795, 1: 0.6805024536801072, 2: 0.00190944301401208}
Debug: Q-values Right for state (2, 4, 2, 1): {0: 0.2148905101131513, 1: -0.5149586111251383, 2: 0.29793607248362586}
Debug: Q-values Left for state (2, 4, 2, 1): {0: 0.27286815517375795, 1: 0.3402512268400536, 2: 0.00190944301401208}
Debug: Q-values Right for state (2, 4, 2, 1): {0: 0.2148905101131513, 1: -0.5149586111251383, 2: 0.14896803624181293}
Debug: Q-values Left for state (2, 4, 2, 1): {0: 0.27286815517375795, 1: 0.1701256134200268, 2: 0.00190944301401208}
Debug: Q-values Right for state (2, 4, 2, 1): {0: 0.2148905101131513, 1: -0.25747930556256915, 2: 0.14896803624181293}
Debug: Q-values Left for state (2, 4, 2, 1): {0: 0.27286

Debug: Q-values Left for state (1, 6, 0, 1): {0: -0.08388896433189619, 1: 0.017953425276174384, 2: -0.20593311388096464}
Debug: Q-values Right for state (1, 6, 0, 1): {0: 0.04614218367052933, 1: -0.030220659046954347, 2: 0.02640337043596458}
Debug: Q-values Left for state (1, 6, 0, 1): {0: -0.041944482165948094, 1: 0.017953425276174384, 2: -0.20593311388096464}
Debug: Q-values Right for state (1, 6, 0, 1): {0: 0.023071091835264665, 1: -0.030220659046954347, 2: 0.02640337043596458}
Debug: Q-values Left for state (1, 6, 0, 1): {0: -0.041944482165948094, 1: 0.017953425276174384, 2: -0.10296655694048232}
Debug: Q-values Right for state (1, 6, 0, 1): {0: 0.023071091835264665, 1: -0.015110329523477174, 2: 0.02640337043596458}
Debug: Q-values Left for state (1, 6, 0, 1): {0: -0.020972241082974047, 1: 0.017953425276174384, 2: -0.10296655694048232}
Debug: Q-values Right for state (1, 6, 0, 1): {0: 0.023071091835264665, 1: -0.007555164761738587, 2: 0.02640337043596458}
Debug: Q-values Left for s

Debug: Q-values Left for state (2, 4, 3, 0): {0: 3.716647387394996e-06, 1: 0.0006138655757710887, 2: -3.770950869929737e-07}
Debug: Q-values Right for state (2, 4, 3, 0): {0: 3.202249585436187e-05, 1: 9.02460765466998e-05, 2: -0.001753703259384353}
Debug: Q-values Left for state (2, 4, 3, 0): {0: 3.716647387394996e-06, 1: 0.00030693278788554434, 2: -3.770950869929737e-07}
Debug: Q-values Right for state (2, 4, 3, 0): {0: 1.6011247927180935e-05, 1: 9.02460765466998e-05, 2: -0.001753703259384353}
Debug: Q-values Left for state (2, 4, 3, 0): {0: 3.716647387394996e-06, 1: 0.00030693278788554434, 2: -1.8854754349648684e-07}
Debug: Q-values Right for state (2, 4, 3, 0): {0: 1.6011247927180935e-05, 1: 4.51230382733499e-05, 2: -0.001753703259384353}
Debug: Q-values Left for state (2, 4, 3, 0): {0: 3.716647387394996e-06, 1: 0.00015346639394277217, 2: -1.8854754349648684e-07}
Debug: Q-values Right for state (2, 4, 3, 0): {0: 1.6011247927180935e-05, 1: 2.256151913667495e-05, 2: -0.001753703259384

Debug: Q-values Left for state (3, 6, 5, 0): {0: 0.042109430021718686, 1: 0.003269143242566528, 2: 0.015031841108780197}
Debug: Q-values Right for state (3, 6, 5, 0): {0: -0.1537312486493485, 1: 0.007591931216674865, 2: 0.030557412194498715}
Debug: Q-values Left for state (3, 6, 5, 0): {0: 0.042109430021718686, 1: 0.003269143242566528, 2: 0.007515920554390099}
Debug: Q-values Right for state (4, 6, 5, 0): {0: 0.017455657442530872, 1: -0.010013588840998122, 2: -0.0386266058318365}
Debug: Q-values Left for state (3, 6, 5, 0): {0: 0.021054715010859343, 1: 0.003269143242566528, 2: 0.007515920554390099}
Debug: Q-values Right for state (4, 6, 5, 0): {0: 0.017455657442530872, 1: -0.005006794420499061, 2: -0.0386266058318365}
Debug: Q-values Left for state (3, 7, 6, 1): {0: -0.3000419706837729, 1: -0.19733941719983816, 2: 0.28158112052022455}
Debug: Q-values Right for state (4, 7, 6, 1): {0: -0.4318153682091548, 1: 0.4749938741128572, 2: 0.17502818633751904}
Debug: Q-values Left for state (3, 

Debug: Q-values Left for state (2, 5, 1, 1): {0: 0.0029348840287833633, 1: -0.11082759218937552, 2: -0.7224547116179127}
Debug: Q-values Right for state (2, 5, 1, 1): {0: -0.0004584188087293625, 1: 0.00584087139098511, 2: -0.015806023934913746}
Debug: Q-values Left for state (2, 5, 1, 1): {0: 0.0014674420143916816, 1: -0.11082759218937552, 2: -0.7224547116179127}
Debug: Q-values Right for state (2, 5, 1, 1): {0: -0.0004584188087293625, 1: 0.00584087139098511, 2: -0.007903011967456873}
Debug: Q-values Left for state (2, 5, 1, 1): {0: 0.0014674420143916816, 1: -0.11082759218937552, 2: -0.3612273558089564}
Debug: Q-values Right for state (2, 5, 1, 1): {0: -0.0004584188087293625, 1: 0.002920435695492555, 2: -0.007903011967456873}
Debug: Q-values Left for state (2, 5, 1, 1): {0: 0.0014674420143916816, 1: -0.05541379609468776, 2: -0.3612273558089564}
Debug: Q-values Right for state (2, 5, 1, 1): {0: -0.0004584188087293625, 1: 0.002920435695492555, 2: -0.0039515059837284365}
Debug: Q-values L

Debug: Q-values Left for state (2, 7, 0, 0): {0: -0.03578624192777485, 1: -0.05369553166671914, 2: 0.03609748164545511}
Debug: Q-values Right for state (2, 7, 0, 0): {0: 0.020327941758268797, 1: 0.29749465790036167, 2: -0.10358757130408683}
Debug: Q-values Left for state (2, 7, 0, 0): {0: -0.03578624192777485, 1: -0.05369553166671914, 2: 0.018048740822727553}
Debug: Q-values Right for state (2, 7, 0, 0): {0: 0.010163970879134399, 1: 0.29749465790036167, 2: -0.10358757130408683}
Debug: Q-values Left for state (2, 7, 0, 0): {0: -0.03578624192777485, 1: -0.02684776583335957, 2: 0.018048740822727553}
Debug: Q-values Right for state (2, 7, 0, 0): {0: 0.005081985439567199, 1: 0.29749465790036167, 2: -0.10358757130408683}
Debug: Q-values Left for state (2, 7, 0, 0): {0: -0.03578624192777485, 1: -0.02684776583335957, 2: 0.009024370411363777}
Debug: Q-values Right for state (2, 7, 0, 0): {0: 0.005081985439567199, 1: 0.14874732895018083, 2: -0.10358757130408683}
Debug: Q-values Left for state (2

# Notes

## Implementing Game Mechanics for Pong

### 1. Initialize Pygame and Create Window
- Initialized Pygame and created an 800x600 window for the game.

### 2. Initialize Paddle and Ball Attributes
- Defined the dimensions of the paddles and the ball. Initialized their starting positions.

### 3. Paddle Movement
- Implemented keyboard controls for moving the paddles up and down.

### 4. Ball Movement and Collision Detection
- Added logic for ball movement and collision detection with the walls and paddles.

### 5. Ball Reset and Scoring
- Implemented ball reset and scoring mechanics. The ball resets to the center after a point is scored.

### 6. Paddle Boundaries
- Added boundaries to prevent the paddles from moving out of the window.

### 7. Game Over Conditions
- Implemented immediate feedback game-over conditions. The game resets after each point, serving as an episode in RL terms.


## Defining RL Elements for Pong

### 1. State Representation
- Decide how to represent the state of the game. Consider the trade-offs between granularity and computational complexity.

### 2. Action Space
- Define the set of actions I can take (e.g., move paddle up, move paddle down, stay still).

### 3. Reward Structure
- Design the rewards I receive for various outcomes (e.g., +1 for scoring, -1 for opponent scoring).

### 4. Policy Initialization
- Initialize my policy, which could be a Q-table, a neural network, or some other function mapping states to actions.

### 5. Learning Algorithm
- Choose and implement a learning algorithm (e.g., Q-learning, SARSA, Deep Q-Networks) to update my policy based on experiences.

### 6. Exploration-Exploitation Strategy
- Decide on a strategy for balancing exploration (trying new actions) and exploitation (sticking with known good actions), such as ε-greedy.

### 7. Training Loop
- Implement the training loop where I interact with the environment, update my policy, and optionally log metrics like average reward over time.

### 8. Evaluation Metrics
- Define metrics to evaluate my performance (e.g., average reward, win rate).

### 9. Hyperparameter Tuning
- Experiment with different learning rates, discount factors, and other hyperparameters to optimize performance.

### 10. Testing and Validation
- Test the trained agent to see how well it performs and validate that it is learning effectively.


## Q-Learning Algorithm

Q-Learning is a model-free reinforcement learning algorithm that aims to learn a policy, which tells an agent what action to take under what circumstances. It defines a function \( Q(s, a) \), representing the quality or the utility of taking action \( a \) in state \( s \).

### Outline

1. **Initialize Q-Table**: Create a table to store the Q-values for each state-action pair.
2. **Policy**: Define how the agent chooses an action (e.g., \(\epsilon\)-greedy).
3. **Learning**: Update the Q-values using the Q-Learning update rule.
4. **Training Loop**: Incorporate these elements into the game loop.

The Q-table will be represented as a Python dictionary. The keys will be the states, and the values will be another dictionary mapping actions to Q-values.


## max() reference

| Iterable Type | What It Returns to `max()` | Example of Using `max()` |
|---------------|----------------------------|--------------------------|
| List          | Individual list elements   | `max([1, 2, 3])` returns `3` |
| Tuple         | Individual tuple elements  | `max((1, 2, 3))` returns `3` |
| String        | Individual characters     | `max("abc")` returns `'c'` |
| Set           | Individual set elements    | `max({1, 2, 3})` returns `3` |
| Dictionary    | Dictionary keys           | `max({'a': 1, 'b': 2}, key=lambda k: k)` returns `'b'` |
|               |                            | `max({'a': 1, 'b': 2}.values())` returns `2` |
|               |                            | `max({'a': 1, 'b': 2}, key=lambda k: {'a': 1, 'b': 2}[k])` returns `'b'` |
| Numpy Array   | Individual array elements  | `import numpy as np; max(np.array([1, 2, 3]))` returns `3` |


## Building intuition around training variables

1. **Alpha (α) - Learning Rate**: 
    - **What it does**: Determines how much of the new Q-value estimate I adopt.
    - **Intuition**: Think of it as a "blending factor." If α is 1, I consider only the most recent information. If α is 0, I learn nothing and stick to my prior knowledge. A value between 0 and 1 blends the old and new information.
    - **Example**: If α is high (closer to 1), I will rapidly adapt to new strategies but may also forget useful past knowledge quickly.

2. **Gamma (γ) - Discount Factor**: 
    - **What it does**: Influences how much future rewards contribute to the Q-value.
    - **Intuition**: It's like a "patience meter." A high γ makes me prioritize long-term reward over short-term reward.
    - **Example**: If γ is close to 1, I will consider future rewards with greater weight, making me more strategic but potentially slower to train.

3. **Epsilon (ε) - Exploration Rate**: 
    - **What it does**: Controls the trade-off between exploration (trying new actions) and exploitation (sticking with known actions).
    - **Intuition**: It's like the "curiosity level." A high ε encourages me to try new things, while a low ε makes me stick to what I know.
    - **Example**: If ε starts high and decays over time (ε-decay), I will initially explore a lot and gradually shift to exploiting my learned knowledge.
