# The implementation of a Tic-Tac-Toe game where a human competes against an AI utilizing Q-learning.

## Instructions on playing this game

#### Start the Game: Make Your Move by inputing row followed by column (0-2), divided by a space corresponding to the grid cells, from top-left to bottom-right, to place your marker (1).
#### Understand the AI: You're playing against an advanced AI (-1) that learns and predicts optimal moves using Q-Learning technique.
#### Play to Win: Aim to align three of your markers (1) in a row, column, or diagonal to win.
#### Game End: The game ends when one player wins by aligning three markers or when all grid spaces are filled, resulting in a draw.

In [3]:
import numpy as np
import random

# Initialize the Q-values dictionary to store state-action pairs and their values
Q = {}

# Learning parameters
learning_rate = 0.1
discount_factor = 0.9
exploration_rate = 0.1

# Initialize the game state as a 3x3 matrix of zeros
def initialize_game_state():
    return np.zeros((3, 3), dtype=int)

# Define possible actions based on the current state (empty cells)
def available_actions(state):
    return [(i, j) for i in range(3) for j in range(3) if state[i][j] == 0]

# Function to calculate the reward for the current state
def calculate_reward(state, player):
    # Check for a win for the current player
    if (np.any(np.all(state == player, axis=0)) or
        np.any(np.all(state == player, axis=1)) or
        np.all(np.diag(state) == player) or
        np.all(np.diag(np.fliplr(state)) == player)):
        return 1  # Reward for winning
    if np.all(state != 0):
        return 0.5  # Reward for a tie
    return 0  # No reward if the game is still ongoing

# Epsilon-greedy algorithm to choose the next action
def select_action(state, exploration_rate):
    if random.random() < exploration_rate:
        return random.choice(available_actions(state))
    else:
        # Return the best action based on Q-values
        return max(available_actions(state), key=lambda action: Q.get((tuple(map(tuple, state)), action), 0))

# Q-learning update formula
def update_Q_values(prev_state, action, reward, next_state):
    prev_state = tuple(map(tuple, prev_state))
    next_state = tuple(map(tuple, next_state))
    best_next_action = max(available_actions(next_state), key=lambda x: Q.get((next_state, x), 0), default=(0, 0))
    td_target = reward + discount_factor * Q.get((next_state, best_next_action), 0)
    td_error = td_target - Q.get((prev_state, action), 0)
    Q[(prev_state, action)] = Q.get((prev_state, action), 0) + learning_rate * td_error

# Main function to execute the game loop
def play_game():
    state = initialize_game_state()
    game_over = False
    current_player = 1  # Start with player 1

    while not game_over:
        print("\n Active status:\n", state)
        if current_player == 1:
            try:
                row, col = map(int, input("\n Input row followed by column (0-2), divided by a space: ").split())
                assert state[row][col] == 0, "This cell is already taken. Please try a different move."
                state[row][col] = current_player  # Player's move
            except (ValueError, AssertionError) as e:
                print(str(e), "Please try again.")
                continue
        else:
            print("\n AI is making a move")
            action = select_action(state, exploration_rate)
            state[action[0]][action[1]] = current_player  # AI's move

        reward = calculate_reward(state, current_player)
        if reward > 0:
            print("You" if current_player == 1 else "AI", "win!")
            game_over = True
        elif np.all(state != 0):
            print("It's a draw!")
            game_over = True
        else:
            current_player = -1 if current_player == 1 else 1  # Switch player

if __name__ == "__main__":
    play_game()



 Active status:
 [[0 0 0]
 [0 0 0]
 [0 0 0]]

 Input row followed by column (0-2), divided by a space: 1 0

 Active status:
 [[0 0 0]
 [1 0 0]
 [0 0 0]]

 AI is making a move

 Active status:
 [[-1  0  0]
 [ 1  0  0]
 [ 0  0  0]]

 Input row followed by column (0-2), divided by a space: 1 1

 Active status:
 [[-1  0  0]
 [ 1  1  0]
 [ 0  0  0]]

 AI is making a move

 Active status:
 [[-1 -1  0]
 [ 1  1  0]
 [ 0  0  0]]

 Input row followed by column (0-2), divided by a space: 1 2
You win!


### Breaking down the code step-by-step:

Imports and Initial Setup:
numpy: Used for matrix operations, which are ideal for representing the game board.
random: Used for generating random numbers, which are essential for implementing exploration in the Q-learning algorithm.

Q-dictionary Initialization:
Q: A dictionary that stores the state-action pairs as keys and their corresponding Q-values as values.
Learning Parameters:

learning_rate (α): The rate at which the AI learns. A higher rate means the AI updates its Q-values more aggressively based on new information.

discount_factor (γ): Represents the importance of future rewards. A factor close to 1 means future rewards are almost as important as immediate rewards.

exploration_rate (ε): The probability of choosing a random action. This parameter balances exploration (trying new actions) with exploitation (choosing known best actions).

initialize_game_state(): Initializes the game board as a 3x3 matrix filled with zeros. In Tic-Tac-Toe, '0' can represent an empty cell, while other numbers (e.g., 1 and -1) represent different players.

available_actions(state): Returns a list of all possible actions (empty cells) available in the current state of the board.

calculate_reward(state, player): Calculates the reward for the given state and player. The player gets a reward of 1 for winning, 0.5 for a draw (when the board is full and no one has won), and 0 for ongoing games.

select_action(state, exploration_rate): Implements the ε-greedy strategy for action selection:
With probability ε, it chooses a random action (exploration),
With probability 1-ε, it chooses the best action based on the Q-values (exploitation).

update_Q_values(prev_state, action, reward, next_state): Updates the Q-values using the Q-learning formula:
TD Target: The sum of the reward and the discounted maximum Q-value of the next state,
TD Error: The difference between the TD target and the current Q-value of the action taken,
Updates the Q-value by adjusting it in the direction of the TD error, scaled by the learning rate.

play_game(): The main game loop where the game is played until it ends (win or draw). It alternates turns between the human player and the AI. The human player inputs their move through the console, and the AI uses the Q-learning algorithm to choose its actions. After each move, it checks for a win or a draw and updates the game state accordingly.

Execution: The script starts by initializing the game state and running the main game loop. It continuously updates the board and switches between players until the game concludes with a win or a draw.

This implementation of Tic-Tac-Toe using Q-learning allows the AI to learn from each game by updating the Q-values associated with the actions taken, thus gradually improving its gameplay strategy over time.