# Reinforcement Learning 
---
## Introduction 

Reinforcement learning is a type of machine learning that allows an agent to learn from its environment through trial and error, rather than being explicitly taught. The agent learns from its past experiences and tries to capture the best possible knowledge to make accurate business decisions. Reinforcement learning is different from supervised learning in a way that in supervised learning the training data has the answer key with it so the model is trained with the correct answer itself whereas in reinforcement learning, there is no answer but the reinforcement agent decides what to do to perform the given task. In the absence of a training dataset, it is bound to learn from its experience.

<img src="https://miro.medium.com/v2/resize:fit:735/1*bD2QuWCSVcnFH8j2iPMtbQ.png" width="400" >

---

## Reinforcement Learning Terminologies

To explain in more detail, let's break it down:

1. Agent: This is the "learner" or "decision-maker". In the tic-tac-toe example, the agent would be the program playing the game.

2. Environment: Everything that the agent interacts with falls under this. In the tic-tac-toe example, this would be the game board.

3. Actions: These are the set of all possible things the agent can do. For tic-tac-toe, an action would be placing a circle or a cross in a certain spot on the board.

4. State: This is the current situation the agent finds itself in. The state of a tic-tac-toe game could be the current configuration of the game board.

5. Reward: Feedback given to the agent after each action it takes. This could be positive (if the action was good) or negative (if the action was bad). For instance, in tic-tac-toe, winning the game might yield a reward of +1, losing might yield -1, and a draw might yield 0.

The objective of the agent is to learn a policy, which is a strategy that dictates which action to take in each state so as to maximize the sum of rewards over time. This is achieved via exploration (trying out new actions to see their effect) and exploitation (using actions that are known to yield high rewards).

---

## Implementing Tabular Methods - tic-tac-toe

I will Apply tabular methods for solving a well known RL problems, tic-tac-toe, like the Q-learning algorithm. In this context, a table is used to store the value (the expected future reward) of each action in each state. Initially, this table is initialized arbitrarily, and then it's updated iteratively based on the rewards the agent receives as it interacts with the environment. Over time, the agent learns to choose the action with the highest value in each state, thereby maximizing its total reward.



In [8]:
#define the environment

class TicTacToe:
    def __init__(self):
        self.state = ' ' * 9
        self.player = 'X'

    def available_actions(self):
        return [i for i, spot in enumerate(self.state) if spot == ' ']

    def update_state(self, action):
        state = list(self.state)
        state[action] = self.player
        self.state = ''.join(state)
        self.player = 'O' if self.player == 'X' else 'X'

    def is_done(self):
        winning_spots = [(0,1,2), (3,4,5), (6,7,8), (0,3,6), (1,4,7), (2,5,8), (0,4,8), (2,4,6)]
        for spot in winning_spots:
            if self.state[spot[0]] == self.state[spot[1]] == self.state[spot[2]] != ' ':
                return True
        return ' ' not in self.state

    def reward(self):
        if self.is_done():
            if self.player == 'O':
                return -1
            elif self.player == 'X':
                return 1
        return 0


In [9]:
import numpy as np

class QLearningAgent:
    def __init__(self, epsilon=0.2, alpha=0.3, gamma=0.9):
        self.epsilon = epsilon
        self.alpha = alpha
        self.gamma = gamma
        self.q_table = {}

    def get_q_value(self, state, action):
        return self.q_table.get((state, action), 0.0)

    def choose_action(self, state, available_actions):
        if np.random.random() < self.epsilon:
            return np.random.choice(available_actions)
        else:
            q_values = [self.get_q_value(state, action) for action in available_actions]
            return available_actions[np.argmax(q_values)]

    def learn(self, state, action, reward, next_state, next_actions):
        old_q_value = self.get_q_value(state, action)
        if next_actions:
            max_next_q_value = max([self.get_q_value(next_state, next_action) for next_action in next_actions])
        else:
            max_next_q_value = 0
        self.q_table[(state, action)] = (1 - self.alpha) * old_q_value + self.alpha * (reward + self.gamma * max_next_q_value)


In [10]:
N_EPISODES = 10000

agent = QLearningAgent()

for episode in range(N_EPISODES):
    game = TicTacToe()
    state = game.state
    done = False
    while not done:
        action = agent.choose_action(state, game.available_actions())
        game.update_state(action)
        reward = game.reward()
        next_state = game.state
        done = game.is_done()
        agent.learn(state, action, reward, next_state, game.available_actions())
        state = next_state


In [4]:
def play(agent, human_starts=True):
    game = TicTacToe()
    if human_starts:
        game.player = 'O'

    while not game.is_done():
        print(game.state[:3] + "\n" + game.state[3:6] + "\n" + game.state[6:])
        if game.player == 'X':
            action = agent.choose_action(game.state, game.available_actions())
            game.update_state(action)
        else:
            action = int(input("Choose your action (0-8): "))
            while action not in game.available_actions():
                print("Invalid action.")
                action = int(input("Choose your action (0-8): "))
            game.update_state(action)

    if game.reward() == 1:
        print("The agent has won!")
    elif game.reward() == -1:
        print("You've won!")
    else:
        print("It's a draw!")


In [7]:
play(agent, human_starts=False)

   
   
   
X  
   
   
X O
   
   
XXO
   
   
XXO
O  
   
XXO
O X
   
XXO
OOX
   
XXO
OOX
 X 
The agent has won!
