# Lab 3: Agents and Environments

In the lecture we have seen multiple different agent types. 
In this lab you will implement the first two:
- A simple reflex agent
- A model-based reflex agent

We will use two environments, one for each agent. 

In the first one, the simple reflex agent will be playing the game "TicTacToe" and in the second, you will implement a model-based agent that can play Rock-Paper-Scissors. 

*Note*: Those agents will not be learning automatically, you will create them on a rule-based system.
We will revisit automatic learning in the later lectures, under the topic: Reinforcement Learning.


After this lab you will:
- Understand how to implement and use environments (which will be useful when we advance to reinforcement learning)
- Have practiced using Numpy and Matplotlib to modify arrays and plot results, which will be handy when working further with machine learning
- And if you are still starting off with Python, you will have seen/used Python classes and functions, and should hopefully be also able to implement them in the future



## Environments
Remember from the lecture, environments are the task to which **agent** are the solution. 
The **agent** takes an **action** (which is the input into the environment), upon which the environment is updated. 
In reinforcement learning we also have a reward for each step, although we will not use that in this lab. 




### TicTacToe Environment
The following code implements the TicTacToe environment.

Read and try to understand the code and its functions. 
Each environment is initialized with its initial values, and a *step* outlines an agent-interaction with an environment.
We then also have a function to reset an environment to its initial state (good for repeating interactions) and also have a get_actions function which lets agent's know what actions they can take.


In [44]:
import numpy as np


class TicTacToeEnvironment:
    def __init__(self):
        self.board = [" "] * 9
        self.done = False
        self.reset()

    def step(self, index: int, player):
        # take a step in the environment
        assert not self.done, "Game is over"
        assert player in ("X", "O"), "Invalid player"
        assert self.board[index] not in ("X", "O"), "Square already taken"

        self.board[index] = player

        return self.board, self.get_actions()

    def get_actions(self) -> list:
        return [i for i in range(len(self.board)) if self.board[i] == " "]

    def reset(self):
        # reset the board/environment
        self.done = False
        self.board = [" "] * 9

    def check_winner(self) -> str:
        # check if there is a winner, returns X, O, D (draw) or N (none) if there is no winner
        board = np.reshape(self.board, (3, 3))

        self.done = True
        # check rows/columns
        for b in [board, np.transpose(board)]:
            for row in b:
                if len(set(row)) == 1 and row[0] != " ":
                    return row[0]

        # check diagonals
        if len(set([board[i][i] for i in range(3)])) == 1 and board[0][0] != " ":
            return board[0][0]
        if len(set([board[i][2 - i] for i in range(3)])) == 1 and board[0][2] != " ":
            return board[0][2]

        # check for draw
        if len(self.get_actions()) == 0:
            return "Draw"
        self.done = False
        return "N"

    def show_board(self):
        print(self.board[0] + " | " + self.board[1] + " | " + self.board[2])
        print("---------")
        print(self.board[3] + " | " + self.board[4] + " | " + self.board[5])
        print("---------")
        print(self.board[6] + " | " + self.board[7] + " | " + self.board[8])
        print("")


### Simple Reflex Agent Interaction with Environment
The following code shows how the agent interacts with the environment. Again, try to read and understand the code and how it has been implemented.
The agent percept's the board and the action is a turn of an agent.

In [57]:
def play_ttt(env, agent1, agent2, iterations: int = 100):
    # play games of tic tac toe
    results = {"X": 0, "O": 0, "Draw": 0}
    for i in range(iterations):
        # Alternate starting turn between agents
        if i % 2 == 0:
            turn = agent1
            no_turn = agent2
        else:
            turn = agent2
            no_turn = agent1

        board = env.board
        actions = env.get_actions()

        while True:
            # play a turn
            action, player = turn(board, actions)
            board, actions = env.step(action, player)
            env.show_board()
            # check the winner
            winner = env.check_winner()
            if winner != "N":
                print("Winner is " + winner)
                results[winner] += 1
                env.reset()
                break
            # switch players
            turn, no_turn = no_turn, turn
    print(results)


#### Random Reflex Based Agent
As you can see, the agent observes the environment (e.g. the current state of the environment) and then performs an action based upon that. 

In this basic example, the agent simply picks a random action. This is suboptimal but can serve as a performance baseline.

In [58]:
def random_agent(state, actions: list) -> int:
    # random agent
    return np.random.choice(actions)


#### Manual Agent
This code implements a manual agent (e.g. let's you interact with the environment). 

Use this code for testing.


In [59]:
from IPython.display import clear_output


def manual_agent_ttt(state, actions: list) -> int:
    clear_output(wait = True)
    print("-------------------")
    print("Current Board")
    print(np.array(state).reshape(3, 3))
    print(f"Available Actions{actions}")
    action = int(input("Enter action: "))
    print(f"Chosen Action:{action}")
    return action


#### Running the Agent-Environment Interactions

The following code contains two functions. These are wrapper functions to assign an agent the "X" or "O" player. Change the return statement if you want to switch agents around.
The call to the *play_ttt* function lets the agents interact with this environment. In this particular case, it is a Multi-Agent system, as now, several agents are interacting with one another in this environment.

In [60]:
def playerX_ttt(state, actions: list)  -> tuple:
    # uses an agent and adds X as the player identifier to the output
    return random_agent(state, actions) , "X"


def playerO_ttt(state, actions: list) -> tuple:
    # uses an agent and adds O as the player identifier to the output
    return random_agent(state, actions) , "O"


play_ttt(TicTacToeEnvironment(), playerX_ttt, playerO_ttt, iterations=13)


  |   |  
---------
  |   |  
---------
  | X |  

  |   |  
---------
  |   |  
---------
  | X | O

  |   |  
---------
  | X |  
---------
  | X | O

  | O |  
---------
  | X |  
---------
  | X | O

  | O | X
---------
  | X |  
---------
  | X | O

  | O | X
---------
O | X |  
---------
  | X | O

  | O | X
---------
O | X |  
---------
X | X | O

Winner is X
  |   |  
---------
O |   |  
---------
  |   |  

  |   |  
---------
O |   |  
---------
X |   |  

  |   |  
---------
O | O |  
---------
X |   |  

  |   | X
---------
O | O |  
---------
X |   |  

  |   | X
---------
O | O |  
---------
X |   | O

  |   | X
---------
O | O |  
---------
X | X | O

  |   | X
---------
O | O | O
---------
X | X | O

Winner is O
  |   |  
---------
  | X |  
---------
  |   |  

  |   |  
---------
  | X |  
---------
  |   | O

  |   |  
---------
  | X |  
---------
  | X | O

O |   |  
---------
  | X |  
---------
  | X | O

O | X |  
---------
  | X |  
---------
  | X | O

Winner 

# Exercises
## General
- Run the code above, try to understand it and change some variables such as the iterations in the call to the *play* function
- Add further comments to the code if it is not clear (ask the lab tutor if you have problems understanding parts of it)
- For any exercises you can add markdown or code blocks as you wish
- The more often you let agents *play* against one another, the larger the sample size is. This could provide more accurate statistics for the results. 

## Exercise 1: Simple Reflex Agent
1) Create a new agent function with the same format as the Random Reflex Agent above. Implement a rule-based strategy that always tries to start with placing a mark at a particular location (if available). Test this against the purely random reflex agent. How does this fare?
2) Think about the strategy that you would use to win this game. Implement this as a new agent and test how it fares against the previous two, and against itself. 
3) If you have built a successful agent ion the previous step, see if you can test it against the strategy of one of your peers. 

In [67]:
def rule_start_reflex_agent_ttt(state, actions: list) -> int:
    # rule-based strategy that always tries to start with placing a mark at a particular location
    # TODO: your code for Exercise 1.1 here
    # action = np.insert(board, none, [['O']], axis=[1,1])
    if 4 in actions:
        return 4
    else:
        return np.random.choice(actions)


def your_strategy_reflex_agent_ttt(state, actions: list) -> int:
    # TODO: your code for Exercise 1.2 here
    if not 4 in actions:
        if 0 in actions:
            return 0
        elif 2 in actions:
            return 2
        elif 6 in actions:
            return 6
        elif 8 in actions:
            return 8
        else:
            return np.random.choice(actions)
    elif 4 in actions:
        return 4
    else:
        return np.random.choice(actions)


def playerX_ttt(state, actions: list) -> tuple:
    # uses an agent and adds X as the player identifier to the output
    return random_agent(state, actions) , "X"


def playerOttt(state, actions: list) -> tuple:
    # uses an agent and adds O as the player identifier to the output
    return rule_start_reflex_agent_ttt(state, actions) , "O"


play_ttt(TicTacToeEnvironment(), playerX_ttt, playerO_ttt, iterations=10)


  |   |  
---------
  |   |  
---------
  | X |  

  |   |  
---------
  |   | O
---------
  | X |  

  | X |  
---------
  |   | O
---------
  | X |  

  | X |  
---------
  | O | O
---------
  | X |  

  | X |  
---------
  | O | O
---------
X | X |  

  | X | O
---------
  | O | O
---------
X | X |  

  | X | O
---------
  | O | O
---------
X | X | X

Winner is X
O |   |  
---------
  |   |  
---------
  |   |  

O |   |  
---------
  |   | X
---------
  |   |  

O |   |  
---------
  |   | X
---------
  | O |  

O |   | X
---------
  |   | X
---------
  | O |  

O |   | X
---------
O |   | X
---------
  | O |  

O |   | X
---------
O |   | X
---------
X | O |  

O |   | X
---------
O | O | X
---------
X | O |  

O |   | X
---------
O | O | X
---------
X | O | X

Winner is X
  | X |  
---------
  |   |  
---------
  |   |  

O | X |  
---------
  |   |  
---------
  |   |  

O | X |  
---------
  |   | X
---------
  |   |  

O | X |  
---------
  |   | X
---------
  |   | O

O | X |

### Rock-Paper-Scissors Environment
The following code partially implements an environment that lets two agents play Rock-Paper-Scissors using Best-Of-Five (whoever wins three rounds first).


In [None]:
import numpy as np

ACTION_DCT = {0: "Rock", 1: "Paper", 2: "Scissors"}
WIN_DCT = {"Agent1":0, "Agent2":1}

class RockPaperScissorsEnvironment:
    def __init__(self):
        self.done = False
        self.reset()
        
        self.agent_1_score = 0
        self.agent_2_score = 0
        self.total_rounds = 5
        self.done = False
        self.history = []

    def step(self, agent_1_action:int, agent_2_action:int):
        # TODO: create an assertion that both actions are valid. Either raise an exception (https://docs.python.org/3/tutorial/errors.html) or use assert (https://www.w3schools.com/python/ref_keyword_assert.asp)
        # TODO: stop this function form being executed if the game is done. You should use the self.done variable to do this.
        # Hint: use the ACTION_DCT variable to translate from int to string
        
        wins_against = {
            "Paper": "Scissors",
            "Rock": "Paper",
            "Scissors": "Rock"
        }

        # TODO: create code that checks who wins the round and updates the score accordingly.
        # A draw will just incur a repretition of the round, there is no winner. 

        # TODO: create code that stores the history of the game. You should use the self.history variable for this. The history should be a list of tuples. Each tuple should contain the actions of both agents and the winner of the round. For example: [(0, 1, 1), (1, 0, 2), (2, 2, 0)]
        # Note: the history does not need to be updated on a draw
        # Hint: use the ACTION_DCT variable to translate from int to string for the winner if you want to parse the history
    
        self.check_done()

        return winner, self.history

    def get_actions(self) -> list:
        # TODO: return available actions as a list
        return None
    
    def check_done(self):
        # TODO: check if the game is done and set the self.done variable accordingly
        pass

    def reset(self):
        # TODO: code to reset the scores, done and the history 
        pass



def show_history(history):
    # helper function to show the history
    for ix, h in enumerate(history):
        print(f"Round {ix+1}: Agent 1:{ACTION_DCT[h[0]]}, Agent 2: {ACTION_DCT[h[1]]} -- Winner Agent {h[2]}")


def play_rps(agent_1, agent_2, iterations=100, verbose=False):
    """Code to play the Rock Paper Scissors Environment

    Args:
        agent_1 (func): Python function for the first agent
        agent_2 (func): Python function for the second agent
        iterations (int, optional): How many rounds of best-of-five to play. Defaults to 100.
        verbose (bool, optional): Whether to show the result after every round. Defaults to False.
    """
    env = RockPaperScissorsEnvironment()
    results = {"Agent1": 0, "Agent2": 0}
    history = []
    for i in range(iterations):
        
        while not env.done:
            agent_1_action = agent_1(history, env.get_actions())
            agent_2_action = agent_2(history, env.get_actions())
            winner, history = env.step(agent_1_action, agent_2_action)

        if env.agent_1_score > env.agent_2_score:
            results["Agent1"] += 1
            winner = "Agent 1"
        else:
            results["Agent2"] += 1
            winner = "Agent 2"

        if verbose:
            print("#"*72)
            print(f"Iteration {i+1} Done. Game Summary:")
            show_history(history)
            print(f"Winner: {winner}")
            print(f"Current Score: Agent 1: {results['Agent1']}, Agent 2: {results['Agent2']}")
            print("#"*72)


        
        env.reset()

    print(results)

# run the environment and play two random agents against one another
play_rps(random_agent, random_agent, iterations=1000)


## Exercise 2: Model Based Reflex Agent
1) Complete the code above to create the RockPaperScissors Environment.
2) Test the environment by creating a manual agent, and run the *play_rps* function with that agent.
3) Implement a model-based agent that can play the game. It should look at the history of the game and use this to make decisions.

In [None]:
def manual_agent_rps(history, actions: list) -> int:
    # TODO: your code for Exercise 2.2. here
    # Note: modify the code from manual_agent_ttt()

    return None


def your_strategy_model_based_reflex_agent_rps(history, actions:list) -> int:
    # TODO: your code for Exercise 2.3 here
    return None


play_rps(random_agent, manual_agent_rps, iterations=1000, verbose=True)


# Exercise 3: Model Based Recurrent Neural Network Agent (Extra Exercise)
1. The code below contains an agent based on recurrent neural networks (RNN's) that has been trained on thousands of games of rock-paper-scissors (adapted from https://github.com/PaulKlinger/rps-rnn). Test (and adapt the strategy) of your model-based agent to play, and win, against this RNN agent.


In [None]:
# you may need to install a tensorflow version >2. in your conda environment to run this RNN Agent
# this code does not need to be modified or changed
import tensorflow as tf

def build_deep_model(state_dims, batch_size, stateful=False):
    # source: https://github.com/PaulKlinger/rps-rnn
    return tf.keras.Sequential([
        tf.keras.layers.SimpleRNN(state_dims[0], batch_input_shape=[batch_size, None, 6],
                                 return_sequences=True,  stateful=stateful, activation="softsign"),
    ] + [tf.keras.layers.SimpleRNN(s, return_sequences=True, stateful=stateful, activation="softsign") 
         for s in state_dims[1:]
        ] + [
        tf.keras.layers.Dense(3),
        tf.keras.layers.Softmax()
    ])

deep_model_3l = build_deep_model([10,10,10], 1)
# make sure you have downloaded the weights from GCU-Learn
deep_model_3l.load_weights("deep_3l_s10_softsign_sim.h5")

def make_input_vector(opponent_move=None, model_move=None):
    # source: https://github.com/PaulKlinger/rps-rnn
    if opponent_move is None and model_move is None:
        return np.zeros((1,1,6)).astype(np.float32)
    elif opponent_move is None or model_move is None:
        raise ValueError
    move_ids = [[1,0,0],[0,1,0],[0,0,1]]
    return np.array([[move_ids[opponent_move] + move_ids[model_move]]]).astype(np.float32)

def rnn_agent(history, actions) -> int:
    # RNN based agent. Note that this agent needs to be Agent 1 for the environment. 

    # parse history
    if len(history) == 0:
        action = np.random.choice(actions)
    else:
        # convert game history into format for RNN
        inp = np.zeros([1,len(history),6], dtype=np.float64)
        for ix, h in enumerate(history):
            a1, a2, _ = h
            inp[0,ix] = make_input_vector(a2, a1)

        prediction = deep_model_3l(inp).numpy()

        # sample from softmax with temperature to avoid being stuck in an infinite loop if two RNN's play against one another
        temperature = 0.1
        pred = np.log(prediction[0,-1]) / temperature
        exp = np.exp(pred)
        pred = exp / np.sum(exp)
        action = np.argmax(np.random.multinomial(1, pred, 1)[0])
    # make RNN start from scratch every iteration
    deep_model_3l.reset_states()
    return action




In [None]:
# Run this to test your agent against the RNN agent. 
# Please note, due to current limitations this agent only works if it is agent1 (e.g. the first function argument)
# change the random agent against the model based agent you have built (your_strategy_model_based_reflex_agent_rps()).
play_rps(rnn_agent, random_agent, iterations=10)
