# Regret Matching with Rock, Paper, Scissors
---
### Regret
For a group of players let $s_i$ be the action player by player $i$ and $s_{-i}$ the actions played by the remaining players.
Together, these actions form an action profile $a \in A$.
For all other actions $s_i'$ we could've played, we define the **regret of not playing action $s'_i$** as a difference in utility:  

<center>
$
\begin{align*}
    regret(s_i', a) &= u(s_i', s_{i-1}) - u(a)
\end{align*}
$
</center>  

For example, if we play scissors and our opponent plays rock, then $a=(scissors, rock)$ and our utility for this play is $u(scissors, rock) = -1$.
We can compute the regret for all of our possible actions to find:  
<center>
$
\begin{align*}
    regret(rock, a) &= u(rock, rock) - u((scissors, rock)) = 0 - (-1) = 1 \\
    regret(paper, a) &= u(paper, rock) - u((scissors, rock)) = 1 - (-1) = 2 \\
    regret(scissors, a) &= u(scissors, rock) - u((scissors, rock)) = -1 - (-1) = 0
\end{align*}
$  
</center>  

Thus we regret not playing "paper" the most, and regret not playing "rock" more than playing "scissors".Note that when $s_i' = s_i$ the regret is zero. 

---
### Regret Matching

Actions that have positive regret is an indicator that we should've chosen these actions to maximize our utility. 
Thus if we track the regret for each action, if we choose actions at random with probability proportional to how positive their regret is, we can hopefully maximize our utility. Actions with negative regret are given zero probability.
Such a weighting is called **regret matching**.  

If we play another game, using the above regrets we play "paper" with probability $\frac{2}{3}$ and "rock" with probability $\frac{1}{3}$. Suppose we play "paper" while our opponent plays "scissors". 
Thus our regret for this game is:
<center>
$
\begin{align*}
    regret(rock, a) &= u(rock, paper) - u((paper, scissors)) = -1 - (-1) = 0 \\
    regret(paper, a) &= u(paper, paper) - u((paper, scissors)) = 0 - (-1) = 1 \\
    regret(scissors, a) &= u(scissors, paper) - u((paper, scissors)) = 1 - (-1) = 2
\end{align*}
$  
</center>  

If we add these regrets to our previous regrets, we can compute the **cumulative regrets** of $(1,3,2)$ respectively, which is normalized to $(\frac{1}{6},\frac{3}{6},\frac{2}{6})$.
These normalized weights form a mixed-strategy that can be used for the next game.

---

### Example RPS:
We given an example of regret matching against an opponent that plays rock slightly more than paper or scissors.
We initialize the cumulative regret of our agent to 0.
Over several iterations, our agent will choose rock, paper, or scissors with a probability proportional to their cummulative regret via a strategy porfile. We then update their regrets using this action and repeat.  

Regret Matching Algorithm:  
`  
Initialize cummulative regret to 0
For some number of training iterations:
    - Use cummulative regret to define strategy profile
    - Add strategy profile to cummulative strategy profile (will use average after training.)
    - Agent selects action according to strategy profile.
    - Compute agent's regret, given opponents action.
    - Update cummulative regret.
Normalize cummulative strategy profile by number of training iterations.
Return normalized strategy profile. 
`

In [1]:
import random as rd 
import numpy as np

np.set_printoptions(suppress=True)

In [8]:
class RPS:
    
    ROCK, PAPER, SCISSORS = 0, 1, 2
    N_ACTIONS = 3
    # Payoff matrix, First index is our agent's choice, Second the opponents choice
    util = np.array([[0, -1, 1], [1, 0, -1], [-1, 1, 0]])

    def __init__(self, oppo_strat=np.array([1/3, 1/3, 1/3]), use_softmax=False):
        self.use_softmax = use_softmax
        self.reset(oppo_strat)
        
    def reset(self, oppo_strat):
        self.regret_sum = np.array([0.0, 0.0, 0.0])
        self.strategy_sum = np.array([0.0, 0.0, 0.0])

        self.agent_strategy = np.array([0.0, 0.0, 0.0])
        self.oppo_strategy = oppo_strat
        
    def set_agent_strategy(self):
        if self.use_softmax:
            # Set agent's strategy proportional to current regret, or uniform if all are non positive.
            normalize_factor = 0
            for a in range(self.N_ACTIONS):
                self.agent_strategy[a] = self.regret_sum[a] if self.regret_sum[a] > 0 else 0
                normalize_factor += self.agent_strategy[a]

            for a in range(self.N_ACTIONS):
                if (normalize_factor > 0):
                    self.agent_strategy[a] /= normalize_factor
                else:
                    self.agent_strategy[a] = 1 / self.N_ACTIONS
        else:
            # Set agent's strategy proportional to current regret via softmax.
            exp = np.exp(self.regret_sum)
            self.agent_strategy = exp / np.sum(exp)
    
    def update_running_strategy(self):
        self.strategy_sum += self.agent_strategy
        
    def get_action(self, strategy):
        
        # Given strategy, generate rdnum in [0,1], return action corresponding to bin this number falls into. 
        r = rd.uniform(0,1)

        if r < strategy[0]:
            return self.ROCK
        elif r < strategy[1]+strategy[0]:
            return self.PAPER
        else:
            return self.SCISSORS
        
    def print_arrays(self):
        # For debugging. 
        print(f'Regret sum: {self.regret_sum}')
        print(f'Strategy sum: {self.strategy_sum}')
        print(f'Agent Strategy: {self.agent_strategy}')
        
    def softmax(self, array):
        x = np.exp(array)
        return x / np.sum(x)
        
        
    def train(self, n_epochs=10_000):
        # Determine strategy for agent. 
        for epoch in range(1, n_epochs + 1):
            self.set_agent_strategy()
            self.update_running_strategy()
            agent_action = self.get_action(self.agent_strategy)
            oppo_action = self.get_action(self.oppo_strategy)
            for i in range(self.N_ACTIONS):
                self.regret_sum[i] += (self.util[i][oppo_action] - self.util[agent_action][oppo_action])
        
#             if epoch % 1000 == 0:
#                 x = self.strategy_sum / epoch
#                 print(f'Epoch {epoch}. Average strategy: {x}')
        
        

We'll attempt to learn an optimal strategy to play when our opponent has a preference for playing rock more than the other two options. We will run 10 experiments to see what strategy is generated:

In [11]:
n_epochs = 50_000
oppo_strat_rock = [0.5, 0.25, 0.25]

In [12]:
for i in range(1, 11):
    rps = RPS(oppo_strat_rock, use_softmax=False)
    rps.train(n_epochs)
    print(f'Iteration {i}: Strategy: {rps.strategy_sum / n_epochs}')

Iteration 1: Strategy: [0.00046486 0.99951195 0.00002319]
Iteration 2: Strategy: [0.00017554 0.99971422 0.00011024]
Iteration 3: Strategy: [0.00003328 0.99989098 0.00007574]
Iteration 4: Strategy: [0.00001765 0.99994535 0.000037  ]
Iteration 5: Strategy: [0.00050425 0.99945306 0.00004268]
Iteration 6: Strategy: [0.00002327 0.99996027 0.00001646]
Iteration 7: Strategy: [0.0001353  0.99969058 0.00017412]
Iteration 8: Strategy: [0.00001273 0.99996128 0.00002599]
Iteration 9: Strategy: [0.00001201 0.99992869 0.0000593 ]
Iteration 10: Strategy: [0.00004421 0.99993374 0.00002206]


For all iterations, the average strategy returned all suggest that we play paper most of the time. This is reasonable given the opponent's strategy: as they typically play rock, in the long run we will win the most games (and maximize our utility) by playing paper. 

Let's repeat this experiment but choose actions to play via a softmax function. Recall that we choose actions proportional to their regret.

In [13]:
for i in range(1, 11):
    rps = RPS(oppo_strat_rock, use_softmax=True)
    rps.train(n_epochs)
    print(f'Iteration {i}: Strategy: {rps.strategy_sum / n_epochs}')

Iteration 1: Strategy: [0.00191633 0.998067   0.00001667]
Iteration 2: Strategy: [0.00000667 0.99998667 0.00000667]
Iteration 3: Strategy: [0.00008333 0.99989    0.00002667]
Iteration 4: Strategy: [0.00123229 0.99875438 0.00001333]
Iteration 5: Strategy: [0.00006333 0.99993    0.00000667]
Iteration 6: Strategy: [0.00065849 0.99871333 0.00062818]
Iteration 7: Strategy: [0.00019977 0.99975023 0.00005   ]
Iteration 8: Strategy: [0.0006033  0.99939004 0.00000667]
Iteration 9: Strategy: [0.0002324  0.99969882 0.00006878]
Iteration 10: Strategy: [0.00001333 0.99998    0.00000667]


Again, we achieve similar results. Finally, let's see what happens when the opponent is clever and uses actions uniformly at random. Here, our regret matching is likely to fail us as their is no dominant strategy to play against this opponent that maximizes our utility.

In [15]:
oppo_strat_uniform = [1/3.0, 1/3.0, 1/3.0]
for i in range(1, 11):
    rps = RPS(oppo_strat_uniform, use_softmax=False)
    rps.train(n_epochs)
    print(f'Iteration {i}: Strategy: {rps.strategy_sum / n_epochs}')

Iteration 1: Strategy: [0.4617967  0.11100432 0.42719899]
Iteration 2: Strategy: [0.86425294 0.11852674 0.01722032]
Iteration 3: Strategy: [0.05346663 0.91809169 0.02844169]
Iteration 4: Strategy: [0.06794672 0.1074527  0.82460057]
Iteration 5: Strategy: [0.07040969 0.02806498 0.90152533]
Iteration 6: Strategy: [0.2899674  0.46907745 0.24095515]
Iteration 7: Strategy: [0.74658699 0.00005238 0.25336062]
Iteration 8: Strategy: [0.88282237 0.00003953 0.11713809]
Iteration 9: Strategy: [0.00116568 0.30869513 0.69013919]
Iteration 10: Strategy: [0.75313986 0.00355577 0.24330438]


As expected the resulting strategy does not favour a single action consistently as seen in the previous experiments.