# Regret Matching with Rock, Paper, Scissors
---
### Regret
For a group of players let $s_i$ be the action player by player $i$ and $s_{-i}$ the actions played by the remaining players.
Together, these actions form an action profile $a \in A$.
For all other actions $s_i'$ we could've played, we define the **regret of not playing action $s'_i$** as a difference in utility:  

<center>
$
\begin{align*}
    regret(s_i', a) &= u(s_i', s_{i-1}) - u(a)
\end{align*}
$
</center>  

For example, if we play scissors and our opponent plays rock, then $a=(scissors, rock)$ and our utility for this play is $u(scissors, rock) = -1$.
We can compute the regret for all of our possible actions to find:  
<center>
$
\begin{align*}
    regret(rock, a) &= u(rock, rock) - u((scissors, rock)) = 0 - (-1) = 1 \\
    regret(paper, a) &= u(paper, rock) - u((scissors, rock)) = 1 - (-1) = 2 \\
    regret(scissors, a) &= u(scissors, rock) - u((scissors, rock)) = -1 - (-1) = 0
\end{align*}
$  
</center>  

Thus we regret not playing "paper" the most, and regret not playing "rock" more than playing "scissors".Note that when $s_i' = s_i$ the regret is zero. 

---
### Regret Matching

Actions that have positive regret is an indicator that we should've chosen these actions to maximize our utility. 
Thus if we track the regret for each action, if we choose actions at random with probability proportional to how positive their regret is, we can hopefully maximize our utility. Actions with negative regret are given zero probability.
Such a weighting is called **regret matching**.  

If we play another game, using the above regrets we play "paper" with probability $\frac{2}{3}$ and "rock" with probability $\frac{1}{3}$. Suppose we play "paper" while our opponent plays "scissors". 
Thus our regret for this game is:
<center>
$
\begin{align*}
    regret(rock, a) &= u(rock, paper) - u((paper, scissors)) = -1 - (-1) = 0 \\
    regret(paper, a) &= u(paper, paper) - u((paper, scissors)) = 0 - (-1) = 1 \\
    regret(scissors, a) &= u(scissors, paper) - u((paper, scissors)) = 1 - (-1) = 2
\end{align*}
$  
</center>  

If we add these regrets to our previous regrets, we can compute the **cumulative regrets** of $(1,3,2)$ respectively, which is normalized to $(\frac{1}{6},\frac{3}{6},\frac{2}{6})$.
These normalized weights form a mixed-strategy that can be used for the next game.

---

### Example RPS:
We given an example of regret matching against an opponent that plays rock slightly more than paper or scissors.
We initialize the cumulative regret of our agent to 0.
Over several iterations, our agent will choose rock, paper, or scissors with a probability proportional to their cummulative regret via a strategy porfile. We then update their regrets using this action and repeat.  

Regret Matching Algorithm:  
`  
Initialize cummulative regret to 0
For some number of training iterations:
    - Use cummulative regret to define strategy profile
    - Add strategy profile to cummulative strategy profile (will use average after training.)
    - Agent selects action according to strategy profile.
    - Compute agent's regret, given opponents action.
    - Update cummulative regret.
Normalize cummulative strategy profile by number of training iterations.
Return normalized strategy profile. 
`

In [9]:
import random as rd 
import numpy as np

ROCK, PAPER, SCISSORS = 0, 1, 2
N_ACTIONS = 3
# Payoff matrix, First index is our agent's choice, Second the opponents choice
PM = np.array([[0, -1, 1], [1, 0, -1], [-1, 1, 0]])

regret_sum = np.array([0, 0, 0])
global strategy_sum
strategy_sum = np.array([0, 0, 0])

agent_strategy = np.array([0, 0, 0])
oppo_strategy = np.array([0.5, 0.25, 0.25])


In [6]:
def get_action(strategy):
    """
        Given strategy, generate rdnum in [0,1],
        Return action corresponding to bin this number
        falls into. 
    """
    r = rd.uniform(0,1)
    
    if r < strategy[0]:
        return ROCK
    elif r < strategy[1]+strategy[0]:
        return PAPER
    else:
        return SCISSORS


In [7]:
def train(n_epochs):
    for i in range(n_epochs):
        
        # Set agent's strategy proportional to current regret, or uniform if all regrets is non positive. 
        normalize_factor = 0
        for a in range(N_ACTIONS):
            agent_strategy[a] = regret_sum[a] if regret_sum[a] > 0 else 0
            normalize_factor += agent_strategy[a]
            
        for a in range(N_ACTIONS):
            if (normalize_factor > 0):
                agent_strategy[a] /= normalize_factor
            else:
                agent_strategy[a] = 1 / N_ACTIONS
                
        # Update running strategy sum.
        strategy_sum += agent_strategy
        # Get player actions, compute regret
        agent_action = get_action(agent_strategy)
        oppo_action = get_action(oppo_strategy)
        

        
        

In [8]:
train(500)


UnboundLocalError: local variable 'strategy_sum' referenced before assignment