#  Playing perfect defence with regret matching


You have probably heard of reinforcement learning and how it is making headlines: [playing Atari](https://deepmind.com/research/publications/playing-atari-deep-reinforcement-learning/), [beating Go masters](https://www.theverge.com/2017/10/18/16495548/deepmind-ai-go-alphago-zero-self-taught) and [saving energy for Google](https://deepmind.com/blog/deepmind-ai-reduces-google-data-centre-cooling-bill-40/), among other things.

In a reinforcement learning problem there is an **agent** who receives a signal from the **environment** that encodes all the information it needs to take an **action** in order to maximize a certain **reward**. Examples of this situation are the ones above, but not only: choosing the content that is delievered to you online (ads or articles) is also an example, as well as adaptive treatments for chronical diseases. The framework is quite general, so it covers a number of applications.

The aim is to find a way how to behave optimally against uncertainty: what actions should be done at each decision instant in order to maximize a reward.

In this post, we will talk about reinforcement learning and game theory. Game theory is concerned with modelling strategic decision making for more than one agent. What does it have to do with reinforcement learning? It turns out that an agent can improve by playing against himself!. This idea is not new, but has turned out to be incredibly powerful: Recently, *Libratus*, an AI, won 1.7 million in chips (no real money) in Texas Hold-em. It played for 20 days, 11 hours per day, 4 of the top players. Why is this a thing? Unlike chess or go, poker is an incomplete information game, and such games were long believed to be too hard for computers. 

Around a month earlier than *Libratus*, a Czech team of researchers from Charles University and Czech Technical University created a similar AI, called *DeepStack*. You can read more about the comparison between them [here](http://www.nature.com/news/how-rival-bots-battled-their-way-to-poker-supremacy-1.21580). 

Formally, a two player zero sum game is a strategic interaction between two players, where one player's gain is at the expense of the other. Examples of zero sum games are:

- Chess
- Rock, paper, scissors
- (Two person) Poker

Note that these games are different in a few important ways. First, there's an issue with **information**: the state of the world is known to both players in chess, but not in poker, because both players are missing a crucial part of information (the hand of the opponent).

We can summarize this as follows:

- Chess: Perfect information, alternating and deterministic moves. 
- Rock, paper, scissors: Perfect information with simultaneous moves.
- Poker: Incomplete information, alternating moves. 



First, a bit of game theory lingo: a **strategy** is a rule that tells each player how to behave, and a **strategy profile** is the set of strategies of the players. A strategy profile is a **Nash equilibrium (NE) ** if no player can do better by unilaterally changing his or her strategy. 

To see what this means, imagine that each player is told the strategies of the others. Suppose then that each player asks themselves: "Knowing the strategies of the other players, and treating the strategies of the other players as set in stone, can I benefit by changing my strategy?" If every player prefers not to switch (or is indifferent between switching and not) then the strategy profile is a Nash equilibrium. 

Hence, a Nash equilibrium teaches us to play self-defence: each player is doing the best it can. Since most interesting games have a certain component of luck, there is no guarantee to get a super-unbeatable strategy, so playing perfect self-defence is a desirable way.

It is by no mean obvious to find Nash equilibria, however, for zero sum games, a way players can reach the Nash equilibrium strategy is through **self-play**. This roughly goes as follows:

- Two agents start with a random strategy.
- They improve their strategy after each game by some adaptive strategy.
- When convergence is reached, your strategy is ready to go.

The topic of this post is one of the simplest adaptive strategies, **regret matching**. It was introduced by Hart and Mas-Collel in 2000. The idea is that players reach equilibrium by tracking their regrets, and will play in the future those actions that led to higher regret. Let's illustrate this with an example:

<div class="alert alert-block alert-warning">
<h4>Example</h4>
<ul>
    <li>Suppose we play rock, paper, scissors for money. Each of us puts 100 EUR on the table, we play and the winner takes the money. </li>
    <li> The strategy sets are $I=J=\{0,1,2\}$ and the payoff is $u(i,j),$ which takes values between $-200, 200$. </li>
    <li> You play rock and I play paper, so I win and take 100 EUR from you. Our payoffs are $(+100,-100)$. </li>
    <li> Your <b>regret</b> for not playing paper is 100, but your regret for not paying scissors is even higher (200). </li>
</ul>
</div>

So how do we make an algorithm out of that? Simply play proportionally to your regret! The more you regret not doing a certain action, the more likely you are to play that action in the future.

To put it clearly:

<div class="alert alert-block alert-info">
<h4>Regret matching algorithm</h4>
<ul>
    <li> Initialize a counter for cumulative regret. </li>
    <li> After each round, compute your regret for each action. If we played $(i^*,j^*)$, the regret for action $i$ is:
 $$\max(u(i,j^*)-u(i^*,j^*),0).$$ </li>
    <li> Add the regrets and normalize the sum (divide the regret of each strategy by the sum of regrets). </li>
    <li> Play a mixed strategy that where each action is played proportionally to the cumulative regret. </li>
</ul>

</div>



Regret matching is the basic ingredient of Libratus, the poker champion AI. It uses a selfplay algorithm (called CFR+) together with a special solver for the end of the game and a continual improvement meta-algorithm which improves after each match.

[CFR](http://modelai.gettysburg.edu/2013/cfr/cfr.pdf) stands for *conterfactual regret minimization*. *Counterfactual* means "If I had known".This algorithm is roughly an adaptation of regret matching to the tree structure of the game.

Let's do a simple example of regret minimization for rock, paper and scissors. 
First, we need to import a few standard libraries:

In [1]:
# Regret minimization for rock, paper, scissors 
import random
import numpy as np

np.set_printoptions(precision=2)


We will create our regret player as a class with two methods, one for calculating the regret and the other for calculating the mixed strategy corresponding to the regret vector.

In [2]:
class RegretPlayer():
    def __init__(self):
        ''' Keeps track of the history of the game'''
        self.regret_sum = np.zeros(3)
        self.my_moves = []
        self.other_moves = []
        
    def move(self, strategy):
        '''
        Input: a vector of probabilities
        Output: an action sampled from this vector
        '''
        return a

    def regret(self):
        '''Calculates the regret vector given the history'''
        return regrets
               
    def get_regret_mixed_strategy(self):
        '''Calculates the strategy from the regrets (normalizing the regret vector)'''
        
        return strategy            

You can try to implement those yourself! Anyway, here goes the structure in a bit more detail.

In [3]:
   
class RegretPlayer():
    def __init__(self):
        self.regret_sum = np.zeros(3)
        self.my_moves = []
        self.other_moves = []
        
    def move(self, strategy):
        # Input: a vector of probability distributions for actions
        # Output: a pure action
        
        r = random.uniform(0,1)
        n_actions = len(strategy)
        
        a = 0
        cumulative_proba = 0.0
        
        while a<n_actions-1:
            cumulative_proba += strategy[a] 
            if r < cumulative_proba: return a
            a +=1
        return a

    def regret(self):
        
        if len(self.my_moves)>0:
            my_action = self.my_moves[-1]
            his_action = self.other_moves[-1]
        else:
            return np.zeros(3)
        
        # Payoffs from player's perspective perspective
        my_payoff = np.zeros(3)
        
        # If we play the same, I don't get any payoff
        my_payoff[his_action] = 0.
                 
        # I win when he plays scissors and I pay rock, 
        # or when I play the "next" (Rock = 0, Paper = 1, Scissors = 2)
        my_payoff[0 if his_action == 2 else his_action + 1] = 1
                
        # I lose when I play scissors and he plays rock, 
        # or when I play the "previous" action         
        my_payoff[2 if his_action == 0 else his_action -1] = -1
                 
        regrets = [my_payoff[a]-my_payoff[my_action] for a in range(3)]
        regrets = np.array(regrets)
        return regrets
               
    def get_regret_mixed_strategy(self):
    
        normalize_sum = 0.0
        strategy = np.zeros(3)
        regret = self.regret()                   
        
        for a in range(3):
            strategy[a] = max(self.regret_sum[a],0)
            normalize_sum += strategy[a]
            
        # If all regrets are positive, play randomly
        if normalize_sum > 0:
            strategy = strategy / normalize_sum
        else:
            strategy = np.ones(3)/3
        
        self.regret_sum += regret
        
        return strategy            


Now we need a function that simulates the game:

In [4]:
def run(n_rounds = 1000, verbose=False):
    p1 = RegretPlayer()
    p2 = RegretPlayer() 
    strategies = []
    total_p1 = 0.0
    total_p2 = 0.0

    for n in range(1,n_rounds):
        
        regret_strategy_p1 = p1.get_regret_mixed_strategy()
        regret_strategy_p2 = p2.get_regret_mixed_strategy()
        
        m1 = p1.move(regret_strategy_p1)      
        m2 = p2.move(regret_strategy_p2)
        
        # Save the regret strategies
        strategies.append((regret_strategy_p1,regret_strategy_p2))
                
        # Players update the info of the moves
        p1.my_moves.append(m1)
        p1.other_moves.append(m2)
        
        p2.my_moves.append(m2)
        p2.other_moves.append(m1)    
            

        #### Display results: useful for debugging
        if verbose:
            moves_map = {0:"Rock", 1:"Paper", 2:"Scissors"}
            print('-'*50)
            print("Player 1 strategy:", regret_strategy_p1)
            print("Player 2 strategy:", regret_strategy_p2)
            print("My move: %s" % moves_map[m1])
            print("His move: %s" % moves_map[m2])


    return strategies

eq = run()



Note that it is the **average** strategy profile that converges to the Nash equilibrium strategy! In this case, we know that the Nash equilibrium strategy is to play randomly each action with equal probability, that means, we expect the vector $\left (\frac{1}{3},\frac{1}{3},\frac{1}{3} \right).$

In [5]:
average_strategy_p1 = np.mean([x[0] for x in eq], axis=0)
average_strategy_p2 = np.mean([x[1] for x in eq], axis=0)
print("Average strategy of player 1: ", average_strategy_p1)
print("Average strategy of player 2: ", average_strategy_p2)

Average strategy of player 1:  [ 0.32  0.31  0.37]
Average strategy of player 2:  [ 0.34  0.34  0.32]


Now we have a Nash equilibrium strategy ready to use! 