#  Two player, zero-sum games



### Why games? 

- Modelling strategic decision making for more than one agent. 
- An agent can improve by playing against himself.


### Examples

- Chess: Perfect information, alternating and deterministic moves. 
- Rock, paper, scissors: Perfect information with simultaneous moves.
- Poker: Incomplete information, alternating moves. 


### Recent success in Computer Poker
- *Libratus*, an AI, won 1.7 million in chips (no real money) in Texas Hold-em. 
- 20 days, 11 hours per day, 4 of the top players.
- **Why is this a thing?**
	-  Imperfect information, unlike chess or go.
	-  In particular, the computer needs to learn to bluff. 
- Two research groups (CMU & CZ/Canada) came to the same benchmark (guess who made more fuss about it...)




### Some terminology
- A player plays a **pure** strategy if he chooses a single action. 
- A player plays a **mixed** strategy is he chooses actions randomly.
- The expected payoff is computed simply as the expectation using the random strategies.


### Best reply
- A **best reply** strategy for player $A$, given the strategy of player $B$, maximizes $A$'s expected payoff.
- A **Nash equilibrium** is a pair of strategies such that each player's strategy is a best reply to the adversary.
- No player can improve by changing strategy alone.


### Goal

- Play well in the game. 
- One *solution concept* is Nash Equilibrium, other possible.



### Example: Prisoner Dilemma
![](images/pdilemma.png)
The Nash equilibrium does not give the best payoff. We say that the equilibrium is *inefficient*. The outcome $(D,D)$ is called Pareto optimal. 


### Example: Battle of the sexes
![](images/battle.png)
- Two pure equilibria, $(S,S)$, and $(H,H)$. 
- One mixed equilibria, that can be reached through a **correlation device**. 



### Zero-sum games

- A **two-player**, **zero-sum game** is a strategic interaction between two agents, called **players** where both try to maximize their reward. 

- One player's win is another players' loose.

- Strategically equivalent to **constant-sum** games, where the sum of the payoffs for all the outcomes is constant.


### Example
![](images/soccer.png)


## What should players do on a game?



 
### Different frameworks to think about this
- **Rational Expectations**
- Play an inherited strategy
- Copy a succesful strategy
- **Adapt the strategy to the outcome**



 
### Rational Expectations: 
- Behave as if everyone was rational 
- Generalized beauty contest
	- Guess a number from zero to 100.
	- The winner is the one whose guess is as close as possible to two-thirds of the average guess of all those participating in the contest
	- **Example**: Three players guess 20, 30, 40. The winner is ?



- Zero-level thinker: "Oh, God, maths... just 50".
- First-level thinker: "These guys will say 50. So $\frac{2}{3}\cdot$ 50 $\approx$ 33".
- 2nd-level: "These guys are smart, and think the others aren't. So 22."
- ...
A hyper-rational player would then arrive to equilibrium zero.



### What do people do?
![](images/ft.png)
https://www.ft.com/content/6149527a-25b8-11e5-bd83-71cb60e8f08c



## Adaptive strategy



- **Goal**: Get Nash Equilibrium strategy from simple rules.
- On expectation, the NE strategy will not do worse than a tie.
- Due to luck in the game, there is no guarantee (for **any** strategy, not only NE).
- A NE strategy just plays perfect defence.



### Finding Nash Equilibrium
- Self-play
	- Two agents with a random strategy.
	- They improve their strategy using regret matching.
	- After each game, an improved strategy.
	- When convergence is reached, your strategy is ready to go.


### Fictitious Play
![](images/fplay.png)


### 
- Convergence for zero-sum games and few other special games. 
- DeepMind 2015 paper (http://proceedings.mlr.press/v37/heinrich15.pdf):
	- Extension of FP to extensive form games.
	- Fictitious self-play (FSP):
		- FP + Reinforcement Learning + Supervised Learning.

*FSP has a lot of potential to scale to large and even continuous-action game-theoretic applications* 



### Demo: Fictitious Play



- **Regret matching**,  Hart and Mas-Collel, 2000.
- Players reach equilibrium by tracking their regrets, and will play in the future those actions that led to higher regret.


### Example
![](images/regret.png)


### Algorithm
![](images/regret_algo.png)



### Demo: Regret Matching




### Example: Libratus
- Selfplay algorithm (CFR+)
- Endgame solver
	- Special solver for the end of the game.
- Continual improvement meta-algorithm
	- Improvement after each match.

### CFR+
- **Counterfactual**: "If I had known"
- An opponent playing with the perfect strategy needs more than a human lifetime of poker playing to have 95% statistical significance that it found the right strategy.
- Difference with regret matching: adapt to the tree structure of the game.





In [2]:
# Code sample: Regret minimization for rock, paper, scissors 

import matplotlib.pyplot as plt
import random
import numpy as np

np.set_printoptions(precision=2)

# Base player that chooses a constant mixed strategy every time
class Player:
    def __init__(self):
        self.my_moves = []
        self.other_moves = []
        
    def move(self, strategy):
        # Input: a vector of probability distributions for actions
        # Output: a pure action
        
        r = random.uniform(0,1)
        n_actions = len(strategy)
        
        a = 0
        cumulative_proba = 0.0
        
        while a<n_actions-1:
            cumulative_proba += strategy[a] 
            if r < cumulative_proba: return a
            a +=1
        return a
    
class RegretPlayer(Player):
    def __init__(self):
        super(RegretPlayer, self).__init__()
        self.regret_sum = np.zeros(3)

    def regret(self):
        
        if len(self.my_moves)>0:
            my_action = self.my_moves[-1]
            his_action = self.other_moves[-1]
        else:
            return np.zeros(3)
        
        # Payoffs from my perspective
        my_payoff = np.zeros(3)
        
        # If we play the same, I don't get any payoff
        my_payoff[his_action] = 0.
                 
        # I win when he plays scissors and I pay rock, 
        # or when I play the "next" (Rock = 0, Paper = 1, Scissors = 2)
        my_payoff[0 if his_action == 2 else his_action + 1] = 1
                
        # I lose when I play scissors and he plays rock, 
        # or when I play the "previous" action         
        my_payoff[2 if his_action == 0 else his_action -1] = -1
                 
        regrets = [my_payoff[a]-my_payoff[my_action] for a in range(3)]
        regrets = np.array(regrets)
        return regrets
               
    def get_regret_mixed_strategy(self):
    
        normalize_sum = 0.0
        strategy = np.zeros(3)
        regret = self.regret()                   
        
        for a in range(3):
            strategy[a] = max(self.regret_sum[a],0)
            normalize_sum += strategy[a]
            
        # If all regrets are positive, play randomly
        if normalize_sum > 0:
            strategy = strategy / normalize_sum
        else:
            strategy = np.ones(3)/3
        
        self.regret_sum += regret
        
        return strategy            

def payoff(my_action, his_action):
    if my_action == his_action: 
        return 0
    if his_action == 0 and my_action==2 or my_action == his_action-1:
        return -1
    return 1
        
def run(n_rounds = 10):
    p1 = RegretPlayer()
    p2 = Player() 
    
    total_p1 = 0.0
    total_p2 = 0.0
    
    rounds_won_p1 = 0.
    
    plt.ion()
    plt.axis([-0.1,n_rounds+0.1,-0.1,1.1])
    print("*"*100)
    print("The match begins")
    print("*"*100)
    for n in range(1,n_rounds):
        
        regret_strategy_p1 = p1.get_regret_mixed_strategy()
        
        m1 = p1.move(regret_strategy_p1)
        
        m2 = p2.move([0.4,0.3,0.3])
                
        # Players update the info of the moves
        p1.my_moves.append(m1)
        p1.other_moves.append(m2)    
            
        total_p1 += payoff(m1,m2)
        
    
        #### SHOW RESULTS
        moves_map = {0:"Rock", 1:"Paper", 2:"Scissors"}
        print('-'*50)
        print("Regret:")
        print(p1.regret_sum)
        print("Strategy:", regret_strategy_p1)
        print("My move: %s" % moves_map[m1])
        print("His move: %s" % moves_map[m2])
        print("Payoffs:")
        print(total_p1)
        print(total_p2)
                
        rounds_won_p1 += 1 if payoff(m1,m2) >0 else 0

        # Plot the moves
        plt.title("Percentage of rounds won using a regret strategy")
        plt.scatter(n,rounds_won_p1/n, color = "red")

        plt.show()
        plt.pause(0.1)
    
run(n_rounds = 100)



ModuleNotFoundError: No module named 'matplotlib'