---
title: "Simple reinforcement for signalling games"
author: "Philippos Triantafyllou"
date-modified: last-modified
date-format: long
lang: en
format: html
theme: cosmo
toc: true
number-sections: true
number-depth: 2
code-line-numbers: true
echo: true
output: true
cap-location: top
embed-resources: true
---

:::{.callout-note}
## Instructions

This notebook aims to build simple reinforcement learning for simple signalling games. The first goal is to understand how to implement and run simple reinforcement learning for basic signalling games. The second goal is to understand how to analyze and interpret the simulation.

:::

## Scenario

The scenario is that of a two players signalling game with an environment (The Nature) that selects a state. Given the state, Alice chooses a message and when Bob receives it he selects an action. In case Bob performs the action that matches the state chosen by nature, the cooperation game is a success otherwise it is a failure.

The game is characterized by:

 - the number of states $S$ available to Nature;
 - the number of messages $M$ available to Alice;
 - the number of actions $A$ available to Bob.

States, messages and actions are represented by vectors of positive or null weights that are to be normalized before sampling. Since the weights are positive, one can normalize them straightforwardly: $$w_i = \frac{w_i}{\sum_{j=1}^k w_j}$$ for a vector $\mathbf{w}$ with length $k$.

## Simple implementation of the signalling game

Implement in python a class representing this Signalling game. $S$, $M$ and $A$ should be parameters.

In [1]:
import numpy as np

class SignallingGame():

    def __init__(self, states: int, messages: int, actions: int, seed: int = 42):
        self.states = states
        self.messages = messages
        self.message_weights = np.full((states, messages), 1e-6)
        self.actions = actions
        self.action_weights = np.full((states, messages), 1e-6)
        self.rng = np.random.RandomState(seed)
        self.stats = []

    def world_state(self):
        state =  self.rng.randint(self.states)
        return state

    def emit_message(self, state):
        probs = self.message_weights[state, :]/np.sum(self.message_weights[state, :])
        message = np.random.choice(self.messages, p=probs)
        return message

    def perform_action(self, message):
        probs = self.action_weights[message, :]/np.sum(self.action_weights[message, :])
        action = np.random.choice(self.actions, p=probs)
        return action

    def payoff(self, state, action):
        return 1 if action == state else 0

    def update_weights(self, state, message, action, payoff):
        self.message_weights[(state, message)] += payoff 
        self.action_weights[(message, action)] += payoff

    def snapshot(self, state, message, action, payoff):
            self.stats.append({
                "s": state,
                "m": message,
                "a": action,
                "p": payoff,
                "mw": self.message_weights.copy(),
                "aw": self.action_weights.copy(),
            })

    def play(self, N):
        for _ in range(N):
            state = self.world_state()
            message = self.emit_message(state)
            action = self.perform_action(message)
            payoff = self.payoff(state, action)
            self.update_weights(state, message, action, payoff)
            self.snapshot(state, message, action, payoff)

We can test.

In [4]:
game = SignallingGame(3, 3, 3)
game.play(100)

In [5]:
print([x["p"] for x in game.stats])

[0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


It works.

## Update weight matrices with Roth-Erev update

Implement a simulation method that plays the game $N$ times and updates the player related vectors using the Roth-Erev update. For a single game the update is the following:

- $w_i = \lambda w_i + u$ if $i$ was chosen;
- $w_i = \lambda w_i$ if $i$ was not chosen;
- $w_i = \lambda w_i$ if $i$ was not sampled in this game;

where $u$ is the payoff and $\lambda \in [0,1]$.

In [6]:
class RothErevGame(SignallingGame):
    def __init__(self, states: int, messages: int, actions: int, l: float, seed: int = 42):
        super().__init__(states, messages, actions, seed)
        self.l = l

    def update_weights(self, state, message, action, payoff):
        self.message_weights[state] *= self.l
        self.message_weights[(state, message)] += payoff
        self.action_weights[message] *= self.l
        self.action_weights[(message, action)] += payoff

Let's look at the results.

In [7]:
roth_erev_game = RothErevGame(3, 3, 3, 0.5)
roth_erev_game.play(100)

In [8]:
print([x["p"] for x in roth_erev_game.stats])

[1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


Well... it works.

## Plot

Design a plotting function that displays each of your statistic typically as a function of the number of games played.

In [None]:
def plot_res():
    pass

Observe the simulation. How stable is it? Which parameters are important? which parameters are less important? Consider in particular:

- the initial conditions;
- the number of games played;
- the value of lambda;
- the number of states;
- the number of messages;
- the number of actions;
- the reward used (use a default 0/1 reward in the first place).