### Markov games

The following definitions are from Zhang et al. (2019). A **Markov game** is defined by a tuple $\left(\mathcal{N}, \mathcal{S}, \{ \mathcal{A^i} \}_{i \in \mathcal{N}}, \mathcal{P}, \{ \mathcal{R}^i \}_{i \in \mathcal{N}}, \gamma \right)$ where $\mathcal{N} = \{ 1, \dots, N \}$ denotes the set of $N>1$ agents, $\mathcal{S}$ denotes the state space observed by all agents, $\mathcal{A}^i$ denotes the action space of agent $i$. Let $\mathcal{A} := \mathcal{A}^1 \times \dots \times \mathcal{A}^N$ then $\mathcal{P} : \mathcal{S} \times \mathcal{A} \rightarrow \Delta (\mathcal{S})$ denotes the transition probability from any state $s \in \mathcal{S}$ to any state $s' \in \mathcal{S}$ for any joint action $a \in \mathcal{A}$; $R^i: \mathcal{S} \times \mathcal{A} \times \mathcal{S} \rightarrow \mathbb{R}$ is the reward function that determines the immediate reward received by agent $i$ for a transition from $(s,a)$ to $s'$; $\gamma \in [0,1]$ is the discount factor.

At time $t$, each agent $i \in \mathcal{N}$ executes an action $a_t^i$, according to the system state $s_t$. The system then transitions to state $s_{t+1}$ and rewards each agent $i$ by $R^i(s_t,a_t,s_{t+1})$. The goal of agent $i$ is to optimise its long-term reward, by finding the policy $\pi^i : \mathcal{S} \rightarrow \Delta (\mathcal{A}^i)$ such that $a^i_t \sim \pi^i(\cdot|s_t)$. 

The value function $V^i:\mathcal{S} \rightarrow \mathbb{R}$ of agent $i$ becomes a function of the joint policy $\pi: \mathcal{S} \rightarrow \Delta (\mathcal{A})$ defined as $\pi(a|s) := \Pi_{i \in \mathcal{N}} \pi^i (a^i|s)$. In particular, for any joint policy $\pi$ and state $s \in \mathcal{S}$, 

$$ V^i_{\pi^i,\pi^{-i}} (s) := \mathbb{E} \left[ \sum_{t \geq 0} \gamma^t R^i (s_t,a_t,s_{t+1}) | a_t^i \sim \pi^i (\cdot|s_t),s_0=s \right] $$

where $-i$ represents the indices of all agents in $\mathcal{N}$, not including $i$. 

### Nash equilibirum (NE)

In a Markov game $\left(\mathcal{N}, \mathcal{S}, \{ \mathcal{A^i} \}_{i \in \mathcal{N}}, \mathcal{P}, \{ \mathcal{R}^i \}_{i \in \mathcal{N}}, \gamma \right)$ a **Nash equilibirum** is a joint policy $\pi^* = \left(\pi^{1,*}, \dots, \pi^{N,*}\right)$ such that for any $s \in \mathcal{S}$ and $i \in \mathcal{N}$, 

$$V^i_{\pi^{i,*}, \pi^{-i,*}} (s) \geq V^i_{\pi^i, \pi^{-i,*}} (s), \text{ for any } \pi^i$$ 

It characterises an equilibirum point $\pi^*$ from which none of the agents has any incentive to deviate. In a NE, each player effectively holds a correct expectation about the other players' behaviours and acts rationally with respect to this expectation. Acting rationally means the agent's strategy is a best response to the others' strategies and any deviation would make that agent worse off. In other words, for any agent $i \in \mathcal{N}$, the policy $\pi*{i,*}$ is the **best-response** of $\pi^{-i, *}$. 

However, NE suffers from non-uniqueness, meaning there may be multiple NE in one game. This presents complications for convergence, for example, it may result in agents oscillating between their optimal strategies and not all converging to the same NE. 

Moreover, there are multiple assumptions here such as knoweldge of the other agents' strategies and rationality, which does not generally hold in practice. In decision-making, the rationality of individuals is limited by the information they have, the cognitive limitations of their minds and the finite amount of time they have to make a decision. This applies both to human agents and machine agents and is referred to as bounded rationality.  


### Nash Q-Learning

For an $N$-agent system, the Q-function is extended beyond the state-action pair of an individual to incorporate all the actions of the agents: $Q(s,a^1,\dots,a^N)$. Then, a **Nash Q-value** is defined as the expected sum of discounted rewards when all agents follow specified NE strategies from the next period on.

Let $Q^i_*$ denote the **Nash Q-function** for agent $i$. It is defined over $(s,a^1,\dots, a^N)$ as the sum of agent $i$'s current reward plus its future rewards when all agents follow a joint NE strategy

$$  Q_*^i (s,a^1,\dots,a^N) = r^i(s,a^1,\dots,a^N) + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P} (s'|s,a^1,\dots,a^N) V^i_{\pi^{i,*}, \pi^{-i,*}} (s) $$ 

where $(\pi^1_*,\dots,\pi^N_*)$ is the joint NE strategy, $r^i(s,a^1,\dots,a^N)$ is agent $i$'s one-period reward in state $s$ and under joint action $(a^1, \dots, a^N)$, $V^i_{\pi^{i,*}, \pi^{-i,*}} (s)$ is agent $i$'s total discounted reward over infinite periods starting from $s'$ given then all agents follow the equilibirum strategies. 