# Active Inference Demo: T-Maze Environment
This demo notebook provides a full walk-through of active inference using the `Agent()` class of `inferactively`. The canonical example used here is the 'T-maze' task, often used in the active inference literature [cite stuff here].

### Imports

First, import `inferactively` and the modules we'll need.

In [1]:
import os
import sys
import pathlib
import numpy as np

path = pathlib.Path(os.getcwd())
module_path = str(path.parent.parent) + '/'
sys.path.append(module_path)

from inferactively.agent import Agent
from inferactively import core
from inferactively.distributions import Categorical, Dirichlet
from inferactively.envs import TMazeEnv

### Auxiliary Functions

Define some demo-specific auxiliary functions that will be helpful for plotting.

## Environment

Here we consider an agent navigating a three-armed 'T-maze,' with the agent starting in a central location of the maze. The bottom arm of the maze contains an informative cue, which signals in which of the two top arms ('Left' or 'Right', the ends of the 'T') a reward is likely to be found. 

At each timestep, the environment is described by the joint occurrence of two qualitatively-different 'kinds' of states (hereafter referred to as _hidden state factors_). These hidden state factors are independent of one another.

We represent the first hidden state factor (`Location`) as a $ 1 \ x \ 4 $ vector that encodes the current position of the agent, and can take the following values: {`CENTER`, `RIGHT ARM`, `LEFT ARM`, or `CUE LOCATION`}. For example, if the agent is in the `CUE LOCATION`, the current state of this factor would be $s_1 = [0 \ 0 \ 0 \ 1]$.

We represent the second hidden state factor (`Reward Condition`) as a $ 1 \ x \ 2 $ vector that encodes the reward condition of the trial: {`Reward on Right`, or `Reward on Left`}.  A trial where the condition is reward is `Reward on Left` is thus encoded as the state $s_2 = [0 \ 1]$.

The environment is designed such that when the agent is located in the `RIGHT ARM` and the reward condition is `Reward on Right`, the agent has a specified probability $a$ (where $a > 0.5$) of receiving a reward, and a low probability $b = 1 - a$ of receiving a 'loss' (we can think of this as an aversive or unpreferred stimulus). If the agent is in the `LEFT ARM` for the same reward condition, the reward probabilities are swapped, and the agent experiences loss with probability $a$, and reward with lower probability $b = 1 - a$. These reward contingencies are intuitively swapped for the `Reward on Left` condition. 

For instance, we can encode the state of the environment at the first time step in a `Reward on Right` trial with the following pair of hidden state vectors: $s_1 = [1 \ 0 \ 0 \ 0]$, $s_2 = [1 \ 0]$, where we assume the agent starts sitting in the central location. If the agent moved to the right arm, then the corresponding hidden state vectors would now be $s_1 = [0 \ 1 \ 0 \ 0]$, $s_2 = [1 \ 0]$. This highlights the _independence_ of the two hidden state factors -- the location of the agent ($s_1$) can change without affecting the identity of the reward condition ($s_2$).


### 1. Initialize environment
Now we can initialize the T-maze environment using the built-in `TMazeEnv` class from the `inferactively.envs` module.

Choose reward probabilities $a$ and $b$, where $a$ and $b$ are the probabilities of reward / loss in the 'correct' arm, and the probabilities of loss / reward in the 'incorrect' arm. Which arm counts as 'correct' vs. 'incorrect' depends on the reward condition (state of the 2nd hidden state factor).

In [6]:
reward_probabilities = [0.98, 0.02] # probabilities used in Karl's original SPM demo

Initialize an instance of the T-maze environment

In [7]:
env = TMazeEnv(reward_probs = reward_probabilities)

### Structure of the state --> outcome mapping
We can 'peer into' the rules encoded by the environment (also known as the _generative process_) by looking at the probability distributions that map from hidden states to observations. Following the SPM version of active inference, we refer to this collection of probabilistic relationships as the $A$ array. In the case of the true rules of the environment, we refer to this array as `A_gp` (where `gp` denotes the generative process). 

It is worth outlining what the observations are in this task. Here, we have three sensory channels or observation modalities: `Location`, `Reward`, and `Cue`. 

>The `Location` observation values are identical to the `Location` hidden state values. In this case, the agent always unambiguously observes its own state - if the agent is in `RIGHT ARM`, it receives a `RIGHT ARM` observation in the corresponding modality. This might be analogized to a 'proprioceptive' sense of place.

>The `Reward` observation modality assumes the values `No Reward`, `Reward` or `Loss`. The `No Reward` (index 0) observation is  observed whenever the agent isn't occupying one of the two T-maze arms (the right or left arms). The `Reward` (index 1) and `Loss` (index 2) observations are observed in the right and left arms of the T-maze, with associated probabilities that depend on the reward condition (i.e. on the value of the second hidden state factor).

> The `Cue` observation modality assumes the values `Cue Right`, `Cue Left`. This observation unambiguously signals the reward condition of the trial, and therefore in which arm the `Reward` observation is more probable. When the agent occupies the other arms, the `Cue` observation will be `Cue Right` or `Cue Left` with equal probability. However (as we'll see below when we intialise the agent), the agent's beliefs about the likelihood mapping render these observations uninformative and irrelevant to state inference.



In [16]:
A_gp = env.get_likelihood_dist()

In [17]:
A_gp[1][:,:,0] # mapping between arms and reward probabilities in 'Reward Right' condition

<Categorical Distribution> 
 [[1.   0.   0.   1.  ]
 [0.   0.98 0.02 0.  ]
 [0.   0.02 0.98 0.  ]]

In [18]:
A_gp[1][:,:,1] # mapping between arms and reward probabilities in 'Reward Left' condition

<Categorical Distribution> 
 [[1.   0.   0.   1.  ]
 [0.   0.02 0.98 0.  ]
 [0.   0.98 0.02 0.  ]]

In [21]:
B_gp = env.get_transition_dist()

### The generative model
Now we can move onto setting up the generative model of the agent - namely, the agent's beliefs about how hidden states give rise to observations and how its own actions affect hidden states.

In [22]:
A_gm = A_gp.copy()
B_gm = B_gp.copy()

In [23]:
agent = Agent(A=A_gm, B=B_gm, control_fac_idx=[0])
agent.C[1][1] = 3.0
agent.C[1][2] = -3.0
T = 10

In [24]:
obs = env.reset()

reward_conditions = ["Reward on Left", "Reward on Right"]
msg = """ === Starting experiment === \n Reward condition: {}, Initial observation {} """
print(msg.format(reward_conditions[env.reward_condition], obs))

for t in range(T):
    qx = agent.infer_states(obs)
    msg = """[{}] Inference [Arm {} / reward {}] """
    print(msg.format(t, qx[0].sample(), qx[1].sample(), obs[0], obs[1]))

    q_pi, efe = agent.infer_policies()

    action = agent.sample_action()

    msg = """[Step {}] Action: [Move to Arm {}]"""
    print(msg.format(t, action[0]))

    obs = env.step(action)

    msg = """[Step {}] Observation: [Arm {}, Reward {}]"""
    print(msg.format(t, obs[0], obs[1]))

 === Starting experiment === 
 Reward condition: Reward on Left, Initial observation (0, 0, 1) 
[0] Inference [Arm 0 / reward 0] 
[Step 0] Action: [Move to Arm 1]
[Step 0] Observation: [Arm 1, Reward 1]
[1] Inference [Arm 1 / reward 0] 
[Step 1] Action: [Move to Arm 1]
[Step 1] Observation: [Arm 1, Reward 1]
[2] Inference [Arm 1 / reward 0] 
[Step 2] Action: [Move to Arm 1]
[Step 2] Observation: [Arm 1, Reward 1]
[3] Inference [Arm 1 / reward 0] 
[Step 3] Action: [Move to Arm 1]
[Step 3] Observation: [Arm 1, Reward 1]
[4] Inference [Arm 1 / reward 0] 
[Step 4] Action: [Move to Arm 1]
[Step 4] Observation: [Arm 1, Reward 1]
[5] Inference [Arm 1 / reward 0] 
[Step 5] Action: [Move to Arm 1]
[Step 5] Observation: [Arm 1, Reward 1]
[6] Inference [Arm 1 / reward 0] 
[Step 6] Action: [Move to Arm 1]
[Step 6] Observation: [Arm 1, Reward 1]
[7] Inference [Arm 1 / reward 0] 
[Step 7] Action: [Move to Arm 1]
[Step 7] Observation: [Arm 1, Reward 1]
[8] Inference [Arm 1 / reward 0] 
[Step 8] Actio

  We have removed zeros by adding a small non-negative scalar to each value."
