# Agent environment with bitflipping actions, and random target

## Objective
The state space is $S=\{0,1\}^n$ for fixed $n$. The objective is to find a policy $\pi_t(s)$ (where the policy is dependent on $t \in S$) such that we are able to reach $t$ by flipping bits. The actions are represented by $A=\{0,\cdots,n-1\}$ representing which bit to be flipped. The corresponding DQN would be $Q: (s,t,a) \in S \times S \times A \rightarrow \mathbb{R}$.

## Neural network model details
 * Input: $\mathbb{R}^n \times \mathbb{R}^n$ vector representing the bit sequence (e.g. $00011 \sim (0, 0, 0, 1, 1)$) of state and goal.
 * Output: $\mathbb{R}^{n}$ vector of the $Q$-values
    * The index corresponds to integer $\geq 0$ representing the index of the bit being flipped
    * So $\text{model}(s,t)[a]$ will be the $Q$-value
 * Architecture: Simple MLP

## Training method
Simple DQN with replay
   * Exploration step -> update Q network -> validation step

**Exploration step**
We initialize 16 agents starting at random starting states (for each agent). The actions will be according to the DQN agent, and with probability $\epsilon$, a random action will be picked uniformly. We add this to the experience buffer. The episode has at most $n$ steps.

**Update Q network**
Update the DQN to match Bellman's equation using a randomly sampled batch size ($=256$), and this is done with gradient descent.

**Validation step**
Access the performance of the learnt policy. Initialize 1024 agents, and let the agent fully decide the actions (we do not replace actions with probability $\epsilon$ with uniform distribution). Since this is the validation step, this should not interfere with the training process, and the experience buffer/model weights won't be updated.


## Notes
For an optimal agent, $E[\text{steps}] = \frac{1}{2^n}\sum_{k=0}^n \dbinom{n}{k}k = \frac{n}{2}$. We expect the average number of steps to be close to $\frac{n}{2}$ if an optimal agent is chosen.

In [1]:
import torch
import src.runtime as runtime
from src.runtime import train_DQN_agent

device = torch.device("cuda:0")

### Testing $n=5$, UVFA

In [None]:
agent, env = train_DQN_agent(5, device=device, episodes=3000, use_HER=False, model_type=runtime.UVFA)

### Testing $n=10$, UVFA

In [None]:
agent, env = train_DQN_agent(10, device=device, episodes=3000, use_HER=False, model_type=runtime.UVFA)

### Testing $n=15$, UVFA

In [None]:
agent, env = train_DQN_agent(15, device=device, episodes=3000, use_HER=False, model_type=runtime.UVFA)

### Testing $n=20$, UVFA

In [None]:
agent, env = train_DQN_agent(20, device=device, episodes=3000, use_HER=False, model_type=runtime.UVFA)

### Testing $n=20$, Handcrafted

In [None]:
agent, env = train_DQN_agent(20, device=device, episodes=3000, use_HER=False, model_type=runtime.HANDCRAFTED)

### Testing $n=30$, Handcrafted

In [None]:
agent, env = train_DQN_agent(30, device=device, episodes=3000, use_HER=False, model_type=runtime.HANDCRAFTED)