<a href="https://colab.research.google.com/github/lefteryx/RIS-MISO-Deep-Reinforcement-Learning/blob/main/RIS_MISO_Deep_Reinforcement_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RIS-MISO-Deep-Reinforcement-Learning
This notebook aims to relate equations in the IEEE-published research paper
[Reconfigurable Intelligent Surface Assisted Multiuser MISO Systems Exploiting Deep Reinforcement Learning](https://arxiv.org/abs/2002.10072) to corresponding code in its [PyTorch implementation](https://github.com/baturaysaglam/RIS-MISO-Deep-Reinforcement-Learning).

## Equations 1 and 3
$$
y_{k}=\mathbf{h}_{k, 2}^{T} \boldsymbol{\Phi} \mathbf{H}_{1} \mathbf{G x}+w_{k}, \tag{1}$$
$$ and $$ 
$$
y_{k}=\mathbf{h}_{k, 2}^{T} \boldsymbol{\Phi} \mathbf{H}_{1} \mathbf{g}_{k} x_{k}+\sum_{n, n \neq k}^{K} \mathbf{h}_{k, 2}^{T} \boldsymbol{\Phi} \mathbf{H}_{1} \mathbf{g}_{n} x_{n}+w_{k}, \tag{3}
$$

Corresponds to code in ```lines 83 to 86``` in ```environment.py``` :  
```py
x = np.abs(h_2_k.T @ Phi @ self.H_1 @ g_k) ** 2
x = x.item()
```

## Equation 4

$$
\rho_{k}=\frac{\left|\mathbf{h}_{k, 2}^{T} \boldsymbol{\Phi} \mathbf{H}_{1} \mathbf{g}_{k}\right|^{2}}{\sum_{n, n \neq k}^{K}\left|\mathbf{h}_{k, 2}^{T} \boldsymbol{\Phi} \mathbf{H}_{1} \mathbf{g}_{n}\right|^{2}+\sigma_{n}^{2}}, \tag{4}
$$

Corresponds to code in ```lines 89 to 92``` in ```environment.py``` :

```py
interference = np.sum(np.abs(h_2_k.T @ Phi @ self.H_1 @ G_removed) ** 2)
y = interference + (self.K - 1) * self.awgn_var
rho_k = x / y
```           

## Equation 5
$$
C\left(\mathbf{G}, \mathbf{\Phi}, \mathbf{h}_{k, 2}, \mathbf{H}_{1}\right)=\sum_{k=1}^{K} R_{k}, \tag{5}
$$


$$\hspace{65mm} where \space R_k = \log_2 (1+\rho_k)$$

Corresponds to Code in ```Line 94``` in ```environment.py``` :

```py
reward += np.log(1 + rho_k) / np.log(2)
```

##Equation 6

$$
\begin{aligned}
& \max _{\mathbf{G}, \boldsymbol{\Phi}} C\left(\mathbf{G}, \mathbf{\Phi}, \mathbf{h}_{k, 2}, \mathbf{H}_{1}\right) \\
& \text { s.t. } \operatorname{tr}\left\{\mathbf{G G}^{\mathcal{H}}\right\} \leq P_{t} \\
& \left|\phi_{n}\right|=1 \forall n=1,2, \ldots, N \text {. }
\end{aligned}
$$

Corresponds to code in ```line 95 ``` in file ```environment.py``` :

```py
opt_reward += np.log(1 + x / ((self.K - 1) * self.awgn_var)) / np.log(2)
 ```

## Equation 7

$$
\begin{aligned}
Q_{\pi}\left(s^{(t)}, a^{(t)}\right) & =\mathcal{E}_{\pi}\left[R^{(t)} \mid s^{(t)}=s, a^{(t)}=a\right] \\
R^{(t)} & =\sum_{\tau=0}^{\infty} \gamma^{\tau} r^{(t+\tau+1)},
\end{aligned}
$$

## Equation 8

$$
\begin{aligned}
Q_{\pi}\left(s^{(t)}, a^{(t)}\right)= & \mathcal{E}_{\pi}\left[r^{(t+1)} \mid s^{(t)}=s, a^{(t)}=a\right] \\
& +\gamma \sum_{s^{\prime} \in S} P_{s s^{\prime}}^{a}\left(\sum_{a^{\prime} \in A} \pi\left(s^{\prime}, a^{\prime}\right) Q^{\pi}\left(s^{\prime}, a^{\prime}\right)\right),
\end{aligned}
$$ 


Both of these equations are pointing out that the Q-function satisfies the Bellman's Equation. This along with equation 10, is implemented as a whole algo in the program about which we studied from the web as Q-Network.

Corresponding to code in ```lines 77 to 96``` in file ```DDPG.py``` :

```py
class Critic(nn.Module):
    def __init__(self, state_dim, action_dim):
        super(Critic, self).__init__()
        hidden_dim = 1 if (state_dim + action_dim) == 0 else 2 ** ((state_dim + action_dim) - 1).bit_length()

        self.l1 = nn.Linear(state_dim, hidden_dim)
        self.l2 = nn.Linear(hidden_dim + action_dim, hidden_dim)
        self.l3 = nn.Linear(hidden_dim, 1)

        self.bn1 = nn.BatchNorm1d(hidden_dim)

    def forward(self, state, action):
        q = torch.tanh(self.l1(state.float()))

        q = self.bn1(q)
        q = torch.tanh(self.l2(torch.cat([q, action], 1)))

        q = self.l3(q)

        return q
  ```

  Critic Network is basically the Q-Network.


## Equation 9

$$
\begin{aligned}
Q^{*}\left(s^{(t)}, a^{(t)}\right)= & r^{(t+1)}\left(s^{(t)}=s, a^{(t)}, \pi=\pi^{*}\right) \\
& +\gamma \sum_{s^{\prime} \in S} P_{s s^{\prime}}^{a} \max _{a^{\prime} \in A} Q^{*}\left(s^{\prime}, a^{\prime}\right) .
\end{aligned}
$$


## Equations 10 and 13

\begin{align}
Q^{*}\left(s^{(t)}, a^{(t)}\right) \leftarrow & (1-\alpha) Q^{*}\left(s^{(t)}, a^{(t)}\right)+\alpha\left(r^{(t+1)}\right. 
 \left.+\gamma \max_{a^{\prime}} Q_{\pi}\left(s^{(t+1)}, a^{\prime}\right)\right) \tag{10}
\end{align}


$$$$

$$
y=r^{(t+1)}+\gamma \max _{a^{\prime}} Q\left(\theta^{(\text {target })} \mid s^{(t+1)}, a^{\prime}\right), \tag{13}
$$

Corresponds to code in ```lines 134 and 135``` in file ```DDPG.py``` :

```py
target_Q = self.critic_target(next_state, self.actor_target(next_state))
target_Q = reward + (not_done * self.discount * target_Q).detach()
```

## Equation 11

$$
Q(s(t), a(t)) \triangleq Q(\theta \mid s(t), a(t))
$$

## Equation 12

$$
\theta^{(t+1)}=\theta^{(t)}-\mu \Delta_{\theta} \ell(\theta)
$$

Corresponds to code in ```lines 108 and 113 ``` in file ```DDPG.py``` :

```py
#108
self.actor_optimizer = torch.optim.Adam(self.actor.parameters(), lr=actor_lr, weight_decay=actor_decay)
#113
self.critic_optimizer = torch.optim.Adam(self.critic.parameters(), lr=critic_lr, weight_decay=critic_decay)
```

Because Adam is the stochastic optimization
algorithm being used.

## Equations 14-16

$$
\ell(\theta)=\left(y-Q\left(\theta^{(\text {train })} \mid s^{(t)}, a^{(t)}\right)\right)^{2}, \tag{14}
$$

$$
\theta_{c}^{(t+1)}= \theta_{c}^{(t)}-\mu_{c} \Delta_{\theta_{c}^{(\text {train })}} \ell\left(\theta_{c}^{(\text {train})}\right), \tag{15}
$$

$$$$

$$
\ell\left(\theta_{c}^{(\text {train})}\right)= \left(r^{(t)}+\gamma q\left(\theta_{c}^{(\text {target})} \mid s^{(t+1)}, a^{\prime}\right)-q\left(\theta_{c}^{(\text {train})} \mid s^{(t)}, a^{(t)}\right)\right)^{2}, \tag{16}
$$

Corresponds to code in ```lines 105-113, 140, 141``` in file ```DDPG.py``` :
    # Initialize actor networks and optimizer
    self.actor = Actor(state_dim, action_dim, M, N, K, powert_t_W, max_action=max_action, device=device).to(self.device)
    self.actor_target = copy.deepcopy(self.actor)
    self.actor_optimizer = torch.optim.Adam(self.actor.parameters(), lr=actor_lr, weight_decay=actor_decay)

    # Initialize critic networks and optimizer
    self.critic = Critic(state_dim, action_dim).to(self.device)
    self.critic_target = copy.deepcopy(self.critic)
    self.critic_optimizer = torch.optim.Adam(self.critic.parameters(), lr=critic_lr, weight_decay=critic_decay)

    ...

    # Compute the critic loss
    critic_loss = F.mse_loss(current_Q, target_Q)

## Equation 17

$$
\theta_{a}^{(t)}-\mu_{a} \Delta_{a} q\left(\theta_{c}^{(\text {target })} \mid s^{(t)}, a\right) \Delta_{\theta_{a}^{(\text {train })}} \pi\left(\theta_{a}^{(\text {train })} \mid s^{(t)}\right), \tag{17}
$$

## Equation 18

$$ 
\theta_{c}^{\text ({target})} \leftarrow \tau_{c} \theta_{c}^{(\text {train})}+\left(1-\tau_{c}\right) \theta_{c}^{\text {(target})},
$$

$$ 
\theta_{a}^{(\text {target })} \leftarrow \tau_{a} \theta_{a}^{(\text {train })}+\left(1-\tau_{a}\right) \theta_{a}^{(\text {target})}, \tag{18}
$$

Corresponds to code in ```lines 156-161``` in file ```DDPG.py``` :

        # Soft update the target networks
        for param, target_param in zip(self.critic.parameters(), self.critic_target.parameters()):
            target_param.data.copy_(self.tau * param.data + (1 - self.tau) * target_param.data)

        for param, target_param in zip(self.actor.parameters(), self.actor_target.parameters()):
            target_param.data.copy_(self.tau * param.data + (1 - self.tau) * target_param.data)