# $\ell_0$-linear layer implemenation

Almost verbatim from [1712.01312](https://arxiv.org/abs/1712.01312.pdf)

Efficient gradient based optimization of the expected $\ell_0$ `norm` of parameter $\theta$. Let $s$ be a continuous rv with $q(s\mid \phi)$. The binary gates $z$ are driven by
$$
    z = \min\bigl\{\max\{s, 0\}1, \bigr\}
    \,, s \sim q(s\big\vert \phi)
    \,, $$
where the weights of a linear layer become $w = \theta \odot z$. This formulation allows the gate to be exactly zero, due to the underlying continueous rv $s$. Now
$$ \Pr\bigl(z\neq 0\big\vert \phi\bigr)
    = \int_0^{+\infty} q(s\big\vert \phi)ds
    \,. $$

The regularized ERM thus becomes
$$
    \mathbb{E}_{q(s\mid \phi)}
        \hat{\mathbb{E}}_{x,y \sim \mathcal{S}}
            \ell(h(x; \theta \odot g(s)), y)
        + \tfrac\lambda{\lvert \theta \rvert}
            \sum_j \Pr(z_j \neq 0\big\vert \phi_j)
    \,, $$
for $g\colon \mathbb{R} \to [0, 1] \colon x \mapsto \min\{1, \max\{x, 0\}\}$.

Let's use the reparametrization trick on $s$:
$$
    q(\cdot \vert\phi)
        \sim f(\phi, \varepsilon)
        \,, \varepsilon \sim p
    \,.$$
The hard-concrete distribution does this trick. [concrete.pdf](http://www.stats.ox.ac.uk/~cmaddis/pubs/concrete.pdf) Which is a concrete binary rv passed thorugh a hard-sigmoid.

So consider an rv $s\in (0, 1)$ with pdf $q_s(s\mid \phi)$ and cdf $Q_s(s\mid \phi)$. The rv is parameterized by $\log \alpha$ -- the location, and $\beta$ the temperature (recall the gumbel softmax trick -- it seems to be related). Let also $(\gamma, \zeta)$ be the parameters $\gamma < 0 < \zeta$, which strech the distribution of $s$ to $(\gamma, \zeta)$ interval.

$$
    u \sim \mathrm{U}(0, 1)
    \,, s = \sigma_{\tfrac1\beta}(\log \tfrac{u}{1-u} + \log \alpha)
    \,, \bar{s} = (\zeta - \gamma) s + \gamma
    \,, z = g(\bar{s})
    \,, $$
where $\sigma_\beta\colon x \mapsto \tfrac1{1+e^{-\beta x}}$, and $s$ is concrete rv, $g$ is the hard-sigmoid.

<br>

For $t\in (0, 1)$
$$
    \{\sigma_{\tfrac1\beta} \leq t\}
        = \{
            \tfrac{1-t}{t} \leq e^{-\tfrac1\beta x}
        \}
        = \{
            \log\tfrac{1-t}{t} \leq -\tfrac1\beta x
        \}
        = \{
            x \leq \beta\log\tfrac{t}{1-t}
        \}
    \,. $$

For any $u \in (0, 1)$
$$
    \{u\leq \sigma(x)\}
        = \{ 1+e^{-x} \leq \tfrac1u \}
        = \{ -x \leq \log\tfrac{1-u}u \}
        = \{ \log\tfrac{u}{1-u} \leq x \}
    \,. $$

Thus
$$\begin{align}
    \Pr\bigl(z = 0\big\vert \phi\bigr)
        &= \Pr\bigl(\bar{s} \leq 0 \big\vert \phi\bigr)
        = \Pr\bigl(s \leq \tfrac{-\gamma}{\zeta - \gamma} \big\vert \phi\bigr)
        \\
        &= \Pr\bigl(
            \log \tfrac{u}{1-u} + \log \alpha
                \leq \beta \log \tfrac{\tfrac{-\gamma}{\zeta - \gamma}}{1-\tfrac{-\gamma}{\zeta - \gamma}}
        \bigr)
        \\
        &= \Pr\bigl(
            \log \tfrac{u}{1-u} \leq \beta \log \tfrac{-\gamma}{\zeta} - \log \alpha
        \bigr)
        \\
        &= \Pr\Bigl(
            u \leq \sigma\bigl(\beta \log \tfrac{-\gamma}{\zeta} - \log \alpha\bigr)
        \Bigr)
        \,.
\end{align}$$

For $t \in (0, 1)$:
$$\begin{align}
    \Pr\bigl(z \leq t \big \vert \phi\bigr)
        &= \Pr\bigl( s \leq \tfrac{t - \gamma}{\zeta - \gamma} \big \vert \phi\bigr)
        = \Pr\bigl(
            \log \tfrac{u}{1-u} + \log \alpha
                \leq \beta \log \tfrac{\tfrac{t - \gamma}{\zeta - \gamma}}{1-\tfrac{t - \gamma}{\zeta - \gamma}}
        \bigr)
        \\
        &= \Pr\bigl(
            \log \tfrac{u}{1-u}
                \leq \beta \log \tfrac{t - \gamma}{\zeta - t}
                    - \log \alpha
        \bigr)
        = \sigma\bigl(
            \beta \log \tfrac{t - \gamma}{\zeta - t} - \log \alpha
        \bigr)
        \,.
\end{align}
$$

Hence
$$\begin{align}
    \Pr\bigl(z \neq 0\big\vert \phi\bigr)
        &= 1 - \Pr\Bigl(
            u \leq \sigma\bigl(\beta \log \tfrac{-\gamma}{\zeta} - \log \alpha\bigr)
        \Bigr)
        \\
        &= \Pr\Bigl(
            u \geq \sigma\bigl(\beta \log \tfrac{-\gamma}{\zeta} - \log \alpha\bigr)
        \Bigr)
        \\
        &= \Pr\Bigl(
            1-u \leq 1-\sigma\bigl(\beta \log \tfrac{-\gamma}{\zeta} - \log \alpha\bigr)
        \Bigr)
        \\
        &= \Pr\Bigl(
            1-u \leq \sigma\bigl(\log \alpha - \beta \log \tfrac{-\gamma}{\zeta}\bigr)
        \Bigr)
        \\
        &= \Pr\Bigl(
            u \leq \sigma\bigl(\log \alpha - \beta \log \tfrac{-\gamma}{\zeta}\bigr)
        \Bigr)
        \\
        &= \sigma\bigl(\log \alpha - \beta \log \tfrac{-\gamma}{\zeta}\bigr)
        \,.
\end{align}$$

<br>

Let's carefully write the distirbution of $z$:
$$\begin{align}
    \Pr\bigl(z \leq t\bigr)
        &= 0
        \,, \text{ for } t < 0
        \,, \\
    \Pr\bigl(z \leq t\bigr)
        &= \sigma\bigl(
            \beta \log \tfrac{t - \gamma}{\zeta - t} - \log \alpha
        \bigr)
        \,, \text{ for } t \in [0, 1)
        \,, \\
    \Pr\bigl(z \leq t\bigr)
        &= 1
        \,, \text{ for } t \geq 1
        \,.
\end{align}$$
This distribution has atoms at $0$ and $1$:
$$\begin{align}
    \Pr\bigl(z = 0\bigr)
        &= \Pr\bigl(z \leq 0\bigr) - \Pr\bigl(z < 0\bigr)
        = \sigma\bigl(
            \beta \log \tfrac{- \gamma}\zeta - \log \alpha
        \bigr) - \lim_{t\uparrow 0} \Pr\bigl(z \leq t\bigr)
        = \sigma\bigl(
            \beta \log \tfrac{- \gamma}\zeta - \log \alpha
        \bigr)
        \,, \\
    \Pr\bigl(z = 1\bigr)
        &= 1 - \lim_{t\uparrow 1} \Pr\bigl(z \leq t\bigr)
        = 1 - \sigma\bigl(
            \beta \log \tfrac{1 - \gamma}{\zeta - 1} - \log \alpha
        \bigr)
        = \sigma\bigl(
            \log \alpha - \beta \log \tfrac{1 - \gamma}{\zeta - 1}
        \bigr)
        \,.
\end{align}$$

<br>

For $t\in (0, 1)$ the pdf of $z$ is
$$
    \tfrac{d}{dt} \Pr(z \leq t)
        = \tfrac{d}{dt} \sigma\bigl(
            \beta \log \tfrac{t - \gamma}{\zeta - t} - \log \alpha
        \bigr)
        = \sigma(x) \sigma(-x) \big\vert_{x = \beta \log \tfrac{t - \gamma}{\zeta - t} - \log \alpha}
%         \Bigl( \beta \tfrac1{t - \gamma} + \beta \tfrac{1}{\zeta - t} \Bigr)
%         = (\ldots)
        \beta \tfrac{\zeta - \gamma}{(t - \gamma)(\zeta - t)}
%         \beta (\log{t - \gamma} - \log {\zeta - t})
%         \tfrac{- e^{-x}}{(1+e^{-x})^2}
    \,. $$

The expectation of $z$:
$$
\begin{align}
    \mathbb{E}_{\mathrm{hc}} z
        &= \int z d\mathbb{P}
        = \int_{\{0\}} + \int_{(0, 1)} + \int_{\{1\}}
        = 0 \sigma\bigl(
            \beta \log \tfrac{- \gamma}\zeta - \log \alpha
        \bigr)
        + 1 \sigma\bigl(
            \log \alpha - \beta \log \tfrac{1 - \gamma}{\zeta - 1}
        \bigr)
        + \int_{(0, 1)}
            t \tfrac{d}{dt} \Pr(z \leq t)
        dt
        \\
        &= \sigma\bigl(
            \log \alpha - \beta \log \tfrac{1 - \gamma}{\zeta - 1}
        \bigr)
        + \sigma\bigl(
                \beta \log \tfrac{1 - \gamma}{\zeta - 1} - \log \alpha
            \bigr)
        - \int_{(0, 1)}
            \sigma\bigl(
                \beta \log \tfrac{t - \gamma}{\zeta - t} - \log \alpha
            \bigr) dt
        \\
        &= 1 - \int_{(0, 1)}
            \sigma\bigl(
                \beta \log \tfrac{t - \gamma}{\zeta - t} - \log \alpha
            \bigr) dt
%         = \int_0^1 \Pr(z \geq t) dt
    \,.
\end{align}$$

<br>

Consider a linear transformtaion $y = Wx + b$ for $ W = \theta \odot z$, for $x\in \mathbb{R}^m$ and $y\in \mathbb{R}^n$. Then
$$\begin{align}
    y_i &= b_i + \sum_j \theta_{ij} z_{ij} x_j
        = b_i + \sum_j z_{ij} e_i^\top \theta e_j x_j
        \\
        &= b_i + e_i^\top \mathop{diag}(z_i) \theta x
    \,.
\end{align}
$$
Local reparametrization trick? The ICLR2018 paper uses a single sample (!) per minibatch. For otherwise we need `batch x n x m` samples, which is a lot.

$$\begin{align}
    \partial y
        &= \partial W x + W \partial x
        = \bigl( \partial \theta \odot z + \theta \odot \partial z \bigr) x + W \partial x
        \\
        &= \bigl( \partial \theta \odot z + \theta \odot \partial z \bigr) x + W \partial x
        \,.
\end{align}$$

See the last paragraph before section 3 of their paper! There they say that they use group dropout.

Linear layer $\mathbb{R}^n \to \mathbb{R}^m$
$$
    y = W x + b
    \,, W_{ij} = \theta_{ij} z_{ij}
\,, $$

where $z \in [0, 1]^{n\times m}$ is a learnable variational dropout mask

$$
    u_{ij} \sim \mathrm{U}(0, 1)
    \,, s_{ij} = \sigma\bigl(\tfrac1\beta (\log \tfrac{u_{ij}}{1-u_{ij}} + \log \alpha_{ij})\bigr)
    \,, \\
    z_{ij} = \min\bigl\{1, \max\{0, (\zeta - \gamma) s_{ij} + \gamma\}\bigr\}
    \,, \sigma\colon x \mapsto \tfrac1{1+e^{-x}}
    \,.
$$

$\gamma=-0.1, \zeta=+1.1$, $\beta = 0.66$.

Objective
$$
    \tfrac1{2B} \sum_b \|g(x_b) - y_b \|_2^2
        + \lambda \tfrac1{\# \text{par}}
            \sum_{ij} \Pr(z_{ij} \neq 0)
    \,. $$

<br>

In [None]:
import torch
import torch.nn.functional as F

import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt

from cplxmodule.relevance import named_penalties, penalties

from torch.nn import Linear
from cplxmodule.relevance import LinearARD
from cplxmodule.relevance import LinearL0ARD, LinearLASSO

In [None]:
shape = 250, 50

In [None]:
index = np.r_ if True else np.random.permutation(shape[0])
index = index[:shape[1]]

In [None]:
model_ard = LinearL0ARD(*shape, bias=False, reduction="mean", group=None)  # longer to learn on small data, noisy grads indeed
# model_ard = LinearARD(*shape, bias=False)
# model_ard = Linear(*shape, bias=False)

# model_ard = LinearLASSO(*shape, bias=False)

In [None]:
from cplxmodule.masked import LinearMasked

model_masked = LinearMasked(*shape, model_ard.bias is not None)

<br>

Train on a simple identiy mapping

In [None]:
model = model_masked

In [None]:
model = model_ard

In [None]:
import tqdm

optim = torch.optim.Adam(model.parameters())
losses = []

model.train()
x = torch.randn(100, shape[0])
y = -x[:, index]

with tqdm.tqdm(range(8000)) as bar:
    for epoch in bar:
        optim.zero_grad()

        loss = F.mse_loss(model(x), y)
        loss += sum(penalties(model))

        loss.backward()

        optim.step()
        losses.append(float(loss))
    # end for
# end with

plt.semilogy(losses)

In [None]:
sum(penalties(model))

In [None]:
from cplxmodule.relevance import sparsity

import math

tau = 0.5

sparsity(model, threshold=tau)

In [None]:
from dpd.tools.utils import sparsity_details

In [None]:
[*sparsity_details(model, None)]

In [None]:
model.eval()

test_loss = []
for _ in tqdm.tqdm(range(1000)):
    x = torch.randn(100, shape[0])
    y = -x[:, index]

    test_loss.append(float(F.mse_loss(model(x), y)))

plt.hist(test_loss, bins=51);

In [None]:
from cplxmodule.masked import is_sparse

fig, ax = plt.subplots(3, 1, figsize=(12, 6))

with torch.no_grad():
    if "LinearL0ARD" in globals() and isinstance(model, LinearL0ARD):
        z = model.gate(None).detach()
        ax[0].imshow(z.numpy(), cmap=plt.cm.bone)
        ax[0].set_title(r"Gate $z_{ij}$")

    elif is_sparse(model):
        z = model.mask
        ax[0].imshow(z.numpy(), cmap=plt.cm.bone)
        ax[0].set_title(r"Gate $z_{ij}$")
    
    ax[1].imshow(abs(getattr(model, "weight_masked", model.weight)).numpy(), cmap=plt.cm.binary_r)
    ax[1].set_title(r"Absolute learnt weight $| \theta_{ij} |$")

    if isinstance(model, LinearMasked):
        relevance = 1 - model.mask
    else:
        relevance = model.log_alpha
    ax[2].imshow(relevance.detach().numpy(), cmap=plt.cm.bone_r)
    ax[2].set_title(r"Relevance $\log \alpha_{ij}$ / mask")
    

plt.tight_layout()
plt.show()

In [None]:
model.weight[np.r_[:shape[1]], index]

<br>

In [None]:
from cplxmodule.masked import compute_ard_masks, deploy_masks

masks = compute_ard_masks(model_ard, threshold=0)

model_masked = deploy_masks(model_masked, state_dict=masks)

In [None]:
[*model_masked.named_buffers()]

In [None]:
model_masked.weight_masked

In [None]:
# [*model.named_parameters()]

In [None]:
model_masked.state_dict()

In [None]:
assert False

In [None]:
from cplxmodule.relevance.base import named_relevance, named_sparsity

In [None]:
from dpd.models import ModelHUAWEI
from dpd.models.huawei import ModelHUAWEIBoosted

In [None]:
recipe = ModelHUAWEI.default_recipe(32, 4)

In [None]:
# model = ModelHUAWEI(*recipe, n_columns=1, ard=True)
model = ModelHUAWEIBoosted(*recipe, n_columns=1, linear=LinearARD)

In [None]:
self = model["boost00"][0]["dense00"]

In [None]:
[*sparsity_details(model, 4.0)]

<br>

In [None]:
import math
import torch
import torch.nn.functional as F
from torch.nn import Parameter

%matplotlib inline
import matplotlib.pyplot as plt

from cplxmodule.relevance import named_penalties, penalties
from cplxmodule.relevance.base import BaseARD

In [None]:
class LinearL0ARD(torch.nn.Linear, BaseARD):
    """L0 regularized linear layer according to [1]_.
    
    Details
    -------
    IMPORTANT: This implemetnation use -ve log-alpha parametrization
    in order to keep the layer's parameters interpretation aligned 
    with the interpretation in variational dropout layer of Kingma
    et al. (relevance.LinearARD).
    
    References
    ----------
    [1]_ :: https://arxiv.org/abs/1712.01312.pdf
    [2]_ :: https://arxiv.org/abs/1902.09574.pdf
    """
    beta, gamma, zeta = .25, -0.1, 1.1
    def __init__(self, in_features, out_features, bias=True):
        super().__init__(in_features, out_features, bias=bias)

        self.log_alpha = Parameter(
            torch.Tensor(*self.weight.shape))
        self.reset_variational_parameters()

    def reset_variational_parameters(self):
        # assume everything is important (but do not saturate the sigmoid too much)
        self.log_alpha.data.uniform_(-0.45, -0.45)

    def forward(self, input):
        if not self.training:
            # suppose u = 0.5, the the input is zero
            z = self.sample_hc(0.0, beta=self.beta)  # 1.0 in paper eq(13), self.beta in code
            return F.linear(input, self.weight * z, self.bias)

        # a single mask sample for the whole batch!
        u = torch.rand_like(self.log_alpha)
        z = self.sample_hc(torch.log(u) - torch.log(1 - u), beta=self.beta)
        return F.linear(input, self.weight * z, self.bias)

    def sample_hc(self, logit=0.0, beta=1.0):
        s = torch.sigmoid((logit - self.log_alpha) / beta)
        return torch.clamp((self.zeta - self.gamma) * s + self.gamma, 0, 1)
    
    @property
    def penalty(self):
        shift = self.beta * math.log(- self.gamma / self.zeta)
        return 1 - torch.sigmoid(self.log_alpha + shift).mean()

    def get_sparsity_mask(self, threshold):
        r"""Get the dropout mask based on the log-relevance."""
        with torch.no_grad():
            return torch.ge(self.log_alpha, threshold)

In [None]:
class LinearL0_arxiv171201312(torch.nn.Linear, BaseARD):
    """L0 regularized linear layer according to [1]_."""
    beta, gamma, zeta = .5, -0.1, 1.1

    def __init__(self, in_features, out_features, bias=True):
        super().__init__(in_features, out_features, bias=bias)

        self.log_alpha = Parameter(torch.Tensor(*self.weight.shape))
        self.reset_variational_parameters()

    def reset_variational_parameters(self):
        # 0.6 = a / (a+1) = n / (n + m), for a = n/m, n=2, m=3. put n, m = 9, 11
        self.log_alpha.data.normal_(0.45, std=0.01)

    def forward(self, input):
        if not self.training:
            s = torch.sigmoid(self.log_alpha)
            z = torch.clamp((self.zeta - self.gamma) * s + self.gamma, 0, 1)
            return F.linear(input, self.weight * z, self.bias)

        # a single mask sample for the whole batch!
        u = torch.rand_like(self.log_alpha)
        logit = torch.log(u) - torch.log(1 - u) + self.log_alpha
        s = torch.sigmoid(logit / self.beta)
        z = torch.clamp((self.zeta - self.gamma) * s + self.gamma, 0, 1)
        return F.linear(input, self.weight * z, self.bias)

    @property
    def penalty(self):
        shift = self.beta * math.log(- self.gamma / self.zeta)
        return torch.sigmoid(self.log_alpha - shift).mean()

<br>

In [None]:
x = torch.randn(1000, 50) * 10
x1 = F.hardtanh(x, 0., 1.)
x2 = torch.clamp(x, 0., 1.)

assert torch.allclose(x2, x1)
assert torch.allclose(x1, x2)

<br>

In [None]:
from cplxmodule.relevance.base import BaseARD
from cplxmodule.relevance.utils import torch_sparse_linear, torch_sparse_tensor
from cplxmodule.relevance.utils import parameter_to_buffer, buffer_to_parameter

from torch.nn import Parameter

class Dummy(torch.nn.Linear):
    @property
    def is_sparse(self):
        mode = getattr(self, "sparsity_mode_", None)
        return mode is not None

    def forward(self, input):
        if self.is_sparse:
            return self.forward_sparse(input)
        return super().forward(input)

    def forward_sparse(self, input):
        if self.sparsity_mode_ == "dense":
            weight = self.weight_ * self.nonzero_
            return F.linear(input, weight, self.bias)

        else:
            weight = torch_sparse_tensor(self.nonzero_, self.weight_,
                                         self.weight.shape)
            return torch_sparse_linear(input, weight, self.bias)

    def sparsify(self, threshold=1.0, mode="dense"):
        if mode is not None and mode not in ("dense", "sparse"):
            raise ValueError(f"""`mode` must be either 'dense', 'sparse' or """
                             f"""`None` (got '{mode}').""")

        if mode is not None:
            with torch.no_grad():
                mask = torch.gt(abs(self.weight), threshold)

            if mode == "sparse":
                # truly sparse mode
                weight = self.weight.data[mask].clone()
                self.register_buffer("nonzero_", mask.nonzero().t())

            elif mode == "dense":
                # smiluated sparse mode
                mask = mask.to(self.weight).data
                weight = self.weight.data * mask
                self.register_buffer("nonzero_", mask)

            # make weight into a buffer (load_state dict doesn't care
            #  about param/buffer distinction!)
            self.register_parameter("weight_", torch.nn.Parameter(weight))
            parameter_to_buffer(self, "weight")

        elif self.is_sparse:
            # reinstate the weight as the parameter and delete runtime stuff
            del self.nonzero_, self.weight_
            buffer_to_parameter(self, "weight")

        # end if

        self.sparsity_mode_ = mode

        return self

In [None]:
from cplxmodule.layers import CplxParameter, CplxLinear
from cplxmodule import Cplx

In [None]:
def parameter_to_buffer(module, name):
    # par could be a solo parameter or a container (essentially a submodule)
    par = getattr(module, name)
    if isinstance(par, (torch.nn.ParameterDict, torch.nn.ParameterList)):
        # parameter containers no not use buffers and aren't expected to.
        #  So we hide parameters there. This precludes acces via __getitem__
        #  though. Not via __getattr__

        # create a copy of the container's master parameter dict's keys and mutate
        for name in list(par._parameters):
            # By design of Parameter containers this never recurses deeper
            parameter_to_buffer(par, name)
        return

    # a solo parameter
    if par is not None and not isinstance(par, torch.nn.Parameter):
        raise KeyError(f"parameter '{name}' is not a tensor.")

    # remove the parameter and mutate into a grad-detached buffer
    delattr(module, name)
    par = par.detach() if par is not None else None
    module.register_buffer(name, par)

def buffer_to_parameter(module, name):
    # a buffer here can be a buffer or a former mutated parameter container
    buf = getattr(module, name)
    if isinstance(buf, (torch.nn.ParameterDict, torch.nn.ParameterList)):
        # create a copy of the container's master buffer dict's keys and restore
        for name in list(buf._buffers):
            # By design of Parameter containers this never goes deeper
            #  than this call
            buffer_to_parameter(buf, name)
        return

    if buf is not None and not isinstance(buf, torch.Tensor):
        raise KeyError(f"buffer '{name}' is not a tensor.")

    # remove the buffer and mutate back into a proper parameter
    delattr(module, name)
    buf = torch.nn.Parameter(buf) if buf is not None else None
    module.register_parameter(name, buf)

In [None]:
mod = torch.nn.Module()
mod.par = torch.nn.Parameter(torch.randn(10))
mod.par_d = torch.nn.ParameterDict({
    "par_1": torch.nn.Parameter(torch.randn(10)),
    "par_2": torch.nn.Parameter(torch.randn(10)),
})
mod.par_l = torch.nn.ParameterList([
    torch.nn.Parameter(torch.randn(10)),
    torch.nn.Parameter(torch.randn(10)),
])

In [None]:
lin = CplxLinear(10, 10)

lin.test = torch.nn.ParameterList([
    torch.nn.Parameter(lin.weight.real.clone()),
    None,
    torch.nn.Parameter(lin.weight.real.clone())
])

lin.test_2 = torch.nn.ParameterDict({
    "a": torch.nn.Parameter(torch.randn(10)),
    "z": torch.nn.Parameter(torch.randn(10)),
    "_12omega": None,
})

In [None]:
before = Cplx(**lin.weight)

parameter_to_buffer(lin, "weight")
print([n for n, b in lin.named_buffers()], list(lin.weight._buffers))

buffer_to_parameter(lin, "weight")
print([n for n, b in lin.named_buffers()], list(lin.weight._buffers))

assert np.allclose(Cplx(**lin.weight).detach().numpy(), before.detach().numpy())

In [None]:
copy = [*lin.test]

parameter_to_buffer(lin, "test")
print([n for n, b in lin.named_buffers()], list(lin.test._buffers))

buffer_to_parameter(lin, "test")
print([n for n, b in lin.named_buffers()], list(lin.test._buffers))

assert all(a is None and b is None or torch.allclose(a, b)
           for a, b in zip(lin.test, copy))

In [None]:
copy = {**lin.test_2}

parameter_to_buffer(lin, "test_2")
print([n for n, b in lin.named_buffers()], list(lin.test_2._buffers))

buffer_to_parameter(lin, "test_2")
print([n for n, b in lin.named_buffers()], list(lin.test_2._buffers))

assert all(a is None and b is None or torch.allclose(a, b)
           for a, b in zip(lin.test_2.values(), copy.values()))

<br>

CDF

In [None]:
import math
from scipy.special import expit

beta, gamma, zeta = 0.95, -0.5, 1.5

In [None]:
mesh = np.meshgrid(np.linspace(-.25, 1.25, num=501),
                   np.linspace(-12, 12, num=101))

t, log_a = mesh

tcl = np.clip(t, 0, 1)
cdf = expit(beta * (np.log(tcl - gamma) - np.log(zeta - tcl)) - log_a)
cdf[t >= 1.] = 1.
cdf[t < 0] = 0.

from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(figsize=(12, 12))
ax = fig.add_subplot(111, projection='3d')

ax.plot_surface(t, log_a, cdf)

ax.view_init(35, 120)
plt.show()

Slices of CDF at large $\log \alpha$

In [None]:
log_a[-50:-35, 0]

In [None]:
t[0][-90:]

In [None]:
# plt.plot(t[0], np.diff(cdf, axis=1, prepend=0).T)
plt.plot(t[0], cdf[-50:-35].T)
plt.show()

In [None]:
# plt.plot(t[0], np.diff(cdf, axis=1, prepend=0).T)
plt.plot(log_a.T[0], cdf.T[-90:-84].T)
plt.show()

$p \mapsto \log \alpha$ such that
$$
    p = \Pr(z=1)
        = \sigma\bigl( \log \alpha - \beta \log\tfrac{1-\gamma}{\zeta- 1}\bigr)
    \,.
$$

In [None]:
from scipy.special import logit

eps = 1
beta, gamma, zeta = 0.5, -0-eps, 1.+eps

p = np.linspace(0.0, 1.0, num=101)[1:-1]
plt.plot(p, logit(p) + beta * (math.log(1 - gamma) - math.log(zeta - 1)))

In [None]:
0.5, -0.15, 1.15

In [None]:
beta, zeta, gamma = .1, 1.25, -0.25

la = np.linspace(-8, 8, num=1001)

In [None]:
u = np.random.rand(1001, 10000)
z = np.log(u) - np.log(1-u)

In [None]:
mc = expit((z-la[:, np.newaxis])/beta).mean(axis=-1)
pp = expit(-la)

In [None]:
plt.plot(mc-pp)