Maybe just a bunch of ANDs in every layer, then ANDs of ANDs? Issue: sparsity goes down.

Just replace ANDs with at-least-two-of-a-small-subset functions? Say each input node is on with prob
$p$; consider an output node with $k$ inputs; the prob this has two inputs on is $p^2(1-p)^{k-2}
\binom{k}{2}$. The prob it has at least two inputs on is $1-(1-p)^k-kp (1-p)^{k-1}$. The difference
between these two is $O\left(\frac{(pk)^3}{1-pk}\right)$, from summing a geometric series. Setting
the first one to $p$ gives $p^2(1-p)^{k-2} \binom{k}{2}=p$, which implies
$p(1-p)^{k-2}=\frac{2}{k(k-1)}$. This should work out at about $p=\frac{2}{(k-1)^2}$, up to
lower-order corrections, which also makes the diff small. Equivalently, $k=\sqrt{\frac{2}{p}}+1$ We
might want to correct the ideal network such that it is more precisely binary though? I.e., we might
want to do it without ReLUs? I guess can try both options, but let's first try the one without ReLUs
that just computes these gates with perfect binary outputs.


Here's an alternative calculation in a slightly different setup (though probably they are the same
up to error terms in some reasonable sense). Let's say each entry of the weight matrix Bernoulli
with probability $q$ and each input is Bernoulli with probability $p$. We want it to be the case
that taking a random matrix and a random input, the probability that an output is on is $p$. The
probability it is on is the probability that there are at least two simultaneous hits from that row
of the weight matrix and the input. Each hit has probability $pq$, so this has probability
$1-(1-pq)^m-m pq (1-pq)^{m-1}$. So to keep sparsity constant, we want $1-(1-pq)^m-m pq
(1-pq)^{m-1}=p$. Up to sth like a $O((mpq)^3/(1-mpq))$ term as before, we can just solve $p=(pq)^2
m(m-1)/2$. This gives $q=\sqrt{\frac{2}{m(m-1)p}}$. A more precise solution can be found using
numerical methods (after all, fixing $m$ and $p$, it's just a matter of finding a root of a
polynomial in $q$), I think. But this should be fine for us for now.

Jake: roughly we want $p=(qp)^2 m\implies q=\frac 1 {m\sqrt p}$. If we decide to pick $p=q$, we have
$p=m^{1/3}$


In [None]:
import math

m = 1000  # dim of ideal sparse net
p = 1 / math.sqrt(m)  # prob each input is on; should also be the prob each gate later on is on
q = math.sqrt(2 / (m * (m - 1) * p))  # prob each weight matrix entry is 1

# k = math.sqrt(2/p)+1 # fan-in

import torch
import torch.nn as nn
import torch.nn.functional as F


# Custom activation function
def custom_activation(x):
    return F.relu(x) - F.relu(x - 1)


# Custom layer with specified properties
class CustomLayer(nn.Module):
    def __init__(self, input_dim, output_dim, probability_q):
        super(CustomLayer, self).__init__()
        self.input_dim = input_dim
        self.output_dim = output_dim
        self.probability_q = probability_q

        # Initialize weights and biases
        self.weights = nn.Parameter(torch.Tensor(output_dim, input_dim))
        self.bias = nn.Parameter(torch.Tensor(output_dim))
        self.reset_parameters()

    def reset_parameters(self):
        self.weights.data = torch.bernoulli(
            torch.full((self.output_dim, self.input_dim), self.probability_q)
        )
        self.bias.data.fill_(-1)

    def forward(self, x):
        return custom_activation(F.linear(x, self.weights, self.bias))


class SummationLayer(nn.Module):
    def __init__(self, input_dim):
        super(SummationLayer, self).__init__()
        self.input_dim = input_dim
        # Initialize coefficients as 1 or -1 with 50/50 probability
        self.coefficients = torch.where(
            torch.rand(input_dim) > 0.5, torch.tensor(1.0), torch.tensor(-1.0)
        )

    def forward(self, x):
        # Adjust for the case where x might not be batched
        if x.dim() == 1:
            # x is a 1D tensor, implying a single sample rather than a batch
            return torch.sum(x * self.coefficients, dim=0, keepdim=True)
        else:
            # x is a 2D tensor, implying a batch of samples
            return torch.sum(x * self.coefficients, dim=1, keepdim=True)


# Neural Network with L Custom Layers and a Summation Layer at the end
class CustomNetwork(nn.Module):
    def __init__(self, layer_dims, probability_q):
        super(CustomNetwork, self).__init__()
        self.layers = nn.ModuleList()
        for i in range(1, len(layer_dims)):
            self.layers.append(CustomLayer(layer_dims[i - 1], layer_dims[i], probability_q))

        # Add the summation layer at the end, treated as just another layer
        self.summation_layer = SummationLayer(layer_dims[-1])

    def forward(self, x):
        activations = [
            x
        ]  # List to store activations from each layer, including the input and output layers
        for layer in self.layers:
            x = layer(x)
            activations.append(x)  # Store the activation of each layer
        # Apply the summation layer and treat its output as the activation of the final layer
        x = self.summation_layer(x)
        activations.append(x)  # Include the final output as the last "activation"
        return activations  # Now, this returns a list of activations for all layers, including the final layer


# Example usage
L = 10  # num of layers, including input but not the 1-neuron output
layer_dims = [m] * L  # Dimension of each layer including input and output dimension
probability_q = q  # Probability of presence of each entry in weight matrix

# Initialize the network
ideal_network = CustomNetwork(layer_dims, probability_q)

# Assuming x is your input tensor
# x = torch.randn(batch_size, layer_dims[0])  # Example input; adjust the size accordingly
# activations = net(x)
# `activations` is a list of activations from each layer, including the final layer.

In [None]:
n = 100  # dim into which we'll try to compress the ideal net

# creating the embedding from U_1 to V_1

import torch


def create_matrix_with_unit_norm_columns(n, m):
    """
    Create an n x m matrix E where each column has a unit norm.

    Args:
        n (int): Number of rows in E, corresponding to the dimension of V_1.
        m (int): Number of columns in E, corresponding to the dimension of U_1.

    Returns:
        torch.Tensor: The matrix E with each column normalized to have a unit norm.
    """
    # Step 1: Generate an n x m matrix with Gaussian entries
    E = torch.randn(n, m)

    # Step 2: Normalize each column to have a unit norm
    norms = torch.norm(E, dim=0, keepdim=True)
    E_normalized = E / norms

    return E_normalized


# Example dimensions
# n = 100  # Dimension for the small network input (V_1)
# m = 200  # Dimension for the big network input (U_1)

# Generate the matrix E
E = create_matrix_with_unit_norm_columns(n, m)

print(f"Shape of E: {E.shape}")