concept of deepsets:
imagine there is a data about students who are eligible for extra credit if they have an average attendance of some x%, where the order doesn't matter where in this case it doesn't matter if they have it one class or the other, but the requirement is to have an overall attendance of x%.
    - unordered sets: each studen'ts record is a set of percentages from different classes where the order in which they are present doesn't matter
    - goal: model simply needs to answer if a given student qualifies or not
it could be conceptualized as follows:
    - consider each attendance percentage seperately
    - combine the percentage in such a way that it reflects overall performance
    - compare the aggregated or combined result against x to determine the eligibility


imagine there are some pictures and the goal is to find if the picture has some n distinct animals where it doesn't matter if there is only 1 animal per type or some random x number of animals of each type but the goal is to have n distinct animals where neither the order nor the count matters
    - unordered sets: each animal detected in the image is an element of the set and the order or count of the animals doesn't matter
    - goal: model should detect if there are n distinct animals and it doesn't matter how many animals are present of each type
it could be conceptualized as follows:
    - identify each animal in the iamge
    - process each detection to determine the type of animal
    - aggregate the results by counting the number of distinct animals
    - check if the count of distinct animal type is equal to given n


real-world scenarios:
    - in robotics or autonomous vehicles, one of the goal of sensors can be to classify if there are certain types of objects around (could be rocks or animals or potholes), where the goal is not to not force an order but to know if they are in the vicinity
    - when processing a large text data where the goal is to find certain key words and if the inclusion is more important rather than where they are present in the document, then the words in the document can be treated as elements in an unordered sets and then the goal would be to know if those keywords are present in that particular document
    

a concrete concept to abstract the deepest model
imagine the goal is to build a system to analyze temperature readings from several servers:
    - each sensor reading could be a simple 3D vector (for example: temperature, humidity, pressure)
    - goal is to extract more informative features from the existing three features that reveals or gives more important patterns
    - it could be achieved by building a neural network that takes in 3D reading and outputs a 5D vector
    - here exact neural network will be applied to each sensor reading so swapping the order of the readings doesn't matter since the extracted features remain same when reordered

additional notion of expanding to a higher dimension: so here when you are thinking it in terms of neural networks when you expand the dimensions or number of layers it reveals more richer patterns, it's like instead of say using 3 words to describe something you use 10 words you can describe it more and again if you reduce it back to 5 words then you take the most important words that are required, so you are not just using 2 extra words out of no where but rather you are using 10 words to describe the same concept that you described with 3 words then if you take 5 words out of those ten words that are more richer description, then you are essentially expanding and improving the way to describe.

In [20]:
"""
with the concrete of analyzing temperature as a base:
    - input_dim = 3 refers to temperature, humidity, pressure (only in this context, but could vary for other cocepts)
    - hidden_dim = 10 that could capture itermediate patterns or combinations of these values
    - output_dim = gives a 5D feature vector which could be considered as richer representation of sensor reading

see now let us imagine this so 3d space with three components that we consider is transformed into 10d space,
where each of those three units in the input layer are weighted and added into each unit in the next layer that is
the 10d space, so here right before the values are point in that 10d space, if there are any negative values they are
put in 0 and any other non zero components are as is so it is not a linear transformation here and
even in the second layer to third layer, when the points are passed into 5d space from 10d space it doesn't
send it as is but rather cpatures only those that have magnitude greater than 0 and those less than 0 are not ignored
bur rather put in 0 of that respective component.
"""
import torch
import torch.nn as nn
import torch.nn.functional as F

class Deepsetlayers(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(Deepsetlayers, self).__init__()

        # two linear layers with ReLU activation in between which will be applied to every element in the set
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, output_dim)

        # key point: every reading will go through the same layers fc1 and fc2, which means the same
        #weights and bias are used for all readings ensuring that the transformation is consistent regardless of the order of the readings

    def forward(self, x):
        """
        x: tensor of shape (batch_size, input_dim)

        in the context of analyzing temperature:
        batch_size: total numbe of different servers
        set_size: number of items in each set or the number of readings per server
        input_dim: number of features in each item, here for example it is temperature, humidity, and pressure

        each set must have the same number of items
        here the input dimensions is the one that should match not the set size, for example let one
        sample in the batch be one server, and in that server if we may have different readings per server,
        if there are n readings then each reading would have respective temperature, pressure, and humidity,
        while the number of elements in the set may vary the number of dimensions each such reading must be the same
        if each set has different size then it must be adjusted using dummy variables or any other such strategies

        reshaping:
        imagine there are n servers and each server has a set of sensor readings:
            for example in this case if there are 4 sensor readings and each sensor has 3D vector or 3 readings
            here the overall shape is (n, 4, 3)

        here we take our  neural network layers to work on 2D tensors (matrices) where each row is a sample, and
        here we want to process each sensor reading independently using the same method

        so we can combine batch and number of elements in batch into a single dimension, thus
        (n, 3, 4) becomes (n*4, 3)

        here it flattens the data so all n*4 sensor readings from all the servers are arranged in one batch

        after linear transformation is applied the new output shape is (n*4, 10) where we transformed 3d into 10d

        we need to reshape it back to original structure of (n, 4, 5) so that it remains consistent and ensures
        that any additional aggregations that maybe performed and also gives the conceptual notion
        that there are n servers and each server has 4 sets and each set has 5 readings, where the 2 new additional
        readings are from the reduction in dimensionality from linear operations that transformed dimensions
        from 3 to 10 and then to 5
        """
        batch_size, set_size, input_dim = x.shape
        # reshape x to combine batch and set dimension
        x = x.view(-1, input_dim)

        # apply shared transformations where same weights are used for every element
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))

        x = x.view(batch_size, set_size, -1)
        return x


In [21]:
"""
one of the key goals is to give a prediction of permutations of the inputs, so here we are doing this
so that no matter from which server or which set the temperature data is received it is considered the same,
so we are giving some notion or information or prediction not based on just the readings but rather the
patterns of some combination of those three readings give us so there maybe a variation in combination,
but our goal is to give some prediction using some pattern how various set of permutations of those three
values or inputs correspond to or give some new features or some additional information

here we can imagine there are ten servers and you sake take some n readings from each server where
each reading is 3dimensional in our case, so in our neural network we transform this 3d into 5d, and
we agregate this 5d reading so that despite the server or particular instance which is one element in the
set if the values of the aggregation matches then we can use or consider it as a symmetric, where
permutations of those 3d input values are of importance here and we consider all permutations where
they gave the same value as symmetric or identical or can be grouped or can be consider as one, they
may or may not be the same, but the permutations gives rise to some pattern that maybe useful
"""

class DeepsetsModel(nn.Module):
    def __init__(self, ds_input_dim, ds_hidden_dim, ds_output_dim, agg_hidden_dim, agg_output_dim):

        super(DeepsetsModel, self).__init__()

        self.ds = Deepsetlayers(ds_input_dim, ds_hidden_dim, ds_output_dim)

        self.agg = nn.Sequential(
            nn.Linear(ds_output_dim, agg_hidden_dim),
            nn.ReLU(),
            nn.Linear(agg_hidden_dim, agg_output_dim)
        )


    def forward(self, x):
        ds_out = self.ds(x)

        aggregated = torch.sum(ds_out, dim=1)
        output = self.agg(aggregated)
        return output

In [23]:
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

num_sets  = 1000
set_size = 10
element_dim = 3


torch.manual_seed(32)
X = torch.randn(num_sets, set_size, element_dim)
y = X.mean(dim=1).sum(dim=1)

dataset = TensorDataset(X, y)

dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

ds_output_dim = 5
ds_hidden_dim = 15
agg_hidden_dim = 15
output_dim = 1

model = DeepsetsModel(ds_input_dim=element_dim, ds_hidden_dim=ds_hidden_dim,
                      ds_output_dim=ds_output_dim, agg_hidden_dim=agg_hidden_dim,
                      agg_output_dim=output_dim)

loss_function = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

num_epochs = 100

for epoch in range(num_epochs):
    model.train()
    for batch_X, batch_y in dataloader:
        outputs = model(batch_X)
        loss = loss_function(outputs.squeeze(), batch_y)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if (epoch +1)%100 == 0:
            print(f'Epoch: [{(epoch+1)/num_epochs}], loss: {loss.item()}')


Epoch: [1.0], loss: 2.3239370420924388e-05
Epoch: [1.0], loss: 2.746977042988874e-05
Epoch: [1.0], loss: 5.466531001729891e-05
Epoch: [1.0], loss: 0.00017508803284727037
Epoch: [1.0], loss: 4.981818710803054e-05
Epoch: [1.0], loss: 9.63455477176467e-06
Epoch: [1.0], loss: 3.272198955528438e-05
Epoch: [1.0], loss: 5.155348117114045e-05
Epoch: [1.0], loss: 5.903784040128812e-05
Epoch: [1.0], loss: 0.00014058210945222527
Epoch: [1.0], loss: 3.3488100598333403e-05
Epoch: [1.0], loss: 2.616386700537987e-05
Epoch: [1.0], loss: 9.928573126671836e-05
Epoch: [1.0], loss: 3.3010390325216576e-05
Epoch: [1.0], loss: 8.904028800316155e-05
Epoch: [1.0], loss: 5.170291115064174e-05
Epoch: [1.0], loss: 5.409807636169717e-05
Epoch: [1.0], loss: 0.00012830189371015877
Epoch: [1.0], loss: 6.804956228734227e-06
Epoch: [1.0], loss: 9.463547030463815e-05
Epoch: [1.0], loss: 4.653465293813497e-05
Epoch: [1.0], loss: 0.00011484741844469681
Epoch: [1.0], loss: 0.00010923575609922409
Epoch: [1.0], loss: 4.74401