# Multi signal modelling 👯

This notebook illustrates how to init a `proteovae.models.MultiGuidedVAE` which allows for embeddings to be driven by multiple supervised factors in your dataset. 

In [3]:
from proteovae.models import MultiGuidedVAE, MultiGuidedConfig 
from proteovae.models.base import Guide, Encoder, Decoder
from proteovae.trainers import ScheduledTrainer
import numpy as np 

First we'll need a dataset consisting of multiple labels assigned to samples.  For simplicity we'll randomly generate labels and data to show how to get the model up and running (it's hard to come up with a high p dataset and valid labelling ad hoc :p) 

In [27]:
nsamples = 5000
nfeatures = 500 

data = np.random.normal(size = (nsamples, nfeatures))

labels_1 = np.random.randint(low = 0, high = 2, size = (nsamples,1)) #labels from categorical w/ 2 choices
labels_2 = np.random.randint(low = 0, high = 5, size = (nsamples,1)) #labels from categorical w/ 5 choices

labels = np.concatenate((labels_1, labels_2), axis=1)

print(f'data shape: {data.shape}, labels shape: {labels.shape}')

data shape: (5000, 500), labels shape: (5000, 2)


Now we prep the data in the standard fashion for pytorch model trainings; defining both training, validation, and test splits in torch `DataLoader`s

To do this we need a torch-compatible class to introduce custom datasets

In [28]:
# custom torch datasets just need to provide __len__ and __getitem__ methods ! 
class TorchDataset():
    def __init__(self, data, labels):
        self.data = torch.Tensor(data)
        self.labels = torch.Tensor(labels)
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        return self.data[idx],  self.labels[idx]

In [29]:
import torch 
from torch.utils.data import Dataset, DataLoader
from torchvision.transforms import ToTensor
from sklearn.model_selection import train_test_split 

batch_size = 16
test_size = 0.2


#train test split 1 
X_train_val, X_test, Y_train_val, Y_test = train_test_split(data,labels, test_size=test_size)

#train test split 2 
X_train, X_val, Y_train, Y_val = train_test_split(X_train_val, Y_train_val, test_size=test_size)

#loaders 
train_data = TorchDataset(X_train, Y_train) 
train_loader = DataLoader(train_data, batch_size=batch_size, shuffle=False)

val_data = TorchDataset(X_val, Y_val) 
val_data = (val_data.data, val_data.labels)

print(f'shape of the bached labels: {next(iter(train_loader))[1].shape}')

shape of the bached labels: torch.Size([16, 2])


## Model training 

Here we'll again make some arbitrary choices in terms of latent dimension and capacity of the guided dimensions we wish to allocate for predicting supervised factors in the data.  For instance we can choose to allocate *m* neurons in the latent space to monitor label \#1 and *n* neurons for the tracking of label \#2

In [30]:
# Model config 
input_dim = data.shape[1]
device = "cuda" if torch.cuda.is_available() else "cpu"
epochs = 200

#here we reserve 1 neuron for label 1, and 3 neurons for label 2 out of the 16 available dimensions in the latent space 
latent_dim = 16
guided_dims = [1,3] 

encoder_dims = (256, 128, ) 
decoder_dims = encoder_dims[::-1]


model_config = MultiGuidedConfig(
    input_dim = input_dim,
    latent_dim = latent_dim,
    device = device,
    guided_dims = guided_dims,
)

Finally define a list of guides objects (either pure `nn.Module`s or `proteovae.models.base.Guide`s) and you're on your way

In [31]:
#recall the number of choices for each labelled factor ! 
guides = [Guide(model_config.guided_dims[0], 2),
          Guide(model_config.guided_dims[1], 5)]

model = MultiGuidedVAE(
    config = model_config,
    encoder = Encoder(model_config.input_dim, model_config.latent_dim, encoder_dims), 
    decoder = Decoder(model_config.input_dim, model_config.latent_dim, decoder_dims),
    guides = guides 
)

model 

MultiGuidedVAE(
  (encoder): Encoder(
    (linear_block): Sequential(
      (0): Linear(in_features=500, out_features=256, bias=True)
      (1): ReLU()
      (2): Linear(in_features=256, out_features=128, bias=True)
      (3): ReLU()
    )
    (fc_mu): Linear(in_features=128, out_features=16, bias=True)
    (fc_logvar): Linear(in_features=128, out_features=16, bias=True)
  )
  (decoder): Decoder(
    (linear_block): Sequential(
      (0): Linear(in_features=16, out_features=128, bias=True)
      (1): ReLU()
      (2): Linear(in_features=128, out_features=256, bias=True)
      (3): ReLU()
      (4): Linear(in_features=256, out_features=500, bias=True)
    )
  )
  (guides): ModuleList(
    (0): Guide(
      (classifier): Sequential(
        (0): Linear(in_features=1, out_features=2, bias=True)
      )
    )
    (1): Guide(
      (classifier): Sequential(
        (0): Linear(in_features=3, out_features=5, bias=True)
      )
    )
  )
)

**note:** the list of guides gets converted to a `nn.ModuleList` so the parameters of each guide get updated by the loss objective!

## Training 

In [32]:
optimizer = torch.optim.Adam(model.parameters(), lr=1e-03)
scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lambda e: max(0.95**(e), 5e-03), last_epoch=- 1)

trainer = ScheduledTrainer(model, optimizer, scheduler)

In [34]:
trainer.train(train_loader, epochs=epochs, val_data = val_data)

## post-analysis 
To check the quality/informativeness of the embeddings we can embed the test set data and fit some vanilla sklearn models to see how (un)predictive each of the guided dimensions are for the associated labels. 

In our case labels are random so we expect random predictive performance at best 

In [38]:
zs = model.embed(torch.tensor(X_test, device = device, dtype=torch.float32))

zs = zs.cpu().detach().numpy()   

y = Y_test
    
print(zs.shape, y.shape)

(1000, 16) (1000, 2)


In [None]:
#!pip install lazypredict 

In [43]:
from lazypredict.Supervised import LazyClassifier

#choose a factor (indexed base 0)
i = 1 
#recall that guided latent dimensions are allocated in the final neurons of the embedding dimensions ... 
l = model.latent_dim-1

#latents without guided dimensions for factor of interest
sz = np.hstack((zs[:, :l-model.guided_dims[i+1]], zs[:, l-model.guided_dims[i]:]))

#latents corresponding to factor of interest 
szz = zs[:,l-model.guided_dims[i+1]:l-model.guided_dims[i]]


print(sz.shape)

(1000, 13)


You can now check that the latents without the guided neurons are uninformative for the signal of interest, and simultaneously check that the guided latents themselves **are** informative

In [41]:
X_train_lazy, X_test_lazy, y_train_lazy, y_test_lazy = train_test_split(sz, y[:,i], test_size= 0.5)
lazy_clf = LazyClassifier(verbose=0,ignore_warnings=True, custom_metric=None)
models,predictions = lazy_clf.fit(X_train_lazy, X_test_lazy, y_train_lazy, y_test_lazy)

#view 
models 

100%|██████████| 29/29 [00:02<00:00, 13.24it/s]


Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Perceptron,0.22,0.22,,0.17,0.01
BaggingClassifier,0.21,0.22,,0.21,0.05
SVC,0.21,0.21,,0.17,0.04
ExtraTreesClassifier,0.21,0.21,,0.21,0.14
LGBMClassifier,0.21,0.21,,0.21,0.29
KNeighborsClassifier,0.21,0.21,,0.2,0.02
RandomForestClassifier,0.21,0.21,,0.2,0.21
BernoulliNB,0.2,0.21,,0.17,0.01
DecisionTreeClassifier,0.21,0.21,,0.2,0.01
GaussianNB,0.2,0.21,,0.18,0.01
