Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Categorical features along side continuous features? #3

Open
fountaindive opened this issue Mar 16, 2023 · 5 comments
Open

Categorical features along side continuous features? #3

fountaindive opened this issue Mar 16, 2023 · 5 comments

Comments

@fountaindive
Copy link

Hi,

Really interesting work on categorical normalising flows (CNF), I'm reading your paper now.

I'm interesting in applying normalising flows to generic tabular datasets that can have both continuous and categorical features. Is it possible to combine CNF with standard normalising flows to cater for generic tabular datasets?

Many thanks!

@phlippe
Copy link
Owner

phlippe commented Mar 19, 2023

Hi @fountaindive, thanks for your interest in the categorical normalising flows!
Yes, it is possible to combine categorical and continuous features in CNFs. In the CNF, we first map the categorical features into a continuous space. For continuous features, you could simply skip that this step and concatenate them to the continuous representation of the categorical features. On that input, you can apply any standard NF of your preference. Hope that helps!

@fountaindive
Copy link
Author

Hi @phlippe, ah great yeah that makes perfect sense. I am definitely going to try this now!

@fountaindive
Copy link
Author

Hi @phlippe, I was wondering if you would be interesting in helping me put together a minimum working example how modelling a toy dataset with continuous and categorical features? I'm happy to add it to the repo as an example notebook.

I'm toying around with 2 continuous features and a single categorical feature but I'm not really sure how to use your library 😅

many thanks either way :)

@fountaindive fountaindive reopened this Mar 28, 2023
@phlippe
Copy link
Owner

phlippe commented Apr 2, 2023

Hi @fountaindive, sure that would be great! Let me summarize the important modules/classes and steps needed:

  • The general model class is FlowModel. This allows you to simply stack flow layers together and execute them sequentially. To specify the individual layers and specifics for a task, you can inherit it in a subclass. For example, see the FlowSetModeling class in how to use it for set modeling.
  • To map the categorical features into continuous space, we are using VariationalCategoricalEncoding. As inputs, you need to specify the number of dimensions you want to map the categorical values to (e.g. 2 or 3), and the vocabulary size (how many different categories each value can take). As flow_config, you can pass an empty dictionary, which should create the default config.
  • To use continuous and categorical features at the same time, you will need to write a subclass of FlowLayer that creates an instance of VariationalCategoricalEncoding. Specifically, this flow layer should take in the forward method as input both the continuous and categorical features, pass the categorical features through the Encoding flow, and then return the concatenation of the original continuous and the encoded categorical features. The LDJ of this layer is simply the one from the Encoding flow. For reversing, split the input again into the continuous and encoded categorical features, and use the reverse method of the Encoding layer to obtain the categorical features.
  • Once this is implemented, you can create a full model with this layer. You can use the FlowSetModeling class as an example. Your first flow layer would be your newly implemented encoding layer, and all the rest can be standard flow layers. Depending on your application, you can look at the different tasks we have implemented here (graph generation, set generation, sequence generation).
  • If you want to fully train a model, you can create a new subfolder in experiments. I recommend checking out the README there and our other tasks, from which you see what you need to implement.

Hope that helps! Let me know if you have any questions or face any issues :)

@fountaindive
Copy link
Author

Hi @phlippe, thank you very much for your detailed notes, I really appreciate it!

I'm trying a slightly simpler case first which I'll use to expand from. Suppose my dataset is just two columns of continuous features. I'm trying to build a "Tabular" flow model class to model this.

I think I've got most of the code written but there is something wrong and perhaps you can help?

The logic is as follows:

I'd like to model some data X with shape (N, M) where N is the number of samples and M is the number of continuous features, let's set M = 2 for now.

I think the simplest flow model would have the following layers

1. (input) permute
2. coupling
3. permute
4. coupling (also the output layer)

I'm using the InvertibleConv layer to permute but I'm not sure if that is appropriate here. What I'd like to do is permute the columns of the data.

For the Coupling Network I use a small dense network with 1 input and 2 outputs. 1 input because we have 2 input features but we will mask 50% of them with CouplingLayer.create_channel_mask and 2 outputs because the coupling layers has two parameters i.e., the shift and scale.

Currently I'm getting a shape error

import sys
sys.path.append("path_to_directory/CategoricalNF")

import numpy as np

import torch
import torch.nn as nn 

from layers.flows.flow_model import FlowModel
from layers.flows.permutation_layers import InvertibleConv
from layers.flows.coupling_layer import CouplingLayer

class CouplingNetwork(nn.Module):
    def __init__(self, c_in, c_out, hidden_size):
        """
        this neural network models the shift and scale parameters
        of the coupling layer
        """
        super().__init__()
        self.model = nn.Sequential(
            nn.Linear(c_in, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, c_out),
        )
        
    def forward(self, x):
        return self.model(x)

class FlowTabularModelling(FlowModel):
    def __init__(self):
        super().__init__(layers=None, name="Tabular")
        
        self._create_layers()
        self.print_overview()
    
    
    def _create_layers(self):
        """
        I want the simplest model possible
        
        input: 2 features
        permute
        coupling
        permute
        coupling
        output: 2 features
        """
        
        n_dim = 2
        c_in = 1 # we only have one input to the CouplingNetwork because we only have 2 features and we mask 50%
        c_out = 2 # two parameters: 1 for the shift and 1 for the scale in the coupling flow?
        model_func = lambda c_out: CouplingNetwork(c_in=c_in, c_out=c_out, hidden_size=128)
        
        # Will mask half the features at a time?
        coupling_mask = CouplingLayer.create_channel_mask(n_dim)
        
        layers = [
            InvertibleConv(n_dim),
            CouplingLayer(n_dim, coupling_mask, model_func, c_out=c_out),
            InvertibleConv(n_dim),
            CouplingLayer(n_dim, coupling_mask, model_func, c_out=c_out),
        ]
        
        self.flow_layers = nn.ModuleList(layers)
        
    def forward(self, z, ldj=None, reverse=False, length=None):
        return super().forward(z, ldj=ldj, reverse=reverse, length=length)

And here is some fake take to test the forward pass

x = np.random.uniform(size=(10, 1))
x = x.astype(np.float32)
x = torch.from_numpy(x)

ftm = FlowTabularModelling()
ftm.forward(x)

Which is currently giving the following shape error

RuntimeError: mat1 and mat2 shapes cannot be multiplied (2x2 and 1x10)

Hopefully that makes sense! I'm still new to normalising flows so might have gotten some terminology wrong. Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants